Repurchasing Contracts from SEC Filings

Collect and Parse 10-Ks

Author

Hongyi Xu

Published

September 16, 2024

Modified

October 14, 2024

The purpose of this file is to replicate the Moon and Phillips (2020) paper on the purchasing contracts of US public firms.

Cheat sheet for REGEX in R.

Functions used in following analyses are recorded in file functions_collect_parse_10Ks.R.

S1. Download US Public Firms’ SEC 10-K Filings

Although parsing SEC EDGAR using R package edgar is one way to get all SEC filings we need, we directly download raw SEC filings from the Notre Dame Software Repository for Accounting and Finance (SRAF).

Code

# store sharing links of all files on the google drive 
googledrive_links <- strsplit(x = "https://drive.google.com/file/d/1G9Qyte36a-9AN8JoH79g7iioNn1AT24d/view?usp=drive_link > 10-X_2022\nhttps://drive.google.com/file/d/14zgoElxrhFJjR_oyw93WQo1AcaL-ZEsg/view?usp=drive_link > 10-X_2019\nhttps://drive.google.com/file/d/10a7myBg5h_-Vq_QkqRqjn6j6X4EOQuPp/view?usp=drive_link > 10-X_2020\nhttps://drive.google.com/file/d/10Fi5SU5LVUGq1oJz1iPWpCMwa_eAdnjD/view?usp=sharing > 10-X_2019 \nhttps://drive.google.com/file/d/10Dl_Dw2_JmMnT6tJv7duo3KOZfGUQREv/view?usp=drive_link > 10-X_2018\nhttps://drive.google.com/file/d/10DSuO8pVtjG_IyWk1k46oiIsLhwK3UiD/view?usp=sharing > 10-X_2017\nhttps://drive.google.com/file/d/10B4dOF2M1iUI6MooT7mEIJcuX6YYgLTQ/view?usp=drive_link > 10-X_2016\nhttps://drive.google.com/file/d/107dzze4B1Q9L8rb_WWMR0M_HD9zyK8W7/view?usp=drive_link > 10-X_2015", split = "\n")[[1]] %>% 
  strsplit(split = ">") %>% 
  do.call(rbind.data.frame, args = .) %>% 
  `colnames<-`(c("Link", "File")) %>% 
  mutate(
    id = str_extract(string = Link, pattern = "(?<=/d/)[^/]+(?=/view)"), # extract the file id 
    Year = as.integer(str_extract(string = File, pattern = "(?<=_)\\d{4}")), # extract the file year 
    File = paste(str_trim(File, "both"), ".zip", sep = "")
  ) %>% 
  select(Year, id, File)

## tabulation: Table 1
gt(googledrive_links) %>% 
  tab_header(
    title = "Table 1: Google drive file information from Notre Dame SRAF", 
  ) %>% 
  tab_options(table.font.size = 10, heading.align = 'left' ) %>% 
  tab_style( # update the font size for table cells. 
    style = cell_text(size = px(10)),
    locations = cells_body()
  )

Table 1:
Google drive file information from Notre Dame SRAF
Year	id	File
Table 1: Google drive file information from Notre Dame SRAF
2022	1G9Qyte36a-9AN8JoH79g7iioNn1AT24d	10-X_2022.zip
2019	14zgoElxrhFJjR_oyw93WQo1AcaL-ZEsg	10-X_2019.zip
2020	10a7myBg5h_-Vq_QkqRqjn6j6X4EOQuPp	10-X_2020.zip
2019	10Fi5SU5LVUGq1oJz1iPWpCMwa_eAdnjD	10-X_2019.zip
2018	10Dl_Dw2_JmMnT6tJv7duo3KOZfGUQREv	10-X_2018.zip
2017	10DSuO8pVtjG_IyWk1k46oiIsLhwK3UiD	10-X_2017.zip
2016	10B4dOF2M1iUI6MooT7mEIJcuX6YYgLTQ	10-X_2016.zip
2015	107dzze4B1Q9L8rb_WWMR0M_HD9zyK8W7	10-X_2015.zip

Steps and command to download file(s) in the CMD:

First, get the API key from Oauth 2.0 Playground - Google Developers.
Steps: Select & authorize APIs \rightarrow Drive API v3 \rightarrow https://www.googleapis.com/auth/drive.readonly.
Get the Access token and copy to the api_key variable below.
Next, choose the files you want to download and use function googledrive_download() to generate the curl command and download the files.
Here I set the default location of downloaded file to "/scratch/nhh/sec/" and you can update to your own.
The code below only needs to be run once. Alternatively, you can use system() inside the for loop to save the storage space.

Code

# curl -H "Authorization: Bearer ya29.a0AcM612zWU6Da_xDydbg-ZDc6vBpMBUoYBTO44oTpTeDvHOhzOyjMr66xdIO7nCfclFiE_NdjSJZvdpFWsNoj2Ds-7L8O2jnrJ15I3MfYb4vlsPcxFj0tPx8Mr8MIa081ZnoIXWnvft6D4aQ1qjda4LyoCl1j0iYuNPKrN3jfaCgYKAS8SARMSFQHGX2Mi8_vM15q895m7oulKDmaIrw0175" https://www.googleapis.com/drive/v3/files/10Fi5SU5LVUGq1oJz1iPWpCMwa_eAdnjD?alt=media -o /scratch/nhh/sec/10-X_2019.zip

## function to automatically generate the downloading code 
googledrive_download <- function(api_key, file_info, file_link, file_name, to = "/scratch/nhh/sec/") {
  if (missing(file_info)) { 
    # if argument `file_info` is not provided 
    cmd_output <- paste('curl -H "Authorization: Bearer ', api_key, '" ', "https://www.googleapis.com/drive/v3/files/", file_link, "?alt=media -o ", to, file_name, "   ", sep = "") 
  } else {
    # if argument `file_info` is provided 
    if (is.data.frame(file_info) & dim(file_info)[2] == 2) {
      cmd_output <- paste(apply(X = file_info, MARGIN = 1, FUN = function(x) paste('curl -H "Authorization: Bearer ', api_key, '" ', "https://www.googleapis.com/drive/v3/files/", x[1], "?alt=media -o ", to, x[2], sep = "") ), collapse = ";  " )
    } else {
      stop("Error: 'file_info' parameter is not data.frame object or has more than 2 columns.")
    }
  }
  return(cmd_output)
} 

api_key = "ya29.a0AcM612xRrh1wolCu4ufaTDwkJqXerA0HA6CyTgLUtQJ6V0JyWeZnlCF7oYx2R8CZ9KhwFPfrdvFPG2WKcAS_8GVK_ve7VvIg-HZAxiD2Xhx4vnUcYLgWtSLMRPzdUnJS_3YmO2CDHVhY8UQDmYmJKsvkmtSElGdLqxKON64laCgYKAaISARMSFQHGX2Mi8eNrVwx4jsGnn9JrhKkexA0175"

googledrive_download_cmd <- googledrive_download(api_key, file_info = googledrive_links[1, 2:3])
## directly run the code in the console, instead of in the terminal 
system(googledrive_download_cmd)
list.files(path = "/scratch/nhh/sec/", pattern = ".zip") 

## unzip one of the .zip file 
utils::unzip(zipfile = "/scratch/nhh/sec/10-X_2019.zip", exdir = "/scratch/nhh/sec/")

S2. Look and Read the SEC filings.

To understand the information contained in the name of the csv files, please refer to the “Paths and directory structure” section in Accessing EDGAR Data

Central Index Key (CIK): EDGAR assigns to filers a unique numerical identifier, known as a Central Index Key (CIK), when they sign up to make filings to the SEC. CIK numbers remain unique to the filer; they are not recycled. It is named as CIK in the cleaned dataset.
Accession number: For example, 0001193125-15-118890 is the accession number, a unique identifier assigned automatically to an accepted submission by EDGAR. The first set of numbers (0001193125) is the CIK of the entity submitting the filing. This could be the company or a third-party filer agent. Some filer agents without a regulatory requirement to make disclosure filings with the SEC have a CIK but no searchable presence in the public EDGAR database. The next two numbers (15) represent the year. The last series of numbers represent a sequential count of submitted filings from that CIK. The count is usually, but not always, reset to zero at the start of each calendar year. It is named as accession_num in the cleaned dataset.

Code

# list of files in the folder 
sec_csv <- list.files("/scratch/nhh/sec/2019", recursive=TRUE, full.names = TRUE)
# set.seed(123); csv_sample <- sample(sec_csv, size = 10)

# extract the list directly from .ZIP file 
sec_csv2 <- unzip(zipfile = "/scratch/nhh/sec/10-X_2019.zip", list = TRUE) %>% 
  grep(pattern = ".txt$", x = .$Name, value = T)

# extract filing name information using `sec_filing_nameinfo()`
sec_csvinfo <- sec_filing_nameinfo(file_path = sec_csv2, keep.original = TRUE) %>%
  as_tibble() %>% 
  mutate(filing_date = as.Date(x = filing_date, format = "%Y%m%d"))

## tabulation: Table 2
set.seed(123)
gt(sample_n(sec_csvinfo, size = 10)) %>% 
  tab_header(
    title = "Information from File Paths of .Zip Files",
    subtitle = md("File: 10-X_2019.zip \u2192 Variable: `sec_csvinfo`")
  ) %>% 
  tab_options(table.font.size = 10, heading.align = 'left' ) %>% 
  tab_style( # update the font size for table cells. 
    style = cell_text(size = px(10)),
    locations = cells_body()
  ) %>% 
  tab_style(
    style = cell_text(size = px(8)), 
    locations = cells_body(columns = c(full_path))
  ) %>% 
  cols_width(
    filing_date ~ cm(30)
  )

Table 2:
Information from File Paths of .Zip Files
qtr_calender	filing_date	file_type	CIK	accession_num	full_path
Information from File Paths of .Zip Files
File: 10-X_2019.zip → Variable: `sec_csvinfo`
QTR3	2019-08-14	10-Q	1475430	0001010412-19-000025.txt	2019/QTR3/20190814_10-Q_edgar_data_1475430_0001010412-19-000025.txt
QTR3	2019-08-14	10-Q	1546296	0001493152-19-012453.txt	2019/QTR3/20190814_10-Q_edgar_data_1546296_0001493152-19-012453.txt
QTR4	2019-12-18	10-K-A	1621221	0001640334-19-002608.txt	2019/QTR4/20191218_10-K-A_edgar_data_1621221_0001640334-19-002608.txt
QTR4	2019-11-13	10-Q	18672	0001089819-19-000030.txt	2019/QTR4/20191113_10-Q_edgar_data_18672_0001089819-19-000030.txt
QTR1	2019-02-28	10-K	1560327	0001560327-19-000046.txt	2019/QTR1/20190228_10-K_edgar_data_1560327_0001560327-19-000046.txt
QTR1	2019-02-21	10-K	1420302	0001564590-19-003676.txt	2019/QTR1/20190221_10-K_edgar_data_1420302_0001564590-19-003676.txt
QTR4	2019-11-14	10-Q	313364	0001047469-19-006308.txt	2019/QTR4/20191114_10-Q_edgar_data_313364_0001047469-19-006308.txt
QTR1	2019-03-01	10-K	1600125	0001564590-19-005807.txt	2019/QTR1/20190301_10-K_edgar_data_1600125_0001564590-19-005807.txt
QTR2	2019-05-13	10-Q	1107280	0001437749-19-009550.txt	2019/QTR2/20190513_10-Q_edgar_data_1107280_0001437749-19-009550.txt
QTR1	2019-03-18	10-K	1325964	0001553350-19-000222.txt	2019/QTR1/20190318_10-K_edgar_data_1325964_0001553350-19-000222.txt

After getting basic information from the title of the SEC filings, we now will import the .txt file and parse required information.

s2.1. Import 10-K Filings and Familiarise the Structure

Code

## keep the universe of 10-K filings
sec_10konly <- sec_csvinfo %>% 
  filter(grepl(pattern = "10-K", x = file_type, fixed = T))

## get the 10-K for Apple Inc
cat(paste("Apple Inc.>", sec_10konly %>% filter(CIK == "320193") %>% .$full_path, collapse = " "))

Apple Inc.> 2019/QTR4/20191031_10-K_edgar_data_320193_0000320193-19-000119.txt

To extract the purchasing/outsourcing contract information as in Moon and Phillips (2020), we only need to look at 10-K filings. Currently, all 10-X filing types include 10-K, 10-K-A, 10-KT, 10-KT-A, 10-Q, 10-Q-A, 10-QT, and we will drop all 10-Q related filings.

This step significantly reduce the number of observations we need to examine. For instance, the total number of 10-X filings in 2019 drops from 27,014 in sec_csvinfo to 7,944 in sec_10konly.

Here I start with the file 2019/QTR4/20191031_10-K_edgar_data_320193_0000320193-19-000119.txt to demonstrate.

s2.1A. Header Info in 10-K Filings

Code

## read .txt files into the R environment 
# filing_path <- sec_10konly$full_path[1]
filing_path <- sec_10konly %>% filter(CIK == "320193") %>% .$full_path # Apple Inc: // 2488
    ## testing candidcates: 
    ## [ ]: https://www.sec.gov/Archives/edgar/data/1173281/000138713119000051/ohr-10k_093018.htm#ohr10k123118a010
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/831641/000083164120000154/ttek-20200927.htm (table is more complex) 
    ## [ ]: https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt 
    ## [ ]: https://www.sec.gov/Archives/edgar/data/1045810/000104581020000010/nvda-2020x10k.htm#s82E07D2E693B525F8500B3A76673C74A
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/1652044/000165204424000022/goog-20231231.htm > Alphabet Inc. 
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/2488/000000248824000012/amd-20231230.htm > AMD > search "purchase commitments" 
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/1543151/000154315124000012/uber-20231231.htm > [Uber Tech. > Purchase Commitment in Text] 
    ### Tables in Notes under Item 8. 
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/0001730168/000173016823000096/avgo-20231029.htm > [Broadcom Inc. > The table is not under Item 7, but under Note 13.] 
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/1090727/000109072724000008/ups-20231231.htm > [UPS > Table is under Item 8, Note 9]

# filing <- readLines(con = filing_path) 
filing <- readLines(archive::archive_read("/scratch/nhh/sec/10-X_2019.zip", filing_path))
## create a more structured file for the main body of the document 
filing_structured <- 
  filing[(grep("<DOCUMENT>", filing, ignore.case = TRUE)[1]):(grep("</DOCUMENT>", filing, ignore.case = TRUE)[1])] %>% 
  paste(., collapse = " ") %>% 
  str_squish() %>% 
  clean_html2(input_string = ., pattern = "(</[^>]+>\\s*</[^>]+>\\s*)<([^/])") %>% # *updated Oct 10, 2024   
  as.vector() 
## keep only the plain text in each section 
filing_cleantext <- sapply(X = filing_structured,
                           FUN = function(x) html_text(read_html(x), trim = TRUE), USE.NAMES = FALSE)

## get the filing headers 
filing_header <- filing.header(x = filing)
### tabulate the header information 
gt(as.data.frame(`colnames<-`(filing_header, c("Item", "Input"))) ) %>% 
  # tab_header(title = "Table 3: Header Information in 10-K Filings") %>% 
  tab_options(table.font.size = 10, heading.align = 'left' ) %>% 
  tab_style( # update the font size for table cells. 
    style = cell_text(size = px(10)),
    locations = cells_body()
  )

Table 3:
Header Information in 10-K Filings
Item	Input
ACCESSION NUMBER	0000320193-19-000119
CONFORMED SUBMISSION TYPE	10-K
PUBLIC DOCUMENT COUNT	96
CONFORMED PERIOD OF REPORT	20190928
FILED AS OF DATE	20191031
DATE AS OF CHANGE	20191030
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME	Apple Inc.
CENTRAL INDEX KEY	0000320193
STANDARD INDUSTRIAL CLASSIFICATION	ELECTRONIC COMPUTERS [3571]
IRS NUMBER	942404110
STATE OF INCORPORATION	CA
FISCAL YEAR END	0928
FILING VALUES:
FORM TYPE	10-K
SEC ACT	1934 Act
SEC FILE NUMBER	001-36743
FILM NUMBER	191181423
BUSINESS ADDRESS:
STREET 1	ONE APPLE PARK WAY
CITY	CUPERTINO
STATE	CA
ZIP	95014
BUSINESS PHONE	(408) 996-1010
MAIL ADDRESS:
STREET 1	ONE APPLE PARK WAY
CITY	CUPERTINO
STATE	CA
ZIP	95014
FORMER COMPANY:
FORMER CONFORMED NAME	APPLE INC
DATE OF NAME CHANGE	20070109
FORMER COMPANY:
FORMER CONFORMED NAME	APPLE COMPUTER INC
DATE OF NAME CHANGE	19970808

Code

# ## get the table of contents (toc) 
# filing_toc <- filing.toc(x = filing, regex_toc = "<text>|</text>") 
# # browsable(HTML(as.character(filing_toc))) # view the ToC in html format

The function filing.toc() does not necessarily extract the table of contents. Although we can use regex_toc = "<table>|</table>" to get the first table we saw in the .txt file, it may not be the table of contents we want. The function loc.item() is better at identify the item; however, it is not structured to get the full table of contents. We modify function loc.item() and create function loc.item_MDnA() to better extract the item we want.

s2.1B. All Items in 10-K filings

To see the full list of items in a 10-K filing, I look into this file and extract all items. The Table 4 below presents all items included in a 10-K filing.

Code

## This section extract the names of each items in the 10-K filing. 
# URL of the webpage
url <- "https://www.sec.gov/answers/reada10k.htm" 
# Set a custom user-agent string to mimic a browser
user_agent_string <- "leonardo.xu@gmail.com"
# Read the webpage with the custom user-agent
reada10k <- read_html(GET(url, user_agent(user_agent_string)))
# browsable(HTML(as.character(reada10k)) )

# generate the item text 
reada10k_items <- html_nodes(reada10k, "table")[3] %>% 
  html_nodes(., "p") %>% 
  .[grep(pattern = ">Item\\s+[1-9]", x = ., ignore.case = FALSE)] %>% 
  sapply(X = ., FUN = function(x) {
    text_raw <- str_squish(html_text(html_nodes(x, "strong, b"))) 
    text_output <- paste(text_raw[text_raw != ""], collapse = " - ") 
    if (!grepl("-", text_output)) {
      text_output <- gsub("\"|“|”|\u0093", "- ", text_output)
    }
    output <- str_replace_all(string = text_output, pattern = "\"|“|”|\u0093|\u0094", replacement = "")
    return(output)
  }, simplify = TRUE) %>% 
  str_split_fixed(string = ., pattern = " - ", 2)
# tabulate the items 
gt(`colnames<-`(data.frame(reada10k_items), c("Item", "Content"))) %>% 
  # tab_header(title = "Table 4: Items in the Table of Contents in 10-K Filings") %>% 
  tab_options(table.font.size = 10, heading.align = 'left' ) %>% 
  tab_style( # update the font size for table cells. 
    style = cell_text(size = px(10)),
    locations = cells_body()
  ) %>% 
  cols_width(
    Item ~ cm(30)
  ) %>% 
  tab_style( # headers to bold
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )

Table 4:
Items in the Table of Contents in 10-K Filings
Item	Content
Item 1	Business
Item 1A	Risk Factors
Item 1B	Unresolved Staff Comments
Item 2	Properties
Item 3	Legal Proceedings
Item 4
Item 5	Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
Item 6	Selected Financial Data
Item 7	Management’s Discussion and Analysis of Financial Condition and Results of Operations
Item 8	Financial Statements and Supplementary Data
Item 9	Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
Item 9A	Controls and Procedures
Item 9B	Other Information
Item 10	Directors, Executive Officers and Corporate Governance
Item 11	Executive Compensation-
Item 12	Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
Item 13	Certain Relationships and Related Transactions, and Director Independence
Item 14	Principal Accountant Fees and Services
Item 15	Exhibits, Financial Statement Schedules

From Table 4, Item 7, MD&A, is the item of interest. We need to first locate the whole section of Item 7 and then extract the information from that. The two candidate REGEXs are:

“(?=.purchas)(?=.(obligation|commitment|agreement|order|contract))”
“\bpurchas[^\\.]*\b(obligat|commitment|agreement|order|contract)”

and the second is faster while the first is more general.

Next, we extracted information from identified Items.

s2.2. Information about Purchase Obligation Disclosure

SEC publishes the final rule on Disclosure in Management’s Discussion and Analysis About Off-Balance Sheet Arrangements and Aggregate Contractual Obligations in 2003. The required table of contractual obligations includes the following four categories of contractual obligations:

Long-term debt obligations;
Capital lease obligations;
Operating lease obligations;
Purchase obligations; and
Other long-term liabilities reflected on the registrant’s balance sheet under GAAP.

However, this information is not only disclosed in Item 7, but also in the Note(s) in Item 8. E.g. Apple Inc. 2019 10K.

About the Lack of Consistency in the Reporting Format!

The purchase obligation disclosure is not always tabulated and can sometimes be in plain text. Here is one example from Apple’s 10-K filing in 2023. Even though Apple tabulates these numbers in its previous filings, there is no guarantee that they will continue the same reporting format.

So, we start with Item 7:

function loc.item_MDnA() is created to locate the MD&A section in the 10-K filing.

Code

## Item 7: 
# cat("Information to locate Item 7 in the 10-K filing:") 
loc_item7 <- loc.item_MDnA(x = filing_structured, filing_type = "10-K")

## Extract the table: 
### From Item 7: 
item7_purchase <- filing.10kitem_purchase(x = filing_structured,
                                          loc_item = loc_item7$loc_item, 
                                          item_regex = "(?=.*purchas)(?=.*(obligation|commitment|agreement|order|contract))")

We can see from variable loc_item7 that Line 947 to 1108 in the cleaned raw HTML variable filing_structured are for Item 7.

Then, we use function filing.10kitem_purchase() to obtain the purchase obligation information in variable item7_purchase. As shown in Table Table 5, the purchase obligations are detailed below. Unit measure, plain text and raw HTML can also be found in item7_purchase.

Table 5:
Purchae Obligation Table in Item 7
item	variable	value¹	rank
Deemed repatriation tax payable	D$- Payments due in 2020	—	1
Deemed repatriation tax payable	D$- Payments due in 2021–2022	4,350	2
Deemed repatriation tax payable	D$- Payments due in 2023–2024	8,501	3
Deemed repatriation tax payable	D$- Payments due after 2024	16,655	4
Deemed repatriation tax payable	D$- Total	29,506	NA
Manufacturing purchase obligations	D$- Payments due in 2020	40,076	1
Manufacturing purchase obligations	D$- Payments due in 2021–2022	1,974	2
Manufacturing purchase obligations	D$- Payments due in 2023–2024	808	3
Manufacturing purchase obligations	D$- Payments due after 2024	69	4
Manufacturing purchase obligations	D$- Total	42,927	NA
Operating leases	D$- Payments due in 2020	1,306	1
Operating leases	D$- Payments due in 2021–2022	2,413	2
Operating leases	D$- Payments due in 2023–2024	1,746	3
Operating leases	D$- Payments due after 2024	5,373	4
Operating leases	D$- Total	10,838	NA
Other purchase obligations	D$- Payments due in 2020	3,744	1
Other purchase obligations	D$- Payments due in 2021–2022	2,271	2
Other purchase obligations	D$- Payments due in 2023–2024	572	3
Other purchase obligations	D$- Payments due after 2024	41	4
Other purchase obligations	D$- Total	6,628	NA
Term debt	D$- Payments due in 2020	10,270	1
Term debt	D$- Payments due in 2021–2022	18,278	2
Term debt	D$- Payments due in 2023–2024	19,329	3
Term debt	D$- Payments due after 2024	53,802	4
Term debt	D$- Total	101,679	NA
Total	D$- Payments due in 2020	55,396	1
Total	D$- Payments due in 2021–2022	29,286	2
Total	D$- Payments due in 2023–2024	30,956	3
Total	D$- Payments due after 2024	75,940	4
Total	D$- Total	191,578	NA
¹ Unit: in millions

Then, we look into Item 8. We still use function loc.item_MDnA(). Different from searching Item 7, we adjust the two parameters regex_item and regex_num in the function to locate Item 8.

Code

## Item 8: 
loc_item8 <- loc.item_MDnA(x = filing_structured, filing_type = "10-K", 
                           regex_item = c(NA, "(?=.*finan)(?=.*statem)(?=.*supple)(?=.*data)"), # item header 
                           regex_num = c("[>](Item|ITEM)[^0-9]+2\\.", "[>](Item|ITEM)[^0-9]+8\\."), # item number 
                           regex_perl = TRUE) 
item8_notes_html <- filing.item8_notes(x = filing_structured, 
                                      loc_item = loc_item8$loc_item, 
                                      item_regex = "\\bpurchas[^\\.]*\\b(obligat|commitment|agreement|order|contract)",
                                      note_regex = NULL) 


## Extract the table: 
### From Item 8: 
item8_purchase <- sapply(X = item8_notes_html, FUN = function(x) {
  filing.10kitem_purchase(x = x,
                          loc_item = c(1, length(x)),
                          item_regex = "(?=.*purchas)(?=.*(obligation|commitment|agreement|order|contract))")
}, simplify = FALSE, USE.NAMES = TRUE)
#### further assign each element in the list into a single variable.
{
  for (x in seq_along(item8_notes_html)) {
    print(paste("item8", letters[x], "_purchase", sep = ""))
    assign(x = paste("item8", letters[x], "_purchase", sep = ""),
           value = item8_purchase[[x]])
  } 
  rm(item8_purchase) # remove the collection if it contains more than 1. 
}

[1] "item8a_purchase"

We can see from variable loc_item8 that Line 1123 to 1701 in the cleaned raw HTML variable filing_structured are for Item 8.

After locating Item 8, function filing.item8_notes() is used to (1) Extract the HTML for Item 8 and create of a Table of Notes (“ToN”). (2) Identify the Note(s) that match the item_regex. Alternatively, the Note(s) can be extracted by note_regex. (3) Return the HTML for the sub-Note in the Note(s) of Interest. The output of function filing.item8_notes() is a list with the name of the element being the Note name and element is a character vector recording the raw HTML for the sub-Note(s).

Additional Notes for function filing.item8_notes()

This additional step of using this function is to extract the header of the sub-Note and use it as the name of the “item” variable in the final output. As you can see from the function name, it is for Item 8 only.

Output item8_notes_html contains information about the matched sub-Note HTML of corresponding Note(s).

Then, we use elements in item8_note_html as the input in function filing.10kitem_purchase() and obtain output item8_purchase, which is a list. We can also choose to rename each element in the list separately. Using code ls(pattern = "item\\d(\\w)*_purchase"), we can find that variables containing purchase obligation information are

[1] "item7_purchase"  "item8a_purchase"

As shown in Table Table 6, the purchase obligations are detailed below. Unit measure, plain text and raw HTML can also be found in item8a_purchase.

Table 6:
Purchae Obligation Table in Item 8
item	variable	value¹	rank
D$- Unconditional Purchase Obligations	2020	2,476	1
D$- Unconditional Purchase Obligations	2021	2,386	2
D$- Unconditional Purchase Obligations	2022	1,859	3
D$- Unconditional Purchase Obligations	2023	1,162	4
D$- Unconditional Purchase Obligations	2024	218	5
D$- Unconditional Purchase Obligations	Thereafter	110	6
D$- Unconditional Purchase Obligations	Total	8,211	NA
¹ Unit: in millions

S3. Up Next!

With all functions presented in this document, I will write the parallel functions to parse information from all files. Let’s start with the 2019 filings.

Here is a list of functions used in the file:

Code

lsf.str()

clean_html : function (input_string, pattern = "(</[^>]+>\\s*</[^>]+>\\s*)<([^/])")  
clean_html2 : function (input_string, pattern = "(</[^>]+>\\s*</[^>]+>\\s*)<([^/])")  
filing_item7_extract : function (filing_txt, filing_headerid, item_regex = "(?=.*purchas)(?=.*(obligation|commitment|agreement|order|contract))")  
filing.10kitem_purchase : function (x, loc_item, item_regex = "(?=.*purchas)(?=.*(obligation|commitment|agreement|order|contract))")  
filing.cleaned : function (loc_file, zip_file, text_break_node)  
filing.cleaned_errorid : function (cleaned_dt)  
filing.cleaned_multiple : function (loc_file, zip_file, text_break_node)  
filing.cleaned_parallel : function (loc_file, zip_file, text_break_node, errors = 1)  
filing.cleaned_parts : function (cleaned_dt)  
filing.header : function (x, regex_header = "ACCESSION NUMBER:|</SEC-(HEADER|Header)>")  
filing.item : function (x, loc_item, item_id, item, item_id_backup, reporting_qrt, text_break_node, 
    table = TRUE, parts = c("footnote"))  
filing.item_multiple : function (x, loc_item, item_id, item, item_id_backup, reporting_qrt, text_break_node, 
    table = TRUE, parts = c("footnote"))  
filing.item_purchase : function (x, loc_item, item, item_id_backup, reporting_qrt, item_regex, 
    table = TRUE, parts = c("footnote"))  
filing.item8_notes : function (x, loc_item, item_regex, note_regex = NULL)  
filing.toc : function (x, regex_toc = "<text>|</text>")  
googledrive_download : function (api_key, file_info, file_link, file_name, to = "/scratch/nhh/sec/")  
has_annotations : function (input)  
html_to_structure : function (item_html, headerlength = c(1, 8))  
html_to_table : function (tbl_html, item_html, tbl_colname, item_regex, allcurrency = TRUE)  
item2_html_table : function (item_html, reporting_qrt)  
loc.item : function (x, filing_type, regex_item = c("(Unregistered|UNREGISTERED|UNRE\\w+)\\s+(Sale|sale|SALE)(s|S|)\\s*(of|Of|OF)", 
    "(Market|MARKET)\\s+(for|For|FOR)\\s*(The|THE|the)?\\s*(Registrant|REGISTRANT|registrant|Re|re|RE|CO)"), 
    regex_perl = TRUE)  
loc.item_MDnA : function (x, filing_type = "10-K", regex_item = c(NA, "(?=.*management)(?=.*discussion)(?=.*analysis)(?=.*operation)"), 
    regex_num = c("[>](Item|ITEM)[^0-9]+2\\.", "[>](Item|ITEM)[^0-9]+7\\."), 
    regex_perl = TRUE)  
sec_filing_nameinfo : function (file_path, keep.original = TRUE)  
source.folder : function ()  
table.cleaned : function (id_table_raw, text_break_node)  
tbl.rowkeep : function (regex_row = "(\\w+(\\s+?)\\d{1,2},\\s+\\d{4}|Total|to|[-]|\\d+\\/\\d+\\/\\d+)|(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)", 
    row_name, reporting_qrt)  
tbl.rowkeep2 : function (regex_row = "(\\w+(\\s+?)\\d{1,2},\\s+\\d{4}|Total|[^a-zA-Z]to|[-]|\\d+\\/\\d+\\/\\d+)|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\\b)|quarter", 
    row_name, reporting_qrt)  
text_to_table : function (text, var_name, largeunits_regex = "thousand|milli|bill|tril")  
text_to_tokens : function (sentences, item_regex)  
vectors_to_matrix : function (vec_list)

Code

{
  ## information extraction from function `fling.cleaned()`
  info <- filing_header
  
  selected_headers <- c('ACCESSION NUMBER','CONFORMED SUBMISSION TYPE','PUBLIC DOCUMENT COUNT','CONFORMED PERIOD OF REPORT','FILED AS OF DATE','DATE AS OF CHANGE','FILER:','COMPANY DATA:','COMPANY CONFORMED NAME','CENTRAL INDEX KEY','STANDARD INDUSTRIAL CLASSIFICATION','IRS NUMBER','STATE OF INCORPORATION','FISCAL YEAR END','FILING VALUES:','FORM TYPE','SEC ACT','SEC FILE NUMBER','FILM NUMBER','BUSINESS ADDRESS:','STREET 1','STREET 2','CITY','STATE','ZIP','BUSINESS PHONE')
  
  info_cleaned <- info[match(selected_headers, 
                             table = info[1:max(grep("mail", info[,1], ignore.case = T)[1]-1, nrow(info), na.rm = T),1]), 2] # all info before section "MAIL ADDRESS:"
  
  info_cleaned
  
  ## generate cleaned info 
  item2_cleaned <- filing.item(x = filing_structured,
                               loc_item = loc_item7$loc_item,
                               item_id = loc_item7$item_id,
                               item = loc_item7$item,
                               item_id_backup = loc_item7$item_id_backup, ## updated August 8, 2023 
                               text_break_node = text_break_node, 
                               reporting_qrt = info_cleaned[4],
                               parts = "footnote")
}

Notes

All codes here may have issues with pure text documents, i.e. filings are not in the HTML format.
A new set of codes need to be written to identify whether the filing is in HTML or TEXT format.
This situation should be a minority in the whole sample, but still need to be checked.

Appendix

footnote indicates manufacturing obligations are primarily non-cancelable. — Figure 1: Apple 2019 10-K Contractual Obligations

The footnote (1) in Figure 1 shows that manufacturing purchase obligations are primarily non-cancellable, which indicates that some numbers of the purchase obligations may not cover items consistent with our story.

No purchase contract is found and more advanced function is needed to parse this kind of tables. — Figure 2: Tetra Tech Inc. 2020 10-K Contractual Obligations

Smaller reporting companies defined under Item 10 of Regulation S-K are not required to disclose contractual obligations under Item 7. You need to find a way to separate these filings from the very beginning to improve the parsing efficiency. — Figure 3: Karbon-X Corp 2024 10-K Contractual Obligations

The underlined paragraph in Figure 3 shows that smaller reporting companies defined under Item 10 of Regulation S-K are not required to disclose contractual obligations under Item 7. You need to find a way to separate these filings from the very beginning to improve the parsing efficiency.

S1. Download US Public Firms’ SEC 10-K Filings

S2. Look and Read the SEC filings.

s2.1. Import 10-K Filings and Familiarise the Structure

s2.1A. Header Info in 10-K Filings

s2.1B. All Items in 10-K filings

s2.2. Information about Purchase Obligation Disclosure

S3. Up Next!

Notes

Appendix

Reference