Repurchasing Contracts from SEC Filings

Collect and Parse 10-Ks

Author

Hongyi Xu

Published

September 16, 2024

Modified

October 14, 2024

The purpose of this file is to replicate the Moon and Phillips (2020) paper on the purchasing contracts of US public firms.

Cheat sheet for REGEX in R.

Functions used in following analyses are recorded in file functions_collect_parse_10Ks.R.

S1. Download US Public Firms’ SEC 10-K Filings

Although parsing SEC EDGAR using R package edgar is one way to get all SEC filings we need, we directly download raw SEC filings from the Notre Dame Software Repository for Accounting and Finance (SRAF).

Code
# store sharing links of all files on the google drive 
googledrive_links <- strsplit(x = "https://drive.google.com/file/d/1G9Qyte36a-9AN8JoH79g7iioNn1AT24d/view?usp=drive_link > 10-X_2022\nhttps://drive.google.com/file/d/14zgoElxrhFJjR_oyw93WQo1AcaL-ZEsg/view?usp=drive_link > 10-X_2019\nhttps://drive.google.com/file/d/10a7myBg5h_-Vq_QkqRqjn6j6X4EOQuPp/view?usp=drive_link > 10-X_2020\nhttps://drive.google.com/file/d/10Fi5SU5LVUGq1oJz1iPWpCMwa_eAdnjD/view?usp=sharing > 10-X_2019 \nhttps://drive.google.com/file/d/10Dl_Dw2_JmMnT6tJv7duo3KOZfGUQREv/view?usp=drive_link > 10-X_2018\nhttps://drive.google.com/file/d/10DSuO8pVtjG_IyWk1k46oiIsLhwK3UiD/view?usp=sharing > 10-X_2017\nhttps://drive.google.com/file/d/10B4dOF2M1iUI6MooT7mEIJcuX6YYgLTQ/view?usp=drive_link > 10-X_2016\nhttps://drive.google.com/file/d/107dzze4B1Q9L8rb_WWMR0M_HD9zyK8W7/view?usp=drive_link > 10-X_2015", split = "\n")[[1]] %>% 
  strsplit(split = ">") %>% 
  do.call(rbind.data.frame, args = .) %>% 
  `colnames<-`(c("Link", "File")) %>% 
  mutate(
    id = str_extract(string = Link, pattern = "(?<=/d/)[^/]+(?=/view)"), # extract the file id 
    Year = as.integer(str_extract(string = File, pattern = "(?<=_)\\d{4}")), # extract the file year 
    File = paste(str_trim(File, "both"), ".zip", sep = "")
  ) %>% 
  select(Year, id, File)

## tabulation: Table 1
gt(googledrive_links) %>% 
  tab_header(
    title = "Table 1: Google drive file information from Notre Dame SRAF", 
  ) %>% 
  tab_options(table.font.size = 10, heading.align = 'left' ) %>% 
  tab_style( # update the font size for table cells. 
    style = cell_text(size = px(10)),
    locations = cells_body()
  )
Table 1:

Google drive file information from Notre Dame SRAF

Table 1: Google drive file information from Notre Dame SRAF
Year id File
2022 1G9Qyte36a-9AN8JoH79g7iioNn1AT24d 10-X_2022.zip
2019 14zgoElxrhFJjR_oyw93WQo1AcaL-ZEsg 10-X_2019.zip
2020 10a7myBg5h_-Vq_QkqRqjn6j6X4EOQuPp 10-X_2020.zip
2019 10Fi5SU5LVUGq1oJz1iPWpCMwa_eAdnjD 10-X_2019.zip
2018 10Dl_Dw2_JmMnT6tJv7duo3KOZfGUQREv 10-X_2018.zip
2017 10DSuO8pVtjG_IyWk1k46oiIsLhwK3UiD 10-X_2017.zip
2016 10B4dOF2M1iUI6MooT7mEIJcuX6YYgLTQ 10-X_2016.zip
2015 107dzze4B1Q9L8rb_WWMR0M_HD9zyK8W7 10-X_2015.zip

Steps and command to download file(s) in the CMD:

  • First, get the API key from Oauth 2.0 Playground - Google Developers.
  • Steps: Select & authorize APIs \rightarrow Drive API v3 \rightarrow https://www.googleapis.com/auth/drive.readonly.
  • Get the Access token and copy to the api_key variable below.
  • Next, choose the files you want to download and use function googledrive_download() to generate the curl command and download the files.
  • Here I set the default location of downloaded file to "/scratch/nhh/sec/" and you can update to your own.
  • The code below only needs to be run once. Alternatively, you can use system(<!-- the curl command -->) inside the for loop to save the storage space.
Code
# curl -H "Authorization: Bearer ya29.a0AcM612zWU6Da_xDydbg-ZDc6vBpMBUoYBTO44oTpTeDvHOhzOyjMr66xdIO7nCfclFiE_NdjSJZvdpFWsNoj2Ds-7L8O2jnrJ15I3MfYb4vlsPcxFj0tPx8Mr8MIa081ZnoIXWnvft6D4aQ1qjda4LyoCl1j0iYuNPKrN3jfaCgYKAS8SARMSFQHGX2Mi8_vM15q895m7oulKDmaIrw0175" https://www.googleapis.com/drive/v3/files/10Fi5SU5LVUGq1oJz1iPWpCMwa_eAdnjD?alt=media -o /scratch/nhh/sec/10-X_2019.zip

## function to automatically generate the downloading code 
googledrive_download <- function(api_key, file_info, file_link, file_name, to = "/scratch/nhh/sec/") {
  if (missing(file_info)) { 
    # if argument `file_info` is not provided 
    cmd_output <- paste('curl -H "Authorization: Bearer ', api_key, '" ', "https://www.googleapis.com/drive/v3/files/", file_link, "?alt=media -o ", to, file_name, "   ", sep = "") 
  } else {
    # if argument `file_info` is provided 
    if (is.data.frame(file_info) & dim(file_info)[2] == 2) {
      cmd_output <- paste(apply(X = file_info, MARGIN = 1, FUN = function(x) paste('curl -H "Authorization: Bearer ', api_key, '" ', "https://www.googleapis.com/drive/v3/files/", x[1], "?alt=media -o ", to, x[2], sep = "") ), collapse = ";  " )
    } else {
      stop("Error: 'file_info' parameter is not data.frame object or has more than 2 columns.")
    }
  }
  return(cmd_output)
} 

api_key = "ya29.a0AcM612xRrh1wolCu4ufaTDwkJqXerA0HA6CyTgLUtQJ6V0JyWeZnlCF7oYx2R8CZ9KhwFPfrdvFPG2WKcAS_8GVK_ve7VvIg-HZAxiD2Xhx4vnUcYLgWtSLMRPzdUnJS_3YmO2CDHVhY8UQDmYmJKsvkmtSElGdLqxKON64laCgYKAaISARMSFQHGX2Mi8eNrVwx4jsGnn9JrhKkexA0175"

googledrive_download_cmd <- googledrive_download(api_key, file_info = googledrive_links[1, 2:3])
## directly run the code in the console, instead of in the terminal 
system(googledrive_download_cmd)
list.files(path = "/scratch/nhh/sec/", pattern = ".zip") 

## unzip one of the .zip file 
utils::unzip(zipfile = "/scratch/nhh/sec/10-X_2019.zip", exdir = "/scratch/nhh/sec/")

S2. Look and Read the SEC filings.

To understand the information contained in the name of the csv files, please refer to the “Paths and directory structure” section in Accessing EDGAR Data

  • Central Index Key (CIK): EDGAR assigns to filers a unique numerical identifier, known as a Central Index Key (CIK), when they sign up to make filings to the SEC. CIK numbers remain unique to the filer; they are not recycled. It is named as CIK in the cleaned dataset.
  • Accession number: For example, 0001193125-15-118890 is the accession number, a unique identifier assigned automatically to an accepted submission by EDGAR. The first set of numbers (0001193125) is the CIK of the entity submitting the filing. This could be the company or a third-party filer agent. Some filer agents without a regulatory requirement to make disclosure filings with the SEC have a CIK but no searchable presence in the public EDGAR database. The next two numbers (15) represent the year. The last series of numbers represent a sequential count of submitted filings from that CIK. The count is usually, but not always, reset to zero at the start of each calendar year. It is named as accession_num in the cleaned dataset.
Code
# list of files in the folder 
sec_csv <- list.files("/scratch/nhh/sec/2019", recursive=TRUE, full.names = TRUE)
# set.seed(123); csv_sample <- sample(sec_csv, size = 10)

# extract the list directly from .ZIP file 
sec_csv2 <- unzip(zipfile = "/scratch/nhh/sec/10-X_2019.zip", list = TRUE) %>% 
  grep(pattern = ".txt$", x = .$Name, value = T)

# extract filing name information using `sec_filing_nameinfo()`
sec_csvinfo <- sec_filing_nameinfo(file_path = sec_csv2, keep.original = TRUE) %>%
  as_tibble() %>% 
  mutate(filing_date = as.Date(x = filing_date, format = "%Y%m%d"))

## tabulation: Table 2
set.seed(123)
gt(sample_n(sec_csvinfo, size = 10)) %>% 
  tab_header(
    title = "Information from File Paths of .Zip Files",
    subtitle = md("File: 10-X_2019.zip \u2192 Variable: `sec_csvinfo`")
  ) %>% 
  tab_options(table.font.size = 10, heading.align = 'left' ) %>% 
  tab_style( # update the font size for table cells. 
    style = cell_text(size = px(10)),
    locations = cells_body()
  ) %>% 
  tab_style(
    style = cell_text(size = px(8)), 
    locations = cells_body(columns = c(full_path))
  ) %>% 
  cols_width(
    filing_date ~ cm(30)
  )
Table 2:

Information from File Paths of .Zip Files

Information from File Paths of .Zip Files

File: 10-X_2019.zip → Variable: sec_csvinfo

qtr_calender filing_date file_type CIK accession_num full_path
QTR3 2019-08-14 10-Q 1475430 0001010412-19-000025.txt 2019/QTR3/20190814_10-Q_edgar_data_1475430_0001010412-19-000025.txt
QTR3 2019-08-14 10-Q 1546296 0001493152-19-012453.txt 2019/QTR3/20190814_10-Q_edgar_data_1546296_0001493152-19-012453.txt
QTR4 2019-12-18 10-K-A 1621221 0001640334-19-002608.txt 2019/QTR4/20191218_10-K-A_edgar_data_1621221_0001640334-19-002608.txt
QTR4 2019-11-13 10-Q 18672 0001089819-19-000030.txt 2019/QTR4/20191113_10-Q_edgar_data_18672_0001089819-19-000030.txt
QTR1 2019-02-28 10-K 1560327 0001560327-19-000046.txt 2019/QTR1/20190228_10-K_edgar_data_1560327_0001560327-19-000046.txt
QTR1 2019-02-21 10-K 1420302 0001564590-19-003676.txt 2019/QTR1/20190221_10-K_edgar_data_1420302_0001564590-19-003676.txt
QTR4 2019-11-14 10-Q 313364 0001047469-19-006308.txt 2019/QTR4/20191114_10-Q_edgar_data_313364_0001047469-19-006308.txt
QTR1 2019-03-01 10-K 1600125 0001564590-19-005807.txt 2019/QTR1/20190301_10-K_edgar_data_1600125_0001564590-19-005807.txt
QTR2 2019-05-13 10-Q 1107280 0001437749-19-009550.txt 2019/QTR2/20190513_10-Q_edgar_data_1107280_0001437749-19-009550.txt
QTR1 2019-03-18 10-K 1325964 0001553350-19-000222.txt 2019/QTR1/20190318_10-K_edgar_data_1325964_0001553350-19-000222.txt

After getting basic information from the title of the SEC filings, we now will import the .txt file and parse required information.

s2.1. Import 10-K Filings and Familiarise the Structure

Code
## keep the universe of 10-K filings
sec_10konly <- sec_csvinfo %>% 
  filter(grepl(pattern = "10-K", x = file_type, fixed = T))

## get the 10-K for Apple Inc
cat(paste("Apple Inc.>", sec_10konly %>% filter(CIK == "320193") %>% .$full_path, collapse = " "))
Apple Inc.> 2019/QTR4/20191031_10-K_edgar_data_320193_0000320193-19-000119.txt

To extract the purchasing/outsourcing contract information as in Moon and Phillips (2020), we only need to look at 10-K filings. Currently, all 10-X filing types include 10-K, 10-K-A, 10-KT, 10-KT-A, 10-Q, 10-Q-A, 10-QT, and we will drop all 10-Q related filings.

This step significantly reduce the number of observations we need to examine. For instance, the total number of 10-X filings in 2019 drops from 27,014 in sec_csvinfo to 7,944 in sec_10konly.

Here I start with the file 2019/QTR4/20191031_10-K_edgar_data_320193_0000320193-19-000119.txt to demonstrate.

s2.1A. Header Info in 10-K Filings
Code
## read .txt files into the R environment 
# filing_path <- sec_10konly$full_path[1]
filing_path <- sec_10konly %>% filter(CIK == "320193") %>% .$full_path # Apple Inc: // 2488
    ## testing candidcates: 
    ## [ ]: https://www.sec.gov/Archives/edgar/data/1173281/000138713119000051/ohr-10k_093018.htm#ohr10k123118a010
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/831641/000083164120000154/ttek-20200927.htm (table is more complex) 
    ## [ ]: https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt 
    ## [ ]: https://www.sec.gov/Archives/edgar/data/1045810/000104581020000010/nvda-2020x10k.htm#s82E07D2E693B525F8500B3A76673C74A
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/1652044/000165204424000022/goog-20231231.htm > Alphabet Inc. 
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/2488/000000248824000012/amd-20231230.htm > AMD > search "purchase commitments" 
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/1543151/000154315124000012/uber-20231231.htm > [Uber Tech. > Purchase Commitment in Text] 
    ### Tables in Notes under Item 8. 
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/0001730168/000173016823000096/avgo-20231029.htm > [Broadcom Inc. > The table is not under Item 7, but under Note 13.] 
    ## [ ]: https://www.sec.gov/ix?doc=/Archives/edgar/data/1090727/000109072724000008/ups-20231231.htm > [UPS > Table is under Item 8, Note 9]

# filing <- readLines(con = filing_path) 
filing <- readLines(archive::archive_read("/scratch/nhh/sec/10-X_2019.zip", filing_path))
## create a more structured file for the main body of the document 
filing_structured <- 
  filing[(grep("<DOCUMENT>", filing, ignore.case = TRUE)[1]):(grep("</DOCUMENT>", filing, ignore.case = TRUE)[1])] %>% 
  paste(., collapse = " ") %>% 
  str_squish() %>% 
  clean_html2(input_string = ., pattern = "(</[^>]+>\\s*</[^>]+>\\s*)<([^/])") %>% # *updated Oct 10, 2024   
  as.vector() 
## keep only the plain text in each section 
filing_cleantext <- sapply(X = filing_structured,
                           FUN = function(x) html_text(read_html(x), trim = TRUE), USE.NAMES = FALSE)

## get the filing headers 
filing_header <- filing.header(x = filing)
### tabulate the header information 
gt(as.data.frame(`colnames<-`(filing_header, c("Item", "Input"))) ) %>% 
  # tab_header(title = "Table 3: Header Information in 10-K Filings") %>% 
  tab_options(table.font.size = 10, heading.align = 'left' ) %>% 
  tab_style( # update the font size for table cells. 
    style = cell_text(size = px(10)),
    locations = cells_body()
  )
Table 3:

Header Information in 10-K Filings

Item Input
ACCESSION NUMBER 0000320193-19-000119
CONFORMED SUBMISSION TYPE 10-K
PUBLIC DOCUMENT COUNT 96
CONFORMED PERIOD OF REPORT 20190928
FILED AS OF DATE 20191031
DATE AS OF CHANGE 20191030
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME Apple Inc.
CENTRAL INDEX KEY 0000320193
STANDARD INDUSTRIAL CLASSIFICATION ELECTRONIC COMPUTERS [3571]
IRS NUMBER 942404110
STATE OF INCORPORATION CA
FISCAL YEAR END 0928
FILING VALUES:
FORM TYPE 10-K
SEC ACT 1934 Act
SEC FILE NUMBER 001-36743
FILM NUMBER 191181423
BUSINESS ADDRESS:
STREET 1 ONE APPLE PARK WAY
CITY CUPERTINO
STATE CA
ZIP 95014
BUSINESS PHONE (408) 996-1010
MAIL ADDRESS:
STREET 1 ONE APPLE PARK WAY
CITY CUPERTINO
STATE CA
ZIP 95014
FORMER COMPANY:
FORMER CONFORMED NAME APPLE INC
DATE OF NAME CHANGE 20070109
FORMER COMPANY:
FORMER CONFORMED NAME APPLE COMPUTER INC
DATE OF NAME CHANGE 19970808
Code
# ## get the table of contents (toc) 
# filing_toc <- filing.toc(x = filing, regex_toc = "<text>|</text>") 
# # browsable(HTML(as.character(filing_toc))) # view the ToC in html format 

The function filing.toc() does not necessarily extract the table of contents. Although we can use regex_toc = "<table>|</table>" to get the first table we saw in the .txt file, it may not be the table of contents we want. The function loc.item() is better at identify the item; however, it is not structured to get the full table of contents. We modify function loc.item() and create function loc.item_MDnA() to better extract the item we want.

s2.1B. All Items in 10-K filings

To see the full list of items in a 10-K filing, I look into this file and extract all items. The Table 4 below presents all items included in a 10-K filing.

Code
## This section extract the names of each items in the 10-K filing. 
# URL of the webpage
url <- "https://www.sec.gov/answers/reada10k.htm" 
# Set a custom user-agent string to mimic a browser
user_agent_string <- "leonardo.xu@gmail.com"
# Read the webpage with the custom user-agent
reada10k <- read_html(GET(url, user_agent(user_agent_string)))
# browsable(HTML(as.character(reada10k)) )

# generate the item text 
reada10k_items <- html_nodes(reada10k, "table")[3] %>% 
  html_nodes(., "p") %>% 
  .[grep(pattern = ">Item\\s+[1-9]", x = ., ignore.case = FALSE)] %>% 
  sapply(X = ., FUN = function(x) {
    text_raw <- str_squish(html_text(html_nodes(x, "strong, b"))) 
    text_output <- paste(text_raw[text_raw != ""], collapse = " - ") 
    if (!grepl("-", text_output)) {
      text_output <- gsub("\"|“|”|\u0093", "- ", text_output)
    }
    output <- str_replace_all(string = text_output, pattern = "\"|“|”|\u0093|\u0094", replacement = "")
    return(output)
  }, simplify = TRUE) %>% 
  str_split_fixed(string = ., pattern = " - ", 2)
# tabulate the items 
gt(`colnames<-`(data.frame(reada10k_items), c("Item", "Content"))) %>% 
  # tab_header(title = "Table 4: Items in the Table of Contents in 10-K Filings") %>% 
  tab_options(table.font.size = 10, heading.align = 'left' ) %>% 
  tab_style( # update the font size for table cells. 
    style = cell_text(size = px(10)),
    locations = cells_body()
  ) %>% 
  cols_width(
    Item ~ cm(30)
  ) %>% 
  tab_style( # headers to bold
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )
Table 4:

Items in the Table of Contents in 10-K Filings

Item Content
Item 1 Business
Item 1A Risk Factors
Item 1B Unresolved Staff Comments
Item 2 Properties
Item 3 Legal Proceedings
Item 4
Item 5 Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities
Item 6 Selected Financial Data
Item 7 Management’s Discussion and Analysis of Financial Condition and Results of Operations
Item 8 Financial Statements and Supplementary Data
Item 9 Changes in and Disagreements with Accountants on Accounting and Financial Disclosure
Item 9A Controls and Procedures
Item 9B Other Information
Item 10 Directors, Executive Officers and Corporate Governance
Item 11 Executive Compensation-
Item 12 Security Ownership of Certain Beneficial Owners and Management and Related Stockholder Matters
Item 13 Certain Relationships and Related Transactions, and Director Independence
Item 14 Principal Accountant Fees and Services
Item 15 Exhibits, Financial Statement Schedules

From Table 4, Item 7, MD&A, is the item of interest. We need to first locate the whole section of Item 7 and then extract the information from that. The two candidate REGEXs are:

  1. “(?=.purchas)(?=.(obligation|commitment|agreement|order|contract))”
  2. “\bpurchas[^\\.]*\b(obligat|commitment|agreement|order|contract)”

and the second is faster while the first is more general.

Next, we extracted information from identified Items.

s2.2. Information about Purchase Obligation Disclosure

SEC publishes the final rule on Disclosure in Management’s Discussion and Analysis About Off-Balance Sheet Arrangements and Aggregate Contractual Obligations in 2003. The required table of contractual obligations includes the following four categories of contractual obligations:

  • Long-term debt obligations;
  • Capital lease obligations;
  • Operating lease obligations;
  • Purchase obligations; and
  • Other long-term liabilities reflected on the registrant’s balance sheet under GAAP.

However, this information is not only disclosed in Item 7, but also in the Note(s) in Item 8. E.g. Apple Inc. 2019 10K.

About the Lack of Consistency in the Reporting Format!

The purchase obligation disclosure is not always tabulated and can sometimes be in plain text. Here is one example from Apple’s 10-K filing in 2023. Even though Apple tabulates these numbers in its previous filings, there is no guarantee that they will continue the same reporting format.

So, we start with Item 7:

function loc.item_MDnA() is created to locate the MD&A section in the 10-K filing.

Code
## Item 7: 
# cat("Information to locate Item 7 in the 10-K filing:") 
loc_item7 <- loc.item_MDnA(x = filing_structured, filing_type = "10-K")

## Extract the table: 
### From Item 7: 
item7_purchase <- filing.10kitem_purchase(x = filing_structured,
                                          loc_item = loc_item7$loc_item, 
                                          item_regex = "(?=.*purchas)(?=.*(obligation|commitment|agreement|order|contract))")

We can see from variable loc_item7 that Line 947 to 1108 in the cleaned raw HTML variable filing_structured are for Item 7.

Then, we use function filing.10kitem_purchase() to obtain the purchase obligation information in variable item7_purchase. As shown in Table Table 5, the purchase obligations are detailed below. Unit measure, plain text and raw HTML can also be found in item7_purchase.

Table 5:

Purchae Obligation Table in Item 7

item variable value1 rank
Deemed repatriation tax payable D$- Payments due in 2020 1
Deemed repatriation tax payable D$- Payments due in 2021–2022 4,350 2
Deemed repatriation tax payable D$- Payments due in 2023–2024 8,501 3
Deemed repatriation tax payable D$- Payments due after 2024 16,655 4
Deemed repatriation tax payable D$- Total 29,506 NA
Manufacturing purchase obligations D$- Payments due in 2020 40,076 1
Manufacturing purchase obligations D$- Payments due in 2021–2022 1,974 2
Manufacturing purchase obligations D$- Payments due in 2023–2024 808 3
Manufacturing purchase obligations D$- Payments due after 2024 69 4
Manufacturing purchase obligations D$- Total 42,927 NA
Operating leases D$- Payments due in 2020 1,306 1
Operating leases D$- Payments due in 2021–2022 2,413 2
Operating leases D$- Payments due in 2023–2024 1,746 3
Operating leases D$- Payments due after 2024 5,373 4
Operating leases D$- Total 10,838 NA
Other purchase obligations D$- Payments due in 2020 3,744 1
Other purchase obligations D$- Payments due in 2021–2022 2,271 2
Other purchase obligations D$- Payments due in 2023–2024 572 3
Other purchase obligations D$- Payments due after 2024 41 4
Other purchase obligations D$- Total 6,628 NA
Term debt D$- Payments due in 2020 10,270 1
Term debt D$- Payments due in 2021–2022 18,278 2
Term debt D$- Payments due in 2023–2024 19,329 3
Term debt D$- Payments due after 2024 53,802 4
Term debt D$- Total 101,679 NA
Total D$- Payments due in 2020 55,396 1
Total D$- Payments due in 2021–2022 29,286 2
Total D$- Payments due in 2023–2024 30,956 3
Total D$- Payments due after 2024 75,940 4
Total D$- Total 191,578 NA
1 Unit: in millions

Then, we look into Item 8. We still use function loc.item_MDnA(). Different from searching Item 7, we adjust the two parameters regex_item and regex_num in the function to locate Item 8.

Code
## Item 8: 
loc_item8 <- loc.item_MDnA(x = filing_structured, filing_type = "10-K", 
                           regex_item = c(NA, "(?=.*finan)(?=.*statem)(?=.*supple)(?=.*data)"), # item header 
                           regex_num = c("[>](Item|ITEM)[^0-9]+2\\.", "[>](Item|ITEM)[^0-9]+8\\."), # item number 
                           regex_perl = TRUE) 
item8_notes_html <- filing.item8_notes(x = filing_structured, 
                                      loc_item = loc_item8$loc_item, 
                                      item_regex = "\\bpurchas[^\\.]*\\b(obligat|commitment|agreement|order|contract)",
                                      note_regex = NULL) 


## Extract the table: 
### From Item 8: 
item8_purchase <- sapply(X = item8_notes_html, FUN = function(x) {
  filing.10kitem_purchase(x = x,
                          loc_item = c(1, length(x)),
                          item_regex = "(?=.*purchas)(?=.*(obligation|commitment|agreement|order|contract))")
}, simplify = FALSE, USE.NAMES = TRUE)
#### further assign each element in the list into a single variable.
{
  for (x in seq_along(item8_notes_html)) {
    print(paste("item8", letters[x], "_purchase", sep = ""))
    assign(x = paste("item8", letters[x], "_purchase", sep = ""),
           value = item8_purchase[[x]])
  } 
  rm(item8_purchase) # remove the collection if it contains more than 1. 
} 
[1] "item8a_purchase"

We can see from variable loc_item8 that Line 1123 to 1701 in the cleaned raw HTML variable filing_structured are for Item 8.

After locating Item 8, function filing.item8_notes() is used to (1) Extract the HTML for Item 8 and create of a Table of Notes (“ToN”). (2) Identify the Note(s) that match the item_regex. Alternatively, the Note(s) can be extracted by note_regex. (3) Return the HTML for the sub-Note in the Note(s) of Interest. The output of function filing.item8_notes() is a list with the name of the element being the Note name and element is a character vector recording the raw HTML for the sub-Note(s).

Additional Notes for function filing.item8_notes()

This additional step of using this function is to extract the header of the sub-Note and use it as the name of the “item” variable in the final output. As you can see from the function name, it is for Item 8 only.

Output item8_notes_html contains information about the matched sub-Note HTML of corresponding Note(s).

Then, we use elements in item8_note_html as the input in function filing.10kitem_purchase() and obtain output item8_purchase, which is a list. We can also choose to rename each element in the list separately. Using code ls(pattern = "item\\d(\\w)*_purchase"), we can find that variables containing purchase obligation information are

[1] "item7_purchase"  "item8a_purchase"

As shown in Table Table 6, the purchase obligations are detailed below. Unit measure, plain text and raw HTML can also be found in item8a_purchase.

Table 6:

Purchae Obligation Table in Item 8

item variable value1 rank
D$- Unconditional Purchase Obligations 2020 2,476 1
D$- Unconditional Purchase Obligations 2021 2,386 2
D$- Unconditional Purchase Obligations 2022 1,859 3
D$- Unconditional Purchase Obligations 2023 1,162 4
D$- Unconditional Purchase Obligations 2024 218 5
D$- Unconditional Purchase Obligations Thereafter 110 6
D$- Unconditional Purchase Obligations Total 8,211 NA
1 Unit: in millions

S3. Up Next!

With all functions presented in this document, I will write the parallel functions to parse information from all files. Let’s start with the 2019 filings.

Here is a list of functions used in the file:

Code
lsf.str()
clean_html : function (input_string, pattern = "(</[^>]+>\\s*</[^>]+>\\s*)<([^/])")  
clean_html2 : function (input_string, pattern = "(</[^>]+>\\s*</[^>]+>\\s*)<([^/])")  
filing_item7_extract : function (filing_txt, filing_headerid, item_regex = "(?=.*purchas)(?=.*(obligation|commitment|agreement|order|contract))")  
filing.10kitem_purchase : function (x, loc_item, item_regex = "(?=.*purchas)(?=.*(obligation|commitment|agreement|order|contract))")  
filing.cleaned : function (loc_file, zip_file, text_break_node)  
filing.cleaned_errorid : function (cleaned_dt)  
filing.cleaned_multiple : function (loc_file, zip_file, text_break_node)  
filing.cleaned_parallel : function (loc_file, zip_file, text_break_node, errors = 1)  
filing.cleaned_parts : function (cleaned_dt)  
filing.header : function (x, regex_header = "ACCESSION NUMBER:|</SEC-(HEADER|Header)>")  
filing.item : function (x, loc_item, item_id, item, item_id_backup, reporting_qrt, text_break_node, 
    table = TRUE, parts = c("footnote"))  
filing.item_multiple : function (x, loc_item, item_id, item, item_id_backup, reporting_qrt, text_break_node, 
    table = TRUE, parts = c("footnote"))  
filing.item_purchase : function (x, loc_item, item, item_id_backup, reporting_qrt, item_regex, 
    table = TRUE, parts = c("footnote"))  
filing.item8_notes : function (x, loc_item, item_regex, note_regex = NULL)  
filing.toc : function (x, regex_toc = "<text>|</text>")  
googledrive_download : function (api_key, file_info, file_link, file_name, to = "/scratch/nhh/sec/")  
has_annotations : function (input)  
html_to_structure : function (item_html, headerlength = c(1, 8))  
html_to_table : function (tbl_html, item_html, tbl_colname, item_regex, allcurrency = TRUE)  
item2_html_table : function (item_html, reporting_qrt)  
loc.item : function (x, filing_type, regex_item = c("(Unregistered|UNREGISTERED|UNRE\\w+)\\s+(Sale|sale|SALE)(s|S|)\\s*(of|Of|OF)", 
    "(Market|MARKET)\\s+(for|For|FOR)\\s*(The|THE|the)?\\s*(Registrant|REGISTRANT|registrant|Re|re|RE|CO)"), 
    regex_perl = TRUE)  
loc.item_MDnA : function (x, filing_type = "10-K", regex_item = c(NA, "(?=.*management)(?=.*discussion)(?=.*analysis)(?=.*operation)"), 
    regex_num = c("[>](Item|ITEM)[^0-9]+2\\.", "[>](Item|ITEM)[^0-9]+7\\."), 
    regex_perl = TRUE)  
sec_filing_nameinfo : function (file_path, keep.original = TRUE)  
source.folder : function ()  
table.cleaned : function (id_table_raw, text_break_node)  
tbl.rowkeep : function (regex_row = "(\\w+(\\s+?)\\d{1,2},\\s+\\d{4}|Total|to|[-]|\\d+\\/\\d+\\/\\d+)|(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)", 
    row_name, reporting_qrt)  
tbl.rowkeep2 : function (regex_row = "(\\w+(\\s+?)\\d{1,2},\\s+\\d{4}|Total|[^a-zA-Z]to|[-]|\\d+\\/\\d+\\/\\d+)|((Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\\b)|quarter", 
    row_name, reporting_qrt)  
text_to_table : function (text, var_name, largeunits_regex = "thousand|milli|bill|tril")  
text_to_tokens : function (sentences, item_regex)  
vectors_to_matrix : function (vec_list)  
Code
{
  ## information extraction from function `fling.cleaned()`
  info <- filing_header
  
  selected_headers <- c('ACCESSION NUMBER','CONFORMED SUBMISSION TYPE','PUBLIC DOCUMENT COUNT','CONFORMED PERIOD OF REPORT','FILED AS OF DATE','DATE AS OF CHANGE','FILER:','COMPANY DATA:','COMPANY CONFORMED NAME','CENTRAL INDEX KEY','STANDARD INDUSTRIAL CLASSIFICATION','IRS NUMBER','STATE OF INCORPORATION','FISCAL YEAR END','FILING VALUES:','FORM TYPE','SEC ACT','SEC FILE NUMBER','FILM NUMBER','BUSINESS ADDRESS:','STREET 1','STREET 2','CITY','STATE','ZIP','BUSINESS PHONE')
  
  info_cleaned <- info[match(selected_headers, 
                             table = info[1:max(grep("mail", info[,1], ignore.case = T)[1]-1, nrow(info), na.rm = T),1]), 2] # all info before section "MAIL ADDRESS:"
  
  info_cleaned
  
  ## generate cleaned info 
  item2_cleaned <- filing.item(x = filing_structured,
                               loc_item = loc_item7$loc_item,
                               item_id = loc_item7$item_id,
                               item = loc_item7$item,
                               item_id_backup = loc_item7$item_id_backup, ## updated August 8, 2023 
                               text_break_node = text_break_node, 
                               reporting_qrt = info_cleaned[4],
                               parts = "footnote")
}

Notes

  1. All codes here may have issues with pure text documents, i.e. filings are not in the HTML format.
  2. A new set of codes need to be written to identify whether the filing is in HTML or TEXT format.
  3. This situation should be a minority in the whole sample, but still need to be checked.

Appendix

footnote indicates manufacturing obligations are primarily non-cancelable.

Figure 1: Apple 2019 10-K Contractual Obligations

The footnote (1) in Figure 1 shows that manufacturing purchase obligations are primarily non-cancellable, which indicates that some numbers of the purchase obligations may not cover items consistent with our story.

No purchase contract is found and more advanced function is needed to parse this kind of tables.

Figure 2: Tetra Tech Inc. 2020 10-K Contractual Obligations

Smaller reporting companies defined under Item 10 of Regulation S-K are not required to disclose contractual obligations under Item 7. You need to find a way to separate these filings from the very beginning to improve the parsing efficiency.

Figure 3: Karbon-X Corp 2024 10-K Contractual Obligations

The underlined paragraph in Figure 3 shows that smaller reporting companies defined under Item 10 of Regulation S-K are not required to disclose contractual obligations under Item 7. You need to find a way to separate these filings from the very beginning to improve the parsing efficiency.

Reference