library(rvest)
library(tidyverse)
library(htm2txt) # convert html to txt.
library(reshape2)
library(gt)
Here we use the filing from APPLE:
# 1. parameters ----
filing_qrt <- "QTR2" # the filing quarter
loc_file <- "Edgar filings_full text/Form 10-Q/320193/320193_10-Q_2015-07-22_0001193125-15-259935.txt"
# 2. import filing -----
filing <- readLines(loc_file) # read txt filing
Function filing.header
extracts the filing information
from the header.
filing.header <- function(x, # the file
regex_header = 'ACCESSION NUMBER:|</SEC-HEADER>' # the regex of the start to end of the header section in the filing
) { # parse filing header info
header <- grep(pattern = regex_header, x = filing, perl = T)
header_cleaned <- str_squish(x[header[1]:(header[2]-1)])
header_info <- str_split_fixed(header_cleaned[header_cleaned != ""],
pattern = ":\\s", 2)
return(header_info)
}
The extracted filing info looks like this:
filing_header <- filing.header(x = filing)
filing_type <- filing_header[2,2] # get the filing type (10-Q/K)
filing_cik <- filing_header[10,2] # get cik
gt(data = as.data.frame(filing_header)) %>%
cols_label(V1 = "Label", V2 = "Info")
Label | Info |
---|---|
ACCESSION NUMBER | 0001193125-15-259935 |
CONFORMED SUBMISSION TYPE | 10-Q |
PUBLIC DOCUMENT COUNT | 11 |
CONFORMED PERIOD OF REPORT | 20150627 |
FILED AS OF DATE | 20150722 |
DATE AS OF CHANGE | 20150722 |
FILER: | |
COMPANY DATA: | |
COMPANY CONFORMED NAME | APPLE INC |
CENTRAL INDEX KEY | 0000320193 |
STANDARD INDUSTRIAL CLASSIFICATION | ELECTRONIC COMPUTERS [3571] |
IRS NUMBER | 942404110 |
STATE OF INCORPORATION | CA |
FISCAL YEAR END | 0927 |
FILING VALUES: | |
FORM TYPE | 10-Q |
SEC ACT | 1934 Act |
SEC FILE NUMBER | 001-36743 |
FILM NUMBER | 151000501 |
BUSINESS ADDRESS: | |
STREET 1 | ONE INFINITE LOOP |
CITY | CUPERTINO |
STATE | CA |
ZIP | 95014 |
BUSINESS PHONE | (408) 996-1010 |
MAIL ADDRESS: | |
STREET 1 | ONE INFINITE LOOP |
CITY | CUPERTINO |
STATE | CA |
ZIP | 95014 |
FORMER COMPANY: | |
FORMER CONFORMED NAME | APPLE COMPUTER INC |
DATE OF NAME CHANGE | 19970808 |
Function filing.toc
extracts the Table of Content(s)
(toc) from the filing.
Function loc.item
locates the item of interest, which
potentially contains share repurchase information, in the Table of
Content(s).
filing.toc <- function(x, # filing
regex_toc = '<text>|</text>' # locate ToC
){ # find the table of content(s)
toc <- grep(pattern = regex_toc, x = x, ignore.case = T)[1:2] # the part containing the ToC
filing_toc <- read_html(paste0(x[toc[1]:toc[2]], collapse = "")) # extract the toc
return(filing_toc)
}
loc.item <- function(x, # filing
filing_type, # filing type from the previous input
regex_item = c("Unregistered Sales of Equity Securities and Use of Proceeds",
"Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities")
) { # locate the section of the item of interest
# > item 2 in 10-Q: "Unregistered Sales of Equity Securities and Use of Proceeds" ;
# > item 5 in 10-K: "Market for Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities" ;
toc <- filing.toc(x = filing)
regex <- regex_item[filing_type == c("10-Q", "10-K")] # identify the regex
toc_txt <- html_nodes(html_nodes(toc, "table"), "a")
item_id <- gsub(x = unique(html_attr(toc_txt[which(grepl(pattern = regex,
x = html_text(toc_txt),
ignore.case = T)) + 0:6],"href"))[1:2],
pattern = '#', replacement = '')
loc_item <- vapply(X = item_id,
FUN = function(p) {
loc_item0 <- grep(pattern = p, x = x, fixed = T)
return(ifelse(length(loc_item0) != 1, loc_item0[2], loc_item0[1]))
},
FUN.VALUE = numeric(1))
return(list(loc_item = loc_item, item_id = item_id))
}
From function loc.item
, it will automatically locate
Item 2 in 10-Q and Item 5 in 10-K based on the filing type. By running
the function on variable filing
, I can store the toc into
variable filing_title
and
filing_title <- filing.toc(x = filing)
filing_title
## {html_document}
## <html>
## [1] <body><text><title>Form 10-Q</title>\n<h5 align="left"><a href="#toc">Tab ...
loc_item2 <- loc.item(x = filing, filing_type = filing_type) # the location for the item 2/5
cat("The item of interest is from line", loc_item2$loc_item[1], "to", loc_item2$loc_item[2], ".")
## The item of interest is from line 13139 to 13405 .
Function filing.item
extracts the information in the
Item 2(5) in 10-Q(K). x
contains the text for the filing.
loc_item
locates the position of the Item 2(5) from
function loc.item
. filing_qrt
is the filing
quarter and is defined outside the function.
Function filing.item_txt
extract the information before
and after the table in the Item 2(5) and record in $header
and $footnote
.
## extract the txt header and/or footnote from the item
filing.item_txt <- function(item_txt, # txt in character for the item
item_tbl_id, # the location of the table in the item `item_txt`
parts = c("header", "footnote") # kept parts
) {
if (length(parts == c("header", "footnote")) != 0) {
parts_id <- c(1, grep('<(/?)table', item_txt, ignore.case = T)[1 + 2*item_tbl_id + (-1:0)], length(item_txt))
loc_item_section <- list(header = parts_id[1:2],
footnote = parts_id[3:4])
filing_item2_txt <- lapply(X = loc_item_section[parts],
FUN = function(id) {html_text(read_html(paste(item_txt[id[1]:id[2]], collapse = ""),
options = 'HUGE'), trim = T)} )
return(filing_item2_txt)
} else {
return(list( header = NULL, footnote = NULL ))
}
}
## extract section/item txt
filing.item <- function(x, # filing
loc_item, # the location of the item of interest
item_id, # the identifier from 'href' for the section
filing_qrt, # the quarter the filing was made
table = TRUE, # whether to scrap the table numbers
parts = c("footnote") # the parts of information that you want
) { # extract info from the section/item
if (loc_item[1] == loc_item[2]) {
item_parse <- str_split_fixed(string = x[loc_item[1]:loc_item[2]],
pattern = item_id[1], n = Inf) %>% .[1, ncol(.)]
item_txt <- str_extract(string = item_parse,
pattern = paste0("^(.*?)", item_id[2], collapse = ""))
} else {
# the full item
item_txt <- x[loc_item[1]:loc_item[2]]
}
# find the table(s)
item_html <- read_html(paste0(item_txt, collapse = ""))
item_tbls <- html_nodes(item_html, "table")
item_tbl_id <- grep(pattern = "Total", x = item_tbls, fixed = T)[1] # identify the correct table
## extract the table
if (!is.na(item_tbl_id)) {
##
if (!(parts %in% c("header", "footnote"))) {
stop("Messing `parts` variable")
} else {
item_htm2txt <- html_text(item_html, trim = T) # pure text document
filing_item2_txt <- strsplit(x = item_htm2txt, split = (html_text(item_tbls[[item_tbl_id]], trim = F)), fixed = T)[[1]][match(parts, c("header", "footnote"))]
}
### extract the unit information
item_table_unit <- c(na.omit((str_extract(string = item_htm2txt,
pattern = str_extract(html_text(item_html), pattern = "\\(([^()]+)\\)")))))
### <Tables starts here!>
### clean the table
item_table <- unique.matrix(as.matrix(html_table(item_tbls[[item_tbl_id]])))[-1,]
tbl_periods_id <- grep(pattern = '(\\w+\\d{1,2},\\s+\\d{4}|Total|total)', item_table[,1]) # id_row for the periods
tbl_periods <- rep(item_table[tbl_periods_id,1],
time = c(diff(tbl_periods_id), 1)
) # return the periods
tbl_periods[tbl_periods == "Total"] <- filing_qrt # entering the filing quarter
tbl_title <- c("item", item_table[1,][-1])
tbl_numbers <- item_table[-(1:(tbl_periods_id[1]-1)),] %>% # remove the first line
cbind(., "period" =`length<-`(tbl_periods, nrow(.))) %>% # add 'period' column
.[-(tbl_periods_id[which(c(diff(tbl_periods_id), 1) != 1)] - (tbl_periods_id[1]-1)), # clean duplicated rows
c(TRUE, duplicated(tbl_title[-1], incomparables = c(NA, "")), TRUE)] # clean duplicated columns
tbl_numbers <- matrix(str_replace(tbl_numbers,
pattern = "\\$|(\\s*?)\\(\\d\\)",
replacement = ""),
ncol = ncol(tbl_numbers),
dimnames = list(NULL,
c("item",
tbl_title[duplicated(tbl_title[-1], incomparables = c(NA, ""))],
"period")))
### return the cleaned table
tbl_numbers_cleaned <- melt(as.tibble(tbl_numbers), id.vars = c("item", "period"))
return(list(table = tbl_numbers_cleaned,
parts = filing_item2_txt,
table_unit = item_table_unit
))
} else { # if no table in the item
return(list(table = NULL,
parts = NULL,
table_unit = NULL
))
}
}
Function filing.item
first extracts the item text and
check the table recording share repurchase information, which is
recorded in item_tbl_id
. If no table is found, which is
checked by !is.na(item_tbl_id)
, this means no repurchase in
the reporting quarter. If a credible table is identified, then the
function will go in to locate the table ($table
), separate
the part before the table ($header
) and the part after the
table ($footnote
). Also, the function finds the unit for
the numbers in the table ($table_unit
).
For a reference, we can see 10-K from APPLE in 2019 and from Twitter 2019 and 2021. While APPLE in 2019 and Twitter in 2021 both have reported share repurchases in their 10-K, Twitter in 2019 did not repurchase and there is no section reporting this info. Same gose for NVIDIA 10-Q in 2021 QRT1.
The original filing on EDGAR is here.
From Apple 10-Q in 2015 QRT2, I have this cleaned table for the repurchase information below:
item2_cleaned <- filing.item(x = filing,
loc_item = loc_item2$loc_item,
item_id = loc_item2$item_id,
filing_qrt = filing_qrt)
gt(data = as.data.frame(item2_cleaned$table)) %>%
tab_footnote(footnote = item2_cleaned$table_unit,
locations = cells_column_labels(columns = value))
item | period | variable | value1 |
---|---|---|---|
Open market and privately negotiated purchases | March 29, 2015 to May 2, 2015: | Total Numberof SharesPurchased | 6,364 |
May 2015 ASR | May 3, 2015 to May 30, 2015: | Total Numberof SharesPurchased | 38,320 |
Open market and privately negotiated purchases | May 3, 2015 to May 30, 2015: | Total Numberof SharesPurchased | 20,190 |
Open market and privately negotiated purchases | May 31, 2015 to June 27, 2015: | Total Numberof SharesPurchased | 4,677 |
Total | QTR2 | Total Numberof SharesPurchased | 69,551 |
Open market and privately negotiated purchases | March 29, 2015 to May 2, 2015: | AveragePrice PaidPer Share | 126.49 |
May 2015 ASR | May 3, 2015 to May 30, 2015: | AveragePrice PaidPer Share | |
Open market and privately negotiated purchases | May 3, 2015 to May 30, 2015: | AveragePrice PaidPer Share | 128.53 |
Open market and privately negotiated purchases | May 31, 2015 to June 27, 2015: | AveragePrice PaidPer Share | 128.28 |
Total | QTR2 | AveragePrice PaidPer Share | |
Open market and privately negotiated purchases | March 29, 2015 to May 2, 2015: | Total Number ofSharesPurchased asPart of PubliclyAnnouncedPlans orPrograms | 6,364 |
May 2015 ASR | May 3, 2015 to May 30, 2015: | Total Number ofSharesPurchased asPart of PubliclyAnnouncedPlans orPrograms | 38,320 |
Open market and privately negotiated purchases | May 3, 2015 to May 30, 2015: | Total Number ofSharesPurchased asPart of PubliclyAnnouncedPlans orPrograms | 20,190 |
Open market and privately negotiated purchases | May 31, 2015 to June 27, 2015: | Total Number ofSharesPurchased asPart of PubliclyAnnouncedPlans orPrograms | 4,677 |
Total | QTR2 | Total Number ofSharesPurchased asPart of PubliclyAnnouncedPlans orPrograms | |
Open market and privately negotiated purchases | March 29, 2015 to May 2, 2015: | ApproximateDollar Value ofShares ThatMay Yet BePurchasedUnder thePlans orPrograms (1) | |
May 2015 ASR | May 3, 2015 to May 30, 2015: | ApproximateDollar Value ofShares ThatMay Yet BePurchasedUnder thePlans orPrograms (1) | |
Open market and privately negotiated purchases | May 3, 2015 to May 30, 2015: | ApproximateDollar Value ofShares ThatMay Yet BePurchasedUnder thePlans orPrograms (1) | |
Open market and privately negotiated purchases | May 31, 2015 to June 27, 2015: | ApproximateDollar Value ofShares ThatMay Yet BePurchasedUnder thePlans orPrograms (1) | |
Total | QTR2 | ApproximateDollar Value ofShares ThatMay Yet BePurchasedUnder thePlans orPrograms (1) | 50,050 |
1 in millions, except number of shares, which are reflected in thousands, and per share amounts |
##
table_var <- as.character(unique(item2_cleaned$table$variable))
item2_cleaned$table_unit
## [1] "in millions, except number of shares, which are reflected in thousands, and per share amounts"
For the unit, it seems that normally, the total number is in millions/thousands, and the value is in million and the price is in dollar.
In the third line of code, we only retain the $parts
with footnote information and discard the text before the table.
item2_cleaned$parts # here is the default footnote
## [1] " (1) In 2012, the Company\u0092s Board of Directors authorized a program to repurchase up to $10 billion of the Company\u0092s common stock beginning in 2013. TheCompany\u0092s Board of Directors increased the authorization to repurchase the Company\u0092s common stock to $60 billion in April 2013, to $90 billion in April 2014 and to $140 billion in April 2015. As of June 27, 2015, $90 billion of the$140 billion had been utilized. The remaining $50 billion in the table represents the amount available to repurchase shares under the authorized repurchase program as of June 27, 2015. The Company\u0092s share repurchase program does notobligate it to acquire any specific number of shares. Under the program, shares may be repurchased in privately negotiated and/or open market transactions, including under plans complying with Rule 10b5-1 under the Exchange Act. (2) In May 2015, the Company entered into a new accelerated share repurchase arrangement (\u0093ASR\u0094) to purchase up to $6.0 billion of the Company\u0092scommon stock. In exchange for up-front payments totaling $6.0 billion, the financial institutions committed to deliver shares during the ASR\u0092s purchase period, which will end in or before November 2015. The total number of shares ultimatelydelivered, and therefore the average price paid per share, will be determined at the end of the applicable purchase period based on the volume weighted-average price of the Company\u0092s common stock during that period. During the third quarter of2015, 38.3 million net shares were delivered and retired under the May 2015 ASR, and the final number of shares to be delivered will be determined at the conclusion of the purchase period. Item 3."
filing.item(x = filing, loc_item = loc_item2$loc_item, item_id = loc_item2$item_id, filing_qrt = filing_qrt, parts = "footnote")$parts
## [1] " (1) In 2012, the Company\u0092s Board of Directors authorized a program to repurchase up to $10 billion of the Company\u0092s common stock beginning in 2013. TheCompany\u0092s Board of Directors increased the authorization to repurchase the Company\u0092s common stock to $60 billion in April 2013, to $90 billion in April 2014 and to $140 billion in April 2015. As of June 27, 2015, $90 billion of the$140 billion had been utilized. The remaining $50 billion in the table represents the amount available to repurchase shares under the authorized repurchase program as of June 27, 2015. The Company\u0092s share repurchase program does notobligate it to acquire any specific number of shares. Under the program, shares may be repurchased in privately negotiated and/or open market transactions, including under plans complying with Rule 10b5-1 under the Exchange Act. (2) In May 2015, the Company entered into a new accelerated share repurchase arrangement (\u0093ASR\u0094) to purchase up to $6.0 billion of the Company\u0092scommon stock. In exchange for up-front payments totaling $6.0 billion, the financial institutions committed to deliver shares during the ASR\u0092s purchase period, which will end in or before November 2015. The total number of shares ultimatelydelivered, and therefore the average price paid per share, will be determined at the end of the applicable purchase period based on the volume weighted-average price of the Company\u0092s common stock during that period. During the third quarter of2015, 38.3 million net shares were delivered and retired under the May 2015 ASR, and the final number of shares to be delivered will be determined at the conclusion of the purchase period. Item 3."
The running time for parsing one 10-Q filing is:
## Time difference of 0.3136666 secs