<- pdftools::pdf_text("2023.pdf") orig_pdf
Example of parsing a PDF
This is an example of how to use pdf_tools to parse a simple pdf that has decent markers and decent spacing – it would be far more complex if you had a bad pdf. This may look complex, but it actually only has about 25 lines of code. If you had something similar, you could just change the parts that are specific to this document, such as the empty lines, the garbage lines, and the way to split the lines.
Here is what the original file looks like. It is a text PDF, meaning it doesn’t require character recognition. There are 14 pages.
The strategy here is to isolate the rows that are “headings” for the lobbyist organization from the rows that are detail lobbyist names. I also want to make sure that we keep page and row numbers so it can be checked.
Step 1
Extract the text from the pdf file using pdftools
library (must be installed)
I’ve saved the file in my project’s directory. You would need to do the same thing.
This creates a character vector of one element per page, so there are 14 entries into the vector. The names of each item is the position in the vector, which correspond to the pages in the original.
Step 2:
We can create a data frame out this vector using the as_tibble()
function, and set the page as the row number since each page is still one long character string. Then we can split the pages into a vector of lines in each page, and “unnest” them back into the dataframe.
<-
unlisted_pdf |>
orig_pdf as_tibble( ) |>
mutate ( page = row_number(),
line = str_split( value, "\n")
|>
) unnest( line) |>
rowid_to_column() |>
filter ( line != "")
Here’s a test of what it looks like, sampling 10 rows:
Now there is a data frame, with three columns: The page number, the row number, and the value for that row.
Step 3:
We can get rid of some rows that have no good information:
<-
all_rows |>
unlisted_pdf filter ( str_detect ( line,
"(City of Phoenix|Year 2023|Report Generated|\\bPage\\s+\\d+)",
negate=TRUE)) |>
select ( page, rowid, line)
Step 4:
Now that we can parse the rows that are company headings. Creating an object called “date_pattern” so that if I test the regular expression, I know it works everywhere. I need to use the date_updated column as a character value, so it isn’t converted to a date until the end.
<- r"(\d{1,2}/\d{1,2}/\d{4}\s*$)"
date_pattern
# now get the values
|>
all_rows mutate ( is_co_header = str_detect ( line, date_pattern) ,
date_updated = str_extract( line, date_pattern) ,
lobbying_org = str_trim( str_replace ( line, date_updated, "") ),
lobbyist = if_else ( is_co_header == FALSE, line, NA_character_),
date_updated = mdy(date_updated)
|>
) fill ( c(date_updated, lobbying_org), .direction="down") |>
filter ( is_co_header == FALSE) |>
select ( date_updated, lobbying_org, lobbyist) |>
::datatable( ) DT
You would need to do a little checking to make sure that the regular expression didn’t exclude something important, or that there were no glitches, especially between pages. But generally this looks right.