Example of parsing a PDF

Author

Sarah Cohen

Published

March 9, 2023

This is an example of how to use pdf_tools to parse a simple pdf that has decent markers and decent spacing – it would be far more complex if you had a bad pdf. This may look complex, but it actually only has about 25 lines of code. If you had something similar, you could just change the parts that are specific to this document, such as the empty lines, the garbage lines, and the way to split the lines.

Here is what the original file looks like. It is a text PDF, meaning it doesn’t require character recognition. There are 14 pages.

The strategy here is to isolate the rows that are “headings” for the lobbyist organization from the rows that are detail lobbyist names. I also want to make sure that we keep page and row numbers so it can be checked.

Step 1

Extract the text from the pdf file using pdftools library (must be installed)

I’ve saved the file in my project’s directory. You would need to do the same thing.

orig_pdf <- pdftools::pdf_text("2023.pdf")

This creates a character vector of one element per page, so there are 14 entries into the vector. The names of each item is the position in the vector, which correspond to the pages in the original.

Step 2:

We can create a data frame out this vector using the as_tibble() function, and set the page as the row number since each page is still one long character string. Then we can split the pages into a vector of lines in each page, and “unnest” them back into the dataframe.

unlisted_pdf <-
   orig_pdf |> 
   as_tibble( ) |> 
   mutate ( page = row_number(), 
            line = str_split( value, "\n")
   ) |> 
  unnest( line) |> 
  rowid_to_column() |> 
  filter ( line != "")

Here’s a test of what it looks like, sampling 10 rows:

Now there is a data frame, with three columns: The page number, the row number, and the value for that row.

Step 3:

We can get rid of some rows that have no good information:

all_rows <-
  unlisted_pdf |> 
  filter ( str_detect ( line, 
            "(City of Phoenix|Year 2023|Report Generated|\\bPage\\s+\\d+)",
            negate=TRUE)) |> 
  select ( page, rowid, line)

Step 4:

Now that we can parse the rows that are company headings. Creating an object called “date_pattern” so that if I test the regular expression, I know it works everywhere. I need to use the date_updated column as a character value, so it isn’t converted to a date until the end.

date_pattern <- r"(\d{1,2}/\d{1,2}/\d{4}\s*$)"

# now get the values
all_rows |> 
  mutate ( is_co_header = str_detect ( line, date_pattern) , 
           date_updated = str_extract( line, date_pattern)  , 
           lobbying_org = str_trim( str_replace ( line, date_updated, "") ), 
           lobbyist = if_else ( is_co_header == FALSE, line, NA_character_), 
           date_updated = mdy(date_updated)
  ) |> 
  fill ( c(date_updated, lobbying_org), .direction="down") |> 
  filter ( is_co_header == FALSE) |> 
  select ( date_updated, lobbying_org, lobbyist) |>
  DT::datatable( )

You would need to do a little checking to make sure that the regular expression didn’t exclude something important, or that there were no glitches, especially between pages. But generally this looks right.

--- title: "Example of parsing a PDF" author: "Sarah Cohen" date: "March 9, 2023" format: html: code-tools: true theme: cosmo embed-resources: true layout: full df-print: paged editor: visual --- This is an example of how to use pdf_tools to parse a simple pdf that has decent markers and decent spacing -- it would be far more complex if you had a bad pdf. This may look complex, but it actually only has about 25 lines of code. If you had something similar, you could just change the parts that are specific to this document, such as the empty lines, the garbage lines, and the way to split the lines. Here is what the original file looks like. It is a text PDF, meaning it doesn't require character recognition. There are 14 pages. ![](advanced-pdf-text.png) The strategy here is to isolate the rows that are "headings" for the lobbyist organization from the rows that are detail lobbyist names. I also want to make sure that we keep page and row numbers so it can be checked. **Step 1** Extract the text from the pdf file using `pdftools` library (must be installed) ```{r} #| label: setup #| include: false library( tidyverse) ``` I've saved [the file](https://cronkitedata.s3.amazonaws.com/rdata/2023.pdf) in my project's directory. You would need to do the same thing. ```{r} orig_pdf <- pdftools::pdf_text("2023.pdf") ``` This creates a character vector of one element per page, so there are 14 entries into the vector. The names of each item is the position in the vector, which correspond to the pages in the original. **Step 2:** We can create a data frame out this vector using the `as_tibble()` function, and set the page as the row number since each page is still one long character string. Then we can split the pages into a vector of lines in each page, and "unnest" them back into the dataframe. ```{r} unlisted_pdf <- orig_pdf |> as_tibble( ) |> mutate ( page = row_number(), line = str_split( value, "\n") ) |> unnest( line) |> rowid_to_column() |> filter ( line != "") ``` Here's a test of what it looks like, sampling 10 rows: ```{r} #| echo: false unlisted_pdf |> select ( page, rowid, line) |> sample_n( 10 ) |> arrange (rowid) |> reactable::reactable() ``` Now there is a data frame, with three columns: The page number, the row number, and the value for that row. **Step 3:** We can get rid of some rows that have no good information: ```{r} all_rows <- unlisted_pdf |> filter ( str_detect ( line, "(City of Phoenix|Year 2023|Report Generated|\\bPage\\s+\\d+)", negate=TRUE)) |> select ( page, rowid, line) ``` **Step 4:** Now that we can parse the rows that are company headings. Creating an object called "date_pattern" so that if I test the regular expression, I know it works everywhere. I need to use the date_updated column as a character value, so it isn't converted to a date until the end. ```{r} date_pattern <- r"(\d{1,2}/\d{1,2}/\d{4}\s*$)" # now get the values all_rows |> mutate ( is_co_header = str_detect ( line, date_pattern) , date_updated = str_extract( line, date_pattern) , lobbying_org = str_trim( str_replace ( line, date_updated, "") ), lobbyist = if_else ( is_co_header == FALSE, line, NA_character_), date_updated = mdy(date_updated) ) |> fill ( c(date_updated, lobbying_org), .direction="down") |> filter ( is_co_header == FALSE) |> select ( date_updated, lobbying_org, lobbyist) |> DT::datatable( ) ``` You would need to do a little checking to make sure that the regular expression didn't exclude something important, or that there were no glitches, especially between pages. But generally this looks right.