Turning a PDF into a Pandas DataFrame

While looking for some specific NYC school information, the only source I could originally find was in the form of a PDF. Nobody wants to go through a PDF and manually enter a bunch of info, so I decided to see if I could extract the data from the PDF with python. Since I was eventually hoping to merge this info into a larger dataframe, I figured I would go ahead and put the PDF into a pandas DataFrame for easy manipulation.

It turns out that I could not actually use the info from the PDF for a couple of reasons, but the value of extracting data from a PDF was immediately obvious, so I thought I would write down the process for future use (and maybe someone else will find it useful). With this particular PDF, we are lucky in that it is already set up in a table.

sreh pdf

Thankfully, the PyPDF2 library already exists to extract text from PDFs, so the heavy lifting has been done. We just have to do some cleaning up. First, make sure you have PyPDF2 installed on your environment, then we will import our libraries.

# import libraries
import pandas as pd
import PyPDF2

Then we will open the PDF as an object and read it into PyPDF2.

pdfFileObj = open('2017_SREH_School_List.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Now we can take a look at the first page of the PDF, by creating an object and then extracting the text (note that the PDF pages are zero-indexed). We can see that its really messy and comes in the form of one really long string, but there is enough order in the chaos with which we can work.

# create page object and extract text
pageObj = pdfReader.getPage(0)
page1 = pageObj.extractText()

Out[]:' \nPage \n1\n \nof \n10\n \n \n \nDBN\n \nSchool Name \n \n(Click to View Webpage) \n \nAddress\n \nSchool Grade \nLevel\n \nPoint of Contact\n \n07X001\n \nP.S. 001 Courtlandt School\n \n335 East 152 Street, Bronx, NY 10451\n \nElementary\n \nstirado2@schools.nyc.gov\n \n07X005\n \nP.S. 5 Port Morris\n \n564 Jackson Avenue, Bronx, NY 10455\n \nK\n-\n8\n \nTCozzocrea@schools.nyc.gov\n \n07X154\n \nP.S. 154 Jonathan D. Hyatt\n \n333 East 135 Street, Bronx, NY 10454\n \nElementary\n \nACoviello@schools.nyc.gov\n \n07X296\n \nSouth Bronx Academy for Applied Media\n \n778 Forest Avenue, Bronx, NY 10456\n \nMiddle School\n \nRAult@schools.nyc.gov\n \n07X381\n....

The first thing I did was to strip away the leading 24 characters, which are the unnecessary page header info. The easiest way I could see to convert this to a dataframe was to turn it into a csv. As with most things in code, there are a few different ways to approach this, but this is what I chose to do:

  1. Find a good place to separate text variables with commas
  2. Create breaks to split the text into separate strings for each school/row of data
  3. Clean up the mess

Looking at the raw printout, it turned out that strings of ‘\n \n’ were intuitive places for commas and leftover ‘\n’ strings could be removed.

# strip away page header
page1 = page1[25:]

# insert commas to separate variables and then remove excess strings
page1 = page1.replace('\n \n',', ').replace('\n','')

Line breaks will come after one of three strings: ‘Point of Contact’, ‘.org’, or ‘.gov’. I was having issues with some of the email addresses getting truncated, so I inserted a ‘ break’ at the end of these three strings, which would use to, well, break the string into separate lines of text.

# set up line breaks
page1 = page1.replace('Point of Contact','Point of Contact break').replace('.org','.org break').replace('.gov', '.gov break')
#page1 = page1.replace('.org','.org break')
#page1 = page1.replace('.gov', '.gov break')

# split the text at the break
page1 = page1.split(' break')

Out[]: ['DBN, School Name , (Click to View Webpage) , Address, School Grade Level, Point of Contact',
', 07X001, P.S. 001 Courtlandt School, 335 East 152 Street, Bronx, NY 10451, Elementary, stirado2@schools.nyc.gov',
', 07X005, P.S. 5 Port Morris, 564 Jackson Avenue, Bronx, NY 10455, K-8, TCozzocrea@schools.nyc.gov',
', 07X154, P.S. 154 Jonathan D. Hyatt, 333 East 135 Street, Bronx, NY 10454, Elementary, ACoviello@schools.nyc.gov',
', 07X296, South Bronx Academy for Applied Media, 778 Forest Avenue, Bronx, NY 10456, Middle School, RAult@schools.nyc.gov',
', 07X381, Bronx Haven High School, 333 East 151 Street, Bronx, NY 10451, High school, JRivera7@schools.nyc.gov',
', 07X522, Bronx Design and Construction Academy, 333 East 151 Street, Bronx, NY 10451, High school, JPulphus@schools.nyc.gov',...]

get outYou can see that we get leading commas in our strings, so we will simply strip them away.

# clean up leading commas
for i in range(1,len(page1)):
page1[i] = page1[i][2:]

Now we have a fairly clean looking set of comma-separated variables. Let’s throw them in a dataframe.

# Create dataframe
dbn_df = pd.DataFrame([sub.split(",") for sub in page1])

0DBN School Name (Click to View Webpage)Address School Grade LevelPoint of Contact None
107X001P.S. 001 Courtlandt School 335 East 152 Street BronxNY 10451 Elementary stirado2@schools.nyc.gov
207X005P.S. 5 Port Morris 564 Jackson AvenueBronx NY 10455 K-8 TCozzocrea@schools.nyc.gov
307X154P.S. 154 Jonathan D. Hyatt 333 East 135 StreetBronx NY 10454 ElementaryACoviello@schools.nyc.gov
407X296 South Bronx Academy for Applied Media778 Forest Avenue BronxNY 10456 Middle School RAult@schools.nyc.gov

Now we just need to clean up the dataframe a bit. First, we can rename the headers and drop the first row. Next, I decided to keep the different elements of the address separated and just to drop the state, since they are all in NY. This seemed more useful to me for future use, but you do you.

# create new header names and drop first row
header = [‘DBN’,’School Name’,’Address’,’City’,’Zip’,’Grade Level’,’Point of Contact’]
dbn_df = dbn_df[1:]
dbn_df.columns = header

# strip NY from zip column
for i in range(1,len(dbn_df.Zip)):
dbn_df[‘Zip’][i] = dbn_df[‘Zip’][i][4:]

 DBN School Name Address City Zip Grade Level Point of Contact
007X001 P.S. 001 Courtlandt School 335 East 152 Street Bronx 10451Elementarystirado2@schools.nyc.gov
107X005P.S. 5 Port Morris 564 Jackson Avenue Bronx10455K-8 TCozzocrea@schools.nyc.gov
207X154 P.S. 154 Jonathan D. Hyatt 333 East 135 Street Bronx10454Elementary ACoviello@schools.nyc.gov
307X296 South Bronx Academy for Applied Media778 Forest AvenueBronx 10456Middle School RAult@schools.nyc.gov
407X381 Bronx Haven High School 333 East 151 Street Bronx 10451High school JRivera7@schools.nyc.gov

That’s it! We have turned a PDF page into a pandas dataframe. Now we can just loop through the other pages and append them to the original dataframe.

for i in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(i)
page = pageObj.extractText()

page = page[25:]
page = page.replace('\n \n',', ')
page = page.replace('\n','')
page = page.replace('Point of Contact','Point of Contact break')
page = page.replace('.org','.org break')
page = page.replace('.gov', '.gov break')
page = page.split(' break')
for i in range(1,len(page)):
page[i] = page[i][2:]

# Create dataframe
page_df = pd.DataFrame([sub.split(",") for sub in page])
page_df = page_df.iloc[1:,:7]
page_df.columns = header
for i in range(1,len(page_df.Zip)):
page_df['Zip'][i] = page_df['Zip'][i][4:]

# Concat with dbn_df
dbn_df = pd.concat([dbn_df,page_df], axis=0, ignore_index=True, sort=False)

That is as far as I’m going to go, but there is definitely still some work to be done. For example, if you take a look at the tail of the dataframe there are two obvious problems.

  1. There is at least one school which must have had an extra comma inserted into the name.
  2. There is an empty row at the bottom of possibly every page.

These are fixable issues, but that’s a matter of cleaning, and, like I said, I can’t actually use this data, so I’m going to stop spending more time on it. I hope someone (including my future self) finds this helpful.

// Header photo by Quintin Gellar from Pexels

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.