Turning a PDF into a Pandas DataFrame

While looking for some specific NYC school information, the only source I could originally find was in the form of a PDF. Nobody wants to go through a PDF and manually enter a bunch of info, so I decided to see if I could extract the data from the PDF with python. Since I was eventually hoping to merge this info into a larger dataframe, I figured I would go ahead and put the PDF into a pandas DataFrame for easy manipulation.

It turns out that I could not actually use the info from the PDF for a couple of reasons, but the value of extracting data from a PDF was immediately obvious, so I thought I would write down the process for future use (and maybe someone else will find it useful). With this particular PDF, we are lucky in that it is already set up in a table.

Thankfully, the PyPDF2 library already exists to extract text from PDFs, so the heavy lifting has been done. We just have to do some cleaning up. First, make sure you have PyPDF2 installed on your environment, then we will import our libraries.

# import libraries import pandas as pd import PyPDF2

Then we will open the PDF as an object and read it into PyPDF2.

pdfFileObj = open('2017_SREH_School_List.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

Now we can take a look at the first page of the PDF, by creating an object and then extracting the text (note that the PDF pages are zero-indexed). We can see that its really messy and comes in the form of one really long string, but there is enough order in the chaos with which we can work.

# create page object and extract text pageObj = pdfReader.getPage(0) page1 = pageObj.extractText() page1

Out[]:' \nPage \n1\n \nof \n10\n \n \n \nDBN\n \nSchool Name \n \n(Click to View Webpage) \n \nAddress\n \nSchool Grade \nLevel\n \nPoint of Contact\n \n07X001\n \nP.S. 001 Courtlandt School\n \n335 East 152 Street, Bronx, NY 10451\n \nElementary\n \nstirado2@schools.nyc.gov\n \n07X005\n \nP.S. 5 Port Morris\n \n564 Jackson Avenue, Bronx, NY 10455\n \nK\n-\n8\n \nTCozzocrea@schools.nyc.gov\n \n07X154\n \nP.S. 154 Jonathan D. Hyatt\n \n333 East 135 Street, Bronx, NY 10454\n \nElementary\n \nACoviello@schools.nyc.gov\n \n07X296\n \nSouth Bronx Academy for Applied Media\n \n778 Forest Avenue, Bronx, NY 10456\n \nMiddle School\n \nRAult@schools.nyc.gov\n \n07X381\n....

The first thing I did was to strip away the leading 24 characters, which are the unnecessary page header info. The easiest way I could see to convert this to a dataframe was to turn it into a csv. As with most things in code, there are a few different ways to approach this, but this is what I chose to do:

Find a good place to separate text variables with commas
Create breaks to split the text into separate strings for each school/row of data
Clean up the mess

Looking at the raw printout, it turned out that strings of ‘\n \n’ were intuitive places for commas and leftover ‘\n’ strings could be removed.

# strip away page header page1 = page1[25:]

# insert commas to separate variables and then remove excess strings page1 = page1.replace('\n \n',', ').replace('\n','')

Line breaks will come after one of three strings: ‘Point of Contact’, ‘.org’, or ‘.gov’. I was having issues with some of the email addresses getting truncated, so I inserted a ‘ break’ at the end of these three strings, which would use to, well, break the string into separate lines of text.

# set up line breaks page1 = page1.replace('Point of Contact','Point of Contact break').replace('.org','.org break').replace('.gov', '.gov break') #page1 = page1.replace('.org','.org break') #page1 = page1.replace('.gov', '.gov break')

# split the text at the break page1 = page1.split(' break') page1

Out[]: ['DBN, School Name , (Click to View Webpage) , Address, School Grade Level, Point of Contact', ', 07X001, P.S. 001 Courtlandt School, 335 East 152 Street, Bronx, NY 10451, Elementary, stirado2@schools.nyc.gov', ', 07X005, P.S. 5 Port Morris, 564 Jackson Avenue, Bronx, NY 10455, K-8, TCozzocrea@schools.nyc.gov', ', 07X154, P.S. 154 Jonathan D. Hyatt, 333 East 135 Street, Bronx, NY 10454, Elementary, ACoviello@schools.nyc.gov', ', 07X296, South Bronx Academy for Applied Media, 778 Forest Avenue, Bronx, NY 10456, Middle School, RAult@schools.nyc.gov', ', 07X381, Bronx Haven High School, 333 East 151 Street, Bronx, NY 10451, High school, JRivera7@schools.nyc.gov', ', 07X522, Bronx Design and Construction Academy, 333 East 151 Street, Bronx, NY 10451, High school, JPulphus@schools.nyc.gov',...]

You can see that we get leading commas in our strings, so we will simply strip them away.

# clean up leading commas for i in range(1,len(page1)): page1[i] = page1[i][2:]

Now we have a fairly clean looking set of comma-separated variables. Let’s throw them in a dataframe.

# Create dataframe dbn_df = pd.DataFrame([sub.split(",") for sub in page1]) dbn_df.head()

	0	1	2	3	4	5	6
0	DBN	School Name	(Click to View Webpage)	Address	School Grade Level	Point of Contact	None
1	07X001	P.S. 001 Courtlandt School	335 East 152 Street	Bronx	NY 10451	Elementary	stirado2@schools.nyc.gov
2	07X005	P.S. 5 Port Morris	564 Jackson Avenue	Bronx	NY 10455	K-8	TCozzocrea@schools.nyc.gov
3	07X154	P.S. 154 Jonathan D. Hyatt	333 East 135 Street	Bronx	NY 10454	Elementary	ACoviello@schools.nyc.gov
4	07X296	South Bronx Academy for Applied Media	778 Forest Avenue	Bronx	NY 10456	Middle School	RAult@schools.nyc.gov

Now we just need to clean up the dataframe a bit. First, we can rename the headers and drop the first row. Next, I decided to keep the different elements of the address separated and just to drop the state, since they are all in NY. This seemed more useful to me for future use, but you do you.

# create new header names and drop first row
header = [‘DBN’,’School Name’,’Address’,’City’,’Zip’,’Grade Level’,’Point of Contact’]
dbn_df = dbn_df[1:]
dbn_df.columns = header

# strip NY from zip column
for i in range(1,len(dbn_df.Zip)):
dbn_df[‘Zip’][i] = dbn_df[‘Zip’][i][4:]

	DBN	School Name	Address	City	Zip	Grade Level	Point of Contact
0	07X001	P.S. 001 Courtlandt School	335 East 152 Street	Bronx	10451	Elementary	stirado2@schools.nyc.gov
1	07X005	P.S. 5 Port Morris	564 Jackson Avenue	Bronx	10455	K-8	TCozzocrea@schools.nyc.gov
2	07X154	P.S. 154 Jonathan D. Hyatt	333 East 135 Street	Bronx	10454	Elementary	ACoviello@schools.nyc.gov
3	07X296	South Bronx Academy for Applied Media	778 Forest Avenue	Bronx	10456	Middle School	RAult@schools.nyc.gov
4	07X381	Bronx Haven High School	333 East 151 Street	Bronx	10451	High school	JRivera7@schools.nyc.gov

That’s it! We have turned a PDF page into a pandas dataframe. Now we can just loop through the other pages and append them to the original dataframe.

for i in range(1, pdfReader.numPages): pageObj = pdfReader.getPage(i) page = pageObj.extractText()

page = page[25:] page = page.replace('\n \n',', ') page = page.replace('\n','') page = page.replace('Point of Contact','Point of Contact break') page = page.replace('.org','.org break') page = page.replace('.gov', '.gov break') page = page.split(' break') for i in range(1,len(page)): page[i] = page[i][2:]

# Create dataframe page_df = pd.DataFrame([sub.split(",") for sub in page]) page_df = page_df.iloc[1:,:7] page_df.columns = header for i in range(1,len(page_df.Zip)): page_df['Zip'][i] = page_df['Zip'][i][4:]

# Concat with dbn_df dbn_df = pd.concat([dbn_df,page_df], axis=0, ignore_index=True, sort=False)

That is as far as I’m going to go, but there is definitely still some work to be done. For example, if you take a look at the tail of the dataframe there are two obvious problems.

There is at least one school which must have had an extra comma inserted into the name.
There is an empty row at the bottom of possibly every page.

These are fixable issues, but that’s a matter of cleaning, and, like I said, I can’t actually use this data, so I’m going to stop spending more time on it. I hope someone (including my future self) finds this helpful.

// Header photo by Quintin Gellar from Pexels

Turning a PDF into a Pandas DataFrame

Like this:

Leave a Reply Cancel reply

Share this:

Like this:

Leave a Reply Cancel reply