While looking for some specific NYC school information, the only source I could originally find was in the form of a PDF. Nobody wants to go through a PDF and manually enter a bunch of info, so I decided to see if I could extract the data from the PDF with python. Since I was eventually hoping to merge this info into a larger dataframe, I figured I would go ahead and put the PDF into a pandas DataFrame for easy manipulation.
It turns out that I could not actually use the info from the PDF for a couple of reasons, but the value of extracting data from a PDF was immediately obvious, so I thought I would write down the process for future use (and maybe someone else will find it useful). With this particular PDF, we are lucky in that it is already set up in a table.
Thankfully, the PyPDF2 library already exists to extract text from PDFs, so the heavy lifting has been done. We just have to do some cleaning up. First, make sure you have PyPDF2 installed on your environment, then we will import our libraries.
# import libraries
import pandas as pd
import PyPDF2
Then we will open the PDF as an object and read it into PyPDF2.
pdfFileObj = open('2017_SREH_School_List.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Now we can take a look at the first page of the PDF, by creating an object and then extracting the text (note that the PDF pages are zero-indexed). We can see that its really messy and comes in the form of one really long string, but there is enough order in the chaos with which we can work.
# create page object and extract text
pageObj = pdfReader.getPage(0)
page1 = pageObj.extractText()
page1
Out[]:' \nPage \n1\n \nof \n10\n \n \n \nDBN\n \nSchool Name \n \n(Click to View Webpage) \n \nAddress\n \nSchool Grade \nLevel\n \nPoint of Contact\n \n07X001\n \nP.S. 001 Courtlandt School\n \n335 East 152 Street, Bronx, NY 10451\n \nElementary\n \nstirado2@schools.nyc.gov\n \n07X005\n \nP.S. 5 Port Morris\n \n564 Jackson Avenue, Bronx, NY 10455\n \nK\n-\n8\n \nTCozzocrea@schools.nyc.gov\n \n07X154\n \nP.S. 154 Jonathan D. Hyatt\n \n333 East 135 Street, Bronx, NY 10454\n \nElementary\n \nACoviello@schools.nyc.gov\n \n07X296\n \nSouth Bronx Academy for Applied Media\n \n778 Forest Avenue, Bronx, NY 10456\n \nMiddle School\n \nRAult@schools.nyc.gov\n \n07X381\n....
The first thing I did was to strip away the leading 24 characters, which are the unnecessary page header info. The easiest way I could see to convert this to a dataframe was to turn it into a csv. As with most things in code, there are a few different ways to approach this, but this is what I chose to do:
- Find a good place to separate text variables with commas
- Create breaks to split the text into separate strings for each school/row of data
- Clean up the mess
Looking at the raw printout, it turned out that strings of ‘\n \n’ were intuitive places for commas and leftover ‘\n’ strings could be removed.
# strip away page header
page1 = page1[25:]
# insert commas to separate variables and then remove excess strings
page1 = page1.replace('\n \n',', ').replace('\n','')
Line breaks will come after one of three strings: ‘Point of Contact’, ‘.org’, or ‘.gov’. I was having issues with some of the email addresses getting truncated, so I inserted a ‘ break’ at the end of these three strings, which would use to, well, break the string into separate lines of text.
# set up line breaks
page1 = page1.replace('Point of Contact','Point of Contact break').replace('.org','.org break').replace('.gov', '.gov break')
#page1 = page1.replace('.org','.org break')
#page1 = page1.replace('.gov', '.gov break')
# split the text at the break
page1 = page1.split(' break')
page1Out[]:
['DBN, School Name , (Click to View Webpage) , Address, School Grade Level, Point of Contact',
', 07X001, P.S. 001 Courtlandt School, 335 East 152 Street, Bronx, NY 10451, Elementary, stirado2@schools.nyc.gov',
', 07X005, P.S. 5 Port Morris, 564 Jackson Avenue, Bronx, NY 10455, K-8, TCozzocrea@schools.nyc.gov',
', 07X154, P.S. 154 Jonathan D. Hyatt, 333 East 135 Street, Bronx, NY 10454, Elementary, ACoviello@schools.nyc.gov',
', 07X296, South Bronx Academy for Applied Media, 778 Forest Avenue, Bronx, NY 10456, Middle School, RAult@schools.nyc.gov',
', 07X381, Bronx Haven High School, 333 East 151 Street, Bronx, NY 10451, High school, JRivera7@schools.nyc.gov',
', 07X522, Bronx Design and Construction Academy, 333 East 151 Street, Bronx, NY 10451, High school, JPulphus@schools.nyc.gov',...]
You can see that we get leading commas in our strings, so we will simply strip them away.
# clean up leading commas
for i in range(1,len(page1)):
page1[i] = page1[i][2:]
Now we have a fairly clean looking set of comma-separated variables. Let’s throw them in a dataframe.
# Create dataframe
dbn_df = pd.DataFrame([sub.split(",") for sub in page1])
dbn_df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | DBN | School Name | (Click to View Webpage) | Address | School Grade Level | Point of Contact | None |
1 | 07X001 | P.S. 001 Courtlandt School | 335 East 152 Street | Bronx | NY 10451 | Elementary | stirado2@schools.nyc.gov |
2 | 07X005 | P.S. 5 Port Morris | 564 Jackson Avenue | Bronx | NY 10455 | K-8 | TCozzocrea@schools.nyc.gov |
3 | 07X154 | P.S. 154 Jonathan D. Hyatt | 333 East 135 Street | Bronx | NY 10454 | Elementary | ACoviello@schools.nyc.gov |
4 | 07X296 | South Bronx Academy for Applied Media | 778 Forest Avenue | Bronx | NY 10456 | Middle School | RAult@schools.nyc.gov |
Now we just need to clean up the dataframe a bit. First, we can rename the headers and drop the first row. Next, I decided to keep the different elements of the address separated and just to drop the state, since they are all in NY. This seemed more useful to me for future use, but you do you.
# create new header names and drop first row
header = [‘DBN’,’School Name’,’Address’,’City’,’Zip’,’Grade Level’,’Point of Contact’]
dbn_df = dbn_df[1:]
dbn_df.columns = header# strip NY from zip column
for i in range(1,len(dbn_df.Zip)):
dbn_df[‘Zip’][i] = dbn_df[‘Zip’][i][4:]
DBN | School Name | Address | City | Zip | Grade Level | Point of Contact | |
---|---|---|---|---|---|---|---|
0 | 07X001 | P.S. 001 Courtlandt School | 335 East 152 Street | Bronx | 10451 | Elementary | stirado2@schools.nyc.gov |
1 | 07X005 | P.S. 5 Port Morris | 564 Jackson Avenue | Bronx | 10455 | K-8 | TCozzocrea@schools.nyc.gov |
2 | 07X154 | P.S. 154 Jonathan D. Hyatt | 333 East 135 Street | Bronx | 10454 | Elementary | ACoviello@schools.nyc.gov |
3 | 07X296 | South Bronx Academy for Applied Media | 778 Forest Avenue | Bronx | 10456 | Middle School | RAult@schools.nyc.gov |
4 | 07X381 | Bronx Haven High School | 333 East 151 Street | Bronx | 10451 | High school | JRivera7@schools.nyc.gov |
That’s it! We have turned a PDF page into a pandas dataframe. Now we can just loop through the other pages and append them to the original dataframe.
for i in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(i)
page = pageObj.extractText()
page = page[25:]
page = page.replace('\n \n',', ')
page = page.replace('\n','')
page = page.replace('Point of Contact','Point of Contact break')
page = page.replace('.org','.org break')
page = page.replace('.gov', '.gov break')
page = page.split(' break')
for i in range(1,len(page)):
page[i] = page[i][2:]
# Create dataframe
page_df = pd.DataFrame([sub.split(",") for sub in page])
page_df = page_df.iloc[1:,:7]
page_df.columns = header
for i in range(1,len(page_df.Zip)):
page_df['Zip'][i] = page_df['Zip'][i][4:]
# Concat with dbn_df
dbn_df = pd.concat([dbn_df,page_df], axis=0, ignore_index=True, sort=False)
That is as far as I’m going to go, but there is definitely still some work to be done. For example, if you take a look at the tail of the dataframe there are two obvious problems.
- There is at least one school which must have had an extra comma inserted into the name.
- There is an empty row at the bottom of possibly every page.
These are fixable issues, but that’s a matter of cleaning, and, like I said, I can’t actually use this data, so I’m going to stop spending more time on it. I hope someone (including my future self) finds this helpful.
// Header photo by Quintin Gellar from Pexels