About six months ago I took a deep dive into becoming a data scientist. Since then, I have been wanting to start documenting my journey, but it is a little intimidating. I don’t really know where to start. After all, I am still new to all of this, what could I possibly have to write about? Well, I’m just going to start with a playful analysis of my kids’ names using the Social Security Administration’s baby names dataset. After all, data science is rooted in data exploration and analysis.
I have created a GitHub repository (since using Git is a part of this learning journey) where I will upload all of the datasets and code that I used for this analysis, so you can explore and share your own analyses. Well, let’s get started by retrieving the datasets. Unfortunately, the baby names dataset is divided into separate comma-delimited text files for each year, but this just gives me a chance to practice automating the process of importing and concatenating CSV files.
First, let’s import some libraries.
import pandas as pd import matplotlib.pyplot as plt import numpy as np
Next, I am going to create a list of years and use that list to concatenate all of the CSV files into one data frame.
years = [i for i in range(1880, 2017)] babynames =  for each in years: file_name = "yob%s.txt" % each columns = ['Name', 'Sex', 'Count'] babyname_df = pd.read_csv(file_name, sep=',', header=0, names=columns) babyname_df.insert(0, 'Year', each) babynames.append(babyname_df) babynames = pd.concat(babynames, axis=0) babynames.head()
Printing the first five rows of the data should return a table like this:
Determining the Questions
This is a very simple, clean dataset, so we don’t need to worry about cleaning it up. However, I needed to figure out what kind of information to look at. I did not want to simply look at the most popular names (the SSA already does that pretty well). Instead, I was curious about the fact that before my daughter, I only ever knew of two women named Maya: Angelou and Rudolph. We named our daughter after the former, believing it to be a beautiful and seemingly unique name, and Maya Angelou was a great poet and activist who had died shortly before our daughter’s birth. (Side note: it also allowed me to tie in my love for astronomy: the “oldest sister” in the Pleiades star cluster is named Maia). However, when we moved to California a couple of years ago, we started meeting several girls between the ages of 2 to 12 named Maya. Is the name more popular than we realized? Or is this a case of frequency illusion? With regards to our son, Henry, we assumed that the average age of Henry’s has to be around 80. Let’s see what we can find out.
The Popularity of Maya
I started by creating a line of code that defined a new database with only my daughter’s name. As I started writing the same line of code for my son’s name, I realized that I could easily create a function that takes a list of names. That way I could pull data on multiple names in just one line of code. I did the same thing for plotting the data. I used these functions to get data on me, my wife, and our parents, but we’ll focus on my kids’ names for the sake of space.
def get_info(name, sex): name_data = babynames[(babynames['Name'] == name) & (babynames['Sex'] == sex)] return name_data def plot_trends(names, sex, color, scale='linear'): for idx in range(len(names)): info = get_info(names[idx], sex[idx]) axes = plt.plot(info['Year'], info['Count'], label=names[idx], alpha=0.8, c=color[idx]) legend = plt.legend(bbox_to_anchor=(.8, .85), bbox_transform=plt.gcf().transFigure) plt.yscale(scale) return axes maya = get_info('Maya','F') henry = get_info('Henry', 'M') kids = ['Maya','Henry'] plot_trends(kids, ['F','M'], ['blue', 'red'])
Regarding the name Maya, the first thing that stands out is that it doesn’t appear in the SSA data until 1940. The name is found in lots of cultures, so I’m curious if the introduction of the name into the US was the result of an influx of World War II refugees. Unfortunately, that will remain speculation with this current data set, but maybe something to look into in the future. There has definitely been a steady increase in girls named Maya since the mid-1980s, with a peak just before my daughter was born. This peak, however, is nothing compared to the numbers we are getting from the name Henry. Seems like Henry is making a very strong comeback. Looks like we are going to have a lot of Henrys over the age of 75 and under the age of 10.
It definitely seems like Maya is a relatively unique name. To get a better idea of how unique, let’s plot Maya against the top 5 most popular girl’s name for 2014.
top5_2014 = babynames[(babynames['Sex'] == 'F') & (babynames['Year'] == 2014)] top5_2014 = np.array(top5_2014.iloc[:5,1]).tolist()
Printing the top five names from 2014 should give you:
['Olivia', 'Sophia', 'Isabella', 'Ava', 'Mia']
Now I will create data frames for each of these names so that we can pull the birth count numbers over time. Then we can plot the data compared to the name Maya. Interestingly enough, if you plot all of the birth counts of these names over the years, you will see that they were all relatively unpopular until the mid-80’s and 90’s when they all began to climb rapidly. For this reason, I started the data off at 1980 so that we can get a better look.
olivia = get_info('Olivia','F') sophia = get_info('Sophia','F') isabella = get_info('Isabella','F') ava = get_info('Ava','F') mia = get_info('Mia','F') # Plot number of Mayas against Top 5 _top = plt.plot(maya['Year'], maya['Count']) _top = plt.plot(olivia['Year'], olivia['Count']) _top = plt.plot(sophia['Year'], sophia['Count']) _top = plt.plot(isabella['Year'], isabella['Count']) _top = plt.plot(ava['Year'], ava['Count']) _top = plt.plot(mia['Year'], mia['Count']) _top = plt.legend(['Maya','Olivia', 'Sophia', 'Isabella', 'Ava', 'Mia']) _top = plt.xlabel('Year') _top = plt.ylabel('Number of Births') _top = plt.title('Maya vs 2014 Top 5 Names') _top = plt.xlim(1980,2016) plt.show()
We can see from the plot that the name Maya was only about 1/5th as popular as the name Olivia. However, it does not give us a very robust idea of the name popularity. We could really dive in and start calculating the number of Maya’s projected to be living between the ages of 2 and 14, and from that determine the percentage of 2- to 14-year-olds named Maya, but let’s just look at one last calculation to get a better idea of the popularity of the names in 2014.
for year in years: babynames.loc[babynames.Year == year, 'Total Births by Year'] = babynames[(babynames.Year == year)]['Count'].sum() babynames.loc[(babynames.Year == year) & (babynames.Sex == 'F'), 'Total by Year and Sex'] = babynames[(babynames.Year == year) & (babynames.Sex == 'F')]['Count'].sum() babynames.loc[(babynames.Year == year) & (babynames.Sex == 'M'), 'Total by Year and Sex'] = babynames[(babynames.Year == year) & (babynames.Sex == 'M')]['Count'].sum() babynames['Pct of Total'] = (babynames['Count'] / babynames['Total Births by Year']) * 100 babynames['Pct by Sex'] = (babynames['Count'] / babynames['Total by Year and Sex']) * 100 top5_2014_merged = olivia.append([sophia, isabella, ava, mia, maya]) top5_2014_merged[top5_2014_merged['Year'] == 2014]
Despite being ranked the 73rd most popular name in 2014 and accounting for only 0.22% of the babies born, Maya was not much more unique than the top five names, which range from 1.12% to 0.77% of the babies born that year. So it can be argued that Maya is a unique name, but so are all of the other names. We will have to chalk this one up to the frequency illusion bias. A better question might be: is it more unique to give your daughter a name that doesn’t end in ‘a’?
The Age of Henry
Ok, that took a little longer than I planned, but I asked another question (‘What is the average age of Henry?’), so I have a little more work to do. First, we need to pull in data from the SSA’s Actuarial Life Tables. And then I’m going to merge the life table data with Henry’s data frame and calculate the estimated number of living Henry’s for each year. To get a visual of the number of Henry’s by age, I made this plot, too.
life_table_male = pd.read_csv('LifeTables_M_2017.csv', skiprows=4) henry_alive = henry.merge(lifetables_male_2016, on='Year') henry_alive['n_alive'] = henry_alive['l(x)']*henry_alive['Count']/(10**5) _h = plt.plot(henry_alive.x, henry_alive.n_alive) _h = plt.title('Living Henrys by Age') _h = plt.xlabel('Age') _h = plt.ylabel('Number of Living Henrys') _h = plt.axis([2,115,0,9500]) plt.show()
We can see that although there is a large grouping of Henrys aged 50-70, they are dwarfed by the number of Henrys that have been born in the past 10 years. This should have a noticeable effect by dragging the average under the age of 50. Let’s calculate and see:
henry_alive['rel_age'] = henry_alive.x * henry_alive.n_alive avg_age = henry_alive.rel_age.sum() / henry_alive.n_alive.sum()
And the average age of Henry is… 38! I guess Henrys could say they are getting younger every day (statistically speaking, of course).
Well, it looks like I was wrong on both accounts: there are not an inordinate number of Mayas being born and the average age of Henry is not even close to 80. At least I can say that I learned a little something in the process. Actually, I learned a lot and not just about my kids’ names. In order to create some of this code, I had to look up a few new tricks. It just goes to show that the best way to learn is to just do it. I will be back again soon. I am working on a few other projects, but between studying, work, and family, I am not going to make any promises about timelines.