Let’s be honest, data dictionaries are not something that comes up often in conversation. They just aren’t as interesting as building models, creating engaging visualizations, or even ETL pipelines. They are often something we do only when we feel we have to. A good data dictionary, however, can be an essential component of data exploration and thus, visualization and model building. They take the hard work of ETL pipelines and create a user-friendly guide. They can help provide insights into poorly named variables, identify deviations from expected values, and identify to whom we should turn if we have questions about the source of truth of a given variable. And while it might be time consuming to create, it can save someone else (including future you) a ton of time when the dataset and project are revisited down the road.
What goes in a data dictionary?
So for this week’s dataset, I thought I would build a simple data dictionary, limited to the following information:
- Last Updated: the date of the most recent update; this helps give a sense of if the dataset is likely out of date
- Document Owner: this is the name of the person who is responsible for keeping the document up to date.
- Term: the name given to the variable/column in the dataset
- Term Owner: the person who can answer any questions you may have about a specific term. This may be the person who originally defined the term or simply the person who knows the most about the table/domain from which it was pulled.
- Data Type: the form the variable is expected to take. You can define this by the actual data type (int8, float64, etc for Python) or I prefer to give a more informal definition (integer, category, string)
- Description: a brief description of what the data is depicting. This can be VERY helpful when you have a poorly named variable, and this dataset has plenty.
You can see the full dictionary here.
You may also consider adding a column for data range – containing minimum and maximum values for continuous values or the full set for categorical values. Some other useful columns include a percentage of missing data, summary statistics, originating source, and a sensitive data flag. A data dictionary can be as detailed or simplified as you need. Data dictionaries can be living documents (hense the last updated feature), and I think the best things to add in are those things that you discover and think to yourself, “I wish I knew this when I started.”
Creating a dictionary when you don’t know the data
Since I didn’t create this dataset (and the variable names are less than intuitive), I went through each variable and plotted it to get a better sense of distributions and definitions. I have included a few plots below, but if you want to see the full exploration and all of my very uncreative graphs, you can check out the notebook.
The only midly creative visualization in this project was this map of where designers live, becuase I never get to create maps at work (heatmaps don’t count) and I really like making maps. The map is focused on the bay area, but the dataset covers the U.S., so you can move around to see other parts of the country. You can also zoom in or out to change how the designers are grouped geographically. No surprise that we see large grouping on the east and west coast.
I’m not sure this waffle plot qualifies as creative, but it is rather colorful. Respondents are 17% more White, 59% more Asian, 75% less Black, and 57% less Latinx than the country as a whole.
The curse of bad feature names
To bring home the need for a data dictionary (and the need to properly name your features), I thought I would focus on the four features all named “I work in”. The name tells us nothing about data contained. I work in California? I work in marketing? I work in a cubicle? I don’t know. Let’s see.
The first (“I work in:”) lists the field(s) in which the designer works (eg. graphic design, UX, education, etc). Designers could give more than one response to this question, so you could dig further to see how these fields are related.
The second (“I work in:.1”) is a little difficult to distinguish from the first, but I think of it as the function that the designer’s work serves (eg. branding, advertising, product design).
I define the third (“I work in:.2”) as the sector in which the designer works. These can also be multiple answers, such as Non-Profit Education or For-Profit Start-up.
That brings us to the fourth and final (“I work in:.3”) feature in this series, which also happens to be the least likely of the four to be answered. That’s because this seems to be a subsection of the “I work in:.2” answer of “Education”. Here, respondents can explain in what type of education they are working (undergrad, grad school, online, etc).
I hope this helps illuminate the need for well-designed data dictionaries and feature names. Let me know if you continue digging into this dataset or have dictionary tips that you swear by.
Github: https://github.com/TheeChris/dataset-of-the-week/211119-ecl-design-census.ipynb
Data: https://designcensus.org/