Navigating FiveThirtyEight’s Polling Averages Archive Using Pandas

Steve Younessi
6 min readJul 1, 2020
Photo by Debbie Molle on Unsplash

With election season heating up, I recently became interested in historical polling data. The great people over at FiveThirtyEight have compiled polling averages of every presidential election from 1968 to 2016 and made it available for download on their GitHub page.

https://github.com/fivethirtyeight/data/blob/master/polls/pres_pollaverages_1968-2016.csv

A file with that much data is a gargantuan labyrinth of information that isn’t exactly easy to work your way around. Thankfully, using pandas is a great way to quickly and effectively navigate and alter massive amounts of data.

After downloading Python, install Jupyter Notebook and pandas. This is done by running Command Prompt in Windows and typing:

pip install jupyter

After Jupyter has finished installing, do the same for pandas.

pip install pandas

While here, it is well worth installing numpy and matplotlib since those packages often work in conjunction with pandas on a variety of data analysis/visualization tasks.

pip install numpypip install matplotlib

For future reference, these packages (and others) are updated pretty often and a way to check to see if a newer version is available is to use this command.

pip list --outdated

This will list outdated packages and the latest available versions.

Packages in need of updates can be found this way

Any package can be upgraded this way.

pip install --upgrade numpy

With all that groundwork finished, it’s time to launch Jupyter Notebook and work on the polling data. In Command Prompt, this command launches Jupyter Notebook.

jupyter notebook

At this point, Jupyter Notebook launches through the web browser. Create a new notebook to work in by navigating to the “New” menu and finding the new notebook option.

How a new notebook is created in Jupyter

Once in the new notebook, import pandas in the first cell and use SHIFT+ENTER to both run the current cell and create a new one below.

import pandas as pd

In the new cell, load the .csv file into pandas as a dataframe.

df = pd.read_csv('pres_pollaverages_1968-2016.csv')

The full file path is only necessary if the file is somewhere other than the Jupyter Notebook directory. If it is in that directory, use the filename only.

Use the display function to view the composition of the dataframe.

display(df)

At this point, this becomes a “choose your own adventure” situation and where you go depends entirely on what you want to derive from the dataframe.

For instance, I wanted to get the national polling averages for the month of August during the 2016 presidential election between Donald Trump and Hillary Clinton.

First, I made sure to remove all of the columns that I had no need for by selecting the ones I wanted to remain.

df = df[['cycle', 'state', 'modeldate', 'candidate_name',
'pct_estimate']]
Only the selected columns remain

Conversely, I can drop only a few columns I don’t want and keep the rest.

df = df.drop(['pct_trend_adjusted', 'comment', 'timestamp'], axis=1)
Only columns selected to drop are removed

It helps to be aware of both methods because it saves time depending on how much of your dataframe you plan on preserving or eliminating. In my case, it would be too time consuming to spell out all the columns I want to drop so I go with which ones I want to keep instead.

The next step is to figure out all of the unique entries under the candidate name column and find the entries I want to keep.

df.candidate_name.unique()
No offense to Libertarian Gary Johnson, but I wasn’t interested in his polling data

Now I can filter my dataframe down to only Donald Trump and Hillary Rodham Clinton. Notice that Clinton’s entries include her birth name (Rodham) so by using the df.unique() function, I avoid the mistake of searching for just “Hillary Clinton” and coming up with nothing.

df = df[(df.candidate_name.isin(['Donald Trump',
'Hillary Rodham Clinton']))]
Only Trump and Clinton remain

Before I filter my data down further to only include national polling, let’s suppose I were curious to see the three pivotal swing states of the 2016 election instead.

df[(df.state.isin(['Wisconsin', 'Michigan', 'Pennsylvania']))]
Three pivotal swing states and their polling averages

This method can be used to filter row entries within a specific column in any number of ways. In my case, I can select only national polling data.

df = df[(df.state.isin(['National']))]
National polling averages only

It is also possible to bifurcate the remaining data into unique and separate dataframes based on each candidate and work on their dataframes individually.

dfClinton = df[(df.candidate_name.isin(['Hillary Rodham Clinton']))]
dfTrump = df[(df.candidate_name.isin(['Donald Trump']))]

In doing so, I can identify the lowest and highest points of a candidate’s polling support or any other descriptive statistics I want to derive from their respective entries like average, mode, median, variance, and so on.

dfClinton.nsmallest(1, ['pct_estimate'])dfClinton.nlargest(1, ['pct_estimate'])

Now to filter by date. First, make sure that the column which contains the dates is properly classified.

df.dtypes
Date column is not properly classified

The model date column has to be changed using pd.to_datetime to reflect that it contains dates.

df.modeldate = pd.to_datetime(df.modeldate)
Column now reflects that it contains dates

With that done, I can search the date column and find any rows which satisfy my conditions. For instance, I can look up the final polling average of Clinton and Trump on election day.

df[(df['modeldate'] == '11/8/2016')]
The final polling data of 2016

As for my original goal, I can filter based on date ranges and only include entries from August 2016.

df = df[(df['modeldate'] >= '8/01/2016') & 
(df['modeldate'] <= '8/31/2016')]
August data only

Now that I have chiseled down the data to exactly what I wanted and a very modest 62 rows in 5 columns, I can save a new .csv file and open the file in Excel or Tableau with ease.

df.to_csv('august2016_nationalpolling_trumpvsclinton.csv')

This will save the identified dataframe to a .csv file in the Jupyter Notebook directory.

None of this is particularly groundbreaking or impossibly complicated but most of the early effort in data analysis and visualization relies on wrangling, cleaning, sorting, and filtering data. Using pandas is a quick and effective way to metastasize large amounts of data with a specific intent in mind. It is also a good way to conduct a deep dive into large amounts of data to uncover the granular details that are not initially evident.

--

--

Steve Younessi

Data Enthusiast. Python, SQL, NoSQL, & Jupyter Enjoyer