Navigating FiveThirtyEight’s Polling Averages Archive Using Pandas
With election season heating up, I recently became interested in historical polling data. The great people over at FiveThirtyEight have compiled polling averages of every presidential election from 1968 to 2016 and made it available for download on their GitHub page.
https://github.com/fivethirtyeight/data/blob/master/polls/pres_pollaverages_1968-2016.csv
A file with that much data is a gargantuan labyrinth of information that isn’t exactly easy to work your way around. Thankfully, using pandas is a great way to quickly and effectively navigate and alter massive amounts of data.
After downloading Python, install Jupyter Notebook and pandas. This is done by running Command Prompt in Windows and typing:
pip install jupyter
After Jupyter has finished installing, do the same for pandas.
pip install pandas
While here, it is well worth installing numpy and matplotlib since those packages often work in conjunction with pandas on a variety of data analysis/visualization tasks.
pip install numpypip install matplotlib
For future reference, these packages (and others) are updated pretty often and a way to check to see if a newer version is available is to use this command.
pip list --outdated
This will list outdated packages and the latest available versions.
Any package can be upgraded this way.
pip install --upgrade numpy
With all that groundwork finished, it’s time to launch Jupyter Notebook and work on the polling data. In Command Prompt, this command launches Jupyter Notebook.
jupyter notebook
At this point, Jupyter Notebook launches through the web browser. Create a new notebook to work in by navigating to the “New” menu and finding the new notebook option.
Once in the new notebook, import pandas in the first cell and use SHIFT+ENTER to both run the current cell and create a new one below.
import pandas as pd
In the new cell, load the .csv file into pandas as a dataframe.
df = pd.read_csv('pres_pollaverages_1968-2016.csv')
The full file path is only necessary if the file is somewhere other than the Jupyter Notebook directory. If it is in that directory, use the filename only.
Use the display function to view the composition of the dataframe.
display(df)
At this point, this becomes a “choose your own adventure” situation and where you go depends entirely on what you want to derive from the dataframe.
For instance, I wanted to get the national polling averages for the month of August during the 2016 presidential election between Donald Trump and Hillary Clinton.
First, I made sure to remove all of the columns that I had no need for by selecting the ones I wanted to remain.
df = df[['cycle', 'state', 'modeldate', 'candidate_name',
'pct_estimate']]
Conversely, I can drop only a few columns I don’t want and keep the rest.
df = df.drop(['pct_trend_adjusted', 'comment', 'timestamp'], axis=1)
It helps to be aware of both methods because it saves time depending on how much of your dataframe you plan on preserving or eliminating. In my case, it would be too time consuming to spell out all the columns I want to drop so I go with which ones I want to keep instead.
The next step is to figure out all of the unique entries under the candidate name column and find the entries I want to keep.
df.candidate_name.unique()
Now I can filter my dataframe down to only Donald Trump and Hillary Rodham Clinton. Notice that Clinton’s entries include her birth name (Rodham) so by using the df.unique()
function, I avoid the mistake of searching for just “Hillary Clinton” and coming up with nothing.
df = df[(df.candidate_name.isin(['Donald Trump',
'Hillary Rodham Clinton']))]
Before I filter my data down further to only include national polling, let’s suppose I were curious to see the three pivotal swing states of the 2016 election instead.
df[(df.state.isin(['Wisconsin', 'Michigan', 'Pennsylvania']))]
This method can be used to filter row entries within a specific column in any number of ways. In my case, I can select only national polling data.
df = df[(df.state.isin(['National']))]
It is also possible to bifurcate the remaining data into unique and separate dataframes based on each candidate and work on their dataframes individually.
dfClinton = df[(df.candidate_name.isin(['Hillary Rodham Clinton']))]
dfTrump = df[(df.candidate_name.isin(['Donald Trump']))]
In doing so, I can identify the lowest and highest points of a candidate’s polling support or any other descriptive statistics I want to derive from their respective entries like average, mode, median, variance, and so on.
dfClinton.nsmallest(1, ['pct_estimate'])dfClinton.nlargest(1, ['pct_estimate'])
Now to filter by date. First, make sure that the column which contains the dates is properly classified.
df.dtypes
The model date column has to be changed using pd.to_datetime
to reflect that it contains dates.
df.modeldate = pd.to_datetime(df.modeldate)
With that done, I can search the date column and find any rows which satisfy my conditions. For instance, I can look up the final polling average of Clinton and Trump on election day.
df[(df['modeldate'] == '11/8/2016')]
As for my original goal, I can filter based on date ranges and only include entries from August 2016.
df = df[(df['modeldate'] >= '8/01/2016') &
(df['modeldate'] <= '8/31/2016')]
Now that I have chiseled down the data to exactly what I wanted and a very modest 62 rows in 5 columns, I can save a new .csv file and open the file in Excel or Tableau with ease.
df.to_csv('august2016_nationalpolling_trumpvsclinton.csv')
This will save the identified dataframe to a .csv file in the Jupyter Notebook directory.
None of this is particularly groundbreaking or impossibly complicated but most of the early effort in data analysis and visualization relies on wrangling, cleaning, sorting, and filtering data. Using pandas is a quick and effective way to metastasize large amounts of data with a specific intent in mind. It is also a good way to conduct a deep dive into large amounts of data to uncover the granular details that are not initially evident.