Did you wonder who chooses our government and what factors influence the U.S. elections? Which social groups have the most power and who is the most targeted during the campaigning? And what does the voters' turnout actually look like? At some point we all made our own hypotheses about these questions and now this group has got a chance to test them out.
This project is focused on the different aspects of election's voting patterns and influence. The data used for this work was taken from United States Census Bureau (www.census.gov) and Harvard Dataverse (www.dataverse.harvard.edu). You can find our notebook with exploratory analysis and Machine Learning prediction model in this notebook
In the exploratory analysis we looked at the Census' "Table A-1. Reported Voting and Registration by Race, Hispanic Origin, Sex and Age Groups: November 1964 to 2020" (https://www.census.gov/data/tables/time-series/demo/voting-and-registration/voting-historical-time-series.html) as it had the summary of most of the categories we were interested in: races, ages, sex, numbers of people who voted and/or registered split by each of these groups (and years of the elections). We wanted to know how actively each of these groups participate in the elections and explore new insights out of the data. We did some of the preprocessing before getting started with the exploration: the rows with years 2002 and earlier had to be removed because until 2004 half of the important data hasn't been registered, such as numbers of black, asian, hispanic population involved in the election (they simply did not ask the question about race until then), so the earlier data was not directly comparable with the later data. There were also changes in the categories of race and ages prior to 2004. As a result we came up with a workable dataset of all the elections between 2004-2020. Initially it had about 300 rows (comments and names of columns excluded) and we ended up with 90 rows of info to work with. Soon we will discuss the results of the analysis down below.
For the Machine Learning aspect of this project we chose to pivot towards predicting the party outcomes in a district based on the population factors like education, age, gender and race. Our data for the exploratory analysis gave us enough information to draw insights out of it and test hypotheses but it wasn't enough to train a predicting model because our total percent registration data didn't account for the real reasons that people go out to vote (or not). We took new datasets from the U.S. census website as well (https://www.census.gov/data/tables/time-series/demo/voting-and-registration/congressional-voting-tables.html), but this one is broken up into districts of each state and gives info on 2018 elections only. There are a total of 4 datasets with info on voting population's age, sex, poverty status, education, and, of course, race and hispanic origin. The preprocessing included dropping "Margin of error", "Percent of total" and "Percent of total margin of error" columns, because the "Estimate" column provided us with enough information and the other three would impact our model later in unpredictable ways, which we want to avoid. We also regrettably had to drop the columns American Indian and Native Hawaiian as many states did not have any data for the column. Putting zeros into the column instead caused our models to overfit and thus they were dropped. One more dataset that we needed is the outcomes of the 2018 election so we can check our predictions and train our models, which were found in the data provided at this link from the MIT election lab: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IG0UN2 . For our baseline, we used the 2016 election results of the congressional house from this same dataset. Some of the issues related to this year were explained in the notebook.
We will discuss some of the most interesting points of the analysis and show accompanying graphs.
There are more registered to vote people than actually the ones' who voted.
This is probably the most obvious one and something we already intuitively knew, but this time we had actual data to back up this hypothesis. There are number of reasons why people cannot or choose not to vote, although more are able to register (we assume that part of the reason why the number of registered peope is higher because there are many options to get registered automatically by using certain services, while to vote people have to put more time and intention). In the process of looking for data we also found some interesting datasets on Census with actual reasons why certain age and race groups did not vote. If you are interested in those, it is definitely worth checking out. For our analysis it wasn't necessary to use this data.
Below is a visual lineplot that compares the basic number of voters who actually voted against who is registered to vote, this is including all numbers of people, simply the total population. The lines on graph dip down and go up with intervals of every 4 years. This trend can be explained by the excitement for presidential elections versus just congressional elections. For every presidential election year the lines grow and then they decay in between. It appears that people are generally more interested in the presidential elections than other kinds. The biggest voter turnout over the last 16 years were in 2008 and 2020 (both years were won by Democrat party). There also seems to be the smallest gap between numbers of people who voted and registered in 2020 and 2008. That means that every of the two years more registered people actually showed up to vote than other years.
This particular plot is looking at the female population by year. It appears that the distance between the lines at any point is decreasing by year, therefore it is adequate to say that more registered to vote females actually vote with each year. Although, the slope on registration line does not change much, the slope on voting line grows, so more registered females are likely to vote more.
The interest in elections grows with age.
We found that older people are more involved in elections than younger people and that the involvement consistently grows for each age group. In simple words, every next age group is more involved than the previous, with 18-24 y.o. the least involved (by numbers of popupation) and 65+ the most interested in elections (also proportionate to their population). And our conclusion to this: the older the person is the more likely they are to vote in the elections.
The next four graphs below are looking into different subsets of voters, specifically the ones that we would guess had potentially the most interesting comparisons as others were possibly predictable. We are looking at the Asian citizen population by year, along with Male, Female, and the Hispanic race as well. Each one of these plots looks at each voter type in age groups, and then for each age group it draws a line of best fit and shows the difference curve that takes outliers into account. We will use these graphs for a few more points later.
Note: datapoints are distributed by years, and there are several datapoints for each age group, one for registered (higher on the graph), and another one for voted (lower in the graph). Our graph shows the line of best fit between the two for each age group.
Eeach year there is a slight increase in numbers of people who registered and voted.
Particularly, Graph 1 above ("Asian, Citizen population that Voted/Registered per Year") and Graph 2 ("Hispanic, Citizen population that Voted/Registered per Year") show noticeable growth in involvement with elections. Graph 3 ("Male, Citizen population that Voted/Registered per Year") show the least growth, and it is somewhat expected since male citizens have always been allowed to vote since the first U.S. elections.
Involvement of the youngest age group (18-24) is increasing each year
Despite the previous conclusion, age group 18-24 becomes more involved with each year: in Graphs 2 ("Hispanic, Citizen population that Voted/Registered per Year") and 3 ("Male, Citizen population that Voted/Registered per Year") where other age groups have about steady slope, the slope for 18-24 age group is relatively higher. 18-24 year olds are becoming more likely to register and vote, and this trend will most likely continue in future elections.
White and black citizen population have similar distribution of the voter turnout and high correlation with each other, however involvement of the black citizens is smaller.
Now we want to introdure a new plot that has all the distributions of the percentages of voter turnout by races and sex. The graph right below has all the distributions of observations for all given races from citizen population. It appears that the most voter turnout is among white citizens (as expected) and the voter turnout for black citizens is very similar to white citizens with slightly smaller percentages, but nearly the same distribution curve.
Hispanic and Asian populations show noticeable growth in involvement in latest years, although asian population is the least involved among other races yet.
To draw this conclusion we used several visualizations. First of all, it is seen in Graphs 1 and 2, and we discussed it earlier. Then we compared the findings with the distribution plot right above. The curve for hispanic citizen population has the most frequent percentage in around 55% and 50% for asian citizens, which appears to have relatively smaller turnout.
Females, while historically barred from voting, are now more involved with the elections than males.
The graph below shows the voter turnout by sex. It was the most surprising so far because it shows higher percentages and higher distribution for females while we expected to see this for males (because, as mentioned above, males have been participating in elections since the beginning and females earned this right relatively recently). We can now draw a new conclusion that females vote more frequently and actively than males.
For the process of the training and validating our model please read our notebook which is elaborate on everything we did and why. We used 2018 U.S. Congressional District demographics and the results of the house election from the same year to create a model that would accurately predict the winning candidate's party. We tried 5 different models (because we also want to be able to compare our predictions with real data before we try to predict the future). Eventually, we chose the Logistic regression based on the fact that it wasn't overfit and was reasonably accurate. Our final regression score for our test data ended up higher than the baseline, at 83.63%, which was very close to the validation data we initially saw as well.
What's interesting is that when we take a look at the way our data was split, our testing data had one more row than our validation data, so our model performed better than the baseline by one single row. It's worth taking away from this model that guessing the previous year's results would get you about the same results as the model.
From here, we could break down the factors that cause these splits into coefficients, plotted below. First, we did an overall look at the coefficients to see which areas impacted our model most, then grouped the coefficients to see which in their respective groups was the most important.
Logistic Regression Coefficients in Order of Influence
As we can see from the graph above, White, Percent Poverty in the district, and White Non-Hispanic were the three largest coefficients that impacted our data. Two of the top three coefficients are Race based showing that for our specific model, the factors based on race proved to be essential in turning the tide in either party's favor. The other large coefficient being poverty percentage, a surprising result as we thought that top 3 would all be race based.
Coefficients by Age, Gender, Education, and Race
Our new results with coefficients that impacted our ML model:
This concludes our discussion on the main results with the data we were able to find and the model we created. There are many reasons that affect the results of the elections, many of which at this point we are unable to find due to the private data policies and the lack of data on demographics and certain social and polical events that happen at the times of elections. Learning about demographics and conducting statistical research on populations' preferences in elections should be an interesting topic to dig deeper to improve this analysis.