What makes a man turn neutral? Lust for gold? Power? Or were you just born with a heart full of neutrality?
- Zapp Brannigan, Futurama

I'm an avid fan of r/dataisbeautiful, and I've seen plenty of US map based visualizations over time. I've always wanted to create something spectacular that would stand out amongst all these beautiful maps, but I never got the inspiration until now.

While keeping an eye on the US elections like everyone else, I noticed something curious. Most large cities were voting Democrat; even in hardcore red states.

Texas election results (2020). Source: Google
Tennessee election results (2020). Source: Google

I've been meaning to learn R for quite some time now, so I thought this would be the perfect chance to get started. After a quick, painless installation, I am ready to go.

My goal here is to make a map that shows how each county voted in the years past and to then compare that to the county's population density.

Firstly, I need to plot the map of the US. Turns out, there's a pretty comprehensive package for that. usmap by Paolo Di Lorenzo is built specifically for the purpose of plotting choropleths on the US map. Yeah, I had to look up what that means.

Then, I need the data. The MIT Election Lab is a good source for US elections related data, and the US Census provides data about anything and everything US.

It's at this point that I realize I've got my priorities messed up. I first need to learn R. A crash course is in order. Bill Petti tells me everything I need to know. Now, I'm truly ready to go.

# install packages using install.packages('package')
library(readxl)
library(tidyverse)
library(usmap)
library(ggplot2)

# POP01.xls - Source: US Census database
# POP060210D - population density from 2010 census
county_pop <- read_excel('./data/POP01.xls', sheet = 'POP01C') %>%
	subset(select=c(Area_name, POP060210D)) %>%
	separate(col=Area_name, into=c('County','State'), sep=', ') %>%
	drop_na()

colnames(county_pop) <- c('county', 'abbr', 'pop_density')

# some data cleaning

county_pop <- county_pop %>%
	transform(county = str_c(county, ' County')) %>%
	merge(usmap_countypop %>% subset(select = -pop_2015)) %>%
 	subset(select = -c(abbr, county))
Extract and clean population density data

The usmap dataset has FIPS codes that are necessary to plot a US choropleth. But before getting at that data, there was some data cleaning that I had to do due to different naming schemes followed by usmap and the US Census. Some counties also changed after the last census. Shannon County, SD changed their name to Oglala Lakota County in 2015, and Bedford County, VA merged with the previously independent city of Bedford in a very confusing administrational restructure. Thank me when you win Trivia Night.

Due to the way the data differed, cleaning it proved to be easier in Excel. So, I exported the data as csv, cleaned it, and reimported it before proceeding.

write.csv(usmap::countypop, './data/county_population.csv')
# clean the data
usmap_countypop <- read.csv('./data/county_population.csv', index_col=FALSE)
In case you too prefer messing around with enormous Excel sheets
img <- plot_usmap(data = county_pop, values = 'pop_density') + 
  labs(title = 'Population density') +
  scale_fill_continuous(low = 'white', high = 'red') + 
  theme(legend.position = 'right')
Plot the population density on the map

Let's see how that looks.

Move along. Nothing to see here.

Well that was bad. It's probably because the population density is extremely skewed right. There are a few small counties with ridiculously high population densities (NYC, I'm looking at you), while the majority of the counties have less than 100 people/sq. mile. The former don't show up because they are tiny. The skew can easily be fixed by logarithmic scaling. So let's try that.

breaks <- c(1, 3, 10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000)
img <- plot_usmap(data = county_pop, values = 'pop_density', colour=NA) + 
  labs(title = 'Population density') +
  scale_fill_continuous(high='red', low='yellow', trans='log10',
  						breaks = breaks, limits=c(0.1,70000)) + 
  theme(legend.position = 'right')
Plot the population density after logarithmic scaling, and have some fun with the colours
Much better

We can see the coastal areas are densely populated, while nobody wants to live in the rural Midwest, Utah, Wyoming, or Alaska. Now that that works, we can move on to plotting how each county voted.

# countypres_2000-2016.csv - Source: MIT Election Lab
voting <- read.csv('./data/countypres_2000-2016.csv') %>%
  subset(select = c(year, FIPS, party, candidatevotes, totalvotes)) %>%
  drop_na()

voting$percentagevotes = voting$candidatevotes/voting$totalvotes
voting <- subset(voting, select=-c(candidatevotes, totalvotes))

winner <- voting$percentagevotes %>%
  aggregate(by = list(voting$year, voting$FIPS), max) %>%
  setNames(c('year', 'fips', 'percentagevotes')) %>%
  merge(voting) %>%
  subset(select = -percentagevotes) %>%
  merge(county_pop)

winner$party <- as.factor(winner$party)
Extraction and clean up
img <- plot_usmap(data = filter(winner, winner$year==2016), values = 'party', size=0.01) +
  labs(title = '2016 US Presidential Election') +
  scale_fill_manual(values = c('democrat'='blue', 'republican'='red'), name='party') +
  theme(legend.position = 'right')
Plot the election map for 2016

What's with Alaska, you ask? Maybe they are just religiously neutral? Maybe they vote white because they love the snow? This led me into another rabbit hole where I learnt that Alaska has boroughs and parishes instead of counties. But, for elections, they have voting districts, and these coincide with their boroughs in only 3 cases: Aleutians East, Aleutians West, and Anchorage; all of which are visible on the map. I've decided that that's more than enough detail for the scope of this blog, so I'm going to pretend Alaska doesn't exist for a while.

Now to see if we can make an animated map with all the results from 2000 - 2016. gganimate is a wonderful library that helps animate ggplots (on which usmap is built) based graphs.

library(gganimate)

anim <- plot_usmap(data = winner, values = 'party') +
  labs(title = '{closest_state}') +
  scale_fill_manual(values = c('democrat'='blue', 'republican'='red'), name='party') +
  theme(legend.position = 'right') +
  transition_states(year) +
  ease_aes('sine-in')
Set the animation parameters
a <- animate(anim, fps=60, duration=15)
Render the animation
Error: cannot allocate vector of size 320.0 Mb
Nope

There is a workaround to this that involves increasing the memory available to R using memory.limit(), but my computer freaked out when I did that. It seems that R runs entirely on the RAM and hence has trouble with large animations. So, I decided to tone it down.

a <- animate(anim, fps=20, duration=15)
20 fps? Why ever did I buy a 144Hz monitor?
Pretty consistent, I'd say

Why does it darken at the end of every year? I don't know. Why is most of Alaska missing? I don't know. Why is the animation not smooth? I don't know. Actually, I do. 20 fps. It doesn't matter though because I'm not exactly being sponsored by Disney here.

Now, to overlay population density data on this. My idea is to set alpha values based on the population density, so high density counties show up brighter. Unfortunately, I simply can't see a way to do that. My search led me to a StackOverflow question, but the solution provided there throws an error. After much trial and error, and banging my head against a wall for a ridiculous amount of time, I've decided that the grapes are sour and this is not the best way to check if population density has anything to do with voting habits.

I opt for a simpler method. Statistical analysis. Our null hypothesis is that voting habits and population density are independent. I feel that a Chi Squared test would suit us just fine, so I group the counties into categories based on population density. For lack of better terms, let's call them the wilderness (0-10 people/sq. mile), villages (10-100), towns (100-1000), cities (1000-10,000),  megacities (10,000-60,000), and NYC (way too many).

library(arules) # for the discretize method

breaks <- c(0,10,100,1000,10000,100000)
types <- c('The Wilderness', 'Village', 'Town', 'City', 'Megacity')
winner$type <- winner$pop_density %>%
  discretize(method = 'fixed', breaks = breaks, labels = types)

for (year_ in unique(winner$year) %>% sort()) {
  winner_by_year <- filter(winner, winner$year == year_)
  chisq_result <- table(winner_by_year$party, winner_by_year$type) %>%
    chisq.test()
  print(year)
  print(chisq_result)
}
Testing our hypothesis
2000 
             The Wilderness Village Town City Megacity
  democrat               39     361  181   85        8
  republican            502    1336  555   54        0
  
2004 
             The Wilderness Village Town City Megacity
  democrat               52     290  148   89        8
  republican            489    1407  588   51        0
  
2008 
             The Wilderness Village Town City Megacity
  democrat               86     408  261  114        8
  republican            455    1289  475   26        0
  
2012
             The Wilderness Village Town City Megacity
  democrat               65     299  215  112        8
  republican            476    1398  521   28        0

2016
             The Wilderness Village Town City Megacity
  democrat               41     168  165  112        8
  republican            500    1529  571   28        0
A look at our data
	Pearson's Chi-squared test

2000
data:  table(winner$party, winner$type)
X-squared = 227.63, df = 4, p-value < 2.2e-16

2004
data:  table(winner$party, winner$type)
X-squared = 252.36, df = 4, p-value < 2.2e-16

2008
data:  table(winner$party, winner$type)
X-squared = 291.04, df = 4, p-value < 2.2e-16

2012
data:  table(winner$party, winner$type)
X-squared = 370.58, df = 4, p-value < 2.2e-16

2016
data:  table(winner$party, winner$type)
X-squared = 571.82, df = 4, p-value < 2.2e-16
The results are out!!

With P-values that close to zero, our null hypothesis can be safely rejected. We can say with absolute certainty that voting habits and population density are correlated. People who live in high density areas definitely vote Democrat. The 9 most densely populated cities in the US all consistently voted Democrat from 2000-2016. Of the top 40 most densely populated counties, 29 consistently voted blue. In fact, the first and only city in that list to consistently vote Republican is Richmond, Virginia which is the 29th most densely populated city in the US. My guess has been proved right.

Are crowds inherently blue, though? Does living in a city magically make you a liberal? Or does being a liberal make you move to cities? It is far more likely that there are some other variables at play here that affect both these factors, and hence the high correlation. Maybe educated folk tend to lean left and tend to live in cities. Maybe people with exposure to different cultures tend to be more progressive and enjoy living in cities. Or maybe the economic policies of the Democrats help city folk while Republican policies help rural folk. Future posts will delve deeper into this topic.

Let me end by reiterating the golden rule of statistics: correlation doesn't imply causation. We've proved that population density and voting habits are intensely intertwined (at least in the US); not that high population density makes people vote blue. If that was the case, the Dems need only stuff people in a small room to gain votes.

Mission accomplished, I'd say. Despite not having a beautiful map to show-off, I've learnt R. And in the end, isn't the journey all that matters?