What makes a man turn neutral? Lust for gold? Power? Or were you just born with a heart full of neutrality?
- Zapp Brannigan, Futurama
I'm an avid fan of r/dataisbeautiful, and I've seen plenty of US map based visualizations over time. I've always wanted to create something spectacular that would stand out amongst all these beautiful maps, but I never got the inspiration until now.
While keeping an eye on the US elections like everyone else, I noticed something curious. Most large cities were voting Democrat; even in hardcore red states.


I've been meaning to learn R for quite some time now, so I thought this would be the perfect chance to get started. After a quick, painless installation, I am ready to go.
My goal here is to make a map that shows how each county voted in the years past and to then compare that to the county's population density.
Firstly, I need to plot the map of the US. Turns out, there's a pretty comprehensive package for that. usmap
by Paolo Di Lorenzo is built specifically for the purpose of plotting choropleths on the US map. Yeah, I had to look up what that means.
Then, I need the data. The MIT Election Lab is a good source for US elections related data, and the US Census provides data about anything and everything US.
It's at this point that I realize I've got my priorities messed up. I first need to learn R. A crash course is in order. Bill Petti tells me everything I need to know. Now, I'm truly ready to go.
# install packages using install.packages('package')
library(readxl)
library(tidyverse)
library(usmap)
library(ggplot2)
# POP01.xls - Source: US Census database
# POP060210D - population density from 2010 census
county_pop <- read_excel('./data/POP01.xls', sheet = 'POP01C') %>%
subset(select=c(Area_name, POP060210D)) %>%
separate(col=Area_name, into=c('County','State'), sep=', ') %>%
drop_na()
colnames(county_pop) <- c('county', 'abbr', 'pop_density')
# some data cleaning
county_pop <- county_pop %>%
transform(county = str_c(county, ' County')) %>%
merge(usmap_countypop %>% subset(select = -pop_2015)) %>%
subset(select = -c(abbr, county))
The usmap
dataset has FIPS codes that are necessary to plot a US choropleth. But before getting at that data, there was some data cleaning that I had to do due to different naming schemes followed by usmap
and the US Census. Some counties also changed after the last census. Shannon County, SD changed their name to Oglala Lakota County in 2015, and Bedford County, VA merged with the previously independent city of Bedford in a very confusing administrational restructure. Thank me when you win Trivia Night.
Due to the way the data differed, cleaning it proved to be easier in Excel. So, I exported the data as csv, cleaned it, and reimported it before proceeding.
write.csv(usmap::countypop, './data/county_population.csv')
# clean the data
usmap_countypop <- read.csv('./data/county_population.csv', index_col=FALSE)
img <- plot_usmap(data = county_pop, values = 'pop_density') +
labs(title = 'Population density') +
scale_fill_continuous(low = 'white', high = 'red') +
theme(legend.position = 'right')
Let's see how that looks.

Well that was bad. It's probably because the population density is extremely skewed right. There are a few small counties with ridiculously high population densities (NYC, I'm looking at you), while the majority of the counties have less than 100 people/sq. mile. The former don't show up because they are tiny. The skew can easily be fixed by logarithmic scaling. So let's try that.
breaks <- c(1, 3, 10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000)
img <- plot_usmap(data = county_pop, values = 'pop_density', colour=NA) +
labs(title = 'Population density') +
scale_fill_continuous(high='red', low='yellow', trans='log10',
breaks = breaks, limits=c(0.1,70000)) +
theme(legend.position = 'right')

We can see the coastal areas are densely populated, while nobody wants to live in the rural Midwest, Utah, Wyoming, or Alaska. Now that that works, we can move on to plotting how each county voted.
# countypres_2000-2016.csv - Source: MIT Election Lab
voting <- read.csv('./data/countypres_2000-2016.csv') %>%
subset(select = c(year, FIPS, party, candidatevotes, totalvotes)) %>%
drop_na()
voting$percentagevotes = voting$candidatevotes/voting$totalvotes
voting <- subset(voting, select=-c(candidatevotes, totalvotes))
winner <- voting$percentagevotes %>%
aggregate(by = list(voting$year, voting$FIPS), max) %>%
setNames(c('year', 'fips', 'percentagevotes')) %>%
merge(voting) %>%
subset(select = -percentagevotes) %>%
merge(county_pop)
winner$party <- as.factor(winner$party)
img <- plot_usmap(data = filter(winner, winner$year==2016), values = 'party', size=0.01) +
labs(title = '2016 US Presidential Election') +
scale_fill_manual(values = c('democrat'='blue', 'republican'='red'), name='party') +
theme(legend.position = 'right')

What's with Alaska, you ask? Maybe they are just religiously neutral? Maybe they vote white because they love the snow? This led me into another rabbit hole where I learnt that Alaska has boroughs and parishes instead of counties. But, for elections, they have voting districts, and these coincide with their boroughs in only 3 cases: Aleutians East, Aleutians West, and Anchorage; all of which are visible on the map. I've decided that that's more than enough detail for the scope of this blog, so I'm going to pretend Alaska doesn't exist for a while.
Now to see if we can make an animated map with all the results from 2000 - 2016. gganimate
is a wonderful library that helps animate ggplots
(on which usmap
is built) based graphs.
library(gganimate)
anim <- plot_usmap(data = winner, values = 'party') +
labs(title = '{closest_state}') +
scale_fill_manual(values = c('democrat'='blue', 'republican'='red'), name='party') +
theme(legend.position = 'right') +
transition_states(year) +
ease_aes('sine-in')
a <- animate(anim, fps=60, duration=15)
Error: cannot allocate vector of size 320.0 Mb
There is a workaround to this that involves increasing the memory available to R using memory.limit()
, but my computer freaked out when I did that. It seems that R runs entirely on the RAM and hence has trouble with large animations. So, I decided to tone it down.
a <- animate(anim, fps=20, duration=15)

Why does it darken at the end of every year? I don't know. Why is most of Alaska missing? I don't know. Why is the animation not smooth? I don't know. Actually, I do. 20 fps. It doesn't matter though because I'm not exactly being sponsored by Disney here.
Now, to overlay population density data on this. My idea is to set alpha values based on the population density, so high density counties show up brighter. Unfortunately, I simply can't see a way to do that. My search led me to a StackOverflow question, but the solution provided there throws an error. After much trial and error, and banging my head against a wall for a ridiculous amount of time, I've decided that the grapes are sour and this is not the best way to check if population density has anything to do with voting habits.
I opt for a simpler method. Statistical analysis. Our null hypothesis is that voting habits and population density are independent. I feel that a Chi Squared test would suit us just fine, so I group the counties into categories based on population density. For lack of better terms, let's call them the wilderness (0-10 people/sq. mile), villages (10-100), towns (100-1000), cities (1000-10,000), megacities (10,000-60,000), and NYC (way too many).
library(arules) # for the discretize method
breaks <- c(0,10,100,1000,10000,100000)
types <- c('The Wilderness', 'Village', 'Town', 'City', 'Megacity')
winner$type <- winner$pop_density %>%
discretize(method = 'fixed', breaks = breaks, labels = types)
for (year_ in unique(winner$year) %>% sort()) {
winner_by_year <- filter(winner, winner$year == year_)
chisq_result <- table(winner_by_year$party, winner_by_year$type) %>%
chisq.test()
print(year)
print(chisq_result)
}
2000
The Wilderness Village Town City Megacity
democrat 39 361 181 85 8
republican 502 1336 555 54 0
2004
The Wilderness Village Town City Megacity
democrat 52 290 148 89 8
republican 489 1407 588 51 0
2008
The Wilderness Village Town City Megacity
democrat 86 408 261 114 8
republican 455 1289 475 26 0
2012
The Wilderness Village Town City Megacity
democrat 65 299 215 112 8
republican 476 1398 521 28 0
2016
The Wilderness Village Town City Megacity
democrat 41 168 165 112 8
republican 500 1529 571 28 0
Pearson's Chi-squared test
2000
data: table(winner$party, winner$type)
X-squared = 227.63, df = 4, p-value < 2.2e-16
2004
data: table(winner$party, winner$type)
X-squared = 252.36, df = 4, p-value < 2.2e-16
2008
data: table(winner$party, winner$type)
X-squared = 291.04, df = 4, p-value < 2.2e-16
2012
data: table(winner$party, winner$type)
X-squared = 370.58, df = 4, p-value < 2.2e-16
2016
data: table(winner$party, winner$type)
X-squared = 571.82, df = 4, p-value < 2.2e-16
With P-values that close to zero, our null hypothesis can be safely rejected. We can say with absolute certainty that voting habits and population density are correlated. People who live in high density areas definitely vote Democrat. The 9 most densely populated cities in the US all consistently voted Democrat from 2000-2016. Of the top 40 most densely populated counties, 29 consistently voted blue. In fact, the first and only city in that list to consistently vote Republican is Richmond, Virginia which is the 29th most densely populated city in the US. My guess has been proved right.
Are crowds inherently blue, though? Does living in a city magically make you a liberal? Or does being a liberal make you move to cities? It is far more likely that there are some other variables at play here that affect both these factors, and hence the high correlation. Maybe educated folk tend to lean left and tend to live in cities. Maybe people with exposure to different cultures tend to be more progressive and enjoy living in cities. Or maybe the economic policies of the Democrats help city folk while Republican policies help rural folk. Future posts will delve deeper into this topic.
Let me end by reiterating the golden rule of statistics: correlation doesn't imply causation. We've proved that population density and voting habits are intensely intertwined (at least in the US); not that high population density makes people vote blue. If that was the case, the Dems need only stuff people in a small room to gain votes.
Mission accomplished, I'd say. Despite not having a beautiful map to show-off, I've learnt R. And in the end, isn't the journey all that matters?