The quarantine has failed. Give me a worst case scenario, and make it grim!
- Richard Nixon's head, Futurama

Disclaimer: Unlike my previous post, there's no analysis in the end of this one. This is just a documentation of my thought process.

I've been busy with other stuff, and coupled with a severe writer's block, I've been putting off this post for a long time. However, inspiration struck in the form of recent COVID related lockdown extensions here in BC. I've decided to do an animated choropleth of the spread of COVID-19 in Canada.

I don't want to use R. In my previous post, a render involving just 4 unique frames caused R to seize up and refuse to work. I had to settle for a 20fps render, and even that took about 10 minutes despite having 16 gigs of RAM to work with. Python's better at this sort of stuff, so I'm going to stick to Python this time.

I don't have to go through the trouble of learning a new language because I'm already comfortable with Python, so I jump ahead to finding a data source. I stumble upon an ArcGIS database with COVID data from various Canadian sources, that proves to be everything I'm looking for. They update the data everyday and they also supply shapefiles for the Canadian Health Regions. This is golden.

Next, I look up useful libraries I could use and come across geopandas (an extension of pandas) for geographical data. and plotly for interactive plotting. Installing them should be as easy as pip install geopandas plotly.

ERROR: Command errored out with exit status 1:
     command: 'g:\prax\Codereido\scripts\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\codereido\\AppData\\Local\\Temp\\pip-install-wgh7cpq7\\fiona\\setup.py'"'"'; __file__='"'"'C:\\Users\\codereido\\AppData\\Local\\Temp\\pip-install-wgh7cpq7\\fiona\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\codereido\AppData\Local\Temp\pip-pip-egg-info-_sj0cnze'
         cwd: C:\Users\codereido\AppData\Local\Temp\pip-install-wgh7cpq7\fiona\
    Complete output (1 lines):
    A GDAL API version must be specified. Provide a path to gdal-config using a
GDAL_CONFIG environment variable or use a GDAL_VERSION environment variable.
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
No such luck

It seems that one of the dependencies for geopandas is fiona which is an API for GDAL, which in turn is a library for translating raster and vector geospatial data formats. This requires that I have GDAL installed, so I try doing that. OSGeo4W is a project that brings various FOSS Geospatial software to Windows (which is sadly my current development environment). The installer includes a collection of all the software under the OSGeo4W umbrella. However, we do have the option to install GDAL and its Python bindings alone, since the entire software suite is several GBs in size. With the installation successfully done, we can safely proceed.

ERROR: Command errored out with exit status 1:
	.
	.
	.
    Complete output (1 lines):
    A GDAL API version must be specified. Provide a path to gdal-config using a
GDAL_CONFIG environment variable or use a GDAL_VERSION environment variable.
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Nope

Got the same error again. I take a deeper look into the error code, and a quick Google search gives me new information. gdal-config, a script that fiona requires, is not available on the OSGeo4W build of GDAL and is only present in the Linux builds. Why do I punish myself by continuing to stick to Windows?

Yeah.. Why ever do I continue to use Windows?

Apparently, Anaconda has a painless geopandas installation. I don't like Anaconda because it adds another Python environment for me to manage, but it looks like I don't have a choice. conda install geopandas does the job. Installation takes a few minutes, but was painless as promised.

Let's get right to the code.

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

shapefile = './data/Canada COVID Data/RegionalhealthBoundaries.shp'
gdf = gpd.read_file(shapefile)

f, ax = plt.subplots(figsize=(10,10))
ax.set_axis_off()
f = gdf.plot(ax=ax, column='CurrentCas', cmap='OrRd', linewidth=.1, edgecolor='#B3B3B3')
plt.show()
Take a look at the current cases
That looks extremely informative!

It is at this point I realize why there's not a lot of Canadian choropleths showing data that has anything to do with the populace. More than 50% of the population live in less than 0.2% of the total land area. We can see the Health Regions that encompass Metro Vancouver, Calgary, Edmonton, GTA, Montréal, Ottawa, and Québec City highlighted on the map, and that is where everyone lives. This map is going to look exactly like this at any point of time. Definitely not worth the trouble of animating it. After all the trouble I went through to install geopandas too.

Now, I'm not going to let this goldmine of a data source go to waste. I'm going to at the very least make a nice graph showing the rise in COVID cases in BC through the year.

import numpy as np
import pandas as pd
import plotly.graph_objects as go

data = pd.read_csv('.\src\data\Canada COVID Data\Provincial_Daily_Totals.csv')
data = data[data['Abbreviation']=='BC']		# select data relevant to BC alone
data = data.reset_index(drop=True)
data = data.fillna(0)
# drop unneccessary columns
del data['Province']
del data['Abbreviation']
del data['OBJECTID']
del data['DailyTested']
del data['TotalTested']

# fix the date column
data['SummaryDate'] = data['SummaryDate'].apply(lambda x: x.split()[0]).astype(np.datetime64)
# set proper types for other columns
cols = list(data.columns)
cols.remove('SummaryDate')
for item in cols:
	data[cols] = data[cols].astype(np.int64)
Set up the data
     DailyTotals SummaryDate  TotalCases  TotalRecovered  DailyRecovered  TotalDeaths  DailyDeaths  TotalActive  DailyActive  TotalHospitalized  DailyHospitalized  TotalICU  DailyICU
0              0  2020-01-25           0               0               0            0            0            0            0                  0                  0         0         0
1              0  2020-01-26           0               0               0            0            0            0            0                  0                  0         0         0
2              0  2020-01-27           0               0               0            0            0            0            0                  0                  0         0         0
3              1  2020-01-28           1               0               0            0            0            1            1                  0                  0         0         0
4              0  2020-01-29           1               0               0            0            0            1            0                  0                  0         0         0
..           ...         ...         ...             ...             ...          ...          ...          ...          ...                ...                ...       ...       ...
295            0  2020-11-15       20985           14901               0          290            0         5579            0                167                  0        50         0
296         1959  2020-11-16       22944           16087            1186          299            9         6279          700                181                 14        57         7
297          717  2020-11-17       23661           16469             382          310           11         6589          310                198                 17        63         6
298          761  2020-11-18       24422           16914             445          320           10         6861          272                209                 11        58        -5
299          536  2020-11-19       24958           17206             292          321            1         6926           65                217                  8        59         1

[300 rows x 13 columns]
A glance at the data
fig = go.Figure()
fig.add_trace(
	go.Scatter(
    	x=data['SummaryDate'], y=data['TotalCases'], mode='lines', name='New cases'))
fig.update_layout(title='Cumulative cases', yaxis_title='No. of people', xaxis_title='Time')
fig.show()
Draw a quick and dirty plot
What's with the squiggles?

That's curious. Covid numbers shouldn't have a regular pattern. While looking at the data, I noticed that there were a few days where daily reported cases were all zero.

fig = go.Figure()
fig.add_trace(
	go.Scatter(
    	x=data['SummaryDate'], y=data['DailyTotals'], mode='lines', name='New cases'))
fig.update_layout(title='Daily cases', yaxis_title='No. of people', xaxis_title='Time')
fig.show()
‎Some wise man once said, "A quick and dirty plot speaks a thousand words"
There's a very clear pattern

Turns out, BCCDC doesn't report numbers on weekends and gives out accumulated numbers on Mondays or sometimes, Tuesdays. This leads to a misleading graph, as BC has had nowhere near 2000 cases a day yet. We can correct for this pretty easily, and our healthcare workers need a break too, so it's all good.

def correct_for_weekends(series, start):
	for i in range(start, len(series), 7):
		n = 3
		if series[i] == 0:				# if the numbers were given out on Tuesday
			n += 1
			i += 1
		x = series[i] % n
		y = series[i] - x				# so we don't have 1/3rd cases on some days
		series.loc[i-n+1:i] = [y//n]*n
		series[i] += x

for item in cols:
	if 'Daily' in item:
		print(item)
		correct_for_weekends(data[item], 2)			# Correct for weekends

index = data[data['SummaryDate']==np.datetime64('2020-11-11')].index[0]	# Lest we forget
for item in cols:
	if 'Daily' in item:
		data.loc[index:index+1, item] = [data.loc[index+1, item]//2] * 2
That looks far more realistic
data['TotalCases'] = data['DailyTotals'].cumsum()
data['TotalRecovered'] = data['DailyRecovered'].cumsum()
data['TotalDeaths'] = data['DailyDeaths'].cumsum()
data['TotalActive'] = data['DailyActive'].cumsum()
data['TotalHospitalized'] = data['DailyHospitalized'].cumsum()
data['TotalICU'] = data['DailyICU'].cumsum()
Recalculate the cumulative data
     DailyTotals SummaryDate  TotalCases  TotalRecovered  DailyRecovered  TotalDeaths  DailyDeaths  TotalActive  DailyActive  TotalHospitalized  DailyHospitalized  TotalICU  DailyICU
0              0  2020-01-25           0               0               0            0            0            0            0                  0                  0         0         0
1              0  2020-01-26           0               0               0            0            0            0            0                  0                  0         0         0
2              0  2020-01-27           0               0               0            0            0            0            0                  0                  0         0         0
3              1  2020-01-28           1               0               0            0            0            1            1                  0                  0         0         0
4              0  2020-01-29           1               0               0            0            0            1            0                  0                  0         0         0
..           ...         ...         ...             ...             ...          ...          ...          ...          ...                ...                ...       ...       ...
295          653  2020-11-15       21835           15319             395          275            3         5931          233                180                  4        53         2
296          653  2020-11-16       22488           15715             396          278            3         6165          234                186                  6        56         3
297          717  2020-11-17       23205           16097             382          289           11         6475          310                203                 17        62         6
298          761  2020-11-18       23966           16542             445          299           10         6747          272                214                 11        57        -5
299          536  2020-11-19       24502           16834             292          300            1         6812           65                222                  8        58         1

[300 rows x 13 columns]
A look at the new data

There looks to be some discrepancy in the totals, eg. Total Recovered went from 17206 to 16834. The code to find the cumulative sum can't have a bug since it is literally a one-liner. So either my computer is bad at math or the BCCDC's numbers don't add up. Now, there might very well be a perfectly sensible reason for the numbers being this way, but right now I have to choose between data that's honest to the source, and data that doesn't have squiggles in it. I make the sensible decision and go with the one that looks better (Pro tip: this is generally considered a bad move).

fig = go.Figure()

trace_tCases = go.Scatter(x=data['SummaryDate'], y=data['TotalCases'], mode='lines',  name='Total cases',  line_shape='spline')
trace_tActive = go.Scatter(x=data['SummaryDate'], y=data['TotalActive'], mode='lines',  name='Total active',  line_shape='spline')
trace_tRecovered = go.Scatter(x=data['SummaryDate'], y=data['TotalRecovered'], mode='lines',  name='Total recovered',  line_shape='spline')
trace_tDeaths = go.Scatter(x=data['SummaryDate'], y=data['TotalDeaths'], mode='lines',  name='Total deaths',  line_shape='spline')

frames = [dict(data = [dict(type='scatter',
						   x=data['SummaryDate'][:k+1],		# frame includes all data upto k
						   y=data['TotalCases'][:k+1]),
					  dict(type='scatter',
						   x=data['SummaryDate'][:k+1],
						   y=data['TotalActive'][:k+1]),
					  dict(type='scatter',
						   x=data['SummaryDate'][:k+1],
						   y=data['TotalRecovered'][:k+1]),
					  dict(type='scatter',
						   x=data['SummaryDate'][:k+1],
						   y=data['TotalDeaths'][:k+1])],
			   traces= [0, 1, 2, 3],  #this indicates that frames[k]['data'][0] updates TotalCases, frames[k]['data'][1] -> TotalActive, etc.
			  )for k  in  range(1, len(data)-1)] 


layout = go.Layout(title='COVID-19 cases in BC',
				   yaxis_title='No. of people',
				   xaxis_title='Time',
				   hovermode='x unified',
				   updatemenus=[dict(type='buttons', showactive=False,
								y=1.05,
								x=1.15,
								xanchor='right',
								yanchor='top',
								pad=dict(t=0, r=10),
								buttons=[dict(label='Play',
											  method='animate',
											  args=[None, 
													dict(frame=dict(duration=3, 
																	redraw=False),
														 transition=dict(duration=0),
														 fromcurrent=True,
														 mode='immediate')])])])

fig = go.Figure(data=[trace_tCases, trace_tActive, trace_tRecovered, trace_tDeaths], frames=frames, layout=layout)
fig.show()
Animate the graph
Exponential growth is scary

The anomalous dip in active cases in mid September is due to a delay in reporting data as is explained in this news article. Other than that, the graph follows a pretty standard exponential growth. I really really hope that BC's healthcare system holds out through this pandemic and we continue to see the majority of affected people making a full recovery.

Click for an interactive graph