by: Bates Rambow and Benjamin Millard

March 26, 2020

At CartoLab, we build a lot of data-driven web applications. Web dashboards with maps are popular for sharing information about the spread of the novel Coronavirus, so we decided to build one. We built it from scratch in the span of a couple of days, and now we want to share how we did it. This is part one of a two-part series which covers how we build our dashboard, starting with our data processing pipeline.

Step 1: Choosing What Data to Show

There are already many variations of map dashboards out there showing global data counts. We wanted to add some additional features to our dashboard like time-series animations on a county-level, 3D visualizations, and overlays with other contextual data like demographics.

For our data source on virus counts we chose data from usafacts.org, which compiles cumulative data on a county-by-county basis along with data from Johns Hopkins University.

We also wanted to show additional data to add context to the count, so we pulled some data from the U.S. Census Bureau including county boundaries and statistics on a county level such as:

Total population per county

Percentage of people over 65

Percentage of people with Health Insurance

Median Age

Our dashboard includes virus counts, ratios, and demographic data

Step 2: Data ETL

We have written ETL (extract-transform-load) pipelines for the data to load them into a PostGIS database alongside data from the US Census Bureau, including county boundaries and demographics statistics.

COVID-19 Data

Example of raw data structure from USAFacts

The COVID-19 data from USAFacts is available as two CSV files: one with cumulative (info for each day since the beginning) confirmed counts and one with deaths. We used a Python script to pull the CSV files into an IO stream that ultimately imports the data into a Pandas data frame. Pandas is a Python library that is invaluable for data ETL and cleanup operations.

The data from USAFacts needs a little cleanup for it to be truly useful - we needed to pivot the time series data. With Pandas we cleaned the data up (column headers, etc.). Next, we set an index on all columns except for the time series data and run the Pandas stack function to pivot the date data into rows. We merged the two csvs into a single data table and removed any duplicate rows.

We loaded the data into a PostgreSQL database (more on that later), so with each update we truncated the destination tables and reset indexes before inserting new data. We also used a database cursor to refresh some materialized views in our database (more on that below as well)

Data structure after being loaded into PostgresSQL

Census Data

For county boundaries and demographics, we obtained the data from the US Census Bureau. The county boundaries come from the Cartographic Boundary dataset. The demographic data was pulled from the American Community Survey. We used another script to pull specific metrics for all counties via the Bureau's API, but the same data could be downloaded from the Data Finder tools as well.

We imported the county boundaries into our PostgreSQL database (which is spatially-enabled with PostGIS) using OGR. We also imported the demographic data to a table in the database.

Step 3: Data Joins

The point of loading this data into a PostgreSQL database is so that we can perform data joins between tables. For example, we can join virus cases and deaths to the county boundaries, and we can also join census demographics to make a useful final table we can export and load into vector tiles to display on a map. Here is a look at the SQL we used to build a materialized view that displays virus and other data alongside GIS boundaries, suitable for displaying on a map.

SELECT tl."STATEFP" AS state_fp,

tl."COUNTYFP" AS county_fp,

tl."GEOID" AS geoid,

tl."NAME" AS county_name,

tl.geom,

st_x(st_centroid(tl.geom)) AS lon,

st_y(st_centroid(tl.geom)) AS lat,

cd.county_fips AS cvd_geoid,

cd.county_name AS cvd_county_name,

cd.state_name,

(cd.date)::character varying AS date,

date_part('epoch'::text, cd.date) AS unix_epoch,

date_part('day'::text, ((cd.date)::timestamp without time zone - '2020-01-22 00:00:00'::timestamp without time zone)) AS day_count,

cd.confirmed_cases,

cd.deaths,

"right"((dp."GEO_ID")::text, 5) AS dp_geoid,

(dp."DP05_0001E")::integer AS total_pop,

(dp."DP05_0018E")::double precision AS median_age,

(dp."DP05_0024E")::integer AS tot_65_and_over,

round((((dp."DP05_0024E")::double precision / (dp."DP05_0001E")::double precision))::numeric, 2) AS pct_65_and_over,

(dp."DP03_0062E")::integer AS median_hh_income,

(dp."DP03_0095E")::integer AS tot_healthcare_pop,

(dp."DP03_0096E")::integer AS tot_w_health_coverage,

round((((dp."DP03_0096E")::double precision / (dp."DP03_0095E")::double precision))::numeric, 2) AS pct_w_health_coverage,

CASE

WHEN ((dp."DP05_0001E")::integer > 0) THEN round(((cd.confirmed_cases)::numeric / (((dp."DP05_0001E")::integer)::numeric * 0.001)), 2)

ELSE NULL::numeric

END AS confirmed_cases_per_1000,

CASE

WHEN ((dp."DP05_0001E")::integer > 0) THEN round(((cd.confirmed_cases)::numeric / (((dp."DP05_0001E")::integer)::numeric * 0.0001)), 2)

ELSE NULL::numeric

END AS confirmed_cases_per_10000

FROM ((census.cb_2018_us_county_5m tl

LEFT JOIN covid.covid_data cd ON (((cd.county_fips)::text = (tl."GEOID")::text)))

LEFT JOIN census.acs_5yr_dp_2018_counties dp ON (("right"((dp."GEO_ID")::text, 5) = (tl."GEOID")::text)))

Step 4: Building Mapbox Vector Tiles

At CartoLab, we like to use the Mapbox Vector Tile data format for displaying data in web maps. Vector tiles are great because we get the performance of tiled data combined with data access capabilities of vector data. We feel that vector tiles, coupled with a WebGL-enhanced mapping library like MapboxGL.js, are perfect for displaying large amounts of data (say time series data on a county-level) in a web map. So how do we get from a PostGIS table or view to a vector tile?

Since we do this kind of work a lot, we have some utility scripts that make use of Python and a library called tippecanoe. Essentially the script reads a PostGIS table, converts it to GeoJSON, and then processes it with Tippecanoe. There are other ways to make vector tiles, but we feel like tippecanoe offers the most control over the final .mbtiles file in terms of size and how it renders in a browser. The script also handles uploading the tiles to Mapbox.com, which we are using for a tile server on this project. Once the vector tiles are uploaded to the tile server, they are ready to be used in a web map.

Vector tile uploaded to Mapbox.com

This concludes Part 1 of our series. Stay tuned for Part 2, in which we discuss developing the front end of the dashboard.