Image source: gratisography.com

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click here.

Where the F@#$ is the data?

This all began when I tried to answer what-should-have-been a simple question.

I live in the San Francisco Bay Area, one of the areas hardest hit by COVID-19 and one of the early epicenters for this virus. But I don’t live in Santa Clara County, the part of The Bay Area that had a huge early surge. I live north a bit and across the bay in Oakland.

So although Santa Clara was going gangbusters with new cases, I wanted to know how Oakland — and Alameda County (the county where Oakland is located) — was affected.

Like everyone else, I had experiences that made me wonder how bad this was getting in my city. I stood in a line at Trader Joe’s that wrapped from the registers clear around the store. I discovered with bafflement that toilet paper was the new hot commodity. And, like at my company, I heard of more and more friends working from home. But was this just people taking precautions or was it an indicator of how bad things were getting? I needed data.

My first stop was where everyone was going, the Johns Hopkins dashboard.

This dashboard actually did have regional/county-level data in the early days, but removed it at some point. This dashboard was well designed but didn’t have the granularity of data I wanted to figure out what was happening in my local community. There was no ability to drill down further into the data. I also wanted to explore more data visualizations than simply the number of cases.

I also tried the dashboards made by the New York Times, the World Health Organization, and the one by ESRI (maker of ArcGIS). None of them really gave me the specificity of data I wanted.

Luckily, right around this time my coworker Jay sent me an out-of-the-blue text (okay, he’s technically my boss, but I don’t do well with hierarchy so I tend to think of everyone as collaborators).

“Hey man, have you come across any good Coronavirus data sources?”

Turns out he had been doing the same kind of searching. He had also been frustrated by the lack of easily accessible data on the novel Coronavirus.

Building it for ourselves

For Jay and I, that text kicked off two weeks of building and rebuilding our own dashboards that pulled in any trusted COVID-19 info we could find. We originally built it for ourselves to help us wrap our heads around what was going on but as things progressed, we kept adding more and more dashboards and eventually decided we should share it publicly.

But deciding to release it publicly created a whole other challenge. We had been building these for ourselves. We didn’t have centralized navigation or clean organization. Hell, most of our charts were missing titles and axis labels — which would have made my grad school advisors cringe.

My grad school advisors if they had seen those early data visualization dashboards (source: knowyourmeme.com)

But we redoubled our effort and, with a few sleepless nights, built something we’re actually proud of.

Something we’re not embarrassed to show people

You can check out the final result here: COVID-19 Data Hub

Image by the author. See the dashboard here.

In the end, we were able to release dashboards for 10 countries, a search by zip code page so people can see their local area, an SF Bay Area dashboard (my favorite), and 10 US State dashboards.

We also recently added a dashboard that shows testing data in the US, opened up the API so anyone can pull our data, and a couple of pages for search-based analytics.

The testing map of the United States. See the testing dashboard here (Image by Author)

I fully admit there are still some parts of it that need a little design and polish. But I am very excited to be building something that will hopefully help get people actionable information during this crisis.

We’re also happy to take suggestions or requests on what to build out next. Below is the image of a new dashboard that allows an easy comparison of the COVID-19 growth curves for various countries. We built it in response to comments that people wanted to be able to monitor which countries are successfully flattening the curve. If there is sufficient demand, we may build a similar one for US States.

This dashboard’s goal is to make it relatively easy to see which countries are flattening the curve (Image by author)

We are also continuing to build this out. From when I started writing this article to today we released a new animated visualization that shows how COVID-19 cases progressed over time by state (see below).

Video by author. See the full/current animated COVID-19 timeline here.

On-going challenges

The biggest on-going challenge is that the data access keeps changing. The Johns Hopkins dataset has been amazing and we incredibly grateful to them for it. That said, the ability to access it and the method for doing so has changed a couple times in the past couple weeks, requiring us to rework our data hub. Our ideal would be to have a clear and consistent API for pulling the data. But hopefully, that’s what our API can be for other people.

How it feels when your data source changes their API, data format, or how it’s accessed. (Source: gratisography.com)

The other big challenge has been getting access to more rich datasets related to the Coronavirus outbreak so we can create more actionable visualizations and dashboards. Some at the top of our list are hospital bed utilization, city-level data, and projections related to jobless claims.

I’ve seen at least one case of counties getting access to city-level data, but due to the source of this data, it may remain inaccessible to everyone else for some time.

Pie-in-the-sky: City Level data

Contra Costa County has accomplished acquiring and posting city-level data by manually scraping data from CalREDIE and manually putting it into their system.

For those who don’t know, CalREDIE stands for California Reportable Disease Information Exchange. It’s a website hosted by the California Department of Public Health for electronic disease reporting and surveillance.

You see, there are some diseases and conditions that are mandated by State law to be reported by healthcare providers and laboratories. When healthcare providers and labs report these conditions, they go into CalREDIE. COVID-19 cases are one such condition, so in theory, all COVID-19 cases in California are going into CalREDIE.

Here’s the problem: currently — and for perfectly valid privacy reasons — the data in CalREDIE can only be accessed by local health departments and the California Department of Public Health. There is no API for anonymized versions of the data. That’s why Contra Costa County is having to manually scrape that data, anonymize it, and put it into their COVID-19 portal.

For news outlets, journalists, reporters, and bloggers

Some news outlets have built really amazing data visualizations. SF Chronicle is a page that I still go to regularly even with our data hub launched. But not all news outlets have that capability even though their readers would benefit from outbreak data. If you work for an organization that needs a visualization, reach out to me. You can add any of our visualizations to your website for free.

Any visualization in our data hub can be embedded in another website with an iframe or some javascript and I would be happy to help you get set up.