COVID-19 has hit hard in the past couple of weeks and its impact has been notorious both from a sanitary perspective and an economic one. Plenty has been written about it, especially statistical reports on its exponential growth and the importance of “flattening the curve”.

At RMOTR, we wanted to help raise awareness of the issues associated with the spread of COVID-19 by making a dynamic and interactive analysis of the situation using Python and Data Science.

We’ve made an interactive project that you can fork and follow step by step. You can see the process that Data Scientists follow to analyze the situation and make predictions. Here is a quick summary.

👉 Click this button to fork the project and follow it step by step👈

This project has been created by @yosoymatias, one of our expert Data Scientists on Staff. This is a write up of his work.

Part 1: The Basics of Exploratory Data Analysis and Data Wrangling

We’ll start with the first Notebook, Part 1.ipynb and follow the basic steps of every Data Science project.

Step 1: Reading Data

The first step is getting the data. In this case, we’re using this Github repo by Johns Hopkins University that contains CSV files updated daily.

We’re using a neat Pandas technique of reading the CSV directly from the Github repo, which means we can run our notebooks everyday and it’ll stay up to date:

Step 2: Data Cleaning

This is the mandatory second step of our project. Here, the data is fairly clean, so there isn’t much more to do. We’ll replace a few country names and fill in blanks.

Step 3 & 4: Analysis and Data Wrangling

In the first stage of analyzing, we’ll start with a worldwide impact analysis of COVID-19. To conduct our analysis, we’ll need to create new columns, create intermediate DataFrames, and re-shape (melt, stack, group) our data. This is usually known as the “Data Wrangling” process.