Baby steps with Covid-19 data for (Clojure) programmers

March 18, 2020

Please share: Twitter.

New books are available for subscription.

The Corona pandemic is on everyone's mind. If your country has not been locked down yet, it will be soon. The world and the human society is not going to be the same as before. Let's not worry too much about how economy is going to be hit, but hope that we solve it health-wise.

Anyway, soon, most of us will have to spend most of our time inside, and when we solve basic needs, and make sure our loved ones are safe, we'll have some time to pass. Most people will re-watch their favorite TV shows and re-play their games. I'm sure that many will acquire various skills of America's Got Talent quality. Some programmers, though, will itch to try their skills on Covid-19 data.

Maybe you wanted to add some machine learning skills to your programming-fu, but the day job just made that impossible. Why not start now? If this thing is going to be on our minds for months, if not years, we might throw some programming magic at it.

By now, we've seen many shiny visualizations and analyses of the pandemic, published by experts and amateurs. Maybe you itch to throw it into a magic machine learning framework and get some super-x-ray insight from artificial intelligence.

I won't do that here. First, because I know nothing about epidemiology. Please do not take any conclusions you hear from non-experts for granted. They don't know what they're talking about. Second, because the data that we can publicly access is too scarce to be thrown to any machine learning beast. You might get some numbers out, but these numbers will tell you only what's obvious from the visualizations anyway, at best, or spit out complete garbage at worst.

Once the data becomes more reliable, and abundant, we might be able to use it for some insight, provided that we learn some basics of epidemiology by then. Until then, I propose that we brush up our basic data skills through pure play.

So, there is no need to feel sorry that you hadn't learned some Big Machine Learning Framework yet. With basic programming skills, you can dissect the data that is currently available just fine, if not more easily! Just pure, plain Clojure, without specialized libraries!

Let's take some basic steps with the Covid-19 data published by Johns Hopkins University. I'll use a copy provided by Oscar Wahltinez in this Git repository.

Loading the data The data is in a CSV file, so we require some useful namespaces for working with files. ( ns dragan.rocks.covid-19.world ( :require [ clojure.java.io :as io ] [ clojure.data.csv :as csv ] ) ) CSV files are textual files where words are separated by comas and newlines. Typically, each line represent an observation, while the values of different variables for that observations are separated by comas. The first task is to translate that into convenient data structures in memory. We load the file as a resource, slurp its contents into a string, and parse it into a lazy sequence. ( csv /read-csv ( slurp ( io /resource "open-covid-19/output/world.csv" ) ) ) (["Date" "CountryCode" "CountryName" "Confirmed" "Deaths" "Latitude" "Longitude"] ["2019-12-31" "AE" "United Arab Emirates" "0" "0" "23.424076" "53.847818"] ["2019-12-31" "AF" "Afghanistan" "0" "0" "33.93911" "67.709953"] ["2019-12-31" "AM" "Armenia" "0" "0" "40.069099" "45.038189"] ["2019-12-31" "AT" "Austria" "0" "0" "47.516231" "14.550072"] ["2019-12-31" "AU" "Australia" "0" "0" "-25.274398" "133.775136"] ["2019-12-31" "AZ" "Azerbaijan" "0" "0" "40.143105" "47.576927"] ["2019-12-31" "BE" "Belgium" "0" "0" "50.503887" "4.469936"] ["2019-12-31" "BH" "Bahrain" "0" "0" "25.930414" "50.637772"] ["2019-12-31" "BR" "Brazil" "0" "0" "-14.235004" "-51.92528"] ["2019-12-31" "BY" "Belarus" "0" "0" "53.709807" "27.953389"] ["2019-12-31" "CA" "Canada" "0" "0" "56.130366" "-106.346771"] ["2019-12-31" "CH" "Switzerland" "0" "0" "46.818188" "8.227512"] ["2019-12-31" "CN" "China" "27" "0" "35.86166" "104.195397"] ["2019-12-31" "CZ" "Czech Republic" "0" "0" "49.817492" "15.472962"] ["2019-12-31" "DE" "Germany" "0" "0" "51.165691" "10.451526"] ...) As you can see, this sequence contains a map of vectors such as ["2019-12-31" "AT" "Austria" "0" "0" "47.516231" "14.550072"] . This is a sign that we successfully loaded the data. This is the data from the world.csv file, while there are a few more similar datasets: usa.csv , china.csv . We can create a convenience function for loading these files. ( defn read-open-covid [ csv-name ] ( csv /read-csv ( slurp ( io /resource ( format "open-covid-19/output/%s.csv" csv-name ) ) ) ) ) And now use it to load the world data and stash it into a global variable (normally a bad, bad, programming practice, but acceptable if we are only playing in the REPL, notebook-style). BTW, I run this code in emacs+CIDER, and automatically generate this post from org-mode. If you copy and paste the code, it should run in any Clojure REPL setup. ( def covid-world ( read-open-covid "world" ) )

Feeling the basic structure Now, the most basic info I can get is "What variables does this data set have?". CSV files typically list that in the first line, and we access it as the first element in our sequence. ( first covid-world ) ["Date" "CountryCode" "CountryName" "Confirmed" "Deaths" "Latitude" "Longitude"] To see how example data looks like, let's take the second row. ( second covid-world ) ["2019-12-31" "AE" "United Arab Emirates" "0" "0" "23.424076" "53.847818"] So, date is in the YYYY-MM-DD format, which could be convenient for sorting. There is hope that Clojure can handle the comparisong and sorting of these strings as-is, without conversion to proper date objects (spoiler: it does). Next is the country code of the observation, which is obviously a useful identifier. CountryName is redundant, but can be a fine time saver for all of us who do not remember all country codes. Next is the official number of confirmed cases of infection by Covid-19, and official death toll. Latitude and longitude refer to the position of the country, and are included because this data set is used as a source for the visualization of the pandemic on the interactive wold map that you can access here. Unsurprisingly, on the New Year's Eve, There were no (discovered!) cases of infection in UAE.

How many observations do we have The answer to this question is so easy to get, that I'm out of inspiration for this paragraph. ( count world-data ) 5322 We have a little more than 5000 observation. How many countries do we have this data for? To answer this question, we should access country codes for each observation and then see how many distinct codes we have. It may require more fiddling in some other programming languages, but in Clojure it's bees knees. ( count ( distinct ( map second world-data ) ) ) 143 So, is our data complete? ( rem ( count world-data ) ( count ( distinct ( map second world-data ) ) ) ) 31 Apparently not, since there is a remainder in this division. Some dates are certainly missing for some countries. This means that we can't blindly treat all data for all countries uniformly; whatever the analysis we plan to do we will have to do something about that.

How many observations are missing First, let's see how many distinct dates there are. Today is the 18th March 2020, and I can count that by hand on the calendar, but the point here is to do that using code. ( count ( distinct ( map first world-data ) ) ) 79 Since there are 143 countries and 79 dates, ideally there would be this many observations: ( * ( count ( distinct ( map first world-data ) ) ) ( count ( distinct ( map second world-data ) ) ) ) 11297 Which means that we are missing half the data. But it's not all. How many observations of the Confirmed variable are 0 ? ( count ( filter zero? ( map # ( nth % 3 ) world-data ) ) ) class java.lang.ClassCastExceptionclass java.lang.ClassCastExceptionExecution error (ClassCastException) at dragan.rocks.covid-19.world/eval14376 (form-init3222897912165483565.clj:1). java.lang.String cannot be cast to java.lang.Number We get the exception, since "0" and "4" are not a numbers, but strings of characters. Let's convert these columns to proper types: ( def world-data2 ( map ( fn [ [ d cc cn conf death ] ] [ d cc cn ( Long /parseLong conf ) ( Long /parseLong death ) ] ) world-data ) ) #'dragan.rocks.covid-19.world/world-data2 ( * ( count ( filter zero? ( map # ( nth % 3 ) world-data2 ) ) ) ) 2976 In roughly half of the observations, there were no confirmed cases. But not even all zeros are equal. Some zeroes are here because the pandemic hasn't reached a country at the particular date. Some other zeros might be there because no new cases were discovered in a country that has previous case. But even that does not mean there are no new case. In my country, Serbia, on some dates no tests were done (or, perhaps, were done but haven't been published, who knows). The point is that this data is so early, that it is very scattered and very rough. Anyway, let's see how many data is recorded at all per each day ( 0 or otherwise). ( def date-freqs ( sort-by first ( frequencies ( map first world-data2 ) ) ) ) (["2019-12-31" 66] ["2020-01-01" 66] ["2020-01-02" 66] ["2020-01-03" 66] ["2020-01-04" 66] ["2020-01-05" 66] ["2020-01-06" 66] ["2020-01-07" 66] ["2020-01-08" 66] ["2020-01-09" 66] ["2020-01-10" 66] ["2020-01-11" 66] ["2020-01-12" 66] ["2020-01-13" 66] ["2020-01-14" 66] ["2020-01-15" 66] ...) At the beginning, most data is available for (probably) the same 66 countries. Let's discover (by code) what's the first date with a different number of observations. It seems that this 66 runs right until two weeks ago. And then? All dates after the 3rd of March first see less observations, and then, starting with the March 11th, the number of observations suddenly jumps. My hunch is that at first, most countries just submitted the default 0 to whomever collected this data (the World Health Organization, I suppose?), simply ignoring the problem. Then, as they started to realize the immediate danger, they were reluctant to send the invented data (or the WHO stopped collecting the default zeros?), and then, on 15th March the data becomes more complete. My hunch is the global pandemic was officially announced sometimes before that. Since this was in the past, I can simply check on the Internet (…typing away in the browser…): the pandemic was announced on March 11th 2020.

How much data do we have for each particular country Analogously to the frequencies of observations on a particular date, we can count the frequencies related to countries; instead of the first column, we will use the second. ( def country-freqs ( sort-by first ( frequencies ( map second world-data2 ) ) ) ) (["AD" 4] ["AE" 72] ["AF" 68] ["AG" 1] ["AL" 9] ["AM" 69] ["AR" 11] ["AT" 78] ["AU" 78] ["AZ" 71] ["BA" 5] ["BD" 3] ["BE" 78] ["BF" 5] ["BG" 8] ["BH" 77] ...)

Selecting your country The human eye quickly gets lost in this bunch of numbers. Let's create a function that selects only the data available for the country, or a set of countries, that we are interested in. For, example, for this set of countries: #{"IT" "FR" "ES" "CN"} ( filter ( fn [ [ _ code ] ] ( # { "IT" "FR" "ES" "CN" } code ) ) world-data2 ) (["2019-12-31" "CN" "China" 27 0] ["2019-12-31" "ES" "Spain" 0 0] ["2019-12-31" "FR" "France" 0 0] ["2019-12-31" "IT" "Italy" 0 0] ["2020-01-01" "CN" "China" 27 0] ["2020-01-01" "ES" "Spain" 0 0] ["2020-01-01" "FR" "France" 0 0] ["2020-01-01" "IT" "Italy" 0 0] ["2020-01-02" "CN" "China" 27 0] ["2020-01-02" "ES" "Spain" 0 0] ["2020-01-02" "FR" "France" 0 0] ["2020-01-02" "IT" "Italy" 0 0] ["2020-01-03" "CN" "China" 44 0] ["2020-01-03" "ES" "Spain" 0 0] ["2020-01-03" "FR" "France" 0 0] ["2020-01-03" "IT" "Italy" 0 0] ...) We'll write some convenient functions for computing the previously discussed values. ( defn take-countries [ data country-set ] ( filter ( fn [ [ _ code ] ] ( country-set code ) ) data ) ) ( defn date-freqs [ data ] ( sort-by first ( frequencies ( map first data ) ) ) ) ( defn country-freq [ data ] ( sort-by first ( frequencies ( map second data ) ) ) ) ( def my-countries ( country-freq ( take-countries world-data # { "IT" "FR" "ES" "CN" "US" "RS" "DE" } ) ) ) Now we can see that most of these countries have pretty complete (if not overly reliable) data, while Serbia only recently started doing tests and reporting some numbers. CN 78 DE 78 ES 78 FR 78 IT 79 RS 8 US 78

Draw some plots Instead of flashy plotting libraries, I'll draw some ASCII art. The reason is that the data is so obvious although coarse, that I don't want to make a false impression that you'll learn anything new that you haven't already seen in the news and on the Internet. The second is: we are programmers, we present data in any silly way that we please! I selected a pretty basic Java ASCII plotting library after a quick search on GitHub. Great thanks to Mitch Talmadge for ASCII-Data :) ( import 'com.mitchtalmadge.asciidata.graph.ASCIIGraph ) First I'll just take the number of confirmed cases from Serbia, and remove whatever zeros there are before the first case (we are not interesting in plotting a flat line). ( drop-while zero? ( map # ( nth % 3 ) ( take-countries world-data2 # { "RS" } ) ) ) 1 5 18 24 41 46 55 57 ( def rs-data ( drop-while zero? ( map # ( nth % 3 ) ( take-countries world-data2 # { "RS" } ) ) ) ) Let's plot this. ( println ( .plot ( ASCIIGraph /fromSeries ( double-array rs-data ) ) ) ) nil 57.00 ┤ ╭ 56.00 ┤ │ 55.00 ┤ ╭╯ 54.00 ┤ │ 53.00 ┤ │ 52.00 ┤ │ 51.00 ┤ │ 50.00 ┤ │ 49.00 ┤ │ 48.00 ┤ │ 47.00 ┤ │ 46.00 ┤ ╭╯ 45.00 ┤ │ 44.00 ┤ │ 43.00 ┤ │ 42.00 ┤ │ 41.00 ┤ ╭╯ 40.00 ┤ │ 39.00 ┤ │ 38.00 ┤ │ 37.00 ┤ │ 36.00 ┤ │ 35.00 ┤ │ 34.00 ┤ │ 33.00 ┤ │ 32.00 ┤ │ 31.00 ┤ │ 30.00 ┤ │ 29.00 ┤ │ 28.00 ┤ │ 27.00 ┤ │ 26.00 ┤ │ 25.00 ┤ │ 24.00 ┤ ╭╯ 23.00 ┤ │ 22.00 ┤ │ 21.00 ┤ │ 20.00 ┤ │ 19.00 ┤ │ 18.00 ┤ ╭╯ 17.00 ┤ │ 16.00 ┤ │ 15.00 ┤ │ 14.00 ┤ │ 13.00 ┤ │ 12.00 ┤ │ 11.00 ┤ │ 10.00 ┤ │ 9.00 ┤ │ 8.00 ┤ │ 7.00 ┤ │ 6.00 ┤ │ 5.00 ┤╭╯ 4.00 ┤│ 3.00 ┤│ 2.00 ┤│ 1.00 ┼╯ Whoaaa. Although the numbers looked pretty tame, graphs shoots up in the skies. This is because the growth is exponential. Since the exponential function grows really fast, the lower numbers quickly become miniscule. However, we are not interested in absolute numbers, but in growth. Therefore, it is more appropriate to take the logarithm of this function, and see whether the logarithm starts to drop off, if only for a tiny bit. We need a convenience log function. I could have imported one from Neanderthal, but a fast CPU and GPU library is clearly an overkill for such a task. Hopefully soon there will be abundance of data, and we'll be able to put these nuclear options to use. For now, let's use sticks and stones. ( defn log ^ double [ ^ double x ] ( Math /log x ) ) #'dragan.rocks.covid-19.world/log Now, plot the logarithm of the function of interest. ( println ( .plot ( ASCIIGraph /fromSeries ( double-array ( map log rs-data ) ) ) ) ) nil It grows quickly, and it is only at the beginning. 4.04 ┤ ╭─── 3.03 ┤ ╭─╯ 2.02 ┤╭╯ 1.01 ┤│ 0.00 ┼╯

Italy is overwhelmed Now, let's see how Italy is holding. For a few weeks we've listened to really bad news. ( defn extract-data [ country-code ] ( drop-while zero? ( map # ( nth % 3 ) ( take-countries world-data2 # { country-code } ) ) ) ) ( take 5 ( reverse ( map log ( extract-data "IT" ) ) ) ) 10.357933282865915 10.239245248219472 10.12414802355653 9.959726098983317 9.77905747415795 ( println ( log-plot ( extract-data "IT" ) ) ) It hasn't started to slow down yet, although it looks like it is about to. dragan.rocks.covid-19.world=> (println (log-plot (extract-data "IT"))) 10.36 ┤ ╭─── 9.39 ┤ ╭────╯ 8.42 ┤ ╭────╯ 7.46 ┤ ╭───╯ 6.49 ┤ ╭──╯ 5.53 ┤ ╭──╯ 4.56 ┤ ╭╯ 3.59 ┤ │ 2.63 ┤ ╭╯ 1.66 ┤ │ 0.69 ┼─────────────────────╯

China is slowing down And China already won this battle, and, I hope, war itself. ( take 5 ( reverse ( map log ( extract-data "CN" ) ) ) ) 11.303808085389111 11.302451316756681 11.302142703354239 11.301871044753339 11.301636371103024 See how the numbers are rising slowly on the log scale. The absolute numbers are still bad, but each day they are less bad. ( log-plot ( extract-data "CN" ) ) " 11.30 ┤ ╭─────────────────────────────────

10.30 ┤ ╭────────╯

9.30 ┤ ╭────╯

8.30 ┤ ╭──╯

7.30 ┤ ╭─╯

6.30 ┤ ╭───╯

5.30 ┤ ╭─╯

4.30 ┤ ╭─────────────╯

3.30 ┼────╯

" 11.30 ┤ ╭───────────────────────────────── 10.30 ┤ ╭────────╯ 9.30 ┤ ╭────╯ 8.30 ┤ ╭──╯ 7.30 ┤ ╭─╯ 6.30 ┤ ╭───╯ 5.30 ┤ ╭─╯ 4.30 ┤ ╭─────────────╯ 3.30 ┼────╯ We can calculate each change explicitly, and see this directly. ( defn absolute-plot [ series-data ] ( .plot ( ASCIIGraph /fromSeries ( double-array series-data ) ) ) ) ( println ( absolute-plot ( map # ( / % 1000 ) ( reduce ( fn [ acc x ] ( conj acc ( - x ( peek acc ) ) ) ) [ 0 ] ( extract-data "CN" ) ) ) ) ) #'dragan.rocks.covid-19.world/absolute-plotnilBoxed math warning, *Org-Babel Preview Corona-1-Baby-steps-with-Covid-19-for-programmers.org[ clojure ]*:5:31 - call: public static java.lang.Number clojure.lang.Numbers.divide(java.lang.Object,long). Boxed math warning, *Org-Babel Preview Corona-1-Baby-steps-with-Covid-19-for-programmers.org[ clojure ]*:7:51 - call: public static java.lang.Number clojure.lang.Numbers.unchecked_minus(java.lang.Object,java.lang.Object). 46.69 ┤ ╭╮╭╮╭╮╭╮╭╮╭╮╭╮╭╮ 45.69 ┤ ╭╮╭╮││││││││││││││││ 44.70 ┤ ╭╮╭╮││││││││││││││││││││ 43.71 ┤ ╭╮││││││││││││││││││││││││ 42.71 ┤ ╭╮││││││││││││││││││││││││││ 41.72 ┤ ╭╮││││││││││││││││││││││││││││ 40.73 ┤ ││││││││││││││││││││││││││││││ 39.73 ┤ ╭╮││││││││││││││││││││││││││││││ 38.74 ┤ ││││││││││││││││││││││││││││││││ 37.75 ┤ ││││││││││││││││││││││││││││││││ 36.75 ┤ ╭╮││││││││││││││││││││││││││││││││ 35.76 ┤ ││││││││││││││││││││││││││││││││││ 34.77 ┤ │││││││││││││││││││││││││││╰╯╰╯╰╯╰ 33.77 ┤ │││││││││││││││╰╯╰╯╰╯╰╯╰╯╰╯ 32.78 ┤ │││││││││╰╯╰╯╰╯ 31.79 ┤ │││││││╰╯ 30.79 ┤ │││││╰╯ 29.80 ┤ │││││ 28.81 ┤ │││╰╯ 27.81 ┤ │││ 26.82 ┤ │╰╯ 25.83 ┤ │ 24.83 ┤ │ 23.84 ┤ │ 22.85 ┤ ╭╯ 21.85 ┤ ╭╯ 20.86 ┤ ╭╯ 19.87 ┤ │ 18.87 ┤ ╭╯ 17.88 ┤ ╭╯ 16.89 ┤ ╭╯ 15.89 ┤ │ 14.90 ┤ ╭╯ 13.91 ┤ │ 12.91 ┤ ╭╯ 11.92 ┤ │ 10.93 ┤ ╭╯ 9.93 ┤ ╭╯ 8.94 ┤ │ 7.95 ┤ ╭╯ 6.95 ┤ ╭╯ 5.96 ┤ │ 4.97 ┤ ╭─╯ 3.97 ┤ │ 2.98 ┤ ╭─╯ 1.99 ┤ ╭╯ 0.99 ┤ ╭─╯ 0.00 ┼─────────────────────────╯