Among the many R packages, there is the outbreaks package. It contains datasets on epidemics, on of which is from the 2013 outbreak of influenza A H7N9 in China, as analysed by Kucharski et al (2014). I will be using their data as an example to test whether we can use Machine Learning algorithms for predicting disease outcome.

To do so, I selected and extracted features from the raw data, including age, days between onset and outcome, gender, whether the patients were hospitalised, etc. Missing values were imputed and different model algorithms were used to predict outcome (death or recovery). The prediction accuracy, sensitivity and specificity. The thus prepared dataset was devided into training and testing subsets. The test subset contained all cases with an unknown outcome. Before I applied the models to the test data, I further split the training data into validation subsets.

The tested modeling algorithms were similarly successful at predicting the outcomes of the validation data. To decide on final classifications, I compared predictions from all models and defined the outcome “Death” or “Recovery” as a function of all models, whereas classifications with a low prediction probability were flagged as “uncertain”. Accounting for this uncertainty led to a 100% correct classification of the validation test set.

The training cases with unknown outcome were then classified based on the same algorithms. From 57 unknown cases, 14 were classified as “Recovery”, 10 as “Death” and 33 as uncertain.

The data

The dataset contains case ID, date of onset, date of hospitalisation, date of outcome, gender, age, province and of course the outcome: Death or Recovery. I can already see that there are a couple of missing values in the data, which I will deal with later.

# install and load package if (!require("outbreaks")) install.packages("outbreaks") library(outbreaks) fluH7N9.china.2013_backup ## case.ID date.of.onset date.of.hospitalisation date.of.outcome outcome gender age province ## 1 case_1 2013-02-19 2013-03-04 Death m 87 Shanghai ## 2 case_2 2013-02-27 2013-03-03 2013-03-10 Death m 27 Shanghai ## 3 case_3 2013-03-09 2013-03-19 2013-04-09 Death f 35 Anhui ## 4 case_4 2013-03-19 2013-03-27 f 45 Jiangsu ## 5 case_5 2013-03-19 2013-03-30 2013-05-15 Recover f 48 Jiangsu ## 6 case_6 2013-03-21 2013-03-28 2013-04-26 Death f 32 Jiangsu

Before I start preparing the data for Machine Learning, I want to get an idea of the distribution of the data points and their different variables by plotting. Most provinces have only a handful of cases, so I am combining them into the category “other” and keep only Jiangsu, Shanghai and Zhejian and separate provinces.

# gather for plotting with ggplot2 library(tidyr) fluH7N9.china.2013_gather % gather(Group, Date, date.of.onset:date.of.outcome) # rearrange group order fluH7N9.china.2013_gather$Group Gives this plot:

This plot shows the dates of onset, hospitalisation and outcome (if known) of each data point. Outcome is marked by color and age shown on the y-axis. Gender is marked by point shape. The density distribution of date by age for the cases seems to indicate that older people died more frequently in the Jiangsu and Zhejiang province than in Shanghai and in other provinces. When we look at the distribution of points along the time axis, it suggests that their might be a positive correlation between the likelihood of death and an early onset or early outcome. I also want to know how many cases there are for each gender and province and compare the genders’ age distribution. fluH7N9.china.2013_gather_2 % gather(group_2, value, gender:province) fluH7N9.china.2013_gather_2$value Gives this plot:

In the dataset, there are more male than female cases and correspondingly, we see more deaths, recoveries and unknown outcomes in men than in women. This is potentially a problem later on for modeling because the inherent likelihoods for outcome are not directly comparable between the sexes. Most unknown outcomes were recorded in Zhejiang. Similarly to gender, we don’t have an equal distribution of data points across provinces either. When we look at the age distribution it is obvious that people who died tended to be slightly older than those who recovered. The density curve of unknown outcomes is more similar to that of death than of recovery, suggesting that among these people there might have been more deaths than recoveries. And lastly, I want to plot how many days passed between onset, hospitalisation and outcome for each case. ggplot(data = fluH7N9.china.2013_gather, aes(x = Date, y = age, color = outcome)) + geom_point(aes(shape = gender), size = 1.5, alpha = 0.6) + geom_path(aes(group = case.ID)) + facet_wrap( ~ province, ncol = 2) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1") + labs( color = "Outcome", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "

Time from onset of flu to outcome." ) Gives this plot which shows that there are many missing values in the dates, so it is hard to draw a general conclusion.

Features In Machine Learning-speak features are the variables used for model training. Using the right features dramatically influences the accuracy of the model. Because we don’t have many features, I am keeping age as it is, but I am also generating new features:

from the date information I am calculating the days between onset and outcome and between onset and hospitalisation

I am converting gender into numeric values with 1 for female and 0 for male

similarly, I am converting provinces to binary classifiers (yes == 1, no == 0) for Shanghai, Zhejiang, Jiangsu and other provinces

the same binary classification is given for whether a case was hospitalised, and whether they had an early onset or early outcome (earlier than the median date)