Why amateur COVID-19 predictions are worthless — featuring ARIMA

Check out the latest COVID-19 predictions, and why I call them worthless. If you are not a public health expert, this post is for you. And if you happen to be an expert, you’ll find some really interesting results here. All with the ease of BigQuery and SQL.

tl;dr: ARIMA is widely used by domain experts to predict timeseries in different domains. With BigQuery you can easily find public datasets and ready to use ML models (including ARIMA, in alpha today). Amateurs beware, especially with topics this critical: #vizresponsibly.

Let me start by establishing that any sideline predictions about the current crisis are worthless. I’m not a public health expert, and — statistically speaking — neither are you. So instead of trying to predict anything, I’ll jump back to the past, and attempt to predict what has already happened — in the last 6 days.

You should know that there are great examples of the usefulness of ARIMA to predict time series. For example it was used successfully to predict the number of beds needed during the outbreak of SARS in Singapore in 2005. You can also check this quick primer on how to use ARIMA to predict public bike usage.

My friend Lak Lakshmanan wrote a post about analyzing COVID-19 with BigQuery. It includes code to easily perform ARIMA predictions with SQL (currently in alpha, see Lak’s notes). Let’s check what happens when we augment his code to compare the ARIMA predictions vs the actual numbers for the last 6 days.

ARIMA predictions vs reality, Japan

ARIMA predictions vs reported numbers, Japan

Are these good predictions? Nah…

The predicted numbers quickly exhibit a larger than 20% underestimation.

None of the actual numbers fell into the 0.9 confidence interval.

Check the query.

ARIMA predictions vs reality, USA

ARIMA predictions vs reported numbers, USA

Are these good predictions? Nah…

The predicted numbers quickly exhibit a larger than 30% overestimation.

However, the actual numbers fell into the 0.9 confidence interval. That might be good, but..

Check how large those confidence intervals are. On day 6 the model gives a confidence interval estimating “somewhere between 80k and 700k”. How can anyone use this?

The confidence intervals got so large, because this model follows Lak’s suggestion “for exponentially growing timeseries, use the LOG() of numbers before applying ARIMA”. We can review later why this is a good idea.

of numbers before applying ARIMA”. We can review later why this is a good idea. Check the query.

Why your predictions are dangerous

If you search on Twitter for “ARIMA COVID”, you’ll find plenty of people armed with a spreadsheet that now think they are epidemiology experts. They are not. Their charts are bad, and their analysis is bad. Don’t be like them. Don’t retweet them either.

Why are bad predictions dangerous? Plenty of reasons, for example:

You might want to give hope to your friends, so you start telling them “don’t worry, my charts show that the crisis will be over before Easter”. If they believe you, and you are wrong, then your friends could end up badly underprepared if the crisis is longer.

You might want to tell your friends that certain medicine appears to be really effective against this virus. But then you have no idea if it is. If your ideas spread, then that medicine might go out of stock, leaving good people that need it without. For no good reason, and it has already happened.

Before creating another chart read Amanda Makulec’s “Ten Considerations Before You Create Another Chart About COVID-19”.

What to do instead

If you really want to help, team up and offer your expertise to groups that can use it. For example, check out covidactnow.org. This team has built models that track the current crisis and possible outcomes. If you are a data scientist, they would love to get your help to keep these models working and updated.

The best part: You’ll be working with an awesome team, whose work has been validated and endorsed by a number of experts in epidemiology, public health, and medicine.

What’s good, awesome, and useful about ARIMA in BigQuery

Now that we established the perils of amateurs spreading their predictions online, let me tell you why I find ARIMA in BigQuery so awesome:

Easy ARIMA and easy access to data

As you can see in my queries above, it was really easy for me to create an ARIMA model and get predictions out of a timeseries. And it all ran in less than 30 seconds. This is great. I used this power to show here how wrong these predictions can be.

To make these tasks even faster, we announced a replica of the John Hopkins U published numbers in BigQuery. Having that table publicly available gave me instant access to the data I needed to replicate these predictions.

Domain experts like ARIMA

I found many interesting papers applying ARIMA to virology tasks.

For example, some literature I found dissing ARIMA:

BMC Bioinformatics, 2014: Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks

These analyses indicate that the Random Forest model has advantages over the ARIMA approach to time series modeling of avian influenza outbreaks in poultry in Egypt. At the same time, it clear that both retrospective models have deficiencies when trying to fit the time series of outbreaks. For example, the ARIMA model provides some estimates that are actually less than zero, which is impossible given the nature of outbreaks. Furthermore, there are times, like the end of 2008 where the model is consistently biased with respect to the signal. The Random Forest model is also consistently biased at the end of 2008, as well as the middle of 2012. However, it performs an order of magnitude better than ARIMA in terms of mean square error.

BMC Infectious Diseases, 2017: A framework for evaluating epidemic forecasts

It means the performance of ARIMA is completely behind all other methods. Figure 20 depicts the one-step-ahead predicted curve of the ARIMA method compared to the observed data that shows the ARIMA output has large deviations from the real observed curve and confirms the correctness of the clustering approach.

Online Journal of Public Health Informatics, 2014: Evaluating a Seasonal ARIMA Model for Event Detection in New York City

An ARIMA model is not an ideal model for prospectively detecting outbreaks in syndromic data, due to frequent monitoring and adjustment of model parameters. Furthermore, by using autoregressive and moving average parameters, the model may have over-fit the data, causing outbreaks to go undetected. ARIMA models have some limitations. Model parameters depend highly on data trends and characteristics, making geographic stratification difficult. Alternative approaches that require less frequent refitting may be easier for health departments to implement and perform better for outbreak detection.

If you search for the opposite, you’ll also find plenty of literature that has put ARIMA models to good use (especially it’s seasonally aware version, SARIMA). For example:

BMC Health Services Research, 2004: Using autoregressive integrated moving average (ARIMA) models to predict and monitor the number of beds occupied during a SARS outbreak in a tertiary hospital in Singapore

The ARIMA model that we developed for modeling the number of beds occupied during the SARS outbreak performed reasonably well, with a MAPE of 5.7% for the training set, and 8.6% for the validation set. In addition, we found that three-day forecasts provided a reasonable prediction of the number of beds required during the outbreak

This work demonstrates the epidemiology of different types of influenza viruses among children in Wuhan, China. Our study suggests that the ARIMA model can be used to forecast the positive rate of different types of influenza virus.

You can even find some fresh papers out utilizing ARIMA for the current crisis:

Elsevier Data in Brief, 2020: Application of the ARIMA model on the COVID2019 epidemic dataset

It only uses data until Feb 10, so this paper is missing a usefulness refresh. Authors have not replied to my emails, or my tweets. (I don’t blame them for not replying, for sure they have more important tasks today than answering my questions)

This article is a preprint and has not been peer-reviewed. It reports new medical research that has yet to be evaluated and so should not be used to guide clinical practice.

This article is a preprint and has not been peer-reviewed. It reports new medical research that has yet to be evaluated and so should not be used to guide clinical practice.

Accelerating experts

One of the top resources domain experts have today is time. They are racing against the virus, and the faster they can get their results, the quicker we’ll all find a solution.

Google and the BigQuery team wants to help accelerate their results. And if they want to prove the usefulness or not of ARIMA, they can use BigQuery to make this process really fast.

In the hands of an expert, the results of a model can be extremely valuable, even if to you the numbers look really wrong. Surprising results can drive experts to ask “why”, and find interesting insights over that starting point. Likewise, in the hands of a non-expert a result that looks good can be totally not.

Stupidly good results, stupidly large confidence intervals