Disease Surveillance with Big Data: 2020 Coronavirus Pneumonia hellocstar Follow Jan 28 · 4 min read

Introduction

Up till 27 January 2020, Chinese health authorities have confirmed 2,744 cases of Wuhan coronavirus pneumonia, with 80 people died. Around the globe, clusters of cases were also reported in Hong Kong, Japan, the United States, France, etc. However, as early as 31 December, a Canadian health monitoring platform has already signalled a massive global outbreak of the epidemic with the use of big data. In this globally connected world, how does disease surveillance with big data remind us of, or even prevent, outbreaks before they spread worldwide?

What is disease surveillance?

Disease surveillance is an information-based activity which involves the collection, analysis and interpretation of large volumes of data originating from a variety of sources (Health Protection Surveillance Centre, 2019). The goal of disease surveillance is to envision risks patterns of infectious diseases with the use of novel data analytics and machine learning techniques, to inform better health care decision making.

The abovementioned disease surveillance platform which predicted the current outbreak is called BlueDot. It leverages AI-driven algorithms to monitor various sources of data to track global infectious diseases (BlueDot, n.d.). In the following, BlueDot is used as an example to illustrate how big data is implemented to monitor infectious diseases which pose a threat in our interconnected world.

How is data collected and processed?

BlueDot uses its automated surveillance tools to filter through a variety of data, including airline data, news reports, animal and plant disease networks, and official proclamations. To predict the locations of the pneumonia outbreak, BlueDot evaluated a large volume of travel data generated from the International Air Transport Association (IATA) (Journal of Travel Medicine, 2020). They quantified passenger volumes originating from the international airport in Wuhan between January through March. The data from International Air Transportation, which is a reputable trade association of the world’s airlines, is a demonstration of the veracity of the data used. The volume of data which accounts for approximately 90% of passenger travel itineraries on commercial flights also constitutes a huge database to support data analysis in a later stage of the surveillance.

To retrieve valuable data from news reports and reports of animal-related outbreaks, BlueDot adopts a novel approach, text mining. It is the process to extract high-quality information from text. BlueDot leverages the technique of natural language processing sift through news reports and other relevant sources of text in 65 languages. The system then applies supervised machine learning to classify diseases relevant information from the irrelevant ones. The automation of the process provides a huge volume of near real-time data for instant analysis and demonstrates the velocity and volume of big data.

As the data available on the Internet is almost unlimited, it is also important to wisely select sources of data. Some companies nowadays mine data from social media like Twitter and Facebook. Kamran Khan, BlueDot’s founder and CEO, on the contrary, said the algorithm doesn’t use social media postings because that data is too messy (Wired, 2020). The efforts in the choice of data sources are crucial to ensure the accountability of data and reduce costs in data collection and processing.

Data analysis and interpretation

After the collection and processing of data, BlueDot analyses and interprets the data. Medical and public health experts with advanced data analytics build solutions that leverage the data to anticipate risks of the spread of infectious disease. Statistical analysis, such as time series analysis, is applied to track the diseases. The unstructured data is transformed into Infectious Disease Vulnerability Index (IDVI) scores (Journal of Travel Medicine, 2020). The score is based on metrics from seven domains; demographic, health care, public health, disease dynamics, political (domestic), political (international), and economic. Airline data and data from natural language processing are quantified into the above parameters. They are then contextualised to IDVI scores for different countries. Countries are scored between 0–1, with higher scores representing greater capacity to respond to outbreaks (Journal of Travel Medicine, 2020).

The scores and other quantifications of data are automatically updated on BlueDot’s dashboards that are built with business intelligence tools. The dashboards act as a user interface and provide visualisation to its customers. The graphical interpretation allows its customers, who may not have data analysis background, to easily understand the results of analysis. A customer who receives information about a disease outbreak in a specific location, for example, can avoid going to that location.

Picture 1: The user interface of BlueDot provides visualisations to its customers and allows easy interpretation of its analysis.

Picture 2: BlueDot provides both laptop and phone version of its products.

Conclusion

People nowadays compare the coronavirus pneumonia with SARS (severe acute respiratory syndrome) given the fact that both originated in open-air markets that sold both live and dead animals, and other similarities. In 2003, SARS took away 774 lives worldwide. We call today the age of Big Data. It is undeniable that big data are between us. The data we generated may be the solution to the threats we are facing. Its application in disease surveillance is an instance.