Abstract Traditional methods for monitoring influenza are haphazard and lack fine-grained details regarding the spatial and temporal dynamics of outbreaks. Twitter gives researchers and public health officials an opportunity to examine the spread of influenza in real-time and at multiple geographical scales. In this paper, we introduce an improved framework for monitoring influenza outbreaks using the social media platform Twitter. Relying upon techniques from geographic information science (GIS) and data mining, Twitter messages were collected, filtered, and analyzed for the thirty most populated cities in the United States during the 2013–2014 flu season. The results of this procedure are compared with national, regional, and local flu outbreak reports, revealing a statistically significant correlation between the two data sources. The main contribution of this paper is to introduce a comprehensive data mining process that enhances previous attempts to accurately identify tweets related to influenza. Additionally, geographical information systems allow us to target, filter, and normalize Twitter messages.

Citation: Allen C, Tsou M-H, Aslam A, Nagel A, Gawron J-M (2016) Applying GIS and Machine Learning Methods to Twitter Data for Multiscale Surveillance of Influenza. PLoS ONE 11(7): e0157734. https://doi.org/10.1371/journal.pone.0157734 Editor: Mansour Ebrahimi, Qom University, ISLAMIC REPUBLIC OF IRAN Received: October 22, 2015; Accepted: June 4, 2016; Published: July 25, 2016 Copyright: © 2016 Allen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Individual Twitter messages are protected by the Twitter Terms of Use and therefore cannot be redistributed. However, we are able to provide aggregated tweet counts at various geographic scales. We are uploading the data on tweet rates and flu rates as Supporting Information. Webmap is available online at http://vision.sdsu.edu/hdma/smart/. Funding: This research was conducted under a National Science Foundation Cyber-enabled Discovery and Innovation (CDI) grant (award #1028177). This research was also conducted under National Science Foundation grant #1416509.The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Public health scholars have long studied the ways in which disease outbreaks can be monitored. Accurate information about the intensity and geographic distribution of illnesses can allow officials to allocate resources effectively and respond appropriately in order to combat these health threats. However, surveillance of influenza has traditionally been problematic due to uneven and haphazard reporting, as well as the time and cost overhead associated with gathering this information. This paper introduces a geo-targeted data mining approach to influenza surveillance that relies upon data from the social media platform Twitter. Messages from Twitter are gathered and analyzed using geographic information system (GIS) methods such as spatial filtering, population normalization, and multi-scale analysis. Additionally, this work also turns to machine learning techniques to improve the filtering process in order to better distinguish tweets that appear to describe real-world cases of influenza from those that do not. For this purpose, a support vector machine (SVM) classifier was trained using manually tagged data collected from the 2012–2013 flu season. Comparing our results for the 2013–2014 flu season with official reports of influenza-like illness (ILI) on national, regional, and local levels demonstrates that the two signals are closely correlated, suggesting that Twitter holds great potential for monitoring influenza outbreaks.

Background and Literature Review This study builds on the flurry of recent research activity seeking to understand the dynamics of disease outbreak by analyzing Internet and social media data. Though remarkable progress has been made in the past few years, there remain significant obstacles facing researchers wanting to use these new sources of data to inform public health decisions. Many researchers have turned to Google Flu Trends (GFT) for user-generated data on influenza. This site publishes information on flu-related web searches that can be used to monitor changes in Internet activity over time and space. Dukic et al. [1] modified the classical particle learning epidemiology model using GFT data to develop a method for monitoring the spread of flu. More recently, Dugas et al. [2] have relied upon historical outbreak data along with patterns in Flu Trends to create a forecast model that allows the team to predict flu cases one week in advance with reasonable precision. In the past few years, however, many have grown skeptical of the accuracy of Google Flu Trends. Olson et. al. [3], for example, argue that GFT predictions are not accurately tuned to yearly variations in flu patterns. Similarly, Declan Butler has commented in Nature [4] that Google’s significant miscalculation in predicting the intensity of flu outbreaks for the 2013 season casts serious doubt on the reliability of Flu Trends data to monitor illness. In the past few years, the micro-blogging service Twitter has provided researchers with a rich new source of data, allowing for detailed examinations of complex diffusion processes, human behaviors, and collective attitudes around the world [5–7]. Scholars from divergent fields have similarly turned to Twitter to collect, filter, and analyze messages (called tweets) in order to gain new insights into the spread of flu. Nagel et al. [8], for instance, have developed a methodology for collecting and filtering tweets that demonstrates a high correlation with local and national reports of ILI cases. Other researchers have delved deeper into text analysis to more accurately remove tweets that are unrelated to actual cases of influenza. Notably, Lamb et al. [9] used manually-defined word feature classes to filter out tweets that are did not appear to reflect personal infection. Though these efforts are valuable contributions to the fields of public health and Big Data analysis, this paper suggests that a greater reliance on geographic information system (GIS) and machine learning methods can shed new light on the role these exciting new data sources (particularly Twitter) can play in studying disease outbreak.

Data Collection The data collection procedure for this research was based upon the Visualizing Information Space in Ontological Networks (VISION) framework developed by Tsou, et. al. [10–11] to examine the interrelationships between cyberspace message, space, and time. Unlike most previous applications of the VISION framework, this study focuses exclusively on data collected from Twitter due to the real-time and dynamic nature of this platform. In the past few years Twitter has emerged as one of the leading social media platforms, boasting more than 140 million active users. Each day, Twitter users produce millions of tweets (messages of 140 characters or less), which can be collected through official Twitter APIs. To assist with our data collection process, our research group has developed search tools which query the Twitter Search API based on spatial constraints and keyword filters. To allow researchers within our group to visually explore this data, a web map was developed that displays the search locations related to each keyword, as well as the intensity for that keyword. Using this interface, it is also possible to download the Twitter data in Excel form, which allows for a more detailed analysis of the various attributes associated with each tweet, such as the time it was published, the user’s location, the GPS coordinates of the tweet, and any URLs or hashtags contained within the message (see S3 File). This study builds on the work of Nagel et al. [8] by using keywords that have been identified as effective indicators of influenza outbreak. The Twitter search tool collected tweets by the keywords “flu” and “influenza.” Tweets were collected from 31 major cities in the United States: Atlanta, GA; Austin, TX; Baltimore, MD; Boston, MA; Chicago, IL; Cleveland, OH; Columbus, OH; Dallas, TX; Denver, CO; Detroit, MI; El Paso, TX; Fort Worth, TX; Houston, TX; Indianapolis, IN; Jacksonville, FL; Los Angeles, CA; Memphis, TN; Milwaukee, WI; Nashville, TN; New York, NY; Oklahoma City, OK; Philadelphia, PA; Phoenix, AZ; Portland, OR; San Diego, CA; Seattle, WA; and Washington D.C. Unlike Nagel et. al., this research uses variable search radiuses for each of these cities, which were determined by the researchers.

Methods For this study, geographic information system (GIS) methods were relied upon for data collection and normalization. To filter out noise from the dataset, a machine learning procedure was adopted that allowed our group to better distinguish tweets that appeared to indicate a real-world instance of influenza from those that did not. In order to target specific locations for data collection, this project took advantage of the spatial filtering methods provided by the Twitter Search Application Programming Interface (API) (see S3 File). Most research focusing on Twitter data relies upon the Streaming API, which allows users to retrieve GPS-tagged tweets within a specified bounding box. However, the main disadvantage of this method is that the Streaming API only gives access to a one percent sampling of all tweets. By polling the Search API continually, we can access a much larger dataset for specific geographic areas, which allows for detailed analysis of the data at municipal, regional, and national scales. Normalization of population is essential to understanding quantitative geographic data, and for this study a novel approach was taken to carry out this task. The Twitter Search API requires that geo-targeted searches specify a latitude/longitude pair as well as a radius, which can essentially be thought of as a point buffer. Each of these city point buffers was joined with census tract centroids to determine which tracts should be included in our population calculations. Using the fine-grained census data allows us to gain a more accurate estimation of population, which greatly improves our ability to accurately normalize tweet counts for individual cities. With respect to noise filtering, Nagel et al. [8] have shown that excluding retweets and tweets containing URL links produces a much higher correlation coefficient with ground-truth data, and we have borrowed that strategy in this study. However, to further filter our data, we have developed a machine learning classification procedure. The goal of this procedure was to identify tweets that do not appear to indicate real-world cases of influenza so that they can be omitted from the statistical analysis. The following are example tweets containing the keyword “flu” and the determination of their validity that we hoped to accomplish with the classification task: Flu medicine kicked in… Time for bed → Valid

I gotta get over this flu!! → Valid

Who gets the stomach flu the day before class starts? → Invalid

I'm getting the flu shot today. #scared → Invalid

Arm is killing from flu jab → Invalid

this flu feels like death → Valid As these examples show, in many cases, one can succeed at classifying tweets by recognizing one positive indicator (e.g., “get over”, “medicine”) or one negative indicator (e.g., “shot”, “jab”). This property strongly suggested to our team that using a linear learning algorithm would be an effective strategy. In machine learning terms, the problem of assigning the labels “no” or “yes” to a tweet is a simple binary classification task. The features relevant to classification are words (such as “medicine” or “bed”) and n-grams (such as “stomach flu”, “flu shot”). The learning algorithm is given a training set of tweets represented as features with numerical values and labeled with their classes. With a linear classifier, the output of the learning algorithm is simply a set of feature weights; unseen examples can be classified using a linear combination of the weighted features. The primary advantage of using a linear classifier is that learning is cheap and scalable. Additional positive and negative indicators can be learned with more labeled data. Furthermore, the problem of what to do when indicators conflict is solved automatically by the feature weights. For instance, consider a tweet that contains both the word “stomach” and the word “medicine.” The learner classifies such tweets based on the weighting that was optimal in correctly classifying the training data. The particular kind of linear learner used here is called a support vector machine (SVM) [12]. In this context, an SVM was used to classify tweets that appeared to be indicators of real-world cases of influenza and those messages that appeared to be irrelevant to actual illness (see above examples). To train the SVM, 1,500 randomly sampled tweets from the 2012–2013 season containing the keyword “flu” were used to train the SVM classifier. Each of these tweets was manually inspected and classified as valid or invalid according to the likelihood that the message indicated an actual case of influenza. In order to assign numerical values to unigram, bi-gram, and tri-gram features, we use the term frequency—inverse document frequency (TF-IDF) scores for each word or word-pair. TF-IDF is a measure of the statistical significance of each term in a message that also accounts for the prevalence of that term in the overall data set. Essentially, this score assigns a higher weight to words that are more important to a specific message in comparison the entire set of messages [13]. It should be noted that using statistical transformations such as TF-IDF allows the researcher to avoid the task of manually identifying specific keywords indicating influenza; rather, the classifier is able to automatically recognize important keywords based on patterns in the manually classified training set. Consequently, applying this procedure to other illnesses (for example, Ebola) would simply require a new set of manually classified tweets with which to train the classifier. To evaluate the classification model, we used standard machine learning measures of quality: recall, precision and the resulting F1 score. Using 1,000 randomly sampled tweets as a test set, we determined that the classifier has a precision score of 0.671, a recall score of 0.949, resulting in an F1 score of 0.786. This essentially means that the model was able to classify most valid tweets correctly (indicated by the high recall score), but as the precision score indicates, it occasionally incorrectly categorizes invalid tweets as valid.

Limitations Though this study improves upon previous attempts to analyze the relationship between flu-related tweets and real-world outbreaks of influenza, there are remaining issues that might be addressed in future work. Currently the support vector classifier used to filter tweets was trained with data from the 2012–2013 season. While this training set produced satisfactory results with respect to the filtering process, this approach to classification ignores the dynamic nature of social media data. For instance, our team discovered that the signal for flu-related tweets in Baltimore for the 2013–2014 season were anomalous in that they did not closely correspond to local ILI reports. After manually analyzing individual tweets for the city, we identified many messages in Baltimore mentioning the phrase “fresher’s flu”, which is a colloquial term to describe illness that sometimes accompanies the start of a college semester. Though these tweets may be indicators of real-world illness, the term “fresher’s flu” generally does not refer to actual influenza. After adjusting our filtering process to account for this anomaly, the correlation coefficients for Baltimore were similar to other cities. This incident demonstrates the need to develop a framework for continually re-training the tweet classifier so that it can account for the temporal and spatial dynamics of messages in cyberspace. Additionally, the regional analysis of flu tweets presented in this paper might be further improved by including a greater number of search areas. Since Twitter messages were only collected for 31 major U.S. cities, regions with few large urban areas contained sparse data. Three regions (1, 2, and 8) contained tweets from only one city and one region (7) was not covered by any of our search areas. Including more search areas to cover less densely populated areas could address this issue. Finally, it should be noted that if social media is used to systematically monitor influenza in the future, there are potential concerns that may need to be considered in order to maintain the reliability of this method. For instance, public awareness of these methods could influence behavior and consequently lead to false reporting. It is not hard to imagine a scenario where Twitter users would falsify flu-related tweets in order to garner more attention from public health officials and receive more resources such as vaccination supplies. However, as discussed by Petróczi and Haugen [15], such false reporting may be counterbalanced by further understanding the motivations individuals may have for distorting the truth, and these insights may allow researchers to identify Twitter messages that are likely to be insincere.

Conclusions This study demonstrates that social media holds great potential for monitoring the outbreak of flu and other illness. It builds on previous work but suggests that GIS methods can augment existing approaches. Additionally, this paper introduces a machine learning procedure to filter out noise from collected tweets—a task that has been a long-standing hurdle preventing researchers from taking Twitter seriously as a source of data. The outcome demonstrates that this procedure yields results that correlate more strongly with national and local ILI reports.

Supporting Information S1 File. A Microsoft Excel file containing tables that list influenza-like illness (ILI) rates at local, regional, and national levels. https://doi.org/10.1371/journal.pone.0157734.s001 (XLSX) S2 File. A Microsoft Excel file containing tables that list flu-related tweet counts at local, regional, and national levels. https://doi.org/10.1371/journal.pone.0157734.s002 (XLSX) S3 File. A document containing links to example source code for collecting geo-targeted Twitter data and the web interface that is used to view and download Twitter data. https://doi.org/10.1371/journal.pone.0157734.s003 (DOCX)

Author Contributions Conceived and designed the experiments: CA AA AN. Analyzed the data: CA AA AN. Contributed reagents/materials/analysis tools: CA JMG MT. Wrote the paper: CA JMG MT.