Our paper draws upon prior scholarly and practical work in three areas: traditional patient- and laboratory-based disease surveillance, Wikipedia-based measurement of the real world, and internet-based disease surveillance.

The main value of these systems is their accuracy. However, they have a number of disadvantages, notably cost and timeliness: for example, each ILI datum requires a practitioner visit, and ILI data are published only after a delay of 1–2 weeks [5] .

Clinical labs play a critical role in surveillance of infectious diseases. For example, the Laboratory Response Network (LRN), consisting of over 120 biological laboratories, provides active surveillance of a number of disease agents in humans ranging from mild (e.g., non-pathogenic E. coli and Staphylococcus aureus) to severe (e.g., Ebola and Marburg), based on clinical or environmental samples [4] . Other systems monitor non-traditional public health indicators such as school absenteeism rates, over-the-counter medication sales, 911 calls, veterinary data, and ambulance run data. For example, the Early Aberration Reporting System (EARS) provides national, state, and local health departments alternative detection approaches for syndromic surveillance [8] .

For example, a well-established measure for influenza surveillance is the fraction of patients with influenza-like illness, abbreviated simply ILI. A network of outpatient providers report the total number of patients seen and the number who present with symptoms consistent with influenza that have no other identifiable cause [5] . Similarly, other electronic resources have emerged, such as the Electronic Surveillance System for the Early Notification of Community Based Epidemics (ESSENCE), based on real-time data from the Department of Defense Military Health System [6] and BioSense, based on data from the Department of Veterans Affairs, the Department of Defense, retail pharmacies, and Laboratory Corporation of America [7] . These systems are designed to facilitate early detection of disease outbreaks as well as response to harmful health effects, exposure to disease, or related hazardous conditions.

Traditional forms of disease surveillance are based upon direct patient contact or biological tests taking place in clinics, hospitals, and laboratories. The majority of current systems rely on syndromic surveillance data (i.e., about symptoms) including clinical diagnoses, chief complaints, school and work absenteeism, illness-related 911 calls, and emergency room admissions [4] .

In summary, use of Wikipedia access logs to measure real-world quantities is beginning to emerge, as is interest in Wikipedia for health purposes. However, to our knowledge, use of the encyclopedia for quantitative disease surveillance remains at the earliest stages.

The fourth study is a recent article by McIver & Brownstein, which uses statistical techniques to estimate the influenza rate in the United States from Wikipedia access logs [34] . In the next section, we compare and contrast this article with the present work in the context of a broader discussion of such techniques.

In the context of health information, the most prominent research direction focuses on assessing the quality of Wikipedia as a health information source for the public, e.g., with respect to cancer [26] , [27] , carpal tunnel syndrome [28] , drug information [29] , and kidney conditions [30] . To our knowledge, only four health studies exist that make use of Wikipedia access logs. Tausczik et al. examined public “anxiety and information seeking” during the 2009 H1N1 pandemic, in part by measuring traffic to H1N1-related Wikipedia articles [31] . Laurent and Vickers evaluated Wikipedia article traffic for disease-related seasonality and in relation to news coverage of health issues, finding significant effects in both cases [32] . Aitken et al. found a correlation between drug sales and Wikipedia traffic for a selection of approximately 5,000 health-related articles [33] . None of these propose a time-series model mapping article traffic to disease metrics.

Wikipedia article access logs have been used for a modest variety of research. The most common application is detection and measurement of popular news topics or events [13] – [17] . The data have also been used to study the dynamics of Wikipedia itself [18] – [20] . Social applications include evaluating toponym importance in order to make type size decisions for maps [21] , measuring the flow of concepts across the world [22] , and estimating the popularity of politicians and political parties [23] . Finally, economic applications include attempts to forecast movie ticket sales [24] and stock prices [25] . The latter two applications are of particular interest because they include a forecasting component, as the present work does.

Wikipedia contrasts with traditional encyclopedias on two key dimensions: it is free of charge to read, and anyone can make changes that are published immediately — review is performed by the community after publication. (This is true for the vast majority of articles. Particularly controversial articles, such as “George W. Bush” or “Abortion”, have varying levels of edit protection.) While this surprising inversion of the traditional review-publish cycle would seem to invite all manner of abuse and misinformation, Wikipedia has developed effective measures to deal with these problems and is of similar accuracy to traditional encyclopedias such as Britannica [12] .

Wikipedia is an online encyclopedia that has, since its founding in 2001, grown to contain approximately 30 million articles in 287 languages [9] . In recent years, it has consistently ranked as a top-10 website; as of this writing, it is the 6th most visited website in the world and the most visited site that is not a search engine or social network [10] , serving roughly 850 million article requests per day [11] . For numerous search engine queries, a Wikipedia article is the top result.

Internet-based disease surveillance.

Recently, new forms of surveillance based upon the social internet have emerged; these data streams are appealing in large part because of their real-time nature and the low cost of information extraction, properties complementary to traditional methods. The basic insight is that people leave traces of their online activity related to health observations, and these traces can be captured and used to derive actionable information. Two main classes of trace exist: sharing such as social media mentions of face mask use [35] and health-seeking behavior such as web searches for health-related topics [36]. (In fact, there is evidence that the volume of internet-based health-seeking behavior dwarfs traditional avenues [37], [38].)

In this section, we focus on the surveillance work most closely related to our efforts, specifically, that which uses existing single-source internet data feeds to estimate some scalar disease-related metric. For example, we exclude from detailed analysis work that provides only alerts [39], [40], measures public perception of a disease [41], includes disease dynamics in its model [42], evaluates a third-party method [43], uses non-single-source data feeds [39], [44], or crowd-sources health-related data (participatory disease surveillance) [45], [46]. We also focus on work that estimates biologically-rooted metrics. For example, we exclude metrics based on seasonality [47], [48] and over-the-counter drug sales volume, itself a proxy [49].

These activity traces are embedded in search queries [36], [50]–[76], social media messages [77]–[92], and web server access logs [34], [72], [93]. At a basic level, traces are extracted by counting query strings, words or phrases, or web page URLs that are related to the metric of interest, forming a time series of occurrences for each item. A statistical model is then created that maps these input time series to a time series estimating the metric's changing value. This model is trained on time period(s) when both the internet data and the true metric values are available and then applied to estimate the metric value over time period(s) when it is not available, i.e., forecasting the future, nowcasting the present, and anti-forecasting the past (the latter two being useful in cases where true metric availability lags real time).

Typically, this model is linear, e.g.: (1)where is the count of some item, is the total number of possible items (i.e., vocabulary size), is the estimated metric value, and are selected by linear regression or similar methods. When appropriately trained, these methods can be quite accurate; for example, many of the cited models can produce near real-time estimates of case counts with correlations upwards of r = 0.95.

The collection of disease surveillance work cited above has estimated incidence for a wide variety of infectious and non-infectious conditions: avian influenza [52], cancer [55], chicken pox [67], cholera [81], dengue [50], [53], [84], dysentery [76], gastroenteritis [56], [61], [67], gonorrhea [64], hand foot and mouth disease (HFMD) [72], HIV/AIDS [75], [76], influenza [34], [36], [54], [57], [59], [62], [63], [65], [67], [68], [71], [74], [77]–[80], [82], [83], [85]–[93], kidney stones [51], listeriosis [70], malaria [66], methicillin-resistant Staphylococcus aureus (MRSA) [58], pertussis [90], pneumonia [68], respiratory syncytial virus (RSV) [52], scarlet fever [76], stroke [69], suicide [60], [73], tuberculosis [76], and West Nile virus [52].

Closely related to the present work is an independent, simultaneous effort by McIver & Brownstein to measure influenza in the United States using Wikipedia access logs [34]. This study used Poisson models fitted with LASSO regression to estimate ILI over a 5-year period. The results, Pearson's r of 0.94 to 0.99 against official data, depending on model variation, compare quite favorably to prior work that tries to replicate official data. More generally, this article's statistical methods are more sophisticated than those employed in the present study. However, we offer several key improvements:

We evaluate 14 location-disease contexts around the globe, rather than just one. In doing so, we test the use of language as a location proxy, which was noted briefly as future work in McIver & Brownstein. (However, as we detail below, we suspect this is not a reliable geo-location method for the long term.)

We test our models for forecasting value, which was again mentioned briefly as future work in McIver & Brownstein.

We evaluate models for translatability from one location to another.

We present negative results and use these to begin exploring when internet-based disease surveillance methods might and might not work.

We offer a systematic, well-specified, and simple procedure to select articles for model inclusion.

We normalize article access counts by actual total language traffic rather than using a few specific articles as a proxy for total traffic.

Our software is open source and has only freely available dependencies, while the McIver & Brownstein code is not available and depends on proprietary components (Stata).

Finally, the goals of the two studies differ. McIver & Brownstein wanted to “develop a statistical model to provide near-time estimates of ILI activity in the US using freely available data gathered from the online encyclopedia Wikipedia” [34, p. 2]. Our goals are to assess the applicability of these data to global disease surveillance for operational public health purposes and to lay out a research agenda for achieving this end.

These methods are the basis for at least one deployed, widely used surveillance system. Based upon search query data, Google Flu Trends offers near-real-time estimates of influenza activity in 29 countries across the world (15 at the province level); another facet of the same system, Google Dengue Trends (http://www.google.org/denguetrends/) estimates dengue activity in 9 countries (2 at the province level) in Asia and Latin America.

Having laid out the space of quantitative internet disease surveillance as it exists to the best of our knowledge, we now consider this prior work in the context of our four challenges:

C1. Openness. Deep access to search queries from Baidu, a Chinese-language search engine serving mostly the Chinese market (http://www.baidu.com) [64], [74], [76]; Google [36], [50]–[54], [56]–[60], [65]–[67], [69]–[73], [75]; Yahoo [55], [68]; and Yandex, a search engine serving mostly Russia and Slavic countries in Russian (http://www.yandex.ru), English (http://www.yandex.com), and Turkish (http://www.yandex.com.tr) [75], as well as purpose-built health website search queries [61]–[63] and access logs [72], [93] are available only to those within the organizations, upon payment of an often-substantial fee, or by some other special arrangement. While tools such as Baidu Index (http://index.baidu.com), Google Trends (http://www.google.com/trends/), Google Correlate (http://www.google.com/trends/correlate/), and Yandex's WordStat (http://wordstat.yandex.com) provide a limited view into specific search queries and/or time periods, as do occasional lower-level data dumps offered for research, neither affords the large-scale, broad data analysis that drives the most effective models.

The situation is only somewhat better for surveillance efforts based upon Twitter [77]–[92]. While a small portion of the real-time message stream (1%, or 10% for certain grandfathered users) is available outside the company at no cost, terms of use prohibit sharing historical data needed for calibration between researchers. Access rules are similar or significantly more restrictive for other social media platforms such as Facebook and Sina Weibo, the leading Chinese microblogging site (http://weibo.com). Consistent with this, we were able to find no research meeting our inclusion criteria based on either of these extremely popular systems.

We identified only one prior effort making use of open data, McIver & Brownstein with Wikipedia access logs [34]. Open algorithms in this field of inquiry are also very limited. Of the works cited above, again only one, Althouse et al. [50], claims general availability of their algorithms in the form of open source code.

Finally, we highlight the quite successful Google Flu and Dengue Trends as a case study in the problems of closed data and algorithms. First, because their data and algorithms are proprietary, there is little opportunity for the wider community of expertise to offer peer review or improvements (for example, the list of search terms used by Dengue Trends has never been published, even in summary form); the importance of these opportunities is highlighted by the system's well-publicized estimation failures during the 2012–2013 flu season [94] as well as more comprehensive scholarly criticisms [43]. Second, only Google can choose the level of resources to spend on Trends, and no one else, regardless of their available resources, can add new contexts or take on operational responsibility should Google choose to discontinue the project.

C2. Breadth. While in principle these surveillance approaches are highly generalizable, nearly all extant efforts address a small set of diseases in a small set of countries, without testing specific methods to expand these sets.

The key exception is Paul & Dredze [91], which proposes a content-based method, ailment topic aspect model (ATAM), to automatically discover a theoretically unbounded set of medical conditions mentioned in Twitter messages. This unsupervised machine learning algorithm, similarly to latent Dirichlet allocation (LDA) [95], accumulates co-occurring words into probabilistic topics. Lists of health-related lay keywords, as well as the text of health articles written for a lay audience, are used to ensure that the algorithm builds topics related to medical issues. A test of the method discovered 15 coherent condition topics including infectious diseases such as influenza, non-infectious diseases such as cancer, and non-specific conditions such as aches and pains. The influenza topic's time series correlated very well with ILI data in the United States.

However, we identify three drawbacks of this approach. First, significant curated text input data in the target language are required; second, output topics require expert interpretation; and third, the ATAM algorithm has several parameters that require expert tuning. That is, in order to adapt the algorithm to a new location and/or language, expertise in both machine learning as well as the target language are required.

In summary, to our knowledge, no disease measurement algorithms have been proposed that are extensible to new disease-location contexts solely by adding examples of desired output. We propose a path to such algorithms.

C3. Transferability. To our knowledge, no prior work offers trained models that can be translated from one context to another. We propose using the inter-language article links provided in Wikipedia to accomplish this translation.

To our knowledge, no prior work offers trained models that can be translated from one context to another. We propose using the inter-language article links provided in Wikipedia to accomplish this translation. C4. Forecasting. A substantial minority of the efforts in this space test some kind of forecasting method. (Note that many papers use the term predict, and some even misuse forecast, to indicate nowcasting.) In addition to forecasting models that incorporate disease dynamics (recall that these are out of scope for the current paper), two basic classes of forecasting exist: lag analysis, where the internet data are simply time-shifted in order to capture leading signals, and statistical forecast models such as linear regression.

Lag analysis has shown mixed results in prior work. Johnson et al. [93], Pelat et al. [67], and Jia-xing et al. [64] identified no reliable leading signals. On the other hand, Polgreen et al. [68] used lag analysis with a shift granularity of one week to forecast positive influenza cultures as well as influenza and pneumonia mortality with a horizon of 5 weeks or more (though these indicators may trail the onset of symptoms significantly). Similarly, Xu et al. [72] reported evidence that lag analysis may be able to forecast HFMD by up to two months, and Yang et al. [73] used lag analysis with a granularity of one month to identify search queries that lead suicide incidence by up to two months.

The more complex method of statistical forecast models appears potentially fruitful as well. Dugas et al. tested several statistical methods using positive influenza tests and Google Flu Trends to make 1-week forecasts [57], and Kim et al. used linear regression to forecast influenza on a horizon of 1 month [86].

In summary, while forecasts based upon models that include disease dynamics are clearly useful, sometimes this is not possible because important disease parameters are insufficiently known. Therefore, it is still important to pursue simple methods. The simplest is lag analysis; our contribution is to evaluate leading information more quantitatively than previously attempted. Specifically, we are unaware of previous analysis with shift granularity less than one week; in contrast, our analysis tests daily shifting even if official data are less granular, and each shift is an independently computed model; thus, our ±28-day evaluation results in 57 separate models for each context.

In summary, significant gaps remain with respect to the challenges blocking a path to an open, deployable, quantitative internet-based disease surveillance system. In this paper, we propose a path to overcoming these challenges and offer evidence demonstrating that this path is plausible.