Even more useful, though, would be the ability to predict disease, rather than simply observe and broadcast its path. In a study published yesterday in the journal PLOS Computational Biology, researchers from Los Alamos National Laboratory—a government facility in New Mexico that focuses on science with national-security implications—found that trends in Wikipedia pageviews can be used to predict flu outbreaks up to four weeks in advance.

To create their computer model, the researchers made a list of all the pages a person could click to directly from the Wikipedia entry for “influenza” (the entry for “flu” also redirects there) and then compared the traffic for each of those pages to flu reports provided by the Centers for Disease Control and Prevention.

“Those that correlated really well, we kept, and those that didn’t correlate, we dropped,” explained Los Alamos researcher Sara del Valle, one of the study authors. Ultimately, they were left with a list of 10 flu-related Wikipedia pages, including the entries for “antivirals,” “H1N1,” and “fever,” whose traffic they used to build their predictive algorithms. “So basically, just by looking at how many people are looking at the Wikipedia flu article, we can see how many cases are going to be showing up.”

Using online activity to predict outbreaks isn’t exactly new—but, del Valle said, the study offers “proof of concept” that Wikipedia could provide scientists a means of circumventing several existing obstacles in disease forecasting. While recent research has shown Twitter to be an effective resource for predicting outbreaks, for example, the cost of the raw data can be prohibitive to many. Gnip, a subsidiary of Twitter and one of only a few data-delivery companies with access to its “firehose” (the feed of every single Tweet ever sent), charges users a monthly fee of $2,000 for the data, plus an additional 10 cents for every 1,000 Tweets delivered. (Earlier this year, Twitter announced that it would provide a select number of “data grants” enabling researchers to access the data for free.) And other companies keep their data closed off entirely; Google doesn’t publicize the search terms it uses to build Flu Trends out of concern that the program might be manipulated by hackers trying to create the appearance of an outbreak.

Wikipedia, by contrast, offers public access to hourly traffic data for all of each of its pages. “We don’t do ‘data grants’ for selected individuals or research institutions, since our mandate is to make data openly available to anyone,” Dario Taraborelli, head of research and data at the Wikimedia Foundation, explained in an email, noting that his team fields “several requests a week” from researchers looking for data.

Another part of the appeal, del Valle said, is that Wikipedia’s open access could allow researchers to bypass the bureaucracy that currently comes with large-scale disease tracking. FluView, the CDC’s weekly influenza-surveillance report that compiles data from hospitals, healthcare providers, state health departments, and government public-health labs, has a lag time of about two weeks.