With Google Flu Trends, a project devoted to inferring outbreaks of influenza from search queries, Google has unwittingly hung a big sign on itself advertising services for government surveillance. Of course, the privacy implications of search engine databases are a longstanding issue. But the topic tends to be treated as an academic concern of privacy activists or civil libertarians - until some controversy propels it into public awareness.

The investigation itself is laudable. In sum, an increase in people searching for queries related to flu is likely to signal an outbreak of infection. Researchers at Google and the US Centers for Disease Control and Prevention have written a paper on how to data-mine search logs for this purpose.

The paper states: "Harnessing the collective intelligence of millions of users, Google web search logs can provide one of the most timely, broad-reaching syndromic surveillance systems available today." It would be more accurate, though more disturbing, if the first part of that sentence read: "Harnessing Google's ability to do extensive real-time monitoring of the search activity of millions of users ..."

At its core, this project is demonstrating how Google's infrastructure can be used for tracking the population. The technical problem is independent of social values. In the same way that a system prohibiting reading forbidden material can be applied to sex or to political dissent, a monitoring system can be applied to the virus of influenza or the virus of democracy.

What's especially significant here is that Google has in effect created and publicised a working prototype. Whenever data-mining is discussed, objections are raised regarding the likelihood of false positives and the possibilities of poisoning by saboteurs. While the researchers acknowledge such potential problems, the overall result is claimed to be useful and worthy of publication in a scientific journal.

It could all be for an extremely good cause. But it raises the spectre of less than noble future applications. Two privacy groups, the Electronic Privacy Information Center (EPIC) and Patient Privacy Rights, have written to Google detailing concerns focused on whether the aggregate data could be used to identify individuals. There have been notable incidents where supposedly anonymised datasets have leaked personally identifying information.

EPIC says: "Historically, identification through aggregated data has been subject to abuse ... Therefore, automatic, permanent, one-way anonymisation of such information is necessary."

However, the main value of this relatively small but easily graspable individual privacy risk may be in raising awareness of the much larger but more abstract societal surveillance risk. While people's health concerns should certainly be treated as confidential, what's being done here is to find methods for automated mass observation.

In effect, techniques developed to find interests for advertisers are being repurposed. This is even more worrisome when one considers the implications for electronic health records, an area where Google has been very active - whoever gets to be a standard computerisation technology provider for the health industry will make a fortunes.

There is a standard ideological rant that it's a threat to freedom when government have large databases concerning citizens, but simply good business for corporations to have large databases regarding users. The problem is that governments can appropriate corporate databases.

Information itself is useless without a means of retrieving and organising it. That's a task at which Google excels. So a deeper problem is not even the database, but the infrastructure to process it. Which is exactly what Google is building.

There's no easy solution to this dilemma, as it's the sort of basic technology that can be used for good and evil. But it's another reason to be wary of surveillance engines.

sethf.com/infothought/blog