[This article was first published on, and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If, like me, you've ever had a sandwich from a dubious deli and then been laid up for days afterwards, you know that food poisoning is no trifling matter. In the past, local authorities would only ever learn of such public health issues if they get reported to the authorities by the victim (or the victim's doctor). But that misses the many cases of less serious illnesses that don't involve a doctor or hospital, or illnesses that simply aren't reported to the authorities.

Now, the City of Chicago has found a new way of identifying sources of food poisoning: by analyzing tweets. Foodborne Chicago scans tweets posted in the Chicagoland area, responding to tweets like: "Stomach flu/food poisoning is like eating gas station sushi without the joys of eating gas station sushi" (but ignoring tweets like "It’s really hard to snack while watching Honey Boo Boo. It’s the second best diet to food poisoning."). If you send a such a tweet, you're likely to get a response:

@cheerjoeyniz Sorry to hear you were sick. We can help you by clicking on this link to file a report: http://t.co/jPYs8NxTVw — Foodborne Chicago (@foodbornechi) April 16, 2013

The system is entirely automated, and uses real-time text analysis implemented with R language to identify those tweets that are about a specific case of food poisoning:

Foodborne searches Twitter for all tweets near Chicago containing the string “food poisoning”. The ingestion service consumes thousands of tweets, storing them in a large MongoDB instance. A collection of classification servers, running R, churn through the collected tweets, applying a series of filters. The tweets are classified using a model that was trained via supervised learning, which determines if the tweets are related to a food poisoning illness or not.

Cory Nissen, the data scientist who implemented the analysis behind the system, shared some of the behind-the-scenes details with me via email. He used an R package called textcat and an algorithm based on n-grams to classify the tweets. The model is trained in such a way as to bias towards sensitivity (at the 90%+ level) at the expense of specificity (50 – 60%) to better sort true food poisoning reports from "junk" tweets merely about food poisoning. Out of all the tweets in the Chigaco area on any given day, the system flags about 10-20 tweets a day for review, of which just a couple will typically warrant a response to the unwell citizen for followup.

The open-source R code behind the classifier is available on Github. Check out the README file for more technical details behind the implementation. You can also see how the application was presented on Fox 39 Chicago news (starting at the 2:09 mark):

Smart Chicago Collaborative: Foodborne Chicago: Behind the Scenes