Open data yield a bounty of insights, but what you find may be difficult to stomach.

San Francisco has more restaurants per capita than any other city in the country, and inspecting these kitchens is a small staff of tireless Department of Public Health workers, routinely hitting the streets to help ensure that you and I dine on food and nothing else. What’s more, our fair city by the bay has implemented a progressive open data policy that has seen countless public records made available in detailed, machine-readable formats. As a hacker foodie I was drawn to the store of health inspection data available through the program, and what I found did not disappoint.

This trove of data paints a revealing picture of the perils of dining out in San Francisco, with more than 40% of restaurants in many neighborhoods incurring high-risk health violations such as vermin infestation, sewage contamination, unapproved living quarters and the sale of previously-served food. Combining these inspection reports with restaurant review data sourced through Mechanical Turk we find, disappointingly if not surprisingly, that many perennially popular cuisines such as Indian, Chinese & Thai rate consistently among the most unclean. Dodgy dives aside, worse still are the many highly-reviewed restaurants whose vermin infested kitchens and adulterated food go unnoticed by thousands of hapless diners every month.

The Facts of the Matter

Focusing on the eleven neighborhoods in San Francisco with the most restaurants, a natural starting place is an average health inspection score on a per-neighborhood basis.

Area Mean Score Dogpatch/Potrero 94.74 Inner Richmond 93.12 Laurel Heights 92.15 Civic Center 92.02 Embarcadero 91.75 Soma 91.22 Mission 90.74 Financial District 90.59 Russian Hill 90.45 Sunset 87.69 Chinatown/North Beach 87.24 Mean health inspection scores in San Francisco, by neighborhood.

Somewhat surprisingly we find that businesses scattered throughout the city’s eastern industrial district are remarkably clean, though this score is probably helped in no small measure by the pristine, slightly precious, eateries serving ubranite families atop Potrero Hill. Mercifully, my home, San Francisco’s Mission district, lands squarely in the middle of the pack, though the low end of the spectrum in this neighborhood will surely give even the most hardened diner-goer pause. Finally, bringing up the bottom of the list, is lovable, touristic Chinatown, which, as we soon shall see, is not a place you would be well advised to eat again. For a compelling, interactive visualization of the geospatial dimensions of this data set, be sure to check out the awesome Leaflet-based map (below) from Zipfian Academy co-founder Jonathan Dinu.

Interactive visualization of health inspections scores throughout San Francisco.

(Credit: Zipfian Academy & Jonathan Dinu.)

Crimes Against Hygiene

Digging deeper, we can unpack these scores by category of violation to understand exactly where businesses are taking the hit. To be clear, many violations cataloged by the Department of Public Health are those which may have been, at one point or another, committed in our own kitchens: inappropriate cooling methods, improper food storage, unclean nonfood contact surfaces, etc.

Violations of this type aren’t anything to be proud of, but they’re not going to put you in the hospital, either. There are a class of violations, however, that just might.

From the sixty eight unique violation categories I identified nine that I consider of essentially unforgivable. These are the worst of the worst, and one can only hope that it is with great infrequency that you dine in a restaurant that has perpetrated the following transgresssions:

High risk vermin infestation

Moderate risk vermin infestation

Employee discharge from eyes, nose, or mouth

Sewage or wastewater contamination

Unapproved living quarters in a food facility

Unsanitary employee garments, hair, or nails

Improper food labeling or menu misrepresentation

Contaminated or adulterated food

Service of previously served foods

The news I have for you on this front is not good.

Below you’ll find the proportion of all businesses in each neighborhood that have incurred at least one of the violations on this list.

Area Ratio Embarcadero 16% Dogpatch/Potrero 17% Inner Richmond 19% Soma 26% Financial District 32% Civic Center 34% Laurel Heights 41% Mission 45% Sunset 46% Russian Hill 50% Chinatown/North Beach 54%

Proportion of businesses with outrageous health code violations.

While you savor these findings, notice two interesting things. First and most unsettling is the fact that more than half the businesses in Chinatown and North Beach are committing the infractions listed above. For anyone who’s been to Chinatown, it’s not unrealistic to imagine that a number of businesses are operating on the margins, but even to my somewhat jaded sensibility this is an arrestingly high figure.

The other, more insidious takeaway is that, in even the cleanest neighborhoods, nearly one in five restaurants are operating under conditions that are unsettling at best and dangerous at worst. Recall here that inspections are metted out at random, with little or no forewarning, and it’s not unreasonable to assume that these infractions, where they do occur, are happening on the regular.

Mashup Culture

The city health inspection records are, in their own right, revealing, but the most interesting analyses almost always involve bringing together distinct data sources. To this end, I employed crowd workers to identify the review site ratings and categories associated with over 1,000 businesses documented in DPH records. At right are presented the average health inspection scores across forty nine restaurant categories with at least twenty five unique businesses.

Category Mean Score Fast Food 97.3 Juice Bars & Smoothies 96.0 Elementary Schools 95.1 Food Stands 94.5 Ice Cream & Frozen Yogurt 93.7 Convenience Stores 93.2 Coffee & Tea 92.8 Dive Bars 92.5 Hotels 91.6 Beer, Wine & Spirits 90.9 Caterers 90.9 Cafes 90.8 Wine Bars 90.5 Sandwiches 90.3 French 89.9 American (New) 89.9 Mediterranean 89.7 Sushi Bars 89.5 Delis 89.5 Lounges 89.1 Grocery 89.0 Seafood 89.0 Donuts 88.9 Sports Bars 88.6 Pizza 88.5 Latin American 88.4 Vegetarian 88.4 Italian 88.3 Diners 88.3 American (Traditional) 88.2 Bakeries 88.1 Pubs 88.1 Burgers 87.8 Spanish 87.8 Asian Fusion 87.6 Desserts 87.6 Breakfast & Brunch 87.2 Japanese 87.2 Korean 87.0 Middle Eastern 86.9 Bars 86.5 Mexican 86.4 Thai 84.9 Vietnamese 84.4 Filipino 84.2 Indian 81.7 Chinese 81.6 Dim Sum 76.3

Mean health inspection score by restaurant category.

Reassuringly, our school cafeterias, while generally devoid of what you and I might consider food, are not unclean, a trend that extends, somewhat surprisingly, to many of your favorite dive bars. There’s much to be said about this list and the spectrum it represents, but in the interest of space I’ll call attention to just one other fact.

Dim sum.

What is the substance, tucked inside those little bundles of ground meat, that puts these businesses so far below the rest of the pack? How many dumplings have you eaten during the course of your life?

What was it, again, that you’ve been eating?

This unexpected feature of the data, namely that a favorite food could come from such miserable kitchens, begs a final comparison.

Here we plot the relationship between a businesses’ average review score and mean historical health inspection rating. Taking the ratio between these values yields a measure I call the ‘Squick Factor’, the extent to which a place is at once wildly popular and absolutely filthy. The lower right hand quadrant of this plot shows many such businesses, the worst of which are documented below.

Business Name Category Score Review Squick Manila Market & Groceries Grocery 46 3.1 .0673 Mee Heong Bakery Bakeries 65.3 4.2 .0643 Howard’s Café American 59.3 3.8 .0640 Lai Hong Lounge Lounges 59 3.7 .0627 De La Paz Coffee Roasters Coffee & Tea 75 4.5 .0600 Gold Coin Trading Co Meat Shops 78.8 4.7 .0596 Irving Street Cave Diners 69.3 4.1 .0591 Let’s Eat Grill Stop Barbeque 66 3.9 .0590 Leopold’s German 72.2 4.2 .0581 Amarena Italian 73 4.2 .0575 Punjab Kabab House Indian 64.5 3.7 .0573 Curry Village Indian 61.7 3.5 .0567 Golden Rice Bowl Chinese 65 3.6 .0553 La Espiga de Oro Mexican 78.8 4.4 .0558 Sun Kwong Restaurant Chinese 72 4 .0555

Popular restaurants with miserable health inspections scores.

Looking Forward

Data science is powerful because it maximizes leverage in the presence of finite resources. Health inspections serve a vital function in protecting the public from food-borne illness, but are expensive and time consuming to perform. In light of this, my hope is that the data science community can contribute to the public welfare by applying statistical modeling techniques to the wealth of open data to which we have access.

Using a simple linear regression, for example, on restaurants’ postal code, category, and mean review site score I produced models (10-fold cross-validation R-squares of ~.22) that were able to predict, on average, a previously unseen restaurant’s future health inspection scores within 8.5 points. Additional data, such as whether the restaurant serves alcohol, its hours of operation, and the text of restaurant reviews could surely improve these models’ accuracy. Equipped with such statistical tools, the Department of Public Health could to prioritize the inspection of newly-opened restaurants in terms of their likeliness to spread food borne illness, saving money, time, and potentially lives. It is my intention to see that these tools make it into their hands.

The promise of data is an efficiently functioning society, in which critical decisions are made in the presence of meaningful and actionable information. As data scientists, we live in an exciting time, and occupy an especially privileged position. It is our obligation to harness our abilities in service of the public good, such that we all may benefit from the hidden structure of the modern world.

Caveat comestor.