With the launching of 538, Vox and the New York Times’ Upshot, it seems like the age of data journalism is finally here, greeted with both acclaim and concern by media critics. But data journalism is not a new thing. These new sites are just the latest iteration of news applications, which were an iteration of computer-assisted reporting, which was an iteration of precision journalism, all of which are just names for specific techniques and approaches used in the service of reporting the truth and finding the story. In other words, it’s journalism that starts from interrogating the data—and applies the same skepticism and rigor that we apply to the testimony of an expert contacted by traditional phone-assisted reporting.

All of which is to say that data journalism inherits a long tradition of journalists working with data, and that comes with the heavy responsibility to get it right. Specifically, to paraphase something I heard at a NICAR conference once: fear and paranoia are the best friends a data journalist can have. I think about this often when I work with data, because I am terrified about making a dumb mistake. The public has only a limited tolerance for fast-and-loose data journalism and we can’t keep fucking it up.

Critique is always annoying when it’s expressed in indefinite terms. So, I’m going to do something I don’t normally like to do and pick a recent example of a data journalism story gone wrong. This is not to scold those who reported it—indeed, I’m well aware of how easy it is for me to make similar mistakes—but because a specific example provides an explicit illustration of how reporting on data can go wrong and what we can learn from it. And so, let’s begin by talking about porn.

Specifically, a story about online pornography consumption in “red” vs. “blue” states that exploded onto social media a few weeks back. I first noticed it because of a story on Vox that reaggregated an Andrew Sullivan post which in turn reposted a chart made by Christopher Ingraham of the data provided by Pornhub for their study. That chain of links reflects how news spreads online these days, and yet none of those professional eyes caught some glaring flaws in the data.

Before I continue, here’s a brief summary of the findings presented by Pornhub’s data scientists. Pornhub (which is apparently the third most-popular pornography site on the Internet) was approached by Buzzfeed (which is probably the most-popular animated GIF distributor on the Internet) to analyze its traffic and determine whether “blue” states that voted for Obama in the last election consumed more pornography than “red” states that voted for Romney. And so, that’s what the statisticians at Pornhub did, pulling IP addresses from their website’s traffic logs, geocoding their likely locations and deriving a figure of total traffic for each state. They then divided the total hits from each state by that state’s population to derive a hits-per-capita number for each state. As a result, they were able to report that per-capita averages for each state and that blue states averaged slightly more hits per capita than red states.

How To Confuse Yourself With Statistics Unfortunately, the study and the subsequent reporting derived from the Pornhub data serves as a vivid example of six ways to make mistakes with statistics: Sloppy proxies

Dichotomizing

Correlation does not equal causation

Ecological inference

Geocoding

Data naivete The first issues begin with the selection of the proxy. In statistics, a proxy is a variable that is used when it’s impossible to measure something directly—for instance, using per-capita GDP as a measure of standard of living. Buzzfeed titled the article about the Pornhub study as “Who Watches More Porn: Republicans Or Democrats?”. Let’s assume that’s the question that Buzzfeed wanted to ask. How would they do it? In an ideal world, they could ask every single Democrat and Republican in the country about their porn watching preferences, but this is obviously unfeasible. So, the next best thing after that would be to conduct a survey of a randomly selected group of individuals that shares similar characteristics to the national population. But that takes time and money and math, so instead Buzzfeed turned to their friends at Pornhub to derive an answer using the data they had on hand. In this case, they used page requests to the third most-popular online porn site as a proxy for all pornography consumption and the percentage of the people who voted for Obama or Romney as proxies for registered Democrats and Republicans. These proxies are not the same thing, so distortion is inevitable. For instance, maybe in some states, people widely prefer to get their pornography via on-demand cable or sketchy video store, so they would be undercounted in the Pornhub figures. Similarly, this study uses total pageviews as a proxy for site users; the two are not necessarily the same and it’s unclear if increased pageviews means a corresponding linear increase in users. In addition, given that a large number of Americans identify themselves as independents, is it accurate to classify those voters as red or blue depending on a single election? Proxies give us a means to derive answers, but they may not always be appropriate for the questions being asked. The problems continue from there. For their analysis, Pornhub sorted states into red and blue ones. This seems like it makes sense, but they’ve flattened a continuous variable (the percentage of the state population that voted for Obama) into a binary condition (Romney wins / Obama wins). It’s likely this dichotomizing had a palpable effect, since it makes a battleground state like Virginia seem closer to a Democratic stalwart like Vermont than its ideological “red state” neighbors in the South. Fortunately some statisticians identified and corrected for this issue, producing a more accurate scatter plot of the states vs their vote share for Obama. The result: a correlation that increased porn consumption in blue states accounted for about 16% of the variance of the state’s vote percentage for Obama. Success! But wait. Here we stumble into two of the most classic mistakes people make with statistics. First, correlation does not equal causation. You’ve probably heard that a hundred times before, but this here is an actual illustration of why that matters. It’s entirely possible that the suggested relationship between the two variables is a total coincidence. Far more likely though is that the variables are related but only through a confounding variable that connects the two variables observed. For instance, blue states might have greater broadband penetration that would favor Internet porn. Or it could be that people in urban areas consume more Internet porn and states with more urban areas also trend Democratic. Confounding variables are common, and this piece by Jonathan Stray contains a solid overview of them and other spurious correlations. Or if you’d prefer a sarcastic look, here are correlations of voting to herpes infection or Nickelback listening. Putting it bluntly, these red state-blue state comparisons are statistical fluff, often reflecting the whimsy of the reporter more than anything real. But what is the second mistake? For the sake of argument, let’s assume that we’ve avoided all these other problems above. Let’s decide Internet porn is a valid proxy for all pornography, that votes for a specific candidate in the last presidential election is a valid measure of party affiliation, that the correlation is not due to any hidden variables, then we can definitively say that Democrats consume more porn than Republicans, right? Wrong. Meet the ecological inference fallacy. In short, just because you’ve derived some average measure about a group that contains more of a subpopulation, that doesn’t necessarily mean it’s true for individuals in that group, especially when the difference is so slight. It’s possible that Democrats really do consume more porn and that’s what makes for the higher numbers per-capita in blue states. But it could also be that Republicans in Democrat-dominated states consume more porn than in Republican-dominated ones and that is what is pushing up the average. Or it could be that urban areas often consume more pornography and also tend to contain more Democrats but the two aren’t directly connected. We simply don’t have enough insight into the individual population to say. And we definitely don’t have any insight into specific people based on these broad statistics. Knowing that your neighbor is a Republican or a Democrat tells you nothing about their porn consumption, regardless of the averages they derived for each population.

We’re Not in Kansas Anymore Unfortunately, the worst error was yet to come. A lot of the early reporting on this study noticed a bizarre anomaly in the data: Kansas, a very red state, consumed an extremely high amount of porn per capita compared to the average for all other states. This is readily apparent when the numbers are graphed in a simple bar chart, but it really jumps out when the states are plotted on a scatterplot of Obama vote share vs. page hits. If you assumed, as Pornhub did, that average porn consumption was normally distributed across all states, Kansas’ average was highly unlikely. At more than 2.95 standard deviations above the average, there would be a 0.16% chance of that occurring if it were truly random. An extreme outlier like this should make you sit up and take notice as a data journalist, because it can only mean one of two things. Either you’ve really found an extreme case that reveals something bizarre and newsworthy. Or—as one reader of Andrew Sullivan’s website figured out while all the journalists shrugged their shoulders—the data is flawed. Pornhub’s writeup omitted any explicit description of their methodology—this is never a good sign—but it seems to have involved mapping the IP addresses from which users visited the site to physical addresses and reverse geocoding those to get states. The statisticians at Pornhub (and the journalists who confidently reported their findings) assumed this was a clean process, but any programmer with experience can tell you the bitter truth: geocoding is often rubbish. What happened here was that a large percentage of IP addresses could not be resolved to an address any more specific than “USA.” When that address was geocoded, it returned a point in the centroid of the continental United States, which placed it in the state of—you guessed it—Kansas! Sadly, IP geocoding is prone to other distortions from networking architecture; for instance, at one time every user of AOL’s nationwide dialup service looked like they were connecting to the Internet from Reston, Virginia. Right now, my corporate VPN makes me look like I’m surfing the web from New Jersey even though I live in Maryland. Of course, if we shift Kansas’ average downwards, that doesn’t change Pornhub’s hypothesis that blue states consume more porn per capita than red states. I’ve already sufficiently argued my concerns with that, but I bring up this specific error because of the central failure it illuminated. If you want to call yourself a data journalist, there is one shortcut you can never take: you must validate your data. Even the cleanest looking data might contain flaws and omissions stemming from its methodology. It’s not enough to run checks on the data itself. You must also lift your nose out of the database, ask the serious questions about how the data was collected and even use the well-honed tools of a traditional reporter to call experts when—never an if—you find questions about the data.