Thank you to everyone who attended today’s informational session about the Stanford Computational Journalism Lab. If you weren’t able to come by, feel free to sign up for our mailing list, and/or get in contact with us via email and social media.

During the presentation, I had a slide listing all the datasets I could think of in 5 minutes that were public, important, “big” (100,000 records or more), and “easy” to get – at least compared to the days when you had to send a government agency a shrink-wrapped hard drive:

Not all the datasets are literally just one-click-to-download – you may have to re-read the wget manual or write a quick scraper – but they are freely accessible to anyone with an Internet connection. Here’s the list of datasets with URLs:

Those are just what I could fit into the slide. There are many more, of course:

Many cities and states are (with, in my opinion, surprising enthusiasm) uploading their machine-readable data to Socrata, which provides easy CSV and JSON data exports; here are the Socrata portals for San Francisco and New York City.

The U.S. City Open Data Census and the Police Open Data Census – via Code For America, Sunlight Foundation, and Open Knowledge Foundation – serve as portals to curated city data.

The examples above include only the machine-readable datasets that come directly from purported official sources. This leaves out independently curated data that pertains to public affairs, such as the Supreme Court Database. For federal lawmaking and lawmaker data, the sheer size of Sunlight Foundation’s projects page and GovTrack’s bulk data hint at the breadth of their data-parsing efforts, which have created comprehensive databases from text intended for dead trees.

It’s also worth knowing about ProPublica’s Data Store – not just because it’s my former employer – but because their catalog contains free downloads of different years and versions of various datasets mentioned above. For example, they FOIAed the Medicare Part D Prescribing Data for 2011 and 2012 – which they used in their Prescriber Checkup project. The Centers for Medicare & Medicaid Services has only posted the 2013 data.

Natural language and data mining

This is a little off-topic, but people usually also want datasets that can be used for machine learning and natural language. That’s not my particular expertise but I’ll list a few examples that I know about.

There are plenty of private datasets for data mining: Yelp’s Academic Dataset is probably one of the easiest one-click datasets for interesting text tied to categories and sentiment (i.e. star reviews). And the Internet Movie Database has plaintext dumps of some of their data (though not reviews). And I’ve always wanted to more closely examine the types of questions asked (and incorrectly answered) on Jeopardy; here’s a scrape of j-archive.com. Visit r/datasets for a variety of independently collected datasets, including the corpus of 1.7 billion comments.

The Stanford Network Analysis Project has a large number of datasets geared towards network analysis, including the Enron email dump. If you’re interested in a more recent, albeit one-man-network, check out Jeb Bush’s emails, which he released in response to requests as he began his presidential run.

There are plenty of public text sources besides disclosed emails. But many require an intermediate level of programming ability and a non-trivial amount of patience to collect. If you can muster that, transcripts and rulings (now with diffs!) aren’t terribly difficult to scrape from the U.S. Supreme Court website. Press releases from lawmakers and agencies are another plentiful source. My colleague Justin Grimmer in the Political Science department makes it a habit to practice reproducible research: here’s a Github repo of the 72,000+ U.S. Senate press releases he collected, and here’s his resulting paper: A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases. (2010).

Sunlight has an API for Congressional speeches. And the Federal Register has an API for the text in its rule-making and notices. And Regulations.gov has an API to track not only all of the pending federal regulations but all of the public comments submitted – not sure why I don’t see much (proclaimed) use of it given all the very timely information it potentially contains.

This is barely scratching the surface of freely available (and mostly public) data – I didn’t even mention the federal data.gov portal or the U.S. Census data. I’ve highlighted these datasets because they have plenty of rows while being relatively self-contained and (usually) having decent, if not fun-to-read documentation.

(And one more example, just because it’s too big to not mention: Amazon’s AWS Public Data Sets page is an overwhelming collection of massive and free data sets.)

Prepare to research

I don’t claim that these datasets are easy to analyze, or that they are as complete as they purport to be. In fact, prepare to be frequently frustrated when trying to conduct even the most straightforward of analyses.

For example, a simple aggregate count of the White House visitor records on the visitee_namelast and visitee_namefirst fields – i.e. the name of the White House official being visited – would seem to yield clues about who is “important” at the White House – unless visitors are listed as visiting an official’s scheduling assistant. Or if the most powerful White House officials have their meetings outside of the White House, as the New York Times reported:

“They’re here all the time — all day,” Andre Williams, a manager at Caribou Coffee, said of his White House customers. (He can spot White House officials by the security badges around their necks, or the Secret Service agents lurking nearby.) “A lot of them like lattes — that or a ‘depth charge,’ a coffee with a shot of espresso,” Mr. Williams said. “The caffeine rush — they need it.” Some administration officials and lobbyists say that meeting away from the White House allows officials to get some air without making visitors go through the cumbersome White House security process. Others, however, acknowledge that one motivation is the desire to avoid lobbyists’ names showing up too often on the White House logs.

Despite these inherent problems of the dataset, enterprising news organizations can find interesting bits when they know how to focus their search, such as Politico’s report on secret visits by The Daily Show’s Jon Stewart.

As we mentioned in our presentation, computational journalism requires not just technical skill, but the ability to conduct in-depth research and investigation of the institutions and the processes that produce these datasets. This domain knowledge is needed for effective and sane data-wrangling. However, don’t let that intimidate you – even just knowing the existence of these datasets should be enough to inspire fun explorations and project ideas.

If you find something interesting, we’d love to hear from you. And if studying and practicing computational journalism at Stanford interests you, take a look at the courses we offer in the winter and spring.