Cristobal Young is an assistant professor at Stanford’s Department of Sociology. He works on quantitative methods, stratification, and economic sociology. In this post co-authored with Aaron Horvath, he reports on the attempt to replicate 53 sociological studies. Spoiler: we need to do better.

Do Sociologists Release Their Data and Code? Disappointing Results from a Field Experiment on Replication.

Replication packages – releasing the complete data and code for a published article – are a growing currency in 21st century social science, and for good reasons. Replication packages help to spread methodological innovations, facilitate understanding of methods, and show confidence in findings. Yet, we found that few sociologists are willing or able to share the exact details of their analysis.

We conducted a small field experiment as part of a graduate course in statistical analysis. Students selected sociological articles that they admired and wanted to learn from, and asked the authors for a replication package.

Out of the 53 sociologists contacted, only 15 of the authors (28 percent) provided a replication package. This is a missed opportunity for the learning and development of new sociologists, as well as an unfortunate marker of the state of open science within our field.

Some 19 percent of authors never replied to repeated requests, or first replied but never provided a package. More than half (56 percent) directly refused to release their data and code. Sometimes there were good reasons. Twelve authors (23 percent) cited legal or IRB limitations on their ability to share their data. But only one of these authors provided the statistical code to show how the confidential data were analyzed.

Why So Little Response?

A common reason for not releasing a replication package was because the author had lost the data – often due to reported computer/hard drive malfunctions. As well, many authors said they were too busy or felt that providing a replication package would be too complicated. One author said they had never heard of a replication package. The solutions here are simple: compiling a replication package should be part of a journal article’s final copy-editing and page-proofing process.

More troubling is that a few authors openly rejected the principle of replication, saying in effect, “read the paper and figure it out yourself.” One articulated a deep opposition, on the grounds that replication packages break down the “barriers to entry” that protect researchers from scrutiny and intellectual competition from others.

The Case for Higher Standards

Methodology sections of research articles are, by necessity, broad and abstract descriptions of their procedures. However, in most quantitative analyses, the exact methods and code are on the author’s computer. Readers should be able to download and run replication packages as easily as they can download and read published articles. The methodology section should not be a “barrier to entry,” but rather an on-ramp to an open and shared scholarly enterprise.

When authors released replication packages, it was enlightening for students to look “under the hood” on research they admired, and see exactly how results were produced. Students finished the process with deeper understanding of – and greater confidence in – the research. Replication packages also serve as a research accelerator: their transparency instills practical insight and confidence – bridging the gap between chalkboard statistics and actual cutting-edge research – and invites younger scholars to build on the shoulders of success. As Gary King has emphasized, replications have become first publications for many students, and helped launched many careers – all while ramping up citations to the original articles.

In our small sample, little more than a quarter of sociologists released their data and code. Top journals in political science and economics now require on-line replication packages. Transparency is no less crucial in sociology for the accumulation of knowledge, methods, and capabilities among young scholars. Sociologists – and ultimately, sociology journals – should embrace replication packages as part of the lasting contribution of their research.

Table 1. Response to Replication Request

Response Frequency Percent Yes: Released data and code for paper 15 28% No: Did not release 38 72% Reasons for “No” IRB / legal / confidentiality issue 12 23% No response / no follow up 10 19% Don’t have data 6 11% Don’t have time / too complicated 6 11% Still using the data 2 4% ‘See the article and figure it out’ 2 4% Total 53 100%

Note: For replication and transparency, a blinded copy of the data is available on-line. Each author’s identity is blinded, but the journal name, year of publication, and response code is available. Half of the requests addressed articles in the top three journals, and more than half were published in the last three years.

Figure 1: Illustrative Quotes from Student Correspondence with Authors:

Positive:

“Here is the data file and Stata .do file to reproduce [the] Tables…. Let me know if you have any questions.” “[Attached are] data and R code that does all regression models in the paper. Assuming that you know R, you could literally redo the entire paper in a few minutes.”

Negative:

“While I applaud your efforts to replicate my research, the best guidance I can offer

is that the details about the data and analysis strategies are in the paper.” “I don’t keep or produce ‘replication packages’… Data takes a significant amount of human capital and financial resources, and serves as a barrier-to-entry against other researchers… they can do it themselves.”

50+ chapters of grad skool advice goodness: Grad Skool Rulz ($2!!!!)/From Black Power/Party in the Street