This study estimates the effect of data sharing on the citations of academic articles, using journal policies as a natural experiment. We begin by examining 17 high-impact journals that have adopted the requirement that data from published articles be publicly posted. We match these 17 journals to 13 journals without policy changes and find that empirical articles published just before their change in editorial policy have citation rates with no statistically significant difference from those published shortly after the shift. We then ask whether this null result stems from poor compliance with data sharing policies, and use the data sharing policy changes as instrumental variables to examine more closely two leading journals in economics and political science with relatively strong enforcement of new data policies. We find that articles that make their data available receive 97 additional citations (estimate standard error of 34). We conclude that: a) authors who share data may be rewarded eventually with additional scholarly citations, and b) data-posting policies alone do not increase the impact of articles published in a journal unless those policies are enforced.

Funding: GC, EM Laura and John Arnold Foundation, grant number 040951. http://www.arnoldventures.org Publication made possible in part by support from the Berkeley Research Impact Initiative (BRII) sponsored by the UC Berkeley Library. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

The objective of this paper is to determine whether sharing data for a research article results in more citations for the article, using changes to journal policies as a natural experiment. If sharing data does result in increased citations, then the private benefits to data sharing could popularize the practice and improve science. We are able to study this issue across a wide variety of disciplines, enhancing the generalizability of our findings. However, we are limited since we cannot ascertain why publicly posted data would or would not lead to additional citations, merely whether articles garner more citations. We are also limited to observational rather than experimental data; we discuss the methods employed to deal with this below.

Researchers give several reasons for their failure to post data. Some highlight costs to the individual, including the effort required, the potential for being scooped, and the risk of being shown to be in error. But there are also benefits from posting data. If research with posted data is more persuasive or believable, it might have greater impact. Moreover, data sets are often useful for analyses that the original author(s) did not think of or chose not to conduct; researchers typically cite an article if they utilize its posted data. It seems plausible that sharing the data used in an article would increase its citations. In fact, several papers across disciplines over the last decade consistently show a positive association between data sharing and citations: in disciplines as varied as cancer microarray trials [ 5 ], gene expression microarrays [ 6 ], astrophysics [ 7 , 8 ], paleoceanography [ 9 ], and peace and conflict studies [ 10 ] data sharing has been shown positively associated with citations, and computational code sharing has been shown to be positively associated with citations in the image processing literature [ 11 ]. These studies range in size (from N = 85 to N>10,000) and find a range of estimates of the increase (from a low of 9% to a high of 69% with most between 20% to 40%). They typically focus on one discipline or subject area. To our knowledge none attempt to explicitly take advantage of a change in journal policy to estimate a treatment effect.

The ease of posting data on the internet has lowered the cost of data sharing; accordingly, advocates of open science have argued that data posting should be standard practice [ 1 ], and a growing number of scientific journals have started requiring that authors publicly post their data. However, this requirement remains more the exception than the rule in many fields, and researchers have not routinely posted their data unless journals require them to do so [ 2 – 4 ].

Verifiability and replicability are fundamental to science. The Royal Society’s motto “nullius in verba" (“take nobody’s word for it”) encourages scientists to verify the claims of others. By sharing data, scientists can increase the verifiability and credibility of their claims. Most academic journals and professional societies encourage researchers to share their data, but these are often informal recommendations; until recently, few journals required it.

Methods

A simple comparison of citations between articles published with and without posted data is difficult to interpret. Authors who post their data may be systematically different from those who do not, as they may choose to publish in different journals. We try to minimize this problem by focusing on journals which began, in principle, to require data posting. This natural experiment enables us to compare articles published before and after the change, exploiting plausibly exogenous variation in data availability caused by shifts in editorial policy.

Two separate and independent teams of researchers, both based at the University of California, Berkeley, serendipitously learned of each other’s plans to exploit the natural experiment caused by shifts in journal policy. One team (Moore and Rose, hereafter MR) collected a broad sample of articles and use within-journal variation in citations to keep the journal (and thus the research community) constant, focusing attention on the effect of the change in data sharing policy. The other team (Christensen, Dafoe, and Miguel, hereafter CDM) collected deeper, more detailed data on a smaller subset of articles, without knowing the results from the MR sample. The data collection processes of both teams appear in Fig 1, and the appendix explains the timing of the researchers’ interactions.

Broad analysis To exploit the change in journal policies, MR systematically searched the top 250 scientific journals, as identified by SCImago (http://scimagojr.com), and identified all those that changed their policies to require data posting for published articles. Following the MR pre-analysis plan (https://osf.io/pxdch/), we collected citation count data for empirical articles–those that analyze quantitative data–published immediately after a change in data posting policy, as well as analogous citation counts for articles published in the period before the regime change. The MR analysis examined, for each journal, 200 empirical articles or two years’ worth of articles (whichever was less) on either side of a policy change. Research assistants recorded the annual flow of new Web of Science citations received one, two, three, four, and five years post-publication for each of these articles. This enables us to compare the difference in citations for articles published in the same journal shortly before and after a new data sharing requirement. To account for the possibility of events that influenced both the change in journal policy and citation rates that could bias estimates, we collected comparable data for two natural comparison sets. We gathered data on theoretical articles published in the same journals; since these do not use empirical data, their citations should be largely unaffected by any change in data posting policy. We also matched the 17 treatment journals (which began to require data sharing) to control journals (which did not) and collected comparable citation data for empirical articles published in these control journals. We selected control journals using conventional one-to-one propensity score matching with replacement [12,13] from top-ranked journals that most closely match the treatment journals on SCImago criteria. Our objective was to identify control journals that did not require data posting, but that were otherwise as similar as possible to the treatment journals in terms of observable characteristics. Accordingly, we began with non-treatment journals—that never required data posting–from the same SCImago “Top 250” list of journals where we identified our treatment journals. We matched treatment to control journals using the six indicators used to create the SCImago list itself. These criteria appear on the Scimago website. The six variables are: a) the journal’s h-index; b) the total number of citable documents published in the journal’s last three years; c) citations per document over the last two years; d) references per document; e) the country where the journal was published; and f) the category of the journal. Since there are only two countries where our treatment journals were published, we created a binary variable for journals published in the UK, leaving US journals as the default. And since we only have a limited number of treatment journals, we consolidated journal category into eight areas: a) Biology; b) Ecology; c) Economics; d) Medicine; e) Molecular Biology; f) Multidisciplinary; g) Sociology and Political Science; and h) Miscellaneous. After creating the data set, we created a binary variable, coded 1 for our treatment journals and 0 for all remaining journals (potential control journals). We then estimated a cross-sectional probit equation; the results are tabulated in our online supplement. After estimating the probit model, we then matched each of the treatment journals to a single control journal, using the closest possible journal, by predicted probit score within journal category. Sometimes this resulted in more than one treatment journal being matched to the same control journal. Both the control journals and the probit regression estimates themselves are freely available online (see https://osf.io/67c5z/). We are left with a set of 13 unique control journals along with 17 treatment journals; thus the MR analysis included data from a total of 30 distinct scientific journals. Appendix A provides the list of journals and more details on the data construction procedure. We divide our data into citations for three types of research articles, as per above: a) empirical articles from the 17 treatment journals, our chief interest; b) theoretical articles from the treatment journals; and c) empirical articles from the 13 control journals. For each set of articles, we further split the data into articles published before and after the imposition of the data sharing policy. Control journals, by construction, do not experience any policy shift; we use the corresponding dates for the matched treatment journals.