Policies ensuring that research data are available on public archives are increasingly being implemented at the government [], funding agency [], and journal [] level. These policies are predicated on the idea that authors are poor stewards of their data, particularly over the long term [], and indeed many studies have found that authors are often unable or unwilling to share their data []. However, there are no systematic estimates of how the availability of research data changes with time since publication. We therefore requested data sets from a relatively homogenous set of 516 articles published between 2 and 22 years ago, and found that availability of the data was strongly affected by article age. For papers where the authors gave the status of their data, the odds of a data set being extant fell by 17% per year. In addition, the odds that we could find a working e-mail address for the first, last, or corresponding author fell by 7% per year. Our results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers, and further demonstrate the urgent need for policies mandating data sharing via public archives.

Finally, there was a strong negative relationship between the age of the paper and the probability that the data set was still extant (either “shared” or “exists but unwilling to share”), given that a response indicating the status of the data was received (OR = 0.83 [0.79–0.90, 95% CI], p < 0.0001; Figure 1 D). The odds ratio suggests that for every yearly increase in article age, the odds of the data set being extant decreased by 17%.

There was no relationship between age of the paper and the probability of a response given that there was an apparently working e-mail (50% response rate, OR = 1.00 [0.97–1.04, 95% CI]; Figure 1 B). There was also no relationship between article age and the probability that the response indicated the status of the data, given a response was received (83% useful responses, OR = 1.00 [0.95–1.07, 95% CI]; Figure 1 C).

We note that eight e-mail addresses generated an error message but did lead to a response from the authors. It also seems likely that some addresses failed but did not generate an error message, leading us to record a “no response” rather than “e-mail not working,” although unfortunately the frequency of these cannot be estimated from our data.

There was a negative relationship between the age of the paper and the probability of finding at least one apparently working e-mail either in the paper or by searching online (odds ratio [OR] = 0.93 [0.90–0.96, 95% confidence interval (CI)], p < 0.00001). The odds ratio suggests that for every year since publication, the odds of finding at least one apparently working e-mail decreased by 7% ( Figure 1 A). Since we searched for e-mails in both the paper and online, four factors contribute to the probability of finding a working e-mail: (1) the number of e-mails in the paper and (2) the chance that any of those worked and (3) the number of e-mails we could find by searching online and (4) the chance that any of those worked. The total number of e-mail addresses we found in the paper decreased with age (Poisson regression coefficient = −0.07, SE = 0.01, p < 0.0001) from an average of 1.17 in 2011 to 0.42 in 1991 ( Figure 2 A), and there was a slight positive effect of article age on the number of e-mails we found online (Poisson regression coefficient = 0.015, SE = 0.007, p < 0.05; Figure 2 C). Moreover, the chance that an e-mail found in the paper or online appeared to work also showed a relationship with article age (OR = 0.96 [0.926–0.998, 95% CI], p < 0.05; and OR = 0.97 [0.936–0.997, 95% CI], p < 0.05; respectively), such that the odds that an e-mail appeared to work declined by 4% and 3% per year since publication, respectively ( Figures 2 B and 2D).

The line indicates the predicted probability from a Poisson (A and C) or logistic (B and D) regression, the gray area shows the 95% CI of this estimate, and the red dots indicate the actual proportions from the data.

In all panels, the line indicates the predicted probability from the logistic regression, the gray area shows the 95% CI of this estimate, and the red dots indicate the actual proportions from the data.

We used logistic regression to formally investigate the relationships between the age of the paper and (1) the probability that at least one e-mail appeared to work (i.e., did not generate an error message), (2) the conditional probability of a response given that at least one e-mail appeared to work, (3) the conditional probability of getting a response that indicated the status of the data (data lost, data exist but unwilling to share, or data shared) given that a response was received, and, finally, (4) the conditional probability that the data were extant (either “shared” or “exists but unwilling to share”) given that an informative response was received.

We investigated how research data availability changes with article age. To avoid potential confounding effects of data type and different research community practices, we focused on recovering data from articles containing morphological data from plants or animals that made use of a discriminant function analysis (DFA). Our final data set consisted of 516 articles published between 1991 and 2011. We found at least one apparently working e-mail for 385 papers (74%), either in the article itself or by searching online. We received 101 data sets (19%) and were told that another 20 (4%) were still in use and could not be shared, such that a total of 121 data sets (23%) were confirmed as extant. Table 1 provides a breakdown of the data by year.

Discussion

We found a strong effect of article age on the availability of data from these 516 studies. The decline in data availability could arise because the authors of older papers were less likely to respond, but this was not supported by the data. Instead, researchers were equally likely to respond ( Figure 1 B) and to indicate the status of their data ( Figure 1 C) across the entire range of article ages.

The major cause of the reduced data availability for older papers was the rapid increase in the proportion of data sets reported as either lost or on inaccessible storage media. For papers where authors reported the status of their data, the odds of the data being extant decreased by 17% per year ( Figure 1 D). There was a continuum of author responses between the data being reported lost and being stored on inaccessible media, and they seemed to vary with the amount of time and effort involved in retrieving the data. Responses included authors being sure that the data were lost (e.g., on a stolen computer) or thinking that they might be stored in some distant location (e.g., their parent’s attic) to authors having some degree of certainty that the data are on a Zip or floppy disk in their possession but no longer having the appropriate hardware to access it. In the latter two cases, the authors would have to devote hours or days to retrieving the data. Our reason for needing the data (a reproducibility study) was not especially compelling for authors, and we may have received more of these inaccessible data sets if we had offered authorship on the subsequent paper or said that the data were needed for an important medical or conservation project.

12 Wren J.D.

Grissom J.E.

Conway T. E-mail decay rates among corresponding authors in MEDLINE. The ability to communicate with and request materials from authors is being eroded by the expiration of e-mail addresses. The odds that we were able to find an apparently working e-mail address (either in the paper or by searching online) for any of the contacted authors did decrease by about 7% per year. This decrease was partly driven by a dearth of e-mail addresses in articles published before 2000 (0.38 per paper on average for 1991–1999) compared with those published after 2001 (1.08 per paper on average; Figure 2 A). Wren et al. [] found a similar increase in the number of e-mails in articles published after 2000. The larger number of e-mails in recent papers may mean that the issue of missing author e-mails is restricted to articles from before 2000: researchers in e.g., 2031 will be able to try a wider range of addresses in their attempts to contact authors of articles published in 2011.

12 Wren J.D.

Grissom J.E.

Conway T. E-mail decay rates among corresponding authors in MEDLINE. The ability to communicate with and request materials from authors is being eroded by the expiration of e-mail addresses. 13 Haak L.L.

Fenner M.

Paglione L.

Pentz E.

Ratner H. ORCID: a system to uniquely identify researchers. The proportion of e-mails from the paper that appeared to work declined with article age between 2 and 14 years of age and then rose to around 80% for articles from 1991, 1993, and 1995 ( Figure 2 B). These latter three proportions are only based on a total of 13 e-mail addresses. Wren et al. [] reported a steep decline with age in the proportion of functioning e-mails from papers published between 1995 and 2004, such that 84% of their 10-year-old e-mails returned an error message. Our proportions for 10-year-old e-mails are lower, with only 51% of e-mails from 2003 returning an error. It may be that e-mail addresses are becoming more stable through time, although this clearly requires additional study. The arrival of author identification initiatives like ORCID [] and online research profiles such as ResearchGate or Google Scholar should make it easier to find working contact information for authors in the future.

11 Vines T.H.

Andrew R.L.

Bock D.G.

Franklin M.T.

Gilbert K.J.

Kane N.C.

Moore J.-S.

Moyers B.T.

Renaut S.

Rennison D.J.

et al. Mandated data archiving greatly improves access to research data. 9 Wicherts J.M.

Borsboom D.

Kats J.

Molenaar D. The poor availability of psychological research data for reanalysis. Considering only the papers from 2011, our results show that asking authors for their data shortly after publication does yield a moderate proportion of data sets (∼40%). A comparable study [] received 59% of the requested data sets from papers that were less than a year old. It is hard to tell whether this difference is due to the slightly different research communities involved or the presence of an extra year between publication and the data request in this study. A related paper by Wicherts et al. in 2005 [] received only 26% of requested psychology data sets.

Overall, we only received 19.5% of the requested data sets, and only 11% for articles published before 2000. We found that several factors contribute to these low proportions: nonworking e-mails, a 50% response rate, and sometimes the lack of an informative response from the authors. However, when the authors did give the status of their data, the proportion of data sets that still existed dropped from 100% in 2011 to 33% in 1991 ( Figure 1 D). Unfortunately, many of these missing data sets could be retrieved only with considerable effort by the authors, and others are completely lost to science.