Abstract Efforts to make research results open and reproducible are increasingly reflected by journal policies encouraging or mandating authors to provide data availability statements. As a consequence of this, there has been a strong uptake of data availability statements in recent literature. Nevertheless, it is still unclear what proportion of these statements actually contain well-formed links to data, for example via a URL or permanent identifier, and if there is an added value in providing such links. We consider 531, 889 journal articles published by PLOS and BMC, develop an automatic system for labelling their data availability statements according to four categories based on their content and the type of data availability they display, and finally analyze the citation advantage of different statement categories via regression. We find that, following mandated publisher policies, data availability statements become very common. In 2018 93.7% of 21,793 PLOS articles and 88.2% of 31,956 BMC articles had data availability statements. Data availability statements containing a link to data in a repository—rather than being available on request or included as supporting information files—are a fraction of the total. In 2017 and 2018, 20.8% of PLOS publications and 12.2% of BMC publications provided DAS containing a link to data in a repository. We also find an association between articles that include statements that link to data in a repository and up to 25.36% (± 1.07%) higher citation impact on average, using a citation prediction model. We discuss the potential implications of these results for authors (researchers) and journal publishers who make the effort of sharing their data in repositories. All our data and code are made available in order to reproduce and extend our results.

Citation: Colavizza G, Hrynaszkiewicz I, Staden I, Whitaker K, McGillivray B (2020) The citation advantage of linking publications to research data. PLoS ONE 15(4): e0230416. https://doi.org/10.1371/journal.pone.0230416 Editor: Jelte M. Wicherts, Tilburg University, NETHERLANDS Received: July 5, 2019; Accepted: February 28, 2020; Published: April 22, 2020 Copyright: © 2020 Colavizza et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: Code and data can be found at: https://doi.org/10.5281/zenodo.3470062. Funding: This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and by Macmillan Education Ltd, part of Springer Nature, through grant RG92108 ‘`Effect of data sharing policies on articles’ citation counts’’ granted to BM. Springer Nature provided support in the form of salaries for author IH, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section. Competing interests: One of the authors (IH) is at the time of publication in the journal, employed by PLOS, publisher of PLOS ONE. IH was employed by Springer Nature, publisher of the BMC journals, at the time of planning and conducting the research and writing of the original manuscript. This does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no patents, products in development or marketed products associated with this research to declare. All other authors have declared that no other competing interests exist. Publisher's Note: The article involves the independent analysis of data from publications in PLOS ONE. PLOS ONE staff had no knowledge or involvement in the study design, funding, execution or manuscript preparation. The evaluation and editorial decision for this manuscript have been managed by an Academic Editor independent of PLOS ONE staff, per our standard editorial process. The findings and conclusions reported in this article are strictly those of the author(s).

Introduction More research funding agencies, institutions, journals and publishers are introducing policies that encourage or require the sharing of research data that support publications. Research data policies in general are intended to improve the reproducibility and quality of published research, to increase the benefits to society of conducting research by promoting its reuse, and to give researchers more credit for sharing their work [1]. While some journals have required data sharing by researchers (authors) for more than two decades, these requirements have tended to be limited to specific types of research, such as experiments generating protein structural data [2]. It is a more recent development for journals and publishers covering multiple research disciplines to introduce common requirements for sharing research data, and for reporting the availability of data from their research in published articles [3]. Journal research data policies often include requirements for researchers to provide Data Availability Statements (DAS). The policies of some research funding agencies, such as the UK’s Engineering and Physical Sciences Research Council (EPSRC), also require that researchers’ publications include DAS. A DAS provides a statement about where data supporting the results reported in a published article can be found, whether those data are available publicly in a data repository, available with the published article as supplementary information, available only privately, upon request or not at all. DAS are often in free-text form, which makes it a non-trivial task to automatically identify the degree of data availability reported in them. This is one of the novel contributions of our study. While DAS can appear in different styles and with different titles depending on the publisher, they are a means to establish and assess compliance with data policies [4–6]. DAS are also known as Data Accessibility Statements, Data Sharing Statements and, in this study, ‘Availability of supporting data’ and ‘Availability of data and materials’ statements. Research data policies of funding agencies and journals can influence researchers’ willingness to share research data [7, 8], and strong journal data sharing policies have been associated with increased availability of research data [9]. However, surveys of researchers have also shown that researchers feel they should receive more credit for sharing data [10]. Citations (referencing) in scholarly publications provide evidence for claims and citation counts also remain an important measure of the impact and reuse of research and a means for researchers to receive credit for their work. Several studies explored compliance with journal data sharing policies [11–15]. For example, DAS in PLOS journals have been found to be significantly on the rise, after a mandated policy has been introduced, even if providing data in a repository remains a sharing method used only in a fraction of articles [16]. This is a known problem more generally: DAS contain links to data (and software) repositories only too rarely [17–19]. Nevertheless there are benefits to data sharing [20–22]. It is known that, for example, the biomedical literature in PubMed has shown clear signs of improvement in the transparency and reproducibility of results over recent years, including sharing data [23]. Some previous studies have shown that, mostly in specific research disciplines—such as gene expression studies [24, 25], paleoceanoagraphy [26], astronomy [27] and astrophysics [28]—sharing research data that support scholarly publications, or linking research data to publications, are associated with increased citations to papers [29]. However, to our knowledge, no previous study has sought to determine if providing a DAS, and specifically providing links to supporting data files in a DAS, has an effect on citations across multiple journals, publishers and a wide variety of research disciplines. Making data (and code) available increases the time (and presumably cost) taken to publish papers [30], which has implications for authors, editors and publishers. As more journals and funding agencies require the provision of DAS, further evidence of the benefits of providing them, for example as measured through citations, is needed. In this study, we consider DAS in journal articles published by two publishers: BMC and PLOS. We focus on the following two questions: are DAS being adopted as per publisher’s policies and, if so, can we qualify DAS into categories determined by their contents? In particular, we consider three categories: data available upon request, data available in the paper or supplementary materials, and data made available via a direct link to it. Are different DAS categories correlated with an article’s citation impact? In particular, are DAS which include an explicit link to a repository, either via a URL or permanent identifier, more positively correlated with citation impact than alternatives?

Materials and methods Data To make this study completely reproducible, we focus only on open access publications and release all the accompanying code (see Data and Code Availability Section). We use the PubMed Open Access (OA) collection, up to all publications from 2018 included [31]. Publications missing a known identifier (DOI, PubMed ID, PMC ID or a publisher-specific ID), a publication date and at least one reference are discarded. The final publication count totals N = 1, 969, 175. Our analyses focus on a subset of these publications, specifically from two publishers: PLOS (Public Library of Science) and BMC (BioMed Central). PLOS and BMC were selected for this study as they were among the first publishers to introduce DAS. Identifying PLOS journals is straightforward, as the journal names start with ‘PLOS’, e.g. ‘PLOS ONE’. We identify BMC journals using an expert-curated list (see footnote 3 below). We further remove review articles and editorials from this dataset, and are left with a final publication count totalling M = 531, 889 journal articles. Our data extraction and processing pipeline is illustrated in Fig 1. The processing pipeline, including DAS classification, as well as the descriptive part below were all developed in Python [32], mainly relying on the following libraries or tools: scipy [33], scikit-learn [34], pandas [35], numpy [36], nltk [37], matplotlib [38], seaborn [39], gensim [40], beautifulsoup (https://www.crummy.com/software/BeautifulSoup), TextBlob (https://github.com/sloria/textblob) and pymongo (MongoDB, https://www.mongodb.com). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Data extraction and processing steps. We first downloaded the PubMed open access collection (1) and created a database with all articles with a known identifier and which contained at least one reference (2; N = 1, 969, 175). Next we identified and disambiguated authors of these papers (3; S = 4, 253, 172) and calculated citations for each author and each publication from within the collection (4). We used these citation counts to calculate a within-collection H-index for each author. Our analysis only focuses on PLOS and BMC publications as these publishers introduced mandated DAS, so we filtered the database for these articles and extracted DAS from each publication (5). We annotated a training dataset by labelling each of these statements into one of four categories (6) and used those labels to train a natural language processing classifier (7). Using this classifier we then categorised the remaining DAS in the database (8). Finally, we exported this categorised dataset of M = 531, 889 publications to a csv file (9) and archived it (see Data and code availability section below). https://doi.org/10.1371/journal.pone.0230416.g001 Data availability statements: Policies and extraction On 1 March 2014, PLOS introduced a mandate which required DAS to be included with all publications and required all authors to share the research data supporting their publications [41]. In 2011 BMC journals began to introduce a policy that either required or encouraged authors to include an equivalent section in their publications, ‘Availability of supporting data’ [42], and the number of BMC journals that adopted one of these policies increased between 2011 and 2015. In 2015 BMC updated and standardised its policy and all of its journals (more than 250 journals) required—mandated—a DAS (styled as ‘Availability of data and materials’) in all their publications. This provides sufficient time for publications in these journals to accrue citations for the analysis. Further, all papers published in the BMC and PLOS journals are open access and available under licenses that enable the content and metadata of the articles to be text-mined and analysed for research purposes. We encoded the dates in which these policies were introduced by the different BMC journals, and the type of policy (that is, DAS encouraged or DAS required/mandated) in the list of journals—which also include PLOS journals [43]. The extraction of DAS from the xml files is straightforward for PLOS journals, while it requires closer inspection for BMC journals. We established a set of rules to detect and extract statements from both sets of journals, as documented in our repository. A total of M d = 184, 075 (34.6%) publications have a DAS in our dataset. We focus this study on DAS provided in the standard sections of articles according to the publisher styles of PLOS and BMC. While this choice does not consider unstructured statements in publications that might describe the availability of supporting data elsewhere, such as sentences in Methods or Results sections of articles, our analysis intentionally focuses on articles in journals with editorial policies that include the use of a DAS. Data availability statements: Classification The content of DAS can take different forms, which reflect varying levels of data availability, different community and disciplinary cultures of data sharing, specific journal style recommendations, and authors’ choices. Some statements contain standard text typically provided by publishers, e.g. ‘The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are within the paper and its Supporting Information files.’ In other cases, the authors may have decided to modify the standard text to add further details about the location of the data for their study, providing a DOI or a link to a specific repository. Where research data are not publicly available, authors may justify this with additional information or provide information on how readers can request access to the data. In other cases, the authors may declare that the data are not available, or that a DAS is not applicable in their case. We identified four categories of DAS, further described in Table 1. We use fewer categories than [16], not to impede reliable classification results. Our four categories cover the most well-represented categories from this study, namely: not available or ‘access restricted’ (our category 0); ‘upon request’ (our category 1); ‘in paper’ or ‘in paper and SI’ or ‘in SI’ (our category 2); ‘repository’ (our category 3). We consider category 3 to be the most desirable one, because the data (or code) are shared as part of a publication and the authors provide a direct link to a repository (e.g. via a unique URL, or, preferably, a persistent identifier). We manually categorized 380 statements according to this coding approach, including all statements repeated eight or more times in the dataset (some DAS are very frequent, resulting from default statements left unchanged by authors) and a random selection from the rest. We used a randomly selected set of 304 (80%) of those statements to train different classifiers and the remaining 76 (20%) statements to test the classifiers’ accuracy. The classifiers we trained are listed below: NB-BOW: Multinomial Naïve Bayes classifier whose features are the vectors of the unique words in the DAS texts (bag-of-words model);

NB-TFIDF: Naïve Bayes classifier whose features are the vectors of the unique words in the DAS texts, weighted by their Term Frequency Inverse Document Frequency (TF-IDF) score [44];

SVM: Support Vector Machines (SVM) classifier [45] whose features are the unique words in the DAS texts, weighted by their TF-IDF score;

ET-Word2vec: Extra Trees classifier [46] whose features are the word embeddings in the DAS texts calculated using the word2vec algorithm [47];

ET-Word2vec-TFIDF: Extra Trees classifier whose features are the word2vec word embeddings in the DAS texts weighted by TF-IDF. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Categories of DAS identified in our coding approach. https://doi.org/10.1371/journal.pone.0230416.t001 TF-IDF is a weighting approach commonly used in information retrieval and has the effect of reducing the weight of words like the, is, a, which tend to occur in most documents. It is obtained by multiplying the term frequency (i.e. the number of times a term t appears in a document d divided by the total number of terms in d) by the inverse document frequency (i.e. the logarithm of the ratio between the total number of documents and the number of documents containing t). We experimented with different parameter values, as detailed below: Stop words filter (values: ‘yes’ or ‘no’): whether or not we remove stop words from the texts before running the classifiers. Stop word lists include very common words (also known as function words) like prepositions (in, at, etc.), determiners (the, a, etc.), auxiliaries (do, will, etc.), and so on.

Stemming (values: ‘yes’ or ‘no’): whether or not we reduce inflected (or sometimes derived) words to their word stem, base or root, for example stemming fishing, fished, fisher results in the stem fish. The best combination of parameter values and classifier type was found to be an SVM with no use of stop words and with stemming, so this was chosen as the model for our subsequent analysis. Its accuracy is 0.99 on the test set, 1.00 only considering the 250 top DAS in the test set by frequency, and the frequency-weighted accuracy is also 1.00. The average precision, recall, and F1-score weighted by support (i.e. the number of instances for each class) are all 0.99. The classification report containing precision, recall, F1-score, and specificity (true negative rate) by category is shown in Table 2. The retained classifier was finally used to classify all DAS in the dataset, keeping manual annotations where available. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Classification report by DAS category. https://doi.org/10.1371/journal.pone.0230416.t002

Discussion Our study has a set of limitations. First of all, the willingness to operate fully reproducibly has constrained our choices with respect to data. While the PubMed OA collection is sizable, it includes only a fraction of all published literature. Even with respect to indicators based on citation counts (H-index, received citations), we decided not to use larger commercial options such as Web of Science or Scopus. A future analysis might consider much larger citation data, perhaps at the price of full reproducibility. We further focus on DAS given in dedicated sections, potentially missing those given in other parts of an article. Furthermore, we do not assess what a given repository contains in practice: this is not a replication study. Finally, citation counts are but one way to assess an article’s impact, among many. These and other limitations constitute potential avenues for future work: we believe that by sharing all our data and code, this study can be updated and built upon for the future analyses. Future research that evaluates the contents and accuracy of DAS in a more detailed way than in this analysis, e.g., with more sophisticated and granular categorisation of DAS, would be valuable. For example, by comparing whether DAS that are highly templated from journals’ guidance for authors are associated with differences in citation counts compared to non-standard statements, and whether DAS are an accurate description of the location of data needed to reproduce the results reported in the article. We assume that non-templated statements imply more consideration of the journal’s data sharing policy by the authors, and potentially more rigorous approaches to research data management. However, we found non-templated statements to appear with a lower frequency than statements such as “All relevant data are within the paper and its Supporting Information files”. There are several potential implications of our results. All stakeholders, from funding agencies to publishers and researchers, have further evidence of an important benefit (a potential for increased citations) of providing access to research data. As a consequence, requests for strengthened and consistent research data policies, from research funders, publishers and institutions, can be better supported, enforced and accepted. Introducing stronger research data policies carry associated costs for all stakeholders, which can be better justified with evidence of a citation benefit. Our finding that journal policies that encourage rather than require or mandate DAS have only a small effect on the volume of DAS published will be of interest to publishers, if their goal is to improve the availability of DAS. However, policies often serve to create cultural and behavioural change in a community and to signal the importance of an issue [75], and it is not uncommon for journals and publishers to introduce new editorial policies in a progressive manner, with policies, such as on availability of data and code, increasing in strength and rigour over time. Springer Nature, for example, have indicated they intend to support more of their journals with data sharing policies that do not mandate a DAS to mandate a DAS [3]. Our DAS classification approach, and release of the data and code, may be helpful for stakeholders interested in research data policy compliance, as it enables more automated approaches to the detection, extraction and classification of DAS across multiple journals and publishers, at least in the open access literature. Even wider adoption of DAS as a standard data policy requirement for publishers, funding agencies and institutions would further facilitate the visibility of links to data as metadata, enhancing data discoverability, credit allocation and positive research practices such as reproducibility. In fact, machine readable DAS would allow for the development of a research data index extending existing citation indexes and allowing, potentially, to monitor sharing behaviour by researchers and compliance with data policies of different stakeholders. DAS also provides a mechanism for more focused search and enrichment of the literature with links between research data/code, and scholarly articles. Links to research data provided within a DAS are most likely to refer to research data generated by or analysed in a study, potentially increasing the accuracy of services such as EU PubMed Central and Scholarly Link Exchange (Scholix), which can link scholarly publications to their supporting data.

Conclusion In this contribution we consider Data Availability Statements (DAS): a section in research articles which is increasingly being encouraged or mandated by publishers and used by authors to state if and how their research data are made available. We use the PubMed Open Access collection and focus on journal articles published by BMC and PLOS, in order to address the following two questions: 1) are DAS being adopted as per publisher’s policies and, if so, can we qualify DAS into categories determined by their contents? 2) Are different DAS categories correlated with an article’s citation impact? In particular, are preferred DAS which include an explicit link to a repository, either via a URL or permanent identifier (category 3 in this study) more positively correlated with citation impact than alternatives? These questions are prompted by our intention to assess to what extent open science practices are adopted by publishers and authors, as well as to verify whether there is a benefit for authors who invest resources in order to (properly) make their research data available. We find that DAS are rapidly adopted after the introduction of a mandate in the journals from both publishers. For reasons in large part related to what is proposed as a standard text for DAS, BMC publications mostly use category 1 (data available on request), while PLOS publications mostly use category 2 (data contained within the article and supplementary materials). Category 3 covers, for both publishers, just a fraction of DAS: 12.2% (BMC) and 20.8% (PLOS) respectively. This is in line with previous literature finding that only about 20% of PLOS One articles between March 2014 and May 2016 contain a link to a repository in their DAS [16]. We also note that individual journals show a significant degree of variation with respect to their DAS category distributions. The results of citation prediction clearly associates a citation advantage, of up to 25.36% (± 1.07%), with articles that have a category 3 DAS—those including a link to a repository via a URL or other permanent identifier, consistent with the results of previous smaller, more focused studies [24–28]. This is encouraging, as it provides a further incentive to authors to make their data available using a repository. There might be a variety of reasons for this effect. More efforts and resources are put into papers sharing data, thus this choice might be made for better quality articles. It is also possible that more successful or visible research groups have also more resources at their disposal for sharing data as category 3. Sharing data likely also gives more credibility to an article’s results, as it supports reproducibility [76, 77]. Finally, data sharing encourages re-use, which might further contribute to citation counts.

Data and code availability Code and data can be found at: https://doi.org/10.5281/zenodo.3470062 [78].

Acknowledgments The authors would like to thank Jo McEntyre and Audrey Hamelers at European Bioinformatics Institute / EUPMC for advice on using their APIs in the planning stage of this study. The authors also thank Angela Dappert at Springer Nature for support in obtaining journal metadata from Springer Nature. GC acknowledges finally thanks James Hetherington, director of Research Engineering at The Alan Turing Institute, for support and advice through the project. GC lastly acknowledges the Centre for Science and Technology Studies, Leiden University, for providing access to their databases.