Making the data behind research papers publicly available remains something of a new frontier, both for publishers and for authors. As the research culture shifts more toward transparency, and as more journals and funding bodies require release of data, it is vital that the data be discoverable, to facilitate reuse, and citable, to provide credit where it is due. A recent study looking at data citation practices from 2011 to 2014 does indeed show progress, but also that we have a long way to go.

Recently Crossref and the Digital Curation Center (DCC) issued guidelines for best practices for data citation, namely that citations to datasets appear in the References section of any paper that uses them. This is in contrast to the ways that journals usually cite data, “intratextually” (e.g. including a GenBank Accession Number in the text of an article) or in a separate dedicated “data availability” section of the paper. Neither of these satisfies the new standards which are aimed at better fulfilling the Joint Declaration of Data Citation Principles (JDDCP), which states, “data citation should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.”

The goal of the JDDCP is to help drive data availability by raising its importance and measurability as a means of understanding researcher productivity. If you create a dataset that drives significant research forward, then you should be acknowledged for that contribution. Proper citation is key toward the reward, not to mention the discoverability offered by data citation metrics such as Thomson Reuters’ Data Citation Index.

The recent study, described here in a blog post by author Elizabeth Hull from Dryad, showed that we are far from the intended goal. Only 6% of total articles cited the data’s DOI in the articles’ reference list, 75% just listed the DOI somewhere in the body of the article, and 20% had no citation of the data DOI anywhere in the article. On a positive note, things are improving, with works cited properly in the references rising from 5% to 8% over the four years studied, and articles with no data citation at all declining from 31% to 15%. These findings indicate progress, albeit very slow progress. I suspect that as data availability becomes more common, things will improve.

Several of the journals I work with are just implementing data policies and means of making data available, and to be honest, data citation was not something that was on the radar of their editorial offices — writing clear policies and instructions to authors, arranging for partnerships with data repositories and then working through the technologies required to make this happen took priority. But having been involved with a recent Alan Turing Institute Symposium on Reproducibility for Data Intensive Research, the need for best citation practices was made evidently clear. We’ve now implemented a policy to meet best practices, and authors will be required to cite data as they would any other citation, and particularly to include it in their article’s References.

I thought that sharing this learning experience might be helpful for others who are in the same position, just beginning to wade into the waters of connecting data to research papers. Citing all one’s sources is a useful practice, and given that data deposits receive DOIs, should present no major hurdle for journals to follow. If we’re going to the bother of making data available, we want it to be found, and we want authors to be rewarded for their efforts. Good citation practices can help make this happen.