At the recent Researcher to Reader conference in London, Mark Allin (@allinsnap) had the job of doing the conference round-up, which is the slot immediately before the closing keynote where the themes and take-homes of the conference are brought together. In his four summary themes, Allin inevitably drew out Open Access / Open Science. It’s almost impossible to have a publishing or library conference without it, however, in terms of significance, he put it at the bottom of the list, almost as an afterthought. His reasoning is that open science now feels like an inevitability. With a clear trend towards both open access and open data mandates among funders, institutions, and publishers, the question that each of us must ask ourselves isn’t whether it will or should happen, but how are we going to adapt as change continues.

This post isn’t about Researcher to Reader. If you’d like to read about the conference, check out the hashtag (#R2Rconf). It was a lively one, I promise you. Instead, I’m going to write about the meeting I attended the day after. This meeting was about what some publishers are doing to respond to and support open research, or more precisely, open data. The STM Research Data Workshop series, facilitated by Joris van Rossum and Fiona Murphy, of which last week’s meeting was the second, is part of the STM Association’s Research Data Year (RDY). The RDY, whose theme is Share-Link-Cite, is a collaboration of 11 publishers*, which according to Rossum — who is STM’s research data director — publish over half of all journal articles. The workshop was attended by select representatives from member publishers and technology organizations important to the effort including SCHOLIX, Crossref and Ripeta.

The rise of data and the data policy

Practices around open research data are gaining traction. In 2019’s The State of Open Data Report, 64% of respondents claimed that they made their data openly available in 2018. That’s a rise of 4% from the previous year. Comprehensive information on the prevalence of open data policies is hard to come by, but there is a general sense that publishers, funders, and institutions alike are all moving towards firstly having data policies and then steadily strengthening those policies over time.

The JoRD project, based at Nottingham University in the UK was funded by Jisc and ran from December 2012 until its final blog post in 2014. In this article, Sturges et al., report that JoRD found the state of open data policies among journals to be patchy and inconsistent, with about half of all the journals they looked at having no policy at all, and with 75% of those that did exist being categorized as weak.

Unfortunately, the short timescale of the JoRD project limits its findings to a snapshot. However, there has since been piecemeal evidence of progress towards a more robust open research data landscape. The case studies presented in this article by Jones et al., — a different Jones, not me — describe how both Taylor and Francis, and Springer Nature have followed the path of steadily increasing the number of journals with data policies while strengthening those that exist. Like all of the publishers that were represented at the STM workshop, both Taylor and Francis, and Springer Nature use multiple progressive levels of policy, from broadly advisory to firmly mandatory in terms of the need to share data as well as how to share it.

With the level of socialization of open data practices varying over discipline, geographical region, and community, there’s no one-size-fits-all data policy that journal editors should adopt or publishers should mandate. Among some communities, in the humanities for example, there isn’t much data to be shared and when there is, the social norms and mechanisms may be in their infancy or not developed at all. There are also communities that have significant portions of their data that can’t be shared. Being clear on the limits of openness, for example in order to respect patient confidentiality, the locations of endangered species, or commercially sensitive data is critically important.

The lowest levels of policy can be as lightweight as requesting that authors provide a data availability statement (DAS) where they state whether they will share their data. Stricter policies can require a DAS without mandating sharing, require sharing in general, the use of specific approved repositories with particular data types conforming to specific structures and standards, and most strictly, peer-review of data, whatever that means.

Guidelines and structures

Those who follow trends in open data will be aware of the Transparency and Openness Promotion (TOP guidelines). For data sharing, TOP describes four levels:

Not implemented: Journal encourages data sharing, or says nothing.

Level 1 – Disclosure: Article states whether data are available, and, if so, where to access them.

Level 2 – Mandate: Data must be posted to a trusted repository. Exceptions must be identified at article submission.

Level 3 – Verify: Data must be posted to a trusted repository, and reported analyses will be reproduced independently prior to publication.

Many publishers have taken these as a starting points for policies, although none have gone as far as requiring independent reproduction of analyses, which is arguably out of scope for traditional scholarly journals and would require significant infrastructure investment. It’s important to bear in mind that TOP is about a lot more than open data and as a result, its view of open data is fairly high level, and constitutes a set of principles. Those principles can and should be interpreted in a community specific way, but at the same time, it’s important to limit complexity so as not to overburden editors and authors.

It’s a tricky balance.

The Belmont Forum, which represents a group of 26 funding agencies, ratified a DAS Policy and templates document in 2018 that articulated a set of minimum requirements for a DAS. These included confirmation of the existence (or not!) of the data, preferring the use of persistent identifiers rather than URLs where available, and providing licensing information. More recently, the Research Data Alliance (RDA) have sought to address the need for clarity and consistency by creating a specific framework for data policies based on a harmonization of existing publisher policies. A number of funder policies, the CODATA best practice guidelines, and the TOP guidelines were also referenced. The rubric, shown below consists of 14 policy features with 6 policy levels that gradually increase the number of included features.

What’s next on the road to open data

Some great progress has been made towards a more robust, consistent, and less confusing open data landscape, but much more needs to be done to make things more accessible to researchers, editors, societies and many smaller publishers.

The identification of the need for a framework that caters to different levels of community development towards openness is an important step forward, but such frameworks are currently still being developed and socialized themselves. The RDA framework was discussed at the workshop and agreed to be on the right track, but attendees weren’t quite ready to adopt it as a standard before exploring how it maps to the TOP guidelines, and their own specific policies. Nevertheless, we seem to be on the road to clarity and standardization.

Meanwhile, many publishers are continuing to work with their editors and communities to both socialize open data and steadily roll-out more and stronger open data policies as appropriate.

Some longer term thoughts

Many data policies from publishers and funders, including the European Commission, make reference to FAIR data. While open and FAIR are connected concepts, they’re not synonymous.

For instance, corporate R&D divisions and knowledge managers are increasingly using FAIR principles to accelerate discoverability of internal, proprietary datasets that are not intended to be shared outside of the company research environment.

By the same token, placing a dataset on a publicly accessible personal website with no supporting metadata may be open, but it wouldn’t be findable, and if not structured correctly, may not be interoperable or reusable.

In many ways, FAIR data is a more complicated problem to solve than openness. At the workshop, there was a sense that editorially policing that data referenced in the article is accounted for in the DAS and actually exists is time consuming and expensive. Making sure that the data is structured correctly, with the right metadata, requires a lot more time and domain knowledge.

At the workshop, Leslie McIntosh from Ripeta discussed their work with natural language processing, which analyzes manuscripts with a machine identifiable DAS definition to validate its presence. Eventually, similar technologies might help editors assess compliance with data policies.

Beyond the DAS, there’s a challenge around specifying and certifying the correct structures for data types in the first place. There have been some great early successes in areas like Genomics and the environmental sciences, where big, well-structured data sets are the norm. As we move into the very long tail of data types — which stretches all the way down to one-off experiments and everything in between — there are open questions of how far down that long tail publishers can helpfully go and where the resources are going to come from to create and enforce the thousands of data structures that will be needed.

It’s clear that open data, and its close cousin FAIR data are both important challenges for the scholarly communication infrastructure. At the 2nd STM Data Workshop, I was impressed by the clarity of thought that had gone into understanding this thorny set of problems and by the practicality of some of the solutions put forward.

There’s a lot that publishers can already do, at the very least to help socialize the idea of open data within their communities. The lightest policies can be purely advisory with the next step up being just to mandate the existence of a DAS. While that lowest level might not require any greater openness on behalf of researchers, at least it can let people know that such things exist and that the editor values them. The various resources I linked to above are well worth a read, but if you’re a publisher looking for a very quick primer, the data availability tips from STM are a good place to start.

Thinking more long-term, there’s still a lot of work to be done to make open data requirements both flexible enough to suit communities and consistent enough to not be confusing. There’s also a need to address sustainability, as it seems that with current approaches, policing stricter mandates would be unsustainably expensive, if it were done by all publishers for all research journals.

Despite these challenges, progress continues to be made. As Allin said, openness feels inevitable. It will be interesting to see how we get there.

*Publishers participating in the STM Research Data Year: