The open access movement is forcing publishers to take down paywalls, making publicly funded research available to the public for free. But beyond that a more important development is pacing in the wings - that of open data.

With open access the issue has been free access to the results of scientific work. However, by “results” researchers really mean published papers which, bluntly, are only what scientists write about after looking at their data. With the open data drive, advocates are saying that the actual raw data should be available too. Anyone could then pick over, explore and re-use the data. This shift represents a behavioural sea-change that will also fix some substantial threats to the integrity of science.

The benefits of open data are clear. First, just the knowledge that the raw data will be out there for other analysts to check may make researchers more responsible about their data. Second, there is vast potential in the re-use of data. Researchers sometimes invest large amount of resources in collecting data only to publish one slice of that before having to move on to new projects.

Sometimes they do not even have time to publish anything, or feel that their results are not good enough to publish - whether that be rooted in their belief about how “negative results” will be received by journals and their peers, or whether writing up something unexciting is just not worth it. About half the results presented at conferences are not published in journals, about half the projects funded by public money never produce any journal articles and negative results from clinical trials often get pushed under the rug.

This means that the “results” out there in the scientific literature are a warped representation of the data that has been collected. Add to this the sheer waste of developing a database then throwing it away once the tusk of a nice finding has been poached from it. If the primary researchers do not have the time to fairly represent everything they have collected, why not just put the data out there? Sharing the data is a fix to our current ills. Yet the data sits in hard drives of scientists around the world.

What’s the hold up?

Limited infrastructure was one excuse not to share such data. But even when some universities built data archives ready for a data deluge, scientists avoided using it. It is not that researchers disagree with idea of sharing data, but they have apprehensions about with putting raw data “out there”.

First, there will always be a better statistician than you somewhere in the world, who can simply take your analyses apart and do it better. That is uncomfortable. Worse, what if someone somewhere does a hatchet job and claims your data “shows” something it does not? What about legalities around patient privacy and consent, or discoveries from your data or patents? Finally, what is in it for an individual scientist or even a research group?

Scientists understand the need for sharing data openly, but they lack the incentive. Yet there may be a way forward by tapping into the concepts of database citations and “data papers”.

The reputation currency of a scientist is often measured by how many papers he or she has published and how many times those papers are cited by other scientists in their papers. While it is not a perfect metric, it is widely used by journals.

The idea then would be to apply such a metric to databases. Assign a unique identifier to a database that can be cited like papers. Thus credit is given to the authors of that database. Some new “data journals” are going a step further by inviting scientists to write citable data papers to complement those deposited databases. These papers detail everything needed to use the data without pestering the original authors.

As a researcher, the ideal scenario for me would be that I hand in my database to the funding body at the end of the project. They check that the data is good, nominate a repository and I write a data paper for the repository. Once that is done, I am granted a grace period to finish writing research papers before the raw data gets released to the outside world.

Only a few funding bodies have mandated sharing, but they are not enforcing it. Weak sticks and theoretical carrots will not be enough to drive scientists into this bold new territory. The culture of sharing raw data will only truly begin when researchers are forced to do so by funding bodies.