A Policy Forum in today's issue of Science takes a look at what's become a significant problem in the sciences: enabling and maintaing unfettered access to large collections of scientific data. Although the report focuses on the biosciences, many of the problems it describes apply to other areas of research as well. The biggest problem, however, is fairly simple: there's no good mechanism for determining who pays for maintaining large amounts of data, which leaves existing repositories at risk of either duplicating efforts or losing funding entirely, with a resulting loss of data.

The Forum actually deals with both the data and some of the materials used to generate it, from DNA samples up to engineered mouse lines. However, the issues underlying both of these are fairly similar: it costs money to accept and maintain both data and materials and, on many levels, people are faced with a choice between funding more science and saving the science we've already done.

In recent years, this sort of choice has been very explicit. Many funding bodies, including the National Institutes of Health, have made grants conditional on a data and materials sharing plan—no plan for sharing data, no money. Unfortunately, in most cases, the grant itself was provided as a single pool of money, without funds earmarked for data or material sharing. All of which left the grant recipients with the decision on how to balance the costs of sharing their work against the costs of doing new research. It's safe to assume that new research typically came out ahead in those decisions.

And, apparently, this problem was widely recognized. "Although funding agencies may exhort their experimental investigators to develop a 'dissemination plan' for the data and bioresources they develop," the authors note, "in reality, such requirements are often not fulfilled, and noncompliance has little or no consequence." Worse still, once the grant in question finishes (usually within five years), then researchers are left to maintain their data with no financial support at all and no inducement to do so.

The other problem here is that it leaves the data and materials being stored on a per-lab basis. In practical terms, this leaves everything vulnerable to loss through random events like a failed backup archive or a power cut. But it also harms the scientific process itself. With key data fragmented across various lab repositories and often in incompatible storage formats, any computer analysis of large datasets will either be much harder to conduct, or will work with a partially complete source.

In many cases, both the community and funding bodies have worked to correct some of these problems. The NIH runs GenBank, an enormous database of protein and nucleic acid sequences. It also supports organism-specific databases like flybase, which maintains information for the entire Drosophila research community, and material repositories like the Developmental Studies Hybridoma Bank, which stores antibodies for use in studying the development of various organisms.

Unfortunately, these resources have to apply for grants through a competitive funding mechanism, where they may end up competing with more research-focused projects. And, if they fail to get funding, then it's not clear where all their resources will end up; there's the real potential that it could be lost. The use of national funding agencies also creates its own problems, since many of these have rules that prevent them from contributing to projects performed in other nations, which could lead to redundant efforts.

Is there a way out of this mess? The authors of the Forum call for revising how funding is handled so that the money for repositories of this sort becomes a distinct class of funding. "The traditional distinction between 'infrastructure' and 'research' is even less appropriate," the authors state. "Funding for data and bioresource repositories needs to be ring-fenced from hypothesis-driven research and supported sufficiently to ensure preservation and maintenance of its outputs." In addition, they call for greater international cooperation to ensure that the major repositories receive sufficient money, and that the funds don't get spent on duplicative effort.

There are two other aspects that the authors bring up that might also influence further funding decisions. The first is the role of manual curation of databases. Left on their own, researchers will make errors that can range from typos and inconsistent language to significant scientific mistakes; others will simply not bother to submit relevant information. Human curators can ensure correct information and make sure that relevant data gets incorporated, but they're relatively expensive; the Forum argues that the extra expense is generally worth it.

The other aspect of data and materials sharing the authors consider is having the users pay for access. This happens to a degree with research materials, where facilities charge nominal fees to ship things like antibodies to individual labs. But these only cover a fraction of the cost, and nothing similar has been tried for metering data access. This, the authors contend, is how it should be, since funding agencies shouldn't support a system of haves and have-nots, or limit access from developing world researchers.

Overall, the Forum is full of good ideas while, at the same time, being completely unrealistic. In the US and most of Europe, government science budgets are likely to be flat in the best of circumstances for the next few years, so coming up with a dedicated budget for community resources will mean removing it from elsewhere. And that is never a simple process when it comes to biomedical research.

Science, 2010. DOI: 10.1126/science.1191506 (About DOIs).