I had the pleasure of going up to visit the Limnological Research Center (LRC) at the University of Minnesota this past week. It’s a pretty cool setup, and obviously something that we should all value very highly, both as a resource to help prepare and sample sediment cores, and as a repository for data. The LRC has more than 4000 individual cores, totaling over 13km of lacustrine and marine sediment. A key point here is that much of this sediment is still available to sample, but, this is still data in its rawest, unprocessed form.

I’ve written before about open science and reproducibility. These two things obviously go hand in hand, but we as scientists navigate a tricky world. The NSF expects that data will be shared “within a reasonable time“, which is fairly open ended. In practice this exhortation doesn’t always work. We’ve all heard about researchers who won’t share data (often for good reason), but equally there are stories of researchers who may have used data to which they have no right. In some cases this is resolved, but in others the results are not so clear cut. The fair use of data overlaps with authorship, good citizenship, fairness and a number of other issues in academia, and one person’s definition of what is fair is unlikely to be another’s.

The use of others’ data presents a fine line in an discipline that depends on the transmission of ideas and data. In a previous post I discussed the need for domain experts in “big data”, and in many ways, our ethics of data sharing are embedded both in respect for the data generator, but also in the need to understand the intricacies of data that is often noisy, quirky, idiomatic and dependent on the methods used to gather it. Open science, and reproducability with it, then present a challenge for data generators, who are being asked to give up control of their data, and for data users who must examine the ethics of their own data use.

Open science is a philosophy as much as it is an imperative. It is asking scientists to give up the unspoken reciprocal partnerships between data generators and those who use the data down the road. Previously our interactions on this front have been mediated by the need to directly interact with data generators since there were no central data repositories for many of our data needs. By moving data to central repositories, and by opening up data sets and the methods of analysis through all-inclusive data supplements, we take a chance in submitting our data, the chance that users will simply take the data without reciprocation.

What I’m saying isn’t that reciprocation is necessary, but that it is a potential benefit provided to the data generators that we may be losing in a move to central repositories. Data generation is costly, it is high (or higher) risk, and it is often slow, but it is critical to move macro-scale research forward, and to find the teleconnections between ecosystems that can help push science forward. The other issue is that we often find the the primary data papers get cited much less often than the large-scale synthesis papers that use them. How this might affect funding in the future is obviously unclear, but it should be obvious that data synthesis may begin to pull funds from the very researchers who generate the data. One solution is to begin to use synthesis data sets to identify gaps in existing data networks. This is what we did in Goring et al. (2009), in developing a modern pollen data set for British Columbia we also used the dataset to identify locations that are under-sampled with respect to the regional climate (Figure 1). The hope is that this sort of work can help motivate future grant proposals, by identifying high value targets, and showing that the need has been addressed in the published literature.

In releasing the neotoma package for R (on figshare and github)we have linked open science methods (reproducible research) with a community of data generators who have contributed their data over the course of nearly 30 years to a variety of community supported repositories, and ultimately to the Neotoma Paleoecological Database. Users can now draw data from across the globe in seconds. Compare this to the situation in the 1940s when paleoecology researchers in North America weren’t even sure if their colleagues were still alive after the Second World War (detailed in the Pollen Analysis Circular). Data exchange at the time relied on formal, personal relationships.

Interestingly, we see the rise of collaborative research in the modern era, but I wonder whether this rise is the result of increasing collaboration between data generators and synthesizers, or whether we see a split, with those generating the data remaining as data generators, and those synthesizing relying increasingly on large databases. I don’t have an answer to this, but I wonder if data generators are suffering from a bit of the Matthew Effect, whereby individual data sets are losing out to synthesis work in terms of recognition and impact.

Ultimately, academics are social and we are regulated by social norms, but many of these norms are in a state of upheaval. Increasing funding pressure, the pressure of meeting proposal goals and the new frontier of open science, all mean that many of our social norms are in for a period of upheaval. The discussions that have gone on in the literature and on the web have been exciting and fruitful, but I still worry that open science is going to have repercussions we haven’t expected, and that they’re going to be felt most by the primary data generators.