The Neotoma database is a fantastic resource, and one that should be getting lots of use from paleoecologists. Neotoma is led by the efforts of Eric Grimm, Allan Ashworth, Russ Graham, Steve Jackson and Jack Williams, and includes paleoecological data from the North American Pollen Database and FAUNMAP, along with other sources, coming online soon. As more and more scientific data sets are being developed, and as those data sets become updated more regularly, we’re also moving to an era where scientists are increasingly needing to use tools to obtain this new data as it is acquired. This leads us to the use of APIs in our scientific workflows. There are already R packages that take advantage of existing APIs, I’ve used ritis a number of times (part of the great ROpenSci project) to access species records in the ITIS database, there’s the twitteR package to take advantage of the Twitter API, and there are plenty of others.

The Neotoma API is still under development, led by Michael Anderson and Brian Bills, but it’s looking great and seems to be fairly functional. It can be used by any sort of programming language, but I used it through R in response to a minor problem we were having within the PalEON project. We recently got the data for Deming Lake from Jim Clark at Duke University. Deming was originally published in Ecological Monographs in 1991, but we’d like to use it since it offers a high resolution record that is well dated and sits at the prairie/woodland boundary of the Upper Midwest. Unfortunately, the data for Deming Lake isn’t in a file format that was compatible with anything any of us were familiar with. While at Simon Fraser University I did my fair share of trawling old Masters and Ph.D theses, cold calling Rolf’s old students asking if their data was still available, and I came across a number of old file formats that needed translation. As we go further into the future it is less likely that these obsolete file formats will be able to be translated, or found. Ultimately the Deming Lake dataset is recoverable, but not without a lot of work (I’ve been practicing my regular expressions, see below).

The thing that particularly interested me about this problem was whether there is a difference in the mean age of records that have not been acquired from the literature and those that have been aggregated into Neotoma. If there is, then this indicates the possibility that as we go on these unacquired records are progressively more likely to be unrecoverable, through a combination of faculty retirement and an inability to recover digital data. I figured it might make a good blog post anyway. . .

To test this I used the Neotoma API and the Unacquired Sites database from the North American Pollen Database. Admittedly the Unacquired Sites database is a bit out of date (15 years out of date), but it was the only available record I had, and, ultimately, this is just a blog post to show off the Neotoma API and my R skills on a somewhat interesting question.

The code I’ve included below allows you to download and unzip the unacquired sites data automatically. We adjust the Neotoma data to reflect the age of the Unacquired Sites database by removing all pollen records newer than 1998 (this doesn’t include sites in the Neotoma holding tank), and then looking at the age of publication since 1998. By setting “0” to be 1998 we can use a GLM with a Poisson family to test for differences between the two datasets. A t-test shows the same difference.

When we look at the publications data it becomes apparent that the unacquired sites are older than Neotoma sites (a difference of ~3 years; Figure 2). The 95% CI for Neotoma (once the 1998+ datasets are removed) is 1965 – 1997 while the unacquired sites database has a 95% CI from 1943 – 1995. This partly speaks to the success of the North American Pollen Database (and subsequently Neotoma) in contacting authors and in obtaining buy-in from researchers, but it also speaks to the need to both find these older records, and, ultimately, to update the unacquired data sets database.

As older researchers retire we lose their datasets, especially count sheets that might be considered nothing more than recycling. As we continue to upgrade computers, store floppy disks and put aside the hard drives that contain M.Sc and Ph.D datasets we again continue to lose these datasets. We also lose an understanding of which datasets exist to be assimilated, and which datasets are of particular importance for a region. Certainly some may have more value than others (see this earlier post), but, as with species, there is inherent value in the raw datasets, simply because of the work that goes into collecting them, and, as we begin to develop robust data assimilation techniques (see this post) even records with poorly resolved chronologies can make a contribution to future research.

Here is the code to do the analysis:

# You shouldn't need to set a working directory, but be aware that this code # downloads files to whatever working directory you're in. library(RJSONIO) library(RCurl) library(plyr) library(ggplot2) # Using the Neotoma API to get a list of all Neotoma publications, currently # the functionality doesn't exist to link them directly to a site type. # More about the API can be found here: http://api.neotomadb.org/doc/ data.uri <- 'http://api.neotomadb.org/data/Datasets' pollen.recs <- laply(fromJSON(getForm(data.uri, DatasetType = 'pollen'))$data, function(x) x$DatasetID) pub.uri <- 'http://api.neotomadb.org/data/publications' pub.year <- rep(NA, length(pollen.recs)) for(i in 1:length(pollen.recs)){ test <- try(fromJSON(getForm(pub.uri, DatasetID = pollen.recs[i]))$data[[1]]$Year) if(!(is.null(test) | class(test) == 'try-error')) pub.year[i] <- test if(i %% length(pollen.recs)/30 == 0) cat('

', i, '

') } # Not really the prettiest way of doing this. Suggestions are welcome. pub.year <- as.numeric(unlist(pub.year))[!is.na(as.numeric(unlist(pub.year)))] # This is the MapPad file for the North American Pollen Database's Unacquired # Sites Inventory. # http://www.museum.state.il.us/research/napd/mainmenu.html download.file('http://www.museum.state.il.us/research/napd/mpdfile.zip', destfile='MPD.zip') aa <- unzip('MPD.zip') unacquired <- scan('MPDFILE.MPD', what='character', encoding='UTF-8') # Some crazy regex here! I relied heavily on Wikipedia and this site: # http://www.jetbrains.com/webstorm/webhelp/regular-expression-syntax-reference.html</pre> unacq.years <- unacquired[regexpr('^[12]{1}[0-9]{3}[.]{1}', unacquired)>0 & nchar(unacquired) == 5] unacq.years <- as.numeric(substr(unacq.years[substr(unacq.years, 5,5) == '.'], 1,4)) years <- data.frame(years = 1999 - c(pub.year, unacq.years), acquired = factor(c(rep('Y', length(pub.year)), rep('N', length(unacq.years))))) # The MPD file was created in 1998, so presumably we should exclude all files older # than this in both datasets. I also set 1917 as an arbitrary minima since that's # when von Post first introduced palynology. years <- years[years$years < 113 & years$years > 0,] ggplot(data=years, aes(x=acquired, y=years)) + geom_jitter(position=position_jitter(width=0.2)) + geom_boxplot(alpha = 0.5) + scale_y_log10() + theme_bw() + xlab('Acquired Dataset') + ylab('Publication Date (Years BP)') # Just so we know that there is a significant difference: anova(glm(years~acquired, data=years, family=Poisson), test='Chisq') <pre>