Image Credit: GBIF, CC BY 4.0, Image Cropped

When I was a child, I’d often study books of Australian birds and mammals, rifling through the pages to see which species lived nearby. My source of information were the maps printed next to photos of the species, distribution maps showing the extent of the species range. These days, many of these species ranges are declining. Or at least, many ecologists believe they are. One of the problems with knowing exactly where species exist or how they are faring is a lack of data. The more data we have, the more precise an idea we get of the future of the species. Some data is difficult to collect, but yet more data has been collected, and is simply inaccessible.

At the Living Norway seminar earlier this month I sat down with Tim Robertson, Head of Informatics and the Global Biodiversity Information Facility. GBIF is an international network that works to solve this data problem worldwide, both by making collected data accessible and by helping everyday people to collect scientific data. I spoke with Tim about the journey from a species observation to a species distribution map, the role of GBIF, and the future of data collection.

Sam Perrin (SP): How did you end up working for GBIF?

Tim Robertson, Head of Informatics, GBIF (TR): I used to work in IT back in the UK, and got a little bit disillusioned with the kind of work I had fallen into. I took a break and ended up in South Australia at the Department of Environment and Heritage, working with a team who were doing monitoring of invasive weeds. I met a Danish girl, moved to Denmark and the DEH folk put me in touch with Donald Hobern, who was at the time the “Database Accessibility and Interoperability” Program Officer at GBIF. That’s what they called “FAIR data” back in those days. I worked for Donald as a software developer and I’ve been there for over 12 years now.

SP: Can you take us through what GBIF is?

TR: GBIF is first and foremost a community. Operationally we are a multi-government organisation, but really we are a very wide community of people helping each other share biodiversity-related datasets. Everything covering species’ names and taxonomy, to museum specimens to citizen science datasets and monitoring schemes. We bring CC BY 2.0_together all of this information and make it discoverable and freely accessible, creating a global index that effectively documents evidence of species’ existence on the planet.

SP: From the moment I register a species presence on iNaturalist to the moment that’s downloaded as part of a species dataset, what’s the process?

TR: Norway has long experience and an active community in citizen science data collection. Increasingly people are using devices like their mobile phone for citizen science, and iNaturalist is one such portal. People take photos – and that’s an important point because the photo becomes verifiable evidence that can be revisited, discussed and allow identifications to be made. Apps like iNaturalist and platforms like Artsdatabanken allow discussion and interaction, bringing multiple eyes to the evidence, which improves quality.

At GBIF we have clear and stable data licensing options, and we help people understand data that is suitably licensed can be shared and reused. iNaturalist users are generally happy to share their data, so the records that are suitably licensed and have been identified by enough people to qualify as ‘research grade’ are prepared in a dataset that is is registered in GBIF and shared through the GBIF infrastructure (see https://doi.org/10.15468/ab3s5x). We then harmonize the data, build search indices, maps, dashboards, et cetera, and make them available to anybody on the web.

A user who views GBIF.org can perform searches, including e.g. species, a time range, a spatial area, find data and download them. We maintain the data provenance so that users can see the source datasets and institutions for the data, and they can also link through to the original dataset, which is often richer at source.

SP: Have there been moments at GBIF where there’s needed to be a huge jump in technology to keep up with trends in science?

TR: In around 2007, Nick King the GBIF director at the time challenged the community “to get a billion records quickly”. Everyone said “what!?” as it seemed like a big number back then, but he was dead serious saying that if we wanted to understand life on the planet, we need to get serious about assembly big datasets.

I was the systems architect at the time, so I got the mandate to start exploring technology. It paved the way for us to build expertise in “Big Data” systems. At the time “Big Data” wasn’t the buzzword it is today, but we settled on an emerging technology known as Hadoop, and we continue to run Hadoop based products today.

This is one of the things that I’ve been quite satisfied with. In the last two years data volumes in GBIF have doubled and the infrastructure has had no issues in coping. We’ve now got the ability to scale up with data growth and we continue to do so. I credit that result largely to the decision made by Nick.

SP: Have there ever been lags between the development of new data technology and its use in the scientific community?

TR: Generally, people come to GBIF and do one of two things. They download fairly small data sets and then run analyses on them; maybe a few hundred thousand reference points. Or they’re doing large scale data analyses already, and generally have research groups associated with them. As we’re now growing quickly, I think an interesting development is coming. We are getting to the stage where data volumes might become a challenge for analysts, and we will need to start educating our community to use tools suitable for larger scale analyses.

The largest community of users we see use R and Python, and they generally hit for example limits in getting dataset into memory. I would expect that people will start looking to technologies like Apache Spark, which allows a user to put much larger data structures into memory across multiple machines. The emerging data science community and the younger generation of graduates are much more familiar with these kinds of tool and technologies than the previous generation. We should engage with them more to understand how they’re doing analyses and look to see what kind of tools they need and how we should best be supporting them.

SP: Can you take use briefly through the concept of FAIR vs. open data?

TR: FAIR means Findable, Accessible, Interoperable and Reusable while open data refers more to freely available data. Data that is FAIR may not be open and as a community we need to respect the nature by which data are accumulated. It can take a huge amount of effort for people to even get into the field to start to collect data long before these datasets appear registered into a system like GBIF. It’s our duty to make sure that the data is treated responsibly and pay attention to aspects like citation practice, making it clear that consumers should cite their sources responsibly to provide scholarly credit.

When it comes to the interoperability of data across GBIF it refers to topics like data standards, the vocabularies used by people when sharing data, the licensing etc.. For data to be reusable in our field, a consumer needs to understand how the dataset was assembled to determine if the data is fit-for-use. Documenting metadata describing the entire project lifecycle, including the data collection protocols, cleaning methods, biases etc are important to allow data to be reused.

GBIF deal only with open data intended to be shared and reused. A classic example of where GBIF need to think about FAIR – but not open – data would be related to sensitive species. Imagine a threatened species where sharing the coordinates could lead to trophy hunters visiting and collecting the specimen. We do want the infrastructure to be able to accommodate datasets of this nature, but also ensure that access is only given to those who should be granted permission – environment agencies and such.

That does remains a slight challenge for the GBIF.org system which currently deals only with open data. However as a network, the more local GBIF communities such as GBIF Norway are more able to accommodate local needs and connections with the relevant agencies. This is why a multi-tiered global network works well as it accommodates both local and national initiatives along with the global level discovery.

SP: Are there any problems you foresee the GBIF community encountering?

TR: I tend to think where we could make advances is in educating people why they should consider putting effort into correctly documenting their datasets. As a community, I think we need to invest in explaining to emerging scientists why it’s important to build such practices into all of their activities, including the data archiving, data preparation, data standards, using and sharing well documented sampling protocols. If we can explain the importance and at the same time provide credit for this work, such as linking to an ORCID ID, might help improve the reusability of data.

This isn’t something that can be accomplished only from the small GBIF team in Copenhagen. The community aspect of GBIF comes into play and becomes important to teach skills in data wrangling to biology students in university courses.

SP: Can you take us through GBIF’s plans for the new trends we’re seeing in environmental DNA?

TR: This is a relatively new data type for GBIF. Classically GBIF has dealt with observational data and specimen-based data from collection events. What we are beginning to deal with now is environmental sampling data, where people are taking physical material from the environment such as a scoop of water or soil sample, and running them through sequencing pipelines that provide DNA barcodes. These datasets in effect provide evidence of species existing at a location and time, but the evidence has not been identified by a human, but by a machine.

At the same time, reference catalogues are appearing that cluster DNA sequences into “species”. One example of this is the UNITE Community database for Fungi. By comparing the sequences coming from the sampling datasets to a reference catalogue such as UNITE, we can determine which taxa are present in the sampling event. In some cases there is DNA-based evidence of a yet-to-be-described species. By combining traditional taxonomy with molecular identifications, we can exchange information containing the evidence for the organisms existing, knowing at some point the species will be described and all this evidence will be linked. One goal for the infrastructure could be to continually align the barcodes to the reference catalogues which are changing constantly – this would help ensure datasets remain as current as possible.

Title image Credit: GBIF, CC BY-NC 4.0, photo cropped