Abstract The development of biological informatics infrastructure capable of supporting growing data management and analysis environments is an increasing need within the systematics biology community. Although significant progress has been made in recent years on developing new algorithms and tools for analyzing and visualizing large phylogenetic data and trees, implementation of these resources is often carried out by bioinformatics experts, using one-off scripts. Therefore, a gap exists in providing data management support for a large set of non-technical users. The TOLKIN project (Tree of Life Knowledge and Information Network) addresses this need by supporting capabilities to manage, integrate, and provide public access to molecular, morphological, and biocollections data and research outcomes through a collaborative, web application. This data management framework allows aggregation and import of sequences, underlying documentation about their source, including vouchers, tissues, and DNA extraction. It combines features of LIMS and workflow environments by supporting management at the level of individual observations, sequences, and specimens, as well as assembly and versioning of data sets used in phylogenetic inference. As a web application, the system provides multi-user support that obviates current practices of sharing data sets as files or spreadsheets via email.

Citation: Beaman RS, Traub GH, Dell CA, Santiago N, Koh J, Cellinese N (2012) TOLKIN – Tree of Life Knowledge and Information Network: Filling a Gap for Collaborative Research in Biological Systematics. PLoS ONE 7(6): e39352. https://doi.org/10.1371/journal.pone.0039352 Editor: Robert DeSalle, American Museum of Natural History, United States of America Received: March 28, 2012; Accepted: May 21, 2012; Published: June 18, 2012 Copyright: © 2012 Beaman et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This research was funded by the National Science Foundation grants DEB-0431258, DEB-0829313, DEB-0827609, DEB-0953677, and IOS-0827254. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Results TOLKIN serves as a project information management system for collaborative efforts involving biodiversity research where integration of data between research laboratories is critical. Data management strategies are focused around day-to-day research use of taxonomic, molecular, morphological and bibliographic information. Taxon pages for public access are automatically generated to include user-selected information pertinent to any taxon at the level of species and/or clades. The information served to the community is based on data stored in any of TOLKIN modules and it is automatically updated as new data is acquired or modified by users. Additionally, publicly available information is ported through an automated export mechanism to resources such as Encyclopedia of Life (www.eol.org). This mechanism generates the required XML output file that EOL can ingest whenever users are willing to serve their data through this service. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 10. Library Module. Citation information is stored, view, and linked to data across all modules. https://doi.org/10.1371/journal.pone.0039352.g010 In general, data are handled in forms familiar to the practicing systematists, as specimen collections, morphological characters, genetic markers and operational taxonomic units (OTUs). The ability to view, link, and manage data records across various modules has been an overarching development goal in TOLKIN. Figure 2 summarizes TOLKIN database schema, showing relationships of modules that support taxonomy, molecular and chromosome data, morphology, biological collections, images, and bibliography. TOLKIN modules Taxonomy module (Figure 3). Taxonomic data can be queried, browsed, and managed through tabular views and a hierarchical tree-based format with support for synonymies and nomenclatural annotations. The methods used to store taxa are independent of taxonomic rank and based on parent child relationships. Taxon ranks are stored as attributes, still maintaining the capability to search or view taxa by their rank. Every taxon has a parent, down to the root, and children, other than the terminals. Users are able to move branches in the taxonomic tree to reflect phylogenetic knowledge or they can choose to organize the tree according to alternative systems (e.g., alphabetically). Taxon names are associated with a code of nomenclature [e.g., ICZN, CPN, or PhyloCode] and carry attribute flags based on their synonymy to other entities and nomenclatural status. Collections module (Figure 4). TOLKIN handles collections as a critical means of vouchering the molecular and morphological data. Researchers typically source collections data from a variety of institutional sources. These, of course, are the primary custodians of the physical and digital voucher data, and TOLKIN serves as a project level aggregator of voucher data. In the case where institutions have not yet digitized their records, TOLKIN becomes the initial digital source, and plays a role in adding value to these data by maintaining provenance about links to related data, such as tissue samples and DNA extractions. Alternatively, in the Angiosperm AToL project, much of the collections data was sourced through the University of Florida Herbarium (FLAS), and TOLKIN simply stored a URI back to the primary resource. Projects such as Filtered Push [14] are currently developing methods of pushing value-added data back to primary custodians. Data can be imported from tabular format and spreadsheets, with automated mapping to commonly used column headings (e.g., Darwin Core). Morphology module (Figure 5). Data management is oriented around OTUs and characters, and matrices that combine the former, as reflected in standard practice in systematic biology. The interface supports matrix views, sub-setting, combining, and versioning of datasets, greatly simplifying tasks necessary to repeat experiments, re-use data and provide easier accessibility. OTUs have operational flexibility and can represent a published taxon, an informal name, or an individual collection. Characters are defined by description and character state values. Users can link documenting images, media, and collections to characters, character states, and OTUs. Projects typically maintain a master matrix and individual views maintained by researchers or teams. Data is versioned at several levels, enabling users to roll back to a snapshot that was generated automatically or by choice. Version snapshots are taken whenever multiple matrices are merged or segregated, or when data values change. When matrices are merged, validation checks resolve conflicts that may have occurred while versions were modified. Users are asked to select which source they want to provide default values. Users are able to import/export matrices to and from Nexus files and support for other file formats such as NeXML will soon be implemented. Additionally, user selected OTUs and characters can be exported from the matrix view. Molecular module (Figure 6). Data is managed around Matrices, DNA information, sequences, alignments, and primers. Like the morphology module, the molecular module employs OTUs, but the matrix columns are oriented around genetic markers as an operational unit. Therefore, each cell of the matrix represents one or more digital objects containing a sequence for a given OTU and marker. Each matrix cell displays status and metadata noting responsible parties. Color-coding and mouse-overs are used to quickly denote status and key information so that collaborators are able to easily scan for complete data, missing data, etc. Users can store as many sequences per OTU/marker combination, but one is indicated as primary for use in analysis. Sequences are stored as unaligned and can be added individually, in bulk from spreadsheets, or from GenBank. GenBank imports can be based on GenBank numbers, specific taxa or markers. In the import process, user validation is required to select the desired sequences to add. As sequences are used in multiple alignments, the alignments can be archived and linked to the individual source sequences. It is up to the users whether they want to save alignments for later reference. Chromosome module (Figure 7). This module is designed to manage data related to chromosome analyses, for example Fluorescence In Situ Hybridization (FISH). Information related to BAC (Bacterial artificial chromosome) probes and target sequences are linked to ZVI files with images. This module supports research related to constructing karyotypes and studying genome-wide changes and integrating next-generation sequencing data into molecular cytogenetics. Image gallery (Figure 8). The use of digital images to document collections and observations has skyrocketed, and will continue to increase. Beyond generic public repositories for photos (e.g., Flickr), MorphBank (www.morphbank.net) and MorphoBank [11] have goals similar to TOLKIN in providing researchers the ability to store and manage images together with the metadata and related data. TOLKIN's emphasis is on generating links between images (and other media) to related data records or matrices. The chromosome module is a good example of how chromosome images are connected to specific data files and source of molecular information. Bibliography module (Figure 9). This module provides a shared framework for access to bibliographic citations that users maintain for the project and links to data in any of the other modules. Common bibliographic formats are supported (e.g., Endnote, Bibtext) for import. Taxon pages (Figure 10). Taxon/clade page for public dissemination are generated from the same data that users store and manage for their research. Usually, formats are similar to those used by EOL and Wikipedia. However, users can provide their preferred page format that is implemented for public dissemination of data. More commonly, taxon pages include core taxonomic data, morphological descriptions, images, distribution data, maps, specimens examined, molecular information (e.g., available sequences and sequenced taxa), etc. Enhanced interactive mapping capability that can display point occurrences of collections is implemented, and we will soon support “What's in my backyard?" queries (a geospatial query), where a taxon or collections list is returned based on proximity to a mouse click on the map. These types of queries can automatically generate a species inventory list, or e-flora, for a selected geographic region (e.g., country, state/province, county, national park) or at a specified radius from a point. As projects mature or are completed, data can finally be pushed to long-term archival repositories. When users are ready to publicly serve their original data (e.g., images), they are encouraged to select or state the type of license under which they wish to release it. A number of published studies demonstrate the use of TOLKIN to manage datasets for large phylogenetic analyses, data syntheses and taxonomic treatments [15]–[19].

Discussion There are a number of capabilities and priorities that have guided the development of TOLKIN. These include 1) handling a combination of molecular, morphological, collections, and taxonomic data not present in similar resources; 2) providing research teams from distant labs with the ability to collaborate on large biodiversity datasets in real-time; 3) maintaining provenance and versioning of molecular and morphological data; and 4) improving capabilities to synthesize data from multiple sources and thus facilitate porting data produced within projects to long-term, archival repositories (e.g., GenBank). Researchers working individually have been well served by single-user desktop tools, up to a point. Collections databases, whether publicly (e.g., Specify) or commercially (e.g., EMu) funded are well adopted and used by museums and academic institutions that maintain collections and researchers that use those data for geospatial modeling, documenting monographs and revisions, and for managing taxonomic and nomenclatural information. Tools for descriptive data and key generation (e.g., Lucid) serve a separate purpose, as do web resources such as MorphBank and MorphoBank. TOLKIN does not perform the same tasks, but having the ability to integrate across those research domains is an infrastructure need that can be expected to grow. As systematics, and biodiversity research in general, has become increasingly interdisciplinary and collaborative, tasks and responsibilities are divided among collaborators. The way data is shared becomes increasingly important to the repeatability of the analysis. The phalanx of single-user desktop tools has not kept up with this need. TOLKIN addresses data sharing through a centralized resource where users can perform essential functions of adding, editing, organizing, and integrating batch imports and exports in a way that sharing spreadsheets and other files, where versions become out of sync, does not. These functions within TOLKIN are aimed at managing active projects, so that when these activities have run their course or users are ready to push data to archival resources such as Dryad and TreeBASE, they can do so. Users need to be able to maintain the provenance and repeatability of their analytical experiments (e.g., alignment, tree inference), and TOLKIN supports this requirement in a web-based multiuser environment. Essentially, collaborators are able to create master matrices that can become quite large in OTUs and characters/markers. Each user can generate subsets of the master and repurpose it for alternative or additional modifications; subsets can later on be re-integrated with the master file. As a whole, teams can view, manage, and track individual responsibilities for portions or the full dataset. TOLKIN was not designed as a permanent repository, but the capability to push data out to permanent repositories such as GenBank and EOL is seen as essential. Unfortunately, with GenBank, this is still problematic, as NCBI does not maintain a stable easy-to-use set of web services for sequence deposit. The management of matrix-based data is ubiquitous within the science community, and is specifically applied in both morphological and molecular data sets. TOLKIN provides a significant contribution not just in supporting shared editing, but also in maintaining views and control over layers of matrix-based data. This can be understood as analogous to a spreadsheet with embedded hyperlinks, where clicking on a cell exposes all the metadata about the content of that cell, including alternative and currently active values (e.g., observations or sequences made at different times). Within each cell, users can drill down through versioned layers of the matrix. TOLKIN fills a need for managing biodiversity data at multiple levels, where users can describe objects at an atomic level (e.g., the sequences and primers related to a specific DNA extraction), or collectively as a dataset (e.g., the multiple sequences aligned and ready for analysis). Users can add, and manage sequences individually or import them in bulk from resources like GenBank. Data aggregation functions such as data import, and versioning are supported similarly. Users are able to change or designate a primary, or current, version of a sequence to be used in a particular analysis, and take a snapshot of a version of a matrix, modify it, run a new analysis, and if necessary, roll back to a previous version of the matrix. Instead of sharing spreadsheets or files, and having multiple versions floating around in email archives or file systems, where they often get out of sync, collaborators can see what everyone else is doing. Research teams are also able to designate who created and who is managing certain data elements. As collaboration and large data sets are becoming the norm in systematic biology, information infrastructure that is capable of supporting this growth must also grow and be usable by those who are not necessarily trained in informatics or computational sciences. Availability and Future Directions Currently, TOLKIN is managing a number of collaborative projects funded by the National Science Foundation. As a server and service based platform, rather than a desktop application, there is an element of infrastructure maintenance that need to be addressed in terms of minimal support required for the software to run, and data to be maintained and accessed. This includes server maintenance, administration, software updates, and support to users. Beyond any potential to obtain funding for enhancements to TOLKIN, a subscription model is a viable alternative to offset maintenance costs. Subscribing collaborative projects that need data management and informatics support would benefit from previous investments and would not have to bear the burden of de novo informatics tooling. Data management, analysis, and visualization of biological data are increasingly data and computationally intensive. This trend comes as a necessity through the advent and ubiquity of high-resolution and high-throughput data capture, e.g., next generation sequencing approaches. The availability and use of web services to provide access to computational capability for intensive and time-consuming analyses is increasingly common and biological research is becoming not just informatics-enabled, but informatics-dependent. Large datasets must be merged from various sources, sub-sampled, recombined, transformed and transmitted back and forth between desktop tools, web applications and servers. In practice, users must be technically adept and able to deal with heterogeneous inputs/outputs of web services. The use of scripting to pipeline data streams and different tools at various stages of analysis are essential. Pipelines can be effectively modeled as workflows, and desktop workflow software, e.g., Kepler [20], [21] and Taverna [22] among others, provide graphical environments with which to compose workflows according to the biological analysis procedure. Workflow software tools allow the integration of web services within analytical pipelines. The inputs/outputs of such web services running inside workflows usually tend to be simple-typed such as a piece of raw sequence or an object ID, which severely restricts the use of tools with more complex inputs/outputs in a workflow environment. As helpful as they are, existing workflow software remain hard to learn and/or challenging to master. TOLKIN is currently experimenting with complex workflows wrapped as web services and their streamlined deployment. A TOLKIN web service allows users to submit, run and manage a fully functioning abstract workflow composed of predefined tools. Abstract workflows are analytical pipelines whose components are specified by users but seamlessly designed and managed on the server side. Plans to build TOLKIN pipeline libraries are in place and these will initially include phylogenetic and population genetic analyses. Finally, as data are exported and deposited into permanent repositories, there is a clear need to develop mechanisms for propagating data and metadata updates made in TOLKIN (e.g., taxonomic names). Although this is a desirable feature that will be a focus of future development, other projects such as BiSciCol (www.biscicol.blogspot.com) may also provide the alternative infrastructure to track changes of data and metadata stored in multiple repositories.

Acknowledgments We are grateful to two anonymous reviewers for their constructive suggestions that helped improve this paper. Throughout the course of our software development the following people helped us to address specific user requirements and technical issues: Purnashree Bhattacharya, Andre Chanderbali, Ugandhar Reddy Chittamurru, Bhavik Ghandi, Chris Goddard, Srinivas Muddana, Dwaipayan Mukhopadhyay, Kate Rachwall, Rakesh Nair, Rajat Sehgal, Stephen A. Smith, Pam Soltis, Doug Soltis, Ramachandren Srinivasan, and Haijun Zhu. Finally, we are very grateful to all our users for their patience and invaluable suggestions; in particular, we would like to thank Ricarda Riina from the PBI-Euphorbia project for her constructive feedback and constant support.

Author Contributions Conceived and designed the experiments: RSB NC. Wrote the paper: RSB NC. Coded the software: GHT CAD NS. Tested the software: JK.