Open access, freely available web-based sources, such as the NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/taxonomy), NCBI Nucleotide database (http://www.ncbi.nlm.nih.gov/nuccore), PubMed citation and index (http://www.ncbi.nlm.nih.gov/pubmed) provide a source of valuable information that may not have been their primary focus. For instance, the descriptive metadata uploaded with genetic sequence submissions can be mined for the purposes of identifying interactions among species, including host-pathogen interactions, or the geographical distribution of the sequenced organism. Below we describe the processes by which raw data were obtained from these sources, and the methods by which interactions between organisms, or organisms and their geographical locations were extracted from these data. Figure 1 illustrates the overall process.

Figure 1: Overview of the methods of identifying species-species and species-location interactions. The first panel lists the resources used in a colour coded fashion. H refers to host and C to country tags in the sequence metadata. PMID is the PubMed unique identifier used in retrieving papers. The second panel explains the method of interrogating the evidence bases to extract species (cargo)-species (carrier) interactions. Species of sequenced organism (i.e., cargo) is first identified using the taxonomy tree, then the host tag in the sequence metadata is disambiguated using the taxonomic tree to identify the carrier species. Lists of PMIDs obtained for cargo and carrier species are intersected to provide additional evidence for the interactions extracted from the sequence metadata and to identify new relationships between cargo and carrier species discovered from the sequence metadata. The third panel illustrates the method of extracting species-location interactions from the evidence-base. First sequenced organisms and location information are extracted from sequence metadata. The species of sequenced organisms is then identified using the taxonomic tree. The location data (L) is split into country (C) and region (R) strings. Both are then disambiguated using the data gathered from GeoNames to obtain the country and region where the species was found. Geonames is also used to interrogate PubMed for papers about each country and region in the database. These are then intersected with species publications, the shared set is used as evidence for the species being found in a given location. Full size image

Data repositories

Organisms, rankings and taxonomic hierarchy

856,031 organism scientific names, their unique identifiers (TaxID), taxonomic ranks and classifications were obtained from the NCBI Taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/). This dataset was manually supplemented with an additional 2,171 organisms of interest that were not found in the NCBI Taxonomy database.

The NCBI definition for ‘species’, ‘subspecies’ and ‘no rank’ (excluding viruses and viroids), was subsequently altered according to the following rules:

1 Where organism name contained numbers=>No rank. 2 Where organism name contained any of the following words (unclassified, uncultured, var)=>No rank. 3 Where word count=2=>Species. 4 Where word count ≥3=>Subspecies.

The application of the above rules resulted in the following changes: Species count decreased from 579,175 to 224,751. Subspecies count increased from 12,289 to 14,821. No rank count increased from 80,692 to 432,584. The remaining organisms were classified at or above genus level and were therefore excluded from the datasets described in this paper.

Taxonomic lineage relationships of the form Organism A is a parent of Organism B were also obtained from NCBI to replicate the hierarchical phylogenetic structure for the species, subspecies and no ranks listed above so that outputs can be obtained for species and higher taxonomic groups (for instance ‘flaviviruses’, ‘ruminants’). The 2,171 additional organisms were forced into the resulting tree, assigned to the most suitable parent nodes; where the correct parent was not found already in the tree, new nodes were added.

Some organisms were then linked to a collection of alternative names (e.g., common names, common misspelling, breeds and acronyms) that were collected from a variety of sources including textbooks. Additional care was given to humans and their domestic animals. Here, we focussed on 46 species of common domestic animals in Europe9. Where needed, the organisms were linked to sets of inclusion (AND) and exclusion (NOT) terms. These sets (alternative names, inclusion and exclusion terms) were utilised in disambiguating organism names and in retrieving publication metadata from PubMed as described in subsequent sections.

Geographical names

To enable the discovery of the geographical distribution of species a comprehensive dictionary of geographical names was built. First a list of countries and their alternative names was obtained from the GeoNames geographical database (http://www.geonames.org), and subsequently supplemented with the list of countries available in the Medical Subject Headings (MeSH) library (http://www.ncbi.nlm.nih.gov/mesh). For each country (particularly for larger countries), information about the country's administrative divisions was collected; State codes and acronyms for countries such as the United States, Brazil, and China were also added (e.g., NY for New York). For the purposes of the datasets described here only the first level administrative divisions (hereafter regions), were required; for these regions (e.g., home nations in UK, states in USA), extensive lists of major cities, natural features and unique place names were also obtained from GeoNames and other sources.

Evidence curation

Nucleotide sequences

The total of 39,238,061 nucleotide sequences' metadata files covering the period 1993–2012, were retrieved in XML format from NCBI Nucleotide Sequences database. The following data items were extracted (where available) from each metadata file (Fig. 2 illustrates this process):

1 NCBI TaxID: Using this identifier we were able to link 19,717,726 sequences with 171,967 corresponding species, 1,106,525 sequences with 10,989 subspecies, and 5,941,718 sequences with 245,532 no rank organisms. 2 Host: where available (7.1%) the host tag indicated the possibility of the sequenced organism being found in or on a host species. 3 Country: where available (17.5%) the country tag indicated the possibility of the organism being found in a certain geographical location that can be associated with a single country (and/or water body). 59.9% of sequences with country tag contained additional location information, such as the name or code of a state, a river, or a national park in which the organism was found.

Figure 2: Example illustrating the information extracted from sequence metadata—sequence ID=158668169. . Full size image

Publications

Comprehensible search terms were automatically built using the three sets of names associated with each organism adhering to the following rule: ((Any of the organism names and alternative names) And (All of the inclusion terms)) NOT (any of the exclusion terms). Below is the search term generated for classical swine fever virus:

(‘classical swine fever virus’ [Text Word] OR ‘csfv’ [Text Word] OR ‘hog cholera virus’ [Text Word] OR ‘pestivirus type 2’ [Text Word] OR ‘swine fever virus’[Text Word]) NOT ‘african swine fever’ [Text Word]

6,473,167 citation metadata files were downloaded in XML format from the PubMed database. 6,028,487 of these files cited 7,463 species, 323,483 cited 208 subspecies and 674,836 cited 1,482 no rank organisms. Note that one paper may cite more than one organism.

Identification of interactions

Using the data and evidence obtained and processed as discussed above, two types of interactions were identified: species-species interactions and species-geographic location interactions.

Species-species interactions

Species-species interactions indicate the possibility of one species (Cargo G) being found in or on another species (Carrier A). Many of these interactions are of the type: Pathogen P was found in Host H, however due to the nature of the underlying evidence we cannot assume all interactions to be of this type. Interactions can also be commensal (neither beneficial nor costly) or mutualistic (beneficial to both species), or vector-host. Additionally, an organism that is pathogenic to one host may be non-pathogenic in another so it is inappropriate to label the organism itself a pathogen; rather, it is interactions between species that are pathogenic or non-pathogenic. We therefore use a more generic terminology: Cargoes are found in/on Carriers. Cargoes are often pathogens and carriers are often hosts, but this is not always the case.

The identification of carrier-cargo interactions is a two steps process:

1 Evidence extraction from nucleotide sequence metadata: we have identified 2,706,620 metadata files (7.1% of the files obtained from NCBI) where information is provided for the host tag. Where the metadata for an organism includes an entry for the host tag, we infer a cargo-carrier interaction. These files were processed as follows: a) Cargo species identification: sequenced organisms ranked above species-level were discarded from this dataset. Subspecies and no ranks were used to recursively identify their parent species in conjunction with the taxonomic tree. In other words, if the cargo is a subspecies, we store the interaction of the parent species with the carrier, not the subspecies itself. b) Carrier species identification: the host tag was used to directly identify 73.6% of carriers to species level. For the remaining sequences a simple disambiguation algorithm was applied resulting in the identification of 94.5% of carriers. As with cargoes, sequenced organisms ranked above species level were discarded, and sequenced organisms ranked below species level (sub-species and no rank) were assigned to their parent species 2 Evidence extraction from publications: Having used the nucleotide database to define organisms as cargoes and carriers, we used this information to interrogate the publication metadata files obtained from PubMed. First, we retrieved all publication metadata files from PubMed for all cargoes and carriers identified above. Then, we intersected the two sets for common PubMed identifiers (i.e., finding papers which were in both sets). This enabled us to identify new combinations of carrier and cargo that were not apparent from the nucleotide evidence. Following a validation exercise9 a threshold of at least 5 papers was applied in including a publication-only interaction that was not backed up by sequence-based evidence.

22,515 unique species interactions were thus generated between 6,314 carrier species and 8,905 cargo species. Figure 3 presents an example of how this dataset could be utilised in analysing and presenting potential pathogens (bacteria, viruses, fungi, helminth and protozoa species) shared between vertebrates species in Data Citation 1.

Figure 3: Shared pathogens between vertebrate species in Data Citation 1. Each node presents a vertebrate species. The size of the node is in proportion to the number of unique pathogen species found to interact with it. Edges between two nodes indicate they both share at least one possible pathogen species. The weight (thickness) of the edges is in proportion to the number of possible pathogen species shared between the two nodes. The location of each particular node corresponds to the size of all nodes in the graph and the weight of the edges linking this particular node with other nodes. Full size image

Species-location interactions

Location interactions indicate the possibility of a species being found in a certain location. Locations were interpreted at two levels: country C and region R. Regions correspond to first administrative divisions rather than geographical or natural regions (e.g., states (USA), departments (France), home nations (UK), etc.). Similarly to above, these interactions were extracted in two steps.

1 Evidence extraction from nucleotide sequences metadata: 6,714,520 metadata files where location information was provided (about 17.5% of the total), were processed as follows: a) Species identification: sequenced organisms were processed in a similar way to cargoes in the previous subsection. b) Location identification: The string within the country tag was assumed to adhere to the following format ‘Country: Location’, the typical format of items within the nucleotide database. i Country identification: the country part of the extracted strings was matched against our collection of geographical identifiers. Where the country was found to be a historical one (e.g., Yugoslavia) region information (if available) were used to identify the country (and where possible region), otherwise data were discarded. Water bodies (e.g., oceans and seas), were also discarded where no region substring was provided, or where the region substring was also a water body. ii Region identification: countries without administrative divisions and small countries (e.g., Andorra) were excluded from this step. A region identification algorithm was applied. The algorithm splits the location string into substrings, each of which is matched against the collected location names from higher to lower ranked places (e.g., first administrative divisions, capitals, second administrative divisions, third administrative divisions, cities, towns, villages), and selecting the highest ranked match. Below are some examples: A ‘Italy: Milan’: C=Italy and R=Regione Lombardia. B ‘USA: MA’: C=United States and R=Massachusetts. C ‘China: Shantou’: C=China and R=Quangdong sheng. D ‘United Kingdom: Yorkshire, Old Peak’: C=United Kingdom and R=England. 2 Evidence extraction from publications: Suitable PubMed search terms were generated for countries and their regions, taking into account whether the country is in the MeSH library, and including the main geographical locations (such as region capitals, main cities, counties etc.) in the search terms, as per the following steps: a) The country C is in the MeSH library: i A MeSH-based search term was generated to retrieve PubMed paper identifiers (PMIDs) for publications about C. We refer to this set of PMIDs as PMID C . ii For each region R of C, a title and abstract only search term was generated using the region name, major cities and landmarks within the region. For instance, the following search term was used to retrieve publications about Scotland (‘Scotland’ [title or abstract] or ‘Glasgow’ [title or abstract] or ‘Edinburgh’ [title or abstract]) or …...). We refer to this set of PMIDs as PMID R . iii Only the PMIDs appearing in both sets were included when extracting information about regions within a country. We refer to this set as PMID RC. iv The set of PMIDs retrieved for each species was then intersected with PMID C and PMID RC . v Where the results of the intersection contained five or more publications the interaction of species-country or species-region (in a country) was added to the database. b) The country C is not in the MeSH library: an altered search term was used to look for the country in the title or abstract of the papers using the country's official name and the set of alternative names. Steps (a).ii-(a).v were then executed as described above.

In this way, 157,204 locations for 72,533 species were identified.

The enhanced infectious disease database (EID2) database

The raw data curated in the above steps, and the identified interactions are stored in a web-fronted relational-database, the Enhanced Infectious Disease Database (EID2) (www. zoonosis. ac. uk/ EID2/). The database is continuously updated with new organisms, evidence and interactions. EID2 uses a 4-tier modular architecture separating the web front from the business logic and the database services following the S#arp Architecture model. EID2 is a web-based system, its user-interface UI is accessible via multiple web-browsers, and is supported by ASP.NET MVC framework from Microsoft Corporation. EID2 utilises and integrates various technologies such as Fluent Nhibernate for conversion-based, strongly typed mapping, and a number of technologies for visualisation and data-display. EID2 is freely accessible via a free of charge and simple registration procedure and subsequent login. In addition to the datasets presented in this paper, EID2 UI enables the user to access each of the evidence pieces on which the interactions were based; generate maps of the distribution of all organisms at both country level and region level, as well as access climate and other useful data.