Abstract The diversity of online resources storing biological data in different formats provides a challenge for bioinformaticians to integrate and analyse their biological data. The semantic web provides a standard to facilitate knowledge integration using statements built as triples describing a relation between two objects. WikiPathways, an online collaborative pathway resource, is now available in the semantic web through a SPARQL endpoint at http://sparql.wikipathways.org. Having biological pathways in the semantic web allows rapid integration with data from other resources that contain information about elements present in pathways using SPARQL queries. In order to convert WikiPathways content into meaningful triples we developed two new vocabularies that capture the graphical representation and the pathway logic, respectively. Each gene, protein, and metabolite in a given pathway is defined with a standard set of identifiers to support linking to several other biological resources in the semantic web. WikiPathways triples were loaded into the Open PHACTS discovery platform and are available through its Web API (https://dev.openphacts.org/docs) to be used in various tools for drug development. We combined various semantic web resources with the newly converted WikiPathways content using a variety of SPARQL query types and third-party resources, such as the Open PHACTS API. The ability to use pathway information to form new links across diverse biological data highlights the utility of integrating WikiPathways in the semantic web.

Author Summary WikiPathways is a crowd-sourced online platform for biological pathways. It is based on the same underlying platform as Wikipedia. Pathways are saved as graphical images embedded in a set of meta data elements (i.e. references, list of pathways elements, and context annotations). Pathways are used as proxies of biological knowledge in their role as descriptors of processes. Yet integrating these hubs of biological knowledge with other biological data resources remains challenging due to a cacophony of file formats, identifier systems, and hidden content. We show the application of the semantic web to enable a straightforward integration of heterogeneous biological data sources. We have taken high-quality pathways from a curated set from WikiPathways and converted the content into a data format native to the semantic web. Here, data is expressed as a set of statements where the statements are built upon a set of web addresses. Given the results, we successfully integrated external resources (e.g., EBI Expression Atlas) and pathway content with a single query.

Citation: Waagmeester A, Kutmon M, Riutta A, Miller R, Willighagen EL, Evelo CT, et al. (2016) Using the Semantic Web for Rapid Integration of WikiPathways with Other Biological Online Data Resources. PLoS Comput Biol 12(6): e1004989. https://doi.org/10.1371/journal.pcbi.1004989 Editor: Christos A. Ouzounis, Hellas, GREECE Received: February 15, 2016; Accepted: May 17, 2016; Published: June 23, 2016 Copyright: © 2016 Waagmeester et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability: All files are available for download at http://rdf.wikipathways.org. The data are also accessible through SPARQL queries at http://sparql.wikipathways.org or through REST-calls at https://dev.openphacts.org/docs. Funding: This work was supported by: Innovative Medicines Initiatives Joint Undertaking under grant agreement no115191 (http://www.imi.europa.eu/content/open-phacts); and NIH National Institute for General Medical Sciences (R01-GM100039) (https://www.nigms.nih.gov/Research/Pages/default.aspx). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Pathway analysis and visualisation of data on pathways provide insights into the underlying biology of effects found in genomics, proteomics, and metabolomics experiments [1–4]. WikiPathways is a pathway repository where content is provided by the community at large [5, 6]. In a given pathway, elements like genes, proteins, metabolites, and interactions are identified using common accession numbers from reference databases such as Entrez Gene [7], Ensembl [8], UniProt [9], HMDB [10], ChemSpider [11], PubChem [12] and ChEMBL [13]. Multiple databases can be referenced to annotate an element of the same semantic type, e.g. Ensembl and Entrez Gene to annotate gene information. Even single studies sometimes use different reference databases to annotate experimental findings. It is common for bioinformaticians to spend valuable time dealing with data mapping issues that impede the actual data analysis and interpretation. In WikiPathways we use the open source software framework BridgeDb [14], to help resolve different identifiers representing the same (or related) entities. Capturing a semantically correct description of biological entities and their connections across datasets is the broader challenge that we have to address. The semantic web provides an approach to define entities and their relationships. By explicitly defining these entities and relationships the semantic web can provide a network of linked data [15]. The Resource Description Framework (RDF) consists of two key components: statements and universal identifiers. Each statement is captured as a triple, consisting of a subject, a predicate, and an object. For example, the following triple defines the glucose molecule as being part of the glycolysis pathway: The notion of a semantic web surfaces as you link across large sets of triples representing a vast number of objects and diverse types of concepts and predicates. The use of uniform identifiers, or URIs [16], provides consistency when specifying subjects and objects. identifiers.org [17], for example, provides a clearinghouse for a wide variety of URIs for biological entities in the life science domain. WikiPathways provides identifiers for all its pathways and identifiers.org provides the URI scheme to make these resolvable. Standardized URIs for predicates come from efforts such as the Simple Knowledge Organization System (SKOS) [18]. For example, our example triple above can be expressed in a more universal way as: where each element is uniquely and universally resolvable to a defined concept (glycolysis, “has member”, and glucose respectively). Of course, the more human readable information can also be explicitly added by describing the labels in RDF. But that information is also available by resolving the URIs. PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX wp: <http://identifiers.org/wikipathways/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX chebi: <http://identifiers.org/chebi/CHEBI:> wp:WP534 skos:member chebi:4167. wp:WP534 rdfs:label “Glycolysis and Gluconeogenesis (Homo sapiens)”@en. chebi:4167 rdfs:label “Glucose”@en. In order to contribute pathway knowledge to the semantic web, we have modeled the content of WikiPathways to form triple-based statements. The interactions and reactions curated at WikiPathways are particularly well-suited to enrich the overall connectivity of the semantic web. Pathways offer a meaningful context for relations between biological entities, such as proteins, metabolites and diseases that are otherwise defined in disparate databases. We report on the conversion process and the development of two new vocabularies essential in capturing the semantics behind pathway diagrams. Finally, we evaluate the use of the semantically linked pathway knowledge through specialized queries and third-party resources, showing how to link WikiPathways with disease annotations (from UniProt [9] and DisGeNET [19]), with gene-expression values (from Gene Express Atlas) and with bioactive chemical compounds known to affect proteins that occur in pathways (e.g. from ChEMBL).

Materials and Methods Use of Open PHACTS RDF guidelines In collaboration with partners in the Open PHACTS project, we proposed guidelines for presenting data as RDF [37], most of that can be considered as general guidelines to produce RDF in the biomedical domain. The guidelines consist of a prerequisite and 11 steps, covering the licensing (step 0), designing (step 1–5), implementation (steps 6–9), and presentation (steps 10–11) of the data in the semantic web. In the work presented here we follow these steps: Licensing. WikiPathways content is covered by the Creative Commons Attribution 3.0 Unported license (https://creativecommons.org/licenses/by/3.0/). This is stated in the VoID headers of the RDF made. These headers are automatically generated by the same script generating the WikiPathways RDF. Open PHACTS provides a template for these header files. Implementation. We used a Java RDF framework, Jena (http://jena.apache.org/)[38], to generate the RDF for WikiPathways. The pathway diagrams were obtained through the web services of WikiPathways, after which they were converted into RDF with the Jena RDF framework. The code of the serializer is available on GitHub (https://github.com/wikipathways/wp2lod). The vocabularies were generated with a vocabulary framework called Deri Neologism (http://neologism.deri.ie/). Presentation. The resulting RDF triples are available from (http://rdf.wikipathways.org) and loaded on a instance of the Virtuoso Open-Source Edition (http://virtuoso.openlinksw.com/) and available through its SPARQL endpoint at http://sparql.wikipathways.org. The triples are also loaded on the Open PHACTS discovery platform (https://dev.openphacts.org/docs/1.5) where they can be accessed through eleven API calls. Identifier mapping In the context of the semantic web, it is impractical to burden query writers with handling identifier mapping per resource and per query. Rather, the mapping results themselves need to become part of the semantic web. We applied two distinct approaches to addressing identifier mapping in our WikiPathways and Open PHACTS projects. Query expansion. The Open PHACTS framework provides query expansion functionality through its Identifier Mappings Services. When an identifier is queried the SPARQL query is enriched with all possible identifiers to retrieve an expanded set of related entities. This approach is the most efficient in terms of the number of triples, since it requires only a single identifier per relationship, eliminating redundancy. However, it also requires a hosted identifier mapping service that it called along with every query. Unified identifiers. In the case of WikiPathways, which does not host a mapping service, we chose a unified identifier approach, where all identifiers are mapped ahead of time to a set of common identifier systems. In this way, the database effectively contains the results of a limited number of identifier mappings in form of partially redundant triples. For example, in the WikiPathways RDF, all identifiers have been unified to Entrez Gene [7] (wp:bdbEntrezGene), Ensembl [8] (wp:bdbEnsembl), UniProt [9] (wp:bdbUniprot) for gene products and HMDB [10] (wp:bdbHmdb), and ChemSpider [11] (wp:bdbChemspider) for compounds like metabolites and drugs. The original identifier provided by the pathway curator is stored as a triple, with the predicate dc:identifier, and a URI from identifiers.org, which points to both the identifier and the resource. Summary We present a semantic web representation of WikiPathways together with vocabularies needed to cover the graphical pathway layout and the biological meaning and solutions to map between different identifier systems. The public availability allows rapid integration with other biological resources. The availability of two vocabularies allows to convert between different pathways resources. Different analytical tools now support the import of semantic web data, allowing integrated use of data from different resources with a single query. We demonstrate this with a federated query across multiple resources where the resulting differentially expressed genes for a disease where shown on a discovered pathway using PathVisio. Availability The following resources are publically available as beta releases just like WikiPathways. They are maintained as part of the open source WikiPathways project Vocabularies GPML: http://vocabularies.wikipathways.org/gpml

WP: http://vocabularies.wikipathways.org/wp WikiPathways on the Semantic Web SPARQL endpoint: http://sparql.wikipathways.org

Open PHACTS: https://dev.openphacts.org/docs/

RDF greendownload: http://rdf.wikipathways.org Source code GitHub: https://github.com/wikipathways/wp2lod

Supporting Information S1 File. CONSTRUCT query to translate from the GPML vocabulary to the WP vocabulary. A construct query is type of SPARQL query that enables the conversion of one graph pattern to another. Here an interaction described by its spatial properties is converted into a semantic representation reflecting its biological interpretation. https://doi.org/10.1371/journal.pcbi.1004989.s001 (PDF)

Acknowledgments We acknowledge the help from the teams behind UniProt, DisGeNET and EBI’s Array atlas for the help on the various SPARQL queries.

Author Contributions Wrote the paper: AW MK CTE ARP. Designed the queries queries and use cases: AW MK AR RM ELW.