In this article, we talk with one of our users: Antonio Messina from the High Performance Computing and Networking Institute of the Italian National Research Council (ICAR-CNR).

Antonio (@xMAnton on Twitter) is a Computer Science Engineer who works as an Applied Scientist at the largest public research institution in Italy. His area of expertise includes (No)SQL databases and advanced Unix systems administration, and he likes to get his hands dirty coding mainly in Java and Node.js. He is enthusiastic about technologies such as graph databases and Docker and is constantly looking for innovation in IT.

Recently, Antonio successfully submitted a paper that describes a practical use case for Grakn. The paper, “BioGrakn: A Knowledge Graph-based Semantic Database for Biomedical Sciences”, will be published after the CISIS 2017 conference that takes place in July.

We asked Antonio to tell us some more about BioGrakn, and how it is the first step in using the power of knowledge graphs and machine reasoning to solve common problems in the domain of biomedical science.

What problem did you need to solve?

Nowadays, the amount of biological data available online is huge, but integrating and connecting related information from different sources to gain new knowledge is a challenge.

We’ve identified a need for tools to aggregate, integrate, and model data, while managing significant complexity and contextual specificity.

Some of the most common problems include: locating resources, differing data formats, ambiguity and duplication, relationships between data, and the sheer volume and granularity of the information. As yet, there is no standard memorization and query format for this kind of data, so each resource usually requires a different approach to be properly handled.

Typically, what kind of data storage do you work with?

Several classes of bio-molecular data — such as transcriptional regulatory networks and protein-protein interaction networks — interact as complex networks. They can usually be modeled as graphs, where nodes (and their attributes) model biological entities and edges contain relationships between these entities. Examples of the adoption of graph databases in bioinformatics are given by ncRNA-DB, Bio4J, and BioGraphDB:

ncRNA-DB is a NoSQL database based on OrientDB that combines many biological resources to deal with several classes of ncRNA such as miRNA, long-noncoding RNA (lncRNA), circular RNA (circRNA) and their interactions with genes and diseases.

Bio4j is based on a Java library and is an integrated cloud-based data platform, built upon a graph structure on top of Neo4J. For now, it includes data about proteins, GO, and enzymes.

BioGraphDB integrates several types of data sources to perform bioinformatics analysis using a comprehensive system built on top of OrientDB. It includes data about genes, proteins, microRNAs, molecular pathways, functional annotations, and associations between microRNAs and cancer diseases.

So what is BioGrakn?

In short: BioGrakn is a graph-based semantic database that takes advantage of the power of knowledge graphs and machine reasoning to solve problems in the domain of biomedical science. We address the major issue of semantic integrity, that is, interpreting the real meaning of data derived from multiple sources or manipulated by various tools.

BioGrakn has been built on top of Grakn, a distributed knowledge graph database which allows complex data modelling, verification, scaling, querying and analysis. A key step is the definition of an ontology, which facilitates the modeling of complex datasets and guarantees information consistency. Inference rules allow the extraction of implicit information from explicit data, to achieve logical reasoning over the represented knowledge.

What data sources did you use?

The data sources we chose are almost the same as those used by BioGraphDB. This way, we can build an integrated database containing resources related to genes, proteins, miRNAs, and metabolic pathways. References can be found at the end of the article.

NCBI Entrez Gene [1]: provides a lot of genes data, such as interactions with other genes, genomic context, annotated pathways, and so on.

Gene Ontology (GO) [2]: provides annotations for gene products in biological processes, cellular components and molecular functions.

UniProt Knowledgebase (UniprotKB) [3]: the largest public collection of annotated functional information on proteins.

Reactome [4]: contains validated metabolic pathways, each annotated as a set of biological events, dealing with genes and proteins.

miRBase [5]: provides all the known miRNAs sequences and annotations, associated with names, keywords, genomic locations, and references.

mirCancer [6]: contains associations between miRNAs and human cancers.

miRNASNP [7]: aims to provide a resource of the miRNA-related mutations (SNPs) for human and other species.

mirTarBase [8]: list of experimentally validated miRNA-target interactions.

miRanda [9]: list of putative miRNA-target interactions.

HGNC [10]: the HUGO Gene Nomenclature Committee database contains, for each gene symbol, a list of synonyms and a list of corresponding entries in the most popular genes databases.

How did you import the data?

Much of the above data is in TSV format, a simple text format for storing data in a tabular structure where each record in the table is one line of the text file, and each field value of a record is separated from the next by a tab character. By contrast, miRBase, GO, and UniprotKB are distributed as EMBL text file format and XML format, respectively.

Grakn does import TSV, but EMBL and XML source data files are not currently supported, so we developed an ad-hoc set of Extract-Transform-Load (ETL) tools. Data consistency and proper relations between entities were guaranteed by the precise order of execution of the ETLs. This way, when a data source also refers to others, the presence in the database of all the depending resources is assured.

What does the ontology look like?

A Graql ontology specifies the relevant concepts and their meaningful associations and must be clearly defined before loading data into a graph. Objects and relationships are categorised into distinct types, enabling automatic reasoning over the represented knowledge, such as inference (extraction of implicit information from explicit data) and validation (discovery of inconsistencies in the data).

The ontology has four types of concepts to model the domain. The categorization of concept types is enforced by declaring every concept type as a subtype of exactly one of the four corresponding built-in concept types: entity, relation, role, and resource.

Here’s a screenshot of the ontology we used. You can find the text version up on Github.

BioGrakn ontology

Do you have any examples of your queries?

Search for genes linked to a particular Gene Ontology annotation

Let’s consider the Gene Ontology annotation “platelet activating factor biosynthetic process”, that has GO:0006663 as identifier. In order to find annotated genes, the annotation relation, with the functional annotation member equal to our starting identifier, points out all the related annotated entities, from which we extract the genes, printing their symbols and names. The following Graql query returns the desired results:

match $go has goId “GO:0006663”;(functionalAnnotation: $go; annotatedEntity: $gene) isa annotation; $gene isa gene;

Search for genes linked to GO annotation GO:0006663

Search for pathways linked to a particular gene

At a first sight, this seems like the previous problem. However, genes cannot be directly linked to pathways, because Reactome just provides pathway-to-proteins associations. Therefore, we have to go through two relations:

1. encoding, that links genes to proteins

2. containing, that links pathways to proteins.

Thus, the Graql query is formed as follows:

match $gene has symbol “LYPLA1”;(encoder: $gene, encoded: $protein) isa encoding; (container: $path, contained: $protein) isa containing; $path isa pathway;

Search for pathways linked to gene LYPLA1

Reasoning

Consider the previous example, where we have the following statements, that can be seen as a set of premises:

if genes codify proteins if proteins belong to pathways Thus, it is possible to infer the following fact: then genes can be linked to pathways

Therefore, we can write an inference rule that infers genes-pathways links:

$genesInPathways isa inference-rule lhs {

$gene isa gene; $protein isa protein;

(encoder: $gene, encoded: $protein) isa encoding;

(container: $pathway, contained: $protein) isa containing;

} rhs {

(container: $pathway, contained: $gene) isa containing;

}

This rule allows us to rewrite the previous query:

match $gene has symbol “LYPLA1”; (container: $pathway, contained: $gene) isa containing;

As expected, the graphic results now show direct links from gene to pathways.

Reasoning on gene-pathway links

Summary

In this article, we have looked at how Grakn was used to build a prototype of a bioinformatics semantic database. We’ve discussed how BioGrakn takes advantage of the power of knowledge graphs and machine reasoning to solve problems in the domain of biomedical science. We address the major issue of semantic integrity, that is, interpreting the real meaning of data derived from multiple sources or manipulated by various tools.

What’s next?

In the short term, further developments are expected, such as the integration of other publicly available biological resources, the use of the native Grakn migration tools for data migration procedures, and the deployment of a user-friendly web interface.

Acknowledgements

Many thanks to Antonio, and to Nicholas D and Michelangelo Bucci for contributing to the text.

References

1. Schuler, G. D., Epstein, J. A., Ohkawa, H., Kans, J. A.: Entrez: molecular biology database and retrieval system. Methods in enzymology, vol. 266, 141–162 (1996)

2. The Gene Ontology Consortium: Gene Ontology Consortium: going forward. Nu- cleic Acids Research, vol. 43, no. D1, 1049–1056 (2015)

3. The UniProt Consortium: UniProt: a hub for protein information. Nucleic Acids Research, vol. 43, no. D1, 204–212 (2015)

4. Croft, D., Mundo, A. F., Haw, R., Milacic, M., Weiser, J., Wu, G., Caudy, M., Garapati, P., Gillespie, M., Kamdar, M. R., Jassal, B., Jupe, S., Matthews, L., May, B., Palatnik, S., Rothfels, K., Shamovsky, V., Song, H., Williams, M., Birney, E., Hermjakob, H., Stein, L., DEustachio, P.: The Reactome pathway knowledgebase. Nucleic Acids Research, vol. 42, no. D1, 472–477 (2014)

5. Kozomara, A., Griffiths-Jones, S.: miRBase: integrating microRNA annotation and deep-sequencing data. Nucleic acids research, vol. 39, Database issue, 152–157 (2011)

6. Xie, B., Ding, Q., Han, H., Wu, D.: miRCancer: a microRNA-cancer association database constructed by text mining on literature. Bioinformatics, vol. 29, no. 5, 638–644 (2013)

7. Gong, J., Tong, Y., Zhang, H.M., Wang, K., Hu, T., Shan, G., Sun, J., Guo, A.Y.: Genome-wide identification of SNPs in microRNA genes and the SNP effects on microRNA target binding and biogenesis. Human Mutation, 33(1), 254–263 (2012)

8. Hsu, S.-D., Tseng, Y.-T., Shrestha, S., Lin, Y.-L., Khaleel, A., Chou, C.-H., Chu, C.-F., Huang, H.-Y., Lin, C.-M., Ho, S.-Y., Jian, T.-Y., Lin, F.-M., Chang, T.- H., Weng, S.-L., Liao, K.-W., Liao, I.-E., Liu, C.-C., Huang, H.-D.: miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions. Nucleic Acids Research, vol. 42, no. D1, 78–85 (2014)

9. John, B., Enright, A. J., Aravin, A., Tuschl, T., Sander, C., Marks, D. S.: Human microRNA targets. PLoS Biology, vol. 2, no. 11 (2004)

10. Gray, K. A., Yates, B., Seal, R. L., Wright, M. W., Bruford, E. A.: Genenames.org: the HGNC resources in 2015. Nucleic Acids Research, vol. 43, no. D1, 1079–1085 (2015)