First Steps Toward Decentralized Biomedical AI

The Gene Annotation Service by Mozi.ai — This article was written by Mike Duncan and Dr Ben Goertzel.

Vision

The tricorder from Star Trek is a powerful symbol for the ultimate goal of medical automation: a device that would acquire all necessary data from a patient to determine their illness — and then perform the interventions needed to cure them, or inform the patient or their healthcare provider of the appropriate treatment.

The AGI controlling a future tricorder-like technology needs an in silico human biology model that can simulate the complete range of normal and pathological variation of human physiology as it changes with the environment, over the lifespan, and during acute and chronic disease processes. Given such a model, it needs to be able to carry out complex query processing, along with inductive, deductive and abductive inference to diagnose the patient’s illness and choose the appropriate treatment. Such an AI would take electronically archived and self-reported patient information, physical exam findings, and biochemical measurements to parameterize a predictive model that would capture the patient’s physiological trajectory in sufficient detail to characterize the cause of their pathology and select interventions maximizing the chance for an optimal outcome.

The decentralized AI paradigm embodied in SingularityNET provides a powerful approach toward this long-term holy-grail goal of “solving medicine” — and in the immediate term provides a valuable platform for creating and connecting AI tools carrying out diverse forms of biological and medical analysis.

There are two massive stumbling blocks in biomedical research that require advanced computational techniques, and ultimately AGI, to solve. One is the sheer volume of data being generated by the modern high through-put “-omics” technologies and the publications of the global army of biomedical researchers. As Jackie Hunter, the director of Benevolent.ai rightly noted “for an evidence-based industry, we don’t actually use a lot of it. There’s a new publication out every 30 seconds. There is a huge amount of information out there that just isn’t being used for the discovery and development of new drugs.” No human can consume, let alone synthesize, this exponentially increasing pool of knowledge. The second impediment to solving medicine is the inherent complexity of biological systems, the myriad of interacting components and factors that need to be taken into account when modeling the functions and dysfunctions of the human body. Examples include the sequencing of some 3 billion DNA “letters” of each human genome (genomics), measuring the levels of tens of thousands of interacting protein variants this DNA codes for (proteomics), or the dynamics of the myriad molecular components these protein networks orchestrate to produce the symphony of living tissues that make up biological organisms.

The emerging experimental technologies that allow us to routinely sequence entire genomes and transcriptomes from single cells and visualize individual proteins in real-time have produced parts of the list for life, microfluidics-based instruments are systematically characterizing the possible interactions among tens of thousands of proteins and small molecules, and organoid technology is producing tissue models at scale to extend this mass screening ability to living cell systems. Synthesizing these observations into tissue and organism level predictive models is the goal that is producing a new publication every 30 seconds by the global biomedical research community¹. Even at this volume, human brainpower has only closely examined the functioning of 10% of the known human genes², and the vagaries of the investment or big government funding sources that select research priorities concentrate the bulk of ongoing research into biology defined by this narrow spotlight³. Considering the intrinsic limit to the number of interacting variables even the brightest human minds can work with, it is clear that modeling life with its tens of thousands of components and tens of millions of interactions requires the scalability of an AI system to be achieved, and AGI to be manipulated as a clinically applicable tool.

Roadmap

SingularityNET, in partnership with Hong Kong bio-AI startup MOZI.AI, has embarked on a project to build an AI framework that will ultimately support complete tricorder-like analytical functionality, based on the developing OpenCog⁴ artificial general intelligence (AGI) system. The development process consists of two tracks — biological and clinical — each a series of functional units developed as SingularityNET services with independent utility. These tools will combine with previously implemented ones to form increasingly complex analytical services and biomedical models.

The biology track starts at the molecular level, implementing tools for basic research that allow experimental omics/big data to be analyzed in the context of current knowledge to produce models and hypotheses for new experiments. The clinical track starts at the human level, securely collecting and storing medical records combined with data streams from fitness wearables, meal and stress tracking apps and the rest of the digital self for services that model and eventually predict and optimize individual health. Some early services of this type will be developed in collaboration with Rejuve, a spinoff from SingularityNET focused on gathering data from and providing services to individuals with a passion for healthspan extension, and using the data gathered to make new discoveries and create new therapies.

These services will be aimed at biotech, pharma, and clinical researchers, and clinical providers and health-conscious consumers from the respective development tracks, all available and potentially interacting in the SingularityNET marketplace. Ultimately, the services for research applications will combine to form a modeling system sophisticated enough to use the individual patient data produced by the clinical/”digital self” data services to allow accurate disease diagnosis, treatment prediction, and health optimization planning and monitoring.

Developing an AI system for solving medicine is an ambitious project that will require input from the global biomedical research community and the global human population it is intended to serve. The collaboration of several research institutions and companies like MOZI.AI on SingularityNET’s open AI marketplace will be one of the world’s most effective means to catalyze a revolution in medicine within this next decade.

First steps: OpenCog based services on SNET

The development process SingularityNET and MOZI.AI are collaboratively undertaking to realize the tricorder vision is a two-pronged approach that combines bioinformatics research into the underlying dynamics of aging with the development of bioinformatics AI tools using the OpenCog AGI framework to carry out this research. The OpenCog AGI framework⁵ ⁶ consists of a suite of machine learning algorithms operating on a common knowledge representation implemented as a weighted labeled hypergraph database⁷ (the AtomSpace). Besides standard graph database functions accessible by a custom scheme based language⁸ (Atomese), there are engines for multi-variable pattern matching⁹, surprising pattern mining¹⁰, and probabilistic inference using the probabilistic logic network¹¹ (PLN) framework. Supervised learning algorithms include deep learning neural net models and a sophisticated genetic algorithm package, meta-optimizing semantic evolutionary search¹² (MOSES) for evolving boolean models from experimental data. A natural language processing component¹³ will allow automated incorporation of knowledge as it is published. These complementary learning processes are coordinated synergistically around a dynamically generated subset of the AtomSpace (the attentional focus¹⁴) that serves as a “global workspace” for ongoing learning processes.

The SingularityNET Beta platform has already listed a number of AI agents that can be accessed by any individual or organisation willing to pay for its use. The term “agent” here simply refers to any AI service, from a small individual algorithm to large end-to-end solutions and standalone AI applications, with an associated “agent” contract that manages pricing and displays certain metadata like the service endpoint. Implementing OpenCog components as SingularityNET services allows the leveraging of cutting edge AI developed for the general SingularityNET market to be incorporated into the OpenCog AGI framework, and also allows for functional components of the tricorder vision implemented with OpenCog technology to be incrementally deployed and improved through user feedback after real-world use.

The first two agents we are publishing to the SingularityNET network are the Gene Annotation Service and the MOSES Service, an evolutionary program learning algorithm interface. Though these initial offerings are primarily intended as a proof of concept showing an AI algorithm service whose results can be fed to a domain knowledge base service for interpretation, they are fully functional services designed to be components in analytical pipelines for biomedical research projects. The rest of this blog post will describe the annotation service in detail, placing it in the context of our broader development goals.

Gene Annotation Service

The current version of the service generates an AtomSpace hypergraph associating a list of input human gene symbols with data derived from three widely used public biological knowledge-bases:

Gene Ontology (GO): The GO associates genes with three sets of concepts describing biological processes (BP), molecular functions (MF), and cellular components (CC) associated with gene products, organized as directed acyclic graphs (DAGs). The definitions of these terms provide a natural language description of these genes curated from the biomedical literature, subdivided into their role in human physiological functioning (BP), the interactions and dynamics the proteins have at the molecular level (MF), and the macromolecular complexes and cellular organelles they are a part of (CC). In the atomspace, GO terms are ConceptNodes with particular GeneNodes (especially named ConceptNoces) connected by MemberLinks.

The GO associates genes with three sets of concepts describing biological processes (BP), molecular functions (MF), and cellular components (CC) associated with gene products, organized as directed acyclic graphs (DAGs). The definitions of these terms provide a natural language description of these genes curated from the biomedical literature, subdivided into their role in human physiological functioning (BP), the interactions and dynamics the proteins have at the molecular level (MF), and the macromolecular complexes and cellular organelles they are a part of (CC). In the atomspace, GO terms are ConceptNodes with particular GeneNodes (especially named ConceptNoces) connected by MemberLinks. Reactome : This a database of metabolic and signaling pathways composed of interacting proteins and small molecules that are curated from the biomedical literature. They are also represented as ConceptNodes with GeneNodes connected to them by MemberLinks. Small molecules involved in the chemical reactions constituting the pathways are represented by MoleculeNodes, and optionally specifically expressed proteins that are known to be the translated protein isomers of their respective genes that actually participate in the reactions of the pathway can be returned in the annotation results as MoleculeNodes as well.

: This a database of metabolic and signaling pathways composed of interacting proteins and small molecules that are curated from the biomedical literature. They are also represented as ConceptNodes with GeneNodes connected to them by MemberLinks. Small molecules involved in the chemical reactions constituting the pathways are represented by MoleculeNodes, and optionally specifically expressed proteins that are known to be the translated protein isomers of their respective genes that actually participate in the reactions of the pathway can be returned in the annotation results as MoleculeNodes as well. BIOGRID: This protein protein interaction (PPI) database contains predicted and experimentally verified protein-protein interactions. While GO and Reactome (like the bulk of the human biomedical literature) only cover 10–20% of the predicted number of human genes that have actually been characterized to some extent in vivo, the PPI documents interactions observed in vitro using multiple experimental methods that among them have analyzed most of the predicted human proteins in mass screening experiments. These interactions are represented in the atomspace by an “interacts with” PredicateNode connecting GeneNodes or, like the Reactome annotations, optional MoleculeNodes of the proteins connected by “expresses” PredicateNode links to their respective GeneNodes.

The inclusion of both GeneNodes and MoleculeNodes embodies the distinction between the evolving natural language conceptual structures of biology and the physical objects characterisable through experimentation that they describe. As our AI and biology research progress, this distinction will allow complementary software representations of biology in both natural language concepts and numerical molecular simulations to coevolve to the sophistication necessary to implement the tricorder vision of arbitrarily accurate patient simulation.

After providing the list of genes to be annotated in the form of their official gene symbols and selecting which databases and associated options to use for annotation, the user is provided with results in three formats, as well as a graph visualization directly in the browser. Plain text tables in comma-separated value (csv) format can be viewed and downloaded for a complete record of the annotations returned for each selected database. A scheme file of the atomspace representation of the results can also be downloaded for users interested in working directly with OpenCog atomspace based tools. Finally, a graph representation of the annotations is available in JSON that is compatible with the popular open-source graph visualizer Cytoscape. We use the cytoscape.js library for two visualization options available on our server via a link on the results page. One is a standard abstract graph representation that can be manipulated in the browser interface and which provides links for each annotation back to the source database, and the other shows the annotations placed within their documented locations in a stylized cell representation. Accessing and using the cell visualizer will be described in a future blog post.

While covering only a small fraction of the available biological knowledge catalogued on the web, the selection of publicly available data incorporated into the atomspace provides a blueprint of a semantic model of molecular biology grounded in the 10% of genes and the biochemical pathways they control as understood currently, and computational and empirically determined in vitro molecular interactions that we can use to develop automated inference tools to systematically begin to fill in the remaining 90% of the picture. One short term goal is to use gene lists from gene expression experiments and variants from whole genome-wide association studies (GWAS) to predict biological pathways that would explain the experimental results based on known pathway relationships and the similarity between the experimental proteins and proteins in known pathways. This type of automated hypothesis generation will be key to making use of the explosion of data being generated by biomedical research. Development of the necessary OpenCog technology is progressing rapidly and milestones will be detailed periodically in future blog posts.

MOSES service

The MOSES service takes an input data file in csv format with samples as rows and binary features as columns, including a binary column of sample category labels, and yaml format file that has the moses binary command-line flags, cross-validation specifications the size of the training partition and the number of training runs, and filter cut-offs for the evolved models to return in the final ensemble. The output link informs the user when the analysis has been completed and a viewer for showing the boolean models with their scores on the training and testing partitions. This table, as well as a table of feature counts from all the models in the final ensemble and copies of the two input files, are available in a zip file for download. A technical outline in the context of our own longevity research and a walk-through of the service for users and developers will be included in our next blog posts.

Invitation to Collaboration

The core creative potential of the SingularityNET platform lies in its ability to facilitate synergy between different computational processes, in this case, AI and bioinformatic processes. In turn, the enhanced cooperation mechanisms of the platform have the power to exponentially increase the number of AI services being offered, in the biomedical domain and beyond. Below is a link to the MOZI.AI Gene Annotation Service created in partnership with SingularityNET and offered to SingularityNET customers in the marketplace, either as an end-user or as a developer interested in contributing to the open-source code, or adding novel services that can interoperate within the SingularityNET biomedical analytics ecosystem.

We will post “Hands-On” guides in subsequent blogs for your reference and enjoyment. Thank you and we look forward to building the future of medicine with you!

https://beta.singularitynet.io/servicedetails/org/mozi/service/gene-annotation-service