Healthcare Big Data

The healthcare sector, that contains a diverse array of industries with activities ranging from research to manufacturing to facilities management (pharma, medical equipment, healthcare facilities), generated in 2013 something like 153 exabytes (1 exabyte = 1 billion gigabytes). It is estimated that by year 2020 the healthcare sector will generate 2,134 exabytes. To put that into perspective data centres globally will have enough space only for an estimated of 985 exabytes by 2020. Meaning that two and a half times this capacity would be required to house all the healthcare data. 40 zettabytes (43 trillion gigabytes) of data will be created collectively by 2020 (2,3 trillion gigabytes per day), an increase of 300 times from 2005.

Big data have four V’s = volume, velocity (real time will be crucial for healthcare), variety and veracity (noise, abnormality, and biases). Poor data quality costs the US economy $ 3,1 trillion a year. And 1 in 3 business leaders don’t trust the information they use to make decisions, and this is true also for the healthcare sector. Moreover, understand how to connect all healthcare data is a universal challenge faced by life science/pharma leaders, while the global life science analytics market is expected to increase to almost $25 billions by 2021.

This article is focused on AI solution’s for data aggregation and synthesis during drug development.

Let’s see know the odyssey of data during drug development and what went wrong so far.

Data and Drug development

Drug development (petabytes of data) is a notoriously time and money consuming complex journey, with a high degree of uncertainty whether a drug will actually succeed. Developing a new prescription medicine that gains marketing approval is estimated to cost drug makers $2.6 billion, with overall success rates 5.1% for cancer drugs and 11,9% for all other drugs. The entire process of drug development takes 10 to 15 years and as of 2019 many of the largest pharmaceutical firms spend nearly 20% of their revenue on R&D (cost of drug development). AstraZeneca blazed the path by spending 25.63% of revenues on R&D while the semiconductor industry is the only industry that regularly outpaces pharmas in R&D spending as a percentage of sales revenues (25 to 28%).

Drug development ususally starts as an early stage research process in a university lab or in university spinoff entity or in a small biotechnology company and is sponsored by the government or by pharma or by both. After that, and through a complicated and elaborated process tons of data (omics and screening data) are produced and kept hidden behind a firewall (negative results, hidden results, raw data) while novel positive preliminary results are seen by few during only conferences (abstracts, posters and power point presentations). At the end of this process, and on average after 2–5 years, flattering and only positive results of the early stage research will be published (papers) and presented to the public after going through the peer review process. These papers once published are usually considered pharma lead-generators for choosing future drug candidates for further drug-biomarker development: i.e. clinical development and eventually an FDA approval where more tetrabytes of data will be generated. Most of the messy big data — produced during clinical development— will sit “eternally” in silos inasmuch companies did not actually considered (until now) those suitable for retrospective analysis. Add now to all this the “new” data from: healthcare IoT, clinical wearable devices, robotics, digital patient, AI and 3D printing organs data and you see why a $8,7 trillion by 2020 industry is ripe for disruption, and disruption it’s getting due it’s messy data.

Now, when it comes to all the data produced before the digital era, life science/pharma/biomedical research they all face the following problems:

Replication or Reproducibility Crisis : In 2011 a group of researchers at Bayer found that in more than 75% of cases the published data did not match up with their in-house attempts to replicate them. Richard Horton the editor of The Lancet put it only more mildly: “much of the scientific literature (papers), perhaps half , may simply be untrue”.

: In 2011 a group of researchers at Bayer found that in more than of cases the published data did not match up with their in-house attempts to replicate them. Richard Horton the editor of The Lancet put it only more mildly: “much of the scientific literature (papers), , may simply be untrue”. Hidden Results, Negative Results, Raw Results (collection of multiple-omics data, gene-expression analysis, cellular screening etc): A huge amount of early-stage research gets presented only at conferences (abstracts, posters and presentations), and it is estimated that only half of it appears in the academic literature. These studies presented only at conferences are almost impossible to find or cite since very little information is available online. Additionally, a systemic review done in 2010 looked for papers investigating at what happens to all conference material. Interestingly, it came out that in the vast majority unflattering negative results are more likely to go missing. The fate of the negative or hidden or raw results until now?!!…lost in some huge fire and humidity resistant archive or lost in a PhD’s computer.

(collection of multiple-omics data, gene-expression analysis, cellular screening etc): A huge amount of early-stage research gets presented only at conferences (abstracts, posters and presentations), and it is estimated that only half of it appears in the academic literature. These studies presented only at conferences are almost impossible to find or cite since very little information is available online. Additionally, a systemic review done in 2010 looked for papers investigating at what happens to all conference material. Interestingly, it came out that in the vast majority unflattering negative results are more likely to go missing. The fate of the negative or hidden or raw results until now?!!…lost in some huge fire and humidity resistant archive or lost in a PhD’s computer. Ghostwriting: Furthermore specific academic literature can be ghost managed, behind the scenes, to an undeclared agenda. In reality, some academic articles are often written by a commercial writer (ghostwriter) employed by pharma, with an academic’s name placed at the top to give imprimatur of independence and scientific rigor. Often these academics have had little or no involvement in collecting the data or drafting the paper.

To make a long story short, aggregation and synthesis of data during drug development needs a big dose of innovation and hopefully the solution to all these problems will come from AI startups aiming to:

extract the “right” knowledge from literature,

generate insights from thousands of unrelated data sources,

improve decision-making,

eliminate blind spots in research, and

identify competitive whitespace.

Let’s see know some of these AI startups.

AI Startups for Data Aggregation and Analysis during Drug Development

AccutarBio (New York US, 2015) employs AI for drug discovery and offers so far: 1) a data-driven atom-based scoring function trained with 100,000 protein crystal structures containing information of >100 million amino acid side chains, 2) a dynamic deep neural network specifically designed for chemical informatics and 3) a drug pocket side chain conformation prediction and drug docking. The company is now partnering with Amgen. Investors: IDG Capital, YITU technology

Ardigen (Kraków Poland, 2015) is a Polish bioinformatics company — part of the Selvita Group — active in the field of laboratory information management systems, biological and clinical data analysis, Big Data integration, as well as custom application — development. On September 2019 Ardigen was accepted into the prestigious TESLA consortium working to improve personalization of immunotherapies. TESLA is a global effort of nearly 40 groups coming from top academia, biotech and big pharma organizations focusing on reaching the best neoantigen prediction for personalization of cancer immunotherapies. Ardigen’s neoantigen prediction platform called ‘ArdImmune Vax’ employs state of the art bioinformatics and AI to suggest an optimal set of neoantigens to be used as targets for cancer vaccines or adoptive cell therapies. The heart of the platform is a proprietary algorithm — ‘ArdImmune Rank’ — built to predict the neoantigen’s probability to elicit immune response. Moreover safety evaluation excluding peptides subject to off-target reactions or central tolerance mechanism made Ardigen an attractive partner for TESLA consortium.

Biorelate (Manchester UK, 2014) helps scientists solving the most difficult biomedical challenges of today by curating truths from existing knowledge, enabling smarter and faster research & development. Biorelate provides biomedical knowledge databases curated from published literature to pharmaceutical and biotechnology companies and academic institutes. Investors: Manchester Tech Trust Angels, NPIF Maven Equity Finance, Catapult Ventures, GM&C Life Sciences Fund

BioSymetrics (New York US) an AI and machine learning SaaS company operates as a software as a service company for use in drug discovery, precision medicine and health systems. Augusta, its biomedical AI and ML framework, is designed to transition time from data pre-processing and integration to model building and interrogation using familiar toolsets within Python. Augusta begins with diverse, raw medical data types (e.g. images, chemical structures, genomic data, tabular data), and operates across three modules: Augusta Pre-Processing, Augusta ML (Machine Learning, Augusta Architect.

Biotx.ai (Berlin Germany, 2017) is AI designer for biomedical data which helps to reliably find complex patterns in high-dimensionality biomedical data. Biomedical data is difficult to analyze because of the problematic structure of small patient cohorts, sample sizes, and many other factors, so by using their platform complex interactions can be found within the biomedical data and retrieve highly accurate predictive biomarkers.

Causaly Inc (London UK, 2017) develops AI based solutions to validate causal claims and generates hypotheses in biomedical systems. The company offers a semantic AI-platform machine which reads corpora of scientific articles and extracts causal associations through linguistic and statistical models; and it also allows users to provide their own corpora and combine them with in-house knowledge graph. They claim to increase productivity in literature reviews by filtering out false positives with their technology. Investors: Marathon Venture Capital, Emerge Education

Datavant, Inc (San Francisco US, 2017) employs AI for the clinical trial process, as well as organizes and structures healthcare data to inform actionable insights for the design and interpretation of clinical trials (aggregates and analyzes biomedical data through machine learning to lower the time, cost, and risk of drug development). Investors: Roivant Sciences, SoftBank, Founders Fund

Deep Intelligent Pharma (DIP) (Beijing China, 2017) is a global start-up dedicated to empowering and accelerating drug discovery, development and registration through the most advanced AI technologies. With its end-to-end AI-driven platforms, company enable clients to efficiently move compounds from the lab to post-marketing stage with great quality. Business operates across China, US and Japan. Investors: Sequoia Capital China, ZhenFund

Data4Cure’s (California US, 2013) Biomedical Intelligence Cloud platform and services help biopharmaceutical companies make more informed decisions using bioinformatics, machine learning and AI applications built on top of the largest repository of semantically linked biomedical data and literature.

Data2Discovery (Utah US, 2012) uses AI to find hidden connections and new insights in diverse, linked datasets. Allows researchers to understand and treat disease by connecting data in new ways. DeepPhenome (California US, 2016) develops AI-based-technologies to increase the productivity of pathologists, empower oncologists by predicting patient’s response to therapy, and expedite drug validation by the pharmaceutical companies. It leverages three leading technologies, AI & Deep Learning, Computer Vision, and Blockchain for Precision Medicine. Genialis (Texas US, 2011) uses AI to analyze multi-omics next-generation sequencing data for contextual, systems-level insights. Allows researchers to: Reveal previously unseen patterns across large, heterogeneous datasets to predict targets and biomarkers. Helix (Georgia US, 2017) uses AI to respond to verbal questions and requests in a lab setting. Allows researchers to: increase efficiency, improve lab safety, keep current on relevant new research, and manage inventory. HelixAI is participating in the Amazon Alexa accelerator. EvidScience is a platform that uses AI and machine to index medical literature to provide a database of therapy evidence, generated by ingesting the entire primary literature of peer-reviewed articles and pulling out the salient data points, at a rate 500X faster than a human. It then blends these results across innumerable combinations to generate customized summaries of the cost and outcomes data, in seconds. Elucidata (New Delhi India, 2015) uses AI to analyze complex biological datasets. Allows researchers to standardize and streamline analysis of -omics data from multiple sources. EvidScience (California US, 2017) provides the most comprehensive, accessible and affordable database of therapy evidence in the world. Their patented AI “reads and understands” the medical literature, reducing weeks (or even months) of work to a few clicks and enabling customers to make faster, smarter, evidence-based decisions

Iris.ai (Oslo Norway, 2015) is a research assistant, drastically increasing performance of R&D teams in mapping out existing knowledge (published research, patents, internal R&D content). Moving beyond limiting keywords, endless result lists and the biased citation, Iris.ai is the perfect AI assistant for cross-disciplinary early stage research projects. Investors: Bakken & Baeck, INDEX: Design to Improve Life, Nordic Impact, Singularity University Ventures, Founders Factory

Intelligencia’s (New York US, 2017) iNsight a proprietary data cube, integrates structured and unstructured data from a host of data sources, to assess the probability of technical and regulatory success of an asset (drug development) at any stage of clinical development, across Phases 1–3. Further, they interpret the reasons behind their estimates and provide insights into the drivers (positive or negative) of the probability of technical and regulatory success.

Intellegens’ (Cambridge UK, 2017) Alchemite technology uses AI to learn underlying correlations in fragmented datasets with incomplete information. CTO and co-founder Dr Gareth Conduit discovered a new method of analysing sparsely populated matrices using deep neural networks and novel machine learning approaches allowing researchers: to estimate missing knowledge of how candidate drugs act on proteins and to aid design of new drug cocktails that activate proteins to cure disease. The company announced its first commercial collaboration — with e-Therapeutics, an Oxford-based pioneer of Network-Driven Drug Discovery in February 2018.

Innoplexus (Frankfurt Germany, 2011) is a consulting-led technology and product development company focusing on big data and analytics, and helping life sciences companies generate actionable insights across pre-clinical, clinical, regulatory and commercial stages of a drug. Ingentium (Boston US, 2015) is a next generation content company and online community service provider. Ingentium uses a novel big data cyber infrastructure technology to aggregate and refine the latest news and information into focused disease specific knowledge bases. These knowledge bases are used to deliver media to patients, physicians and the medical research community in the Health Magazines, a component of the Ingentium Health Network. Ingentium offers free access to its Magazines where users can post comments, create alerts and join specific communities of interest that span major health and wellness topics. InsideDNA (London UK, 2013) focus on refining biomarker strategies, reducing clinical trial risks and providing a more refined stratification plan, consequently increasing specificity and success rates of clinical program/drug candidates. Their proprietary scalable analytics platform (InsideDNA) helps analysing large data volumes in no time.

InveniAI’s (Connecticut US) EvolverAI is an augmented intelligence platform that combines human intuition based on experience and expertise with the comprehensiveness and scale of AI. EvolverAI combines Big Data technology with machine learning to present to human experts massive and complex data sets in a manner that allow humans to decipher patterns and enables the generation of hypotheses. EvolverAI combines the tenets of deductive, inductive and abductive research in this journey of discovery.

iCarbonX (Shenzhen China, 2015) a top AI startup aimed at building a ecosystem of digital life based on the combination of consumers’ big life data, internet and AI. They will create the first and also most professional data collection platform of millions of health data in China, combine the best biotechnology and the top team of AI, and integrate the advantages of AI into the abundant analyses and applications of big life data through data mining and machine learning. On Sept. 2018 it was announced that iCarbonX entered into a cooperation agreement with the wholly subsidiary of DaChan Food (Asia), jointly establishing “Better Me Precision Nutrition Limited” (“Better Me”). Better Me will make dynamic and intelligent analysis of members’ life data based on biotechnology, big data and AI technology, combining the experience and advantages of the healthy food research and development, production and catering operation management, to develop into a leading precision nutrition catering solution provider. Investors: China Bridge Capital, Tencent Holdings, Zhongyuan Union Cell & Gene Eng

Dr. Daphne Koller and Dr. Carlos Bustamante discuss machine learning, drug discovery, Eroom’s Law, and how data science is revolutionizing the life sciences. October 15, 2018

Insitro (California US, 2018) a top AI startup integrates machine learning techniques for drug development. It offers life sciences, engineering, and data science to define problems, design experiments, analyze the data, and derive insights that develops new therapeutics. The company recently partnered with Gilead Sciences to find medicines to treat a liver disease called nonalcoholic steatohepatitis (NASH) because of all the related human data that Gilead has amassed over time. Investors: Foresite Capital, Third Rock Ventures, Andreessen Horowitz, ARCH Venture Partners, GV

Quertle’s (Nevada US, 2008) BioAI (biomedical-specific AI) platform enables at least ten-fold deeper and at least 30-times more efficient discovery reducing R&D dead-ends, optimizing clinical trials, speeding products to market, identifying treatment options, and much more. With its AI predictive visual analytics based on the actual text meaning, products built on the BioAI platform provide intuitive means to explore complex content and the ability to make serendipitous discoveries. KEEN EYE: (Paris France, 2013) is a health tech company, building machine learning technology for translational and clinical research, with a particular focus and a ‘keen expertise’ on imaging data. Thanks to its technology, KEEN EYE allows doctors and biologists to reproduce and extend their visual expertise, notably by identifying signals with high predictive value that are yet difficult to detect by eye. They save valuable time in day-to-day decision-making, whether it is to diagnose, screen disease, or evaluate the effectiveness of a drug, in a more precise and standardized way. Linguamatics (Cambridge UK, 2001) uses AI to extract and analyse text. Linguamatics is a software company providing high performance natural language processing (NLP) based text mining software. The software enables the rapid extraction of business critical facts and relationships from large document collections. Linguamatics’ text mining software can be used for business and competitive intelligence, life sciences research, and mining social media such as twitter.

LabTwin (Berlin Germany, 2018) uses AI to understand voice-based commands and transcribe voice-based notes. Allows researchers to take notes and organize lab documentation faster and with less effort. Take notes, create order lists and set reminders or timers in real-time from anywhere in your lab just by talking to LabTwin. On October 2019 it was announced a new partnership between LabTwin and ABI-LAB, a life science incubator and accelerator that supports biotech, medtech and medical data startups. MOZI is creating smarter tools to analyse the full potential of the ever growing body of medical knowledge. It is also building a suite of R&D tools powered by machine learning for scientists to explore and decipher the mechanics of disease; leading to the development of the next generation of therapeutics. MediBIC Group (Tokyo Japan, 2000) provides healthcare and drug development services. It provides genetic testing services with diagnostic tools to doctors, GLP-compliant gene profiling services to pharmaceutical companies and sample banking and management services, as well as focuses on providing health care services to pets.

Meta (California US, 2009) is a free discovery tool, designed to help researchers stay up-to-date with biomedical literature. It uses AI to organize and track over 67 million biomedical diseases, genes, proteins, techniques, researchers, journals, papers, and more — including full coverage of PubMed and preprints from bioRxiv. Meta is free to use, and free of ads. It is enabled by the Chan Zuckerberg Initiative. With the integration of Altmetric data, users can understand the global consumption of research outputs across the web, including mentions in the mainstream media, as well as where they are shared and discussed via online tools and communities such as Twitter, Wikipedia, F1000, Facebook, Reddit, and Mendeley.

Owkin (New York, US, 2016) uses AI to build intelligence from distributed datasets, including through privacy-safe transfer and federated learning. Allows researchers to overcome the problem of data-sharing in healthcare to automate diagnostics, predict treatment outcomes, and optimize clinical trials. On October 2019 it was announced that by scanning thousands of tissue samples, Owkin was able to help identify new types of patients with mesothelioma and predict which may respond better to certain therapies. Investors: Cathay Innovation, NJF Capital, Otium Capital, Plug and Play, GV, F-Prime Capital. Plex Research (2017) is a Boston area start-up that’s transforming the humble search bar into a drug discovery powerhouse. Their search engine will scour all the world’s biomedical research data, and find the bit. PatSnap (Singapore, 2007) has brought together the world’s most comprehensive R&D dataset in one easy to use platform to help innovation leaders analyse tech trends, assess new opportunities, conduct competitor intelligence and maximise return on IP assets. By combining millions of data points from patents, licensing, litigation and company information with non-patent literature, PatSnap provides the world’s most innovative organisations with a new intuitive source of information to accelerate their R&D. Investors: Accel X, Jiantou Huawen Investment, Qualgro VC, Sequoia Capital China, Shunwei Capital, Vertex Ventures, Temasek Holdings, Summit Partners. Percayai (Missouri, US, 2018) uses AI to organize and prioritize data in a contextual manner, enabling interactive 3D diagrams illustrating biological information. Allows researchers to rapidly generate testable hypotheses from complex, omic, and multi-omic data sets. Satalia (London UK, 2008) helps companies to solve their most challenging optimisation problems in telecommunications, financial services, system design and drug discovery. Satalia builds full stack AI solutions for the world’s best known companies and solve industries’ hardest problems by combining machine learning technologies with its optimisation-as-a-service platform.

SciNote (Wisconsin US, 2015) is a top-rated platform for researchers in academia or industry. They offer efficient digital lab management and all experimental data in one place: from note-keeping to inventory management, reporting and CFR 21 Part 11. In SciNote all your data is searchable, accessible and traceable.

Sparrho was founded in 2013 out of frustration with existing literature search tools by two Oxbridge scientists, and now has an amazing team based in London. Uses AI to curate, in combination with human expertise, millions of scientific papers from thousands of publications. Allows researchers to stay up-to-date with new scientific publications and patents. With 60 million+ papers and patents from 45k+ journals and preprint servers, Sparrho’s content is enhanced by world-class researchers from 1,500+ universities in 150+ countries. Investors: Beast Ventures, Entrepreneur First (ef.)AllBright, Entrepreneur First, White Cloud Capital, Pitch@Palace

Researchably (California US, 2016) offers an AI platform that helps the world make sense of biomedical research. Keeping up with the latest biomedical research is critical. Researchably enables to do it smarter & faster, so clients can worry less about searching and focus on helping their colleagues and KOLs stay informed. ThoughtSpot (California US, 2012) is a business intelligence platform that helps you explore, analyze, and share real-time business analytics data easily. ThoughtSpot’s AI-Driven analytics platform puts the power of a thousand analysts in every business person’s hands. Investors: Capital One Growth Ventures, Geodesic Capital, Hewlett Packard Pathfinder, Qualgro VC, ServiceNow, Sapphire Ventures, General Catalyst, Lightspeed Venture Partners. Nference, Inc (Massachusetts US, 2013) offers AI software platform that helps in augmenting the scientists’ abilities to generate holistic data-driven and unbiased hypotheses in a rapid manner. The company offers neural networks for real-time, automated extraction of knowledge from scientific, clinical, regulatory, and commercial datasets. The company offers drug discovery, life cycle management, drug development, and precision medicine services. Investors: Matrix Capital Management. Kyndi’s AI (California US, 2014) platform uses ML to streamline regulated business processes and offer auditable AI systems for government, financial services, and healthcare. The company’s mission is to build explainable AI products that optimize human performance. Kyndi enables enterprises and government to transform regulated processes by offering auditable AI systems. Investors: Citrix Startup Accelerator, Citrix Systems, Darling Ventures, Density Ventures, J. Hunt Holdings, PivotNorth Capital, Creative Destruction Lab

To conclude, it’s quite possible to say that the pharma industry has historically done a poor job of managing its data. Data management is at the heart of improving processes and ensuring information is clean, findable and useful. But data, whether clinical or preclinical, are both a huge asset and a big problem for the pharma industry. So improving data lifecycle is undoubtable a priority…but a difficult question has to be addressed first: “CLEANSING OF OLD DATA: WORTH THE COST”?