Corpus Conversion Service and Corpus Processing Service

To unlock the knowledge from the published unstructured and structured data on COVID-19, IBM researchers are making available two key technologies - the Corpus Conversion Service and Corpus Processing Service. Both are already in extensive use in the material science, automotive and energy industries.

The Corpus Conversion Service can ingest 100,000 PDF pages per day (even of scanned documents) on a single server — and then train and apply advanced machine learning models that extract the content from these documents with high accuracy at a scale never achieved before. We have applied this technology to thousands of PDFs on the coronavirus and COVID-19 and combined it with curated databases from DrugBank, Clinicaltrials.gov and GenBank.

The Corpus Processing Service integrates data from databases and publications into a knowledge graph, so that these can be queried to retrieve known facts and to generate novel insights.

Examples of the types of queries:

Which drugs have been used so far and what are the outcomes Identify new, reported risk-factors