Can you read fast? Really, really fast?

As you start reading this blog, consider that a person reads an average page of text in about 2 minutes, so it could take you about 10 minutes to read this whole post, more or less. Now imagine reading 10 to 20 pages of a scientific paper. Next imagine reading hundreds, thousands, or even millions of such papers. Not an easy — if even feasible — task for a single person or even for a group of avid readers. And even if a group of people can read many scientific publications in a reasonable time, how would they then combine their acquired knowledge and establish correlations between articles and terms of interest, find common patterns related to a specific subject, and so on?

This is one of the major challenges now facing health professionals and researchers. It’s estimated that around 2 million new scientific publications appear every year (with an exponential growth in the past decade), which brings the total to well over 50 million publications documented since 1665, according to an article from ResearchGate.

Obviously, the advancement of science — and society — relies on researchers sharing the knowledge and results of their arduous work via scientific publications. However, when it comes to consuming and mining that vast amount of information, there’s clearly room for improvement.

Data science can help, especially with techniques and tooling for text analytics and natural language processing (NLP). Text analytics provides the ability to process a large collection of unstructured data, in this case text from scientific publications, to output data that can be further analyzed to discover new insights. Related to text analytics, NLP provides machines with the ability to understand aspects of human language, such as the relationships between words, grouping of words into phrases, and much more.

Here’s someone who can read fast! (Not a real person though)

IBM Watson Explorer Content Analytics is a powerful tool created to help with this kind of analysis. It collects and analyzes structured and unstructured content from documents, databases, websites and many other types of data repositories. (Note: Typical spreadsheets with rows and columns are examples of structured content; unstructured content includes things like text from articles and emails.)

Watson Explorer crawls, parses, and analyzes content to create a searchable index that allows researchers to perform text analytics across all data, and to query the index to quickly find and retrieve relevant documents from a ranked list of results. Watson Explorer also offers a rich content mining user interface that allows users to explore data interactively. That uncovers different facets, relationships, and anomalies between the various facets.

Let’s look at a case where Watson Explorer can help us analyze and derive new insights from scientific publications. We start by importing into Watson Explorer a large number of medical journals downloaded from a public source such as the US National Library of Medicine, also known as PubMed. Data gathering and some preparation is required to retrieve the journal abstracts or the entire text of the journals before importing into Watson Explorer, which accepts several formats of data input. For our example, we have a CSV (comma separated values) file with one publication title and abstract per row, ready to import into Watson Explorer.

For this example, let’s look at publications related to infectious disease. Figure 1 shows that we took a collection of almost 170K medical publications (spanning several decades) and imported them into Watson Explorer. Watson Explorer’s text analytics engine parses and organizes the input text, breaking it down into parts of speech of natural language, such as nouns, verbs, adjectives, and so on. The tool also gives the count for each word parsed. By selecting one of the words, we can see it highlighted in the medical paper abstracts to the right. We can also query terms of interest in a query bar on the top and see those terms highlighted in the text.