DApp of the Week #07 — Troubadour

A Natural Language Processing (NLP) platform using decentralized cloud computing.

Talking, writing and reading: as human beings, we can do it all. For this, we use one of the many human languages that can be found around the globe. Through language, we attach meaning to the things that we perceive throughout our daily lives. Not only human beings communicate through languages, but computers do so as well.

Nowadays, we use programming languages to communicate with computers, but what kind of possibilities would we have if we could use a language which is directly understood by both parties? This installment of Dapp of the Week will look into examples of how we are already bridging the gap between data on a computer and human language.

Introduction

It is easy to forget just how complex the process of reading and understanding human language is. Computers are able to process data much faster than we, humans, can. They are extremely efficient at working with standardized and structured data such as database tables.

Unfortunately, humans don’t communicate data in such a ‘structured’ way. We communicate using words, a form of unstructured data. With unstructured data, rules are quite abstract and challenging to define concretely. Things like context, sarcasm, and proverbs are already complicated to understand for humans, so can you imagine what it’s like for computers?

In human languages, we don’t always say what we mean; and we don’t always mean what we say.

Nowadays the majority of data collected within enterprises exists in the form of emails, reports and other documents. Analysts at Gartner, one of the world’s leading research and advisory companies, estimated that more than 80% of enterprise data today is unstructured.

Businesses across all industries are facing a growing need to observe, interpret, and evaluate this type of data within their own specific industry use case. The expertise, knowledge, and information of professionals in industries is time-sensitive, becoming quickly outdated. At the same time, decisions have a bigger impact in a highly interconnected world. At the moment, users are already able to perform simple textual content searches on unstructured data.

Finding the desired information can be hard and very time-consuming due to the overload of information.

There still exists an urgent need to retrieve all useful information contained within data without having to manually read through all of it. In other words, this unstructured data needs to be transformed into structured data before it can be of any practical use.

What is Natural Language Processing?

NLP is the ability of machines to understand and interpret human language the way it is written and spoken. NLP sits at the intersection of computer science, AI, and computational linguistics. The objective of NLP is to make machines as intelligent as humans in understanding language.

This technology allows us to break human (natural) language down into elementary components that can be tagged and organized accordingly. Storing this information in a standardized format allows us to use this data as structured data.

Such a standardized format for the content of the data would make it much more accessible to users and would allow textual analytics to be conducted in order to extract knowledge and information from this data. Some examples of extracted knowledge consist of entities, facts, relations between concepts as well as sentiment, opinions, and emotions.

Although the term ‘NLP’ is not so commonly heard, it is often being interacted with daily without people realizing it. Google Translate, spam filters and search engines of web browsers are all commonly used products which utilize NLP methods.

Troubadour

Troubadour is a data enhancement platform providing intuitive and accessible Natural Language Processing (NLP) tools, to be used by anyone as a solution to the ‘information overload’ problem. The aim of this platform is to provide users with a way to optimally make use of all of their unstructured data, by converting it into structured data. Our vision is that Troubadour can offer this solution for any domain or industry dealing with unstructured data in the form of written natural language. Education, legal affairs, clinical research, and tourism are just some examples of these domains.

Troubadour is based on NewsReader Project, an EU-funded academic initiative aiming to provide NLP solutions that are accessible to everyone. NewsReader is being developed by a consortium including the Computational Lexicology and Terminology Lab (CLTL) of the VU University of Amsterdam, led by Prof. Dr. Piek Th.J.M. Vossen.

NewsReader is a system that extracts what happened to whom, when and where from multiple sources, and stores this in a structured database, enabling more precise search over this immense stack of information.

Scanning data in news stories from around the globe, NewsReader provides a solution to the data volume problem, by partly mimicking how humans read text and integrate new information with what is known of the past. Like human readers, NewsReader will reconstruct a coherent story in which new events are related to past events.

In contrast to human readers however, NewsReader will not forget any detail, keeping track of all existing facts and will even know how stories differ from source to source.

Likewise, NewsReader will be able to present the essential knowledge and information both as structured lists of data and facts but also as abstract schemas of event sequences that represent stories going back in time, as humans do. This allows us to detect trends, events with impact and social networks of people over time and regions. We can query long-term developments spanning decades for individuals or types of individuals to discover events that remained unnoticed.

Use case: The polarizing vaccination debate

The importance of proper access to unstructured data becomes clearly visible when we look at the fierce vaccination debate that is happening in our society at the moment. Not having access to information or worse, having access to false information, could literally mean the difference between life and death. Despite significant potential to enable dissemination of factual information, social media are frequently abused to spread harmful health content. This potentially reduces vaccine uptake rates and increases the risks of global pandemics, especially among the most vulnerable.

Recent outbreaks of measles, mumps, and pertussis and increased mortality from vaccine-preventable diseases such as influenza and viral pneumonia show how important it is to combat online misinformation about vaccines. In 2015 for example, Québec was hit with an outbreak of measles even though a free vaccine which can prevent this childhood infection is freely available. Unfortunately, due to doubts that currently hang over vaccination, new episodes have emerged. Findings revealed that 83% of parents who hesitate to vaccinate their children are concerned with the potential side effects of the vaccines and 77% doubt their efficiency: two misconceptions that tend to spread through social media.

Studies have shown that access to a wide amount of content through the Internet without intermediaries resolved into major segregation of the users in polarized groups. Users select information adhering to theirs system of beliefs and tend to ignore dissenting information. In other words, we fit logic to our perspective instead of our perspective to logic.

Since we only tend to see what we want to see, we cannot rely on the subjective nature of our own skewed perspective to determine right from wrong. Fortunately, using NewsReader, we are able to objectively map the unstructured data available from both sides. First, we start by scraping a plethora of websites in regard to vaccinations and run the content through our NLP pipeline. By connecting the information of the processed data and turning it into event-centric knowledge graphs we are able to generate a so-called perspective web.