How SingularityNET is Advancing Unsupervised Language Learning

Encouraging preliminary results will allow us to work with greater corpora present for English and other languages.

Note: This is the first technical post in our new AI Research Lab publication where we will be chronicling the work of SingularityNET’s AI Team. The SingularityNET Research Lab will give you insider access to the most exciting AI initiatives we have under development. When launched, these services will only be available through the SingularityNET platform. To chat directly with our team and community, visit the SingularityNET Community Forum.

Bridging the Gap: Allowing AI Services to Comprehend & Interact in Human Language

For many AI services, it is critical to be able to comprehend human language and even converse in it with human users. So far, advances in natural language processing (NLP) powered with “sub-symbolic” machine learning based on deep neural networks allows us to solve multiple tasks like machine translation, classification, and emotion recognition.

However, using these approaches requires enormous amount of training. Additionally, there are increasing legal restrictions in particular applications due to recent regulations, making current solutions unviable.

The ultimate goal for these industry initiatives is to allow humans and AI to interact fluently in a common language.

Effectively Using Human Language Still Has Major Problems

One of the reasons why this approach is still not widely used is that these grammar and ontology have to be created and maintained by humans manually. It is known that, for the practical application of a given domain for certain language — like bioinformatics in English — it may take 10+ years of to create all grammatical rules and ontology terms and relationships, plus it needs ongoing human effort to maintain them.

To solve this problem, SingularityNET has been working on a solution for Unsupervised Language Learning, which would make it possible to take large unannotated corpus of texts and infer grammar used in this corpus and basics of ontology underlying the grammar. Our advances could dramatically reduce the need of data analytics and computational linguists to create and maintain ontologies for novel languages, professional dialects, and jargons for known languages.

Put simply, SingularityNET is working on bridging the divide between humans and AI, allowing each to fluently understand and converse in a common language.

Diving Deeper into the Current Challenges & SingularityNET’s Solution

1. Current Challenges

Let’s look at two examples that demonstrate some popular issues with language processing.

While formal Link Grammar makes it possible to parse literary English text, it fails on professional jargons or simplified versions of English, such as pidgins used in chat rooms and international forums. For another example, many languages on Earth — such as Sinhalese — have many native speakers, but the language is not well covered with computable formal grammar such as Link Grammar.

Learn more about the current challenges associated with Supervised vs. Unsupervised Learning

These examples show why Unsupervised Language Learning (ULL) would help humans simplify their work involving text processing, as well as for AI systems which need to be able to capture human languages to serve humans better. However, major advances are needed to advance toward effective unsupervised learning.

2. SingularityNET’s Solution

To make ULL feasible, the SingularityNET AI team has begun by leveraging the NLP and Language Learning components of OpenCog. Our plan involves employing an incremental learning approach and notion of “Baby Turing Test” mentioned in our earlier publication.

Below is the current view of the our ULL pipeline and its components.

Components of SingularityNET’s ULL System

Text Pre-Cleaner — Preprocesses corpus files with configurable cleanup and normalisation options. Sense Pre-Disambiguator — Optional component which performs word disambiguation and builds senses from tokens. Text Parser — Parses sentences of word tokens or senses with one of the possible approaches, with a Minimum-Spanning Tree (MST) Parsing approach based on mutual information computed from words co-occurring in corpus (see MST Parser for reference). Grammar Learner — The key component which learns word categories from parses, infers grammar in Link Grammar format. Tester/Evaluator — Evaluates the quality of inferred grammar based on a number of parameters, such as “parse-ability” as a percentage of successfully parsed words per sentence in the corpus, and “parse-quality” as a percentage of correctly parsed links in the parse tree produced with Link Grammar Parser, relying on the grammar learned earlier.

Testing SingularityNET’s NLP Solution

For initial experiments, we took two simple corpora for languages such as Turtle (used for Semantic Web programming) and plain English, having the two simplistic corpora created for Proof-of-Concept experiments.

Turtle:

tuna isa fish.

herring isa fish.

tuna has fin.

herring has fin.

parrot isa bird.

eagle isa bird.

parrot has wing.

eagle has wing.

fin isa extremity.

wing isa extremity.

fin has scale.

wing has feather.

English:

A mom is a human.

A dad is a human.

A mom is a parent.

A dad is a parent.

A son is a child.

A daughter is a child.

A son is a human.

A daughter is a human.

A mom likes cake.

A daughter likes cake.

A son likes sausage.

A dad likes sausage.

Cake is a food.

Sausage is a food.

Mom is a human now.

Dad is a human now.

Mom is a parent now.

Dad is a parent now.

Son is a child now.

Daughter is a child now.

Son is a human now.

Daughter is a human now.

Mom likes cake now.

Daughter likes cake now.

Son likes sausage now.

Dad likes sausage now.

Cake is a food now.

Sausage is a food now.

Mom was a daughter before.

Dad was a son before.

Mom was not a parent before.

Dad was not a parent before.

Mom liked cake before.

Dad liked sausage before.

Cake was a food before.

Sausage was a food before.

For these two simple “closed world” corpora, we were able to identify key grammatical and semantic categories as well as infer formal grammars programmatically. The inferred grammars made it possible to reach 100% of parse-ability and parse-quality for Turtle language. For English language, the best achieved parse-ability turned to be 97% and parse-quality 64%.

The picture above renders grammatical categories of words in two possible vector spaces:

Space of neighbor words adjacent to the left or the right in the parse tree Space of Link Grammar disjuncts built of conjunctions of these neighbors. The left part of the picture represents grammatical and semantic categories for Turtle and the right — for English.

The following picture gives an example of Link Grammar rules for the entire English corpus learned automatically (on the left) and some partial parses of the original corpus created with Link Grammar parser loaded with this grammar.

Among the different settings and options of our ULL pipeline that we have tried, the best quality grammar was produced with the following settings:

Text Parser: For mutual information counting, use word pairs co-occurring in the same sentence with word-to-word link strength linearly decaying with distance. Grammar Learner: Use disjuncts for building vector space for unsupervised category learning as well as for grammar induction.

With this setup, given MST Parses that are 67% matching expected English parses, we can learn the grammar that provide parses matching 66% of expected English parses. For Turtle, both numbers are 100%.

The results mentioned above are very preliminary, as we are now working toward using greater corpora present for English and other languages.

Note: All of SingularityNET’s AI initiatives will be, or already are open-source, as well as OpenCog and our main network (SingularityNET). The project Language Learning project is not mature enough for forks and pull requests from a wide audience, but we are tentatively planning to make it available for wider use in the second half of 2018.

How Can You Get Involved?

While our AI Research Lab gives you inside access into our AI initiatives, we’re not done yet!

In May, we’ll be introducing a Community Forum, allowing you to chat directly with our AI team, as well as developers and researchers from around the world. We’re excited to share more details in the coming weeks.

You can message our Developer Marketing Associate Ibby Benali (@ibbybenali) via Telegram with any feedback.

Stay Tuned!

Over the coming weeks, we’ll have plenty of exciting content hitting our AI Research Lab publication, including: