Bridging the Language Divide

SingularityNET continues to make advances in Unsupervised Language Learning.

The Importance of Unsupervised Language Learning

In the very first research article published by our AI Research Lab, we discussed how SingularityNET plans to advance unsupervised language learning by leveraging the NLP and Language Learning components of OpenCog.

In this article, we will outline the latest advances of the SingularityNET’s Unsupervised Language Learning project, which is backed by the OpenCog Foundation and supported by the Hanson Robotics enterprise.

To be able to perform unsupervised language acquisition is one of the critical capabilities of an intelligent being. For instance, let’s assume that the goals of evolution are to create, correct and transfer information efficiently. In such a case, using one and universal hard-coded genetic language, for a message to be created, modified or transmitted at the biological level a generational change would be required.

It is because an intelligent being can acquire languages during its lifetime, that messages can be transferred at the speed of light in their diverse natural and artificial forms through human and computer languages.

The goal of the unsupervised language learning project is to let any artificial intelligence system learn the grammar and the basics of the ontology of any language programmatically, without human supervision.

That is, the software should be able to identify word categories as parts of speech, determine the linguistic relations between the categories and figure out different senses of the words — all without any corpora tagging or human feedback.

Put simply, SingularityNET is working on bridging the language divide between humans and AI, allowing each to fluently understand and converse in a common language.

Learning the Grammar and Ontology

In this article, we restrict ourselves to the task of only learning the language grammar and basics of ontology. Although the grammar and ontology can be used for language comprehension and production, that is beyond the scope of our current work.

We will also not consider the possibility of enforcing the learning process through supervised learning with tagged or marked-up corpora or experiential feedback from the environment of the learner. In our research, we only anticipate some instances of system self-reinforcement. Our goal, therefore, is to achieve completely unsupervised learning.

Our main goal is programmatic grammar learning from scratch. Such functionality will allow for many useful applications. Besides creating formal grammars for new and old languages not studied by computational linguists so far, we will be able to extend or customize existing grammars for specific domains and jargons. Also, we will be able to create natural language processing applications such as building dictionaries, patterns, or providing the text parses. Finally, we will be able to go beyond simplistic spell-checking, as accurate grammar checking will become possible for any language.

At this stage of the project, we constrain ourselves with few assumptions. First of all, we are using controlled corpora, so selected volumes of text for every specific language is used, and the data is cleaned, like normalization of punctuation and removal of specific markup. Second, we rely on “Link Grammar” and “Minimum Spanning Tree Parsing” formalisms, which will be discussed later. Third, we do not consider morphology so any English or Russian word is considered as a single Chinese hieroglyph. Fourth, we self-reinforce the evolutionary search for the best combination of hyper-parameters with what we call parse-ability. Finally, we consider testing results based on the same corpus that is used for learning, so any grammatical construction not represented in the corpus may be considered void.

The use of Link Grammar formalism for describing target grammars for language learning is justified by the work of Ben Goertzel and Linas Vepstas in 2014. According to that paper, each word category or part of speech can be described by a set of possible combinations of connections to other parts of speech. Each combination is called a “disjunct.” A disjunct may be used to bind any of the words in the category to the specific grammatical context in a sentence. Each connection in a disjunct that links the category to other categories is called “connector.”

Utilizing the Minimum Spanning Tree Parser

For unsupervised learning to start, we use the Minimum Spanning Tree (MST) parser instead of raw text. MST parser does not know any grammar to make the parse as the Link Grammar Parser does. So, before the parsing, any input texts are analyzed, and mutual information between words is computed. After that, the parser itself accounts for learned mutual information between the words in the sentence and creates parses and word-to-word linkages building minimum spanning trees and maximizing the amount of mutual information across all links. For instance, if the words “a” and “snake” appear together in the corpus often enough, it is likely they will be linked together in a MST-parse.

This way we can create parses for the text corpora contextualized to specific industry domains or jargons and dialects, which might not always be represented with formal English grammar. For instance, we can create reasonable parses for corpora of texts used in modern social media, forums, chat rooms and SMS texts. Later on, we may use these parses for grammar learning.

Before the grammar learning can start, our pipeline provides two extra steps.

The first step is the configurable text Pre-Cleaner, intended to remove the markup and special characters and to normalize punctuation and white spaces. After the pre-cleaner, there is the word sense disambiguation phase which analyses the cleaned text and rewrites potentially ambiguous words with different senses, so the MST Parser handles the specific word senses, not just word tokens.

The second step involves the Text Parser component which may be configured as either plain Link Grammar Parser used for experiments or MST Parser used for production studies. Also, in the future, there could be so-called Hybrid Parser which could use the tow at the same time: Link Grammar formalism for words matching given grammar and MST Parser for unknown words. After the text parser creates the parses, they are directed to the Grammar Learner — which creates the target grammar. Finally, the grammar is evaluated by the Grammar Tester which compares parses produced with learned grammar against “expected parses.” These expected parses are either created manually or are created through the Link Grammar Parser using manually created grammar.

The Grammar Learner

We believe the Grammar Learner component requires some further explanation. Its internal pipeline has multiple stages, each of them configurable with multiple options.

The vector space can be created with dimensions represented by either words or “connectors” or “disjuncts,” with connectors and disjuncts providing very interesting results, which we will discuss later.

Dimension reduction of the vector space is optional, we do it with the help of Singular Value Decomposition (SVD) now, but other options are also possible. We can create initial word sense categories in the vector space with different algorithms. Currently, for these algorithms, we employ K-means and custom aggregation of Identical Lexical Entries as words sharing the same sets of disjuncts, but more algorithms are being tried.

Learned categories may be further aggregated into a category tree with the optional generalization phase. After which, Grammar Induction will take place to build grammatical links between the word categories. Then, extra optional generalization may get applied to generalize categories on the basis of their links to other categories. Finally, the learning results are exported to the standard Link Grammar dictionary file with categories saved as rules and with links as disjuncts of connectors associated with these rules.

It should be noted, that the future versions of the Grammar Learning stage could be run iteratively. With grammar learned during previous iterations re-used, Classification would take place before clustering. So, only non-classified word senses would undergo the clustering procedure.

The Results

The image below shows the results for word sense disambiguation. For a sentence like “mom saw dad with a saw,” both instances of the word “saw” are properly disambiguated. For quality assessment, we used the “gold standard” corpus with disambiguation created manually and then we used different configurations of hyper-parameters for evolutionary search targeted to fit the function maximizing “Disambiguation” metric and minimizing “Over-disambiguation” one. The tradeoff between the two was clear: the best disambiguation results were connected with a higher level of over-disambiguation.

In the results displayed in the image below, we have used different languages: Turtle — an artificial language used in “Semantic Web” programming and different versions of English — starting from child-directed speech and ending with literary English, limited with children-level literature.

Our proof-of-concept (POC) corpora were small — created manually to contain a balanced mix of words within the restricted lexicon. Also, we have used different metrics to assess the quality of the grammar:

Parse-Ability (PA) — indicating the percentage of successfully parsed words in the corpus.

Parse-Quality (PQ) — indicating the percentage of parsed links, matching the links in “expected parses” and corresponding to recall metric.

For precision, we have used parse-quality denominated by parse-ability. In the end, no matter which corpora we used, we were able to achieve high metrics even for the relatively complex corpus of “Gutenberg Children Books” collection, reaching much higher metrics for manually prepared corpora.

The intriguing part of our results is that using MST-parses has provided more stable results across all metrics in the 50–100% range. On the contrary, using “expected” parses created manually or with the proper Link Grammar dictionary showed less predictable results for parse-ability and parse-quality in the 15–100% range. We can tentatively explain this with our current inability to learn grammatical exceptions, which are lacking in MST Parses but are present in parses created manually with the help of Link Grammar Parser. Hence, the more self-consistent MST Parser can provide more regular vector space enabling better regular grammars.

The other result is even more fascinating, as it displays a very high correlation between parse-ability and parse-quality. That means we can assume that the extent to which we can parse the text successfully corresponds to the actual correctness of the parses. Therefore, if we can confirm that correspondence for the other corpora and languages in the future, we could use the parse-ability as a fitness function for evolutionary search in the space of hyper-parameters for our algorithms, without the need for “expected” parses to evaluate parse-quality.

In the image below, the potential to identify grammatical and semantic categories is seen on the charts for words arranged in the first two dimensions of reduced vector spaces of connectors and disjuncts — both for the proof-of-concept Turtle and English corpora. With a rich imagination, we may be able to identify grammatical categories such as verbs “like, liked, was” as well as semantic categories such as “dad, mom” or “daughter, son.” Speaking cautiously, we might be able to identify categories of words used in similar linguistic contexts.

Moreover, the generalized categories may be rendered in a tree, with all levels of generalization retained, so that for example, “child, mom, daughter, son” are assembled in the same category. Also, it can be seen how homonyms “board” and “saw” disambiguated at the word-sense disambiguation phase are split and attached to different categories, corresponding to their grammatical and semantic use — for instance, “board.b” is associated with “wood” while “board.a” hangs alone as white board, or “saw.a” shares a category with “hammer” while “saw.b” sticks to “see” — as it is expected for a verb.

Another pleasant observation was the expressed inverse connection between the number of discovered categories and parse-ability and parse quality.

Generally, the quality of the final distribution seems to be highly dependent on the initial random seed when using K-means clustering. This explains why the quality of grammar with a large number of categories seems to be difficult to predict.

However, if the initial seed permits the collapse of clusters such that the number of categories is closer to the number of natural parts of speech, both parse-ability and parse-quality increase dramatically — close to 100% and 60–70%, respectively.

As expected, using grammar learning algorithms based on connectors and disjuncts provided different results. Using connectors resulted in relaxed grammars: recognizing words in more generalized contexts and providing better parse-ability and parse-quality (recall), respectively. On the other hand, using disjuncts assured higher precision with rules enforcing grammatical contexts more strictly.

The video of our presentation at the AGI-2018 conference next week is available over here:

What’s next?

In the future, we are going to continue our work implementing incremental probabilistic assessment of parses, clustering and grammar induction, and having volume and complexity of corpora increased gradually based on the statistical measures of the data in corpora.

Moreover, we will be fine-tuning MST-parsing parameters and considering other sources of parses. Consequently, the self-enforcement procedure may be amended to avoid overfitting due to use of the same corpus for training and testing.

All of our code, which is currently open source under the MIT license, will be made publicly available as a command-line tool, web service, and suite of SingularityNET agents. We invite you to check out our progress at the Unsupervised Language Learning Demo and Data Site and at our Github.

You can visit our Community Forum to chat about the research mentioned in this post. Over the coming weeks, we hope to not only provide you with more insider access to SingularityNET’s groundbreaking AI research but also to share with you the specifics of our development.

For any additional information, please refer to our roadmaps and subscribe to our newsletter to stay informed regarding all of our developments.