By jake neely

At Forge.AI, we capture events from unstructured data and represent them in a manner suitable for machine learning, decision making, and other algorithmic tasks for our customers (for a broad technical overview, see this blog post). In order to do this, we employ a suite of state of the art machine learning and natural language understanding technologies, many of which are supervised learning systems. For our business to scale aggressively, we need an economically viable way to acquire training data quickly for those supervised learners. We use natural language generation to do just that, supplementing human annotations with annotated synthetic language in an agile fashion.

Background

The cost and time required to obtain quality training datasets for supervised learning is a chief bottleneck in the deployability of machine learning. To minimize those impacts, much of the work in the broader research community has focused on reducing the inherent training data requirements of the model itself. This has proven to be effective in a variety of models and use cases (e.g. the success of attention networks for document classification 1) but the difficulty of implementation and the fairly narrow focus on benchmarks suggests that a complementary approach, tackling the problem from the direction of the training data itself, could be fruitful.

There is precedence for augmenting supervised learning with artificially expanded training datasets. In image and speech recognition this is called training data augmentation, and it has been very successful 2 3.This is because there are natural metric spaces on which invariant properties exist, such as image rotation or speed/tone of speech, that a model should possess. In natural language, however, no such intrinsic space exists. This makes the task of expanding a text training corpus considerably more complex. Below are three examples of research aimed at addressing training data scarcity in natural language:

Thesaurus inflation 4: probabilistically replace tokens with their synonyms weighted by the distance of the synonyms to the original meanings

2. Data counterfeiting 5: delexicalize all annotated slots and values for each annotated utterance, randomly replace delexicalized slots with similar slots under some clustering, and yield a set of pseudo annotated utterances, e.g. a basic subject-object utterance would be turned into the pseudo annotation:

“SLOT_SUBJECT called SLOT_OBJECT yesterday afternoon to talk about work”

3. Weak supervision 6: approximately and noisily annotate documents to produce low quality training data coupled with associated scores, use those scores within an appropriate loss function to enable “noise-aware” training. For a great state of the art framework for weak supervision, look at data programming 7 8

These approaches inflate existing training corpora in analogy to data augmentation and do not rely on heavy linguistic modeling. They are shown to be effective at domain-constrained use cases ranging from text classification to spoken dialogue systems. For our use case, we need to be able to rapidly expand our event coverage to new domains and believe that deep linguistic modeling without domain-specific rules is a powerful and effective way to do that.

Challenges

It is easy to fall into a chicken-and-egg dilemma with training data generation: it is difficult to design these systems such that they themselves do not require large quantities of training data. As a result, construction of a necessarily robust and domain-agnostic system can end up going so far as to solve the very problem reserved for the supervised learner. Domain-specific rules sets, grammars, and constraints on the supervised learning models are ways to get around this, but can sometimes result in over-specialized and rigid systems.

A second key challenge to training data generation is controlling the bias induced on a supervised learning system. Approaches to data generation must produce a varied enough corpus that the supervised model isn’t just learning the generative model of training data. It is difficult to define what is effective and sufficient for a training corpus. Later, we will discuss a few paths of inquiry into intrinsic properties of training corpora that may be predictive of training behavior. These properties can be a measure of sufficient variability that is not dependent on the resulting performance of the trained supervised model.

How We Are Designing The System

The primary design constraints of our system are that it must be robust and easily extensible to new events and semantic structures; this means that we cannot rely on hand-coded domain-specific rules or language. We are designing the system such that everything should be learned from the exemplar data, auxiliary data sources, and a priori knowledge of English language constructs.

Our system architecture takes some inspiration from traditional natural language generation systems 9.The initial design is split into three key components that we are working to prove out (see Figure 1):

Grammatical model: This module is in charge of learning grammar characteristic of the event in question. The model itself is initially being derived from a probabilistic context-free grammar. Generic English treebanks are used to initialize the grammar and then human annotations are used for specialization/refinement. We use classified non-annotated data to refine the domain-specific production probabilities further. Semantic planning: Here we consume the semantic event frame and any available auxiliary data sources in conjunction with the learned grammar to plan a natural language expression, i.e. to decide on the semantically relevant roles and tokens to include in an expression. We are building this out initially as a probabilistic graphical model. Surface realizer: The job of this module is to take a semantic plan and make it grammatically correct. This is the component in charge of inflection, tense, voice, ordering, plurality, etc. It is likely that any version of such a system will include this module, but we are investigating different and possibly novel approaches that are more discriminative than generative. An example of a high quality surface realizer is OpenCCG, which is based on Combinatorial Categorical Grammars. Since CCGs are constituency-based and therefore a kind of phrase structure grammar (like a PCFG), this realizer shares some convenient similarities with our grammatical model.

Figure 1. Architectural plan of Forge.AI’s natural language generation system.

We look forward to sharing changes and improvements in much greater depth as we prove out this technology.

An Illustrative Example

Imagine, as a customer, that you want to start receiving events about lawsuits. You can define a minimal structured representation of a lawsuit with four fields:

Plaintiff

Defendant

Damages amount

Date of filing

Figure 2. Example of an event annotation for the basic lawsuit event.

We call the collection of these fields the semantic event frame. With this frame in hand, we acquire a few hundred human-annotated examples of these events in unstructured natural language (see Figure 2 for a pictorial of a human annotation). From there, we can combine the frame with the human annotations and any auxiliary data sources, internally or externally (e.g. a knowledge base, a repository of filing dates), together with our natural language generation system. This produces a synthetic training corpus much larger than the few hundred examples we sourced from humans. Since our extraction technology is fairly sophisticated, it requires a good deal of training data. By reducing the number of human annotations we have to source, we can cut the training and deployment time for this new lawsuit event down by a sizable factor.

Here is a very simple example of a synthetically generated lawsuit event:

Party X on Tuesday filed a lawsuit against Party Y for $50,000.

And a slightly more complicated one:

Party X is suing Party Y. The lawsuit, which was filed Tuesday, is for total damages of $50,000.

Of course, real data is often more complicated: events are expressed over a span of more than a few sentences and sometimes the document expressing the event has a lot of less semantically relevant text. Our system is being built to produce documents that display such features.

Results Thus Far

While our natural language generation system is only in its infancy, we are already getting very promising results. Figure X, below, illustrates some examples of our system producing grammatical variations of a simple product recall event, with a focus on changing tense and voice. This is one of the features that enables more complicated language generation to produce structurally diverse documents.

Figure 3. Natural language generations expressing a recall event of the “Rice cake soup” by “Ottogi America.” These are simple examples of some forms of grammatical variation we are able to produce for a single event.

In Figure Y, we see an example of a much more linguistically complicated event. This generated text contains a good deal of variation, complex content and structure, and domain specificity. We are able to generate semantically relevant but role-implicit sentences and clauses (see for example the last sentence), and are able to describe this event as being related to a game despite that alias not existing in the event frame.

Figure 4. A product delay event generated by our system, showing more complexity.

Toward the future

Our road toward highly robust and rich natural language generation is just beginning. Below are some advancements in our research horizon that we intend to pursue.

Improved Semantics Via Forge’s Knowledge Base

One of the most challenging aspects of training data generation is ensuring that the correct depth and breadth of semantics are captured. We are currently exploring path traversals on our budding knowledge base (see this blog post) as a means for re-ranking semantic plans. This holds the promise of enabling the generation of larger and more consistently relevant documents.

Furthering Domain Specificity Via Reinforcement Learning

Currently, domain specificity is learned from classified, non-annotated documents and a small number of human-annotated exemplars. We have been exploring reinforcement learning as a means to further refine domain-specific generated documents. We invision implementing this as a user-in-the-loop mode of interaction to speed up development time.

Support For Multiple Languages

As an initial proof of concept, we are exploring neural machine translation 10 approaches to translate from English to the target language at the end of the generation pipeline. This has the potential pitfall of amplifying the systematic error introduced in the grammatical models, so we will need to be mindful of that.

Relationship Between Training Behaviors and Intrinsic Properties of Training Corpora

Many industrial practitioners, especially those in resource-constrained organizations, have identified the limited applicability of many published approaches on the highly noisy and varied real world natural language training data faced in commercial use cases. In order to understand how to collect and produce training data reliably and sustainably, we must understand intrinsic properties of training corpora that predict supervised learning behavior from less than optimal training data.

We can think of these properties as fitting into three classes: lexical, syntactic, and semantic.

Lexical properties: Variability in word usage is an example of this kind of property. Normalized word frequency distributions are a straightforward measurement of this kind of property We are exploring the cross-entropy between these distributions and some reference distribution, drawn from a sufficiently large general English corpus, as well as pointwise mutual information within a corpus. The latter has historically been considered a semantic measure, but we claim it is lexical and shallowly semantic as some reasoning component, like a knowledge base, is required for deep semantic measurements Syntactic properties: Grammatical variability and complexity is the property of interest here. We are researching metrics for comparing distributions over parse trees conditional on some trained parsing model. In analogy to population genetics, we would like to consider distributions over genealogies (ancestral trees) of unique tokens across corpora, where Kingman’s Coalescent 11 would serve as a suitable prior over genealogies (this has been used effectively in the other direction for hierarchical clustering 12). The Developmental Level Scale 13 is another potentially powerful metric that we are exploring to address syntactic complexity of training data Semantic properties: The meaning contained and expressed within a corpus is of key importance. One of the biggest challenges for both collecting and generating training data is confirming that you’ve captured a sufficient amount of the necessary semantic content. We claim that the best way to measure these properties is with a knowledge base. Imagining a corpus as a collection of expressed entities and relationships, we can infer a subgraph within the knowledge base for each corpus’ semantic content. Graph similarity between these subgraphs is then a natural measure for semantic similarity. Previous work has considered word similarity across corpora in the context of an external knowledge base, like WordNet 14.

Conclusion

We have shown how we are using natural language generation to collapse the time and cost of training usually associated with supervised learning, which is allowing us to accelerate our event coverage and scale deeply and horizontally. We are excited to share our progress as we continue to build this foundational technology to language understanding.