Sophisticated AI models are capable of performing incredible feats, from predicting which patients are likely to develop breast cancer and spotting early signs of glaucoma from eye scans to hallucinating fake landscapes that look indistinguishable from the real thing. But despite their versatility, they share a common shortcoming: a lack of commonsense reasoning. Try telling a machine learning algorithm to predict what will happen when you push a ball off a table or when a person trips down the stairs. Unless it has been explicitly “taught” laws of physics through training on countless examples, it will struggle.

One solution is enumerating logic and applying it to a given AI model’s decision-making, but that’s a time-consuming and monotonous chore that doesn’t account for the many exceptions to probabilistic heuristics. That’s why scientists at Salesforce investigated an alternative approach, which they detail in a paper accepted into the 2019 Annual Meeting of the Association for Computational Linguistics: training a system on sequences of explanations for commonsense reasoning and highlighted annotations. They propose a new open source corpus — Common Sense Explanations (CoS-E) — for training and inference with a novel machine learning framework (Commonsense Auto-Generated Explanation, or CAGE), which they say improves performance on question-and-answer benchmarks by 10% over baselines and demonstrates an aptitude for reasoning in out-of-domain tasks.

“It turns out that, despite all the recent breakthroughs over the last decade, it’s been historically really hard to capture commonsense knowledge in a form that algorithms can actually make useful,” Salesforce chief scientist and coauthor on the paper Richard Socher told VentureBeat in a phone interview. “The reason I’m so excited for [the paper] here is that they have a first approach to capture commonsense knowledge, and it turns out that language models — simple models that read text and try to predict the next word and make sense of the future to autocomplete sentences — capture this commonsense knowledge.”

Compiling a data set

Devising the model was a multistep process.

To procure commonsense explanations for CoS-E, which is divided into two parts — a question token split and a random split — the team turned to Amazon’s Mechanical Turk and tasked human participants with explaining which of several answers was “most appropriate,” given ground-truth answers. Annotators highlighted relevant words in questions that justified the ground truths and then provided brief, open-ended explanations based on the highlighted justifications that served as the reasoning behind the questions.

For example, for the prompt “What could people do that involves talking?” the crowdworkers had to select from these answers: “confession,” “carnival,” or “state park.” Their explanation for “confession” might be “confession is the only vocal action,” and they might supply the reason “people talk to each other” or the rationale “people talk to people.”

Image Credit: Salesforce

Socher notes that CoS-E’s effectiveness isn’t constrained by the examples. CAGE achieves state-of-the-art results when trained on it, implying that even when drawing only on explanations that don’t have any word overlap with any of the answer choices, performance exceeds that of models that don’t use CoS-E.

“Usually, a lot of the tasks and data sets we look at have all the information [an AI model] needs to make a certain call,” explained Socher. “But [the model will] never be able to enumerate all the different possible types of reasoning to be able to do well on the test set, because the test set includes completely empty domains and things [the model has] never seen before.”

Devising a model

So how did CAGE come about? Well, coauthor Nazneen Rajini and team drew examples from Common sense Question Answering (CQA), a corpus containing multiple choice questions for developing common sense reasoning models. They paired these with corresponding CoS-E explanations from a natural language model conditioned on the question-and-answer choices. Next, they concatenated the explanations to the end of the original questions, answer choices, and outputs and then fed them to a second commonsense reasoning model.

In this way, the team considerably extended the capabilities of CQA, which was designed to benchmark performance on tasks requiring proficiency in pronoun resolution. Whereas results from CQA tend to be somewhat ambiguous with respect to whether commonsense reasoning is actually being performed, the researchers assert that CoS-E’s explanations are explicit and can be used to study, analyze, and evaluate models’ reasoning capabilities.

The aforementioned language model was OpenAI’s GPT, a multilayer transformer decoder and the forebear of the highly capable GPT-2 model released last year. As with all deep neural networks, GPT contains neurons (mathematical functions loosely modeled after biological neurons) arranged in interconnected layers that transmit “signals” from input data and slowly adjust the synaptic strength — weights — of each connection. (That’s how the model extracts features and learns to make predictions.) Uniquely, however, it has attention: Every output element is connected to every input element, and the weightings between them are calculated dynamically.

For the commonsense reasoning model — a classification module that learned to perform predictions on the CQA task — the team chose Google’s BERT, which is unique in that it’s both bidirectional (allowing it to access context from past and future directions) and unsupervised (meaning it can ingest data that’s neither classified nor labeled).

The team fine-tuned a pretrained GTP model on a combination of CQA and CoS-E data sets and experimented with language generation in two settings: “reasoning,” where the language model conditioned on questions, answer choices, and the human-generated explanation but not the actual predicted label, and “rationalization,” where the model conditioned on the predicted labels, along with the input to generate rationalizations. The researchers found that reasoning outperformed the state-of-the-art on CQA by 10%, while rationalization bested the current top-ranking model by 6%.

The explanations in the rationalization setup can’t be considered commonsense reasoning, Rajani and colleagues note, because the model had access to the ground truth labels to input questions during training. Instead, they consider it an interpretability framework — a means of making the system’s decisions more transparent.

“The idea behind explainable AI is that you’d like to have an AI model to generate explanations for their decisions, and the most obvious reason for this is to gain users’ trust so that users can interact with them and they understand them,” Rajini told VentureBeat.

Surprising results

With framework and data set in hand, the team moved on to the next experimental step: validation.

On CQA, they say that CAGE achieved accuracy of roughly 65%, which they claim is state-of-the-art. And during a test in which the commonsense question-answering model was provided access to explanations that weren’t conditioned on the ground truth (during both training and validation), accuracy jumped nearly 10% from 64% to 72%.

Interestingly, the team found that when explanations consisted only of justifications, the best accuracy the model could reach was 53%, in contrast to the 85% hit by models trained on open-ended explanations. Adding questions to the mix boosted performance to 70%, and to 90% when provided at inference time.

The team separately carried out a test on two out-of-domain data sets: SWAG, a corpus with multiple choice questions about “a rich spectrum of grounded situations,” and Story Cloze, a collection of five-sentence “commonsense” stories. Model performance was slightly worse across the board, but the outputs exhibited surprisingly little in the way of grammatical or syntactical errors and contained information relevant to the scenarios at hand. In the case of the SWAG data set, where each question was a video caption with choices about what might happen next, generated explanations seemed to be grounded in given images — even though the language model wasn’t trained on SWAG.

Image Credit: Salesforce

“It shows that it’s worthwhile for the [research] community to think about collecting explanations as they’re collecting new data sets,” said paper coauthor Bryan McCann. “[It turns out that] actually going to the trouble of having humans write a little sentence about why they [chose an answer to a question] will potentially be very useful … for accessibility, interpretability, and performance as well.”

Work has already begun on CAGE frameworks with larger language models, which Socher predicts will boost accuracy even further.

“You can plug in any language model that’s pretrained and has weights available. Our hypothesis is that as you get larger and larger language models, you’ll capture more and more common sense,” he said. “Before, knowledge conglomeration used to be thought of as a human-in-the-loop endeavor … and the nice thing here is we can allow this model to read text [and then] make sense from all the things that people are saying. It can read about the world … and really capture this commonsense reasoning ability.”

Rajani believes the work could lay the groundwork for more helpful, less frustrating AI assistants.

“For example, suppose that you’re interacting with a robot and you have a coffee mug and an empty glass in front of you and you say ‘Pour me some water in a glass.’ If the robot had common sense, you wouldn’t have to be very specific — it’s not going to pour water in the coffee mug.”