Two researchers to keep an eye on in this field, both out of Microsoft Research, are Scott Wen-tau Yih (now at the Allen Institute for AI) and Jianfeng Gao. There are probably many others, and my omission of them reflects my limited bandwidth, not their accomplishments.

Broadly, there are two subfields in this area which are determined by the source of truth for answers: structured and unstructured data. Structured data may be a knowledge base, SQL, table(s); unstructured data is usually plain text, though in the case of visual question answering, it can be images. [I’m not sure if there are visual question answering systems that accept a series of images (video) as input, but if not, I’d bet that someone is working on it. They are probably using LSTMs, or some attention mechanism, due to the time-dependent nature of video (this is a theme of this article)]. Approaches vary between the two subfields. The most famous benchmark using unstructured text as a source of truth is SQuAD, The Stanford Question Answering Dataset. There doesn’t seem to be as much of a baseline in structured question answering, but Spider, out of Yale, looks promising.

Question Answering with Structure Data

Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base is a reference for QA with an underlying knowledge base. A more recent work, Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access released an implementation, KB-InfoBot, which by default uses a knowledge base of IMDB data.

In Search-based Neural Structured Learning for Sequential Question Answering, the backing data is semi-structured — tables from Wikipedia. More on this work later.

Of course, our favorite data format is SQL. And, if you have data in a SQL database, there is work being done to translate natural language questions into SQL. Spider is “a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students.” This feels like it has the potential to kill an entire class of applications:

Some examples from the Spider project https://yale-lily.github.io/spider

Question Answering with Unstructured Data — Machine Reading Comprehension

In addition to SQuAD1.1, which is the go to reading comprehension data set, this year saw the release of SQuAD2.0, CoQA, QuAC, MS MARCO, and DuoRC. At a high level, each entry in the data set involves a short corpus of text, such as

This is the story of a young girl and her dog. The young girl and her dog set out a trip into the woods one day. Upon entering the woods the girl and her dog found that the woods were dark and cold. The girl was a little scared and was thinking of turning back, but yet they went on. The girl's dog was acting very interested in what was in the bushes up ahead. To both the girl and the dog's surprise, there was a small brown bear resting in the bushes. The bear was not surprised and did not seem at all interested in the girl and her dog. The bear looked up at the girl and it was almost as if he was smiling at her. He then rested his head on his bear paws and went back to sleep. The girl and the dog kept walking and finally made it out of the woods. To this day the girl does not know why the bear was so friendly and to this day she has never told anyone about the meeting with the bear in the woods.

Though many articles are more interesting than the above example. Questions are colocated with the corpus, as are acceptable answers. You can read a comparison of CoQA, SQuAD2.0, and QuAC if you are interested in the details, but roughly they fall into three categories:

Unanswerable questions, that is, questions where the correct answer is “answer not available in corpus.” Multi-turn interactions (the next section of this article) Abstractive answers — questions where the correct answer can be inferred, but not directly extracted, from the corpus

From the comparison paper:

The coverage of unanswerable questions is complementary among datasets; SQuAD 2.0 covers all types of unanswerable questions present in other datasets, but focuses more on questions of extreme confusion, such as false premise questions, while QuAC primarily focuses on missing information. QuAC and CoQA dialogs simulate different types of user behavior: QuAC dialogs often switch topics while CoQA dialogs include more queries for details and cover twice as many sentences in the context as QuAC dialogs. Unfortunately, no dataset provides significant coverage of abstractive answers beyond yes/no answers, and we show that a method can achieve an extractive answer upper bound of 100 and 97.8 F1 on QuAC and CoQA , respectively.

Some interesting algorithms of note that are architected specifically for question answering include FlowQA, SAN (Stochastic Answer Networks), and Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering. However, as noted above, general purpose architectures like BERT seem to outperform, or at least perform competitively, without specialized architecture.

Sequential Question Answering

From the standpoint of building dialogue systems with the intention of them being used, sequences of simple questions are much more relevant than complex, one-off questions. As the authors write in the paper introducing QuAC,

QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as we show in a detailed qualitative evaluation.

This is the beginning of quantifying exactly what about human dialogue is so hard for computers to emulate. Compare with the similar CoQA data set to get a feel for the similar direction the separate authors have converged on:

The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

Here is a chart from the QuAC paper contrasting the various question answering data sets: