During a conversation between a customer and a dialogue system like Alexa’s, the system must not only understand what the customer is saying currently but also remember the conversation history. Only by combining the history with the current utterance can the system truly understand the customer’s requirements.

Consequently, an important problem in task-oriented dialogue systems is dialogue state tracking, which is essentially estimating and tracking the customer’s goal throughout a conversation.

Specifically, the dialogue state is a dictionary of {slot_name, slot_value} pairs, where slot_name can be defined by the dialogue system itself — for example, “Hotel_Stars” or “Hotel_Price” — and slot_values are entities mentioned in the dialogue — for example, “4” for “Hotel_Stars” and “Expensive” for “Hotel_Price”. The dialogue-state-tracking problem is to estimate all the available {slot_name, slot_value} pairs at each conversational turn.

Dialogue states at several successive dialogue turns

At this year’s meeting of the Association for Computational Linguistics’ and the International Speech Communication Association’s Special Interest Group on Discourse and Dialogue, we received a best-paper nomination for work on applying machine reading comprehension approaches to dialogue state tracking.

Machine reading comprehension, which aims to teach machines to understand human-language documents, is a classical problem in natural-language processing. One common task of reading comprehension is to answer questions given a passage of text. In our proposed approach, we formulate dialogue state tracking as a question-answering-style reading comprehension problem. That is, we ask the dialogue system to answer the question “What is the slot_value for slot_name?” after reading a conversational passage.

Dialogue state tracking reinterpreted as question answering based on machine reading comprehension

Historically, research on dialogue state tracking has focused on methods that estimate distributions over all the possible values for a given slot. But modern task-oriented dialogue systems present problems of scale. It is not unusual to have thousands or even millions of values for a single slot — song_name, for example, in a voice-controlled music service. Calculating distributions over all those values for each turn of dialogue would be prohibitively time consuming.

One advantage of our new approach is that, in reading-comprehension-based question answering, answers are usually extracted from the text as spans of consecutive words, so there’s no need to calculate massive distributions.

Additionally, machine reading comprehension is an active research area that has made a lot of great process in recent years. By connecting it with dialogue state tracking, we can leverage reading-comprehension-based approaches and develop robust new models for task-oriented dialogue systems.

After reformulating state tracking as reading comprehension, we propose a method with three prediction components:

A slot carryover model: Predicting whether a {slot_name, slot_value} pair needs to be carried over from the previous turn or updated at the current turn; A slot type model: If the slot carryover model decided to update the {slot_name, slot_value} pair, this model predicts the type of the slot_value from four values: Yes, No, Don’t care, Span; and A slot span model: If the slot type model decided that the type is Span, this model extracts the slot_value span from the dialogue, represented as [start position, end position] in the conversation.

Our reading comprehension system for dialogue state tracking. The prediction (top) layer has three components (from right to left): (1) a slot carryover model, which predicts whether a particular slot needs to be updated from one turn to the next; (2) a slot type model, which predicts the type of the slot values (Yes, No, Don’t Care, Span); and (3) a slot span model, which predicts the start and end points of the value within the dialogue.

In tests involving a dataset with 37 {slot_name, slot_value} pairs, our approach yielded a 6.5% improvement in slot-tracking accuracy over the previous state of the art. Furthermore, if we combine our method with the traditional state-tracking approach — the way we combined methods in the HyST system described in our previous blog post — we are able to further advance the state of the art by 11.75%.

We also did a comprehensive analysis of our models. Our span-based reading comprehension model has an accuracy of up to 96% per slot on development data, validating our approach. Most errors appear to be coming from the slot carryover model — i.e., the decision whether to update a slot_value or not. We found that the majority of these errors result from delayed annotation in the training data — i.e., slot values that were annotated one or more turns after they appeared in user utterances. We thus published a cleaner version of the dataset we used, to fix the annotations in the original version.

Acknowledgments: Sanchit Agarwal, Abhishek Sethi, Tagyoung Chung, Dilek Hakkani-Tür, Rahul Goel, Shachi Paul, Anuj Kumar Goyal, Angeliki Metallinou