The idea that machine-generated conversations could be more “realistic” than human conversations may seem absurd or even logically flawed. Yet we’re getting there. Microsoft’s new tunable gigaword-scale neural network DialogGPT (dialogue generative pre-trained transformer) is a virtual master of conversation that outperforms strong baseline systems in generating relevant and context-consistent responses and attains near human level performance in conversational response generation tasks.

Most open-domain neural response generation systems suffer from content inconsistency due to a lack of long-term contextual information. Such issues can be alleviated by boosting the information content, and that is what unified text-to-text transformers have successfully done. GPT-2, the OpenAI text generator that can convincingly transform a prompt into a news article, a poem or other text style, uses a multi-layer self-attentive mechanism that enables long-term dependency information to be better preserved across time, which to some extent fixes the content inconsistency problem.

DialogGPT extends GPT-2 to address conversational neural response generation tasks. Like GPT-2, DialogGPT is an autoregressive language model with multi-layer transformer architecture that models a multi-turn dialogue session as a long text (concatenate all dialog turns within a dialogue session), and frames the generation task as language modeling (by optimizing the product of the conditional probability of target sentences on the source sentence).

However, unlike GPT-2, which was trained on 40GB of random Internet text, DialogGPT was trained specifically on dialogue pairs extracted from comment chains scraped from Reddit. This enables DialogGPT to capture the joint distribution of target and source with finer granularity and to generate more diverse information that remains faithful to the source prompt.

DialogGPT inherits much of its network structure from GPT-2, including a 12-to-24 layer transformer with layer normalization, a initiation schema for model depth and byte pair encoding for tokenizer. In order to better address the issue of inconsistent content generated by open-domain text generators, DialogGPT also implements a maximum mutual information (MMI) scoring function which intuitively penalizes bland, uninformative and overgeneralized hypotheses.

The researchers also conducted experiments using end-to-end conversational modeling tasks in the DSTC-7 Dialogue Generation Challenge. Unlike dialogue tasks with specific and narrow goals such as booking a flight or reserving a table at a restaurant, the goal of the DSTC-7 challenge is to generate free-flowing humanlike conversations such as those found in brainstorming meetings, while also appropriately injecting information grounded in external knowledge.

DSTC-7 evaluation results

Reddit multi-reference evaluation

It was observed that DialogGPT scored similarly to the study’s human ground truth reference. Re-ranking responses using MMI meanwhile increased response diversity and achieved higher NIST, METEOR, Entropy and Dist scores compared with low-compute greedy generation.

Generated examples addressing commonsense questions

The paper DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation is on arXiv.