Microsoft Research's Natural Language Processing Group released the dialogue generative pre-trained transformer (DialoGPT), a pre-trained deep-learning natural language processing (NLP) model for automatic conversation response generation. The model was trained on over 147M dialogues and achieves state-of-the-art results on several benchmarks.

The team presented the details of the system in a paper published on arXiv. DialoGPT is built on the GPT-2 transformer architecture and trained using a dataset scraped from Reddit comment threads. The model was evaluated using two test datasets, the Dialog System Technology Challenges (DSTC-7) dataset and a new dataset of 6k examples also extracted from Reddit. For both datasets, the team used machine-translation metrics such as BLEU and Meteor to evaluate the performance of DialoGPT compared with Microsoft's Personality Chat and with "Team B," the winner of DSTC-7. DialoGPT outperformed the other models on all metrics. The team also used human judges to rank the output of DialoGPT against real human responses; the judges preferred DialoGPT's response about 50% of the time.

The Transformer architecture has become a popular deep-learning model for NLP tasks. These models are usually pre-trained, using unsupervised learning, on large datasets such as the contents of Wikipedia. Pre-training allows the model to learn a natural language structure, before being fine-tuned on a dataset for a particular task (such as the DSTC-7 dataset). Even without fine-tuning, the large pre-trained models can achieve state-of-the-art results on NLP benchmarks. However, the DialoGPT team points out that many of these models are "notorious for generating bland, uninformative samples." To address this, they implemented a maximum mutual information (MMI) scoring function that re-ranks the model's outputs, penalizing "bland" outputs. The team also investigated using reinforcement learning to improve the model's results, but found that this usually resulted in responses that simply repeated the source sentence.

Pre-trained models are especially attractive for conversational systems, due to a lack of high-quality training datasets for dialogue tasks. However, using natural dialogue information from Internet sites such as Reddit or Twitter poses risks that the model will be exposed to, and can learn from, offensive speech. Microsoft's earlier experimental chatbot, Tay, produced output that was "wildly inappropriate and reprehensible" after conversing with users of Twitter. Microsoft's Personality Chat cloud service attempts to address this by using a series of machine-learning classifiers to filter out offensive input before auto-generating responses. As a precaution, the DialoGPT team chose not to release the decoder that converts the model outputs into actual text strings. Similarly, OpenAI originally held back their fully-trained model due to concerns about "malicious applications of the technology."

A user from Reddit did reverse-engineer the decoder and posted some results of using the model, along with the comment:

I'd say all of the generations are grammatically acceptable and quite impressive considering how little information it was given, about 1 out of 5 appeared to be very coherent and sometimes strikingly sarcastic (much [like] reddit). Those prompted with a clear defined topic certainly worked out better.

On Twitter, NLP researcher Julian Harris said:

One always needs to bear in mind in these reports that "close to human performance" is for the tested scenario only. Autogeneration of responses (NLG) is still a very new field and is highly unpredictable...As such deep learning-generated conversational dialogs currently are at best entertaining, and at worst, a terrible, brand-damaging user experience.

The DialoGPT code and pretrained models are available on GitHub.