Removed instances where either source or target has/is URL.

Removed instances where there is a trace of the toxic language. Toxicity was detected based on a list of pre-determined words.

Removed instances where the response did not contain any word from top-50 most common word list in the English Language. - This step would probably help in validating that the sentence is in English.

Removed instances where response contains special markups like "[" or "]" testing for a markup language.

Removed instances where the length of source+target is >200 words.

Removed instances where the target contains word repetitions of at least 3 words.

Removed instances in which at least 90% trigrams occurs more than 1000 times. - Helps in pruning the bland responses.

!pip install transformers== 3.0.2 from transformers import AutoModelWithLMHead, AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium") model = AutoModelWithLMHead.from_pretrained("microsoft/DialoGPT-medium") for step in range(3): new_user_input_ids = tokenizer.encode(input(">> User:") + tokenizer.eos_token, return_tensors='pt') bot_input_ids = torch.cat([ gen_ids , new_user_input_ids], dim=-1) if step > 0 else new_user_input_ids gen_ids = model.generate(bot_input_ids, max_length=200, pad_token_id=tokenizer.eos_token_id) print("DialoGPT: {}".format(tokenizer.decode(gen_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))

Feel free to copy-paste the above code and put print() to debug the vectors

I see this research to be pushing the capabilities of language models by training deeper networks, training longer, playing with data formatting to solve more sophisticated tasks. Also, it is so cool to see a language model being able to do so well (although it's not surprising, well, GPT-3 :D ). Apart from open-domain, I would be more interested to see it's performance on a closed domain/task-oriented dialogue setting and it's feasibility in terms of robustness, quality of response for deploying in production. Also, I believe putting a multi-task loss for handling blandness, repetition, toxicity, etc could also be tried apart from preprocessing the input corpus for handling such nuances. Also, I did not find Author's to encode any kind of speaker's identity while modeling the utterances. Although, I am not sure whether that would help in any way or not.

References

A dialogue system, or conversational agent, is a computer system intended to converse with a human.employed one or more of text, speech, graphics, haptics, gestures, and other modes for communication on both the input and output channel. -A couple of months ago, researchers from Microsoft released their large scale conversational model -, that achieves state-of-the-art performance for generating a relevant and consistent response in dialog systems. Response generation can be seen as a sub-problem to text-generation where the idea is to generate natural and fluent text which is conditioned on and is relevant to the input prompt sequence.Before we move further, let's see a quick refresher on. GPT-2 is a large scalethat was trained on 40GB of WebText corpus. It’s a stack of multiple decoder transformer units on top of each other enabled with some advanced learning concepts like Masked Self Attention , Multiple Heads, Residual Connections Layer Normalization , etc making it a SOTA text generator. The objective that GPT-2 tries to optimize is to predict the next word in the sequence having seen past words. At the input, model one word at a time and outputs the next word in the sequence. The Autocomplete feature on smartphones, Autocompose in Gmail are essentially built on such concepts. Read more Author's talk about current challenges in open-domain response generation systems such as style inconsistency, blandness, failing to capture long-term dependency, etc. They believe a transformer-based language model such as GPT-2 that uses multiple levels of self-attention layers and has already been seen to perform really well in generating fluent text could potentially be used to solve some of these limitations. As a measure of precaution, they also employ multiple pre-processing steps on the I/O samples before feeding it to the model. Below mentioned are 7 pre-processing steps -They train their model on 147M conversation exchanges from Reddit comment threads spanning from 2005 through 2017. As a modeling strategy, authors transform the task of response generation as a task of learning a Language Model. They extend GPT-2 to address this problemunder the assumption that this approach would capture the joint distribution of P(Target, Source) in conversational flow with finer granularity.They extract multiple threads from top to leaf level for Reddit discussions where each thread acts as a training instance containing multiple turns of dialogues. They concatenate all dialog turns within a session into a long text x1, x2, ... , xN, eos, where N being the sequence length and eos being an end-of-text token. Let S=x1, ... , xM be the source sequence and T=xM+1, ... , xN be the target sequence, then we wish to maximize,Authors use Top-k (k=10) decoding strategy to sample 16 candidate responses at any given instance. They employ MMI () strategy for. MMI is also a pre-trained GPT-2 model that was trained to predict S given T. Response that yields lowest backward model loss is selected. This selection helps in pruning bland responses as bland ones are usually common to most of the queries and thus should yield low probability for any specific query.They trained Small(117M), Medium(345M), and Large-sized(762M) GPT-2 models with 5, 5, and 3 epochs respectively. On the engineering side, they compress and had put all data in a, so that data was loaded as and when needed. They also grouped the conversations offor better training and also employedfor scaling the training. Below fig. shows some sample chat from the paper -Below snippet shows sample code for trying out the pre-trained model -We start by importing the pre-trained tokenizer and model. For every user utterance, we encode it using a pre-trained tokenizer and add end-of-token at the end. If it is the first utterance then that becomes history else previously generated sentences act as history.. We pass the concatenated version of history and user input to generate function. We decode till eos is reached and pick the last generated segment as the response from the model.1. Original Paper - https://arxiv.org/abs/1911.00536 2. Microsoft Blog - https://www.microsoft.com/en-us/research/project/large-scale-pretraining-for-response-generation/ 3. Github - https://github.com/microsoft/DialoGPT