Interesting developments happened in 2018/2019 for natural language generation decoding algorithms: here's a thread with some papers & code



So, the two most common decoders for language generation used to be greedy-decoding (GD) and beam-search (BS). [1/9]

Greedy: at each time step, select the most likely next token according to the model until end of sequence. Risk: miss a high prob token hiding after a low-prob one.



Beam-search: to mitigate this, maintain a beam of sequences constructed word-by-word. Choose best at the end [2/9]

Beam-search is now the standard decoding algorithm for almost all language generation tasks including dialog (see http://arxiv.org/abs/1811.00907 ).



But interesting developments happened in 2018/2019 [3/9]

First, there was growing evidence that beam-search is highly sensitive to the length of the output. Best results are obtained when the output length is predicted from the input before decoding ( http://arxiv.org/abs/1808.10006 , https://arxiv.org/abs/1808.09582 at EMNLP 2018) [4/9]

While this makes sense for low-entropy tasks like translation where the output length can be roughly predicted from the input, it seems arbitrary for high-entropy tasks like dialog and story generation where outputs of widely different lengths are usually equally valid [5/9]

In parallel, at least 2 influential papers ( https://arxiv.org/abs/1805.04833 , https://openai.com/blog/better-language-models/ …) on high-entropy tasks were published where BS/greedy was replaced by sampling from the next token distribution at each time step (using a variant called top-k sampling, see below) [6/9]

Last in this recent trend of work is https://arxiv.org/abs/1904.09751 in which @universeinanegg & co show that the distribution of words in BS/greedy decoded texts is very different from the one in human texts.

Clearly BS/greedy fail to reproduce distributional aspects of human text [7/9]

Today, the most promising candidates for high-entropy tasks decoders seem to be top-k & nucleus sampling



General principle: at each step, sample from the next-token distribution filtered to keep only the top-k tokens or the top tokens with cumulative prob above a threshold [8/9]

Finally, here is a gist showing how to code top-k and nucleus sampling in PyTorch:

https://gist.github.com/thomwolf/1a5a29f6962089e871b94cbd09daf317 …

[9/9]

You can follow @Thom_Wolf.

Share this thread

Bookmark

____

Tip: mention @threader_app on a Twitter thread with the keyword “compile” to get a link to it.



Enjoy Threader? Sign up.



Since you’re here...



... we’re asking visitors like you to make a contribution to support this independent project. In these uncertain times, access to information is vital. Threader gets 1,000,000+ visits a month and our iOS Twitter client was featured as an App of the Day by Apple. Your financial support will help two developers to keep working on this app. Everyone’s contribution, big or small, is so valuable. Support Threader by becoming premium or by donating on PayPal. Thank you.



Download Threader on iOS.