To understand why WaveNet improves on the current state of the art, it is useful to understand how text-to-speech (TTS) - or speech synthesis - systems work today.

The majority of these are based on so-called concatenative TTS, which uses a large database of high-quality recordings, collected from a single voice actor over many hours. These recordings are split into tiny chunks that can then be combined - or concatenated - to form complete utterances as needed. However, these systems can result in unnatural sounding voices and are also difficult to modify because a whole new database needs to be recorded each time a set of changes, such as new emotions or intonations, are needed.

To overcome some of these problems, an alternative model known as parametric TTS is sometimes used. This does away with the need for concatenating sounds by using a series of rules and parameters about grammar and mouth movements to guide a computer-generated voice. Although cheaper and quicker, this method creates less natural sounding voices.

WaveNet takes a totally different approach. In the original paper we described a deep generative model that can create individual waveforms from scratch, one sample at a time, with 16,000 samples per second and seamless transitions between individual sounds.