During training, the student network starts off in a random state. It is fed random white noise as an input and is tasked with producing a continuous audio waveform as output. The generated waveform is then fed to the trained WaveNet model, which scores each sample, giving the student a signal to understand how far away it is from the teacher network’s desired output. Over time, the student network can be tuned - via backpropagation - to learn what sounds it should produce. Put another way, both the teacher and the student output a probability distribution for the value of each audio sample, and the goal of the training is to minimise the KL divergence between the teacher’s distribution and the student’s distribution.

The training method has parallels to the set-up for generative adversarial networks (GANs), with the student playing the role of generator and the teacher as the discriminator. However, unlike GANs, the student’s aim is not to “fool” the teacher but to cooperate and try to match the teacher’s performance.



Although the training technique works well, we also need to add a few extra loss functions to guide the student towards the desired behaviour. Specifically, we add a perceptual loss to avoid bad pronunciations, a contrastive loss to further reduce the noise, and a power loss to help match the energy of the human speech. Without the latter, for example, the trained model whispers rather than speaking out loud.



Adding all of these together allowed us to train the parallel WaveNet to achieve the same quality of speech as the original WaveNet, as shown by the mean opinion scores (MOS) - a scale of 1-5 that measures of how natural sounding the speech is according to tests with human listeners. Note that even human speech is rated at just 4.667 on the MOS scale.

