Any sound that we hear or a virtual sound clip can be summarized as a set of amplitude values at the interval of the sampling rate. The position of the wave pattern that this creates is the sound that is heard at any given time as it vibrates the sensitve hairs within the ear. The oscillation of amplitude creates frequencies of the overall wave, which in turn creates the unique pitch which we hear from any sound. Amplitude on the other hand provides the magnitude of vibration at any time. The pattern of amplitude produces frequencies and volume which is then heard by the listener as the sounds they know.

Scary Fourier Transform Equation (Source)

A Fourier Transformation can be performed on any wave to find the spectrogram or frequency density over time. This analysis can help to show what types of frequencies lead to known sounds. While this is a derivative of the overall wave, a spectrogram cannot be used to actually produce sound as the wave pattern is unknown at this point. Altogether this creates a dynamic in which the unique sounds are best created via spectrogram of frequencies but cannot be heard until converted back to amplitude based waves. There is no reverse FFT which creates an enticing prediction task for AI and Machine Learning.

Wave form(Top) and converted Spectrogram(Bottom) (Source: MindBuilder AI Research)

With this knowledge it becomes apparent that the best use of generative models to take text and convert it to sounds is to build a:

text — >spectrogram relationship and train a model to perform this.

The fact still remains that the spectrogram cannot generate the actual sounds. To accomplish this another generative model can be utilized to create: spectrogram — >wave conversions.

Both of these types of models have been heavily researched and have current state of the art methods available. For text->spectrogram the Tacotron2 model architecture exists, and likewise for spectrogram->wave/sound, the Wavenet architecture exists. By putting these networks together the goal of converting text to sound can be achieved.

This pipeline is what was implemented to generate voice from simple text. In this general diagram our implementation utilized a trained encoder net for the green section, Tacotron2 for the Synthesizer section and WaveNet for the Vocoder section to finally create the waveform.

The Real-Time-Voice-Cloning Github repository has built functions implement both networks and create sound with 5 secs or more of audio. Due to the difficulty of accumulating data to train such models from scratch(millisecond text-spectrogram data is required), the repository provides pretrained models with methods to finetune based on a new voice. This is accomplished by utilizing audio libraries to scan audio and create a weight embedding of the supplied voice that is applied to the Tacotron2 pre-trained model to generate frequency spectrograms unique to the voice supplied. Following this, the Wavenet is a generic model which simply converts the spectrogram to wave/amplitude data. This amplitude data can then be converted to .wav format utilizing built in python libraries included in the repository. A walk through of this process is provided in demo_cli.py and simply requires the user to supply an audio file and text in order to generate any sentence(s).

Finally, this model can be used to take raw text as input and produce a spectrogram with frequencies similar to any source audio file. As discussed above the final step is to generate useable audio from spectrogram to waveform using WaveNet. Luckily this is a generic conversion across speakers, so no specific training was required to produce the wave form. A heavily pre-trained model by Nvidia was utilized in this portion. The final product is a .wav file generated from the waveform with a very similar voice saying the input text to the audio book sound.

Figure: End to End architecture with in-depth TacoTron 2 cells (Source)

In order to take this architecture to a production system to mirror any supplied voice, the audio file does not need to be supplied on each run because the embedding can simply be saved and applied permanently to the Tacotron model as long as the voice does not need to change. The code in the Github repo will show that the methods to generate spectrogram with Tacotron only require the text to be spoken and the embedding. With this in mind it makes sense to save and simply pass the same embedding each time. This can act as the entry point for text received from the API. The next step can follow the GitHub demo flow and pass the generated spectrogram to the wave generation function. Finally the .wav file will be saved to the file system with a link to the resource returned by the API or if configured can be returned directly from the API.

This process flow allows for the generation of voices based off of a one shot training to create an embedding. Following this the API can be integrated anywhere that receives text input and requires audio. This flexible model can be used to create toggled embedding of different voices as well.

In the future this technology will lead to many new opportunities in entertainment, communication, and mobility. As more developers continue to implement and tweak these architectures the accuracy and realness of generated content will continue to improve. A world in which we can implement a virtual version of ourselves may not be far off!