Dice games, Markov chains, and RNNs aren’t the only ways to make algorithmic music. Some machine learning practitioners explore alternative approaches like hierarchical temporal memory, or principal components analysis. But I’m focusing on neural nets because they are responsible for most of the big changes recently. (Though even within the domain of neural nets there are some directions I’m leaving out that have fewer examples, such as restricted Boltzmann machines for composing 4-bar jazz licks, short variations on a single song, or hybrid RNN-RBM models, or hybrid autoencoder-LSTM models, or even neuroevolutionary strategies).

The power of RNNs wasn’t common knowledge until Andrej Karpathy’s viral post “The Unreasonable Effectiveness of Recurrent Neural Networks” in May 2015. Andrej showed that a relatively simple neural network called char-rnn could reliably recreate the “look and feel” of any text, from Shakespeare to C++. The same way that the popularity of dice games was buffeted by a resurgence of rationalism and interest in mathematics, Andrej’s article came at a time when interest in neural networks was exploding, triggering a renewed interest in recurrent networks. Some of the first people to test Andrej’s code applied it to symbolic music notation.

“Eight Short Outputs” by Bob Sturm, using char-rnn and 23,000 ABC transcriptions of Irish folk music. He has also lead groups to perform these compositions.

By Gaurav Trivedi, using char-rnn and 207 tabla rhythms.

Some people started with char-rnn as inspiration, but developed their own architecture specifically for working with music. Notable examples come from Daniel Johnson and Ji-Sung Kim.

Custom RNN architecture trained on classical music for piano.

deepjazz uses the same architecture as char-rnn and trains on a single song.

Christian Walder uses LSTMs in a more unusual way: starting with a pre-defined rhythm, and asking the neural net to fill in the pitches. This provides a lot of the global structure that is otherwise usually missing, but heavily constrains the possibilities.

Example from “Modeling Symbolic Music: Beyond the Piano Roll” by Christian Walder, trained on Baroque sonatas.

While all the examples so far are based on symbolic representations of music, some enthusiasts pushed char-rnn to its limits by feeding it raw audio.

By Priya Pramesi, trained on Joanna Newsom.

Unfortunately it seems that char-rnn is fundamentally limited in its capacity to abstract higher level representations of raw audio. The most inspiring results on audio turned out to be nothing more than noisy copies of the source material (some people explain this when sharing their work, see SomethingUnreal modeling his own speech). In machine learning this is related to the concept of “overfitting”: when a model can recreate the training data faithfully, but can’t effectively generalize to anything novel that it hasn’t been trained on. During training, initially first the model performs poorly on both the training and novel data, then it starts to perform better on both. But if you let it train too long, it gets worse at generalizing to novel data at the expense of recreating the training data. Researchers stop the training just before hitting that point. But overfitting is not so clearly a “problem” in creative contexts, where recombination of existing material is a common strategy that is hard to distinguish from “generalization”. Some people like David Cope go so far as to say “all music [is] essentially inspired plagiarism” (but he has also been accused of publishing pseudoscience and straight-up plagiarism).

In September 2016 DeepMind published their WaveNet research demonstrating an architecture that can build higher level abstractions of audio sample-by-sample.

Diagram of the dilated convolutions used in the WaveNet architecture.

WaveNet sample-by-sample probability distributions across the range of 8-bit values.

Instead of using a recurrent network to learn representations over time, they used a convolutional network. Convolutional networks learn combinations of filters. They’re normally used for processing images, but WaveNet treats time like a spatial dimension.

Samples of WaveNet trained on piano music from YouTube.

Looking into the background of the co-authors, there are some interesting predecessors to WaveNet.

One of my favorite things to emerge from the WaveNet research is this rough piano imitation by Sageev Oore, who was on sabbatical at Google Brain at the time.

Sageev Oore performs “sample_3.wav” by WaveNet

In April 2017, Magenta built on WaveNet to create NSynth, a model for analyzing and generating monophonic instrument sounds. They created an NSynth-powered “Sound Maker” experiment in collaboration with Google Creative Lab New York. I worked with the Google Creative Lab in London to build NSynth into an open-source portable MIDI synthesizer, called “NSynth Super”.

Demonstration of linear interpolation between two sounds compared to NSynth interpolation.

NSynth Super (2018) by Google Creative Lab.

In February 2017 a team from Montreal lead by Yoshua Bengio published SampleRNN (with code) for sample-by-sample generation of audio using a set of recurrent networks in a hierarchical structure. This research was influenced by experiments from Ishaan Gulrajani who trained a hierarchical version of char-rnn on raw audio.

Simplified snapshot of the SampleRNN architecture: a hierarchy of recurrent networks (tier 2 and 3) at slower time scales, combined with one standard neural network (tier 1) at the fastest time scale, all using the same upsampling ratio (4).

SampleRNN trained on all over a hundred hours of speech from a single person (the Blizzard dataset).

SampleRNN trained on all 32 of Beethoven’s piano sonatas.

By Richard Assar, trained on 32 hours of Tangerine Dream, using his port of the original code.

By DADABOTS, trained on the album Diotima by Krallice, accepted to NIPS 2017.

Both SampleRNN and WaveNet take an unusually long time to train (more than a week), and without optimizations (like fast-wavenet) they are many times slower than realtime for generation. To reduce the training and generation time researchers use audio at 16kHz and 8 bits.

But for companies like Google or Baidu, the primary application of audio generation is text to speech, where fast generation is essential. In March 2017 Google published their Tacotron research, which generates audio frame-by-frame using a spectral representation as an intermediate output step and a sequence of characters (text) as input.

Tacotron architecture, showing a mixture of techniques including attention, bidirectional RNNs, and convolution.

The Tacotron demo samples are similar to WaveNet, with some small discrepancies. In May 2017, Baidu built on the Tacotron architecture with their Deep Voice 2 research, increasing the audio quality by adding some final stages specific to speech generation. Because generating audio from amplitude spectra requires a phase reconstruction step, the quality of polyphonic and noisy audio from this approach can be limited. But this hasn’t stopped folks like Dmitry Ulyanov from using spectra for audio stylization, while Leon Fedden, Memo Akten and Max Frenzel have used spectra for generation. For phase reconstruction, Tacotron, Dmitry and Max use Griffin-Lim, while Leon and Memo use LWS . Leon, Memo and Max all use an autoencoder to build a latent space across spectrograms.

Besides Dmitry, other researchers who have looked into style transfer include Parag Mital in November 2017 (focused on audio stylization) and Mor et al in May 2018 (focused on musical style transfer across instruments/genres). For more early work on audio style transfer with only concatenative synthesis, “Audio Analogies” (2005) provides a lot of inspiration.

In November 2017, DeepMind published their “Parallel WaveNet” technique where a slow-to-train WaveNet teaches a fast-to-generate student. Instead of predicting a 256-way 8-bit output, they use a discretized mixture of logistics (DMoL), which allows for 16-bit output. Google immediately started using Parallel WaveNet in production. In December 2017, Google published Tacotron 2 using a parallel WaveNet as the synthesis (vocoder) step instead of Griffin-Lim phase reconstruction. This kicked off a wave of papers focusing on speech synthesis conditioned on mel spectra, including ClariNet (which also introduces an end-to-end text-to-wave architecture), WaveGlow and FloWaveNet. In October 2018, Google published a controllable version of their Tacotron system, allowing them to synthesize voice in different styles (something they proposed in the original Tacotron blog post). There is a wealth of other research related to speech synthesis, but it isn’t always relevant to the more general task of generating audio in a musical context.

In February 2018, DeepMind published “Efficient Neural Audio Synthesis” or “WaveRNN” which solves fast generation using a handful of optimizations. Instead of using DMoL outputs, they achieve 16-bit output by using two separate 8-bit outputs: one for the high bits, and one for the low bits.