Technical Overview

The interaction process looks as follows.

Due to the fact that RNNs do not work well with multidimensional data, MIDI signals should first be mapped onto an artificial alphabet, where every musical element corresponds to a unique character or hash code. During the interaction process, a MIDI signal goes through the Magenta MIDI interface and is converted into the NoteSequence format, which is a protocol buffers format for exchanging data with a TensorFlow model. A TensorFlow model is a RNN that receives an input sequence and produces a response using this sequence as a seed.

At first, I intended to train TensorFlow models from scratch, but I ran into some problems. To begin with, it was difficult to obtain large sets of data from specific styles of music. It is possible to download some training data (for example, from http://colinraffel.com/projects/lmd/), but it makes little sense to train your own models on the same data set that the Magenta team uses. So I decided to take one of Magenta’s pre-trained models (a LSTM RNN with two hidden layers of 128 elements) and continue its training process. I downloaded a lot of music with break beats in the MIDI format, ranging from Prodigy’s Out of Space and Goldie’s Inner City Life to Moby and Massive Attack. Magenta’s GitHub contains conversion tools that can help you prepare your own training set from a collection of MIDI files.

The second problem arose when I was trying to start the training process on the AWS Cloud. I started the process on a p2.8xlarge instance, but kept having non-obvious hang-ups that were related to native calls. Unfortunately, I had no time for investigating that problem.

Only the last two bars were taken into account during every training instance, so we only spent a few nights training the models at different settings. The gradient descent was not especially fast and I cannot say whether the trend was towards continued descent or whether it all amounted to oscillation around local minima, but it was acceptable for my purposes.

The last checkpoint from the training process was converted into a bundle file using Magenta’s tools. Our configuration consisted of two pairs of virtual MIDI ports with a separate model for each. The first model was to listen to the rhythm provided by one of the live musicians and then provide a rhythm in response. The second model was to listen to its output and generate its own response, and so on. It is also possible to build a set-up, where two or more models are listening to each other and creating music together. We did perform one such test as well, letting our models play with each other while we took a lunch break. The result was vaguely trance-like and tinged with Afro-beats, bizarre but not unpleasant.

The next step was to figure out how to bind the models together with live musicians during an improvisation session. Aleksandr Zedeljov proposed that we use Ableton Live as a universal glue. It is a musical sequencer and digital audio workstation that was designed to be used for live performances as well as for music production. The MaxSP plugin has previously been used to bind Ableton Live with Magenta, but this solution did not work for us, because MaxSP always crashed and took Ableton with it. So we ended up discarding MaxSP and binding them straight together. Later, we also had some problems with synching Ableton and Magenta via midi_clock.

Our first attempts at improvisation looked like this:

Aleksandr Zedeljov played a rhythm example on the drum pad (Ableton Push 2), the model received the MIDI signals through Magenta, and then produced a response.

We went through many “try, test, modify, repeat” cycles. It was quite entertaining, especially when the results were unusual. The main complexity arose from the fact that the models gave different responses every time, so every test amounted to pure improvisation. During the process, we noticed that longer input from live musicians seemed to result in more sensible responses from the models. It felt like the models gained courage as they kept working. I decided to check whether it was merely our belief, or whether there were actual differences in the performance of the RNNs. In the beginning, the models produced responses with log-likelihood -70, but after a certain amount of time, the log-likelihood value fell to -150, -400, and even -700, which was distinguishable by ear. It seems to be somehow related to the internal state of the RNN, which seems to converge to values that start to generate increasingly better responses within a certain amount of time.

We decided to film our first real improvisation session with live musicians in Playtech’s Tallinn office. It was pretty cool, because the office was empty that late in the evening and we were up on the 10th floor, with a view of planes landing at the airport.

A bit of Moog :)

Work moments…

Magenta’s browser interface enabled us to monitor what was going on within the models in real time, making it possible for us to change the parameters of the models on the fly (see the orange “bricks” running on the screen).

Performance

Martin Altrov, Aleksandr Zedeljov, Aleksei Semenihhin (MODULSHTEIN)

20 minutes before the start

Control panel

We achieved much better results during our Topconf performance due to the additional time we had for tuning the whole system. However, due to the lack of time and data for training melodic models, only the rhythm section was provided by two RNNs.

MIDI signals can also be used for controlling digital video workstations, so it would be interesting to also use models that produce video responses in order to supplement the music with an improvised video stream. There are a lot of possible approaches to chaining various models and combining them with music and video devices, experimenting with various harmonic models, implementing call-response loops during any intermediate step, and so on.

Great thanks to the team: Aleksandr Zedeljov, Martin Altrov, Aleksei Semenihhin, Nikolay Alhazov, and Playtech and personally Marianne Võime, Ergo Jõepere.