A few experiments using state-of-the-art machine learning algorithms for decomposing and recomposing audio tracks. In the process we’ll look at: python audio tools, TensorFlow, Jupyter Notebooks and the Google Colab Platform.

WHY?

Good question.

Last week was arguably the most important week for Italy’s “pop” music market. The national TV hosts a week-long music competition and the winners get to flood mainstream radios, shopping malls and the occasional elevator for the next 4 months, before they disappear forever replaced by the next latin-music-based summer hit.

So before that happens let’s take a chance to celebrate the power of italian melodic love-singing by taking one great song, and shovelling it down a bender of AI-based processes.

I chose Tiziano Ferro’s cover of “Almeno tu nell’universo“: it’s a great song and his performance had some intonation problem that made his number very moving IMHO, but most importantly the piano introduction had some nice chords which I could not figure out by ear.

So instead of sitting at the piano and taking 10 minutes of wild guesses I figured it would be better to spend around 8 hours of my time getting AI to do that for me. Yayyy.

1) Source track separation with Spleeter

The first step would be to isolate the piano part.

Deezer has just released Spleeter, a library and Tensorflow model that achieve state-of-the-art source separation from audio files. Separating song vocals from background has been achieved with a bunch of different techniques (see here for a recap), opening the way for excruciating karaoke nights among other things.

This library gets quite impressive results in separating single instruments within the background, i.e. getting drum, guitar, bass and keyboard tracks out of one single mix. The way they got there was by training a neural network, feeding it the original stems from a big data set of open mixes along with the mixed results.

They can not share the data sets for obvious copyright reasons, but they can share the trained model. Interesting.

Pause for a second to ponder.

Good. Now let’s get started.

Requirements

I will quickly go through the requirements, but if you just want to play around with the system I have setup a Google Colab notebook, you can just make your own copy of the notebook and tweak it. ( If you are new to Colab have a read here, short version: you get to run your code on google servers, using their GPU/TPU time, comment it and share it with your pals, for free.)+

There is also a docker image you can use, see here https://github.com/deezer/spleeter/wiki/2.-Getting-started#using-docker-image

You will need Python 3.

If you have a working conda installation you are almost ready. Conda is an excellent package and environment manager, heavily used for Python data science and machine learning tasks. If you are on Windows you should definitely use it. It will help you get running in minutes.

So, if you are using conda go with:

conda install -c conda-forge spleeter

If you are using pip, first make sure you have ffmpeg installed on your system, then

pip install spleeter

We will also use Pafy to download the audio stream

pip install youtube-dl Pafy Let’s go Here’s some basic Python code you can run in a script or in the python command line import pafy url = "https://www.youtube.com/watch?v=YTXhocAQqBI" video = pafy.new(url) bestaudio = video.getbestaudio() # gets a handle to the highest resolution audio stream available bestaudio.download(filepath='sanremo.webm') ffmpeg -i sanremo.webm -ss 3 -to 54 intro.webm # we take the first 51 seconds ffmpeg -i intro.webm -vn -ar 44100 -ac 2 -ab 192k -f mp3 intro.mp3 # and convert to mp3 At this point you should have a file called intro.mp3 in your current folder. Listen to it while staring at this picture, remember things will get even better soon: https://tech.uqido.com/wp-content/uploads/2020/02/tiziano-ferro-intro-original.mp3 Time for separation I guess. So from the command line let’s run: spleeter separate -i intro.mp3 -o output/ At this point spleeter will download the trained model it needs, after that separation takes only a few seconds.

If you don’t specify which model you want to use spleeter will fall back to the default 2 stems separation (voice and background), but you can use: spleeter separate -i audio_example.mp3 -o audio_output -p spleeter:4stems # for (vocals / bass / drums / other ) or spleeter separate -i audio_example.mp3 -o audio_output -p spleeter:5stems # for (vocals / bass / drums / piano / other) Now you should have a file /output/intro/accompaniment.wav

It will sound like this: https://tech.uqido.com/wp-content/uploads/2020/02/tiziano-ferro-intro-piano.mp3 and look like this: Ok, there are evident artefacts where the vocals are removed, but it’s not bad.

Let’s hear the vocal track: https://tech.uqido.com/wp-content/uploads/2020/02/tiziano-ferro-intro-vocals.mp3 Now let’s see if we can convert it to midi. 2) Audio to Midi with Google Magenta Google Magenta is a set of machine learning libraries that aims to support creative processes (music, digital art, etc). It comes in Python/Tensorflow and Javascript flavours.

It includes a library called Onsets and Frames that deals with automatic transcription of solo piano performances. This model has been trained using the MAESTRO Dataset: recording both audio and midi piano competition performances on a Yamaha Disklaviers, and feeding it into a neural network. Again, I created this complete Google Colab for you to play with so you do not have to install the whole thing on your dev machine. But if you wish to do so you can find instructions here.

All the initialisation code can be found on the colab notebook, I will just copy here some interesting parts: !gsutil -q -m cp -R gs://magentadata/models/onsets_frames_transcription/* /content/onsets-frames/ !unzip -o /content/onsets-frames/maestro_checkpoint.zip -d /content/onsets-frames CHECKPOINT_DIR = '/content/onsets-frames/train' Training a deep learning model from scratch can be a very long and resource-consuming process. The nice thing is that you can benefit from the information embedded into the model after this long training process, and that will be relatively fast and light on your CPU. One more exciting idea is that you can take a trained network, freeze the lower layers and just retrain the last part of the structure. Intuitively that means incorporating some basic knowledge that someone has embedded in the model for you and using your resources for higher level, task specific training. Read this if you are interested. In the code above we download a checkpoint for the onset-frames model, that means that we get the model trained to a given point, i.e. after feeding it all the examples from the MAESTRO Dataset. pythonconfig = configs.CONFIG_MAP['onsets_frames'] hparams = config.hparams hparams.use_cudnn = False hparams.batch_size = 1 examples = tf.placeholder(tf.string, [None]) dataset = data.provide_batch( examples=examples,https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/data/Iterator preprocess_examples=True, params=hparams, is_training=False, shuffle_examples=False, skip_n_initial_records=0) estimator = train_util.create_estimator( config.model_fn, CHECKPOINT_DIR, hparams) iterator = dataset.make_initializable_iterator() next_record = iterator.get_next() This code simply puts things back where they were left at the checkpoint. The same hyper parameters, (except for a batch size of 1, since we will only apply the model to one example, and for disabling GPU since we are not training ), the same estimator, (the estimator just incapsulates model functionalities like prediction, evaluation, etc), the same iterator (just a standardised way to loop through data objects)ù In TensorFlow 1.x you first describe a graph of the operations you want to execute using placeholders, and then you fill the placeholders and run a session., so examples is just a graph node expecting to receive data to process. to_process = [] pianoAudio = './output/intro/accompaniment.wav' with open(pianoAudio, mode='rb') as file: wav_data = file.read()example_list = list( audio_label_data_utils.process_record( # we need to prepare each element for ingestion by the prediction pipeline wav_data=wav_data, ns=music_pb2.NoteSequence(), # this is a 'protobuffer', i.e. a container for the note predicted note sequence example_id='accompaniment.wav', # this is just a label min_length=0, max_length=-1, allow_empty_notesequence=True))to_process.append(example_list[0].SerializeToString()) sess = tf.Session() # setup basic vars sess.run([ tf.initializers.global_variables(), tf.initializers.local_variables() ]) # initialize the iterator to a set of (1) files to process sess.run(iterator.initializer, {examples: to_process})#set up a chain of functions for the input side of the pipe line def transcription_data(params): del params return tf.data.Dataset.from_tensors(sess.run(next_record)) input_fn = infer_util.labels_to_features_wrapper(transcription_data) # here is the real guesswork, the estimator receives the data processed by the input line and comes up with a most likely prediction for each sample prediction_list = list( estimator.predict( input_fn, yield_single_examples=False)) assert len(prediction_list) == 1 frame_predictions = prediction_list[0]['frame_predictions'][0] onset_predictions = prediction_list[0]['onset_predictions'][0] velocity_values = prediction_list[0]['velocity_values'][0] # we can then convert the output into a more readable sequence sequence_prediction = sequences_lib.pianoroll_to_note_sequence( frame_predictions, frames_per_second=data.hparams_frames_per_second(hparams), min_duration_ms=0, min_midi_pitch=constants.MIN_MIDI_PITCH, onset_predictions=onset_predictions, velocity_values=velocity_values) Ok time to open the oven and checkout the results

import warnings warnings.filterwarnings("ignore", category=DeprecationWarning)import bokeh import bokeh.plotting fig = mm.plot_sequence(sequence_prediction, show_figure=False) fig.plot_width = 1000 fig.plot_height = 400 bokeh.plotting.output_notebook() bokeh.plotting.show(fig)

We can first visualise the data. Just to recap, this is the prediction data which according to the model best approximates the notes that when played on the piano at the predicted time frames and with the predicted velocities would produce the audio we fed as input.

If you get to this point in the google colab notebook you will see this chart is interactive, you can zoom, click on each note, etc.

This is done using BOKEH, an excellent open source visualisation lib for Python. If you look at it the first notes you can clearly see the bohemian-rhapsody-like pattern in the beginning (5 notes up, one down) To hear the result we need to convert the midi sequence back to audio, closing the loop WAVE -> MIDI -> WAVE. If the process has worked we should hear a rendition of the pure piano part without the artefacts resulting from source separation. from scipy.io import wavfile array_of_floats = mm.midi_synth.fluidsynth(sequence_prediction,44100) normalizer = float(np.iinfo(np.int16).max) array_of_ints = np.array( np.asarray(array_of_floats) * normalizer, dtype=np.int16) wavfile.write('rendered_piano.wav', 44100, array_of_ints) We can use fluid synth to render the note sequence as an audio file. https://tech.uqido.com/wp-content/uploads/2020/02/rendered_piano.mp3 Ok, I think the transcription is pretty accurate in the note identification.while it fails a little at note durations and relative dynamics. Still if you consider this the output of a fully automated process with no human intervention, I’d say it’s pretty nice. So while you stop considering how many variables the pianist’s brain must process (in real time) when interpreting a piano score, I feel entitled to paste the following picture.

Whoa! (surprise, amazement, and great pleasure)

Now we need pydub so we can finally put the parts back together, the struggling robot piano player and the hyper-ventilated human singer, isn’t this something special.

pip install pydub

from pydub import AudioSegment sound1 = AudioSegment.from_file("rendered_piano.wav", format="wav") sound2 = AudioSegment.from_file("output/intro/vocals.wav", format="wav") played_togther = sound1.overlay(sound2, gain_during_overlay=-15) played_togther.export("robot_piano_and_human_voice.mp3", format="mp3")

Here is the result when importing the midi data into logic and showing the sheet music view. Unfortunately there is no tempo information in the midi file, so if you wanted to see a correct transcription you would need to fiddle with logic and find the right bpm.

3) And now for something different: creating a synthetic accompanying instrument

For this last part of this experiment we will use Magenta GanSynth to recreate the piano part using synthesised sounds. GanSynth uses Generative Adversarial Networks to create synthetic sounds that are ‘believable’, i.e. that could in theory belong to a training data set of real world instruments (acoustic instruments, synths, ecc).

Once it learns to do that he can generate a whole universe of sounds by picking some parameters at random from its latent space. It’s a fascinating concept, that I will not go into. Let’s just say that by learning to generate ‘believable’ strings and bells it also learns some sounds that are in between those two, that are not heard in the real world (represented buy the data set), but could as well be possible if some different points from the latent space had found expression in a different resynthesized version of the world. (Ok it’s confusing and I just made it more confusing, let’s just have fun with it.)

First we let GanSynth generate a few sounds (for requirements, models, checkpoints and setup see the notebook)

number_of_random_instruments = 10 pitch_preview = 60 n_preview = number_of_random_instruments pitches_preview = [pitch_preview] * n_preview z_preview = model.generate_z(n_preview) audio_notes = model.generate_samples_from_z(z_preview, pitches_preview) for i, audio_note in enumerate(audio_notes): print("Instrument: {}".format(i)) play(audio_note, sample_rate=16000) This will generate 10 different sounds and create a preview of each. Let’s hear a few:

Instrument 1:

Instrument 5:

We can now choose a few of them and have our synth render the piano part interpolating from one instrument to the other in time. (Interpolating means traversing the latent space moving from the point that generated the first instrument to the one that generated the next).

This will create an. infinite range of wacky renditions of the same song which by now you probably hate even more. Here is what I got. (Please send us your variations. No, really.)

I love it how the plucked whatever turns into stringy whatever just at the perfect time.

Great Job. Thank you, AI. Really, you can go to sleep now.

Well, in theory there is one bit we have not tried: generating alternative note sequences while keeping the style or the harmony. But given the amazing results we have achieved I’d say we stop here and leave it to further explorations.

Time is gone, the song is over and I still haven’t figured out the chords.