We present an extension of texture synthesis and style transfer method of Leon Gatys et al. for audio. We have developed the same code for three frameworks (well, it is cold in Moscow), choose your favorite:

How do you apply neural-style to audio?

The modifications of image style transfer algorithm are rather straightforward.

The raw audio is converted to a spectrogram via Short Time Fourier Transform. Spectrogram is a 2D representation of a 1D signal so it can be treated (almost) as an image. In fact is better to think of spectrogram as of 1xT image with F channels.

image with channels. Next we need a network. We cannot just use VGG-19, since 3x3 convolutions are not suited for our essentially 1D problem, for which we for sure want to use 1D convolutions. Then there are two options: use a pretrained network or use completely random weights. In Torch implementation I tried to train different kind of nets, but they seem to perform similarly. As [1,2,3] Vadim also found that quality of the network is not important for texture synthesis. Nets with random weights are implemented for all three frameworks. Interestingly, the network we use has only one layer with 4096 filters.

convolutions are not suited for our essentially 1D problem, for which we for sure want to use 1D convolutions. Then there are two options: use a pretrained network or use completely random weights. In Torch implementation I tried to train different kind of nets, but they seem to perform similarly. As [1,2,3] Vadim also found that quality of the network is not important for texture synthesis. Nets with random weights are implemented for all three frameworks. Interestingly, the network we use has only one layer with filters. And finally we need to reconstruct a signal from its spectrogram. The simplest way to do the inversion is to use Griffin-Lim algorithm.

Texture synthesis

By setting content weight to zero we can synthesize textures.

These examples were generated with Torch code, you can find instructions in the repository.

Style transfer (or whatever you call it)

Most probably you would say that style transfer for audio is to transfer voice, instruments, intonations. In fact neural style transfer does none aim to do any of that. So we call it style transfer by analogy with image style transfer because we apply the same method.

I think with a help of community we will find some funny stylization examples :)

What’s next?

I see a slow but consistent interest increase in music/audio by the community, for sure amazing things are just yet to come. I bet in 2017 already we will find a way to make WaveNet practical but my attempts failed so far :)