Image from https://www.maxpixel.net/Circle-Structure-Music-Points-Clef-Pattern-Heart-1790837

Music is not just an art, music is an expression of the human condition. When an artist is making a song you can often hear the emotions, experiences, and energy they have in that moment. Music connects people all over the world and is shared across cultures. So there is no way a computer could possibly compete with this right? That’s the question my group and I asked when we chose our semester project for our Machine Learning class. Our goal was to create something that would make the listener believe that what they were listening to was created by a human. I think we succeeded personally, but I will let you be the judge (see the results towards the bottom of this post).

Approach

In order to create music, we needed some way to learn the patterns and behaviors of existing songs so that we could reproduce something that sounded like actual music. All of us had been interested in deep learning, so we saw this as a perfect opportunity to explore this technology. To begin we researched existing solutions to this problem and came across a great tutorial from Sigurður Skúli on how to generate music using Keras. After reading their tutorial, we had a pretty good idea of what we wanted to do.

File format is important as it is what would decide how we would approach the problem. The tutorial used midi files so we followed suit and decided to use them as well because they were easy to parse and learn from (you can learn more about them here). Using midi files gave us a couple advantages because we could easily detect the pitch of a note as well as the duration. But before we dove in and began building our network, we needed some more information on how music is structured and the patterns to consider. For this we went to a good friend of mine Mitch Burdick. He helped us to determine a few things about our approach and gave us a crash course on simple music theory.

After our conversation we realized that the time step and sequence length would be two important factors for our network. The time step determined when we analyzed and produced each note while the sequence length determined how we learned patterns in a song. For our solution we chose a time step of 0.25 seconds and 8 notes per time step. This corresponded to a time signature of 4/4, which for us meant 8 different sequence of 4 notes. By learning these sequences and repeating them, we could generate a pattern that sounded like actual music and build from there. As a starting point we used the code mentioned in Skúli’s tutorial, however in the end our implementation differentiated from the original in several ways:

Network architecture

Restricted to single key

Use of variable length notes and rests

Use of the structure/patterns of a song

Network Architecture

For our architecture we decided to lean heavily on Bidirectional Long Short-Term Memory (BLSTM) layers. Below is the Keras code we used:

model = Sequential()

model.add(

Bidirectional(

LSTM(512, return_sequences=True),

input_shape=(

network_input.shape[1], network_input.shape[2]),

)

)

model.add(Dropout(0.3))

model.add(Bidirectional(LSTM(512)))

model.add(Dense(n_vocab))

model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

Our thoughts behind this were by using the notes before and after a particular spot in a song we could generate melodies that sounded similar to a human. Often when listening to music what came before helps the listener predict what is next. There have been many times when I’ve been listening to a song and I can bob along to a particular beat because I can predict what will come next. This is exactly what happens when building up to a drop in a song. The song gets more and more intense which causes the listener to build tension in anticipation of the drop and causes that moment of relief and excitement when it finally hits. By taking advantage of this we were able to produce beats that would sound natural and bring forth the same emotions that we have become accustomed to expecting in modern music.

For the number of nodes in our BLSTM layers we chose 512 as that was what Skúli used. However we did experiment with this a little, but due to time constraints we ended up sticking with the original number. Same goes for the dropout rate of 30% (read more about dropout rates here). For the activation function we chose softmax and for our loss function we chose categorical cross-entropy as they work well for multi-class classification problems such as note prediction (you can read more about both of them here). Lastly we chose RMSprop for our optimizer as this was recommended by Keras for RNNs.

Key Restriction

An important assumption we made was that we would only use songs from the same key: C major/A minor. The reason for this is by keeping every song we produced in the same key, our output would sound more song-like as the network wouldn’t ever learn notes that would cause a song to go off key. To do this we used a script we found here from Nick Kelly. This part was really simple but gave us a huge improvement in our results.

Variable Length Notes and Rests

An important part of music is the dynamic and creative use of variable length notes and rests. That one long note struck by the guitarist followed by a peaceful pause can send a wave of emotion to the listener as we hear the heart and soul of the player spilled out into the world. To capture this we looked into ways of introducing long notes, short notes, and rests so that we could create different emotions throughout the song.

In order to implement this we looked at the pitch and duration of a note and treated this as a separate value we could input into our network. This meant that a C# played for 0.5 seconds and a C# played for 1 second would be treated as different values by the network. This allowed us to learn what pitches were played longer or shorter than others and enabled us to combine notes to produce something that sounded natural and fitting for that part of the song.

Of course rests cannot be forgotten as they are crucial for guiding the listener to a place of anticipation or excitement. A slow note and a pause followed by a burst of quick firing notes can create a different emotion than several long notes with long pauses between. We felt this was important in order to replicate the experience the listener has when listening to a relaxing Sunday afternoon song or a Friday night party anthem.

To achieve these goals we had to focus on our preprocessing. Again here we started with the code from Skúli’s tutorial and adapted it to fit our needs.

for element in notes_to_parse:

if (isinstance(element, note.Note) or

isinstance(element, chord.Chord

):

duration = element.duration.quarterLength

if isinstance(element, note.Note):

name = element.pitch

elif isinstance(element, chord.Chord):

name = ".".join(str(n) for n in element.normalOrder)

notes.append(f"{name}${duration}") rest_notes = int((element.offset - prev_offset) / TIMESTEP - 1)

for _ in range(0, rest_notes):

notes.append("NULL") prev_offset = element.offset

To elaborate on the code above, we create notes by combining their pitch and duration with a “$” to feed into our network. For example “A$1.0”, “A$0.75”, “B$0.25”, etc. would all be encoded separately for use by our network (inputs are encoded by mapping each unique note/duration to an integer then dividing all of the integers by the number of unique combinations thus encoding each one as a floating point number between 0 and 1). The more interesting part is calculating how many rests to insert. We look at the offset of the current note and compare it to the offset of the last note we looked at. We take this gap and divide it by our time step to calculate how many rest notes we can fit (minus 1 because really this calculates how many notes fit in the gap, but one of them is our actual next note so we don’t want to double count it). An example would be if one note started at 0.5s and the next didn’t start till 1.0s. With a time step of 0.25 (each note is played in 0.25s intervals), this would mean we need one rest note to fill the gap.

Song Structure

Lastly one of the most important parts of writing a song is the structure, and this is one of the things we found lacking in existing solutions. From what I have seen most researchers are hoping for their network to learn this on its own, and I don’t think that is a misguided approach. However I think this introduces complexity to the problem and leads to further difficulty. This could be a source of improvement upon our solution though as we take a more manual approach to this and assume a constant pattern.

One of the key assumptions we made is that we would only produce songs that follow the specific pattern ABCBDB where:

A is the first verse

B is the chorus

C is the second verse

and D is the bridge

Initially we tried ABABCB but this felt too formulaic. To resolve this we decided to introduce a second verse that was different than the first but still related. We generated the first verse from a random note and then generated the second verse based on the first. Effectively this is generating a single section that is twice as long and splitting it in half. The thought process here was that if we create one verse the second should still fit the same vibe, and by using the first as a reference we could achieve this.

def generate_notes(self, model, network_input, pitchnames, n_vocab):

""" Generate notes from the neural network based on a sequence

of notes """

int_to_note = dict(

(

number + 1,

note

) for number, note in enumerate(pitchnames)

)

int_to_note[0] = "NULL" def get_start():

# pick a random sequence from the input as a starting point for

# the prediction

start = numpy.random.randint(0, len(network_input) - 1)

pattern = network_input[start]

prediction_output = []

return pattern, prediction_output # generate verse 1

verse1_pattern, verse1_prediction_output = get_start()

for note_index in range(4 * SEQUENCE_LEN):

prediction_input = numpy.reshape(

verse1_pattern, (1, len(verse1_pattern), 1)

)

prediction_input = prediction_input / float(n_vocab) prediction = model.predict(prediction_input, verbose=0) index = numpy.argmax(prediction)

result = int_to_note[index]

verse1_prediction_output.append(result) verse1_pattern.append(index)

verse1_pattern = verse1_pattern[1 : len(verse1_pattern)] # generate verse 2

verse2_pattern = verse1_pattern

verse2_prediction_output = []

for note_index in range(4 * SEQUENCE_LEN):

prediction_input = numpy.reshape(

verse2_pattern, (1, len(verse2_pattern), 1)

)

prediction_input = prediction_input / float(n_vocab) prediction = model.predict(prediction_input, verbose=0) index = numpy.argmax(prediction)

result = int_to_note[index]

verse2_prediction_output.append(result) verse2_pattern.append(index)

verse2_pattern = verse2_pattern[1 : len(verse2_pattern)] # generate chorus

chorus_pattern, chorus_prediction_output = get_start()

for note_index in range(4 * SEQUENCE_LEN):

prediction_input = numpy.reshape(

chorus_pattern, (1, len(chorus_pattern), 1)

)

prediction_input = prediction_input / float(n_vocab) prediction = model.predict(prediction_input, verbose=0) index = numpy.argmax(prediction)

result = int_to_note[index]

chorus_prediction_output.append(result) chorus_pattern.append(index)

chorus_pattern = chorus_pattern[1 : len(chorus_pattern)] # generate bridge

bridge_pattern, bridge_prediction_output = get_start()

for note_index in range(4 * SEQUENCE_LEN):

prediction_input = numpy.reshape(

bridge_pattern, (1, len(bridge_pattern), 1)

)

prediction_input = prediction_input / float(n_vocab) prediction = model.predict(prediction_input, verbose=0) index = numpy.argmax(prediction)

result = int_to_note[index]

bridge_prediction_output.append(result) bridge_pattern.append(index)

bridge_pattern = bridge_pattern[1 : len(bridge_pattern)] return (

verse1_prediction_output

+ chorus_prediction_output

+ verse2_prediction_output

+ chorus_prediction_output

+ bridge_prediction_output

+ chorus_prediction_output

)

Results

We were able to achieve surprising results from this approach. We could consistently generate unique songs that fell into the proper genre that we trained the respective networks on. Below are some example outputs from our various networks.

Ragtime

Christmas

Rap

Conclusion

Music generation by machines is indeed possible. Is it better or could it be better than music generated by humans? Only time will tell. From these results though I would say that it’s definitely possible.

Future Work

Several improvements could be made that would bring this even closer to true music. Some possible ideas/experiments include:

Learn patterns in songs rather than manually piecing together parts

Take note duration as a separate input to the network rather than treating each pitch/duration separately

Expand to multiple instruments

Move away from midi files and produce/learn from actual MP3s

Learn the time step, sequence length, and time signature

Introduce randomness to emulate “human error/experimentation”

Allow for multiple keys

Learn how to use intros and outros

Acknowledgments

I would like to thank my teammates Izaak Sulka and Jeff Greene for their help on this project as well as my friend Mitch Burdick for his expertise on music that enabled us to get these great results. And of course we would like to thank Sigurður Skúli for their tutorial as it gave us a great starting point and something to reference. Last but not least I would like to thank Nick Kelly for his script to transpose songs to C major.

The code for this project can be found here: https://github.com/tylerdoll/music-generator

Disclaimer: the music used in our project does not belong to us and was sourced from various public websites.