Starting with the Spotify API

Before we can preform any analysis, we need to learn how to login, here is how I did that:

import spotipy

from spotipy.oauth2 import SpotifyClientCredentials

import spotipy.util as util cid ="<Client ID from Earlier>"

secret = "<Client Secret from Earlier>"

username = "" client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)

sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) scope = 'user-library-read playlist-read-private'

token = util.prompt_for_user_token(username, scope) if token:

sp = spotipy.Spotify(auth=token)

else:

print("Can't get token for", username)

The first thing this does is to initialize a client_credentials_manager, which tells Spotify which of your Spotify Applications to connect to. Once we know which application we want to connect to, we define a scope. This scope tells Spotify what our application will need to do, check out the link below for more info.

After the scope is defined, we can login. If this works, the script should redirect you to a fancy Spotify login page. After you login, it will redirect you to the Redirect URL that we defined earlier. Also, the notebook will ask you to paste the url you were redirected to into a prompt to continue. Once you do that and it accepts the request, you are logged into Spotify!

With the basic Spotify stuff out of the way, we can turn to the data analytics part of the project.

Data Gathering

This was the most tedious part of the whole process. I needed to gather two playlists, one filled with songs that I didn’t like and one filled with songs that I did.

Finding the songs I liked was relatively easy, I just added all the songs I had saved and all the songs off some playlists I like. But that was only half the battle.

Have you ever tried to seek out bad music?? Trust me: It’s a pain.

I started by paying a visit to some friends who’s music taste I don’t like and added a bunch of their favorite songs. Eventually, I ran out of friends. But, then I remembered that Spotify has genre sorted playlists! Huzzah! I went into genres I didn’t like and added a bunch of songs. This eventually got me to about how many songs I wanted.

Disclaimer: I know this isn’t the best way to gather the data but I really didn’t want to spend the time to get a representative sample of songs I don’t like, I just wanted to get something that worked good enough.

All of this song gathering involved a bunch of moving songs from one playlist to another. Here’s a code snippet to get the songs from one playlist into another:

sourcePlaylist = sp.user_playlist("<source user>", "<Source Playlist ID>")

tracks = sourcePlaylist["tracks"]

songs = tracks["items"]

while tracks['next']:

tracks = sp.next(tracks)

for item in tracks["items"]:

songs.append(item)

ids = []

print(len(songs))

print(songs[0]['track']['id'])

i = 0

for i in range(len(songs)):

sp.user_playlist_add_tracks("<target user>", "<Target Playlist ID>", [songs[i]["track"]["id"]])

In order to move the songs, you need the user and playlist id. You can get it from the link to the playlist, heres an example: https://open.spotify.com/user/<user>/playlist/<playlistID>

Side note: I would recommend about 1500–2000 songs in your good and bad playlists at the end of this.

Data Analytics

Getting the Audio Features

Now that we have our playlists of good and bad songs, how do we analyze them? Luckily for us, Spotify provides us a way to do that — the Audio Feature Object.

This is the object, from the docs:

This object is the cornerstone of our analysis. We don’t really have access to the raw audio waveforms or other statistics (ie. number of plays, how long we listen to a song, ect.) to make our analysis better. This, while it is not perfect, helps us to draw some basic conclusions on characteristics we like in a song.

To get the audio features of the a song, we need to use the sp.audio_features(<SongID>) call. This requires us to pass in a Song ID to get the features for that track.

Q: But all we have so far is two playlists of good and bad songs, how do we get the song ID’s for all those songs? A: I got you.

good_playlist = sp.user_playlist("1287242681", "5OdH7PmotfAO7qDGxKdw3J") good_tracks = good_playlist["tracks"]

good_songs = good_tracks["items"]

while good_tracks['next']:

good_tracks = sp.next(good_tracks)

for item in good_tracks["items"]:

good_songs.append(item)

good_ids = []

for i in range(len(good_songs)- 500):

good_ids.append(good_songs[i]['track']['id'])

First, we grab the playlist by the user id (“1287242681”) and playlist ID (“5OdH7PmotfAO7qDGxKdw3J”). Once we have the playlist, we need to iterate through it to pick out each song in the playlist and then pick out the id from that song. After this block ends we will have the good song id’s in the good_ids array.

Now we make the call to get the audio features from Spotify:

features = []

for i in range(0,len(good_ids),50):

audio_features = sp.audio_features(good_ids[i:i+50])

for track in audio_features:

features.append(track)

features[-1]['target'] = 1

The only quirk about the audio features call is that we can only get the features for 50 songs at once. So, we can just split the ids by 50 and pass them in 50 at a time. Here we add all the audio features to an array along with a “target” field to specify if we like the song or not.

All that is left is to do repeat the same steps for the bad playlist and we can start doing some actual analysis!

Graphs on Graphs on Graphs

All we need to do in order to start looking at our graphical goodness is to insert the data into a Pandas DataFrame.

trainingData = pd.DataFrame(features)

I used matplotlib for my plotting. Here are some of the interesting comparisons from my listening data.

Note: Songs I like represented by blue and songs I don’t like are in red.

Tempo comparison between songs I like and don’t

Valence comparison between songs I like and don’t

The first graph looks at the tempo of the songs. From that graph, we can see that we can’t really use tempo to reliably predict if I’ll like a song. The next graph is something called Valence. Valence is a measure of how happy a song sounds. This graph is showing that I strongly prefer sad songs rather than happy ones. The rest of the graphs for all the other audio features can be found in the notebook.

Now that we have some graphs, lets train a classifier and see how good it is at predicting the songs I like!

Using different classifiers and seeing how they preform

Just a little bit of a definition before we get started.

Classifier: something that tries to classify data into a couple different buckets based on different input values.

Here is a nice comparison between different classifiers and how they shape around different data.

If you still want to learn more about different types of classifiers, Google is your friend!

In order to make any classifier work, we need to split our data into a training and testing set so we have some data to train our model with and some data to test the aforementioned model. This can be accomplished with an sklearn function called train_test_split() which splits the data according to a test_size percent specified in the method. The code below breaks up the data into 85% train, 15% test.

from sklearn.model_selection import train_test_split

train, test = train_test_split(trainingData, test_size = 0.15)

After we split the data we will put it into a train/test x and y variables to input into our classifiers.

#Define the set of features that we want to look at

features = ["danceability", "loudness", "valence", "energy", "instrumentalness", "acousticness", "key", "speechiness", "duration_ms"] #Split the data into x and y test and train sets to feed them into a bunch of classifiers!

x_train = train[features]

y_train = train["target"] x_test = test[features]

y_test = test["target"]

Decision Tree Classifier

A Decision Tree Classifier is the first classifier I’ll look at because it is the easiest to visualize. Here is a code snippet that shows how you fit the model to the training data, predict values based off of the test data and then show the accuracy of the model.

c = DecisionTreeClassifier(min_samples_split=100)

dt = c.fit(x_train, y_train)

y_pred = c.predict(x_test)

score = accuracy_score(y_test, y_pred) * 100

print("Accuracy using Decision Tree: ", round(score, 1), "%")

The most important part of this classifier configuration is the min_samples_split value. This is the value at which the tree will split based on a characteristic. Here is a little part of the decision tree.

A snippet of the Decision Tree to show the decisions and number of samples in each bucket

The Decision Tree gave me an accuracy of only 80%, which is good, but we can do better.

KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(3)

knn.fit(x_train, y_train)

knn_pred = c.predict(x_test)

score = accuracy_score(y_test, knn_pred) * 100

print("Accuracy using Knn Tree: ", round(score, 1), "%")

The K-Nearest Neighbors classifier looks at the neighbors of a data point in order to determine what the output is. So in our case, it takes in a new songs audio features, plots it and looks at the songs around it to figure out if I will like it or not. This approach only gave us an accuracy of 80%, the same as the Decision Tree. I wasn’t very hopeful for this type of classifier because the data I was training from wasn’t well separated along distinct characteristics.

AdaBoostClassifier and GradientBoostingClassifier

from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier(n_estimators=100)

ada.fit(x_train, y_train)

ada_pred = ada.predict(x_test)

from sklearn.metrics import accuracy_score

score = accuracy_score(y_test, ada_pred) * 100

print("Accuracy using ada: ", round(score, 1), "%") from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=.1, max_depth=1, random_state=0)

gbc.fit(x_train, y_train)

predicted = gbc.predict(x_test)

score = accuracy_score(y_test, predicted)*100

print("Accuracy using Gbc: ", round(score, 1), "%")

Both of these classifiers operate in a similar way. They both start by creating a relatively weak “learner” (something used to make predictions) and then use the results of classifying to modify the “learner” and make it better at predicting things in the future.

AdaBoost works by fitting that learner and then between every iteration of the data, it modifies the way it predicts in order to try and classify the more difficult cases with a better accuracy. This classifier had 84.3% accuracy when I ran it.

Gradient Boosting uses the loss function (a measure of how far off the prediction was from the true value) and tries to reduce that loss function with each iteration. This classifier had a 85.5% accuracy when I ran it.

Disclaimer: These accuracy values will change every time you run the classifiers, so don’t worry if you don’t get the same values that I do.

All we have to do now is to pick the classifier that had the highest accuracy on our training data (for me it was the Gradient Boost) and measure how well it preforms in the real world!

Results

In order to test my classifier, I ran it on my Discover Weekly for 4 weeks. Discover Weekly is 30 songs, so I had a total of 120 songs to test. In order to do this, repeat the steps we went through for loading our good and bad playlists, but after, just call the predict function of the classifier.

pred = gbc.predict(playlistToLookAtFeatures[features]) i = 0

for prediction in pred:

if(prediction == 1):

print ("Song: " + playlistToLookAtFeatures["song_title"][i] + ", By: "+ playlistToLookAtFeatures["artist"][i])

i = i +1

The classifier picked 66 songs that it thought I would like. Listening to all the songs, I picked out 31 songs that I liked. The classifier and my personal likes shared 23 songs. So, my classifier identified 43 songs that it said I would like but didn’t. However, it only missed 8 songs that I liked. In spite of getting a bunch of false positives, I would call that a success for my first delve into anything like this!

The Future

Eventually I want to get into using Tensorflow, the open source machine learning library from Google, and hopefully make a better model out of the tools that they provide me. I also eventually want to incorporate this classifier into a larger system to pick me out a “Discover Weekly” playlist every day so I can constantly be on the lookout for new music.