I am a huge fan of the Discovery Weekly feature of Spotify, introduced in 2015, thanks to the data provided by the start-up EchoNest, acquired by Spotify. And I am always amazed by how good their algorithm is and I am not the only one:

In this article, I will lay out how to extract songs from the Spotify API, visualize them with Spark SQL and Databricks, using Scala or Python, in the cloud for 0€/$/£, and use Spark machine learning library to create our own music recommendation service.

A quick word about Databricks, they are the team behind Apache Spark and they have built a great product around notebooks, which are Google Documents where you can execute code, to make Spark easy to use, and enhance collaboration between teams.

You can check out the code in Scala and results on this Databricks notebook.

Extract songs data from the Spotify API

As a former front-end developer, I’ve played with a lot of API, and I have to admit that the Spotify API is the best I know, thanks to its pratical API documentation where you can test their API directly inside your comfy browser.

Of course the API has its limits, literally, when requesting user’s saved tracks or playlist tracks, the output is only 50 songs top. It requires us to play with offsets to get the appropriate number of songs for our machine learning classifier.

When we extract songs from the Spotify API, we need to convert them into a Spark JSON format. Man, I love Spark but this kind of sucks, leave a comment if you have any insight about why they are doing that (EDIT: @liancheng , SparkSQL commiter, gave us a excellent answer on the why : https://medium.com/@liancheng/it-has-to-be-done-in-this-way-because-spark-processes-data-in-a-distributed-way-which-means-a-json-9f10e6d9b49a#.4mhpvs2za) :

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

This is a valid (and annoying) Spark JSON file

These commands will get our songs with a curl , then transform it into the right Spark JSON format with sed . I have to admit that the one with sed was a bit tricky to create haha



curl -X GET " https://api.spotify.com/v1/users/cpolos/playlists/4fApV392cH6iN49Fieh0Z8/tracks?limit=50&offset= 0" -H "Accept: application/json" -H "Authorization: Bearer $TOKEN" >> wantedSongsSpotify.json cat wantedSongsSpotify.json | sed ':a;N;$!ba;s/

/ /g' | sed 's/ //g' | sed 's/}},{/}}

{/g' | sed -e 's/{"href":"https:\/\/api.spotify.com\/v1.*&limit=50","items":\[/ /g' | sed -e 's/],"limit":50,"next":.*,"offset":50,"previous":"https:.*api.spotify.com.*,"total":.*}/ /g' > parsedSpotifySongs.json

Now, we have our songs JSON we need to get the audio features associated with them to start our algorithm.

Songs’ audio features (or variables)

Spotify API audio features documentation tells us we can obtain 9 interesting features, for example :

acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic (unplugged)

: A confidence measure from 0.0 to 1.0 of whether the track is acoustic (unplugged) valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

In order to get these audio features we need to execute a GET on https://api.spotify.com/v1/audio-features/$trackId , when we have hundreds of songs, we need a bit of code to automatize the process. I used SparkSQL to get the trackId and a curl , you can check the details on the notebook cell called “Important Step : Extract audio features from tracks” :

getAudioFeatureAPI(sqlContext.sql("""

SELECT track.id

FROM dislikedSongsML

"""

)) result: curl -X GET "https://api.spotify.com/v1/audio-features/$trackId_1" -H "Authorization: Bearer $TOKEN" >> audioFeatures.json

curl -X GET "https://api.spotify.com/v1/audio-features/$trackId_2" -H "Authorization: Bearer $TOKEN" >> audioFeatures.json

These audio features are perfect to implement machine learning algorithms such as logistic or Random forest regression. We just need to label our tracks with like (1) or dislike (0)

Dataviz with Databricks display method

Once we have loaded our JSON inside Spark with Databricks, Table -> Create Table -> import file, we can use the power of SparkSQL and Databricks’s display method to mess around with our data. We can, for example, compare the Today’s top songs playlist.

This 2 pie charts below show the most frequent audio feature ranges, from 0 to 1 :