Whenever I drive or code, I listen to music, as this happens a lot, and in order to find new songs, I listen to the radio or I listen to Spotify’s discover weekly playlist, which made me like Mondays (because they release it every Monday)

A french old-school institute called Mediamétrie analyzes radio stations’ songs. Since I have seen their study (that I can’t find anymore) some years ago, I have been obsessed with creating my own.

Mediametrie’s image you can find online → old school

This article will present the year 2016 for 4 main french radio stations through fun SQL queries, then we will connect each song to the Spotify API to create the radio stations’ musical profile.

We will use the Databricks community version to visualize our data. All SQL queries and all results are available on this notebook. It’s the “backstage” of this article, where the magic happens if we can say.

Protip: don’t miss the bonuses at the end of the article

Radio stations introduction

We all have a favorite radio station, mine is Radio Nova for their diversity, their humor, and as a hip hop fan this is the only national radio where we can hear listenable hip hop songs

Radio nova had 1,4% of the audience in September 2016 (PDF to download from Mediametrie).

Most 5 broadcasted songs on Nova in 2016

In order to see how a radio becomes number 1, we are also going to analyze the number 1 music radio called NRJ who has 10,8% of the audience and 2 others : Virgin (5%) which, we’ll see, sounds like NRJ, and Skyrock (6%), don’t mind the name it’s a rap radio… haha

Most 5 broadcasted songs on NRJ in 2016

The main question is, after we compared these radios, should we give to Radio Nova the tips of how to be the number one based on NRJ’s analyze? What do you say, Nova? Learn from the best, right?!

Getting the Radio’s songs data

“What was this title?” Nova page

In order to extract the songs lists, artist, song title and timestamp, we are going to parse each Radio “What was this song?” HTML pages, except for Skyrock which has a handy RESTful web service.

Every song extracted will be converted into this Song class to query them easily with (Spark) SQL:

Song(timestamp:Int, humanDate:Long, year:Int, month:Int, day:Int, hour:Int, minute: Int, artist:String, allArtists: String, title:String)

In 2016 300K broadcasts were collected:

Nova : 95K broadcasts of 5000 different songs

NRJ : 50K broadcasts of 800 different songs

Virgin: 60K broacasts of 1200 different songs

Skyrock: 100K broadcasts of 1000 different songs

Every songs is stored in a parquet format to extract only once the data (you’re welcome radios servers :p) and to speed up SparkSQL queries. Btw, if you are interested by the file I can export it to you in CSV, or parquet.

Remember that the best way to speed up (the Spark doc says often by more than 10x) queries, if you have to use the same SQL table (or Dataset/Dataframe) again and again, is to cache them in memory (Thanks Databricks for the 6Go RAM server!) with the dataframe.cache() method.

Let’s dive into our analysis now !

How many songs by day ?

Some days were not recorded by the radios’ history system, so the real numbers should be a bit higher.