In June 2016 Ravi Mody and Tim Schmeier gave a presentation at the NYC Machine Learning meetup to discuss their work on the data science team at iHeartRadio. This is the first in a three part article complementing the presentation. Part 2 can be found here.

Music, the Internet, and Machine Learning

Like almost everyone who grew up in the 80s/90s, my music discovery followed a familiar pattern: I’d listen to the radio, talk to my friends about what they’re listening to, then buy a CD/tape to play on repeat until everyone around me was sick of it. This worked fine for most people, but as a music lover I’m thankful it seems so foreign today — the internet has completely transformed the way I interact with music. While I still often listen to the radio, I now use my phone to tune into any of thousands of stations from across the country. Or I can stream music from basically any artist who has ever sold recorded music. On the discovery side I can choose between 100s of blogs and review sites, or see what my friends are listening to using social networks.

While this explosion of music and discovery has been great and has opened up an entire world of music, it can be overwhelming. Ironically, paralyzed with choice, we’re often tempted to stop exploring and listen to what we’re familiar with. Fortunately, in recent years, the music industry has been addressing this through a massive shift towards deep, intelligent personalization. I attribute this to two simultaneous events:

The rise of online music streaming services. This has increased our access to music while letting our music providers “learn” our taste. More sophisticated machine learning techniques and more powerful hardware to deal with the volume of data the industry produces.

This article goes into detail of how we at iHeartRadio are using modern machine learning to connect our users with the music they love. The first part will show how we map our user listening patterns into a convenient form. We will show how we can combine almost every aspect of our service into the same algorithms, enabling many different ways of connecting our users and music together. The second part will cover our work using deep neural networks to “listen to” music to further personalize the user experience. The third and final part will introduce a powerful, convenient way of working with both methods. We will show examples of how this lets us quickly prototype and build a large number of otherwise complex features.

Factorization Methods

Probably the most ubiquitous approach to serving music recommendations to users is using a machine learning technique called matrix factorization. Matrix factorization is popular because it scales to very large datasets while still providing accurate recommendations.

Matrix factorization works on behavioral data; to build these models we first aggregate all interactions between all users and all their music into a large (huge!) interaction matrix. For example, in the below figure, the intersection between the blue dotted line (a user) and the red dotted line (an artist) could be a number indicating how many hours a user listened to an artist. This matrix can theoretically contain trillions of interactions, but is fortunately very sparse, meaning it’s mostly filled with zeros.

We then use an algorithm called implicit matrix factorization to “decompose” the large interaction matrix into two much smaller matrices — in this case users and artists. These smaller matrices still contain rows and columns for all users and artists, but instead of storing the interaction between a user and an artist, the matrices store the interaction between a user/artist and a “factor space” of fixed, predetermined size, D (usually 30–200).

So, in the figure above, instead of representing a user by how much they listened to every artist (the blue-dotted line), they can be represented by a vector of D numbers (the blue solid line). And instead of representing an artist by which users listened to them, they can be represented by another vector of D numbers (the red solid line). These vectors map the user’s music taste and the artist’s music style into the same sized vectors. To recreate an estimate of how many hours the blue user listened to the red artist, we can simply take the dot product of the red and blue vectors.

The real power of this comes when we want to create artist recommendations for a user: instead of taking the dot product of the user’s vector with artists she’s already listened to, we can take the dot product of her vector with artists she hasn’t to find music she would enjoy! This is a list of artists with the highest dot products to a coworker of mine, who listens to a lot of modern rock including the White Stripes and The Strokes:

The Black Keys

Cage the Elephant

Mumford & Sons

Modest Mouse

Death Cab for Cutie

Alabama Shakes

Kings of Leon

Cold War Kids

The Lumineers

Arctic Monkeys

Vector Space Models

To help visualize what we’re doing, the user and artist vectors from matrix factorization can each be considered a point in D-dimensional space. If D=2, we could visualize the space by mapping the first element of the vector to the x-axis and the second to the y-axis and plotting them:

Taken together, this mapping of all our users and artists in D-dimensional space is called a vector space model. In this model the dot product of two vectors is a measure of how close, or “similar” two points are. Note that you can instead use euclidean distance to measure the dissimilarity between two points; this is different than the dot product but we’ve found often gives similar results.

From here it’s easy to see that we can find the similarity not just between users and artists, but any two points in the space; this means we can find the similarity between two artists, or between two users. For example, here are similar artists to Michael Jackson in our vector space model:

Michael Jackson

Prince

Stevie Wonder

Earth, Wind & Fire

The Temptations

Marvin Gaye

Janet Jackson

Usher

Prince & the Revolution

Whitney Houston

Mariah Carey

Extending our Vector Space Model

This user-artist vector space model provides great music recommendations, but we felt it was limiting because iHeartRadio’s products and features involve far more than just users and artists, including:

Live radio stations — one of iHeartRadio’s key features if the ability to listen to thousands of radio stations from around the country 24 hours a day.

Tracks — our users can seed custom stations from a track (instead of an artist), or thumb tracks up and down as they listen.

Podcasts — iHeartRadio has 100s of thousands of podcast episodes in its library.

Perfect for — curated themed stations, like “hip hop workout”.

Genres — our music is classified into 100s of genres covering virtually every type of music from around the world.

Demographics — we ask for a user’s age, gender, and location during sign-up, helping us personalize the service to them.

When we were working with just users and artists it was easy to keep the two separate, but with all these other features in mind we introduced a general concept of an “entity type”. Some examples include user, artist, live_station, podcast, and gender.

Our goal was to map all our entity types into the same space. There are several benefits to this:

We can share information from different types; for example users that listen to this live station also like this podcast and don’t like this track.

Some users only use the service for one feature: artist radio vs track radio vs live radio vs podcasts. Putting every entity type into the same space lets us build a model that works across all users and features. This also lets us build products that generalize over our service (e.g. the code to recommend artists to a user is almost identical to the code that recommends podcasts).

Mapping age, gender, and location helps us with the cold start problem, immediately letting us serve recommendations on sign-up.

In the literature there is some research covering how to map many entity types with arbitrary relationships using matrix factorization, a technique known as collective matrix factorization. Our first attempt was a very simple approach: concatenating many different types into the same interaction matrix and running standard matrix factorization.

This approach ended up working very well, creating a vector space model with all types mixed together:

Using this multi-type vector space model, we can mix and match different entity types freely with dozens of possible combinations. We could create genre radio stations by finding similar tracks to a genre, or get an age/gender description for every live and artist radio station, or create personalized podcasts recommendations by finding similar podcasts to a user. We’re currently putting several diverse applications of the model into production. We also moved some of our direct marketing efforts to this model earlier this year.

Conclusion

Mapping many different types in a single vector space model is a powerful concept — each new entity type combinatorially increases the number of ways we can use the model to help our users find content they’ll enjoy. We believe this approach is very general, extending far beyond mapping user behavior in a music service (for example mapping websites, advertisers, internet users, and ad creatives into the same space to power a realtime bidding advertising platform). In part 2 of this article we will show how we created a similar vector space using the audio data of tracks. In part 3 we will show how adding a couple efficient operations opens up even more functionality, creating a seemingly endless way of working with our music data.