Data is one of Pandora’s core differentiators. Since the launch of our service in 2005, Pandora listeners have created 13 billion stations and have thumbed up or down more than 90 billion times. This feedback from our listeners is a core component of how we customize our stations and playlists to deliver a unique and personalized experience. As an example, earlier this year we launched Personalized Soundtracks on Pandora.

Personalized Soundtracks are a collection of themed playlists that are automatically created for and unique to each Pandora Premium listener. There are dozens of available themes spanning a wide range of moods, activities, and genres such as “Energy,” “Party,” or “Dubstep.” Each playlist is personalized to the listener, both in terms of the selection of the playlist themes and the songs contained within. New playlists are delivered weekly and evolve in synchrony with your musical preferences.

Our ability to effectively execute computation against our largest datasets is a foundational part of our personalization. As we look to the future, we continue to invest in our core capability of personalization and contextual awareness. Which leads us to today, and why we are excited to announce that we’ve chosen Google Cloud Platform (GCP) as the preferred cloud provider for big data and analytics at Pandora.

Since 2010, we’ve used Hadoop to power analytics and offline data processing. We started with a small on premise cluster and as our consumption of data has grown, our cluster has grown as well — to over 2500 nodes. Our scientists, developers and analysts traverse about 6 PB of data with tools such as Hive, Spark, and Presto every day to gain insights and improve the product.

The landscape of cloud offerings around analytics has changed greatly since we launched our on-prem cluster in 2010. Cloud providers ability to separate compute and storage resources is an exciting prospect. We run a single monolithic production cluster on-prem. As you might imagine, running a cluster that shares resources between production batch and ad hoc workloads has its challenges. In order to meet the demand of our team, we must provision our cluster for peak usage, and it’s difficult to stay ahead of the usage curve. Inevitably, we end up having to prioritize our workloads, which often means some users end up waiting for jobs to complete. The ability to spin up purpose driven Hadoop clusters against our shared datasets and scale them up/down with demand is a game changer for us.

Early Wins

Although we’re just getting started, we have had a couple early wins from running on GCP. A large percentage of our on-prem workload currently runs in Hive & Presto. Some of these workloads, particularly the interactive ad-hoc queries are a great fit for BigQuery. We have seen very consistent performance over our largest datasets. In our proof of concept, we took the most challenging queries run against one of our largest datasets (13 billion records/month). The run times against our on-prem cluster were all over the place — from 30 seconds to several hours. These same queries all ran consistently in under 30 seconds on BigQuery. The ability for developers, scientists and analysts to iterate quickly without context switching will be a big productivity win for us.

The use of Cloud Dataflow with TensorFlow model training is another area we are eager to deploy more widely. An early workload was migrated from a single machine to a Dataflow job to enable parallel execution. The move resulted in reduction of the run time from 6 days to 30 minutes.

Next Steps

The migration to GCP is a considerable effort and one that we expect to take eighteen to twenty-four months to complete. Although this is a large endeavor, we are excited to push forward our analytics capabilities and empower our teams to deliver even more unique personalized experiences for our listeners.

As we make progress on our journey, we will share our experiences both with the platform and with our migration here on Algorithm and Blues.