Creating a musical (data) pipeline

Our journey towards Big Data

At Songkick, we stand in a particular position from where to see music-related data. Being focused on live music discovery for fans of all types all over the world, we wanted to make the most of the information we have, its relations, implications, and opportunities to provide even better value for our users.

Our first attempts at this, included several experiments in traditional Data Warehousing, via Hadoop and followed by an in-house solution for product health metrics written in Ruby. Eventually, all these were discarded on performance grounds, i.e. as data grew the Ruby jobs would take so long they would sometimes collide with each other, so we placed our bets on Google’s BigQuery, and went ahead and implemented a very simple, in-house solution that roughly looks like this:

For the more intensive task of fetching data from the several sources we possess, we implemented a number of extractors built with Go that make use of its excellent concurrency model. The new extractors make easy work of fetching millions of rows in jobs that used to take around 5–7 hours in less than 40 minutes. They extract data from MySQL databases, parse logs created by our various services, external APIs (like KeenIO) and Google Analytics. The extracted data then gets stored as formatted text files in Google Cloud Storage, where it can easily be picked up by the next step in our pipeline.

Next, for loading data into BigQuery we rely on Python because some of the load/transform jobs use DataFlow (Google’s implementation of Apache Beam) that allows us to format, filter, and summarise massive data streams, like music taste imports for example, in a matter of minutes. DataFlow achieves this easily by spreading and parallelising the operations we perform on our data across a variable number of workers, scaling up or down depending on the demands of each task, finally piping the results straight into BigQuery.

At this point, our data is ready to be analysed and played with. For recurrent reports, we have a Rails application connected to BigQuery that lets internal users follow trends, metrics, and in general keep a pulse on things.

More technically savvy users though, dive straight into BigQuery’s Web UI, to join data from disparate sources, test hypotheses and find patterns submerged in the data. Its massively parallel infrastructure lets them do this without worrying about performance issues or lengthy wait times.

As a final step, we’ve just added a handy little feature to the pipeline that sends (or shoves) some of the BigQuery data back into our production databases. This enriches our desktop and mobile apps in ways that were not practically feasible before. We accomplish this via a group of Go scripts that handle mappings between BigQuery and MySQL structures and provide hot swapping of tables without any downtime.

To sum things up, some of our data can now come back full circle into our databases after being mixed in BigQuery with bits from other sources, processed and turned into valuable insights in a way that was not feasible before. It allows us to bring more and more artists and fans together, and facilitate an even better live music experience.