About two months ago, a friend and I decided to organize a concert in San Francisco. We had no prior experience promoting shows, but we both loved live music and felt up to the challenge. Plus, 2016 had been a crappy year, and it seemed like this would be a good way to bring the community together and end the year on a positive note.

We began by reaching out to local acts we personally knew with hopes of booking them for a concert in December. A week later, we had exhausted our list of contacts and still had no success. I began thinking about whether we could analyze social media to scout local musicians.

I decided to look into Soundcloud, the de facto online hangout for musicians. Having met a few musicians through another project, I knew they routinely used the platform to distribute music and connect with fans. It didn’t matter who or where someone was — a bedroom DJ in SoMa, a garage band in Mountain View, or a singer-songwriter in Oakland — they all posted music on Soundcloud. And as long as they had posted a single track, I knew I would be able to find them.

Crawling the Soundcloud Graph

A quick look at Soundcloud’s Search API revealed that it wouldn’t be adequate. The simple keyword search would not allow me to write a query like “return any user based in San Francisco or Oakland who has less than 10k followers and has posted at least one track”.

Looking for a solution, I realized that crawling the social graph could be an effective approach. I could write an algorithm which, when seeded with a Soundcloud user, would pull all their followers and followings, and then in turn pull all followers and followings for each of those users. This simple recursive algorithm would expand to cover thousands of users after a few iterations. I would then have the luxury to analyze the social connections any which way, the easiest of which would be writing a SQL query.

I picked Afrolicious, Mark Slee, EARMILK and a few others as seed accounts. These users are deeply integrated into the hip hop, electronic, and indie scenes of San Francisco. I was confident that their combined social graph would provide a diverse and complete representation of Bay Area music.

As I began experimenting with the algorithm, I realized that it was impractical to pull a user’s followers. Musicians regularly have tens or hundreds of thousands of followers (Calvin Harris at the extreme end of the spectrum has 7.08m followers). Pulling all followers for all users was obviously a sub-optimal approach. I also didn’t intend on paying a thousand dollars a month in database costs.

The solution was to crawl only the followings and not the followers. Musicians follow other musicians. Interestingly enough, there are even papers that have analyzed this behavior and the “virtual scenes” that emerge (Allington et al., 2015). By crawling the following graph, I could efficiently map out the local music scene. Furthermore, by keeping track of who follows who, I could later use an algorithm like PageRank to find up-and-comers who do not yet have many followers but nevertheless have a vote of confidence from the community.

I built a Sinatra app to crawl the Soundcloud social graph and saved all users and their relationships to a Postgres database. After a few iterations, there were over 200k users and 500k relationships. It was time to make sense of the data.

Analyzing the Network

I wrote a Python script to query the database for any Soundcloud user with at least 500 followers, at least one track, and based in San Francisco or Oakland. I then mapped out these users and their relationships onto a Networkx directed graph. It was then simple to export a .gexf graph file which could be consumed by Gephi for visual analysis.

Gephi turned out to be an incredible tool for visualizing social networks and gave me more than enough to play with.