How to Better Classify Coachella With Machine Learning (Part 1)

1,235 reads

This is the time when the cold winter month are ending. Temperatures are slowly rising, flowers are blossoming, the faint scent of BBQ in the air, and we soon start dancing outside to pulsating beats surrounded by a group of sweaty people. It’s time for Summer Festivals. It’s time for the Coachella Festival to kick off the season of large outdoor live music weekend events around the United States.

For the few uninitiated, the Coachella Music and Arts Festival is an annual event held at the Empire Polo Club in Indio, California. It attracts hundred- thousands of visitors each year since 1999 to enjoy an increasingly broad lineup of artists previously considered off the main stream. Some critics might say Coachella tries to be everything for every one, thus losing character. Other’s would say more people attended Lady Gaga’s Coachella concert than Donald Trump’s inauguration. But who am I to judge?

But what exactly is Coachella? Is it a music concert, an art exhibition, or just a sea of rich hot people swaying? Even we humans struggle with classifying such events. Machines even more so. For machines ‘music’ , ‘arts’ , or ‘ swaying humans’ are just series of 1’s and zeroes’ and the infinite amount of rational numbers in between.

Categories matter.

People interpret what they see quickly and automatically thus recognising familiar objects as members of categories. That helps us to make decisions fast. Categories allows humans and machines to effectively find the information they need. On the other hand, I personally would be very interested to see what would happen if we store items in the supermarket by alphabet. Aisle A has Apples, Aspirin, and Avocados. Aisle B has Bananas, Bread, Butter, and Baking Powder. But then alphabets in itself are again a categorization.

Let’s just distribute the goods randomly and change it every day.

But who would do such a thing ?

I guess we need Jane’s Addiction’s Perry Farrell to save the day.

Categorizations are closely related to our experiences and preferences. Therefore, correct classification is a crucial part of the overall experience of a service or good we purchase.

For machines, categorization is a hard task and teaching machines to support our human categorization needs is difficult. Radiohead surely had a story to tell this year when technical difficulties blighted their performance. Here, I admit publicly to being a fan since ‘Pablo Honey’. Great album, btw.

So how can you teach machines to categorize better? Our popular drop!in app has daily active users worldwide. There are people in every corner of this planet from Belize to Marakkesh and Melbourne to Tokyo who are daily looking for nearby events happening now. And of course drop!in is the best app for this. So many events actually, that already in the first year of operations our users have generated 601,923 events on our platform with hundreds of millions of interactions between them.

Event volume by region 2016

The problem that we have operationally, is that we can not afford to have people in each country manually or system supported categorize the events correctly. Which leads to unhappy users which rely on looking for sports events and not food events. This is where machine learning comes in.

Here is the battle plan: Since we have already classified events in the past correctly we use our event data as the source for supervised learning and exclude the data where we know the classification was not correct. In more detail, we want to use the description of the specific events to give us the bag of words to classify the event correctly.

The Yuma tent is still the festival’s best-kept secret

It always starts with a problem, right ? The problem is that the raw data we have is variable size text which cannot be fed directly to ML algorithms because they require fixed size numerical feature vectors.

The solution to our problem is combining TF-IDF vectorization with multinomial Naive Bayes classification. We use Multinomial Naive Bayes classification is a a probabilistic learning method because it calculates the likelihood of given event description to be of a given event category.

We will assume that the words in an event description are conditionally independent for a given category. For instance, if for a given event description cat= 1 means ‘Arts’ “Coachella” is word 420 and “sweaty” is word 1036; then we are assuming that if I tell you that particular event is ‘Arts’, then the knowledge of word 420 (if “Coachella” appears in the event description) will have no effect on whether “sweaty” appears 1036.

The second component is Tf-idf vectorization which follows two main concepts.

The weight of a term that occurs in a document is simply proportional to the term frequency. (The Luhn Assumption 1957) The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.

So when Radiohead plays Coachella and many people tweet (or instagram) about it, and these words jointly show up in each description then the weight of the terms increases. Of course if people are complaining about the long lines and sweaty people at Coachella in a separate thread then the specificity of Radiohead increases and the specificity of Coachella decreases. Sounds straightforward, right ?

Oh Alice! Down the rabbit hole we go

A wordcloud made from 50,000 events in New York in 2016

We will calculate tf–idf as the product of term frequency (tf) and inverse document frequency (idf). But before we can start calculating we need to generate ‘same length’ vectors based on the description fields of our events. The vectorization of a body of text assigns each word in the event description a number. Fortunately, Python has a TfidfVectorizer in the fabulous sklearn package.

For the algorithm to work as intended, very common words, such as “a” or “the”, are like Madonna on MTV. They are awesome but are so common that they receive heavily discounted tf-idf scores.

The result of the calculation is a matrix of tf-idf scores with one row per document and as many columns as there are different words in the dataset.

If the text is too long we need to find ways to reduce dimensionality. Fortunately in our dataset the max length of observations was only 102 words (average 673.1 characters) based on the notion that the text needed to be displayed on the mobile phone.

Thank you for reading all the way down to here. In part 2 we will show the actual implementation in Python.

This article was brought to you by tenqyu, a startups making urban living more fun, healthy, inclusive, and thriving using big data, machine learning, and LOTs of creativity.

[1]https://www.coachella.com/

[2]http://cs229.stanford.edu/notes/cs229-notes2.pdf

Tags