In this blog we will touch on two recurring questions in machine learning: The first question revolves around how deep learning performs well on images and text, but how can we use it on our tabular data? Second is a question you must always ask yourself when building a machine learning model: How am I going to deal with categorical variables in this data set? Surprisingly, we can answer both questions with the same answer: entity embeddings.

Deep Learning has outperformed other Machine Learning methods on many fronts recently: image recognition, audio classification and natural language processing are just some of the many examples. These research areas all use what is known as ‘unstructured data’, which is data without a predefined structure. Generally speaking this data can also be organized as a sequence (of pixels, user behavior, text). Deep learning has become the standard when dealing with unstructured data. Recently the question has arisen of whether deep learning can also perform the best on structured data. Structured data is data that is organized in a tabular format where the columns represent different features and the rows represent different data samples. This is similar to how data is represented in an Excel sheet. Currently, the golden standard for structured data sets are Gradient Boosted Tree models (Chen & Guestrin, 2016). They consistently perform the best on Kaggle competitions, as well as in academic literature. Recently Deep Learning has shown that it can match the performance of these boosted tree models on structured data. Entity embeddings play an important role in this.

Structured vs. unstructured data

Entity Embeddings

Entity embeddings have been shown to work successfully when fitting neural networks on structured data. For example, the winning solution in a Kaggle competition on predicting the distance of taxi rides used entity embeddings to deal with the categorical metadata of each ride (de Brébisson et al., 2015). Similarly, the third place solution on the task of prediction store sales for Rossmann drug stores used a much less complicated approach than the number one and two’s solutions. The team was able to achieve this success by using a simple feed-forward neural network with entity embeddings for the categorical variables. This included variables with over a 1000 categories like the store id (Guo & Berkahn, 2016).

If this is your first time reading about embeddings I suggest you first read this post. In short, embeddings refer the representation of categories by vectors. Let’s show how this works on a short sentence:

‘Deep learning is deep’

We can represent each word with a vector, so the word ‘deep’ becomes something like [0.20, 0.82, 0.45, 0.67]. In practice, one would replace the words by integers like 1 2 3 1, and use a look-up table to find the vector linked to each integer. This practice is very common in Natural Language Processing and has also been used in data that consists of a behavioral sequence, like the journey of an online user. Entity embeddings refer to using this principle on categorical variables, where each category of a categorical variable gets represented by a vector. Let’s quickly review the two common methods for handling categorical variables in machine learning.

One-hot encoding: Creates binary sub-features like word_deep, word_learning, word_is. These are 1 for the category belonging to that data point and 0 for the others. So, for the word ‘deep’ the feature word_deep will be 1 and word_learning, word_is etc. will be 0.

Label encoding: Assigning integers like we did in the example before, so deep becomes 1, learning becomes 2 etc. This method is suitable for tree-based methods, but not for linear models because it implies an order in the assigned values.

Entity embeddings basically take the label encoding approach to the next level, by not just assigning an integer to a category but a whole vector. This vector can be of any size and has to be specified by the researcher. You might be wondering what the advantages of these entity embeddings are.