Enhancing categorical features with Entity Embeddings

First, let’s talk about selling beers

Let’s pretend you are the owner of a pub, and you want to predict how many beers your establishment is going to sell on a given day based on two variables: the day of the week and the current weather. We can in some ways imagine that weekends and warmer days are going to sell more beers when compared to the beginning of the week and colder days.

In face of this problem, we would usually start by encoding our categorical data (in this example, the day of the week and the weather) into dummy variables, in order to provide an input to our classifier without any kind of hierarchy between the existing categorical values.

Our data would look like something below, for the day of week feature (you can imagine something similar for the weather feature):

But does this really makes sense, to treat each categorical value as being completely different from one another, such as when using One-Hot-Encoding? Or could we make usage of some sort of technique to “learn” the relationships and inner connections between each possible value and our target variable?

Entity Embeddings to the rescue

With this given scenario in mind, we can then proceed to the adoption of a technique popularly known in the NLP (Natural Language Processing) field as Entity Embeddings, which allows us to map a given feature set into a new one with a smaller number of dimensions. In our case, it will also allow us to extract meaningful information from our categorical data.

The usage of Entity Embeddings is based on the process of training a Neural Network with the categorical data, with the purpose to retrieve the weights of the Embedding layers. This allows us to have a more significant input when compared to a single One-Hot-Encoding approach. By adopting Entity Embeddings we also are able to mitigate two major problems:

No need to have a domain expert, once we’re capable to train a Neural Network that can efficiently learn patterns and relationships between the values of a same categorical feature. This leads to avoid the feature engineering step (such as manually giving weights to each day of the week or kind of weather);

Shrinkage on computing resources, once we’re no longer encoding our possible categorical values with One-Hot-Encoding, which can represent a huge resource usage. Imagine that you have a categorical feature with ten thousand possible unique values. This would translate into a feature vector with the same amount of empty positions just to represent a given value.

The definition of Entity Embedding

An Embedding layer is pretty much a Neural Network layer that groups, into an N-dimensional space, categorical values with similar output value. This spatial representation allows us to obtain intrinsic properties of each categorical value, which can be later on used as a replacement to our old dummy encoded variables. If we think about it in a more simple manner, it would mean that days of the week that have a similar output (in our case, number of sold beers), would be close to each other. If you don’t get it, maybe an example picture can help:

Here we can see that we have four major groups: group 1, with Monday and Tuesday, possibly related to a low amount of sold beers, due to being the start of the week; group 2, with Wednesday and Thursday, with some distance from group 1; group 3, with Friday and Saturday, relatively close to group 2, indicating that they show more similarity than when compared with group 1; and group 4, with Sunday, without many similarities when compared to the other groups. This simple example can show us that the embedding layers can learn information from the real world, such as the most common days for going out and drinking. Pretty cool, isn’t it?

Putting it together with Keras

First, we need to know that for using an embedding layer, we must specify the number of dimensions we would like to be used for that given embedding. This, as you can notice, is a hyper parameter, and should be tested and experimented case by case. But as a rule of thumb, you can adopt the number of dimensions as equal to the square root of the number of unique values for the category. So in our case, our representation for the day of the week would have instead of seven different positions, only three (rounded up). Below we give an example for both mentioned features, and add some hidden layers into our model, in order to have more parameters to capture minor data nuances.

# Embedding layer for the 'Day of Week' feature

n_unique_day = df['Day'].nunique()

n_dim_day = int(sqrt(n_unique_day)) input_week = Input(shape=(1, ))

output_week = Embedding(input_dim=n_unique_day,

output_dim=n_dim_day, name="day")(input_week)

output_week = Reshape(target_shape=(n_dim_day, ))(output_week) # Embedding layer for the 'Weather' feature

n_unique_weather = df['Weather'].nunique()

n_dim_weather = int(sqrt(n_unique_weather)) input_weather = Input(shape=(1, ))

output_weather = Embedding(input_dim=n_unique_weather,

output_dim=n_dim_weather,

name="weather")(input_weather) output_weather = Reshape(target_shape=(n_dim_weather,))(output_weather) input_layers = [input_week, input_weather]

output_layers = [output_week, output_weather] model = Concatenate()(output_layers) # Add a few hidden layers

model = Dense(200, kernel_initializer="uniform")(model)

model = Activation('relu')(model) model = Dense(100, kernel_initializer="uniform")(model)

model = Activation('relu')(model) # And finally our output layer

model = Dense(1)(model)

model = Activation('sigmoid')(model) # Put it all together and compile the model

model = KerasModel(inputs=input_layers, outputs=model)

model.summary() opt = SGD(lr=0.05)

model.compile(loss='mse', optimizer=opt, metrics=['mse'])

Graphically our Neural Network would have the following representation:

That’s it! We can see that our architecture is initially composed of an Input layer for each of the categorical values, followed by our Embedding layers, then a Reshape layer. And then all put together. Lastly, we add some hidden layers to capture any extra information.

Training our network for 200 epochs with a learning rate of 0.05, we can see some pretty good results for loss and mean squared error:

Conclusions

In this simple example it may sound silly, but we can again think about our scenario of ten thousand unique values. The difference between a feature vector with ten thousand positions (by using One-Hot-Encoding) and another with only 100 (measured by the rule of thumb, when using entity embeddings) is enormous, when we think about computing resources. This is the difference for only a single record for a single feature, but you can imagine how complex this becomes with a real world dataset, where categories can sometimes become enormous.

If you reached until this point without any doubts, congratulations! But if you do have any kind of questions, suggestions or complaints, feel free to reach me. I also created a GitHub repository containing a library to help anyone who is looking to perform entity embedding on their data, feel free to check it out:

https://github.com/rodrigobressan/entity_embeddings_categorical.

See you next time, and happy coding!