Movie recommender. How the Idea was born

At Deep Systems we are engaged in creating solutions and products based on machine learning and Deep Learning. Among our projects: developing a “mind” for self-driving car prototype and automatic defects detection for roads and airport runways. The important part of our work are recommender systems. The strong desire to create our version of recommender system has long prevented us from sleeping peacefully.

The reason of writing this post is to share our service with the World, to get feedback about our work, vision and, at the same time, to share experiences that may be of interest to deep learning practitioners, and other people.

From Big Dreams to actual product

Here is the phrase from J.Schmidhuber*, which we can not get rid of:

Google of the future together with all its services is just a single giant LSTM.

Here we mean that there is one large neural network that interacts with the user and solves a variety of his tasks.

*not completely sure that Jurgen actually said this but deep learning researchers are aware now of how dangerous it is not to cite him :-)

The idea seems too ambitious, perhaps utopian. We tried to “land” this idea and find the domain where a single large neural network can solve all the tasks for the user. So the idea was born to build a movie recommender system, which will interact with the user in a smart way, and to model the interaction in “end to end” manner with deep lstm like network.

Today there is a huge hype around chatbots. As for the academic community, at the end of the day, it’s all about passing the Turing test. For the large companies operational cost optimization is a concern, so guys from tech support should keep a weather eye open. All jokes aside. In many cases, typing text as a way to talk to the computer may be inconvenient and the “language of clicks” is more appropriate.

Many recommender systems are built on the concept of similar items — that is, for each movie there is a predefined set of movies similar to it. This does not take into account the preferences of a particular user. As a consequence, the user is forced to explore the static content and have no tool to tell the system about his preferences. This is not interactive approach. One the other hand, we believe that interactivity is a “must have” component of a good recommender system.

Our concept is the following: no registration is required, the user visits the site, makes a few clicks on movies or tags and receives recommendations reflecting his current mood and preferences. There are two entities which the system predicts: the movies and tags. The movie is the ultimate goal, i.e. the user is here because he wants to find a movie to watch, whereas tags are additional user interaction tool allowing the user to feed his current preferences faster into the system.

Why Deep Learning?

There is one reasonable question to ask: “Why using neural networks after all? Collaborative filtering approaches exist for many years, well understood and work fine.”

We will answer step by step. When talking about collaborative filtering, we should clearly distinguish the following two tasks: (1) rating prediction and (2) top N recommendations.

The task of rating prediction is much more popularized and, as a consequence, tons of papers and open source libraries are there. However, speaking about top N recommendation task, the situation is quite the opposite. The reason — Netflix Challenge (2006–2009) with a prize fund of $ 1 million, where the participants were asked to predict “how user u will rate the movie m”.

However, in most business applications, it is required to give top N recommendations. Typical cases are: based on historical data for a particular user, show 10 items that he is most likely would buy or, as in our case, show 10 films that he most likely wants to see.

Without a doubt, the task (2) can be reduced to the task (1) in a following naive way: take the user, predict the ratings for all the movies from our catalog, sort the movies in descending order by the predicted ratings, then take the top 10 movies and recommend them. It sounds like a good idea, but there is one problem — it does not work (we were the ones who did this mistake). So when you look at the recommendations you do not like it (metrics also reflect inner feelings)!

We are done with rating prediction problem, let’s return to the top N recommendations. Again, there are two ways of solving the problem: (a) Matrix Factorization, (b) Nearest neighbours approach.

Matrix Factorization methods have the following drawback — they are not interactive, in the sense that if a user has rated to a movie, then to update the recommendations for him, you need to re-do the factorization procedure. Since we want to recommend “on the fly”, for us it is unacceptable.

Nearest neighbours are interactive. Pedro Domingos classifies this approach to “Lazy machine learning methods”, as training procedure is equal to saving new training examples in the database. So, in terms of computational costs, training is free and all the work is done during inference stage. But when it comes to a metric, the best one can do is to rely on some sort of heuristics. In case we want to go beyond movies and work also with other entities, like tags, the metric issue is even bigger.

We are not saying that standard approaches are bad, we are just pointing out the evident advantage of deep learning approach for this task: just feed all the available data to the deep model and formulate training objective that is somehow correlated with quality of user experience. If one does it in a right way, all we have to do is to wait until the training procedure is converged to some local minimum.

Top N recommendation problem in terms of Deep Learning

The Deep Learning revolution first came to the area of ​​speech recognition, then to computer vision, and, after that, to natural language processing (NLP). Many NLP tasks are reduced to answering the question: what is the probability distribution for the next word, if we know N previous words? Or simply — to predict the next word in the sentence (in the text).

Today, in most NLP tasks, large recurrent neural networks (LSTMs) dominate other approaches, i.e. neural networks are pretty good at predicting the next word in a sequence.

Many recommender systems are built on the concept of similar items — that is, for each movie there is a predefined set of movies similar to it. This does not take into account the preferences of a particular user. As a consequence, the user is forced to explore the static content and have no tool to tell the system about his preferences. This is not interactive approach. One the other hand, we believe that interactivity is a “must have” component of a good recommender system.

The point is that the database of user ratings can be represented as one very long text. This text will consist of sentences, and each sentence is a list of movie IDs that a particular user liked.

Consider a very simple example:

“100 200 123/0 100 10 300/0 1 2 3 4 5/0”

We can see that

There are 3 users in our database

The first user likes movies with the identifiers: 100 200 123

The second: 100 10 300

The third: 1 2 3 4 5

“/0” — special symbol for separating different users

User IDs are not important, only movie IDs (and its relative order) are important

Afterwards, in theory, one can take the state of the art NLP model and train it to predict the next identifier in our “text”, which during the serve phase will represent the actual recommendation.

More than a year ago, we took MovieLens dataset, torch7 based NLP project, and done the above procedure to obtain our first movie recommender prototype.

But we wanted more, both in terms of the quality of the recommendations, and in the way we utilize Deep Learning techniques for our task.

Movies plus tags, putting it all together

The hypothesis is that by allowing the user to operate both movies and tags, we speed up his way to a list of relevant movies reflecting his current mood.

The task of constructing a neural network architecture, capable of working with the two mentioned entities, arises. See figure 1.

Fig.1. The neural network architecture which recommends movies to the user who has chosen (liked) three movies — Avatar, District 9, I’m Legend and two tags — Dystopia, Police. The “Emb” block stands for Embedding, “Avg” — Average, “FC” — Fully Connected.

With each movie that the user has liked, a fixed, predefined set of tags is associated. For both movies and tags embedding takes place that is just the mapping from movie and tag identifiers to the fixed size vectors. For tags, the vectors obtained as a result of embedding, are averaged. So, for each movie that the user liked, the LSTM cell takes as input the concatenation of the following vectors:

Movie embedding vector Average of the following vectors:

Tags embedding vectors associated with the current movie

Tags embedding vectors associated with the next movie in a sequence (in figure 1, we call these vectors “tags of future movie in a sequence”)

The output of a 2-layer LSTM (output vector of the upper right LSTM cell) goes to two separated fully connected layers (FC). Then softmax layers allow to estimate the “like” probability for each move and tag in the database. Top N movies and tags are shown to user.

Let’s say a few more words about tags. In terms of recommendations quality, tags may be useful even if we do not directly predict them. They give the model more information that some movies are similar to each other. For example, consider two movies in a case when there is no user in the database that liked both of them. The fact that these movies may have a lot of common tags gives an opportunity for the system to figure out that the movies are, indeed, similar. The type of tags we have just talked about is called “tags associated with movies” (figure 1).

Another scenario is to allow the user to like tags along with the movies ( “tags chosen by user” in figure 1). It is important that we can simulate this scenario at the training stage. Initially, the neural network predicts the next movie that the user likes based on previously “liked” movies, but we know the tags for the next movie in a sequence. Therefore, a significant amount of the training time, we can force the model to solve the following problem: knowing the movie history for a user and some set of tags associated with the next movie in sequence, guess what exactly the next movie is (it would be convenient to formulate the last statement in terms of conditional probability, but in this post we decided to do without formulas, if there is interest, we will write a more technical post). Also note, that in the scenario under consideration there may be no “liked” movies at all — the user, for example, choose a group of diverse tags and still receives recommendations.

Technology stack and training data

Our deep model is LSTM based neural network that is built using the TensorFlow framework.

To create a training data, we’ve used the MovieLens dataset, where we took user’s movie preferences. We parsed IMDB and used the Movie DB API to form tags database.

The API interacts with TensorFlow through ZeroMQ, and Elastic Search acts as a storage for information retrieval about the movies.

The frontend is made using Vue.js and Element UI.

Movix.ai features