The Ontology

For this project, I started by creating an ontology with two entity types - user and movie. A user interacts with a movie insofar as he/she contributes a rating to that movie; a user can be said to “like” a movie if their rating for that movie is greater than or equal to their mean rating for all movies. Conversely, a user “dislikes” a movie if their rating for that movie is less than the mean. This is a tweakable parameter, by the way; if you wanted to change the threshold for “liking” a movie, you could do so in 2 lines of code.

Throughout this post I will make references to situations where my choice of parameters is malleable and can lead to different results that may suit different endeavours better or worse.

I could have added a third entity corresponding to genre, and that was indeed my first approach, but I found it easier to simply assign each genre to a number between 0 and 18, inclusive, and encode each movie’s set of genres as a bit vector in a resource of datatype long that I assigned to the movie entity in the ontology. So for example in the movies file downloaded from MovieLens,

48,Pocahontas (1995),Animation|Children|Drama|Musical|Romance

corresponds to the 1995 movie Pocahontas, which has 5 genres that I encoded in the long as

.....0000101000010001100

The least significant bit corresponds to the first genre, alphabetically, in the dataset, and it goes down the alphabet from there.

I included 4 relations in my ontology. Two of the relations are user-movie relations: user liking a movie, and user disliking a movie. The other two are movie relations: recommended movie (given a movie A, relates any other movie that was liked by at least one person who liked A), and neg-recommended movie (given a movie B, relates any other movie that was liked by at least one person who *disliked* B). These latter relations are not hard-coded into the ontology and are instead produced through inference rules in the movieRules.gql file. These relationships can be visualized below:

You can think of this process as a basic clustering algorithm for binary movie classification.

— — — — — — — — — — — — —

Let’s get some recommendations!

Now it’s time to see how the program actually works. The first step is to parse all the lines of the movies file — the schema for which I showed you above — and insert the relevant information into the Grakn graph. Step 2 is to ingest the ratings data, which comes in the form

1,2968,1.0,1260759200

1,3671,3.0,1260759117

2,10,4.0,835355493

2,17,5.0,835355681

where the 1st column is userId, the 2nd column is movieId, the 3rd column is rating given by userId for movieId, and the 4th column is the rating’s timestamp.

A recommender can only recommend if it is given some inputs off of which to base its search. A (small) dataset to train on, if you will. The program I have written takes in user input one-by-one and stores the information, making correspondence with Grakn through a command-line Graql query after every input.

The program gives the ‘player’ random movies from the movie dataset and allows them to respond in one of three ways. If the player likes the displayed movie, they should respond with a 'Y' or a 'Yes' . If the player dislikes the movie, they respond 'N' or 'No' . If the player does not know or have an opinion on the movie, they respond with ? . The player must give a yes/no response to n movies, at which point the engine calculates the recommendations.

The choice of n is somewhat arbitrary, and in the shell snippet below I have set it to 10. Think about what happens when you increase or decrease that value, though. If n is too low, say 3, you don’t have a large enough sample size to pursue meaningful content-based filtering, and you will be choosing with less refined search terms. If n is very high, say 50, then it will take a long time for the user to go through and respond to every suggestion, and you tend to get many of the same movies recommended every time, since they will have a lot of connections in the Grakn graph.