Importing MovieLens into OrientDB graph database by warriorkitty on September 6, 2015

I’m starting a bigger project for my master’s degree where I merge my knowledge about Machine Learning and OrientDB graph database. GroupLens Research has collected and made available rating data sets from the MovieLens website. At the time of this writing, MovieLens database has 21M ratings and 470k tag applications applied to 27k movies by 230k users. If this database is not good for learning topics like Machine Learning, Content-Based, Collaborative Recommendation, I don’t know what is.

My plan is to create a movie recommendation system where you will find new movies and similar users, based on your reviews, on OrientDB graph database. I’m also planning to do some type of clustering (for example, these users like horrors, or these users are most likely serial killers :D).

What you need to know/have?

little bit of Java (and lambdas)

little bit of OrientDB (docs)

IntelliJ IDEA (or Eclipse, if you know how to import the project)

Machine Learning algorithms are not “a must” for now

Project is hosted on Github. If you, after reading this article, still don’t understand some part, or you have a suggestion, feel free to contact me: davor [ at ] warriorkitty.com. I will reply to you as soon as possible.

First of all, create a new database (I’m calling the database orientlens). Run the server and open localhost:2480 and create a new database with any credentials you want.

I *know* that the code you will be looking at, could be faster. This is only for educational purposes. I could import and connect users with movies with fewer queries but I wanted to make it step by step (first, import all Users, then, import all Movies, then connect them).

Now, let’s look at the source. (If you are using Intellij IDEA, press ‘CTRL + SHIFT + N’ and just enter the filename.) If you open the Config class, you will see all the constants. Feel free to change them for your needs.

Data is inside movielens folder. That is a smaller version of the dataset. Feel free to replace the files from the full dataset. I’m using smaller dataset in this example because the import with the full dataset takes too long.

The Logger class is really simple and it just prints out information messages. If you don’t want to see these messages in the console, just comment out the contents of the log method.

Now, the Main class. The main method is your entry point. The best way to get a Graph instance is through the OrientGraphFactory:

OrientGraphFactory factory = new OrientGraphFactory( Config.DB_URL, Config.DB_USERNAME, Config.DB_PASSWORD ) .setupPool(Config.DB_POOL_MIN, Config.DB_POOL_MAX); OrientGraph graph = factory.getTx();

OK, now that we have a reference to the database, we can call our custom Worker class which does most of the work. Inside Main class, there is something like this:

Worker worker = new Worker(graph); try { worker.addGenres(); worker.addUsers(); worker.addMovies(); worker.connectMoviesWithGenres(); worker.rateMovies(); worker.tagMovies(); } catch (IOException e) { e.printStackTrace(); }

The Worker class has the reference to the database. Again, this class is a custom class, it’s not something from 3rd party libraries or from Java. Let’s look how I added Genres. Inside movies.csv there are movieId, title, and genres. Genres are separated with “|”. Let’s get unique genres with HashSet and Lambdas:

Stream lines = Files.lines(Paths.get( Config.MOVIELENS_PATH + "movies.csv")); Set genres = new HashSet<>(); lines.skip(1).forEach(line ->; { String[] tokens = line.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1); genres.addAll(Arrays.asList(tokens[2].split("\\|"))); });

You may be asking yourself what is with this regex? Why isn’t this a simple .split(",") command. Some movie titles contain commas and such movies are inside double quotes, so we need to ignore commas inside double-quoted titles like “Usual Suspects, The (1995)”.

We need to create a class before adding records. This is the reason why I’m executing it outside any transaction:

graph.executeOutsideTx(arg -> { OrientVertexType genreClass = graph.createVertexType("Genre"); genreClass.createProperty("name", OType.STRING); return null; });

Adding genres to the database is simple:

genres.stream().forEach(genre -> { try { graph.addVertex("class:Genre", "name", genre); } catch (Exception e) { graph.rollback(); } }); graph.commit();

Go ahead and inspect the rest of the code and feel free to contact me if you have any question or suggestion.