In this article, I’m going to discuss how to move a table of data stored in CSV format into a GRAKN.AI knowledge graph, then illustrate how to query and visualise it.

The data I’m using can be found on Kaggle here, and is titled “20 Years of Games”. The dataset is a result of a crawl on a well-known web resource for game reviews (IGN) and contains over 18,000 rows of information about games, with columns for: game title, platform, release date, genre, score (IGN review) and IGN score phrase (e.g. “Amazing”), URL and whether the game was selected as an Editor’s Choice game.

The originator of the data says, “By exploring this dataset, one is able to find trends about the industry, compare consoles against each other, search through the most popular genres and more.” Of course, it’s possible to explore a CSV using a range of tools and techniques: Python, R and even Excel, and you can see what others have been doing with the data on the Kaggle site.

I’m using GRAKN.AI, in part to illustrate the migration process, but also to highlight it as an alternative approach that can be used as part of a data manipulation toolbox. One of the advantages of using the Grakn knowledge graph with this data is that you can write straightforward queries using Graql, and use the Grakn reasoner to make inferences across the domain or use Graql analytics to draw valuable insight from the data.

Let’s get started!

The Data

The data is in one CSV file, with 18625 rows and 12 columns. I spotted a few issues that needed minor cleanup before I could migrate the data into Grakn. So I used R to read the CSV and fix the issues described below with a few lines of script, storing into a new CSV file.

Some games that had a pair of “ ” within the game name (e.g. Ivan “Ironman” Stewart’s Super Off Road). On the version of Grakn that I first tested (0.6.1) this would have caused a problem, although in subsequent versions, support has been added. These needed to be removed

A number of games had two genres specified in the single genre column, separated by a comma. These needed to be split into separate columns (genre and genre2)

35 games had no genre specified. I decided to remove these.

Having cleaned up the dataset, I then used R to create two separate CSV files, the purpose of which I’ll explain shortly.

One CSV, genres.csv, contained a column that listed the different genres of games in the dataset (30 genres, from Action to Wrestling).

The second CSV, platforms.csv, contained a column that listed the different platforms on which the games could be played (from Android to XBox One — sadly there were no ZX Spectrum games in this list).

Right, that’s the data sorted. I’m going to migrate 6 columns of data: game title, release year, platform, genre, score and score phrase.

Migration: Setting up an Ontology

The first thing to do is set up GRAKN.AI. If you hit any problems along the way, please do reach out via one of the ways listed on our Community page, or leave a comment to this article below. I downloaded and used version Grakn 0.6.1 and 0.7.0 to write and test this article. I also read the CSV migration documentation before getting started. By the way, we also support migration of SQL, OWL and JSON into Grakn, as described here.

As the CSV documentation explains, there are limitations on the CSV format that prevent it from expressing the semantics of the data. Automatic migration of the data would mean that it stays as unhelpful in Grakn as it is in CSV, so to get the most out of the knowledge graph, I must write an ontology for the data before migrating it into Grakn.

Fortunately, writing an ontology is easy enough, though you need to know the basics of the Grakn object model, which is explained in our documentation. My colleague Miko has also provided an explanation in a series of blog posts.

To illustrate this data, I have opted to have three entity types: game, platform and genre. The game entity has resources to describe its title, score, score phrase, release year. The platform entity just has a name, as does the genre. I then define two relationships between the entities, as follows:

Between a game and a platform — a “compatibility” relationship with roles “supports” for the platform and “runs-on” for the game.

Between a game and a genre — a “has-genre” relationship with roles “genre-of-game” for the genre and “game-with-genre” for the game.

The full Graql is below, and you can find all the code for this example on our sample-projects repo on Github.

insert platform sub entity

has-resource platform-name

plays-role supports; platform-name sub resource datatype string;

supports sub role;

runs-on sub role; genre sub entity

has-resource genre-name

plays-role genre-of-game; genre-name sub resource datatype string;

game-with-genre sub role;

genre-of-game sub role; game sub entity

has-resource title

has-resource score

has-resource score-phrase

has-resource release-year

plays-role game-with-genre

plays-role runs-on; title sub resource datatype string;

score sub resource datatype double;

score-phrase sub resource datatype string;

release-year sub resource datatype string; compatibility sub relation

has-role supports

has-role runs-on; has-genre sub relation

has-role game-with-genre

has-role genre-of-game;

Having defined the ontology, the first thing to do is to start the Grakn engine (typing grakn.sh start into the terminal window from the /bin directory of the Grakn environment), and load the ontology above (stored in game-ontology.gql) into Grakn. From the terminal:

../bin/graql.sh -f ./game-ontology.gql

A Note on Keyspaces

Before I started this project, I made sure that my keyspace, which you can think of as the space where the graph is stored, was clear of any other graph data, by calling grakn.sh clean before I started the Grakn engine. In this example, I’m going to migrate into the default keyspace, so I don’t need to specify it explicitly, but if you decide to use a different keyspace, you can use the -k flag in the call above (e.g. to use the “games” keyspace):

../bin/graql.sh -f ./game-ontology.gql -k games

Migration: Loading the Data

OK, so having set up the game ontology, it’s time to focus on getting the data loaded into the knowledge graph. I’m going to concentrate on loading the simple entities: game genre and platform. I need to use Graql templates, so that I can iterate over each row and insert an entity for each with the correct resource.

For genre, I use the genres.csv file that I created earlier using R to extract all genres from the game dataset. For each of the 30 rows representing genres, I need to create a genre entity and give it a genre-name resource. The Graql template for this is very straightforward:

insert

$x isa genre

has genre-name <genre>;

The Graql above is stored in a template file (genre-template.gql) in the same location as the genres.csv file. To create the entities in the knowledge graph, I call the migration shell script, which is located in the /bin directory of the Grakn environment. The call in the terminal is as follows:

../bin/migration.sh csv -template ./genre-template.gql -input genres.csv -no

Usage of the shell script is documented on our portal, but in effect, what it is doing is taking the template Graql above and applying it to every row of genres.csv, to create entities in the knowledge graph for genres from Action to Wrestling. The -no flag at the end of the line is a “dry run” — it prints out what Grakn would be doing under the hood, so you can “try before you buy”. The output looks good to me, so I remove the -no flag and make the call again.

Next up: the platforms. In exactly the same way as described above for genres, I need to create platform entities, each with a platform-name resource, from the 59 rows of platforms.csv, which I extracted from the game dataset. The Graql template is as follows, and is stored as platform-template.gql:

insert

$x isa platform

has platform-name <platform>;

I then call the migration shell script again, checking the dry run, then remove the flag to create the entities in the knowledge graph.

../bin/migration.sh csv -template ./platform-template.gql -input platforms.csv [-no]

Having dealt with genres and platforms, it is time to create game entities and hook them up with relationships to their corresponding genre and platform entities. The template for this is a little more complex, as I need to do a match-insert query for each game, to match its genre and platform. The template then inserts the game entity with resources for its title, score-phrase, score and release-year. It also adds two edges to represent relationships with the correct platform (“compatibility” relationship) and genre (“has-genre” relationship) entities.

match

$genre has genre-name <genre>;

$platform has platform-name <platform>; insert $x isa game

has title <title>

has score-phrase <score_phrase>

has score @double(score)

has release-year <release_year>;

(game-with-genre: $x, genre-of-game: $genre) isa has-genre;(supports: $platform, runs-on: $x) isa compatibility;

To call this template (match.gql), I invoke the migration shell script again, this time using the complete game data CSV, downloaded from Kaggle and cleaned up using R at the beginning of the article.

../bin/migration.sh csv -template ./match.gql -input ign.csv [-no]

Migrating over 18,500 rows of data takes just under 5 minutes. When migration is complete, the terminal prompt will say “Initiating shutdown” and, at that point, I can start to investigate the dataset, using either the Graql shell (invoked by calling graql.sh from the bin directory of the Grakn environment) or the Grakn visualiser (visit http://localhost:4567 in the browser).

Using Graql to Query the Knowledge Graph

Graql can be used to retrieve explicitly stored and implicitly derived information, as well as to perform graph analytics and automated reasoning. Here are a few queries I ran in the Graql shell, which you may want to try:

How many games in the knowledge graph? (should be 18589)

match $x isa game; aggregate count;

List games by title

match $x isa game, has title $a; select $a;

List games by by decreasing score and title

match $x isa game, has title $a, has score $s; select $a, $s;order by $s desc;

List games by title, score and release year, starting with the most recent first

match $x isa game, has title $a, has release-year $y, has score $s; select $a, $s, $y;order by $y desc;

List games supported by a particular platform, e.g. N-Gage

match (runs-on: $game, $p) isa compatibility; $p has platform-name “N-Gage”; $game has title $a; select $a;

List games of a particular genre, e.g. RPG

match (game-with-genre: $game, $g) isa has-genre; $g has genre-name “RPG”; $game has title $a; select $a;

List games by increasing release year, showing title and platform

match $g isa game, has release-year $ryear, has title $t; (supports:$p, $g); $p has platform-name $pname; select $t, $pname, $ryear; order by $ryear asc;

List all masterpiece games by year, starting with the most recent, showing title and platform.

match $g isa game, has score-phrase “Masterpiece”, has title $t, has release-year $r; (supports:$p, $g); $p has platform-name $pname; select $t, $r, $pname;

The Visualiser

To look at the information returned by the Grakn visualiser, let’s take a simple query: the set of games that IGN considers to be “Masterpieces”. First, navigate to http://localhost:4567/ in your browser to see the Grakn visualiser. There should just be a blank screen until you input a query. Type the query below into the form and hit “Submit”:

match $g isa game, has score-phrase “Masterpiece”, has title $t; select $t;

You should see something along the lines of the image below, and can then explore the nodes to find out about individual games (e.g. their genre) and dig further into the data.

But…why?

You may still be wondering why I’ve bothered with all the above. After all, the data is fine as it is in a CSV file, right? Well yes, it is, but what I can do with the data is very limited. Building an ontology and putting the data into a knowledge graph has allowed me to focus on the relationships within data, and I can build queries around those relationships in a far simpler and more efficient way than if I had kept it as tabulated data. I can now go on to use the Grakn reasoner to make inferences and Graql analytics to explore features of the dataset.

Summary

This article has focussed on the details of migrating a dataset from CSV into GRAKN.AI, and illustrated a few simple queries upon the data. We haven’t delved into what is possible once the data is in the Grakn knowledge graph, but, fear not, that will be the subject of future articles. I hope this introduction has whetted your appetite (please leave me a comment on whether you think I succeeded!) and do check out the Grakn documentation for more information about the topics covered in this post.

Notes

Where a game falls under multiple genres, I have used just the first genre specified, but I could also match genre2 and allow games to have multiple has-genre relationships.

Where a game has been released for multiple platforms, it appears in the CSV once for each platform, and that makes sense. It may be that one platform version is great and scores well, while another scores badly. I could have removed duplicate game entities and allowed instead for a game to have multiple compatibility relationships with platform entities, but decided that the original data did not reflect this approach. Incidentally, the Graql to list out only distinct game titles is as follows:

match $x isa game, has title $a; select $a; distinct;

To count them up, the following can be used. The result should be 13956.

match $x isa game, has title $a; select $a; distinct; aggregate count;