In the previous post, we used PySpark to compute a node/edge list using all open source Github repository ForkEvents and WatchEvents in 2017. In this post, we’re going to load that network into JanusGraph. In the next post, we’ll be generating our co-forked network and computing our project rating metric via eigenvector centrality.

Note that I have edited the previous post to add the extraction of WatchEvents to add a user--starred-->repo edge to the graph to enable a comparison between our new project rating and project stars. If you’d like, you can skip ahead to the next post.

The code for this series of blog posts (and all my Github experiments) is here: https://github.com/rjurney/github_network After this series I’ll be moving on to building a project recommender using deep learning for link prediction. Stay tuned!

Setting Up JanusGraph

I was a huge Titan fan, but it hasn’t merged a commit in two years. Fortunately there is JanusGraph, which has picked up where Titan left off. Check out install_janus.sh, which works like so:



unzip -d . /tmp/janusgraph-0.2.0-hadoop2.zip curl -Lko /tmp/janusgraph-0.2.0-hadoop2.zip https://github.com/JanusGraph/janusgraph/releases/download/v0.2.0/janusgraph-0.2.0-hadoop2.zip unzip -d . /tmp/janusgraph-0.2.0-hadoop2.zip

Then you can access it by running janusgraph-0.2.0-hadoop2/bin/gremlin.sh . Setting up a database schema in JanusGraph was even simpler than doing so in Titan, with most of the commands being identical. See setup_janus.groovy.

graph = JanusGraphFactory.build()\

.set("storage.backend", "berkeleyje")\

.set("storage.directory", "data/fork_graph")\

.open(); g = graph.traversal() // Setup our graph schema

mgmt = graph.openManagement() // Vertex labels

user = mgmt.makeVertexLabel('user').make()

repo = mgmt.makeVertexLabel('repo').make() // Node properties

userName = mgmt.makePropertyKey('userName').dataType(String.class).cardinality(Cardinality.SINGLE).make()

repoName = mgmt.makePropertyKey('repoName').dataType(String.class).cardinality(Cardinality.SINGLE).make() // Indexes

mgmt.buildIndex('byUserNameUnique', Vertex.class).addKey(userName).unique().buildCompositeIndex()

mgmt.buildIndex('byRepoNameUnique', Vertex.class).addKey(repoName).unique().buildCompositeIndex() // Metric node properties

degreeCentrality = mgmt.makePropertyKey('degree').dataType(Integer.class).make()

eigenvectorCentrality = mgmt.makePropertyKey('eigen').dataType(Integer.class).make()

stars = mgmt.makePropertyKey('stars').dataType(Integer.class).make() // Relationships

forked = mgmt.makeEdgeLabel('forked').multiplicity(SIMPLE).make()

co_forked = mgmt.makeEdgeLabel('co_forked').multiplicity(SIMPLE).make()

starred = mgmt.makeEdgeLabel('starred').multiplicity(SIMPLE).make() // Commit changes

mgmt.commit()

We define two node labels for users and repos, then add a property for the name of each. Then we add indexes for these names to speed filtering if we test things out on one node. Next we add three integer node properties, one for degree centrality, one for eigenvector centrality and one for the number of stars a repo received. Then we add three types of relationships: forked (user → repo), co_forked (repo → repo) and starred (user → repo). Finally we commit all our changes.

Now that we’ve got our graph database model ready to go, we can load the node and edge data and construct our network.

Loading our Graph Data

First we setup a BerkeleyDB database instance in janusgraph-0.2.0-hadoop2/data/co_forked and initialize a JsonSlurper for reading our JSON files. Note that you want to set storage.batch-loading to true or load times will be very high. Once batch loading is enabled, the default batch of records is 1,000. I turned it up to 10,000 to get better performance.

// Setup our database on top of berkeleydb (for now)

graph = JanusGraphFactory.build()\

.set("storage.backend", "berkeleyje")\

.set("storage.directory", "data/fork_graph")\

.set("storage.batch-loading", true)\

.set("storage.buffer-size", 1000)\

.open();



// Setup JSON Reading of MongoDB mongodump data

jsonSlurper = new JsonSlurper()

Next we read the users file and create the corresponding user nodes.

// Add user nodes to graph

usersFilename = "../data/users.jsonl"

usersReader = new BufferedReader(new FileReader(usersFilename)); while((json = usersReader.readLine()) != null)

{

document = jsonSlurper.parseText(json) v = graph.addVertex('user')

v.property("userName", document.user)

graph.tx().commit() print("U")

}

Next we add the repositories to our graph:

// Add repo nodes to graph

reposFilename = "../data/repos.jsonl"

reposReader = new BufferedReader(new FileReader(reposFilename)) while((json = reposReader.readLine()) != null)

{

document = jsonSlurper.parseText(json) v = graph.addVertex('repo')

v.property("repoName", document.repo)

graph.tx().commit() print("R")

}

In order to add the edges linking users and repos, we must traverse the graph to fetch the nodes corresponding to each label before linking them with an edge. This time we will read the edges, fetch the objects for the node at either end, and add the edge itself. We begin with the forked edges, which map a user forking a repository.

// Create forked edges between repos and nodes

forkEdgesFilename = "../data/users_forked_repos.jsonl"

forkEdgesReader = new BufferedReader(new FileReader(forkEdgesFilename)); while((json = forkEdgesReader.readLine()) != null)

{

document = jsonSlurper.parseText(json) // Add edges to graph

user = g.V().has('userName', document.user).next()

repo = g.V().has('repoName', document.repo).next() user.addEdge("forked", repo)

graph.tx().commit() print("-")

}

Next up we load the starred edges, which map a user starring a repository.

// Create starred edges between repos and nodes

starEdgesFilename = "../data/users_starred_repos.jsonl"

starEdgesReader = new BufferedReader(new FileReader(starEdgesFilename)); while((json = starEdgesReader.readLine()) != null)

{

document = jsonSlurper.parseText(json) // Add edges to graph

user = g.V().has('userName', document.user).next()

repo = g.V().has('repoName', document.repo).next() user.addEdge("starred", repo)

graph.tx().commit() print("-")

}

The call to graph.traversal() gives us our query handle. The call to g.V().has('userName', document.user).next() fetches the object for the user node. user.addEdge("forked", repo) adds the edge itself. graph.tx().commit() commits the write to disk, once a batch of 10,000 records has been achieved.

Finally we verify our nodes and edges to make sure all of them loaded correctly. As all data for the year of 2017 has now been collected, if you get an error at this stage, you will need to investigate the loading.

userCount = g.V().hasLabel('user').count().next()

assert(userCount == 4067599) repoCount = g.V().hasLabel('repo').count().next()

assert(repoCount == 4071996) forkedCount = g.E().hasLabel('forked').count().next()

assert(forkedCount == 11366334) starredCount = g.E().hasLabel('starred').count().next()

assert(starredCount == 31870088)

Conclusion

Now that our data is loaded, we can get on with creating a co-forked network and generating our new project rating… and that is exactly what we will do in our next post!

Shameless plug: Need help with graph analytics or building analytics applications? Data Syndrome has you covered. We deliver entire applications for hire, and we’re just dying to develop working shiny things for your company! Contact rjurney@datasyndrome.com for more information.