In this post we will take the user-—fork-->repo / user—-star-->repo network we created in the previous post and query it to produce a single mode repo<-->repo network called co_forked . This new edge type will represent repositories who have been forked by the same user, thus indicating a strong link between them in terms of the interest of users. Using these links in the next post, we will compute our new project metric before evaluating the performance of the rating against stars.

The code for this series of blog posts (and all my Github experiments) is here: https://github.com/rjurney/github_network If you’d like you can skip ahead to the next post.

Switching Storage Engines

While working with the Berkeley DB storage engine, I had repeated problems where the database would become corrupted after an aborted gremlin session. I aborted many Gremlin sessions because queries would never return, or they would self abort due to GC overhead limits. I ended up reloading the data twenty or thirty times before I gave up, setup Cassandra and moved to the Cassandra storage engine.

Extracting a Repo-Repo Network

As I mentioned in the first post, we start with a graph of repos and users. Users link to repos by forking them.

We want to short-circuit the user node to arrive at this graph so that we can run a centrality algorithm to get a better project rating.

We’re adding the edge in blue

This is a common task when working with property graphs: extracting different forms of networks from the master network defined by new and different semantics, and then applying methods designed for networks of one node type. The toolbox of Social Network Analysis (SNA) is large, and property graph algorithms are fewer and farther between. So we create new one-mode networks on the fly to analyze different types of edges which we define on the fly via queries. Fun!

The query to do this is lengthy (thanks to Robert Dale on the Gremlin users list for helping me figure it out), so I will explain it one section at a time (check out create_co_forks.groovy):

// Add co-forked edges between nodes

g.V().

hasLabel('repo').

store('x').

as('repo1').

in('forked').

out('forked').

where(without('x')).

as('repo2').

addE('co_forked').

to('repo1').

iterate()

We begin with g.V().hasLabel('repo').store('x').as(‘repo1’) in which we start by selecting all vertices in the graph and then filtering these to only repo nodes. The hasLabel('repo') step is a has step, which filters things by their properties. In this case their label (user or repo). We set these labels when we loaded the data. The store('x') step in the traversal is a side effect that stores the results (in this case vertices) at that step for use in a later operation. Finally we label them as('repo1') . These nodes will form the left-hand side of each repo-repo edge, and they will be available under the variable ‘x’ in a later step.

Next we travel out from repos across the in-bound forked links towards the users who forked the repos with .in('forked) . This takes us to the user nodes, but we want a repo-repo network. Therefore we keep going .out(‘forked') to explore any other repositories a user may have forked. This takes us back to a repo node, our final destination. This gives us the structure of our repo-repo network.

Step-wise query to compose repo-repo network

But there’s a catch… if we follow forked in and then forked out, we will follow the link back to ourselves! This is not desirable. To avoid this, we add: .where(without('x')) , and we label the remaining nodes .as('repo2') .

Now that we have both repo1 and repo2 , we add our new co-forked edge from repo2 back to repo1 : addE('co_forked').to('repo1') . This adds the edges as part of our query, which is pretty cool! Originally I exported JSON and loaded it to create the new edges which was much more painful than adding the edge directly. Now that we have our co-forked edges, we are ready to compute our project rating via eigenvector centrality of the co_forked edges!

Update: Using SparkGraphComputer

I seemed to get lucky on the first attempt to run this query, because after that it did not work. I was forced to use SparkGraphComputer to create an edge list. Because GraphComputer graphs can’t use Vertex.addEdge , I had to run the co_forked query without the last three lines.

Instead, I first created a SparkGraphComputer OLAP graph and traversal. After a test count of all vertices to make sure things were working as expected I computed the co_forked query — this time without adding edges. The configuration file I refer to in the code below is altered to use the github_graph namespace (check out conf/read-cassandra-3.properties).

// Use SparkGraphComputer

//

:plugin use tinkerpop.hadoop

:plugin use tinkerpop.spark // I edited the keyspace in this file to github_graph

olap_graph = GraphFactory.open('conf/hadoop-graph/read-cassandra-3.properties') // Get a graph traverser

olap_g = olap_graph.traversal().withComputer(SparkGraphComputer) // Test things out with a vertex count

assert(olap_g.V().count().next() == 8139595) // Add co-forked edges between nodes

edgePairs = olap_g.V().

hasLabel('repo').

store('x').

as('repo1').

in('forked').

out('forked').

where(without('x')).

as('repo2')

Next I spawned a new OLTP graph instance and a corresponding traversal. I could then iterate the OLAP graph resultset, using the new OLTP traversal to search for the vertices by name and then add an edge between them. This is not ideal because it uses two Gremlin APIs that aren’t supposed to be mixed, but was the only solution I could find owing to the read-only nature of GraphComputer . (Again, check out create_co_forks.groovy).

// Setup our OLTP graph instance and OLTP traversal

oltp_graph = JanusGraphFactory.open("conf/janusgraph-cassandra-es.properties")

oltp_g = oltp_graph.traversal() for(edgePair : egdePairs) {

repo1 = oltp_g.V().has('repoName', edgePair.repo1).next()

repo2 = oltp_g.V().has('repoName', edgePair.repo2).next() repo1.addEdge("co_forked", repo2)

graph.tx().commit() print("-")

}

Note that this query took a LONG time. Overnight.

This requires that you install both Hadoop and Spark, which increases the complexity of this post :( Fortunately, I have instructions on how to do that as part of the AWS setup scripts for my book, Agile Data Science 2.0, which you can adapt for this purpose.

Conclusion

Here we have demonstrated how to convert a two-mode network into a single mode network using a graph traversal that defines a new relationship and creates the new edge. This enables you to capture the semantics expressed in any Gremlin query in a new edge type, which you can later reference in additional analytics (which we will be doing next post). This is a pattern in graph analytics of which you should take note.