This is the fourth post in the series Github Network Science, in which we will use only ForkEvents from the year 2017 to create a project rating that is better than project star count. In this post we’ll look at scaling graph queries which rapidly expand into many possible nodes and edges. Our previous work suffers from a common problem in querying graph databases… N² or quadratic complexity. This impacts scalability so as to prevent the code from running reliably. We’ll look at the problem and figure out strategies to overcome it.

Specifically, we’ll revisit the work to add the co_forked edge from the previous post owing to problems I encountered in scaling the query to run on the full dataset of 4,402,402 users and 4,071,996 repositories. We jumped the gun on publishing that post, but we’re leaving it as is rather than changing it to demonstrate how data science actually works: getting your hands dirty through exploratory data analysis and experimentation.

Data Analysis vs Data Science

Often, what seems simple at first becomes complex as you explore it more deeply. Data science is iterative, part engineering and part science. Predicting the amount of work for a given task is difficult, as problems have a habit of pulling you down the rabbit hole of complexity.

Often you’ll have to cut the effort off and give up on a particular tactic when it becomes too complex. You have to rethink things, find another path. I nearly had to give up on this blog post, as it took several weeks to write… but with persistence the problem unraveled.

Along these lines, it is worth noting that while an excellent analyst might learn Gremlin to query graphs directly, they would not be able to optimize this query once problems popped up. That is the work of a data scientist, who would likely revel in the complexity of the problem (as I did), which follows.

The N² Edge Problem

Getting back to our problem, it can be summarized in a tweet:

Our strategy was to create a co_forked edge between all users who fork the same repository, and then use these edges to compute a centrality metric for projects. Inspecting the most popular projects on Github, Bootstrap and Tensorflow shows that they have been forked upwards of 58,000 times. That means that for bootstrap, there is a group of 58,000 users that need links between each other. That amounts to 58,000² edges, which is 3.36 billion edges… a scale that would require an enormous cluster to query and that would challenge even the most modern graph databases.

58,000² = 3,364,000,000 edges

After the query ran out of heap space no matter what I tried (including SparkGraphComputer, Gremlin’s Spark connector), I watched an iterative version of the query create tens of millions of edges before my data intuition kicked in, I realized the problem and killed the program.

For reference, the iterative program to add the co-forked edges is below. Instead of adding iterate on our query, we manually run it in a for loop and make sure to throw away the results each iteration to avoid a memory leak. This strategy can work, but only after we first limit the number of co-forked edges.

edgePairs = g.V().

hasLabel('repo').

as('repo1').

in('forked').

out('forked').

where(neq('repo1')).

as('repo2').

select('repo1', 'repo2')

// .iterate() i = 0

import java.text.NumberFormat

for(java.util.LinkedHashMap edgePair: edgePairs) {

edgePair.repo1.addEdge("co_forked", edgePair.repo2) if((i % 1000) == 0) {

graph.tx().commit()

print("Committed edge number: " + \

NumberFormat.getIntegerInstance().format(i) + \

"

")

}



i = i + 1

}

Data Intuition

As I work with the Github Archive data more and more, it informs my intuitive mind and I develop data intuition. Data intuition is “right-brained” thinking about data, informed by every experience you’ve had working with it. It will guide the strategies you take, it will detect clues in the data and it will nudge you in the right direction. Data intuition is system one working, leading to new ideas for system two to use.

Now that I’m feeling a bit more familiar with the Github data… scalability was an obvious obstacle for my original strategy. The lesson here is to get to know your dataset by viewing, exploring, visualizing, modeling and playing around with it. This informs data intuition that will dramatically increase efficiency. It takes six months to a year to learn a business and its data well enough to operate at peak ability. The best data scientists have strong data intuition.

Semantics of a Solution

To find a solution to the problem, for graph A we need to limit the size of the group of nodes B such that B² edges between them is not too large to compute. But there’s a catch… we can’t alter the semantics of our query such that it no longer answers our question. So for each strategy we come up with, we need to decide what its implied semantics are and whether that fits with our intentions for the query. In the same way, our intentions for the query guide the solutions we come up with. Solutions and semantics are tightly coupled.

Data explosion projecting links between groups in social networks

Take One: Filtering User Supernodes

The solution first seemed to be to remove user supernodes. I reasoned that if a user forks thousands of projects, he is inflating the pool of co-forkers dramatically, at the rate of: O projects * P forkers. I started by altering the query to remove users with large numbers of forked projects.

Bootstrap had been forked 57,623 times.

g.V().

hasLabel('repo').

as('repo1').

in('forked').

where(

outE('forked').

count().

is(lt(100))

).

where(

neq('repo1')

).

as('repo2').

select('repo1', 'repo2')

This optimized things a little bit, and the altered semantics aren’t so bad… we’ve only filtered what are likely robots plus a few die-hard open source fans that constantly fork projects on Github. But it turns out that removing user supernodes doesn’t really address the problem.

The problem wasn’t supernodes, the real problem was the number of co-forkers on a popular project: up to 58,000. Reducing the number of users in a project by a few (there usually aren’t many supernodes) will hardly impact the problem.

Tensorflow had been forked 47,597 times.

We need to further limit the number of users to reduce these N² pools to something manageable.

Take Two: Filtering by User Centrality

So I went back to the event data to enrich the model. The second solution employed an unused property of the ForkEvent data to link repositories to their owners in owned edges, creating fan edges between forkers and owners, and then running a centrality algorithm on the fan graph to obtain a node property against which to filter users before proceeding with creating our co_forked edges.

To restate that: once we have user-owned->project edges, we will use them to add fan edges between the user who forks a project and that project’s owner. This will require that we run pyspark again (see build_network.spark.py) to compile the new owned edge data, and then alter our load script (see load_janus.groovy) to add these edges. Let’s dive into that.

Extracting Repository Owners

The new pyspark script is below. We extract the owner of the repository from the ForkEvent by extracting it from the original repo name. The changes are in bold. Here the own_events are extracted from ForkEvents .



fork_events = github_events.filter(lambda x: "type" in x and x["type"] == "ForkEvent")

own_events = github_events.filter(lambda x: "type" in x and x["type"] == "ForkEvent") # See https://developer.github.com/v3/activity/events/types/#forkevent fork_events = github_events.filter(lambda x: "type" in x and x["type"] == "ForkEvent")

star_events = github_events.filter(lambda x: "type" in x and x["type"] == "WatchEvent") # See https://developer.github.com/v3/activity/events/types/#watchevent star_events = github_events.filter(lambda x: "type" in x and x["type"] == "WatchEvent")

Here the own_event records are created by splitting the owner from the repo name in the ForkEvent records.

own_events = own_events.filter(lambda x: "repo" in x and "name" in x["repo"] and "/" in x["repo"]["name"])

own_events = own_events.map(

lambda x: frozendict(

{

"owner": x["repo"]["name"].split("/")[0] if "repo" in x and "name" in x["repo"] else None,

"repo": x["repo"]["name"] if "repo" in x and "name" in x["repo"] else None

}

)

)

own_events = own_events.filter(lambda x: x["owner"] is not None and x["repo"] is not None)

Here we get the unique owners list and store it as JSON.

own_events = own_events.distinct()

own_events_lines = own_events.map(lambda x: json.dumps(x, default=json_serialize))

own_events_lines.saveAsTextFile("data/users_owned_repos.json")

We must also add any users that appear in the owner field of our fork_events relation, otherwise we will see errors when we add these edges because project owners that have never forked a project will be missing from the graph.

# We must get any users appearing in either event type

fork_users = fork_events.map(lambda x: frozendict({"user": x["user"]}))

fork_owner_users = fork_events.map(lambda x: frozendict({"user": x["owner"]}))

star_users = star_events.map(lambda x: frozendict({"user": x["user"]}))

star_owner_users = star_events.map(lambda x: frozendict({"user": x["owner"]}))

users = sc.union([fork_users, star_users, fork_owner_users, star_owner_users])

users = users.distinct()

users_lines = users.map(lambda x: json.dumps(x, default=json_serialize))

users_lines.saveAsTextFile("data/users.json")

And while we’re at it, we do the same thing for any new repos that may appear (although they shouldn’t, the data may be corrupted or the semantics may change later).

# We must get any repos appearing in either event type

fork_repos = fork_events.map(lambda x: frozendict({"repo": x["repo"]}))

own_repos = star_events.map(lambda x: frozendict({"repo": x["repo"]}))

star_repos = star_events.map(lambda x: frozendict({"repo": x["repo"]}))

repos = sc.union([fork_repos, star_repos, own_repos])

repos = repos.distinct()

repos_lines = repos.map(lambda x: json.dumps(x, default=json_serialize))

repos_lines.saveAsTextFile("data/repos.json")

Once we run this on all the Github events for 2017, we load this data into the graph. This only takes a couple of hours on our new deep learning machine, whereas before it took almost twenty four hours! :)

When I ran these updates, I fixed a bug where null repos weren’t being filtered. This reduces the size of users.json to 4,402,402 users in 95MB and repos.json to 4,071,996 repos in 144MB. This is a welcome reduction. users_owned_repos.json is 2,263,684 edges in 130MB, and the others are the same as in the first post.

Loading Owners into JanusGraph

Now that we’ve computed the raw data linking forks to project owners, we need to load this data into JanusGraph and its data model. This means creating two new edge types. If we use the forked edge type, which currently links users and projects, to link users with other users, we won’t be able to easily write queries that traverse only one or the other edge type. We need to add two edge types to our graph. One to model users owning projects user-owned->repo and one to model user fans of owners user-fan->user (see setup_janus.groovy).

Users forking other users did not make sense in our specific use case

// Relationships

forked = mgmt.makeEdgeLabel('forked').multiplicity(SIMPLE).make()

co_forked = mgmt.makeEdgeLabel('co_forked').multiplicity(SIMPLE).make()

starred = mgmt.makeEdgeLabel('starred').multiplicity(SIMPLE).make()

owned = mgmt.makeEdgeLabel('owned').multiplicity(SIMPLE).make()

fan = mgmt.makeEdgeLabel('fan').multiplicity(SIMPLE).make()

Then we need to add new properties to store the centrality values we’re going to compute. We store these values although we could compute them on the fly when we need them because it is efficient and avoids heap space errors. We’re also updating the schema to have more than one ‘eigen’ value.

// Metric node properties

fanDegreeCentrality = mgmt.makePropertyKey('fan_degree').dataType(Integer.class).make()

fanEigenvectorCentrality = mgmt.makePropertyKey('fan_eigen').dataType(Integer.class).make()

coforkEigenvectorCentrality = mgmt.makePropertyKey('co_fork_eigen').dataType(Integer.class).make()

stars = mgmt.makePropertyKey('stars').dataType(Integer.class).make()

We also need to make changes to load_janus.groovy to create the owned and fan edges and to add any projects/owners that appear in owner records to the projects/records collections. This works just like the process for forks and stars.

// Create fan edges between users and other users

ownEdgesFilename = "../data/users_own_repos.jsonl"

ownEdgesReader = new BufferedReader(new FileReader(ownEdgesFilename)); i = 0

while((json = ownEdgesReader.readLine()) != null)

{

document = jsonSlurper.parseText(json) // Fetch the user and repo

user = g.V().has('userName', document.user).next()

repo = g.V().has('repoName', document.repo).next() user.addEdge("owned", repo) if(i % 1000 == 0) {

graph.tx().commit()

str = NumberFormat.getIntegerInstance().format(i)

println(str + "O")

}

i++

}

User Fans

Our data model now includes edges both matching users with the projects they own and fork, and we have the raw data to query and fill the model. We will use the user-forked->repo and user-owned->repo edges to derive a user->fan->user graph which can then give us centrality scores to help filter users for our co_forked edges.

From user’s owning and forking repositories to users forking users…

Because one side of this edge projection has one and only one node (the single owner of the repository), it avoids the N² problem and will scale much more easily. In fact, there will only be as many fan links as there are forked links, and these are basically a shortcut that we could do without. We include them for illustration and query simplicity. It is often better to build up your data model than to always derive it on the fly.

We include a bit of code to print a progress log and to commit a transaction every 1,000 records, although we have Cassandra batching turned on for every 10,000 records, so Cassandra will go from memory to disk once every 10 of these calls.

count = 0 g.V().

hasLabel('user').

as('user1').

out('owned').

in('forked').

as('user2').

addE('forked').

to('user1').

choose(

filter{it->count+=1; count%1000 == 0},

__.map{it->println(count); g.tx().commit(); it.get()},

__.identity()

).

iterate() print(count)

Next we check the count to verify it is good.

// Verify the count is correct

fanCount = g.E().

hasLabel('fan').

count().

next()

assert(fanCount == 9451177)

Now that our forked links are in place, we can compute a centrality score to filter our users. Note that only 9,451,177 out of 11,366,334 user-forked->repo connections have owners, indicating a problem with the original ForkEvents or how we’re processing them. For now we’ll leave things well enough alone and move on, but in production you would need to investigate this issue.

Computing User Fan Centralities

What we’re looking for in our filter is to reduce the noise from unimportant users. What makes a user unimportant for our analysis? I think a global project rating should be based on the opinions (expressed through their actions) of more experienced developers. The number of projects you have forked is a good proxy for experience, so a centrality measure on fan edges makes sense. But which one?

Four centrality algorithm implementations are available in the Gremlin docs, although closeness and betweenness are extremely compute intensive and hard to run on real datasets. This leaves degree and eigenvector centrality. According to the images below, degree centrality (on the right) would spread the filtered users more evenly across the graph, while eigenvector centrality would focus on a core of experienced engineers working with other experienced engineers. Degree centrality feels more democratic, so I tried it first. I also ran eigenvector centrality out of curiosity.

Eigenvector Centrality (C) vs Degree Centrality (D)

I used in-degree because prestige across a fan link should flow from the fan to the owner. A good starting place would be to filter the users down to only those that have created a repository and had someone else fork it. This means they are an open source contributor, and those are the opinions that should contribute to our project rating.

Degree Centrality

To begin we calculate the top 20 users by degree centrality (see user_fan_centrality.groovy).

// Inspect a top 20 degree centrality of fans

g.V().

hasLabel('user').

group().

by('userName').

as('owner').

by(

inE('fan').

count()

).

as('degree').

select('owner','degree').

order().

by(

select('degree'),

decr

).

limit(20)

First we filter to just user nodes with hasLabel('user') then we group our vertices with group() and group them by their userName so the results will use the names instead of the unintelligible vertex IDs. We label these userName using as('owner') to access it later. We group by the fan in-degree using inE('fan').count() and name this with as('degree') . Finally we select('owner','degree') so we can order() the results by the degree in decreasing order using by(select('degree'), decr) , and we limit(20) the results to something manageable.

This results in:

==>[user:CNXTEoE,degree:62419]

==>[user:tensorflow,degree:37543]

==>[user:korolr,degree:34538]

==>[user:jenniemanphonsy,degree:31182]

==>[user:facebook,degree:27414]

==>[user:jtleek,degree:21805]

==>[user:CNXTEoEorg,degree:20888]

==>[user:bestwpw,degree:20109]

==>[user:AlexxNica,degree:19912]

==>[user:vuejs,degree:19820]

==>[user:SmartThingsCommunity,degree:19228]

==>[user:angular,degree:19091]

==>[user:github,degree:18674]

==>[user:roscopecoltran,degree:18547]

==>[user:alibaba,degree:17746]

==>[user:spring-projects,degree:17304]

==>[user:octocat,degree:17197]

==>[user:rdpeng,degree:16423]

==>[user:twbs,degree:16212]

==>[user:PlumpMath,degree:15494]

Those that seem to make sense are in bold. Note that the top user CNXTEoE, is not present on github.com and is therefore likely a spam user. Tensorflow, the number two user, makes good sense. Some other users don’t make a lot of sense. fan centrality seems to have a spam problem, but it may still work for our purposes, as it is only a filter for our pool of voting users.

Degree fan centrality has shortcomings as an overall metric

We’ll write the fan_degree to the user node so we can easily filter by it later (see user_fan_centrality.groovy).

// Compute degree centrality of users via user-fan->user degree

// and write it to the 'degree' node property

count = 0 g.V().

hasLabel('user').

group().

by().

by(

inE('fan').

count()

).

unfold().

as('kv').

select(keys).

property(

'fan_degree',

select('kv').

select(values)

).

choose(

filter{it->count+=1; count%1000 == 0},

__.map{it->println(count); g.tx().commit(); it.get()},

__.identity()

).

iterate() println(count)

Let’s break this out, step-by-step, from the previous query. After the last by() , the unfold() call converts the iterator emitted by by() into a <String, Object>Map . This is named kv by as('kv') and the keys of this map are selected by select(keys) . property() of 'degree' of select('kv').select(values) writes the values of 'kv’ to the node’s property named 'degree' . That completes the query.

Eigenvector Centrality

Next we compute the top twenty Github users by eigenvector centrality. Because we limit the computation, it only takes thirty seconds (see user_fan_centrality.groovy).

Eigenvector Fan Centrality for Thirty Seconds

// Inspect a top 20 eigenvector centrality of fans

g.V().

hasLabel('user').

repeat(

groupCount('m').

by('userName').

out('fan').

timeLimit(30000)

).

times(5).

cap('m').

order(local).

by(values, decr).

limit(local, 20).

next()

You can see the PageRank like walks across the fan edges occurring in repeat(groupCount('m').by().out('fan')) and times(5) . We limit the computation to 30 seconds with timeLimit(30000) , which seems to work well enough. Even 1 second produced good top 20 results. cap('m') fetched the groupCount('m') results, which are then ordered by the values in decrementing order, and limited to the top twenty results.

The results present a surprise: in a ten second run, Microsoft edges out Facebook and Google in fan centrality. In a thirty second run, Facebook pulls ahead of both. The thirty second results look like:

==>facebook=1931

==>Microsoft=1714

==>sindresorhus=1507

==>google=1437

==>zeit=1161

==>apache=1155

==>kubernetes=1049

==>vuejs=966

==>github=789

==>webpack=784

==>airbnb=763

==>tensorflow=730

==>golang=714

==>jenkinsci=707

==>facebookincubator=690

==>babel=669

==>reactjs=642

==>atom=620

==>coreos=620

==>jenkins-infra=614

Writing the eigenvector centrality computation to the nodes looks like this (see user_fan_centrality.groovy):

// Calculate eigenvector centrality on the user-fan->user edges

// and write it to the 'eigen' node property

count = 0 g.V().

hasLabel('user').

repeat(

groupCount('m').

by().

out('fan').

timeLimit(30000)

).

times(5).

cap('m').

unfold().

as('kv').

select(keys).

property(

'fan_eigen',

select('kv').

select(values)

).

choose(

filter{it->count+=1; count%1000 == 0},

__.map{it->println(count); g.tx().commit(); it.get()},

__.identity()

).

iterate() println(count)

The query continues the previous one who’s cap(‘m') is then unfold().as('kv') into a <String, Object>Map containing the counts for each node. As with degree centrality, property('fan_eigen', select('kv').select(values)) writes the centrality score to the fan_eigen property of the node.

Back to Projecting Co-forks!

Now that we have our centrality scores for user-fan->user , we can filter the nodes down to some reasonable number of veteran collaborators and re-run our query to expand user-forked->repo edges into repo-co_forked->repo edges. We’ll employ the fan_degree and fan_eigen properties to filter users down to something more manageable before we explode them into N² form when creating our co_forked edges (see post number three).

The eigenvector fan centrality looks the best in terms of ranking users contributing large projects, but the degree centrality may be better for our purposes. Let’s see how far the two metrics we’ve computed filter the nodes:

g.V().has("fan_degree").count()

==>4,401,145 g.V().has('user','fan_degree',gt(0)).count()

==>1,015,514 g.V().has("fan_eigen").count()

==>84,863

Lets try the fan_eigen property, since it is smaller. Below, we begin our co-fork query from the last post, only this time when we walk the repo<-forked-user edges via in('forked') , we will filter the users by the presence of the fan_eigen property. This will reduce the pool of users for our co-forking from 4.4 million to 84,000, which should square nicely (see create_co_forks.groovy).

g.V().

hasLabel('repo').

as('repo1').

in('forked').

has('fan_eigen').

There’s another part to the query we created in our first attempt at filtering that we can add to ensure a single super-user who has forked thousands of repos doesn’t balloon into millions of co_forked edges. We’ll limit users to those who have forked just 100 repos or less. This is called a supernode or whale filter, applied to users. Note that we can’t use a whale filter on repos, as repos with many forkers are the very repos most likely to have high ratings.

where(

outE('forked').

count().

is(

lt(100)

)

).

The rest of the query is the same as before:

out('forked').

where(neq('repo1')).

as('repo2').

addE('co_forked').

to('repo1').

choose(

filter{it->count+=1; count%1000 == 0},

__.map{it->println(count); g.tx().commit(); it.get()},

__.identity()

).

iterate()

Check on the results shows that we’ve created 8.26 million co_forked edges:

g.E().hasLabel('co_forked').count()

==>8,263,332

Back to Co-Fork Centrality

Now we are back to where we were in the previous post — to the point of computing a co_forked centrality (see co_forked_centrality.groovy). We start by computing a top twenty repos given by a ten minute (600,000 milliseconds) run of out-bound eigenvector centrality across the co_forked edges (see create_co_forks.groovy).

// Get top 20 by co_forked centrality

g.V().hasLabel('repo').

repeat(

groupCount('m').

by('repoName').

out('co_forked').

timeLimit(60000)

).

times(5).

cap('m').

order(local).

by(values, decr).

limit(local, 20).

next()

The results were surprising, but show that we’re not done yet:

==>reactnativecn/react-native-guide=895

==>vuejs/vue=570

==>ElemeFE/element=541

==>webpack/webpack=519

==>nodejs/node=509

==>facebook/react-native=507

==>vuejs/awesome-vue=506

==>facebook/react=490

==>tensorflow/tensorflow=485

==>facebookincubator/create-react-app=466

==>mrdoob/three.js=463

==>airbnb/javascript=459

==>vuejs/vuex=459

==>angular/angular=439

==>Microsoft/vscode=439

==>thank-you-github/thank-you-github=438

==>github/gitignore=432

==>Microsoft/TypeScript=431

==>atom/atom=430

==>callemall/material-ui=430

Next we’ll commit this rating to a property called co_fork_eigen so we can compare it to stars in our next post.

// Write eigenvector centrality to graph property

g.V().hasLabel('repo').

repeat(

groupCount('m').

by().

out('co_forked').

timeLimit(60000)

).

times(5).

cap('m').

unfold().

as('kv').

select(keys).

property(

'co_fork_eigen',

select('kv').

select(values)

).

choose(

filter{it->count+=1; count%1000 == 0},

__.map{it->println(count); g.tx().commit(); it.get()},

__.identity()

).

iterate()

Conclusion

After a major diversion and a lot of trouble, we’ve managed to create an initial version of the metric we set out to create. Along the way we overcame an N² scalability limit which stood in the way of the strategy we set out to employ to solve our problem. These kinds of problems are common, so take note of the ways we worked around them for your own graph explorations.

In the next post, we’ll evaluate our metric and continue to improve it by comparing it with the star count it proposes to replace.

Need Help?

Need help with graph analytics, data visualization or predictive analytics products? My consultancy Data Syndrome is here to help! We are experts in graph analytics, graph databases and link prediction. We can help accelerate your existing project or we can start and finish a new one. We specialize in building analytics products from end-to-end, and we’re just an email away: rjurney@datasyndrome.com.

Special Thanks

A special thanks to my friend Andrew Blevins, who helped with the strategy to solve the user filtering/scaling problem! Also thanks to the Gremlin mailing list, and Kevin Lawrence for his helpful book, Practical Gremlin.