In this series of posts we will use Spark and JanusGraph to apply social network analysis to Github data to create a better Github project popularity ranking than stars provide. This will show what projects are most popular in use by developers, as opposed to casual fans.

I will begin by walking you through calculating the edges of fork and star graphs using PySpark, followed by showing how to setup a data model in Janus, how to load the data and then on to querying the data interactively and finally creating and evaluating our project rating.

Stars are one way to measure project importance, but I think we can do better!

Why? I received an email from a researcher asking for how many stars should count as a significant project, and it got me thinking that using stars wasn’t the best approach. While they are a direct metric, they have a very long tail distribution that doesn’t flesh out the intermediate area that is of interest. So here goes… a better Github repository rating in open source.

Why I Do What?

I am interested in learning to apply deep learning to network analysis, so I’m going to experiment with networks extracted from the Github Archive. This project is a way to get familiar with the dataset. To begin, we’ll show how to use PySpark on 135GB of event data to extract a network composed of users, repositories and the forks that link them. We’ll go on to extract one-mode networks, composed of users and repos separately, from this two-mode one. Then we’ll calculate a centrality score for each repository, which will serve as our project rating.

The original network

This will also set us up to begin further experiments using tools optimized for one-mode networks. In doing so, we’ll show how to extract property graphs from “big data” and then how to apply tools made for one-mode networks (most tools) to analyze them.

The derived user-user and repo-repo networks

Getting the Data

The Github Archive is an excellent source for Github events, saving you the trouble of collecting the events from the Github API. It can be retrieved easily using wget (see download.sh). We will limit our experiment to the Github events for the year 2017.

wget http://data.githubarchive.org/2017-{01..12}-{01..31}-{0..23}.json.gz

Making Fork and Star Networks from Github Events

With the data in hand, we need to process the raw events to extract a node/edge list to build our network. For this task I started out using Python directly (without Spark), but when I moved to working on 135GB of compressed data for the year 2017, this approach failed to scale.

This is where PySpark shines. Note that we’ll be using SparkContext and RDD instead of SparkSession and DataFrames , as this is easier for nested records like our Github events. See build_network.spark.py.

Starting out we import frozendict because it is hashable. This enables us to use dictionaries with the features of SparkContext . Then we run a snippet of code designed to initialize our Spark environment whether we’re using the pyspark console or spark-submit .

import sys, os, json

from frozendict import frozendict



# If there is no SparkSession, create the environment

try:

sc and spark

except NameError as e:

import findspark



findspark.init()

import pyspark

import pyspark.sql



sc = pyspark.SparkContext()

spark = pyspark.sql.SparkSession(sc).builder.appName("Extract Network").getOrCreate()

Our first task will be to extract the star and fork edges that link users and repositories. We load our events, parse their JSON (and handle any errors) and filter to just the valid ForkEvents and WatchEvents that we’re interested in.

github_lines = sc.textFile("data/2017*.json.gz")



# Apply the function to every record

def parse_json(line):

record = None

try:

record = json.loads(line)

except json.JSONDecodeError as e:

sys.stderr.write(str(e))

record = {"error": "Parse error"}

return record



github_events = github_lines.map(parse_json)

github_events = github_events.filter(lambda x: "error" not in x)

fork_events = github_events.filter(lambda x: "type" in x and x["type"] == "ForkEvent") # See http://bit.ly/github_ForkEvent_definition fork_events = github_events.filter(lambda x: "type" in x and x["type"] == "ForkEvent")

star_events = github_events.filter(lambda x: "type" in x and x["type"] == "WatchEvent") # See http://bit.ly/github_WatchEvent_definition star_events = github_events.filter(lambda x: "type" in x and x["type"] == "WatchEvent")

Then we use frozendict to create our pared down records with only users and repos. Note that when working with property graphs, you can pull in many more fields than this and use them as properties of nodes and edges… but it is easiest to start simply with the essential identifiers.

# Get the user and repo for each ForkEvent: user-fork-repo

fork_events = fork_events.map(

lambda x: frozendict(

{

"user": x["actor"]["login"] if "actor" in x and "login" in x["actor"] else None,

"repo": x["repo"]["name"] if "repo" in x and "name" in x["repo"] else None

}

)

)

fork_events = fork_events.filter(lambda x: x["user"] is not None and x["repo"] is not None)

And for stars as well:

# Get the user and repo for each WatchEvent: user-star-repo

star_events = star_events.map(

lambda x: frozendict(

{

"user": x["actor"]["login"] if "actor" in x and "login" in x["actor"] else None,

"repo": x["repo"]["name"] if "repo" in x and "name" in x["repo"] else None

}

)

)

star_events = star_events.filter(lambda x: x["user"] is not None and x["repo"] is not None)

Finally, we need to serialize our data back to JSON to import it into our graph database. We’ll need to supply a default method to json.dumps() to enable it to serialize frozendicts .

def json_serialize(obj):

"""Serialize objects as dicts instead of strings"""

if isinstance(obj, frozendict):

return dict(obj)



fork_events_lines = fork_events.map(

lambda x: json.dumps(x, default=json_serialize)

)

fork_events_lines.saveAsTextFile("data/users_repos.jsonl") star_events_links = star_events.map(lambda x: json.dumps(x, default=json_serialize)) star_events_links.saveAsTextFile("data/users_starred_repos.json")

Now that we have edges, we still need nodes for users and repositories. To create the user records, we start by emitting the username in a frozendict for both the fork and the star records (remember, we need any username that appears in either event type), union these two results with SparkContext.union() and finally call RDD.distinct() on the resulting dataset to get one record per user. Without frozendict , you can’t use distinct() .

# We must get any repos appearing in either event type

fork_repos = fork_events.map(lambda x: frozendict({"repo": x["repo"]}))

star_repos = star_events.map(lambda x: frozendict({"repo": x["repo"]})) repos = sc.union([fork_repos, star_repos])

repos = repos.distinct() repos_lines = repos.map(lambda x: json.dumps(x, default=json_serialize))

repos_lines.saveAsTextFile("data/repos.json")

The same thing goes for repositories:

# We must get any users appearing in either event type

fork_users = fork_events.map(lambda x: frozendict({"user": x["user"]}))

star_users = star_events.map(lambda x: frozendict({"user": x["user"]})) users = sc.union([fork_users, star_users])

users = users.distinct() users_lines = users.map(lambda x: json.dumps(x, default=json_serialize))

users_lines.saveAsTextFile("data/users.json")

Sizing Up Our Edges

The size of the users relation indicates there are 4,067,599 active users in 2017 (active users here being defined as having forked or starred an open source project on Github). There were 4,071,996 repos forked. Those users forked and starred open source repositories 11,366,334 and 31,870,088 times respectively in 2017. The users_forked_repos.jsonl file is 659MB, while the users_starred_repos.jsonl file is 1.8GB.

Now that we’ve got nodes and the edges that connect them, we’re ready to setup and import data into our graph database. We’ll do that in our next post! :)

Data Syndrome Does Networks

If you need help with this kind of shenanigans, don’t hesitate to contact Data Syndrome at rjurney@datasyndrome.com :)