Without question, Github is the biggest code sharing platform on the planet. With more than 14 millions users and 35 million repositories, the insights you can discover by analyzing the data available through its API are surprising and revealing.

I’ve been very passionate about this data for a long time, as much as I am about Neo4j. Importing this data into a graph database can reveal interesting new information.

Two years ago I created a Github Gist with some interesting queries on that dataset. I showed many more queries at several conferences since.

Recently, I came across mention-bot from Facebook, which analyses the git blame of the files of a Pull Request in order to notify (via @mention) potential reviewers.

It uses two heuristics :

If a line was deleted or modified, the person that last touched that line is likely going to care about this pull request.

If a person last touched many lines in the file where the change was made, they will want to be notified.

I got inspired by the approach taken by Facebook. Because I’ve been running a continuous importer of all Github events into Neo4j for a long time, I have a lot of GitHub data to work with.

I was curious about the opportunity of using our Natural Language Processing plugin for Neo4j recently created by my colleague, friend and Chief Scientist of GraphAware Dr. Alessandro Negro, to apply it to the text descriptions of Github Pull Requests and Issues.

Our approach

We will import the last 100 Pull Requests of our favorite Github repository, Neo4j with their title, body, author and files.

We will then process the above text through an NLP pipeline storing the processing results in the graph which, thanks to the nature of graphs, will allow us to relate already existing code file nodes to text entities.

Then after some scoring queries, we will create a new GitHub Issue and find potential reviewers based on:

The tags extracted from the new issue’s text using NLP

The similarity between such tags and the tags extracted from previous pull requests

The files affected by those previous Pull Requests

The activity frequency of the users on the files found with the above

Importing the data

Thanks to the amazing work of Michael Hunger and the Neo4j community, the APOC procedures library offers an easy way to import data from public API’s directly via Cypher.

Let’s go!

Creating unique constraints

CREATE CONSTRAINT ON (user:User) ASSERT user.id IS UNIQUE CREATE CONSTRAINT ON (repository:Repository) ASSERT repository.id IS UNIQUE CREATE CONSTRAINT ON (file:File) ASSERT file.path IS UNIQUE CREATE CONSTRAINT ON (pullrequest:PullRequest) ASSERT pullrequest.id IS UNIQUE

Importing the last 100 pull requests

MERGE (r:Repository {full_name:"neo4j/neo4j"}) WITH r as repo CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls?per_page=100&state=closed&direction=desc&base=3.0") YIELD value as pull MERGE (pr:PullRequest {id: pull.id}) ON CREATE SET pr.title = pull.title, pr.body = pull.body, pr.time = pull.created_at, pr.number = pull.number MERGE (user:User {id: pull.user.id}) ON CREATE SET user.login = pull.user.login MERGE (user)-[:OPENED_PR]->(pr) MERGE (pr)-[:PR_TO_REPO]->(repo)

Frequency of contribution

The idea is to create a direct relationship, between a user and a repository, with a property representing the frequency of contributons:

MATCH (r:Repository {full_name:"neo4j/neo4j"})<-[:PR_TO_REPO]-(pr)<-[:OPENED_PR]-(user) WITH r, user, count(*) as freq MERGE (user)-[c:CONTRIBUTED_TO]->(r) SET c.freq = freq

Once the last statement has finished, you can already visualize a graph of contributors to the Neo4j repository

Importing Pull Request files

The final statement will import the files of those Pull Requests and create the repository hierarchy. Replace the XXX in the api url with your GitHub client id and secret if you have one :

MATCH (pr:PullRequest) CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/pulls/" + pr.number + "/files?client_id=XXX&client_secret=XXX") YIELD value as file WITH pr, file as fileInfo, split(file.filename,"/") AS elts UNWIND range(0,size(elts)-1) AS i MERGE (prev:File {path: "neo4j/neo4j_" + reduce(s="",x IN range(0,i-2) | s+ elts[x]+"/") + CASE i WHEN 0 THEN "" ELSE elts[i-1] END}) SET prev.name = CASE i WHEN 0 THEN "." ELSE reduce(s="",x IN range(0,i-2) | s+ elts[x]+"/") + elts[i-1] END MERGE (file:File {path: "neo4j/neo4j_" + reduce(s="",x IN range(0,i-1) | s+ elts[x]+"/") + elts[i]}) SET file.name = reduce(s="",x IN range(0,i-1) | s+ elts[x]+"/") + elts[i] MERGE (file)-[:PARENT]->(prev) WITH "neo4j/neo4j_" + reduce(s="",x IN range(0,size(elts)-2) | s+ elts[x]+"/") + elts[-1] AS la, pr, fileInfo MATCH (f:File {path:la}) SET f:FileRecord MERGE (f)<-[r:TOUCHED_FILE]-(pr) SET r.changes = fileInfo.changes

The last part of the query is more about iterating over a text split on a file path in order to recreate the file tree structure in the graph:

Processing the NLP pipeline

Using the Natural Language Processing plugin is just about calling a procedure on the given text and creating a relationship between the Pull Request and the newly created AnnotatedText:

MATCH (n:PullRequest) CALL ga.nlp.annotate({text: n.title + " " + n.body, id: id(n), nlpDepth:0}) YIELD result MERGE (n)-[:ANNOTATED_TEXT]->(result)

The extraction is automatically persisted in the graph and relates sentences and tags to Pull Requests, for example:

By just browsing your database, you can identifty which tags (big pink nodes) are related to a pull request.

Here we will take a simple scoring approach- for each pull request we will find the extracted tags and create a score where:

score = tag occurences in this pr / (total occurences of the tag * number of tags for this PR)

We take how many times the tag occurs in this particular pull request and we divide this number by the mulitplication of :

a) how many times this tag appears for pull requests b) how many tags are tagging this particular pull request

The corresponding Cypher query :

MATCH (pr:PullRequest)-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag) // no tags yet WHERE size((pr)<-[:TAGS_PR]-()) = 0 WITH pr, count(distinct(tag)) as total MATCH (pr)-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag) WITH pr, tag, count(*) as freq, total ORDER BY freq DESC WITH pr, tag, toFloat((freq * 1.0f) / (size((tag)<-[:HAS_TAG]-()) * total)) as score, total MERGE (tag)-[r:TAGS_PR]->(pr) SET r.score = score RETURN tag.value, score

When the query is completed, an interesting question to ask is to take a file and find its corresponding tags based on the pull requests, NLP tags and the score:

WITH "neo4j/neo4j_community/kernel/src/main/java/org/neo4j/kernel/impl/api/KernelTransactions.java" as path MATCH (f:FileRecord {path: path})<-[:TOUCHED_FILE]-(pr)<-[r:TAGS_PR]-(tag) WITH tag, collect(r) as occurences RETURN tag.value, size(occurences) + reduce(x=0, o IN occurences | x + o.score) as score ORDER BY score DESC

Result :

╒══════════════════════════════╤══════════════════╕ │tag.value │score │ ╞══════════════════════════════╪══════════════════╡ │transaction │4.0154783125371365│ ├──────────────────────────────┼──────────────────┤ │commit │3.0050662878787877│ ├──────────────────────────────┼──────────────────┤ │can │3.0026903651903654│ ├──────────────────────────────┼──────────────────┤ │make │3.0020750988142293│ ├──────────────────────────────┼──────────────────┤ │external │2.028030303030303 │ ├──────────────────────────────┼──────────────────┤ │guard │2.0244444444444443│ ├──────────────────────────────┼──────────────────┤ │execution │2.0194444444444444│ ├──────────────────────────────┼──────────────────┤ │guarantee │2.018308080808081 │ ├──────────────────────────────┼──────────────────┤ │logical │2.016919191919192 │ ├──────────────────────────────┼──────────────────┤ │note │2.0112794612794613│ ├──────────────────────────────┼──────────────────┤ │during │2.00915404040404 │ ├──────────────────────────────┼──────────────────┤ │right │2.008459595959596 │ ├──────────────────────────────┼──────────────────┤ │slave │2.0075396825396825│ ├──────────────────────────────┼──────────────────┤ │kernel │2.0063492063492063│ ├──────────────────────────────┼──────────────────┤ │get │2.0061026936026938│ ├──────────────────────────────┼──────────────────┤ │instance │2.0056277056277056│ ├──────────────────────────────┼──────────────────┤ │what │2.004513888888889 │ ├──────────────────────────────┼──────────────────┤ │through │2.004166666666667 │ ├──────────────────────────────┼──────────────────┤

Without having seen a single line of code in this file, we already know that this file is about transaction , guard execution , kernel and guarantee . These terms quite precisely represent the purpose of the file.

A visual comparison is really surprising :

Actually, with the help of Neo4j and NLP, we made the code talk to us.

Creating a new issue

The next step will represent a user creating a new issue and we will process it using the NLP pipeline. We will take a real issue about escaping output of the constraints procedure reported on the Neo4j repository:

MATCH (repo:Repository {full_name:"neo4j/neo4j"}) CALL apoc.load.json("https://api.github.com/repos/neo4j/neo4j/issues/7634") YIELD value as issue CREATE (i:Issue {number: issue.number, title: issue.title, body: issue.body}) MERGE (i)-[:ISSUE_ON_REPO]->(repo)

Now we process the NLP pipeline:

MATCH (i:Issue {number: 7634}) CALL ga.nlp.annotate({text: i.title + " " + i.body, id: id(i), nlpDepth:0}) YIELD result MERGE (i)-[:ANNOTATED_TEXT]->(result)

which will produce the same extraction format as for pull requests:

Finding potential reviewers

Starting from the issue, we will incrementally run Cypher queries adding more complexity at each step:

Note here that we are not interested in tags having less than 5 characters

MATCH (i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag) WHERE size(tag.value) > 4 RETURN tag.value as tag, count(*) as occurences ORDER BY occurences DESC

Result :

╒══════════╤══════════╕ │tag │occurences│ ╞══════════╪══════════╡ │UNIQUE │3 │ ├──────────┼──────────┤ │label │3 │ ├──────────┼──────────┤ │property │2 │ ├──────────┼──────────┤ │neo4j │2 │ ├──────────┼──────────┤ │CONSTRAINT│2 │ ├──────────┼──────────┤ │constraint│2 │ ├──────────┼──────────┤ │assert │2 │ ├──────────┼──────────┤ │create │2 │ ├──────────┼──────────┤ [...] result truncated

From the found Tags, you can find the related previous Pull Requests as well as their Authors:

MATCH (i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag) WHERE size(tag.value) > 4 WITH tag, count(*) as occurences MATCH (tag)-[r:TAGS_PR]->(pull)<-[:OPENED_PR]-(author) RETURN tag.value as tag, author.login, count(*) as freq ORDER BY freq DESC

Result:

╒══════════╤═══════════════╤════╕ │tag │author.login │freq│ ╞══════════╪═══════════════╪════╡ │index │MishaDemianenko│5 │ ├──────────┼───────────────┼────┤ │which │tinwelint │5 │ ├──────────┼───────────────┼────┤ │where │tinwelint │4 │ ├──────────┼───────────────┼────┤ │should │systay │3 │ ├──────────┼───────────────┼────┤ │property │MishaDemianenko│3 │ ├──────────┼───────────────┼────┤ │match │pontusmelke │3 │ ├──────────┼───────────────┼────┤ │label │Mats-SX │2 │ ├──────────┼───────────────┼────┤ │neo4j │lutovich │2 │ ├──────────┼───────────────┼────┤ │should │Mats-SX │2 │ ├──────────┼───────────────┼────┤ │property │Mats-SX │2 │ ├──────────┼───────────────┼────┤ │which │glennsarti │2 │ ├──────────┼───────────────┼────┤ │where │pontusmelke │2 │ ├──────────┼───────────────┼────┤ │should │lutovich │2 │ ├──────────┼───────────────┼────┤ │index │tinwelint │2 │ ├──────────┼───────────────┼────┤ │label │MishaDemianenko│2 │ [...] result truncated

The next step is to combine the occurences of the tag in the issue with the score property on the TAGS_PR relationship :

MATCH (i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag) WHERE size(tag.value) > 4 WITH tag, count(*) as occurences MATCH (tag)-[r:TAGS_PR]->(pull)<-[:OPENED_PR]-(author) RETURN tag.value as tag, author.login, occurences * sum(r.score) as score ORDER BY score DESC

Result :

╒══════════╤═══════════════╤═════════════════════╕ │tag │author.login │score │ ╞══════════╪═══════════════╪═════════════════════╡ │neo4j │pe4cey │0.125 │ ├──────────┼───────────────┼─────────────────────┤ │version │pe4cey │0.125 │ ├──────────┼───────────────┼─────────────────────┤ │UNIQUE │pontusmelke │0.10714285714285714 │ ├──────────┼───────────────┼─────────────────────┤ │property │Mats-SX │0.08571428571428572 │ ├──────────┼───────────────┼─────────────────────┤ │assert │pontusmelke │0.07142857142857142 │ ├──────────┼───────────────┼─────────────────────┤ │CONSTRAINT│pontusmelke │0.07142857142857142 │ ├──────────┼───────────────┼─────────────────────┤ │create │Mats-SX │0.05 │ ├──────────┼───────────────┼─────────────────────┤ │around │chrisvest │0.05 │ ├──────────┼───────────────┼─────────────────────┤ │label │MishaDemianenko│0.03953823953823954 │ ├──────────┼───────────────┼─────────────────────┤ │property │MishaDemianenko│0.038644688644688646 │ ├──────────┼───────────────┼─────────────────────┤ │constraint│MishaDemianenko│0.037037037037037035 │ ├──────────┼───────────────┼─────────────────────┤ │label │Mats-SX │0.03638127577007034 │ ├──────────┼───────────────┼─────────────────────┤ │constraint│pontusmelke │0.03571428571428571 │ ├──────────┼───────────────┼─────────────────────┤ [...] result truncated

Now, we will aggregate the tags by user and return the top potential users :

MATCH (i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag) WHERE size(tag.value) > 4 WITH tag, count(*) as occurences MATCH (tag)-[r:TAGS_PR]->(pull)<-[:OPENED_PR]-(author) WITH tag, author.login as user, occurences * sum(r.score) as score RETURN user, sum(score) as score

Result :

╒═══════════════╤════════════════════╕ │user │score │ ╞═══════════════╪════════════════════╡ │pontusmelke │0.39697590421274626 │ ├───────────────┼────────────────────┤ │Mats-SX │0.19041057979937437 │ ├───────────────┼────────────────────┤ │MishaDemianenko│0.18433177933177933 │ ├───────────────┼────────────────────┤ │systay │0.03407509157509157 │ ├───────────────┼────────────────────┤ │chrisvest │0.09658185529709445 │ ├───────────────┼────────────────────┤ │glennsarti │0.028600823045267492│ ├───────────────┼────────────────────┤ │davidegrohmann │0.005917159763313609│ ├───────────────┼────────────────────┤ │jakewins │0.003108003108003108│ ├───────────────┼────────────────────┤ │thobe │0.02110805860805861 │ ├───────────────┼────────────────────┤ │tinwelint │0.10783658701133339 │ ├───────────────┼────────────────────┤ [...] result truncated

Finally, let’s combine the frequency score of the user together with the current score :

MATCH (repo:Repository {full_name:"neo4j/neo4j"})<-[:ISSUE_ON_REPO]-(i:Issue {number: 7634})-[:ANNOTATED_TEXT]->(text)-[:CONTAINS_SENTENCE]->(s)-[:HAS_TAG]->(tag) WHERE size(tag.value) > 4 WITH repo, tag, count(*) as occurences MATCH (tag)-[r:TAGS_PR]->(pull)<-[:OPENED_PR]-(author) WITH repo, tag, author as user, occurences * sum(r.score) as score MATCH (user)-[r:CONTRIBUTED_TO]->(repo) RETURN user.login, sum(score) * ( r.freq * 1.0f / size((repo)<-[:CONTRIBUTED_TO]-()) ) as score ORDER BY score DESC LIMIT 10

Result :

╒═══════════════╤═════════════════════╕ │user.login │score │ ╞═══════════════╪═════════════════════╡ │pontusmelke │0.21502861478190424 │ ├───────────────┼─────────────────────┤ │MishaDemianenko│0.09984638047138046 │ ├───────────────┼─────────────────────┤ │Mats-SX │0.07933774158307266 │ ├───────────────┼─────────────────────┤ │chrisvest │0.048290927648547234 │ ├───────────────┼─────────────────────┤ │tinwelint │0.04043872012925002 │ ├───────────────┼─────────────────────┤ │lutovich │0.021550774328552104 │ ├───────────────┼─────────────────────┤ │pe4cey │0.020833333333333332 │ ├───────────────┼─────────────────────┤ │systay │0.005679181929181929 │ ├───────────────┼─────────────────────┤ │glennsarti │0.002383401920438957 │ ├───────────────┼─────────────────────┤ │digitalstain │0.0015988874182151494│ └───────────────┴─────────────────────┘

If you look at the Commits referencing the Issue on GitHub and view the Pull Request, the Author of the Pull Request validates our suggested results.

The next step will be to connect to the GitHub API with your bot account and mention the three first users of the list of results. An immediate improvement can be to cut off results depending on their score in comparison to the standard deviation of the overall scores.

Going further by suggesting pull requests to reviewers

The goal described in the beginning of the blog post is now achieved, people will be mentioned when they are good candidates to review an issue.

In order to help them, we might want to suggest to them pull requests that they can potentially refer to, fix and confirm the new issue.

The Natural Language Processing plugin offers a search procedure call to which you can provide a text to search for (here, the text of the issue) and it returns documents (nodes) similar to the given text along with a relevance score.

The internal query of the NLP search procedure looks like this :

MATCH (doc:AnnotatedText) WITH count(doc) as documentsCount MATCH (t:Tag) WHERE t.value IN {searchTokens} WITH t, documentsCount, {querySize} as queryTagsCount MATCH (t)<-[:HAS_TAG]-(:Sentence)<-[]-(document:AnnotatedText) WITH t, count(distinct document) as documentsCountForTag, documentsCount, queryTagsCount MATCH (t)<-[ht:HAS_TAG]-(sentence:Sentence)<-[]-(at:AnnotatedText) WITH DISTINCT at, t.value as value, sum(ht.tf)*(1 + log((1.0f*documentsCount)/(documentsCountForTag + 1)))* (1.0f/at.numTerms^0.5f) as sum, queryTagsCount RETURN at, (1.0f*size(collect(value))/queryTagsCount)*(sum(sum)) as score

Let’s use it and see the results :

MATCH (n:Issue {number: 7634}) CALL ga.nlp.search(n.title + " " + n.body) YIELD result, score MATCH (result)<-[:ANNOTATED_TEXT]-(pull:PullRequest) RETURN pull.title, score ORDER BY score DESC LIMIT 10

Result :

╒═════════════════════════════════════════════════╤═════════════════════╕ │pull.title │score │ ╞═════════════════════════════════════════════════╪═════════════════════╡ │Escape when listing constraints containing `:` │7.645201712255016 │ ├─────────────────────────────────────────────────┼─────────────────────┤ │Revamped parallelism of parts of batch importer │1.3149300137481459 │ ├─────────────────────────────────────────────────┼─────────────────────┤ │Apply predicates in VarExpand, not in filter │0.5192339447267563 │ ├─────────────────────────────────────────────────┼─────────────────────┤ │Returns index partitions in sorted order │0.4926431770732637 │ ├─────────────────────────────────────────────────┼─────────────────────┤ │3.0 using label scan store for uniqueness constra│0.40473910166978916 │ │ints (Prototype) (queue all updates) │ │ ├─────────────────────────────────────────────────┼─────────────────────┤ │Open schema and label indexes in read only mode i│0.38958225594380896 │ │n case database started in read only mode │ │ ├─────────────────────────────────────────────────┼─────────────────────┤ │CsvInput parses data in parallel │0.375735879917604 │ ├─────────────────────────────────────────────────┼─────────────────────┤ │Fix theoretical data-race in SequenceLock │0.3730853451635107 │ ├─────────────────────────────────────────────────┼─────────────────────┤ │Complete Procedure documentation │0.351140524268717 │ ├─────────────────────────────────────────────────┼─────────────────────┤ [...] result truncated

Conclusion

A graph database like Neo4j offers you the flexibiltiy to store data in its natural shape and view it from multiple angles in order to gain new insights. Combined with NLP we proved today that the only limit to the possibilities is your imagination.

For the sake of the blog post, we removed some steps that we used in our continuous integration, like :

Fetching comments of issues and pull requests

Text parsing for detecting commit SHA references in order to relate pull requests closing issues, where you can then combine the text from the issue and the pull request with different scores

Issue and Pull request labels which are valuable extra tags

You would have also noticed that the NLP extraction yields tags such as have , would and the like which bring little value to the language we use on Github. This is why we are currently tuning the pipeline in order to detect and ignore such tags and are working on the capability to define your own stopwords to be excluded during the NLP pipeline process.

The NLP plugin is going to be open-sourced under GPL in the future and we would like to make sure it is production ready with private beta testers. If you’re interested to know more or see its usage in action, please get in touch.

If you’re attending GraphConnect in San Francisco in October this year, or in London next year, make sure to stop by our booth!