Ever since, I submitted the 1500th Pull Request at the end of last year, I have been thinking of doing some kind of data analysis. I love playing with data as it can tell you something interesting every time you poke. I had so many questions in my head with regard to my PR journey e.g. How many authors I have contacted so far? How many distributions on GitHub I have submitted PR against? etc.

Luckily, I have been collecting data about every PR (nearly) that I have submitted so far with details like author, distribution name and pull request id. I have used that in the graph that I have on my personal website.

Pull Request Summary

Yearly/Monthly Breakdown

The above graph doesn't answer the questions I mentioned above. So I started exploring how can I use the data I have to get the answer. First thing first, list what I had already.

1) Count of overall PR

2) Count of closed (unmerged) PR

3) Count of open PR

4) Count of merged PR

5) Breakdown of PR for each month

The critical information missing was individual PR status. So my immediate target was to collect the missing information. I had never used GitHub API before, so didn't know how to

fetch the information from GitHub. Luckily I had all the information, I needed to get the status of PR. Like others, I first searched MetaCPAN, if there is any module that can help me with the GitHub API. As always the case, I found one Pithub. It had enough documentation to get me going straight away, thanks to Olaf Alders.

I then created perl script to go through the PR data I had and asked GitHub to give me the status. To my surprise, I found plenty of typo in the data, e.g. PR id was incorrect, typo in the author name, distribution name etc. This exercise actually cleaned my historical data, thanks to Pithub. I also noticed that some of the PR were missing from GitHub completely. Either author deleted the distributions or passed it to someone else. I was sure, it wasn't my bad data as I checked my inbox still had email with reference to the missing PR. Luckily there were only 7 such PR.

OK, so now I have status of every PR that I have submitted so far. My first target to get the top 30 authors who received the most PR with breakdown. Below is what I got.

If you look at the graph closely, you would notice that Renee Baecker top the list with 83 PR, with good acceptance rate as well, 78 of those merged. There are few other interesting facts came to my notice, there are 4 authors in the top 30 that has accepted every PR I submitted and they are as below:

Gabor Szabo (37 PR)

(37 PR) Dave Cross (20 PR)

(20 PR) Olaf Alders (28 PR)

(28 PR) Stefan G (11 PR)

I also noticed, there are 6 authors in the top 30, where I have hope to achieve 100% acceptance and they are

Slaven Rezić [Open: 3, Merged: 21]

[Open: 3, Merged: 21] FAYLAND [Open: 1, Merged: 16]

[Open: 1, Merged: 16] Theo van Hoesel [Open: 5, Merged: 7]

[Open: 5, Merged: 7] Jose Luis Martinez Torres [Open: 4, Merged: 8]

[Open: 4, Merged: 8] Daniel Friesel [Open: 2, Merged: 8]

[Open: 2, Merged: 8] Ali Zia [Open: 1, Merged: 8]

I then, modified it a little to show total PR against each author and got this.

Finally I wanted top 30 authors with most distributions receiving my PR. I got below.

If you look carefully, Renee Baecker is in top in both category (PR and Repository). He is a super star and always supportive.

I have cron job that updates these data every 2 hours.

All these graphs and many more are available to explore on my personal website

If you have any suggestion then please do share with me.