A couple of weeks ago I posted about a new open source python library I started called git-pandas. The github page for it is here:

https://github.com/wdm0006/git-pandas

The basic idea is to provide an interface to a git repository or collection of git repositories via pandas DataFrames. With this, we can do some interesting analysis. In this example we will analyze the two projects that make git-pandas possible: GitPython and pandas. To get started, make a new directory to put everything in, and clone the 3 repositories (we will use the bleeding edge version of git-pandas):

mkdir gitpandas_example cd gitpandas_example git clone https://github.com/gitpython-developers/GitPython.git git clone https://github.com/pydata/pandas.git git clone https://github.com/wdm0006/git-pandas.git

Now in git-pandas, in the examples folder, there is an example called bus_analysis.py. It contains the following script:

import os from pandas import merge from gitpandas import ProjectDirectory, Repository __author__ = 'willmcginnis' def get_interfaces(): project_path = str(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))) proj = ProjectDirectory(working_dir=project_path) pandas_repo = Repository(working_dir=project_path + os.sep + 'pandas') gitpython_repo = Repository(working_dir=project_path + os.sep + 'GitPython') return proj, pandas_repo, gitpython_repo if __name__ == '__main__': project, pandas_repo, gitpython_repo = get_interfaces() # do some blaming shared_blame = project.blame(extensions=['py']) pandas_blame = pandas_repo.blame(extensions=['py']) gitpython_blame = gitpython_repo.blame(extensions=['py']) # figure out who is common between projects common = merge(pandas_blame, gitpython_blame, how='inner', left_index=True, right_index=True) common = common.rename(columns={'loc_x': 'pandas_loc', 'loc_y': 'gitpython_loc'}) # figure out committer count from each pandas_ch = pandas_repo.commit_history('master', limit=None, extensions=['py']) gitpython_ch = gitpython_repo.commit_history('master', limit=None, extensions=['py']) # now print out some things print('Total Python LOC for 3 Projects Combined') print('t%d' % (int(shared_blame['loc'].sum()), )) print('nNumber of contributors per project') print('tPandas: %d' % (len(set(pandas_ch['committer'].values)))) print('tGitPython: %d' % (len(set(gitpython_ch['committer'].values)))) print('nTop 10 Contributors Between Each') print(shared_blame.head(10)) print('nCommitters that committed to Both') print(common) print('nTruck Count of Each') print('tPandas: %d' % (pandas_repo.bus_factor(extensions=['py']))) print('tGitPython: %d' % (gitpython_repo.bus_factor(extensions=['py'])))

Which does a few things for us. First we are pulling the commit history and the blame for each project, we are also pulling the blame for the directory as a whole (which includes all 3 projects, git-python, git-pandas and pandas).

Then we compute some interesting things using those datasets. At the end, we estimate the bus factor of each repository by seeing the number of contributors that account for 50% of all of the code. This is an extremely rough estimate of how many people it would take disappearing (i.e. getting hit by a bus) for the project to die.

If you run the script, you should see:

Total Python LOC for 3 Projects Combined 284921

Number of contributors per project Pandas: 350 GitPython: 70

Top 10 Contributors Between Each name loc Wes McKinney 64994 jreback 47357 Jeff Reback 21869 sinhrks 20126 Sebastian Thiel 15236 Phillip Cloud 13282 Chris Whelan 7864 Jeffrey Tratner 6933 y-p 6053 Andy Hayden 5158

Committers that committed to Both committer pandas_loc gitpython_loc Yaroslav Halchenko 41 18

Truck Count of Each Pandas: 3 GitPython: 1

So there you have it, a nice analysis of project size, organizational support, and distribution of contribution in around 50 lines of painless python. Development continues on git-pandas, so if you have any suggestions for new features, examples, use-cases or anything else, comment below or on github.

https://github.com/wdm0006/git-pandas

Edit: I've since pushed a new release of git-pandas to pypi (v0.0.3), so you can install the version used in this post with pip using the instructions in the docs/readme.