In a few past posts, I've shown you some of the functionality of one of my projects: git-pandas. You can do aggregate analysis of all of those, or you can even do a cumulative blame across them all. But with a little bit of extra code, you can start to see where aggregated, higher level analysis of git repositories can be useful for a team.



GitNOC is a flask/d3/redis based app that, with git-pandas behind the scenes, gives you cumulative blame charts and a file change rate table for as many 'profiles' as you want. In GitNOC, a profile is just a set of configurations for:

What repository or repositories you want to analyze What file extensions you want to analyze What directories you want to ignore

So let's say your team has around 20 repositories and 3 teams: R&D, frontend, and backend. You may end up with 4 basic profiles:

frontend: the web repositories, focusing on css, js, and html, but ignoring the bower_components directory backend: all production repositories, but focusing on python, and ignoring docs and tests. R&D: focusing on just a few sandbox type repositories, with python, R, sql, and some C in there. Management: all repos, all languages we use, ignoring imported directories like bower_components

With these three profiles set up, you navigate back to the main page, where cumulative blame is shown. As we discussed in the previous post, this can be a pretty slow operation on big projects, so the task is run in the background using redis and rq. When the task is done running, you'll have your beautiful cumulative blame charts and can get a picture for where the bulk of your code is.

Now, as we all know, LOC is a pretty terrible metric for work. Bad engineers may put out tons of lines of bad code, while a good engineer solves the problem quickly with few. But a well formed view of scale across projects like this can shed a ton of good light:

What repositories are growing at the fastest rate? Who is contributing where? What repositories make up the majority of a given profile? Are tests growing in sync with codebase size? Are two otherwise similar teams contributing similar amounts of code, and if not why not? What languages/frameworks make up the bulk of the codebase? Does the skillset we are hiring for match the state of the codebase?

The second tab is labeled risk. In this tab, we join in file change rate data with coverage data, to identify files that are being edited unusually often, and that also have poor test coverage. These files are obvious candidates for some new tests. Again, not a foolproof measure, but a useful heuristic for technical leads and managers with deadlines.

GitNOC is just one example of how you can use git-pandas to wrangle large teams and complex codebases. If you have any great ideas, I'd love to hear them. Both git-pandas and GitNOC are under active development, and I'd welcome any help or feedback.

Source code:

https://github.com/wdm0006/git-pandas

https://github.com/wdm0006/gitnoc