A while ago, I wanted to get a little quick feedback on some data I was playing with, but the day was almost over and I wasn’t done working on it yet. I decided to tweet my rough draft of a graph of GitHub language trends anyway, followed later by a slight improvement.

Much to my surprise, that graph was retweeted more than 2,000 times and reached well over 1 million people. My colleagues have both examined this data since I posted the graph — James took a stab at pulling out a few key points, particularly GitHub’s start around Rails and its growth into the mainstream, and Steve’s also taken a look at visualizing this data differently.

Despite that being fantastic news, the best part was the questions I got, and all the conversations gave me an opportunity to decide what points would be most interesting to people who read this post. The initial plot was a spaghetti graph, so I fixed it up and decided to do a more in-depth analysis.

Caveats

Before we can get into useful results and interpretation, there are a few artifacts and potential pitfalls to be aware of:

GitHub is a specific community that’s grown very quickly since it launched [writeup]. It was not initially reflective of open source as a whole but rather centered around the Ruby on Rails community;

In 2009, the GitPAN project imported all of CPAN (Perl’s module ecosystem) into GitHub, which explains the one-time peak;

Language detection is based on lines of code, so a repository with a large amount of JavaScript template libraries (e.g. jQuery) copied into it will be detected as JavaScript rather than the language where most of the work is being done; and

I’m showing percentages, not absolute values. A downward slope does not mean fewer repositories are being created. It does mean, however, that other languages are gaining repositories faster.

The big reveal

The first set of graphs shows new, non-fork repositories created on GitHub by primary language and year. This dataset includes all languages that were in the top 10 during any of the years 2008–2013, but languages used for text-editor configuration were ignored (VimL and Emacs Lisp). I’m showing them as a grid of equally scaled graphs to make comparisons easier across any set of languages, and I’m using percentages to indicate relative share of GitHub.

GitHub hits the mainstream: James quickly nailed the key point: GitHub has gone mainstream over the past 5 years. This is best shown by the decline of Ruby as it reached beyond the Rails community and the simultaneous growth of a broad set of both old and newer languages including Java , PHP , and Python as GitHub reached a broader developer base. The apparent rise and drop of languages like PHP, Python, and C could indicate that these communities migrated toward GitHub earlier than others. This would result in an initially larger share that lowered as more developers from e.g. Java, C++, C#, Obj-C, and Shell joined.

James quickly nailed the over the past 5 years. This is best shown by the decline of as it reached beyond the Rails community and the simultaneous growth of a broad set of both old and newer languages including , , and as GitHub reached a broader developer base. of languages like PHP, Python, and C could indicate that these communities migrated toward GitHub earlier than others. This would result in an initially larger share that lowered as more developers from e.g. Java, C++, C#, Obj-C, and Shell joined. The rise of JavaScript: Another trend that instantly stands out is the growth of JavaScript . Although it’s tempting to attribute that to the rise of Node.js [2010 writeup], reality is far more ambiguous. Node certainly accounts for a portion of the increase, but equally important to remember is (1) the popularity of frameworks that generate large quantities of JavaScript code for new projects and (2) the JavaScript development philosophy that encourages bundling of dependencies in the same repo as the primary codebase. Both of these encourage large amounts of essentially unmodified JavaScript to be added to webapp repositories, which increases the likelihood that repositories, especially those involving small projects in other languages, get misclassified as JavaScript.

Another trend that instantly stands out is . Although it’s tempting to attribute that to the rise of Node.js [2010 writeup], reality is far more ambiguous. Node certainly accounts for a portion of the increase, but equally important to remember is (1) the popularity of frameworks that generate large quantities of JavaScript code for new projects and (2) the JavaScript development philosophy that encourages bundling of dependencies in the same repo as the primary codebase. Both of these encourage large amounts of essentially unmodified JavaScript to be added to webapp repositories, which increases the likelihood that repositories, especially those involving small projects in other languages, get misclassified as JavaScript. Windows and iOS development nearly invisible: Both C# and Objective-C are unsurprisingly almost invisible, because they’re both ecosystems that either don’t encourage or actively discourage open-source code. These are the two languages in this chart most likely to be unreflective both of current usage outside GitHub but also of predictive usage, again due to open-source imbalance in those communities.

What about pushes rather than creation?

What’s really interesting is that if you do the same query by when the last push of code to the repo occurred rather than its creation, the graphs look nearly identical (not shown). The average number of pushes to repositories is independent of both time and language but is correlated with when repositories were created. In only two cases do the percentages of created and pushed repos differ by more than 2 points: Perl in 2009 (+4.1% pushed) and Ruby in 2008 (–3.5% pushed), both of which are likely artifacts due to the caveats described earlier.

This result is particularly striking because there’s no difference over time despite a broader audience joining GitHub, and there’s also no difference across all of these language communities. The vast majority of repositories (>98%) are modified only in the year they are created, and they’re never touched again. This is consistent with my previous research exploring the size of open-source projects, where we saw that 87% of repositories have ≤5 contributors.

Are GitHub issues a better measure of interest?

One potential problem with looking at repositories is that it’s not a reflection of usage or and a fairly indirect measurement of interest for a given codebase. It instead measures developers creating new code — to get a closer look at usage, some possibilities are forks, stars, or issues. GitHub’s search API makes it more convenient to focus on issues so that’s what I measured for this post. My expectation going into this was that issues would be much more biased by extremely popular projects with large numbers of users, but let’s take a look:

This gave me a fairly similar set of graphs to the new-repository data. It’s critical to note that although these are new issues, they’re filed against both new and preexisting repos so the trends are not directly comparable in that sense. Rather, they’re comparable in terms of thinking about different measurements of developer interest in a given language during the same timeframe. The peaks in Ruby, Python, and C++ early on are all due to particularly popular projects that dominated GitHub in its earlier days, when it was a far smaller collection of projects. Other than that, let’s take a look through the real trends.

Nearly all of these trends are consistent with new repos. With the clear exception of Ruby and less obvious example of JavaScript, the trends above are largely consistent with those in the previous set of graphs. I’ll focus mainly on the exceptions in my other points.

With the clear exception of Ruby and less obvious example of JavaScript, the trends above are largely consistent with those in the previous set of graphs. I’ll focus mainly on the exceptions in my other points. JavaScript’s increase appears asymptotic rather than linear. In other words, it continues to increase but it’s decelerating, and it appears to be moving toward a static share around 25% of new issues. This may be the case with new repos as well, but it’s less obvious there than here.

In other words, it continues to increase but it’s decelerating, and it appears to be moving toward a static share around 25% of new issues. This may be the case with new repos as well, but it’s less obvious there than here. Ruby’s seen a steep decline since 2009. It peaked early on with Rails-related projects, but as GitHub grew mainstream, Ruby’s share of issues dropped back down. But again, this trend seems to be gradually flattening out around 10% of total issues.

It peaked early on with Rails-related projects, but as GitHub grew mainstream, Ruby’s share of issues dropped back down. But again, this trend seems to be gradually flattening out around 10% of total issues. Java and PHP have both grown and stabilized. In both cases, they’ve reached around 10% of issue share and remained largely steady since then, although Java may continue to see slow growth here.

In both cases, they’ve reached around 10% of issue share and remained largely steady since then, although Java may continue to see slow growth here. Python’s issue count has consistently shrunk since 2009. Since dropping to 15% after an initial spike in 2008, it’s slowly come down to just above 10%. Given the past trend, which may be flattening out, it’s unclear whether it will continue to shrink.

The developer-centric (rather than code-centric) perspective

What if we take a different tack and focus on the primary language of new users joining GitHub? This creates a wildly different set of trends that’s reflective of individual users, rather than being weighted toward activist users who create lots of repositories and issues.

The points I find most interesting about these graphs are:

There are no clearly artifactual spikes. All of the trends here are fairly smooth, very much unlike both the repos and issues. This is very encouraging because it suggests any results here may be more reliable rather than spurious.

All of the trends here are fairly smooth, very much unlike both the repos and issues. This is very encouraging because it suggests any results here may be more reliable rather than spurious. Language rank remains quite similar to the other two datasets. Every dataset is ordered by the number of new repos created in each language in 2013, to make comparisons simpler across datasets. If you look at activity in 2013 for issues and users, you can see that their values are generally ranked in the correct order with few minor exceptions. One in this case is that Java and Ruby should clearly be reversed, but that’s about all that’s obviously out of order.

Every dataset is ordered by the number of new created in each language in 2013, to make comparisons simpler across datasets. If you look at activity in 2013 for issues and users, you can see that their values are generally ranked in the correct order with few minor exceptions. One in this case is that Java and Ruby should clearly be reversed, but that’s about all that’s obviously out of order. Almost every language shows a long-term downhill trend. With the exception of Java and (recently) CSS, all of these languages have been decreasing. This was a bit of a puzzler and made me wonder more about the fragmentation of languages over time, which I’ll explore later in this post as well as future posts. My initial guess is that users of languages below the top 12 are growing in share to counterbalance the decreases here. It’s also possible that GitHub may leave some users unclassified, which would tend to lower everything else’s proportion over time.

With the exception of Java and (recently) CSS, all of these languages have been decreasing. This was a bit of a puzzler and made me wonder more about the fragmentation of languages over time, which I’ll explore later in this post as well as future posts. It’s also possible that GitHub may leave some users unclassified, which would tend to lower everything else’s proportion over time. I’m therefore not going to focus on linear decreases. I will, however, examine nonlinear decreases, or anything that’s otherwise an exception such as increases.

I will, however, examine nonlinear decreases, or anything that’s otherwise an exception such as increases. Ruby’s downward slide shows an interesting sort of exponential decay. This is actually “slower” than a linear decrease as it curves upwards, so it indicates that relative to everything else moving linearly downward, Ruby held onto its share better.

This is actually “slower” than a linear decrease as it curves upwards, so it indicates that relative to everything else moving linearly downward, Ruby held onto its share better. Java was the only top language that showed long-term increases during this time. Violating all expectations and trends, new Java users on GitHub even grew as a percentage of overall new users, while everything else went downhill. This further supports the assertion that GitHub is reaching the enterprise.

A consensus approach accounts for outliers

When I aggregated all three datasets together to look at how trends correlated across them, everything got quite clear:

Artifacts become obvious as spikes in only one of the three datasets, as happens for a number of languages in the 2009–2010 time frame. It’s increasingly obvious that only 5 languages have historically mattered on GitHub on the basis of overall share: JavaScript, Ruby, Java, PHP, and Python. New contender CSS is on the way up, while C and C++ hold honorable mentions. Everything else is, on a volume basis, irrelevant today, even if it’s showing fantastic growth like Go and will likely be relevant in these rankings within the next year or two.

The fragmenting landscape

In looking at the decline in the past couple of years among many of the top languages, I started wondering whether it was nearly all going to JavaScript and Java or whether there might be more hidden in there. After all, there’s a whole lot more than 12 languages on GitHub. So I next looked at total repository creation and subtracted only the languages shown above, to look at the long tail.

Although you can see an initial rush by the small but diverse community of early adopters creating lots of repositories in less-popular languages, it dropped off dramatically as GitHub exploded in popularity. Then the trend begins a more gradual increase as a wide variety of smaller language communities migrate onto GitHub. New issues show a similar but slower increase starting in 2009, when GitHub added issues. While new users increase the fastest, that likely reflects a combination of users in less-popular languages and “lurker” users with no repositories at all, and therefore no primary language.

The programming landscape today continues to fragment, and this GitHub data supports that trend over time as well as an increasing overlap with the mainstream, not only early adopters.

Update (2014/05/05): Here’s raw data from yesterday in Google Docs.

Update (2014/05/08): Simplify graphs as per advice from Jennifer Bryan.



Disclosure: GitHub has been a client.