Two weeks ago, iHackerNews.com (a creation of Ronnie Roller) made available a Hacker News related dataset via bittorrent. For those of you unfamiliar with Hacker News, it’s a portal for developers to discuss relevant items of interest run by Paul Graham’s Y Combinator. Hacker News may not be considered representative of developers broadly, but it is generally well trafficked by alpha geek types, and thus conclusions drawn from it have predictive value.

The dataset that was made available included a little better than 1.7M items from Hacker News in a basic XML structure. Believing that this represented collective wisdom of a sort, I collected the set shortly after it was made available, which proved to be shortly before it was taken down.

Having examined the dataset only briefly, it’s impossible to say as yet what might be reasonably extracted from it. That said, even the superficial metrics – bearing the requisite caveats in mind – are proving to be of interest.

As an example, the histogram below represents the distribution of select programming language mentions on Hacker News. It records nothing more than mentions; it’s blind to multiple occurences in a single sentence, for example, let alone the nuance of sentiment. But given the scale of the dataset, the distribution remains interesting.





This rough metric is just a datapoint, and obviously cannot by itself contradict broader claims such as Forrester analyst Mike Gualtieri’s “Java is a Dead-End For Enterprise App Development.” But in our view, such claims should include in situ assessments of developer behaviors in addition to traditional analyst firm survey work.

Hence our interest in datasets like Hacker News, and the reason we built RedMonk Analytics, which will be our primary mechanism for sharing similar data directly with our customers. We’ll keep you apprised of what we learn from the dataset, and if you have questions you’d like us to ask of it we’re open to suggestions.

As a technical note for those interested, the frequency counts were done with Cloudera’s Hadoop distribution and their sample examples.jar application. This provides for case sensitive substitute searching, so that the metrics above do reflect, as an example, both “Java” and “java.”

Update: That was fast. Thomas Winningham, via Twitter, passes along his look at the dataset with historical Python vs Ruby numbers through October, which is available here.

Update 2: By request, I’ve updated the graphic to reflect count data for Erlang and Lisp.

Update 3: By further request request, I’ve updated the graphic to reflect count data for C, C#, Haskell and Perl.

Disclosure: Cloudera is a RedMonk customer, and I am a Hacker News member.