January 19, 2015

I have assembled a file called massive.clj . It is 325k lines worth of concatenated open source Clojure code, every .clj from the top 50 most starred Clojure projects according to Github (excluding Clojure itself).

Let’s roll with some stats to understand some basic metrics of the most-used Clojure code, shall we?

Metric Count File count 2,359 Line count 325,153 Lines of code* 275,690 Comment lines 23,266 SLOC 252,424 Top-level forms 15,550 Total amount of forms 477,000

without whitespace, without parentheses-only lines

In general, we have 1:11 code-to-comment ratio, average of 138 total lines per file (out of them, 107 lines are code). The longest file is 8,065 lines (a huge config of UIKit bindings in the Clojure-C project), the longest non-config file is clojure.typed ‘s test.core at 4,573 lines and the longest non-config, non-test file is charts.clj in Incanter, a JFreeChart wrapper for Clojure’s most important data science project. At the shortest side, there are numerous 1-line files.

Files and lines aside, let’s focus on what’s ultimately most important, the code itself.

There are 15,550 top-line forms across the entire codebase; surprisingly few if one takes to account that it includes incredibly prolific and complex projects – Clojurescript, Compojure, core.logic, Typed Clojure, Incanter, Leiningen, LightTable, Midje, Pedestal, Quil and 40 more!

Out of these top-level forms:

Type of form Count What’s this? defn 4,739 Function deftest 2,048 Test def 1,609 Constant ns 978 Namespace defn- 894 Private function defmacro 892 Macro defmethod 735 Case of a multimethod clojure.typed/ann 543 Typed Clojure type hint defrecord 154 Struct Others 2,958 Protocols, multimethods etc

How often are docstrings used? 41% of public functions (1,962 functions) use them, 59% do not. For private functions, only 28% (253 functions) had docstrings. The average length of a docstring is 116 characters, with the shortest being only 10 and the longest being 4,039 characters long (whoa!).

Speaking of argument counts, most of the functions are usual 1- or 2-arity functions, with some notable exceptions of 0-arg functions or 7-, 8-, even 10-arity functions.

Argument count # of functions 0 288 1 1,888 2 1,198 3 472 4 162 5 87 6 21 7 12 8 2 10 1

Out of total 477,000 total internal forms, 157,760 are meaningful (are a function or a macro or a special form). Among them, top 100 most popular elements are:

Fn/macro/special form # of occurences list 19626 quote 12720 seq 7629 concat 7529 let 5307 defn 5026 = 4710 is 4159 apply 2196 deftest 2050 if 2029 def 1784 fn 1552 str 1186 map 1160 when 1091 fn* 1077 ns 1014 defn- 928 defmacro 910 and 902 defmethod 884 -> 864 :require 808 assoc 678 ann 672 count 649 not 608 testing 569 cond 569 or 550 deref 531 do 509 println 498 == 481 first 478 recur 422 when-not 401 doseq 399 emitln 346 :use 344 is-clj 337 assert 336 is-tc-e 330 run* 320 * 306 if-let 303 update-in 291 :import 282 throw 282 nil? 268 reduce 267 + 261 loop 259 ->> 248 empty? 240 into 238 binding 229 emits 226 - 221 fd/interval 213 try 211 for 211 conj 209 instance? 208 catch 204 swap! 203 fresh 200 next 200 f 196 All 195 merge 193 contains? 191 inc 189 core-run 158 range 183 meta 182 nil 175 when-let 175 declare 174 var 172 set 168 every? 167 get 166 matrix 165 ret 164 defrecord 164 enqueue 158 defalias 157 rest 156 < 155 mapv 148 sel 146 name 146 / 146 nom/tie 146 test?<- 144 HMap 143 defprotocol 143 is-tc-err 142

Do you know any other good statistics to run on this dataset? Tell me by email (zirkonit at gmail.com), or, better yet, fork the repository on github (https://github.com/zirkonit/clamjamfry) and run the stats yoursefl!

68 Kudos