Scala in Numbers The Ecosystem Census Source code statistics of common Scala libraries 2014-06-18 Scala Days 2014 Berlin Johannes Rudolph /

@virtualvoid

Scientific Approach Start with hypothesis

Gather evidence

Try to prevent biases

Make sure data is correct

Make experiments reproducible

Choose relevant features

Goal: discover knowledge

Ad Hoc Approach Start with rough idea

Technology first

Run some experiments

Look for the interesting needles in the data haystack

Hope to learn something in the process

Data basis Scala 2.10

Mostly libraries from the 2.11 release announcement list

libraries

libraries ~ 200 MB json files of analysis output

Libraries

Name LoC (*) Source files (*) Lines the compiler reported an AST for

Doing statistics is hard Start with hypotheses?

Start with data?

Start with visualizations?

Identifiers

Local vals Rank Name Occurrences Top Library % of Total* (*) % of occurrences the top library contributed

Example GenericCompanion.scala: def apply[A](elems: A*): CC[A] = { if (elems.isEmpty) empty[A] else { val b = newBuilder[A] b ++= elems b.result() } } GenericCompanion.scala:

Local vars Rank Name Occurrences Top Library % of Total* (*) % of occurrences the top library contributed

Vals vs. Vars Name Vars Vals % Vars

Example rapture-io / streams.scala override def foreach[U](fn: Data => U): Unit = { var next: Option[Data] = read() while(next != None) { fn(next.get) next = read() } } scala-stm / TxnHashTrie.scala def mapForeach[U](f: ((A, B)) => U) { var i = 0 while (i < hashes.length) { f(getKeyValue(i)) i += 1 } } rapture-io / streams.scalascala-stm / TxnHashTrie.scala

Parameter names Rank Name Occurrences Top Library % of Total* (*) % of occurrences the top library contributed

Example scalaz-core / Lens.scala def map[C](f: B1 => C): State[A1, C] def >-[C](f: B1 => C): State[A1, C] def flatMap[C](f: B1 => IndexedState[A1, A2, C]): IndexedState[A1, A2, C] scalaz-core / Lens.scala

Lengths of Local Identifiers Longest: linearizedTargetColumnsForOriginalTargetTable from slick

Lengths of Local Identifiers by Library Library Avg. Length Median Shortest Longest

Lengths of Local Identifiers

Scala Library Usage

Scala Library Usage (12 - 24) Name % Users

Scala Library Usage Top 12 Name % Users

Scala Library Usage (scala.collection) Name % Users

Predef Usages Name %

Predef Enhancements Usages Method Extension %

Implicit usage

Implicit parameter definitions by type Type # Top user %

Implicit params from scala-library Type # Top user %

Making-Of

Parts Crawler

Compiler

Feature extraction

Analysis

Frontend

Crawler Input: Maven ModuleID

Output: source jar and binary dependency jars

Uses ivy/sbt-ivy to access Maven repositories

Compiler Input: sources and dependency jars

Initializes presentation compiler with all sources

Allows to run queries over source trees

trait AnalyzingCompiler { def analyze[T](factory: AnalyzerFactory[Option[T]]): Seq[T] } trait AnalyzerFactory[T] { def create(u: Universe): Analyzer[T] { type U = u.type } } trait Analyzer[T] { type U new AnalyzerFactory[Option[PredefUsage]] { def create(u: Universe) = new Analyzer[Option[PredefUsage]] { def analyze(tree: u.Tree): Option[PredefUsage] = tree match { case q"${ _ }.Predef.$method" ⇒ Some( PredefUsage(method.decoded, pos(library, tree.pos))) case _ => None } } }

Demo: Code Search

Extraction Extracts one aspect of code

Examples: collect-identifiers, scala-library-references

Runs per library

Results serialized to json

Analysis Aggregates per-library data into statistics

Examples: common-identifers, local-identifier histogram

Results serialized to json

Frontend Fetch statistics as JSON

Render stats on the presentation on-the-fly

Problem: Data may still change

Tools used reveal.js d3.js scala-js



Techniques and Libraries used sbt/ivy for fetching dependencies

presentation compiler for providing data structures

quasi-quotes for matching on interesting trees

scala-js for doing some client side data manipulation

d3 for creating visualizations

Issues with the data Only library code was considered

Not corrected for code size

Correctness wasn't validated rigorously

Libraries weren't properly pre-screened for relevancy

Some libraries had minor compilation issues

Multi-module libraries weren't aggregated

Last but not least: symbolic operators Name # Top 5

Top 10 Top 2 Libraries with Unicode operators Name # Operators

Executive Summary Code is data Use it to your advantage.