Summary: Would you like to optimize your learning of Clojure? Would you like to focus on learning only the most useful parts of the language first? Take this lesson from second language learning: learn the expressions in order of frequency of use.

When I was learning Spanish, I liked to use Anki to drill new vocabulary. It’s a flashcard program. I found that someone had made a set of cards from an analysis of thousands of newspapers. They read in all of the words from the newspapers, counted them up, and figured out what the most common words were. The top 1000 made it into the deck.

It turns out that this is a very good strategy for learning words. Word frequency follows a hockey stick distribution. The most common words are used so much more than the less common words. For instance, the 100 most common English words make up more than 50% of text. If you’ve got limited time, you should learn those most common words first.

People who are trying to learn Clojure have been asking me “how do I learn all of this stuff? There’s so much!” It’s a valid question and I haven’t had a good answer. I remembered the Spanish newspaper analysis and I thought I’d try to do a similar analysis of Clojure expressions.

Method

Clojars provides a list of all of their projects. Many of them include a git repo. So I got that list and fetched the git repo. There were over seven thousand git repos pointed to by Clojars. That was a lot of stuff, and git repos have a lot of data I wasn’t interested in, so I filtered all the files looking for Clojure files. /\.clj(s|c|x)?$/ That left 59,665 Clojure source files with over four million lines total.

I read in all of the code in the files and parse it as Clojure. Then I walked it, looking for lists. I counted the head of all lists in each file, aggregating it all together. Then I sorted by frequency. I also only counted symbols in the head position.

Warnings

Just a few warnings before I continue. This is a very informal study. I didn’t try to remove duplicate files. Some files failed to parse. And this obviously isn’t all Clojure code on GitHub. It even misses quite a few important projects because they’re not in Clojars. But, it’s a large dataset and probably good enough for figuring out relative frequencies of expressions to use for learning.

Results

Here are successive graphs zooming in on the word frequencies. In each, I’ve ordered the expressions by frequency and plotted the frequency. Notice the nice long tail curve.

Top 100 symbols ordered most frequent first

This means that learning to read and write expressions at the left end of the graph will have an outsized impact on being able to read and write code in general. Even the most frequent 25 expressions are vastly more common than the rest. But looking at the expressions by hand, I think if you learned about 200 of them, you’d cover most code easily. Between 100 and 200, lots of third party library functions and lots of common variable names ( f for function) start to show up. Either way, you should start at the left hand side and work your way right. Most of what you would learn in LispCast Introduction to Clojure is included in the top 100.

Conclusions

I think the results show that learning the most common 100 expressions will have an outsized usefulness. If you’re interested in learning Clojure, you should begin with those. I’ve created flashcards in PDF form that you can print, and also a CSV version that you can import into Anki for spaced repetition. You can get access to them, and other resources for learning Clojure, here.

Data