From: andrew cooke <andrew@...>

Date: Wed, 31 Aug 2011 08:54:22 -0300

I had some Clojure code that needed optimising. Eventually I achieved a 5x speedup. This is how. Finding Something to Measure ---------------------------- First, I needed something to measure, so that I could tell how effective my changes were. This was complicated by three things: the need to generate test data to process; the delayed action of the JIT; garbage collection. Generating test data is quite expensive, so I needed to carefully tune parameters so that my processing used more time than the test data generation (not necessary for the simple timings, but important for profiling, since I could find no simple way to delay profiling to start after the data was ready). Then I placed the processing in a loop, repeating it multiple times (re-using the same test data). Each pass through the loop was timed - the successive times let me see when the JIT engaged (typically the first calculation was 50% slower) and also reduce the effect of GC by selecting the lowest value. Profiling --------- Once I had a good test running in my IDE, I needed to make standalone Java classes to profile. The simplest way to do this seems to be to use a build tool called leiningen - https://github.com/technomancy/leiningen - which was very easy to use. You just download and run it, create an empty project, and modify the project descriptor file (which I then copied into my existing project). My file contents are: (defproject compressive "0.0.0" :description "set-related experiments" :aot [#"com.isti.compset.*"] :main com.isti.compset.stack :dependencies [[org.clojure/clojure "1.3.0-beta1"] [org.clojure/clojure-contrib "1.2.0"]]) where: - aot indicates which classes need compiling - main indicates the location of the "main" file - dependencies are downloaded automatically I also added the (-main) method and put (:gen-class) in my ns at the top of the file. With all that, "lein compile" generated a set of classes that could be run with java: java -cp classes:lib/clojure-1.3.0-beta1.jar \ -agentlib:hprof=cpu=samples,depth=10,file=hprof com.isti.compset.stack giving a profile in the file "hprof". This profile was dominated by expected methods related to generating and consuming sequences, and adding and multiplying generic numbers. Vector of Doubles ----------------- The central loop in my code sums various spectra and the square of their values. The spectra were stored as "vec" instances. I hope to remove the use of generic numbers by replacing this with a vector of doubles and, if necessary, adding some type annotations to various functions. So I replaced "vec" with (defn d-vec [collection] (apply conj (vector-of :double) collection)) However, this did not have the desired effect - the code was significantly slower and the trace was dominated by calls to "Vec.count". I asked for help on stack overflow http://stackoverflow.com/q/7223297/181772 but didn't get any very useful response, so I removed these changes. Array of Doubles ---------------- At this point I was a little disheartened. The traces didn't have much useful information (that I could see, even using a nice little tool called PerfAnal http://java.sun.com/developer/technicalArticles/Programming/perfanal/). So I looked at my inner loop and decided that I would simply rewrite it as I thought it should be written, relying on experience rather than profiling. I also found this page http://clojure.org/java_interop on the use of arrays helpful. I made the following changes: - replace the vector in which I was accumulating results with a native array of doubles (using double-array) - mutate the accumulator rather than creating a new instance each loop - use additional indices to identify the "active range" of the accumulator (which may get smaller over time, due to restricted data ranges in spectra). This had not been necessary when I was generating new immutable instances - the instances simply decreased in size. - change the spectra from vectors to arrays. - replace some maps and reduces with explicit loops that mutate data. This was, initially a complete disaster. Running time increased by roughly a hundredfold. Type Annotations ---------------- Rather than reverting these changes, which seemed justified from experience, I looked again at profiling. The profile was now dominated by methods related to java introspection. This is a known issue with Clojure and has a simple solution - add (set! *warn-on-reflection* true) and remove all warnings by adding type hints. I did not enable this for all code - only for the central loop and related functions. And I used the "^doubles" hint which indicates a native array of doubles. This, finally, gave a significant speedup - about 5x faster than the initial code. Conclusions ----------- - Clojure code is simplest, and reasonably efficient, when you stay within Clojure's own types (vec, seq etc). - Profiling with hprof can give broad information, but I, at least, couldn't understand it in detail (I normally can read profiling data!) - there seemed to be information missing, or I simply do not understand how Clojure works. - It was fairly simple to change the code to use typed, mutable data in the inner loop, which gave significant speedups. - Dynamic interop with Java is apallingly slow, but easy to fix once you see that it is happening. - More generally, Clojure works fairly well for exploratory data processing. It's nicely high-level and easy to work with, and can be optimised where necessary. I have also implemented this particular algorithm in C and on GPUs, and I am sure it is faster in both cases, but (without doing any formal measurements) the code feels close enough to C for me to experiment with new ideas quickly. I doubt this would have been possible in Python, for example (even with numpy - the inner loop is annoyingly hard to vectorize)