From: andrew cooke <andrew@...>

Date: Wed, 30 Apr 2014 20:22:53 -0400

Disclaimer: Let’s get something clear. This is about writing a Julia package, not about doing data analysis. Julia means different things to different people - this post is about using Julia as a “real programming language” (not for scripting, not for data analysis).

Every language has its own set of idioms - its own approach to structuring programs. If you’re an experienced developer you’re fluent in the languages you use every day. Writing good, idiomatic code comes naturally. But working with a new language is a shock - suddenly things don’t fit together. Your programs are uncomfortable, ungainly, unbalanced.

This was my problem. I wrote a Julia library to calculate CRCs (checksums). But things didn’t turn out as I expected - the maths was fairly easy, but my code was an ugly mess.

Worse, my code was an inefficient, ugly mess. And making it faster was making it uglier and uglier.

Eventually I stalled. I spent a week (or two) rewriting, trying to work out how to better express my ideas in Julia. Eventually made something I was happy with - https://github.com/andrewcooke/CRC.jl/blob/master/src/CRC.jl.

The new code is 2/3 the size (excluding new functionality I added later) and 2.5x faster. It’s also easier to understand.

Here I want to record what I learned. There’s little that’s really new. But still, it’s useful to collect things in one place.

1 - Simplify

Obvious, yes, but it’s easy to get carried away. Julia lets you write code that is independent of the size of integers; the compiler can automatically generate routines for different types (8, 16, 32, 64, 128 bits). So it seemed natural to write a CRC routine that could calculate checksums not just for bytes, but also for streams of 16, 32, 64 and 128 bit values.

Then there’s the size of the polynomial. And the size of the intermediate accumulator. And the size of the index into the lookup table. I let the user choose all of these, and soon had a combinatorial explosion of different integer types, requiring checks for inconsistencies and tweaked algorithms for special cases.

So I backed up; stripped things to the bone. Now, data are only bytes (all that anyone uses anyway). Lookup table indices are 8 bits (16 bits uses too much cache for too little gain). Only the accumulator size can vary (and is chosen internally - it must be large enough to contain the polynomial).

2 - “If” Statements Are A Code Smell

Simplified, my code still had too much gnarly logic. Too many “if” statements.

In an OO language, having many “ifs” often means you need classes for different cases. Then the conditionals change to method calls (the type of the object selecting which code to execute without an explicit “if” statement).

In Julia the equivalent is to dispatch on types. So instead of having one function with many logical paths, I have several, smaller, simpler functions, each to handle a particular type.

3 - Untangle Types

At this point I had a bunch of functions that dispatched on a complicated, parameterized type that described everything about the CRC.

When I looked more closely there were two separate (orthogonal) ideas that “drove” the CRC calculation.

First, whether the input data are reflected or not. This changes how we do the calculation (reflected data are not reflected byte by byte - that would be much too slow - instead “all the rest of the maths” is reflected, and the data are left untouched).

Second, whether we use 0, 1 or multiple lookup tables.

So I broke my complicated type into two: one had two subtypes (for the two directions); the other had three subtypes (for 0, 1 and many tables). Which gave 6 different versions of the main routine (called “extend”), one for each combination of the two types.

This was a huge step. Suddenly “extend” (the main loop for calculating the checksum) became much simpler. The separation of logic was obvious. Everything “just made sense”.

4 - Types Are For Data

But still, I had duplication. A handful of details (like the amount of padding needed to align data with the accumulator) were being re-calculated in multiple places.

Looking more carefully, there were two sets of values, corresponding to whether the data were reflected are not (indicated by one of the two types above). So it made sense to place the values inside the types, calculating them once in the appropriate constructors.

At this point it was easy to see the parallel between standard OO design and Julia’s types / methods. Each type was like a class; these variables were instance attributes. But Julia’s approach made it easier to also handle the lookup tables, dispatching on both together.

5 - Efficiency Requires Typing

Now the code was looking pretty good. As a bonus, the six different functions that calculated the checksums (one for each combination of the two main types) had all arguments clearly typed.

This is critical for high performance Julia code - you want the compiler to be able to figure out the exact type of everything in the inner loop. And you want those types to be single, concrete values (not unions).

A side note - This is the great power of Julia. In high level routines you can write generic code like you would in, say, Python. And then, when you get down to the innermost loops, you can write code that is effectively C with a nice syntax. It’s not that the authors of Julia have found a way to magically make elegant, abstract, code perfectly efficient. They have not. Instead they have made a language which can seamlessly transition from elegant abstraction to low-level, painstaking detail.

6 - Writing Fast Code Hasn’t Changed Much

Continuing with the point above, there’s nothing unusual about writing fast code in Julia (at least most of the time; it’s a new language and some bugs are still being worked out, so you can see the occasional strangeness). For example:

Bounds checking has a cost, of course, but can be disabled with the @inbounds macro (assuming you know the code is correct).

macro (assuming you know the code is correct). Loop unrolling is easy with the @nexprs macro in Base.Cartesian (see the examples in CRC.jl).

macro in Base.Cartesian (see the examples in CRC.jl). Allocation and GC are not a good idea in the inner loop. Pre-allocate arrays as necessary.

Function calls are not free. Prefer iteration over recursion. Inlining of small functions works, mostly (though there are some related bugs which I suspect won’t be completely fixed til the 0.4 release).

In critical code, directly indexing arrays (with @inbounds ) seems to be slightly faster than the iterator protocol (it also makes typing the container easier).

) seems to be slightly faster than the iterator protocol (it also makes typing the container easier). Julia’s core library routines are very good. Smart people are putting a lot of effort into making them efficient.

There’s a useful (if somewhat clunky) graphical profiler - see the ProfileView package.

Summary

Pulling everything into one list:

Simplify. Simple, clear code can be made fast. Dispatch on types instead of using explicit conditionals. Untangle types - you can dispatch on combinations of types. Store data in types (they are Julia’s objects). The inner loop needs to be clearly typed. At a low level, efficient Julia code is much like C or Fortran.

Hope that helps!

Andrew