Update: Bill Gropp covers this topic in depth in a presentation at University of Illinois: Engineering for Performance in High-Performance Computing.

Two things happened recently that made me think my approach to performance may not be widely shared.

First, when I announced BufferBuilder on reddit, one early commenter asked "Did you run your Haskell through a profiler to help identify the slow parts of your code?" A fairly reasonable question, but no, I didn't. And the reason is that, if stuff shows up in a Haskell profiler, it's already going to be too slow. I knew, up front, that an efficient way to build up buffers is a bounds check followed by a store (in the case of a single byte) or copy (in the case of a bytestring). Any instructions beyond that are heat, not light. Thus, I started BufferBuilder by reading the generated assembly and building a mental model of how Haskell translates to instructions, not by running benchmarks and profilers. (Benchmarks came later.) Any "typical" CPU-bound Haskell program is going to spend most of its time in stg_upd_frame_info and stg_ap_*_info anyway. ;)

Next, on my local programming IRC channel, I shared a little trick I'd seen on Twitter for replacing a div with a shift: (i+1)%3 == (1<<i)&3 for i in [0, 2]. One person strenuously objected to the idea of using this trick. Paraphrasing, their argument went something like "the meaning of the code is not clear, so tricks like that should be left to the compiler, but it doesn't work for all values of i, so a compiler would never actually substitute the left for the right. Just don't bother." We went back and forth, after which it became clear that what the person REALLY meant was "don't use this trick unless you're sure it matters". And then we got to "you have to run a profiler to know it matters." I contend it's possible to use your eyes and brain to see that a div is on the critical path in an inner loop without fancy tools. You just have to have a rough sense of operation cost and how out-of-order CPUs work.

Over the years I have built a mental framework for thinking about performance beyond the oft-recommended "get the program working, then profile and optimize the hot spots". My mental framework comes from three sources:

Rico Mariani has written a lot about designing for performance so you aren't caught, late in the project, unable to hit your perf targets. That is, understand the machine and problem constraints, and sketch out on paper the solution so you can make sure it fits. Use prototypes as necessary. I've always been interested in how our programs run on physical silicon. The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips is an excellent history of the design of Intel's out-of-order P6 line. I've been involved with too many projects run with the philosophy of "get it working, profile and optimize later". It's very easy for this strategy to fail, resulting in a program that's uniformly sluggish and uneconomical to fix.

Performance is Engineering

To hit your performance goals, you first need to define your goals. Consider what you're trying to accomplish. Some example performance goals:

Interactive VR on a mobile phone

Time between mouse click in menus and action taken is under 100 ms

Load a full 3D chat room in five seconds

First page of search results from a mountain in Hawaii in half a second

Latest experimental analyses in half a day

Also, understand why you want to hit those goals. Todd Hoff's excellent Latency is Everywhere and it Costs You Sales has several links to research showing how performance can affect your business metrics and customer satisfaction.

For BufferBuilder, our goal was to match the performance, in Haskell, of a naive C++ buffer building library.

Now that you have a goal in mind, and presumably you understand the problem you're trying to solve, there's one final piece to understand. You could call them the "fundamental constants of human-computer interaction", which can be split into the human and computer halves.

Very roughly:

~16 ms - frame budget for interactive animation

100 ms - user action feels instantaneous

200 ms - typical human reaction time

500+ ms - perceptible delay

3+ seconds - perceived sluggishness

10+ seconds - attention span is broken

And on the computer:

1 cycle - latency of simple bitwise operations

a few - maximum number of independent instructions that can be retired per cycle

3-4 cycles - latency of L1 cache hit

~dozen cycles - latency of L2 cache hit

~couple dozen cycles - integer division on modern x86

couple hundred cycles - round-trip to main memory

50-250 us - round-trip to SSD

250-1000 us - in-network ethernet/tcp round-trip

10 ms - spinny disk seek

150 ms - IP latency between Australia and USA

Throughput numbers:

8.5 GB/s - memory bandwidth on iPhone 5

~100 MB/s - saturated gigabit ethernet connection

50-100 MB/s - spinny disk transfer speed

4.5 Mb/s - global average connection speed

3.2 Mb/s - average mobile bandwidth in USA

These numbers can be composed into higher-level numbers. For example:

1 ms - parsing 100 KB of JSON on a modern, native parser such as RapidJSON or sajson.

It's unlikely JSON parsing will become appreciably faster in the future - parsing is a frustrating, latency-bound problem for computers. "Read one byte. Branch. Read next byte. Branch."

While throughput numbers increase over time, latency has only inched downwards. Thus, on most typical programs, you're likely to find yourself latency-bound before being throughput-bound.

Above numbers are from:

Designing for Performance

Once you've defined your performance goal, you need to make the numbers fit. If your goal is interactive animation, you can barely afford a single blocking round-trip to a spinny disk per frame. (Don't do that.) If your goal is an interactive AJAX application that feels instant from all around the world, you must pay VERY close attention to the number of IP round-trips required. Bandwidth, modulo TCP windowing, is usually available in spades - but accumulated, serial round-trips will rapidly blow your experience budget. If you want a WebGL scene to load in five seconds on a typical global internet connection, the data had better fit in (5s * 4.5 Mb/s) = 2.8 MB, minus initial round-trip time.

For BufferBuilder, since we knew our goal was to match (or at least come reasonably close to) C++ performance in Haskell, we didn't need a profiler. We knew that appending a single byte to a buffer could be minimally expressed with a load of the current output pointer, a (predicted) bounds check, and a write of the new output pointer. Appending a buffer to another is a (predicted) bounds check, a memcpy, and updating the output pointer.

A profiler is not needed to achieve the desired performance characteristics. An understanding of the problem, an understanding of the constraints, and careful attention to the generated code is all you need.

We can apply this approach at almost any scale. Want a rich website that uses less than 100 MB of RAM yet has high-resolution images, gradients, transitions, and effects? Build prototypes to measure the costs of each component, build up a quantitative understanding, and make the numbers fit. (Of course, don't forget to actually measure real memory consumption on the resulting page, lest you be surprised by something you missed.)

Want to design a mobile application that goes from tap to displayed network data in 200 ms? Well, good luck, because Android devices have touchscreen latency of 100+ ms. (Edit: mdwrigh2 on Hacker News points out that this data is out of date for modern Android handsets.) And it can take a second or more to spin up the radio!

To summarize, given a rough understanding of the latencies of various layers of a system, and knowledge of the critical path through the code, simply sum the numbers together and you should have a close approximation to the total latency. Periodically double-check your understanding against real measurements, of course. :)

FAQ

Are you saying profilers are useless?!

Absolutely not. Profilers are awesome. I have a spent a great deal of time with the Python profiler. I particularly love AMD's old CodeAnalyst tool. (Not the new CodeXL thing.) Sampling profilers in general are great. I've also built a bunch of intrusive profilers for various purposes.

But always keep in mind that profilers are for exploring and learning something new. By the time you've built your application, you may not actually be able to "profile your way to success".

Should I unilaterally apply this advice, irrespective of context?

Of course not. A large number of programs, on a modern CPU, trivially fit in all the performance budgets you care about. In these situations, by all means, write O(n) loops, linked lists, call malloc() inside every function, use Python, whatever. At that point your bottleneck is human development speed, so don't worry about it.

But continue to develop an understanding of the costs of various things. I remember a particular instance when someone replaced hundreds of <img> tags on a page with <canvas> for some kind of visual effect. Bam, the page became horrifically slow and consumed an enormous amount of memory. Browsers are very smart about minimizing working set in the presence of lots of images (flushing decoded JPEGs from memory until they're in view is one possible technique), but <canvas> is freeform and consumes at least width*height*4 bytes of pagefile-backed memory.

What about algorithmic complexity?

Algorithmic complexity can be a significant improvement. Especially when you're Accidentally Quadratic. Reach for those wins first. But once you get into O(n) vs. O(lg n), you're almost certainly limited by constant factors.

In any context, you should aim for the biggest wins first. Let's say you're writing a web service that talks to a database, fetching info for 100 customers. By far the biggest optimization there is to run one query to fetch 100 customers (as a single round-trip can be 1-10 ms) instead of 100 queries of one customer each. Whenever someone says "My new web service is taking 1.5 seconds!" I can almost guarantee this is why. Both approaches are technically O(n), but the query issue overhead is a HUGE constant factor.

In interviews I sometimes ask candidates about the performance difference between two algorithms that have the same algorithmic complexity. There is no precisely correct answer to my question, but "the execution time is the same" is wrong. Constant factors can be very large, and optimizing them can result in multiple-orders-of-magnitude improvements.

But Knuth said...!