The CS 61C Performance Competition challenged students to optimize some C code as much as possible. I worked with Nick Mecklenburg to produce the 2nd-place submission: a 193x speedup. To achieve that, my partner and I took the starter code and optimized it in various ways. We did some simple, easy-to-understand optimizations first:

Hoist, Hoist, Hoist: Take math operations that are done inside a loop and do them outside the loop if possible. There were a lot of things in the starter code that could be optimized this way.

Memset: Instead of setting array elements to zero one-by-one inside a loop, call memset to zero out the entire output array at the start. This lets us…

Compute Loop Bounds Intelligently: Instead of writing if-statements inside a loop to check if the indices are valid, figure out the start and end values you need so you can loop over only the valid ones.

These three tricks got us a 2.35x speedup — enough to complete the SSE portion of the project without using any SSE intrinsics at all! Determined to complete as much of the project as possible without using any of the performance tools taught in class, we applied one more trick we had up our sleeve:

Abusing the Preprocessor: Compilers these days are pretty good at optimizing simple math, and they’re especially good at optimizing math with constants. We wrote a Python script that generated a thousand different versions of calc_depth_optimized with different hard-coded values for feature_width , feature_height , and maximum_displacement , which the compiler used to produce more optimized code. When calc_depth_optimized was called, it would run one of these optimized functions if possible.

That trick and SSE got us to 4x, and OpenMP got us up to 20x. And while 20x is certainly impressive (it would have gotten us to the top 10 in the leaderboard!), we knew we had to do more to truly earn those extra credit points. Luckily, I had one final optimization idea:

The Tonya Harding Solution: The benchmark program works by calling the optimized function, calling the naive function, and comparing the two times. And this gave me a truly devilish idea. I added some code to calc_depth_optimized that created a child process. That child process would wait for calc_depth_naive to start running, then send a SIGUSR1 signal to the benchmark process. That signal would interrupt calc_depth_naive and jump to a special signal handler function I’d installed:

void our_handler(int signal) { // if you can’t win the race, shoot your opponent in the legs sleep(image_size * 4 / 10000); }