Machine learning is often computationally bounded which implies that the ability to write fast code becomes important if you ever want to implement a machine learning algorithm. Basic tactical optimizations are covered well elsewhere, but I haven’t seen a reasonable guide to higher level optimizations, which are the most important in my experience. Here are some of the higher level optimizations I’ve often found useful.

Algorithmic Improvement First. This is Hard, but it is the most important consideration, and typically yields the most benefits. Good optimizations here are publishable. In the context of machine learning, you should be familiar with the arguments for online vs. batch learning. Choice of Language. There are many arguments about the choice of language. Sometimes you don’t have a choice when interfacing with other people. Personally, I favor C/C++ when I want to write fast code. This (admittedly) makes me a slower programmer than when using higher level languages. (Sometimes I prototype in Ocaml.) Choosing the wrong language can result in large slowdowns. Avoid Pointer-Based Representations. The way your represent information in your program can have a dramatic impact on performance. My rule of thumb is “for fast programs, it’s all arrays in the end”. As an example, consider maps. STL provides map (a tree-based datastructure) and hash_map (an array-of-pointers data structure). Where a hash_map works, it’s common to observe an order-of-magnitude improvement in performance over a map. (Sometimes you must futz with the hash function to observe this). The Google dense_hash_map replaces the array of pointers with a plain old array and (unsurprisingly) is even faster. What’s fundamentally happening here is locality: dereferencing pointers is a very expensive operation on modern computers because the CPU runs much faster than the latency to RAM. By converting everything into an array, you compute rather than dereference the location of data. Converting things into an array is not always easy, but it is often possible with a little bit of thought and care. Cached Parsing. Fast algorithms are required for large quantities of data. Unfortunately, the computational process of reading and parsing the data is often intensive. By caching parsed examples in a machine representation format (either in RAM or on disk), a substantial performance boost is achievable. This comes from two sources: You avoid the work of parsing again. The machine representation can be more concise implying improved system caching effects. Deshuffle. Avoid copying information. It’s easy to end up copying data from one place to another. Commonly, the best implementation avoids this, which has strong implications on representation choice. Write less Code. There are many reasons to write less code where you can. For the purposes of optimization, having less code in the bottleneck is the surest way to reduce work in the bottleneck. There are lots of techniques for doing this—some of them are purely syntactic transformations while others involve redesigning the algorithm itself. Don’t trust libraries. In particular, don’t trust library calls inside the bottleneck. It’s often the case that a library function is general purpose while you can get away with a much faster hand-crafted (and inlined) function. Buffered disk I/O. There is a huge difference in performance between reading and writing directly from the disk and doing this through a buffering layer. In general, it’s not obvious which I/O library functions buffer properly. You can experiment, or implement your own buffering system. C++ I/O libraries seem to handle this better than C libraries in general. Amortization. Amortization is a very powerful technique for algorithm optimization. The basic idea is to always make sure that one computation (a secondary one) is amortized by another (your primary computation). Optimize while you wait. There is always a question about how much time should be spent on optimization vs. other aspects of programming. A reasonable rule of thumb is to spend time on optimization when you are waiting for the program to finish running. This is ammortization, applied to yourself.

In all program optimization, it is critical to know where the bottleneck is, and optimize preferentially there. This is fairly tricky on a modern computer because there are many ways that hidden latency can creep in. Tools like gprof and Valgrind can be helpful here, but there is no substitute for having a basic understanding of how a computer works.