Recently I decided to give Julia a try. Julia is a relatively new scientific programming language that is free, open source, and fast. Besides the huge appeal of a performant high level scripting language, Julia intrigued me because the language actually has a reasonable type system and seems to be well-designed (unlike Matlab, which Julia is quite similar to). I recently implemented low-rank matrix approximations (in Matlab) for my numerical linear algebra class, so I figured reimplementing part of the algorithm in Julia would be a good way to get my feet on the ground in Julia land.

Rather than implementing all of the routines required to compute the low-rank approximation to a matrix, I just implemented the QR decomposition of a tridiagonal matrix using Householder reflections. To learn more about this algorithm, I highly recommend checking out section 3.4.1 of "Applied Numerical Linear Algebra" by Demmel. I was quite interested in the performance of Julia, so I iteratively optimized my algorithm and compared each version to an implementation of the same algorithm using Python and numpy.

Super-naive QR decomposition This extremely inefficient implementation mainly serves to illustrate how to compute the QR decomposition of a matrix using Householder projections. function td_qr1 ( T ) # naive as possible implementation Qt = eye ( size ( T ) ... ) R = copy ( T ) I = copy ( Qt ) for i in 1 :( size ( R , 1 ) - 1 ) u = [ zeros ( i - 1 ), householder ! ( R [ i : end , i ])] P = I - 2 .* u * u ' R = P * R Qt = P * Qt end Qt ' , R end In iteration $i$ of the loop, the Householder projection $P$ makes elements below the diagonal of column $i$ equal to 0. This leads to $R$ being upper triangular by the end of the loop. Note that this loop explicitly forms the full $P$ matrix in memory. This leads to the poor performance of this implementation: ~0.31s for 200x200 matrix. We can slightly optimize this implementation by noting that P * R == (I - 2 .* u * u') * R == R - 2 .* u * u' * R (and the same logic applies for Qt ). This lets us replace P = I - 2 .* u * u ' R = P * R Qt = P * Qt with M = 2 .* u * u ' R -= M * R Qt -= M * Qt This very small "optimization" caused no significant change in run-time.

Cutting down the number of FLOPs Here we'll take advantage of the tridiagonal form of input matrix $T$. To recall, a matrix is tridiagonal if it is 0 everywhere except for its superdiagonal, main diagonal, and subdiagonal. Because of the tridiagonal form of $T$, the column vector $u$ in the above iterations only contains 2 non-zero elements (the diagonal and the subdiagonal). This means $P = I - 2uu^T$ is equal to $I$ on all rows except for 2. Thus, on iteration $i$ of the algorithm, we only need to multiply rows $i$ and $i+1$ of $R$ and $Q^T$. Here is this optimization implemented: function td_qr3 ( T ) Qt = eye ( size ( T ) ... ) R = copy ( T ) for i in 1 :( size ( R , 1 ) - 1 ) # now M is 2x2 u = householder ! ( R [ i : i + 1 , i ]) M = 2 .* u * u ' # P = I - M # only operate on rows i and i+1 R [ i : i + 1 , :] -= M * R [ i : i + 1 , :] Qt [ i : i + 1 , :] -= M * Qt [ i : i + 1 , :] end Qt ' , R end This optimization gives huge performance gains both in terms of forming the matrix $M$ (as it is just $2 \times 2$ rather than $n \times n$) as well as the matrix multiplications by $R$ and $Q^T$. This implementation takes only ~0.005s for 200x200 matrix, making it ~23 times faster than the previous implementation. As the code just became much faster, all future benchmarking will be done with 1500x1500 tridiagonal matrices. This implementation takes ~0.36s for 1500x1500 matrix. Let's take even more advantage of the tridiagonal structure of $T$. Our previous optimization only depended on the fact that $T$ was zero beneath its first subdiagonal, but we can still improve our performance by exploiting that $T$ is also zero above its first superdiagonal. Namely, on iteration $i$ of the main loop, only the first $i+2$ columns have at least one non-zero element in rows $i$ and $i+1$. This means on iteration $i$ we only need to deal with exactly 2 rows and $i+2$ columns of $R$ and $Q^T$. Here's what the updated code looks like: function td_qr4 ( T ) Qt = eye ( size ( T ) ... ) R = copy ( T ) for i in 1 :( size ( R , 1 ) - 1 ) u = householder ! ( R [ i : i + 1 , i ]) M = 2 .* u * u ' # only multiply the columns we need to multiply upper = min ( i + 2 , size ( R , 2 )) R [ i : i + 1 , 1 : upper ] -= M * R [ i : i + 1 , 1 : upper ] Qt [ i : i + 1 , 1 : upper ] -= M * Qt [ i : i + 1 , 1 : upper ] end Qt ' , R end Profiling this code gives ~0.19s for 1500x1500 matrix.

A brief Python interlude For comparision purposes, I implemented the exact same algorithm in Python with numpy. Here's the code: def tridiag_qr ( T ): R = T . copy () Qt = np . eye ( T . shape [ 0 ]) for i in xrange ( T . shape [ 0 ] - 1 ): u = householder ( R [ i : i + 2 , i ]) M = np . outer ( u , u ) R [ i : i + 2 , :( i + 3 )] -= 2 * np . dot ( M , R [ i : i + 2 , :( i + 3 )]) Qt [ i : i + 2 , :( i + 3 )] -= 2 * np . dot ( M , Qt [ i : i + 2 , :( i + 3 )]) return Qt . T , R One small difficulty I had with this translation was the difference in indexing between Julia and Python. Slicing in Python does not include the end index, but slicing in Julia does. Profiling this codes gives ~0.16s for 1500x1500 matrix (edit: This previously read 0.25s, that was for numpy.linalg.qr). This is quite similar to the 0.19s from the equivalent Julia code, and this is not surprising due to most of the run-time being spent in BLAS calls. I do not know of anyway to speed up this Python code (without using either another language or an external library). The optimizations that I make to the Julia code below would not work with Python due to the lack of JIT (and relatively poor performance) of Python.

JIT == Awesome I've heard that most linear algebra libraries are particularly well optimized for very rectangular matrices (such as the $2 \times n$ matrices we are using) and also that unrolling loops often speeds up Julia code due to Julia's JIT. This convinced me that there might be some performance gains if I unroll the matrix multiplication. Here's the new code: function td_qr5 ( T ) Qt = eye ( size ( T ) ... ) R = copy ( T ) for i in 1 :( size ( R , 1 ) - 1 ) u = householder ! ( R [ i : i + 1 , i ]) M = 2 .* u * u ' # inline the multiplication for j in 1 : min ( i + 2 , size ( R , 2 )) tmp = M [ 1 , 1 ] .* R [ i , j ] + M [ 1 , 2 ] .* R [ i + 1 , j ] R [ i + 1 , j ] -= M [ 2 , 1 ] .* R [ i , j ] + M [ 2 , 2 ] .* R [ i + 1 , j ] R [ i , j ] -= tmp tmp = M [ 1 , 1 ] .* Qt [ i , j ] + M [ 1 , 2 ] .* Qt [ i + 1 , j ] Qt [ i + 1 , j ] -= M [ 2 , 1 ] .* Qt [ i , j ] + M [ 2 , 2 ] .* Qt [ i + 1 , j ] Qt [ i , j ] -= tmp end end Qt ' , R end Besides the hopefully optimized machine code the JIT is generating, inlining the matrix multiplication also allowed me to fuse the loops required for the two seperate matrix multiplications. The profiler confirms that Julia's JIT did a good job, as this implementation takes ~0.05s for a 1500x1500 matrix. This is a 4x speed increase just from unrolling a loop!

Final step: Cache optimality Julia arrays are stored in column-major order. This means that A[i, j] is directly next to A[i+1, j] in memory and that we want to iterate over columns rather than rows whenever possible to take full advantage of CPU caching. In both the computation of $R$ and $Q^T$, I was iterating over rows. However, I really want $Q$, not $Q^T$. Thus I can have the double-win of getting to iterate over columns rather than rows, and not needing to compute transpose $Q^T$ to get $Q$ after finishing the loop. Here's the new code: function td_qr6 ( T ) Q = eye ( size ( T ) ... ) R = copy ( T ) for i in 1 :( size ( R , 1 ) - 1 ) u = householder ! ( R [ i : i + 1 , i ]) M = 2 .* u * u ' for j in 1 : min ( i + 2 , size ( R , 2 )) tmp = M [ 1 , 1 ] .* R [ i , j ] + M [ 1 , 2 ] .* R [ i + 1 , j ] R [ i + 1 , j ] -= M [ 2 , 1 ] .* R [ i , j ] + M [ 2 , 2 ] .* R [ i + 1 , j ] R [ i , j ] -= tmp # now cache optimal and no need to transpose tmp = M [ 1 , 1 ] .* Q [ j , i ] + M [ 1 , 2 ] .* Q [ j , i + 1 ] Q [ j , i + 1 ] -= M [ 2 , 1 ] .* Q [ j , i ] + M [ 2 , 2 ] .* Q [ j , i + 1 ] Q [ j , i ] -= tmp end end Q , R end Benchmarking this code shows each iteration takes ~0.03 seconds for a 1500x1500 matrix. This is about ~2x as fast as without cache optimizations, and about 6x as fast as td_qr4, which performs the exact same math.