Neanderthal was already optimized for top CPU speed, and even more speed on both AMD's and Nvidia's GPU. You could write very concise Clojure code using vector and matrix API, and also easily combine those with customized GPU code in ClojureCL.

One major thing was still left untapped though: Nvidia's proprietary CUDA-based libraries that require Nvidia's proprietary and closed source CUDA technology that is also tied to the Nvidia hardware. Bad. Saint IGNUcius does not approve this sinful technology and I am ashamed to indulge in this blasphemy. On the other hand it gives us the access to a ridiculously well optimized set of libraries ranging from linear algebra to deep learning that Nvidia offers at no charge. How fast it is? Let's see…

In the OpenCL engine tutorial, I did a basic exploration of the capabilities of the OpenCL-based Neanderthal engine, that is based on Cedric Nugteren's excellent open-source library CLBlast. It is amazingly fast. For example, it multiplies three 8192 × 8192 matrices (\(C_{8192\times8192} = \alpha A_{8192\times8192} \cdot B_{8192\times8192}\)) in 220 ms on Nvidia GTX 1080.

Theoretically, matrix computations require \(2\times m \times k \times n\) floating point operations (FLOPS). (This does not even count memory operations, but that's the problem of the implementer.) \((2 * 8192^3) \div 0.220\) is 4.99 TFLOPS (\(10^{12}\)). This boils down to 5 TFLOPS out of 8.228 that the card is theoretically capable of. That's 60% utilization, which is quite impressive for a one-man team working part-time! Now I'm trying the same stuff on the same hardware with the CUDA-based engine:

( require ' [ uncomplicate.clojurecuda.core :refer [ synchronize! ] ] )

( with-default-engine ( let [ cnt 8192 ] ( with-release [ gpu-a ( cuge cnt cnt ( range ( * cnt cnt ) ) ) gpu-b ( copy gpu-a ) gpu-c ( copy gpu-a ) ] ( time ( do ( mm! 3 gpu-a gpu-b 2 gpu-c ) ( synchronize! ) ;; Wait for asynchronious mm! gpu-c ) ) ) ) )

#CUGEMatrix[float, mxn:8192x8192, order:column, offset:0, ld:8192]

141 ms! \((2 * 8192^3) \div 0.141\) is 7.798 TFLOPS, almost at the specification maximum of the hardware! Nvidia did a really good job here, and the performance difference is, according to Cedric's experiments, even larger for smaller or non-quadratic matrices. In Clojure land, now we have the choice between a great free OpenCL backend, and an impressive proprietary Nvidia backend with unbeatable speed!