In the previous post, I’ve shown that in some cases, template parameters can help compiler to optimize CUDA kernels. Unfortunately, this approach has several issues. First of all, it’s impossible to write template specialization for an arbitrary non-type template argument value. For example, the block size for sparse matrix-vector multiplication (SpMV) could be an application input parameter. Typically code author would specialize SpMV function with some expected block sizes. Therefore, this leads to a slowdown on different block sizes and extended application binary size. The second issue is the non-type template argument requirements. The main issue is that non-type template arguments should be evaluated at compile time (constant expression). In this post, I’m going to discuss the CUDA just-in-time (JIT) compilation.

NVRTC

NVIDIA provides the library for CUDA runtime compilation, which is called NVRTC. With it, it’s possible to get a PTX string from CUDA one. The PTX string could be further loaded by CUDA Driver API. Using of NVRTC has some advantages over calling the NVCC in runtime. First of all, NVRTC doesn’t require NVCC on the user side. The second advantage is a lower compilation overhead. Here is an example of PTX generation.

Listing 1 — PTX generation with NVRTC

Generated PTX could be further used with CUDA Driver API.

Listing 2 — Generated PTX launch

Apparently, the kernel call isn’t type-safe. There is no direct relation between kernel arguments and the kernel call at compile time. To handle these issues I’ve written the NVRTC wrapper library, which is called CUDA-JIT.

CUDA-JIT

With CUDA-JIT the PTX generation and kernel launch are more simple. There are several advantages over using the direct PTX generation. First of all the kernel launch is type-safe now. The code won’t compile unless the launch matches a kernel signature.

Listing 3 — CUDA-JIT basic kernel

With this approach, syntax highlighting is available for CUDA kernel. I’ve integrated inja as a template engine. All variables within {{ }} are replaced with values that are passed during the compile call.

Listing 4 — JIT BCSR

With JIT compilation it’s possible to allocate shared memory of dynamic size without passing a separate argument to kernel launch. I’ve also extracted initial stride calculation from the example above. From the figure below it might be seen that the NVCC compiler hasn’t managed to optimize the initial stride calculation.

Figure 1 — JIT-compiled code vs template specialization

We could go further and add some conditions to the kernel template to prevent the loading of row pointer array in case of fixed row length matrices.

Also, it’s possible not to rely on compiler optimizations and unroll loops with inja. Besides optimization, the JIT approach to kernel code could bring some platform independence into the computing acceleration world. It should be possible to translate some general code into OpenCL within the library to compute it on AMD GPUs. I think that it would be interesting to use C++ JIT compilation to test kernels or measure its performance on CPU without code duplication. Anyway, this approach seems simpler than SYCL for me. The CUDA-JIT library is available on github.