Working with OpenCL and Cuda in Nim

Date: May 6, 2018, by Mamy André-Ratsimbazafy

Arraymancer is a tensor library I’m writing from the ground up in Nim. Cuda support was added in v0.3 last December, I just released the new v0.4 with OpenCL support.

I’d like to share a bit of my experience on working in OpenCL through Nim. First of all, you have to know that none of the big guys (Google Tensorflow, Facebook PyTorch, Apache/Amazon MxNet, Microsoft CNTK or even Intel/AMD) has first class OpenCL support.

Why? Probably because Nvidia is providing superb tools and documentation for frameworks developers. Also Cuda can leerage a few C++ facilities like generics and function objects that I use heavily for generic code.

For example in Nim+Cuda I define element-wise functions like the following and pass it to a higher-order function that will apply it element-wise on 3 tensors:

template cuda_binary_op ( op_name , op_symbol : string ) = { . emit : [ """ template<typename T> struct """ , op_name , """{ __device__ __forceinline__ void operator()( T * __restrict__ dst, const T * __restrict__ A, const T * __restrict__ B){ *dst = __ldg(A)""" , op_symbol , """ __ldg(B); } }; """ ] . }

You can see here the advantage of C++: typename T to template over int/float/double and higher-order functions/function object for cleaner code. You can also see that Nim can directly inline C++ code with emit and I even templatize the operation_name.

Now what about OpenCL? Unfortunately C doesn’t offer something similar and requires a lot of boilerplate. The alternative, the C++ official OpenCL API and implementation: SYCL is very experimental and I am not sure how it works on actual GPUs.

However thanks to Nim metaprogramming, squashing the C boilerplate is super easy. Here is an example kernel to do C = A op B

template gen_cl_apply3 * ( kern_name , ctype , op : string ) : string = opencl_getIndexOfElementID ( ) & """ __kernel void """ & kern_name & """(const int rank, const int len, __global const int * restrict dst_shape, __global const int * restrict dst_strides, const int dst_offset, __global """ & ctype & """ * restrict const dst_data, __global const int * restrict A_shape, __global const int * restrict A_strides, const int A_offset, __global const """ & ctype & """ * restrict const A_data, __global const int * restrict B_shape, __global const int * restrict B_strides, const int B_offset, __global const """ & ctype & """ * restrict const B_data) { // Grid-stride loop for (int elemID = get_global_id(0); elemID < len; elemID += get_global_size(0)) { const int dst_real_idx = opencl_getIndexOfElementID(rank, dst_shape, dst_strides, dst_offset, elemID); const int A_real_idx = opencl_getIndexOfElementID(rank, A_shape, A_strides, A_offset, elemID); const int B_real_idx = opencl_getIndexOfElementID(rank, B_shape, B_strides, B_offset, elemID); dst_data[dst_real_idx] = A_data[A_real_idx] """ & op & """ B_data[B_real_idx]; } } """

And write a few generic lines of code to deal with the data on the device (especially opencl_getIndexOfElementID which convert foo[1, 2, 3] into foo.data[456] depending on the tensor shape.

Afterwards, all my operations are easily added in one line:

kind of function (infix: C = A op B or in-place A += B or A *= B)

Nim type

C type

Nim operator (for operator overloading)

OpenCL kernel name

OpenCL operation

genClInfixOp ( float32 , "float" , ` + ` , "clAdd" , "+" ) genClInfixOp ( float64 , "double" , ` + ` , "clAdd" , "+" ) genClInfixOp ( float32 , "float" , ` - ` , "clSub" , "-" ) genClInfixOp ( float64 , "double" , ` - ` , "clSub" , "-" ) genClInPlaceOp ( float32 , "float" , ` += ` , "clAdd" , "+=" ) genClInPlaceOp ( float64 , "double" , ` += ` , "clAdd" , "+=" ) genClInPlaceOp ( float32 , "float" , ` -= ` , "clSub" , "-=" ) genClInPlaceOp ( float64 , "double" , ` -= ` , "clSub" , "-=" )

Next steps? Create unary operation higher-order functions and add cos/sin/ln/exp in just 2 lines of code each. Furthermore allow lifting any unary operation to operations on whole tensors with a map function, expose it so that OpenCL tensors are easily customizable.

After using Nim + OpenCL, I actually realized that using C++ function objects was overengineering.

To conclude, at the moment, I am convinced that the best language to work with GPUs is Nim.

Oh, and for those who wants to see real Nim code for neural networks, here is a Fizzbuzz in Nim using neural networks (I didn’t implement it on GPU yet though)

import ../ src / arraymancer , math , strformat func binary_encode ( i : int , num_digits : int ) : Tensor [ float32 ] = result = newTensor [ float32 ] ( 1 , num_digits ) for d in 0 ..< num_digits : result [ 0 , d ] = float32 ( i shr d and 1 ) func fizz_buzz_encode ( i : int ) : int = if i mod 15 == 0 : return 3 elif i mod 5 == 0 : return 2 elif i mod 3 == 0 : return 1 else : return 0 const NumDigits = 10 var x_train = newTensor [ float32 ] ( 2 ^ NumDigits - 101 , NumDigits ) var y_train = newTensor [ int ] ( 2 ^ NumDigits - 101 ) for i in 101 ..< 2 ^ NumDigits : x_train [ i - 101 , _ ] = binary_encode ( i , NumDigits ) y_train [ i - 101 ] = fizz_buzz_encode ( i ) const NumHidden = 100 let ctx = newContext Tensor [ float32 ] X = ctx . variable x_train network ctx , FizzBuzzNet : layers : hidden : Linear ( NumDigits , NumHidden ) output : Linear ( NumHidden , 4 ) forward x : x . hidden . relu . output let model = ctx . init ( FizzBuzzNet ) let optim = model . optimizerSGD ( 0.05'f32 ) func fizz_buzz ( i : int , prediction : int ) : string = [ $ i , "fizz" , "buzz" , "fizzbuzz" ] [ prediction ] const BatchSize = 128 const Epochs = 2500 for epoch in 0 ..< Epochs : for start_batch in countup ( 0 , x_train . shape [ 0 ] - 1 , BatchSize ) : let end_batch = min ( x_train . shape [ 0 ] - 1 , start_batch + BatchSize ) let X_batch = X [ start_batch ..< end_batch , _ ] let target = y_train [ start_batch ..< end_batch ] let clf = model . forward ( X_batch ) let loss = clf . sparse_softmax_cross_entropy ( target ) loss . backprop ( ) optim . update ( ) ctx . no_grad_mode : echo & "

Epoch #{epoch} done. Testing accuracy" let y_pred = model . forward ( X ) . value . softmax . argmax ( axis = 1 ) . squeeze let score = y_pred . accuracy_score ( y_train ) echo & "Accuracy: {score:.3f}%" echo "

" var x_buzz = newTensor [ float32 ] ( 100 , NumDigits ) for i in 1 .. 100 : x_buzz [ i - 1 , _ ] = binary_encode ( i , NumDigits ) let X_buzz = ctx . variable x_buzz ctx . no_grad_mode : let y_buzz = model . forward ( X_buzz ) . value . softmax . argmax ( axis = 1 ) . squeeze var answer : seq [ string ] = @ [ ] for i in 1. . 100 : answer . add fizz_buzz ( i , y_buzz [ i - 1 ] ) echo answer

Thank you for your attention and your support,

Be sure to try Nim and Arraymancer!