I doubt there is a leak in our code. First, I re-checked it a few times and did not find it. Second, and more reliable source of my confidence, is that during stress-testing the same code on CUDA platform the memory usage stays constant.

The leak might be in CLBlast, the open-source performance library that I use under the hood for matrix computations on the OpenCL platform. There have been some temporary object leaks in the past, but these have been fixed. Besides, I stress-tested matrix multiplications many times, and there were no issues.

The root of this issue is in temporary working memory that CLBlast creates during matrix multiplication. This memory gets cleaned up. If you just launch many multiplications of matrices of the same sizes, or smaller matrices, these temporary buffers get created and destroyed, and everything works well.

Remember, though, that GPU kernel launches are asynchronous. Many operations get queued instantly, without waiting that the previous operations complete. If the operations need temporary objects of different sizes, these objects may have been created too early. That does no harm when there is enough space, but here we have a pathological case of huge objects (1.3 GB total) and related operations that may require huge temporary buffers.

This is why I recommend extreme carefulness, and why I insisted in pre-allocating and reusing memory buffers wherever and whenever possible in the code that we wrote in this series.

In this case, we can't control the code of the underlying performance library that we use, but we can make it work now that we know the source of the problem. We simply have to launch the kernels that do matrix operations less aggressively. We can either call ClojureCL's finish! method to force the synchronization of the queue before too many operations get launched, or call methods that read results of reductions; these implicitly force synchronization.

( defn sgd-opencl [ network out cost! epochs eta ] ( dotimes [ n epochs ] ( forward network ) ( cost! out ( output network ) ) ( finish! ) ( backward network eta ) ( cost! ( output network ) ) ) )

In the above example, I've demonstrated both methods. First I force synchronization after each forward pass through network. After each backward pass, I calculate cost, which does reduction and returns scalar result, thus forcing synchronization.

Reductions are bad for GPU performance, but in this case matrix multiplications are much more demanding, so this should not take a big impact.

I made the dimension of the input a few times smaller to get shorter running times, but the example demonstrates the point.

( opencl /with-default ( with-release [ factory ( opencl-float *context* *command-queue* ) ] ( with-release [ x ( ge factory 2000 8000 ) y ( entry! ( ge factory 10 8000 ) 0.33 ) inference ( inference-network factory 2000 [ ( fully-connected 5000 tanh ) ( fully-connected 1000 sigmoid ) ( fully-connected 10 sigmoid ) ] ) training ( training-network inference x ) ] ( time ( sgd-opencl training y quadratic-cost! 1 0.05 ) ) ) ) )

"Elapsed time: 850.229314 msecs"

The same network doing 10 epochs is much faster per epoch.

( opencl /with-default ( with-release [ factory ( opencl-float *context* *command-queue* ) ] ( with-release [ x ( ge factory 2000 8000 ) y ( entry! ( ge factory 10 8000 ) 0.33 ) inference ( inference-network factory 2000 [ ( fully-connected 5000 tanh ) ( fully-connected 1000 sigmoid ) ( fully-connected 10 sigmoid ) ] ) training ( training-network inference x ) ] ( time ( sgd-opencl training y quadratic-cost! 10 0.05 ) ) ) ) )

"Elapsed time: 2785.458512 msecs"

As you can see, with this small fix, we made the OpenCL engine performing (almost) as well as CUDA. Have in mind that this is on (a four year) older and less powerful hardware, and with (three year) old drivers. Keep in mind that this network consists of unusually large layers, and you might not even see this issue in "normal" examples, but this is what edge cases are for.