This is a sequel to my previous post several months ago. Last time, I introduced a shoddy library named deeplearn-rs that allowed you to build and (manually) train neural networks that run on your GPU via OpenCL. I told you to “take a gander” at some atrocious code that constructed a rather useless XOR network and then didn’t bother to train it. For reasons unknown, I still got positive feedback. Thanks, internet :)

Deeplearn-rs has come a long way since January!

There’s a working MNIST digit classifier example.

The Graph API has been cleaned up quite a bit.

Node/layer creation is much more convenient.

There’s now a Trainer struct to make training convenient.

Got minibatches working

The underlying GPU array math library, gpuarray-rs (formerly called matrix-rs) now supports N-dimensional arrays.

Let’s see what it looks like before we dive into the details:

This snippet builds a batched 2-layer network with biased fully connected layers, ReLU activations, and MSE (mean squared error) loss. I’m really pleased with how the API turned out. Check out the full MNIST example to see how it’s trained and validated.

Node Creation

Node creation before:

The Operation (matrix multiplication, in this case) needs the shapes of the input variables and access to the GPU context so that it can create the intermediate GPU buffers it needs. The Graph needs to know the input variables and the shapes of the output variables. It was pretty ridiculous to ask the poor API user to supply of all of these.

Node creation now:

The solution was to separate Operation descriptions from their implementations. I introduced an OpBuilder trait and an OpDescriptor struct.

Under this system, there are two structs associated with every operation.

An operation builder that implements OpBuilder. OpBuilder::build is where parameter validity is verified and the actual operation’s constructor is called. OpBuilder::build returns an OpDescriptor, which Graph uses to build the Node.

The actual operation, which implements the forward and backward passes.

Below is the new matrix multiplication implementation:

Note that <MatMul as OpBuilder>::build figures out the shapes of the input variables, checks for dimension errors, and constructs the MatMulImpl struct, which actually runs the forward and backward passes. The new system is slightly more complex, but much more convenient for the end user.

Layer Creation

Even with the improved convenience of node creation, building layers was still too inconvenient for my tastes.

Manually creating the weights, adding the nodes, and getting the output variables can be such a drag! I want to build layers with one-liners!

I thought long and hard about how to simplify layer creation for the user. I thought about complicating the operation system further by allowing you to make composite operations out of other operations. And then complicating it even further by giving OpBuilder another associated type “Variables”, which would be a tuple of some VarIndexes for the Graph::add_node to return to the user. Then we’d have to build some notion of which variables to build automatically in OpBuilder::build (i.e. weights). But MatMul seems like a pretty general operation, what if the user doesn’t want their weights made for them? All of this is wayyy too complicated.

Instead, I opted to place layer creation at a higher level of abstraction. I just wrote some functions that build common layer types. Here’s the declaration for the function that adds a biased dense layer:

You supply the graph, the input to the layer, the layer size (number of neurons), and the initializers for the weights and bias (maybe a normal distribution, for example). You get back the VarIndexes for the output, weights, and biases. Nice!

Training

There’s now a Trainer struct. Right now, it just backpropagates the gradients to the learnable variables and adds the gradients to the variables at each epoch. I guess that makes it Stochastic Gradient Descent. I intend to implement fancier trainers like RMSprop and Ada Grad eventually. The trainer takes your graph, the number of epochs you want to run, an update function, and your training data. The update function is called every epoch and is where you can read the values of your different variables so that you can monitor your network’s progress.

Validating and/or deploying your network

Just upload inputs, call Graph::forward, grab outputs, rinse, and repeat!

Conclusion

The MNIST example runs in 75.6 seconds on my Toshiba Satellite laptop with an Intel Integrated IvyBridge graphics card running the Beignet OpenCL driver. It achieves 89.8% validation accuracy. Pretty terrible, but here I am just stoked that it all works.

Next, I’m going to write a LSTM layer and build a simple char-rnn model.

After that, the rough plan is:

Get it working on stable Rust. I’m currently using a few unstable features

Documentation

2D Convolution

Utilities for working with images

Work on gpuarray-rs. Tune OpenCL kernels. Benchmark all the things.

I’m graduating from undergrad this coming Saturday (WOOT). Once I find a job and get some money, I’d like to buy a beefy desktop with a real graphics card that can run OpenCL 2.0. Rust’s OpenCL bindings haven’t gotten a lot of love lately, and I’d like to write my own bindings with OpenCL 2.0 support. Then, I’d like to take advantage of the OpenCL 2.0 support in gpuarray-rs and deeplearn-rs. I think device-side kernel enqueuing would be really useful for things like generic broadcast/reduce operations. Stuff for another blog post on another day.

Thanks for reading!