Running CUDA C/C++ in Jupyter or how to run nvcc in Google CoLab

Not that long ago Google made its research tool publicly available. Besides that it is a fully functional Jupyter Notebook with pre-installed Tensorflow and some other ML/DL tools, you can take advantage of Nvidia GPU. Yes a 11.5 GB Nvidia K80 GPU for free.

Not that long ago Nvidia announced its Deep Learning Institute where you can acquire basics of CUDA programming in both Python and C/C++. After the course you will get some good recommendations where to go next. Although, in era of ultrabooks not many of us have a dedicated graphic card on board and Nvidia even rarer. There are a few options, when you have more experience you can write your code offline and then upload to your Google/AWS/FloydHub GPU instance. However, when you just started it might be useful to compile your code and see results in an on-live solution.

I guess, I’m done with introduction. Let’s get our hands dirty and set up our learning environment. First, create a new Notebook.

Please select Python 3 Notebook in the pop-up window.

If you have worked with Jupyter before, the interface will look familiar. A bit more stylish though. If you haven’t, don’t worry. It’s a pretty simple and very powerful tool, that’s way it is so popular.

Next, we need to switch our runtime from CPU to GPU. We just 2 clicks away.

Change runtime type in Notebook settings under Runtime tab on the upper menu:

And click save.

Despite that CUDA libs are available for the Tensorflow environment, Colab does not have NVCC (Nvidia CUDA Compiler) installed. So, it is our next step.



!wget

!dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb;

!apt-key add /var/cuda-repo-8-0-local-ga2/7fa2af80.pub;

!apt-get update -qq;

!apt-get install cuda gcc-5 g++-5 -y -qq;

!ln -s /usr/bin/gcc-5 /usr/local/cuda/bin/gcc;

!ln -s /usr/bin/g++-5 /usr/local/cuda/bin/g++;

!apt install cuda-8.0; !apt update -qq;!wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb !dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb;!apt-key add /var/cuda-repo-8-0-local-ga2/7fa2af80.pub;!apt-get update -qq;!apt-get install cuda gcc-5 g++-5 -y -qq;!ln -s /usr/bin/gcc-5 /usr/local/cuda/bin/gcc;!ln -s /usr/bin/g++-5 /usr/local/cuda/bin/g++;!apt install cuda-8.0;

You can just copy it in a cell in Notebook. Each line that starts with ! is going to be executed as a command line command.

There is no magic. We download CUDA debian package and install it. CUDA 8 requires GCC version less than 6 so we installed GCC 5 and made symlinks for successful compilation.

Now you can test your CUDA installation by running

!/usr/local/cuda/bin/nvcc --version

And the ouput should be something like

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Tue_Jan_10_13:22:03_CST_2017 Cuda compilation tools, release 8.0, V8.0.61

We are almost done. I created a small extension for running NVCC from Notebook cells. Install it with

!pip install git+git://github.com/andreinechaev/nvcc4jupyter.git

Now you need to load the installed extension, by running:

%load_ext nvcc_plugin

We are ready to run CUDA C/C++ code right in your Notebook.

For this we need explicitly say to the interpreter, that we want to use the extension by adding %cu at the beginning of each cell with CUDA code.

%%cu

#include <iostream> int main() {

std::cout << "Hello world

";

return 0;

}

To check that everything works I encourage you to try matrix multiplication program from Nvidia course.

#include <stdio.h> #define N 64 inline cudaError_t checkCudaErr(cudaError_t err, const char* msg) {

if (err != cudaSuccess) {

fprintf(stderr, "CUDA Runtime error at %s: %s

", msg, cudaGetErrorString(err));

}

return err;

} __global__ void matrixMulGPU( int * a, int * b, int * c )

{

/*

* Build out this kernel.

*/

int row = threadIdx.y + blockIdx.y * blockDim.y;

int col = threadIdx.x + blockIdx.x * blockDim.x;



int val = 0;

if (row < N && col < N) {

for (int i = 0; i < N; ++i) {

val += a[row * N + i] * b[i * N + col];

}



c[row * N + col] = val;

}

} /*

* This CPU function already works, and will run to create a solution matrix

* against which to verify your work building out the matrixMulGPU kernel.

*/

void matrixMulCPU( int * a, int * b, int * c )

{

int val = 0; for( int row = 0; row < N; ++row )

for( int col = 0; col < N; ++col )

{

val = 0;

for ( int k = 0; k < N; ++k )

val += a[row * N + k] * b[k * N + col];

c[row * N + col] = val;

}

} int main()

{

int *a, *b, *c_cpu, *c_gpu; // Allocate a solution matrix for both the CPU and the GPU operations int size = N * N * sizeof (int); // Number of bytes of an N x N matrix // Allocate memory

cudaMallocManaged (&a, size);

cudaMallocManaged (&b, size);

cudaMallocManaged (&c_cpu, size);

cudaMallocManaged (&c_gpu, size); // Initialize memory; create 2D matrices

for( int row = 0; row < N; ++row )

for( int col = 0; col < N; ++col )

{

a[row*N + col] = row;

b[row*N + col] = col+2;

c_cpu[row*N + col] = 0;

c_gpu[row*N + col] = 0;

} /*

* Assign `threads_per_block` and `number_of_blocks` 2D values

* that can be used in matrixMulGPU above.

*/ dim3 threads_per_block(32, 32, 1);

dim3 number_of_blocks(N / threads_per_block.x + 1, N / threads_per_block.y + 1, 1); matrixMulGPU <<< number_of_blocks, threads_per_block >>> ( a, b, c_gpu ); checkCudaErr(cudaDeviceSynchronize(), "Syncronization"); checkCudaErr(cudaGetLastError(), "GPU"); // Call the CPU version to check our work

matrixMulCPU( a, b, c_cpu ); // Compare the two answers to make sure they are equal

bool error = false;

for( int row = 0; row < N && !error; ++row )

for( int col = 0; col < N && !error; ++col )

if (c_cpu[row * N + col] != c_gpu[row * N + col])

{

printf("FOUND ERROR at c[%d][%d]

", row, col);

error = true;

break;

} if (!error)

printf("Success!

"); // Free all our allocated memory

cudaFree(a); cudaFree(b);

cudaFree( c_cpu ); cudaFree( c_gpu );

}

If everything was set up correctly you should see 'Success

at the output of the cell.

All above steps should also be valid for your local environment. I didn’t have an intention to create a plugin for everyone, it’s mostly an accedent :) So, feel free to update it.

I hope it will be useful for someone.

Have fun!