I've finally done it. After 50+ hours spent trying to install GPU support for Tensorflow over the span of a year and a half, I have finally done it. I'm happy to say that I have CUDA 9.2, CuDNN 7.2, and compiled Tensorflow from source well enough that I can train a Resnet on Imagenet-100 in a barely decent amount of time by 2018 standards. Take that, cloud!

These notes aren't coming from an expert to be used as a flawless guide for others. I was not very knowledgeable of Linux hardware debugging when I first started this journey back in 2016, and am still now very much learning.

That being said, NVIDIA has had a notoriously bad driver support for Linux, which famously led to Mr Torvalds flipping a finger directed at them in a 2012 interview. And even today it is exceedingly hard to not pull your hair out trying to do so. Having spent many, many days configuring it and getting a collection of black screens, consoles and no-boots, here's what it took to install tensorflow on my machine.

I ended up using the Manjaro distribution, as it was the only one between Ubuntu 14, 16 and 18, Linux Mint 18 and 19, Scientific Linux and Debian Jessie to just work out of the box. Even though it gets unrequited love from many fans, Arch Linux was a no-go: I still don't believe it a good use of my time to learn to build my OS from the ground up. After further experimentation in the past months, I have to say I've grown quite used to using pacman instead of apt . Manjaro also offers a community edition that comes bundled with the i3 window manager, which greatly fits my obsession for the last years to go mouseless (as possible). But that's another blog post.

I assume that many people have gone through the same steps as I have, and I would blame none of them for having given up before reaching the end. If there's one lesson to learn from this situation, it's that your calculated ROI on purchasing a (or many) GPUs for training neural networks should consider the possible time it will take to troubleshoot their installation.

Research

Before you buy anything, do your research. At the time of purchase, the GeFORCE 10xx series was still far away, and the 960 was at the top of the chart for performance / price ratio. I've spent a few dozen hours using Keras since, but as I am still a novice, this suits my needs quite nicely. As of writing, the new 20xx series is slated to be released any day now, but it's still unclear whether the tensor cores they'll be packing are going to make much difference in terms of performance.

By default, most distributions will install and use the open source GPU drivers called nouveau , which won't cut it with what we've got in mind for this GPU. The hardest and most frustrating part of the installation process is to get the NVIDIA drivers running. I was met with a lot of black screens, flashing cursor bars, and giving up trying to back-fix things from the GRUB console.

Lots of resources exist out there, but the following have been the most useful in finally making things run:

Specs

CPU : Intel(R) Core(TM) i5-6500 CPU @ 3.20GH

: Intel(R) Core(TM) i5-6500 CPU @ 3.20GH GPU : NVIDIA Corporation GM206 [GeForce GTX 960]

: NVIDIA Corporation GM206 [GeForce GTX 960] Monitor: Asus 24" HDMI

Notes: I cannot be 100% sure, but I believe that connecting by HDMI sold part of my problems. YMMV

Steps

Install Manjaro from Live USB

I was getting a blank screen when trying to boot with the Non-Free Drivers option, so went with Free Drivers.

Update system

sudo pacman -Syyu

Install NVIDIA drivers

In my case, I wasn't able to make the regular nvidia package work, but had to go with the 390xx series. This step just worked from the GUI. Go to Manjaro Settings > Drivers and simply install that one.

Reboot and cross your fingers.

Install CUDA and CuDNN

sudo pacman -S cuda cudnn

This next one may or may not be useful:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64

Reboot and cross your fingers.

Install Anaconda

Next up, we download the latest Anaconda release. Since we'll be compiling TF right after, we'll grab bazel while we're at it.

sudo pacman -S anaconda bazel

Compile Tensorflow from source

There are no handy CUDA 9.2 wheels for tensorflow available for Linux, so you'll need to compile from source. Don't worry, it'll put hair on your chest. I also ran into r1.10 asking for the keras_applications Python module to be installed, so according to this SO post I also pip-installed the following:

conda install keras pip install keras_applications==1.0.4 --no-deps pip install keras_preprocessing==1.0.2 --no-deps pip install h5py==2.8.0

Then:

git clone https://github.com/tensorflow/tensorflow cd tensorflow git branch r1.10 bash configure

You'll then be asked a series of questions you'll probably want to Google before you answer. In my case, I did:

All default until

CUDA Support: Y

CUDA 9.2

CUDNN 7.2

TensorRT: default

NCCL: 1.3 (only one GPU on this box anyhow)

CUDA compute capabilities: 3.5,5.2 Get this from here

compile w/ clang: default

Bazel compiler flags: -mavx -mavx2 -mfma -msse4.2

Android: default

And finally:

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

And finally, installing with pip (the exact filename might change in your case):

sudo pip install /tmp/tensorflow_pkg/tensorflow-1.10.0-cp36-cp36m-linux_x86_64.whl

Test your build