In November 2015, Google open sourced TensorFlow, a Deep Learning library based off of their internal Deepnet software DistBelief v2, which was developed for the Google Brain project. Deep Learning could very well be driving a lot of product development in the next few decades. When software is able to detect and classify objects and perform semantic and syntactic analogy things like translating signs with your phone's camera become possible.

TensorFlow supports implementing a wide variety of Deep Learning algorithms including:

Though you can train Deep Learning models using CPUs with TensorFlow, the large matrices being manipulated are what GPUs have been optimised for. Games use matrices intensively for all sorts of 3D functionality and GPUs are specifically designed to handle matrix manipulation orders of magnitude faster than regular CPUs.

TensorFlow has a Python-based API. The Python you write is used to build data and control structures before it's then handed off to either the CPU or GPU. The great thing about this is you have an easy to learn and easy to read and comprehend programming language to interface with without the slow execution times you'd come to expect with such computationally-intensive tasks. And unless you target a specific device on your machine, the code you write is agnostic; there aren't any specific CUDA calls or Intel-only endpoints to worry about. TensorFlow automatically prioritises using the GPU on operations where it makes sense performance-wise.

In this blog post I will build "Deep Fizz buzz" by training a Deepnet to try and give the correct answers for the first 100 values of Fizz buzz.

TensorFlow does support training models across clusters of machines but for this exercise I'll be using a single PC. I will train the Deepnet using an Nvidia GTX 1080. The GTX 1080 replaced my Radeon HD 7870 after I found TensorFlow has yet to support OpenCL and has a dependency on Nvidia's CUDA platform for any GPU-based training. The GTX 1080 draws up to 180 Watts of power compared to the 175 Watts the HD 7870 draws and they're both the same physical size so replacement was easy.

TensorFlow, Up & Running A special thanks to Sai Soundararaj for the excellent installation notes he's put together and to everyone commenting in the GitHub issues for TensorFlow that have been kind enough to debug and share a set of version numbers for all the software that works well together. The following was run on a fresh Ubuntu 16.04.1 LTS installation. Normally I use Ubuntu 14.04.3 LTS as I usually run everything in a virtual machine but I needed my code to be able to speak to the GPU directly and neither VMWare Workstation not VirtualBox showed any promise at getting GPU passthrough to work properly so I ended up having to install Ubuntu on my machine. Ubuntu 14 didn't like the looks of my USB devices and didn't want to play ball but Ubuntu 16 installed nicely. I'll first install a few dependencies to support TensorFlow's Python-based environment and a whole stack of Nvidia software. $ sudo apt update $ sudo apt install \ freeglut3-dev \ g++-4.9 \ gcc-4.9 \ libglu1-mesa-dev \ libx11-dev \ libxi-dev \ libxmu-dev \ nvidia-modprobe \ python-dev \ python-pip \ python-virtualenv I'll install a version of Nvidia's drivers which have been known to play well with TensorFlow and its dependencies. $ sudo apt purge nvidia-* $ sudo add-apt-repository ppa:graphics-drivers/ppa $ sudo apt update $ sudo apt install nvidia-367 With the driver and its dependencies installed I'll reboot the system. $ sudo reboot Once the machine is back up you can see version 367.35 is now installed: $ cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.35 Mon Jul 11 23:14:21 PDT 2016 GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.1) The following is the output from Nvidia's system management interface showing various diagnostics from my GTX 1080. $ sudo nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.35 Driver Version: 367.35 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A | | 25% 50C P2 38W / 200W | 55MiB / 8112MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 3084 G /usr/lib/xorg/Xorg 53MiB | +-----------------------------------------------------------------------------+ I'll set GCC 4.9 to be the default version being used on this system. The CUDA toolkit complains about this version while installing but I found it works nonetheless. $ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 10 $ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20 $ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 10 $ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 20 Next I'll download the 64-bit version of the CUDA 7.5 platform distribution for Ubuntu 14 (even though this is running on Ubuntu 16). Version 8 is the latest version but isn't supported by TensorFlow yet. If you need a different version please see Nvidia's CUDA downloads page. There will be a complaint about using GCC 4.9 so I've added an --override flag to get around this. $ wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run $ sudo sh cuda_7.5.18_linux.run --override Do you accept the previously read EULA? (accept/decline/quit): accept You are attempting to install on an unsupported configuration. Do you wish to continue? ((y)es/(n)o) [ default is no ]: yes Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 352.39? ((y)es/(n)o/(q)uit): no Install the CUDA 7.5 Toolkit? ((y)es/(n)o/(q)uit): yes Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]: Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): yes Install the CUDA 7.5 Samples? ((y)es/(n)o/(q)uit): no Installing the CUDA Toolkit in /usr/local/cuda-7.5 ... =========== = Summary = =========== Driver: Not Selected Toolkit: Installed in /usr/local/cuda-7.5 Samples: Not Selected Please make sure that - PATH includes /usr/local/cuda-7.5/bin - LD_LIBRARY_PATH includes /usr/local/cuda-7.5/lib64, or, add /usr/local/cuda-7.5/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-7.5/bin To uninstall the NVIDIA Driver, run nvidia-uninstall Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-7.5/doc/pdf for detailed information on setting up CUDA. ***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 352.00 is required for CUDA 7.5 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run -silent -driver Logfile is /tmp/cuda_install_14557.log I'll then add the environment variables for the CUDA platform to my .bashrc file. $ echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc $ echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc $ source ~/.bashrc I can now see the CUDA compiler is installed properly. $ nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2015 NVIDIA Corporation Built on Tue_Aug_11_14:27:32_CDT_2015 Cuda compilation tools, release 7.5, V7.5.17 I'll then download the CUDA Deep Neural Network library. This library contains a set of primitives used by Deepnets. This library is available to members of the Accelerated Computing Developer Program so once you've joined you'll be able to download version 4.0.7. $ wget .../cudnn-70-linux-x64-v40 $ tar xvf cudnn-7.0-linux-x64-v4.0-prod.tgz $ sudo cp cuda/include/cudnn.h /usr/local/cuda/include/ $ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/ $ sudo chmod a+r /usr/local/cuda/lib64/libcudnn* Finally I'll install the binary for the 0.10.0 release candidate 0 of TensorFlow. Using the binary release instead of compiling from source saves needing to find and install specific versions of Bazel 0.2.3, Protobuf 3.0.0b2 and avoid any woes with GCC. $ virtualenv tf_gpu $ source tf_gpu/bin/activate $ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl

Deep Fizz Buzz I'll train a model which Joel Grus has released into the public domain after he used it, unsuccessfully, to try and land a job. I've added in telemetry collection for analysis in TensorBoard after the model has been trained. The telemetry is being stored in the /tmp/train folder. $ vi deep_fizz_buzz.py import numpy as np import tensorflow as tf NUM_DIGITS = 10 # Represent each input by an array of its binary digits. def binary_encode ( i , num_digits ): return np . array ([ i >> d & 1 for d in range ( num_digits )]) # One-hot encode the desired outputs: [number, "fizz", "buzz", "fizzbuzz"] def fizz_buzz_encode ( i ): if i % 15 == 0 : return np . array ([ 0 , 0 , 0 , 1 ]) elif i % 5 == 0 : return np . array ([ 0 , 0 , 1 , 0 ]) elif i % 3 == 0 : return np . array ([ 0 , 1 , 0 , 0 ]) else : return np . array ([ 1 , 0 , 0 , 0 ]) # Our goal is to produce fizzbuzz for the numbers 1 to 100. So it would be # unfair to include these in our training data. Accordingly, the training data # corresponds to the numbers 101 to (2 ** NUM_DIGITS - 1). trX = np . array ([ binary_encode ( i , NUM_DIGITS ) for i in range ( 101 , 2 ** NUM_DIGITS )]) trY = np . array ([ fizz_buzz_encode ( i ) for i in range ( 101 , 2 ** NUM_DIGITS )]) # We'll want to randomly initialize weights. def init_weights ( shape ): return tf . Variable ( tf . random_normal ( shape , stddev = 0.01 )) # Our model is a standard 1-hidden-layer multi-layer-perceptron with ReLU # activation. The softmax (which turns arbitrary real-valued outputs into # probabilities) gets applied in the cost function. def model ( X , w_h , w_o ): h = tf . nn . relu ( tf . matmul ( X , w_h )) return tf . matmul ( h , w_o ) # Our variables. The input has width NUM_DIGITS, and the output has width 4. X = tf . placeholder ( "float" , [ None , NUM_DIGITS ]) Y = tf . placeholder ( "float" , [ None , 4 ]) # How many units in the hidden layer. NUM_HIDDEN = 100 # Initialize the weights. w_h = init_weights ([ NUM_DIGITS , NUM_HIDDEN ]) w_o = init_weights ([ NUM_HIDDEN , 4 ]) # Predict y given x using the model. py_x = model ( X , w_h , w_o ) # We'll train our model by minimizing a cost function. cost = tf . reduce_mean ( tf . nn . softmax_cross_entropy_with_logits ( py_x , Y )) train_op = tf . train . GradientDescentOptimizer ( 0.05 ) . minimize ( cost ) # And we'll make predictions by choosing the largest output. predict_op = tf . argmax ( py_x , 1 ) # Finally, we need a way to turn a prediction (and an original number) # into a fizz buzz output def fizz_buzz ( i , prediction ): return [ str ( i ), "fizz" , "buzz" , "fizzbuzz" ][ prediction ] BATCH_SIZE = 128 # Launch the graph in a session with tf . Session () as sess : tf . initialize_all_variables () . run () train_writer = tf . train . SummaryWriter ( '/tmp/train' , sess . graph ) merged_summary_op = tf . merge_all_summaries () for epoch in range ( 10000 ): # Shuffle the data before each training iteration. p = np . random . permutation ( range ( len ( trX ))) trX , trY = trX [ p ], trY [ p ] # Train in batches of 128 inputs. for iteration , start in enumerate ( range ( 0 , len ( trX ), BATCH_SIZE )): end = start + BATCH_SIZE feed_dict = { X : trX [ start : end ], Y : trY [ start : end ] } summary = sess . run ( train_op , feed_dict ) if iteration and not iteration % 32 : train_writer . add_summary ( sess . run ( merged_summary_op ), iteration ) # And print the current accuracy on the training data. if not epoch % 250 : feed_dict = { X : trX , Y : trY } print epoch , np . mean ( np . argmax ( trY , axis = 1 ) == sess . run ( predict_op , feed_dict )) # And now for some fizz buzz numbers = np . arange ( 1 , 101 ) teX = np . transpose ( binary_encode ( numbers , NUM_DIGITS )) teY = sess . run ( predict_op , feed_dict = { X : teX }) output = np . vectorize ( fizz_buzz )( numbers , teY ) print output $ python deep_fizz_buzz.py The above finished in a few minutes. While training this model you can see the CPU usage isn't completely maxed out. $ top top - 22:52:18 up 6 min, 2 users, load average: 10,12, 5,85, 2,76 Tasks: 159 total, 10 running, 145 sleeping, 0 stopped, 4 zombie %Cpu0 : 62,7 us, 5,9 sy, 0,0 ni, 31,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu1 : 56,1 us, 7,1 sy, 0,0 ni, 30,6 id, 0,0 wa, 0,0 hi, 6,1 si, 0,0 st %Cpu2 : 60,8 us, 6,2 sy, 0,0 ni, 33,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st %Cpu3 : 0,0 us,100,0 sy, 0,0 ni, 0,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem : 32831072 total, 30551628 free, 808544 used, 1470900 buff/cache KiB Swap: 33435644 total, 33435644 free, 0 used. 31611888 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3496 mark 20 0 21,876g 715020 203328 S 204,0 2,2 1:37.98 python deep_fizz_buzz.py 3084 root 20 0 181416 44096 29960 R 99,0 0,1 5:32.06 /usr/lib/xorg/Xorg vt7 -displayfd 3 -auth /var/lib/gdm3/.cache/gdm+ TensorFlow will allocate almost all of the GPU's memory in an attempt to reduce the effects of memory fragmentation. Below you can see Nvidia's system management interface showing almost all the memory is in use. $ sudo nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.35 Driver Version: 367.35 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A | | 23% 50C P2 44W / 200W | 7762MiB / 8112MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 3084 G /usr/lib/xorg/Xorg 53MiB | | 0 3496 C python 7705MiB | +-----------------------------------------------------------------------------+ Here's the final output: [ '1' '2' 'fizz' '4' 'buzz' 'fizz' '7' '8' 'fizz' '10' '11' '12' '13' '14' 'fizzbuzz' '16' '17' 'fizz' '19' 'buzz' 'fizz' '22' '23' 'fizz' 'buzz' '26' 'fizz' '28' '29' 'fizzbuzz' '31' '32' 'fizz' 'fizz' 'buzz' '36' '37' '38' 'fizz' '40' '41' 'fizz' '43' '44' 'fizzbuzz' '46' '47' 'fizz' '49' 'buzz' 'fizz' '52' '53' 'fizz' 'buzz' '56' '57' '58' '59' 'fizzbuzz' '61' '62' 'fizz' '64' 'buzz' 'fizz' '67' '68' 'fizz' 'buzz' '71' '72' '73' '74' 'fizzbuzz' '76' '77' 'fizz' '79' 'buzz' 'fizz' '82' '83' 'fizz' 'buzz' '86' 'fizz' '88' '89' 'fizzbuzz' '91' '92' '93' '94' 'buzz' 'fizz' '97' '98' 'fizz' '100' ] I'll populate a variable called generated with that list and then check it against a more deterministically generated list. def fizz_buzz ( x ): if x % 15 == 0 : return 'fizzbuzz' elif x % 5 == 0 : return 'buzz' elif x % 3 == 0 : return 'fizz' else : return str ( x ) correct = [ fizz_buzz ( x ) for x in range ( 1 , 101 )] In [ 19 ]: sum ([ val == generated [ count ] for count , val in enumerate ( correct )]) Out [ 19 ]: 91 91 correct out of a possible 100.

Tensorboard TensorFlow ships with a web application that will visualise the telemetry produced during the model training. To launch Tensorboard and bring up the graphs page, run the following: $ tensorboard --logdir = /tmp/train & $ open 'http://127.0.0.1:6006/#graphs'