Overview

In this post I will describe the steps required to get Tensorflow running with OpenCL on an AMD GPU.

Install AMDGPU-PRO Drivers

Install with (applying sudo credentials where necessary):

wget -referer = https://support.amd.com https://www2.ati.com/drivers/linux/ubuntu/amdgpu-pro-17.50-511655.tar.xz tar -xvf amdgpu-pro-17.50-511655.tar.xz cd amdgpu-pro-17.50-511655 ./amdgpupro-install --opencl = legacy --headless sudo reboot

Install a recent python

I use pyenv to manage different versions of python, setup your python as follows:

curl -L https://github.com/pyenv/pyenv-installer/raw/master/bin/pyenv-installer | bash

Add the following lines to your ~/.bashrc file:

export PATH = "~/.pyenv/bin: $PATH " eval " $( pyenv init - ) " eval " $( pyenv virtualenv-init - ) "

Restart your terminal and you have pyenv installed. Now install a python (I’ll use the latest 3.6) and its system dependencies:

sudo apt install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev env PYTHON_CONFIGURE_OPTS = "--enable-shared" MAKEOPTS = "-j 8" pyenv install 3.6.5 pyenv local 3.6.5 pip install wheel numpy

Get The ComputeCPP SYCL Implementation

Tensorflow can use the SYCL interface to seamlessly run device agnostic c++ code on an OpenCL enabled device. There is an open source template based library called triSYCL. It requires c++17 support and thus you probably need to build your own gcc from a recent release. I won’t cover it here but you can if you want to.

Sign up here and get your computecpp library:

https://developer.codeplay.com/computecppce/latest/overview

tar -xvf ComputeCpp-CE-0.6.1-Ubuntu.16.04-64bit.tar.gz sudo mv ComputeCpp-CE-0.6.1-Ubuntu-16.04-64bit /usr/local/computecpp

Get Tensorflow with experimental opencl support

This fork of tensorflow is maintained by someone from Codeplay, who make ComputeCPP. It’s highly experimental, expect it to change in the future. We use the dev/amd_gpu branch which is currently under active development:

git clone https://github.com/lukeiwanski/tensorflow cd tensorflow git fetch git checkout dev/amd_gpu pyenv local 3.6.5

In order to build tensorflow, you need to use Google’s bazel build system:

wget https://github.com/bazelbuild/bazel/releases/download/0.11.1/bazel_0.11.1-linux-x86_64.deb sudo dpkg -i bazel_0.11.1-linux-x86_64.deb

Environment and Building

You can now configure and build tensorflow, from the tensorflow directory run:

./configure

I used the following options (press enter to get the defaults):

$ ./configure You have bazel 0.11.1 installed. Please specify the location of python. [ Default is /home/user/.pyenv/versions/3.6.5/bin/python]: Please input the desired Python library path to use. Default is [ /home/user/.pyenv/versions/3.6.5/lib/python3.6/site-packages] Do you wish to build TensorFlow with jemalloc as malloc support? [ Y/n]: Y jemalloc as malloc support will be enabled for TensorFlow. Do you wish to build TensorFlow with Google Cloud Platform support? [ Y/n]: n No Google Cloud Platform support will be enabled for TensorFlow. Do you wish to build TensorFlow with Hadoop File System support? [ Y/n]: n No Hadoop File System support will be enabled for TensorFlow. Do you wish to build TensorFlow with Amazon S3 File System support? [ Y/n]: n No Amazon S3 File System support will be enabled for TensorFlow. Do you wish to build TensorFlow with Apache Kafka Platform support? [ y/N]: n No Apache Kafka Platform support will be enabled for TensorFlow. Do you wish to build TensorFlow with XLA JIT support? [ y/N]: n No XLA JIT support will be enabled for TensorFlow. Do you wish to build TensorFlow with GDR support? [ y/N]: n No GDR support will be enabled for TensorFlow. Do you wish to build TensorFlow with VERBS support? [ y/N]: n No VERBS support will be enabled for TensorFlow. Do you wish to build TensorFlow with OpenCL SYCL support? [ y/N]: y OpenCL SYCL support will be enabled for TensorFlow. Please specify which C++ compiler should be used as the host C++ compiler. [ Default is /usr/bin/g++]: Please specify which C compiler should be used as the hostC compiler. [ Default is /usr/bin/gcc]: Do you wish to build TensorFlow with ComputeCPP support? [ Y/n]: Y ComputeCPP support will be enabled for TensorFlow. Please specify the location where ComputeCpp for SYCL 1.2 is installed. [ Default is /usr/local/computecpp]: Do you wish to build TensorFlow with double types in SYCL support? [ Y/n]: Y double types in SYCL support will be enabled for TensorFlow. Do you wish to build TensorFlow with half types in SYCL support? [ y/N]: n No half types in SYCL support will be enabled for TensorFlow. Do you wish to build TensorFlow with CUDA support? [ y/N]: n No CUDA support will be enabled for TensorFlow. Do you wish to build TensorFlow with MPI support? [ y/N]: n No MPI support will be enabled for TensorFlow. Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [ Default is -march = native]:

Now you can build the package, consider going to get lunch because this will take a while:

bazel build --config = opt --config = sycl //tensorflow/tools/pip_package:build_pip_package

Hopefully it built without errors, if it did, try to fix them. Now install the generated wheel file:

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tf_cpp_whl pip install /tmp/tf_cpp_whl/tensorflow-1.6.0rc0-cp27-cp27mu-linux_x86_64.whl

Now you’ve got tensorflow installed, but it’s not over yet, you need everything setup nicely to point to the appropriate OpenCL ICD and use everything in your updated amdgpu-pro drivers. Consider putting these environment variables in your ~/.bashrc file:

export $OPENCL_VENDOR_PATH = /opt/amdgpu-pro/etc/OpenCL/vendors/ mkdir -p $OPENCL_VENDOR_PATH echo "/opt/amdgpu-pro/lib/x86_64-linux-gnu/libamdocl64.so" > $OPENCL_VENDOR_PATH /amdocl64.icd export LD_LIBRARY_PATH = /usr/local/computecpp/lib:/opt/amdgpu-pro/lib/x86_64-linux-gnu export PATH = /usr/local/computecpp/bin:/opt/amdgpu-pro/bin: $PATH

Here’s some explanation:

OPENCL_VENDOR_PATH is where libOpenCL.so will look for .icd files which point to a specific library for a driver. In our case it’s the libamdocl64.so , each one will appear as a different vendor, so you could install pocl or oclgrind LD_LIBRARY_PATH is where the system looks for shared object files PATH is where the system looks like for executable files. The two listed above contain clinfo and computecpp_info

Run clinfo and you should get something like this describing your GPU(s):

Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.1 AMD-APP (2527.3) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Platform Name: AMD Accelerated Parallel Processing Number of devices: 1 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: AMD Radeon HD 8500 Series Device Topology: PCI[ B#1, D#0, F#0 ] Max compute units: 6 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 780Mhz Address bits: 64 Max memory allocation: 330696294 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 505212928 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Max pipe arguments: 0 Max pipe active reservations: 0 Max pipe packet size: 0 Max global variable size: 0 Max global variable preferred total size: 0 Max read/write image args: 0 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities: Coarse grain buffer: No Fine grain buffer: No Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties: Out-of-Order: No Profiling : Yes Queue on Device properties: Out-of-Order: No Profiling : No Platform ID: 0x7f9728fbf510 Name: Oland Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 1.2 Driver version: 2527.3 Profile: FULL_PROFILE Version: OpenCL 1.2 AMD-APP (2527.3) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

Running computecpp_info should get you something like this:

******************************************************************************** ComputeCpp Info (CE 0.6.1) ******************************************************************************** Toolchain information: GLIBC version: 2.23 GLIBCXX: 20160609 This version of libstdc++ is supported. ******************************************************************************** Device Info: Discovered 1 devices matching: platform : <any> device type : <any> -------------------------------------------------------------------------------- Device 0: Device is supported : UNTESTED - Vendor not tested on this OS CL_DEVICE_NAME : Oland CL_DEVICE_VENDOR : Advanced Micro Devices, Inc. CL_DRIVER_VERSION : 2527.3 CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU If you encounter problems when using any of these OpenCL devices, please consult this website for known issues: https://computecpp.codeplay.com/releases/v0.6.1/platform-support-notes ********************************************************************************

Verification

You can now quickly check that everything installed correctly (don’t run this from the tensorflow source directory, you’ll run into issues):

cd ~ pyenv local 3.6.5 python -c "from tensorflow.python.client import device_lib;device_lib.list_local_devices()" 2018-04-04 16:39:13.673362: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices: 2018-04-04 16:39:13.673394: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type : GPU, name: Oland, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE

Now you should be able to run some code that uses the AMD GPU! Here’s a small example using keras (which you need to install):

pip install keras

Here’s a toy regression example (it learns the sin function) that you can use to compare the speed of CPU vs. GPU:

#!/usr/bin/python import tensorflow as tf from keras.models import Sequential from keras.layers import Dense , Activation import numpy as np x = np . arange ( 10000 ) y = np . sin ( x ) for i in ( '/cpu:0' , '/gpu:0' ): with tf . device ( i ): model = Sequential ([ Dense ( 128 , input_shape = ( 1 ,), activation = 'relu' ), Dense ( 1 ), ]) model . compile ( optimizer = 'adam' , loss = 'mse' ) model . fit ( x , y , validation_split = 0.25 , epochs = 10 )

Running this I get the following output:

Using TensorFlow backend. Train on 7500 samples, validate on 2500 samples Epoch 1/10 2018-04-04 16:53:55.766232: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices: 2018-04-04 16:53:55.766265: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: Oland, vendor: Advanced Micro Devices, Inc., profile: FULL_PROFILE 7500/7500 [==============================] - 0s 53us/step - loss: 29006.8066 - val_loss: 0.5027 Epoch 2/10 7500/7500 [==============================] - 0s 25us/step - loss: 0.5071 - val_loss: 0.5019 Epoch 3/10 7500/7500 [==============================] - 0s 26us/step - loss: 0.5091 - val_loss: 0.5005 Epoch 4/10 7500/7500 [==============================] - 0s 25us/step - loss: 0.5105 - val_loss: 0.5007 Epoch 5/10 7500/7500 [==============================] - 0s 30us/step - loss: 0.5166 - val_loss: 0.5006 Epoch 6/10 7500/7500 [==============================] - 0s 27us/step - loss: 0.5141 - val_loss: 0.6930 Epoch 7/10 7500/7500 [==============================] - 0s 26us/step - loss: 0.5173 - val_loss: 0.5968 Epoch 8/10 7500/7500 [==============================] - 0s 27us/step - loss: 0.5162 - val_loss: 0.5382 Epoch 9/10 7500/7500 [==============================] - 0s 26us/step - loss: 0.5195 - val_loss: 0.5377 Epoch 10/10 7500/7500 [==============================] - 0s 25us/step - loss: 0.5160 - val_loss: 0.5020 Train on 7500 samples, validate on 2500 samples Epoch 1/10 7500/7500 [==============================] - 92s 12ms/step - loss: nan - val_loss: nan Epoch 2/10 7500/7500 [==============================] - 2s 265us/step - loss: nan - val_loss: nan Epoch 3/10 7500/7500 [==============================] - 2s 262us/step - loss: nan - val_loss: nan Epoch 4/10 7500/7500 [==============================] - 2s 268us/step - loss: nan - val_loss: nan Epoch 5/10 7500/7500 [==============================] - 2s 262us/step - loss: nan - val_loss: nan Epoch 6/10 7500/7500 [==============================] - 2s 268us/step - loss: nan - val_loss: nan Epoch 7/10 7500/7500 [==============================] - 2s 268us/step - loss: nan - val_loss: nan Epoch 8/10 7500/7500 [==============================] - 2s 259us/step - loss: nan - val_loss: nan Epoch 9/10 7500/7500 [==============================] - 2s 258us/step - loss: nan - val_loss: nan Epoch 10/10 7500/7500 [==============================] - 2s 264us/step - loss: nan - val_loss: nan

As you can see, in the GPU run, it takes a long time to run the first epoch of the GPU the model, and significantly longer to run each training step (also that the GPU learnt nothing). You can play with the code to see if you can get it to work but waiting ~90s for the model to start the first epoch is a pain for testing.

It might just be I have an old workstation card, or that there is a lot of overhead involved in using SYCL, I hope this helps someone else!