One of the really great things about Tensorflow is how easy it makes it to offload computations to the GPU. Tensorflow can do this more or less automatically if you have an Nvidia GPU and the CUDA tools and libraries installed. But just because Tensorflow offloads computations to the GPU doesn't mean you'll get good performance. In fact, it's not uncommon to get significantly worse performance when using a GPU than you would if you ran your compute graphs on the CPU.

There are two main reasons that using a GPU can be slower than the CPU:

Launching CUDA kernels has a higher baseline overhead than CPU kernels, by about 5x.

Naïve programs may end up transferring a large amount of data back between main memory and GPU memory during each training epoch.

Writing big, expensive network models is easy, so usually the first point isn't the problem. It's much more common to run into problems where data is unnecessarily being copied back and forth between main memory and GPU memory. This is the same problem that OpenGL programmers have faced for years: copying vertex data between main memory and the GPU is expensive, so a big part of writing high performance OpenGL code is figuring out how to keep vertex data on the GPU.

There are two ways to copy NumPy arrays from main memory into GPU memory:

You can pass the array to a Tensorflow session using a feed_dict .

. You can use tf.constant() to load the array into a tf.Tensor .

Most of the models and tutorials you'll find online use the first approach, copying the data using a feed_dict . This always copies the data from main memory to the GPU. For huge datasets that can't entirely fit onto the GPU, this is often fine. For instance, if you have hundreds of gigabytes of image or video data, your dataset will vastly exceed the available space in the GPU, so it's easy to fill the GPU with each mini-batch. Furthermore, contemporary CNNs are quite deep and are fairly expensive to run, so the memory transfer overhead is low compared to how long the compute graph take to run. However, if you're dealing with smaller datasets that can fit entirely in GPU memory (e.g. with text or numeric datasets), you can get much better performance using tf.constant() to pin your dataset into GPU memory. The problem with doing this is that neural networks tend to overfit training data unless the training data is split up into mini-batches, so reusing the same tf.constant() for each training epoch will lead to poor generalization.

After a lot of internet sleuthing, I found a cryptic StackOverflow answer suggesting a clever solution to this problem: load the entire dataset using tf.constant() , and then use tf.slice() to grab mini-batches from the constant. For instance, let's say you have an Nvidia GPU with 8 GB of memory, and your dataset is smaller than 8 GB. During training, you want to split the dataset into 100 mini-batches. The idea is that in each training epoch you would pass the slice indexes into the session via a feed_dict , and then the compute graph you've written would use tf.slice() to generate the mini-batch. Using this approach requires only sending the slice indexes via a feed_dict , which will be small scalar values. The idea is really elegant, but I found actually figuring out how to implement this to be kind of tricky.

Example Code

I'm going to demonstrate this technique with a small Python 3 program that generates mini-batches from a tf.constant() using the tf.slice() operator. I've also created a GitHub repo with the full code for this example, if you want something you can download and actually run locally.

To keep the code simple, we're going to write a Tensorflow compute graph that applies a simple numeric operation to a small 10⨯3 matrix:

import numpy as np # Height of our input data. HEIGHT = 10 # The size of each mini-batch. BATCH_SIZE = 2 # Create a 10x3 matrix in numpy; this lives in main memory (*not* the GPU). np_data = np.array(range(30), dtype=np.float32).reshape(10, 3)

The code above will create a NumPy array called np_data that looks like this:

# Contents of np_data. array([[ 0., 1., 2.], [ 3., 4., 5.], [ 6., 7., 8.], [ 9., 10., 11.], [ 12., 13., 14.], [ 15., 16., 17.], [ 18., 19., 20.], [ 21., 22., 23.], [ 24., 25., 26.], [ 27., 28., 29.]], dtype=float32)

For this demo our mini-batches will have size 2, meaning that they will be 2⨯3 matrices. The first mini-batch would be equivalent to np_data[:2] , the second mini-batch would be equivalent to np_data[2:4] , and so on. Of course, we won't actually be using NumPy slicing; instead we'll be using Tensorflow operators.

The next step is to copy np_data into Tensorflow's data graph. Tensorflow will automatically use a GPU if available, but you can also use a tf.device() context to force the location.

import tensorflow as tf # Copy the numpy data into TF memory as a constant var; this will be copied # exactly one time into the GPU (if one is available). tf_data = tf.constant(np_data, dtype=tf.float32)

Generating a mini-batch is done by supplying a batch index via a placeholder called ix , and then a mini-batch is generating using tf.slice() with the batch index:

# The index to use when generating our mini-batch. ix = tf.placeholder(shape=(), dtype=tf.int32) # The mini-batch of data we'll work on. batch = tf.slice(tf_data, [BATCH_SIZE * ix, 0], [BATCH_SIZE, -1])

I found the documentation for tf.slice() to be pretty confusing, so I'll explain here in plain English how it works. The begin argument, which is [BATCH_SIZE * ix, 0] in the code above, is the index of the upper-left corner of the slice we're creating. The index is multiplied by BATCH_SIZE because the ix values are in the range 0 to 4, so they need to be scaled to get the true offset into the matrix. The size argument, which is [BATCH_SIZE, -1] in the code above, says how many rows to go down and how many columns to go right. The special value -1 means "all columns"; I could have also used 3 here, since that's the width of the matrix.

The value we're going to calculate with our compute graph is the sum of the squares of the values in our mini-batch:

# The output of the Tensorflow graph. outp = tf.reduce_sum(tf.square(batch))

For this demonstration, we'll run the compute graph 100 times. We'll also shuffle the batch order. In this example I've initialized the dataset with random data, so shuffling isn't necessary. However, in a real neural network shuffling the mini-batch order is helpful since it helps fight any locality patterns in the input data (e.g. if earlier batches tend to have small numeric values, and later batches tend to have larger numeric values). Shuffling the data this way can help combat overfitting:

import random # Number of epochs to train for. EPOCHS = 100 # Shuffle the indexes of mini-batches, so that the mini-batches are generated # in a random order. This helps break locality in the structure of the training # dataset, which can help with overfitting. INDEXES = list(range(HEIGHT // BATCH_SIZE)) random.shuffle(INDEXES)

The training loop is very simple. All it does is pass the batch index (a single 32-bit integer) into a Tensorflow session:

# Create and initialize a TF session. with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for epoch in range(EPOCHS): for i in INDEXES: # Run the computation. The only data in the feed_dict is a single # 32-bit integer we supply here. All of the data needed for the # mini-batch already lives in GPU memory, and doesn't need to be # copied from main memory. b, o = sess.run([batch, outp], feed_dict={ix: i}) print('epoch = {}, ix = {}'.format(epoch, i)) print('batch: {}'.format(b)) print('output: {}'.format(o))

There are a lot of variations that you can make on this same theme. Here are a few I thought of while writing this post:

If the number of input records isn't evenly divisible by the batch size, the final mini-batch will be smaller than the other ones. The easiest way to handle this case is by supplying two index parameters (start and length).

For models where the training data set is way too large to fit in memory, but the size of the labels is small, you can pin just the training labels in GPU memory. This is a somewhat common access pattern with image or video classification tasks.

When the training set is too large for GPU memory, but the mini-batch sizes are relatively small, you could try filling the GPU with multiple mini-batches at once rather than copying one mini-batch each epoch. This might help amortize the transfer time.

Because this technique tends to make designing models more complicated, I would suggest implementing it only after you're satisfied with the basic structure of your model. That's the best time to start looking at optimizing training times, and that's when I would consider employing this technique.