DeepDream visualizations of Mt. Pilatus in Switzerland

The exciting DeepGramAI Hackathon just concluded, and I wanted to share some of the cool things John Henning and myself built this weekend! Other cool projects at the Hackathon ranged from a speech to math decoder, mind reading with Keras + EEG data, and a semantic parser for image segmentation.

My goal for this hackathon was to live process and augment video via DeepDream image generation. This project would have application in augmented reality gaming or live-processing of videos with DeepDream features.

Using consumer hardware , it takes ~45 to 90 seconds to process a single image with TensorFlow DeepDream on CPU alone, and ~5–30 seconds utilizing a graphical processing unit (GPU). In order to accelerate image processing to >15 frames per second, we first benchmarked the effect of different hardware components on DeepDream performance.

Benchmarking DeepDream

We first compared performance of DeepDream on processing a single image, using TensorFlow 1.01 in Python (code found here):

Model: Inception 5h

Inception 5h Extract Layer: mixed4d_3x3_bottleneck_pre_relu

mixed4d_3x3_bottleneck_pre_relu Parameters: t_obj_filter= 139 , iter_n=10, step=1.5, octave_n=4, octave_scale=1.4

Our original test was on the Google Cloud Platform (GCP) using an n1-standard-2 instance with 2CPU, 8GB of memory and 1 NVIDIA K80 GPU. All tests used NVIDIA parallel computing platform CUDA, as well as the CUDA® Deep Neural Network library (cuDNN). In the blue columns you can see that with CPU only, it took longer than 1 minute to process a single image. As expected, using a K80 GPU on the n1-standard-2 instance speed up processing. With this configuration it took ~30 seconds to process a single image. We were then able to accelerate this performance to less than 20 seconds using a larger instance type, the n1-standard-4 that comes with 4 CPUs, 15 GB RAM and the same NVIDIA K80 GPU.

We next tested if TensorFlow’s new Acceleration Linear Algebra (XLA) compiler would increase performance. To do this, we compiled the TensorFlow r1.0 branch from source with the XLA and CUDA options on the n1-standard-4 instance. I was not sure what to expect, and was pleasantly surprised that with the GPU we were able to process an image in ~10 seconds (~60% increase in speed with XLA).

Our last step was using our own custom hardware with 4 cores, 64GB RAM, and running two of the new 1080Ti NVIDIA GPU Cards. This rig was built by my coworkers at Silicon Valley Data Science (SVDS), Paul Ho and Jeff Lam. I would definitely recommend you reach out to these hardware magicians if you have any questions about setting up your own multi-GPU hardware! Below is a picture of the machine with 3 cards installed (our test was run with the two 1080Ti cards with the SLI bridge).

Using our custom hardware with XLA enabled, we observed processing of single images in 6.5 seconds, a significant improvement! In the future we plan to repeat our multi-GPU test on GCP and Amazon Web Services (AWS) to confirm our findings on our custom hardware.

Adjusting DeepDream Parameters

After we explored how scaling hardware could accelerate processing, we next explored how changing parameters in the DeepDream algorithm could change speed and output image quality. To do this, we wrapped the render_deepdream function provided in the TensorFlow examples folder. We chose several parameters to experiment with including the object filter from the mixed4d_3x3_bottleneck_pre_relu layer, processing iterations per image, number of image scales to be analyzed (octaves), and the relative size of the features (octave scale).

We found that decreasing iteration and octave number reduces processing time while reducing image quality. At 5 iterations and 1 octave our processing time dropped to ~2.5 seconds, but our images’ DeepDream features were noticeably less expressive. While that finding was expected, we did not find that processing time was changed by t_obj_filter number. I have included examples outputs below to demonstrate the variety of images that can be generated while holding all parameters except for t_obj_filter constant: