by Alex Chen, Justin Basilico, and Xavier Amatriain

As we have described previously on this blog, at Netflix we are constantly innovating by looking for better ways to find the best movies and TV shows for our members. When a new algorithmic technique such as Deep Learning shows promising results in other domains (e.g. Image Recognition, Neuro-imaging, Language Models, and Speech Recognition), it should not come as a surprise that we would try to figure out how to apply such techniques to improve our product. In this post, we will focus on what we have learned while building infrastructure for experimenting with these approaches at Netflix. We hope that this will be useful for others working on similar algorithms, especially if they are also leveraging the Amazon Web Services (AWS) infrastructure. However, we will not detail how we are using variants of Artificial Neural Networks for personalization, since it is an active area of research.

Many researchers have pointed out that most of the algorithmic techniques used in the trendy Deep Learning approaches have been known and available for some time. Much of the more recent innovation in this area has been around making these techniques feasible for real-world applications. This involves designing and implementing architectures that can execute these techniques using a reasonable amount of resources in a reasonable amount of time. The first successful instance of large-scale Deep Learning made use of 16000 CPU cores in 1000 machines in order to train an Artificial Neural Network in a matter of days. While that was a remarkable milestone, the required infrastructure, cost, and computation time are still not practical.

Andrew Ng and his team addressed this issue in follow up work . Their implementation used GPUs as a powerful yet cheap alternative to large clusters of CPUs. Using this architecture, they were able to train a model 6.5 times larger in a few days using only 3 machines. In another study, Schwenk et al. showed that training these models on GPUs can improve performance dramatically, even when comparing to high-end multicore CPUs.

Given our well-known approach and leadership in cloud computing, we sought out to implement a large-scale Neural Network training system that leveraged both the advantages of GPUs and the AWS cloud. We wanted to use a reasonable number of machines to implement a powerful machine learning solution using a Neural Network approach. We also wanted to avoid needing special machines in a dedicated data center and instead leverage the full, on-demand computing power we can obtain from AWS.

In architecting our approach for leveraging computing power in the cloud, we sought to strike a balance that would make it fast and easy to train Neural Networks by looking at the entire training process. For computing resources, we have the capacity to use many GPU cores, CPU cores, and AWS instances, which we would like to use efficiently. For an application such as this, we typically need to train not one, but multiple models either from different datasets or configurations (e.g. different international regions). For each configuration we need to perform hyperparameter tuning, where each combination of parameters requires training a separate Neural Network. In our solution, we take the approach of using GPU-based parallelism for training and using distributed computation for handling hyperparameter tuning and different configurations.

Distributing Machine Learning: At what level?

Some of you might be thinking that the scenario described above is not what people think of as a distributed Machine Learning in the traditional sense. For instance, in the work by Ng et al. cited above, they distribute the learning algorithm itself between different machines. While that approach might make sense in some cases, we have found that to be not always the norm, especially when a dataset can be stored on a single instance. To understand why, we first need to explain the different levels at which a model training process can be distributed.

In a standard scenario, we will have a particular model with multiple instances. Those instances might correspond to different partitions in your problem space. A typical situation is to have different models trained for different countries or regions since the feature distribution and even the item space might be very different from one region to the other. This represents the first initial level at which we can decide to distribute our learning process. We could have, for example, a separate machine train each of the 41 countries where Netflix operates, since each region can be trained entirely independently.

However, as explained above, training a single instance actually implies training and testing several models, each corresponding to a different combinations of hyperparameters. This represents the second level at which the process can be distributed. This level is particularly interesting if there are many parameters to optimize and you have a good strategy to optimize them, like Bayesian optimization with Gaussian Processes. The only communication between runs are hyperparameter settings and test evaluation metrics.

Finally, the algorithm training itself can be distributed. While this is also interesting, it comes at a cost. For example, training ANN is a comparatively communication-intensive process. Given that you are likely to have thousands of cores available in a single GPU instance, it is very convenient if you can squeeze the most out of that GPU and avoid getting into costly across-machine communication scenarios. This is because communication within a machine using memory is usually much faster than communication over a network.

The following pseudo code below illustrates the three levels at which an algorithm training process like us can be distributed.

for each region -> level 1 distribution

for each hyperparameter combination -> level 2 distribution

train model -> level 3 distribution

end for

end for

In this post we will explain how we addressed level 1 and 2 distribution in our use case. Note that one of the reasons we did not need to address level 3 distribution is because our model has millions of parameters (compared to the billions in the original paper by Ng).

Optimizing the CUDA Kernel

Before we addressed distribution problem though, we had to make sure the GPU-based parallel training was efficient. We approached this by first getting a proof-of-concept to work on our own development machines and then addressing the issue of how to scale and use the cloud as a second stage. We started by using a Lenovo S20 workstation with a Nvidia Quadro 600 GPU. This GPU has 98 cores and provides a useful baseline for our experiments; especially considering that we planned on using a more powerful machine and GPU in the AWS cloud. Our first attempt to train our Neural Network model took 7 hours.

We then ran the same code to train the model in on a EC2’s cg1.4xlarge instance, which has a more powerful Tesla M2050 with 448 cores. However, the training time jumped from 7 to over 20 hours.

Profiling showed that most of the time was spent on the function calls to Nvidia Performance Primitive library, e.g. nppsMulC_32f_I, nppsExp_32f_I. Calling the npps functions repeatedly took 10x more system time on the cg1 instance than in the Lenovo S20.

While we tried to uncover the root cause, we worked our way around the issue by reimplementing the npps functions using the customized cuda kernel, e.g. replace nppsMulC_32f_I function with:

__global__

void KernelMulC(float c, float *data, int n)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < n) {

data[i] = c * data[i];

}

}

Replacing all npps functions in this way for the Neural Network code reduced the total training time on the cg1 instance from over 20 hours to just 47 minutes when training on 4 million samples. Training 1 million samples took 96 seconds of GPU time. Using the same approach on the Lenovo S20 the total training time also reduced from 7 hours to 2 hours. This makes us believe that the implementation of these functions is suboptimal regardless of the card specifics.

PCI configuration space and virtualized environments

While we were implementing this “hack”, we also worked with the AWS team to find a principled solution that would not require a kernel patch. In doing so, we found that the performance degradation was related to the NVreg_CheckPCIConfigSpace parameter of the kernel. According to RedHat, setting this parameter to 0 disables very slow accesses to the PCI configuration space. In a virtualized environment such as the AWS cloud, these accesses cause a trap in the hypervisor that results in even slower access.

NVreg_CheckPCIConfigSpace is a parameter of kernel module nvidia-current, that can be set using:

sudo modprobe nvidia-current NVreg_CheckPCIConfigSpace=0

We tested the effect of changing this parameter using a benchmark that calls MulC repeatedly (128x1000 times). Below are the results (runtime in sec) on our cg1.4xlarge instances:

KernelMulC npps_MulC

CheckPCI=1 3.37 103.04

CheckPCI=0 2.56 6.17

As you can see, disabling accesses to PCI space had a spectacular effect in the original npps functions, decreasing the runtime by 95%. The effect was significant even in our optimized Kernel functions saving almost 25% in runtime. However, it is important to note that even when the PCI access is disabled, our customized functions performed almost 60% better than the default ones.

We should also point out that there are other options, which we have not explored so far but could be useful for others. First, we could look at optimizing our code by applying a kernel fusion trick that combines several computation steps into one kernel to reduce the memory access. Finally, we could think about using Theano, the GPU Match compiler in Python, which is supposed to also improve performance in these cases.

G2 Instances

While our initial work was done using cg1.4xlarge EC2 instances, we were interested in moving to the new EC2 GPU g2.2xlarge instance type, which has a GRID K520 GPU (GK104 chip) with 1536 cores. Currently our application is also bounded by GPU memory bandwidth and the GRID K520‘s memory bandwidth is 198 GB/sec, which is an improvement over the Tesla M2050’s at 148 GB/sec. Of course, using a GPU with faster memory would also help (e.g. TITAN’s memory bandwidth is 288 GB/sec).

We repeated the same comparison between the default npps functions and our customized ones (with and without PCI space access) on the g2.2xlarge instances.

KernelMulC npps_MulC

CheckPCI=1 2.01 299.23

CheckPCI=0 0.97 3.48

One initial surprise was that we measured worse performance for npps on the g2 instances than the cg1 when PCI space access was enabled. However, disabling it improved performance between 45% and 65% compared to the cg1 instances. Again, our KernelMulC customized functions are over 70% better, with benchmark times under a second. Thus, switching to G2 with the right configuration allowed us to run our experiments faster, or alternatively larger experiments in the same amount of time.

Distributed Bayesian Hyperparameter Optimization

Once we had optimized the single-node training and testing operations, we were ready to tackle the issue of hyperparameter optimization. If you are not familiar with this concept, here is a simple explanation: Most machine learning algorithms have parameters to tune, which are called often called hyperparameters to distinguish them from model parameters that are produced as a result of the learning algorithm. For example, in the case of a Neural Network, we can think about optimizing the number of hidden units, the learning rate, or the regularization weight. In order to tune these, you need to train and test several different combinations of hyperparameters and pick the best one for your final model. A naive approach is to simply perform an exhaustive grid search over the different possible combinations of reasonable hyperparameters. However, when faced with a complex model where training each one is time consuming and there are many hyperparameters to tune, it can be prohibitively costly to perform such exhaustive grid searches. Luckily, you can do better than this by thinking of parameter tuning as an optimization problem in itself.

One way to do this is to use a Bayesian Optimization approach where an algorithm’s performance with respect to a set of hyperparameters is modeled as a sample from a Gaussian Process. Gaussian Processes are a very effective way to perform regression and while they can have trouble scaling to large problems, they work well when there is a limited amount of data, like what we encounter when performing hyperparameter optimization. We use package spearmint to perform Bayesian Optimization and find the best hyperparameters for the Neural Network training algorithm. We hook up spearmint with our training algorithm by having it choose the set of hyperparameters and then training a Neural Network with those parameters using our GPU-optimized code. This model is then tested and the test metric results used to update the next hyperparameter choices made by spearmint.

We’ve squeezed high performance from our GPU but we only have 1–2 GPU cards per machine, so we would like to make use of the distributed computing power of the AWS cloud to perform the hyperparameter tuning for all configurations, such as different models per international region. To do this, we use the distributed task queue Celery to send work to each of the GPUs. Each worker process listens to the task queue and runs the training on one GPU. This allows us, for example, to tune, train, and update several models daily for all international regions.

Although the Spearmint + Celery system is working, we are currently evaluating more complete and flexible solutions using HTCondor or StarCluster. HTCondor can be used to manage the workflow of any Directed Acyclic Graph (DAG). It handles input/output file transfer and resource management. In order to use Condor, we need each compute node register into the manager with a given ClassAd (e.g. SLOT1_HAS_GPU=TRUE; STARD_ATTRS=HAS_GPU). Then the user can submit a job with a configuration “Requirements=HAS_GPU” so that the job only runs on AWS instances that have an available GPU. The main advantage of using Condor is that it also manages the distribution of the data needed for the training of the different models. Condor also allows us to run the Spearmint Bayesian optimization on the Manager instead of having to run it on each of the workers.

Another alternative is to use StarCluster , which is an open source cluster computing framework for AWS EC2 developed at MIT. StarCluster runs on the Oracle Grid Engine (formerly Sun Grid Engine) in a fault-tolerant way and is fully supported by Spearmint.

Finally, we are also looking into integrating Spearmint with Jobman in order to better manage the hyperparameter search workflow.

Figure below illustrates the generalized setup using Spearmint plus Celery, Condor, or StarCluster: