Running Kubeflow on Powerful GPU enabled instances on AWS-EKS

Ever since Google created Kubernetes as an open source container orchestration tool, it has seen it blossom in ways it might never have imagined. As the project gains in popularity, we are seeing many adjunct programs develop. Kubeflow is designed to bring machine learning to Kubernetes containers. The project’s goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures.

The GPU has evolved from just a graphics chip into a core component of deep learning and machine learning. CPUs are designed for more general computing workloads. GPUs in contrast are less flexible, however GPUs are designed to compute in parallel the same instructions, this parallel architecture in a GPU is well adapted for vector and matrix operations. Machine learning is a field with intense computational requirements and the choice of your GPU will fundamentally determine users learning experience. With no GPU this might look like months of waiting for an experiment to finish, or running an experiment for a day or more only to see that the chosen parameters were off and the model diverged.

With increasing number of HPC applications, AI powered applications and the broad availability of GPUs in public cloud, there is a need for open-source Kubernetes to be GPU-aware. With Kubernetes on NVIDIA GPUs, software developers and DevOps engineers can build and deploy GPU-accelerated deep learning training or inference applications to heterogeneous GPU clusters at scale and seamlessly. GPU support in Kubernetes is facilitated by NVIDIA device plugin which exposes the GPUs on the host to the container space.

NVIDIA and others provide AMIs for GPU-based accelerated computing instances on AWS. This enables developers to linearly scale their model training performance on AWS EC2 instances accelerating preprocessing and remove data transfer bottlenecks, and rapidly improve the quality of their machine learning models. Nvidia’s CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.