Authors: Hagay Lupesko, Alexander Zai, Manu Seth

Apache MXNet community is excited to announce that MXNet performance on CPUs is now dramatically improved through the integration of Intel MKL-DNN into the default MXNet build. The improvements are significant across a wide range of model architectures, with inference speed-ups for both latency and throughput improving between 3x to 35x, depending on the model architecture and CPU — more performance stats listed later in the post.

To benefit from these optimizations, just grab the latest version of MXNet for Java or Scala. Alternatively, just follow the default steps to build MXNet from source — no further customization, installations or specialized actions needed!

What is Apache MXNet

Apache MXNet is an open-source deep learning framework used to build, train, and deploy deep neural networks. MXNet abstracts much of the complexity involved in implementing neural networks, is highly performant and scalable, and offers APIs across popular programming languages such as Python, C++, Java, R, Scala, and more.

What is Intel MKL-DNN

Intel Math Kernel Library for Deep Neural Networks (MKL-DNN) is an open-source library for high performance deep-learning on CPUs. The library accelerates neural network applications through vectorized and threaded building blocks, exposed via a C++ interface. MKL-DNN source code is available on GitHub.

Much of MKL-DNN optimization magic is achieved through data parallelism, also known as SIMD: Single Instruction Multiple Data. Modern x86 processors support Advanced Vector Extensions (AVX), which extend the x86 instruction set with SIMD capabilities, and enable the processor to perform operations on multiple numbers at the same time, as opposed to running an operation one number at a time. Recent AVX512 extensions, supported on Intel’s Skylake architecture, can handle up to 16 32-bit float numbers in just a single instruction!

Performance improvements at a glance

As part of the effort to integrate MKL-DNN as the default build, the MXNet community ran exhaustive inference benchmarks across Intel and AMD CPUs. To check out the full details of these benchmarks visit the MXNet Wiki.

Below are a few diagrams illustrating the inference performance improvements across a few model architectures, benchmarked on Intel Xeon Skylake-SP (AWS c5.18xlarge).