Abstracts

MLIR - Multi-Level Intermediate Representation compiler infrastructure - was announced at C4ML last year and has since become an official LLVM subproject. It continues to grow as an open community of academic and industry collaborators building common infrastructure for compilers that operate on high level abstractions. In this talk I will focus on uses of MLIR specific to Machine Learning and in particular the TensorFlow ecosystem. This talk will go beyond current TensorFlow usage of MLIR, covering infrastructure work to enable future use cases (such as, ongoing work on dynamic pattern rewrites) and highlighting some community-driven efforts (in particular the tensor compute working group).

Note: At CC on Saturday, Albert Cohen will explore "IR Design for Heterogeneity: Challenges and Opportunities" in general and illustrated on the MLIR rationale in particular, and at CGO on Wednesday, Chris Lattner and Tatiana Shpeisman will discuss the "MLIR Compiler Infrastructure".

Modern compilers are still built using technology that existed decades ago. These include basic algorithms and techniques for lexing, parsing, data-flow analysis, data dependence analysis, vectorization, register allocation, instruction selection, and instruction scheduling. It is high time that we modernize our compiler toolchain.

In this talk, I will show the path to the modernization of one important compiler technique -- vectorization. Vectorization was first introduced in the era of Cray vector processors during the 1980's.

In modernizing vectorization, I will first show how to use new techniques that better target modern hardware. While vector supercomputers need large vectors, which are only available by parallelizing loops, modern SIMD instructions efficiently work on short vectors. Thus, in 2000, we introduced Superword Level Parallelism (SLP) based vectorization. SLP finds short vector instructions within basic blocks, and by loop unrolling we can convert vector parallelism to SLP.

Next, I will show how we can take advantage of the power of modern computers for compilation, by using more accurate but expensive techniques to improve SLP vectorization. Due to the hardware resource constraints of the era, like many other compiler optimizations, SLP implementation was a greedy algorithm. In 2018, we introduced goSLP, which uses integer linear programming to find an optimal instruction packing strategy and achieves 7.58% geomean performance improvement over the LLVM’s SLP implementation on SPEC2017fp C/C++ programs.

Finally, I will show how to truly modernize a compiler by automatically learning the necessary components of the compiler with Ithemal and Vemal. The optimality of goSLP is under LLVM’s simple per instruction additive cost model that fits within the Integer programming framework. However, the actual cost of execution in a modern out-of-order, pipelined, superscalar processor is much more complex. Manually building such cost models as well as manually developing compiler optimizations is costly, tedious, error-prone and is hard to keep up with the architectural changes. Ithemal is the first learnt cost model for predicting the throughput of x86 basic blocks. It not only significantly outperforms (more than halves the error) state-of-the-art analytical hand-written tools like llvm-mca, but also is learnt from data requiring minimal human effort. Vemal is a learnt policy for end-to-end vectorization as opposed to tuning heuristics, which outperforms LLVM’s SLP vectorizer. These data-driven techniques can help achieve state-of-the-art results while also reducing the development and maintenance burden of the compiler developer.

Apache (incubating) TVM is an open-source deep learning compiler stack for CPUs, GPUs, and specialized accelerators. It aims to close the gap between the productivity-focused deep learning frameworks, and the performance- or efficiency-oriented hardware backends.

In this talk, I will highlight some of the TVM's developments in the past year, including the area of dynamic computation, TinyML, core infrastructure, and accelerator support. I will also briefly talk about our recent efforts on the unified IR and runtime.

Prior to the release of MLIR, PlaidML used purpose-built IRs and custom C++ passes to perform the optimizations necessary for a performant ML compiler. MLIR provides us with an excellent compiler infrastructure for PlaidML, and we have begun porting our core optimizations into MLIR, particularly MLIR's Affine dialect. This presentation will discuss why we decided the benefits of MLIR and its ecosystem outweighed the costs of dropping our existing custom IR and porting to MLIR. We will also discuss challenges and lessons from this process, and provide an example of how the PlaidML compiler utilizes MLIR in the "jigsaw" pass. This pass is an optimization for almost-hyperrectangular regions which splits affine.parallel loops with constraints into multiple loops to minimize the number of iterations that must check the constraint.

Existing deep neural network (DNN) frameworks optimize the computation graph of a DNN by applying graph transformations manually designed by human experts. This approach requires significant manual effort, misses possible graph optimizations, and is difficult to scale, as new DNN operators are introduced on a regular basis.

In this talk, I will present TASO (https://github.com/jiazhihao/TASO), the first DNN computation graph optimizer that automatically generates graph substitutions. TASO takes as input a list of operator specifications and generates candidate substitutions using the given operators as basic building blocks. All generated substitutions are formally verified against the operator specifications using an automated theorem prover. To optimize a given DNN computation graph, TASO performs a cost-based backtracking search, applying the verified substitutions to find an optimized graph, which can be directly used by existing DNN frameworks. TASO outperforms existing DNN graph optimizers by up to 2.8x, while requiring significantly less human effort. For example, TensorFlow currently contains approximately 53,000 lines of manual optimization rules, while the operator specifications needed by TASO are only 1,400 lines of code.

With continually changing Deep Learning frameworks and a diverse range of hardware devices, it becomes increasingly difficult to link DL frameworks to new hardware accelerators in a scalable, performant fashion. nGraph is an open-source graph-based DL compiler that serves as a common intermediate abstraction layer between frameworks and hardware backends. Through a unified intermediate representation (IR), a common set of analyses and optimizations, and an extensible runtime, nGraph aims to reduce engineering cost, encourage code-reuse and accelerate performance.

In this talk, we will give a brief overview of nGraph, in addition to sharing our progress on OpenVINO integration and support for dynamic tensor shapes. We will also discuss our latest work on enabling MLIR, an extensible compiler framework, and our experience defining a tensor-level dialect. Finally, we will motivate the case for a common MLIR DL/ML dialect that various graph-based compilers can converge to.

Saeed Maleki, Microsoft: "Adasum: Semantics Preserving Large Batch Gradient Descent"

Stochastic Gradient Descent (SGD) is the workhorse of training machine learning models. SGD is very popular because of its few hyperparameters and robustness. However, it is inherently sequential and hard to parallelize. Mini-batch parallelism can help, but larger batch sizes come at the cost of increasing losses in convergence rate. Furthermore, every mini-batch size requires separately tuning the learning rate. Other optimizers, such as Momentum, Adam and Lamb, have been shown to converge faster than SGD at the cost of having even more hyperparameters to be tuned. This only makes the parallelization more cumbersome.

In this paper, we introduce Adasum, a new technique to reduce multiple gradients such that the convergence rate degradation is minimized without changing any of the hyperparameters. Adasum exploits the empirical observation that gradients are parallel to each other at the beginning of the training and later they become orthogonal. Our experiments show that Adasum enables efficient training at an unprecedented scale. In particular, we demonstrate training of Resnet-50 and BERT models converging with significantly larger batch sizes than shown before.

JAX is a system for high-performance machine learning research. It offers the familiarity of Python+NumPy together with hardware acceleration, and it enables the definition and composition of user-wielded function transformations useful for machine learning programs. These transformations include automatic differentiation, automatic batching, end-to-end compilation via XLA, parallelizing over multiple accelerators, and more. Composing these transformations is the key to JAX's power and simplicity.

JAX had its initial open-source release in December 2018 (https://github.com/google/jax). It’s used by researchers for a wide range of advanced applications, from studying training dynamics of neural networks, to probabilistic programming, to scientific applications in physics and biology.

This talk will introduce JAX and its core function transformations with a live demo. You’ll learn about JAX’s core design, including how it leverages the XLA compiler to bring hardware acceleration to Python, and how it’s powering new research.



