Reconciling Abstraction with High Performance: A MetaOCaml approach

A common application of generative programming is building high-performance computational kernels highly tuned to the problem at hand. A typical linear algebra kernel is specialized to the numerical domain (rational, float, double, etc.), loop unrolling factors, array layout and a priori knowledge (e.g., the matrix being positive definite). It is tedious and error prone to specialize by hand, writing numerous variations of the same algorithm.

The widely used generators such as ATLAS and SPIRAL reliably produce highly tuned specialized code but are difficult to extend. In ATLAS, which generates code using printf, even balancing parentheses is a challenge. According to the ATLAS creator, debugging is nightmare.

A typed staged programming language such as MetaOCaml lets us state a general, obviously correct algorithm and add layers of specializations in a modular way. By ensuring that the generated code always compiles and letting us quickly test it, MetaOCaml makes writing generators less daunting and more productive.

The readers will see it for themselves in this hands-on tutorial. Assuming no prior knowledge of MetaOCaml and only a basic familiarity with functional programming, we will eventually implement a simple domain-specific language (DSL) for linear algebra, with layers of optimizations for sparsity and memory layout of matrices and vectors, and their algebraic properties. We will generate optimal BLAS kernels. We shall get the taste of the ``Abstraction without guilt''.

As any other monograph in now's Foundations and Trends (TM) series, ``Reconciling Abstraction with High Performance: A MetaOCaml approach'' is published in three formats: journal, e-book and print book.

DOI: 10.1561/2500000038

The tutorial on systematic generation of optimal numeric kernels with MetaOCaml was first presented at the tutorial session of the conference of Commercial Users of Functional Programming (CUFP 13) on September 23, 2013 in Boston, USA. It was reprised at IFL 2017 (Bristol, UK).

This page describes the structure of the book and points to the accompanying code.

1. Introduction Why metaprogramming?

Why this tutorial?

Why MetaOCaml?

Overview

Obtaining MetaOCaml 2. First Steps Now or later

Power

Offline code generation

Runtime specialization and its benchmark

Recap

A historical aside 3. Filtering Specializing to the known filter order

Specialization to the known coefficients

Smarter specialization

Further challenges

Recap 4. Linear Algebra DSL: Complex Vector Arithmetic and Data Layout Data layout problem

Abstracting arithmetic

Abstracting vectors

Vector arithmetic DSL

Compiling vector DSL

Recap and further challenges 5. Linear Algebra DSL: Matrix-Vector Operations and Modular Optimizations Shonan challenge 1

BLAS 2 DSL

Implementing and generating matrix-vector multiplication

Specializing to the known dimensions

Specializing to the known matrix: Partially-known values

Algebraic simplifications

Selective unrolling

Cross-stage persistence for large data

Recap 6. From an Interpreter to a Compiler: DSL for Image Manipulation Image-processing DSL

Interpreting DSL

Compiling DSL 7. Further Challenges Digital filters

Linear Algebra DSL

Other Challenges Conclusions Acknowledgements References Accompanying Code

Overview

The goal of the tutorial is to teach how to write typed code generators, how to make them modular, and how to introduce local domain-specific optimizations with MetaOCaml. The tutorial is based on the progression of problems, which, except the introductory one, are all slightly simplified versions of real-life problems: First steps in staging and MetaOCaml Digital filters Complex vector multiplication: varying data representation (structure of arrays vs. array of structures) Systematic optimization of simple linear algebra: building extensively specialized general BLAS From an interpreter to a compiler: DSL for image manipulation Further challenges (Homework) In fact, problems 3, 4 and 6 were suggested by HPC researchers as challenges to program generation community (Shonan challenges). The common theme is building high-performance computational kernels highly tuned to the problem at hand. Hence most problems revolve around simple linear algebra -- a typical and most frequently executed part in HPC. The stress on high-performance applications and on modular optimizations and generators sets this tutorial apart from Taha's very accessible, gentle introductions to the `classical' partial evaluation and staging, focused on turning an interpreter of a generally higher-order language into a compiler. We also get to see this classical area in Chap. 6; however, we pay less attention to lambda-calculus and more to image processing. Furthermore, this tutorial mentions recent additions to MetaOCaml such as offshoring and let-insertion. The source code for the tutorial is available as a supplement (Accompanying Code).

Accompanying Code

Makefile [2K]

How to build the tutorial code

Introduction to MetaOCaml: the power example (A.P.Ershov, 1977) power.ml [6K]

Code and explanations

power_rts.ml [2K]

The example of run-time specialization

powerf_rts.ml [2K]

The example of run-time specialization (FP)

square.ml [<1K]

The externally defined `square' function

Lifting values to code: now comes with BER MetaOCaml N107

Generating optimal FIR filters filter.ml [14K]

Introductory, ad hoc approach

Systematic approach Complex Vector Arithmetic (Shonan challenge) ring.ml [2K]

Abstracting arithmetic: Rings

vector.ml [3K]

Abstract vectors and BLAS 1 operations

complex.ml [13K]

Complex vector arithmetic

cmplx.mli [<1K]

Defining data type cmplx

Shonan Challenge 1: Matrix-vector multiplication with a known, mostly sparse matrix mvmult_full.ml [22K]

Stepwise development of the optimal specialized matrix-vector multiplication

A DSL for image manipulation imgdsl.mli [2K]

The definition of the DSL

img_ex.ml [<1K]

Sample DSL expressions

img_interp.ml [2K]

The interpreter

img_comp.ml [4K]

The staged interpreter: the compiler

img_trans.ml [3K]

Interpreting and compiling the examples

grayimg.mli [<1K]

grayimg.ml [5K]

Low-level image processing library

takaosan.pgm [1302K]

A sample image to process: Takao-san view, April 2008

Conclusions

The trade-off between clarity, maintainability, ease of developing, on one hand -- and performance is real. Throughout the tutorial we have been encountering it time and again. We have also seen the trade-off resolved. Well-chosen abstractions (DSLs embedded in the tagless-final style, for one) let domain experts easily write code, easily see it correct and easily express domain-specific knowledge and optimizations. Code generation removes the overhead of abstraction from the resulting code (shifting the overhead to the generator, where it is bearable). The recent ``Stream Fusion, to Completeness'' enforces the lesson, on the `industrial strength' stream processing. The strymonas library designed in the paper lets us build pipelines by freely nesting and plugging in the components such as maps, filters, joins, etc. The result is the highly imperative code, whose performance not just approaches but matches the hand-written code (in the cases where the hand-written code was feasible to write). Building even complicated generators is simple if we take advantage of abstraction and types. OCaml's excellent abstraction facilities -- from higher-order functions to module system -- let us write and debug generators in small pieces, and compose optimizations from separate layers. As we keep saying, with code generation, the abstraction comes with no cost. We may abstract with abandon. Staged types are of particularly great help. They help ensure that compiling the generated code produces no errors. Problematic generators are reported with helpful error messages that refer to the generator (rather than generated) code. Furthermore, on many occasions we have seen that to stage code, we merely need to give it the desired signature, submit the code to the type checker and fix the reported errors. The type checker actively helps us write the code. This tutorial has covered the part of MetaOCaml that has been stable for a decade and is expected to remain so. MetaOCaml is an actively developed project, with more, experimental features such as offshoring and genlet , which have been mentioned only in passing. They all help write generators easily and produce faster code while maintaining confidence in the result.

Last updated October 5, 2018

oleg-at-okmij.org

Your comments, problem reports, questions are very welcome!