Let the AI benchmarking wars begin. Today, a diverse group from academia and industry – Google, Baidu, Intel, AMD, Harvard, and Stanford among them – released MLPerf, a nascent benchmarking tool “for measuring the speed of machine learning software and hardware.” Arrival of MLPerf follows what has been a smattering of ad hoc AI performance comparisons trickling to market. Last week, RiseML blog compared Google’s TPUv2 against Nvidia V100. Today Intel posted a blog with data showing for select machine translation using RNNs “the Intel Xeon Scalable processor outperforms NVidia V100 by 4x on the AWS Sockeye Neural Machine Translation model.”

For quite some time there has been vigorous discussion around the need for meaningful AI benchmarks with proponents suggesting that the lack of meaningful benchmark tools has restrained AI adoption. Quoted in the MLPerf announcement is AI pioneer Andrew Ng, “AI is transforming multiple industries, but for it to reach its full potential, we still need faster hardware and software.” The hope is better, standardized benchmarks will help AI technology developers create such products and allow adopters to make informed AI-enabling technology selections.

MLPerf says its primary goals are to:

Accelerate progress in ML via fair and useful measurement

Enable fair comparison of competing systems yet encourage innovation to improve the state-of-the-art of ML

Keep benchmarking effort affordable so all can participate

Serve both the commercial and research communities

Enforce replicability to ensure reliable results

Comparisons of AI performance (h/w and s/w) have so far largely been issued by parties with vested interest, such as Intel’s blog today entitled, “Amazing Inference Performance with Intel Xeon Scalable Processors.” This isn’t a knock on Intel. Such comparisons often contain useful insight, but they are also often structured to demonstrate one vendor’s superiority over a competitor. A standardized benchmark mitigates tweaking of tests to get the result one wants.

The MLPerf effort is emulating, for example, past efforts such as SPEC (The Standard Performance Evaluation Corporation). “[T]he SPEC benchmark helped accelerate improvements in general purpose computing. SPEC was introduced in 1988 by a consortium of computing companies. CPU Performance improved 1.6X/year for the next 15 years. MLPerf combines best practices from previous benchmarks including: SPEC’s use of a suite of programs, SORT’s use one division to enable comparisons and another division to foster innovative ideas, DeepBench’s coverage of software deployed in production, and DAWNBench’s time-to-accuracy metric,” says MLPerf.

Addison Snell, CEO of Intersect360 Research, noted, “AI is on the minds of so many enterprises today, that any effort to provide neutral benchmarking guidance is of heightened importance, especially with the range of competing technologies at play. However, AI is such a diverse field, I doubt any single benchmark will become dominant over time. Consider all the zeal around big data and analytics five years ago; despite everyone’s attempts to define it, the industry didn’t provide a unified, common benchmark. I expect the same will happen with AI.”

MLPerf is a “good and useful” step said Steve Conway, senior research vice president, Hyperion Research, “because there has been a real lack of benchmarks for buyers and sellers for years to show the differences between AI products and solutions. This benchmark appears to be written for bounded problems that predominate today in early AI. Later on we are going to need additional benchmarks as AI starts getting into unbounded problems that will be the most economically important problems. Bounded problems are relatively simple like voice and image recognition or game playing. An unbounded problem is diagnosing a cancer versus a bounded problem of reading an MRI; it’s being able to recommend decision on really complicated questions.”

MLPerf is available now on GitHub but still in a very early stage, as emphasized by MLPerf, “This release is very much an ‘alpha’ release — it could be improved in many ways. The benchmark suite is still being developed and refined, see the Suggestions section below to learn how to contribute. We anticipate a significant round of updates at the end of May based on input from users.”

Currently there are reference implementations for each of the seven benchmarks in the MLPerf suite (excerpted from GitHub):

Image classification – Resnet-50 v1 applied to Imagenet.

– Resnet-50 v1 applied to Imagenet. Object detection – Mask R-CNN applied to COCO.

– Mask R-CNN applied to COCO. Speech recognition – DeepSpeech2 applied to Librispeech.

– DeepSpeech2 applied to Librispeech. Translation – Transformer applied to WMT English-German.

– Transformer applied to WMT English-German. Recommendation – Neural Collaborative Filtering applied to MovieLens 20 Million (ml-20m).

– Neural Collaborative Filtering applied to MovieLens 20 Million (ml-20m). Sentiment analysis – Seq-CNN applied to IMDB dataset.

– Seq-CNN applied to IMDB dataset. Reinforcement– Mini-go applied to predicting pro game moves.

Each reference implementation provides the following: code that implements the model in at least one framework; a Dockerfile which can be used to run the benchmark in a container; a script which downloads the appropriate dataset; A script which runs and times training the model; and documentaiton on the dataset, model, and machine setup.

According to the GitHub site, the benchmarks have been tested on the following machine configuration:

16 CPUs, one Nvidia P100.

Ubuntu 16.04, including docker with nvidia support.

600GB of disk (though many benchmarks do require less disk).

It will be interesting to watch whether the industry coalesces around a few AI benchmarks or if benchmarks proliferate. In such a young market, many are likely to offer benchmarking tools and services. For example, Stanford – which is MLPerf member – recently ran its first DAWNBench v1 Deep Learning results.

Stanford reported: “April 20, 2018 marked first deep learning benchmark and competition that measures end-to-end performance: the time/cost required to achieve a state-of-the-art accuracy level for common deep learning tasks, as well as the latency/cost of inference at this state-of-the-art accuracy level. Focusing on end-to-end performance provided an objective means of normalizing across differences in computation frameworks, hardware, optimization algorithms, hyperparameter settings, and other factors that affect real-world performance.”

One DAWN competitor, fast.ai– a young company offering AI training and developing AI software tools – reached out to HPCwire touting its performance (see company blog for results). These benchmarks matter, and it seems very likely that any Stanford-run exercise is serious and should be taken seriously. That said, others may be less so. An effort such as MLPerf could help clear the currently muddy waters going forward when comparing AI claims.

Link to MLPerf user guide: https://mlperf.org/assets/static/media/MLPerf-User-Guide.pdf

* Additional reporting by Tiffany Trader