MSR ResNet34 visualised by Graphcore

Accelerating Machine Intelligence

Compute, data, and algorithms have combined to power the recent huge strides in machine intelligence. But there is still plenty of scope for improvement, and hardware is finally coming to the fore.

We’ve heard that it is prohibitively expensive for startups and academics to train machine learning models, and this is due to the rental or purchase costs of hardware. The results from one recent Google paper were estimated to cost $13k to emulate.

That’s just to reproduce the final model, not to emulate the whole experimentation and hyperparameter optimisation caboodle. Equally, there are intelligence tasks (training, inference, or prediction) that would ideally happen on the cellphone or remote sensor but are too compute constrained locally, so currently rely on uploading data to the cloud for processing.

Machine intelligence is the future of computing, so what needs to happen at a hardware level to make it faster and more energy- and cost-efficient? We talked to Simon Knowles, CTO of Graphcore, about hardware acceleration of machine intelligence.

Why do we need new hardware for machine intelligence?

Because intelligence represents a different type of compute workload than we’ve seen before. Existing silicon architectures are not time-or energy-efficient for this new workload. In particular, machine intelligence permits and requires massively parallel processing, but the design of parallel processors and methods of programming them are nascent arts. There is huge scope for innovation in architecture and detailed design.

What are the special characteristics of intelligence workloads?

Intelligence, human or machine, involves two essential capabilities. The first is approximate computing — efficiently finding probably good answers where perfect answers are not possible, usually because there is insufficient information, time, or energy. This is what we call judgement in humans. The second capability is learning — adapting responses according to experience, which is just previous data. In computer or humans, learning is distilling a probability distribution (a model) from previous data. This model can then be used to predict probable outcomes or to infer probable causes.

These models are naturally graphs, where the vertices represent the probabilities of features in the data, and the edges represent correlation or causation between features. Usually there are very many vertices so the graph is naturally rather sparse. The graph structure also makes huge parallelism explicit — a massively parallel processor can work on many vertices and edges at the same time. Another characteristic of these models is that the resolution of each component probability is usually rather small; it is the aggregation of them all that provides a higher resolution output. This means that the underlying calculations require arithmetic on only small words — typically half-precision floating-point. Machine intelligence may be the first workload to require very high performance compute on very low precision data, quite a contrast to traditional HPC.

Can you explain the different approaches to the problem that people are taking?

There are always trade-offs in silicon design, and one of the hardest decisions at the emergence of a new market is how much to harden. Broadly speaking, very large markets reward application-specific design and very broad potential promotes application-agnostic design. Machine intelligence is undoubtedly the future of computing, but it is still relatively young and is evolving very quickly. So baking today’s favourite models or learning algorithms into fixed hardware, by building an application specific integrated circuit (ASIC), would be foolish.

Instead, what is required is silicon architecture that efficiently supports the essential new characteristics of intelligence as a workload, yet is flexible enough to maintain utility as the details evolve. One can achieve flexibility by using something called a Field Programmable Gate Array (FPGA, an approach that Microsoft has implemented). However, these are really painful to program, power hungry, and have a relatively low performance ceiling. Just as much flexibility at much greater performance can be achieved by designing a processor for this new class of workload. If we think of the Central Processing Unit (CPU) in your laptop as being designed for scalar-centric control tasks, and the Graphics Processing Unit (GPU) as being designed for vector-centric graphics tasks, then this new class of processor would be an Intelligence Processing Unit (IPU), designed for graph-centric intelligence tasks.

I’m glad you mentioned GPUs, because haven’t they enabled many of the exciting results we’ve seen from deep learning in the last few years?

Yes, they have and they may be able to evolve to play an ongoing role in machine intelligence. GPUs are currently much faster than CPUs for deep learning where dense tensors are being manipulated, which suits their wide vector datapaths. In fact some CPUs like Xeon-Phi have evolved to look more like GPUs. But only a subset of machine intelligence is amenable to wide vector machines, and the high arithmetic precision required by graphics is far too wasteful for the probability processing of intelligence. So GPUs are actually very inefficient, but today they are what exists. In the near future we will certainly see more focused alternatives

You’ve probably seen the first of these new architectures, the Tensor Processing Unit (TPU) recently announced by Google as their proprietary custom accelerator for machine inference. Start-up Nervana, recently acquired by Intel, also claim they are working on a TPU. It’s especially exciting to see Google advocating tailored processor design for machine learning — they wouldn’t do chip design if it didn’t make a big difference to what they can deliver in application performance, and ultimately cost.

What about neuromorphic approaches?

The brain is a great exemplar for computer architects in this brave new endeavour of machine intelligence. But the strengths and weaknesses of silicon are very different to those of “wetware”. We have not copied nature’s pattern for flying machines, nor for surface locomotion, nor for engines, because our engineering materials are different. So too with computation. For example, most neuromorphic computing projects advocate communication by electrical spikes, like the brain. But a basic analysis of energy efficiency immediately concludes that an electrical spike (two edges) is half as efficient for information transmission as a single edge, so following the brain is not automatically a good idea. I think computer architects should always strive to learn how the brain computes, but should not strive to literally copy it in silicon.

So what’s left to do?

Intelligence is the future of all computing. Without it, many of the things we want machines to do would be impossible. At the twilight of Moore’s law it might seem unfortunate that we have just uncovered a good use for another factor of a billion in compute performance and efficiency. That won’t come from just shrinking silicon features further, and there’s nothing that could replace silicon on the visible horizon. However, we have barely begun to explore the computing machines which could be built from the multi-billion transistor chips we can now mass-produce cheaply. We have barely begun to explore how to develop software efficiently for parallel processors running thousands or millions of concurrent sub-programs. And we have barely begun to understand the mechanisms of intelligence as a computing task. This will be another golden age of engineering design. There is a great deal of innovation coming, and I’m very pleased that Graphcore will be at the vanguard of that.