For more than a decade, graphics processor maker Nvidia has been championing the adoption of GPU accelerators as heavy-lifting compute engines for an increasing array of applications that can take advantage of the parallel processing inherent in a GPU. And to its great credit, an entire ecosystem of hybrid computing, mixing the attributes of CPUs and GPUs, has branched off from graphics processors, which were once aimed exclusively at PCs and workstations. It has taken more than good hardware engineering and a pressing need for low-cost, low-power performance to foment such an ecosystem – it takes a lot of researchers and software developers, too.

As Nvidia’s own CEO and co-founder, Jen-Hsun Huang explained in his keynote address during the SC11 supercomputing conference four years ago, Nvidia was founded so he and his co-founders, Curtis Priem and Chris Malachowsky, could play 3D video games, which did not yet exist in 1993. Helping build massive compute farms for running scientific simulations, financial risk analysis, or machine learning algorithms were the furthest things from their minds. The fortunate thing for Nvidia was that five years after its founding, Quake, the very first OpenGL application and the first wildly popular 3D game, burst onto the scene. It was not long before Nvidia was solidly in the GPU business and soon after it went public and came to set the pace in GPU chip design.

It took a while to make the leap from 3D games to supercomputing simulations. Back in 2006, Nvidia accidentally walked through the datacenter door when it added 32-bit, single-precision floating point math units that complied with IEEE specifications to its GPUs. Once this capability was available, features that were added to the GPUs, it was practically inevitable that researchers, always looking for an edge when it came to parallel computing, would leverage this floating point capability to run Fortran applications because of the inherent massive parallelism and very good power efficiency (relative to the amount of compute) in Nvidia GPUs. The breakthrough moment for GPU coprocessing came in 2008 when a team led by Tsuyoshi Hamada at Nagasaki University in Japan built a cluster of 128 PCs based on Intel’s Core 2 Q6600 quad-core processors, each with two GeForce graphics cards from Nvidia . For the cost of $230,000, Hamada’s team put together a hybrid cluster (including Gigabit Ethernet switching) that was programmed in C to run an N-body astrophysics simulation. This machine, like many prototypes and like Google’s early homegrown clusters for its search engine, was not pretty:

But the Nagasaki cluster, which won the Gordon Bell prize for supercomputing in 2009, was pretty effective, delivering 42 teraflops of aggregate peak performance – at an unprecedented 124 megaflops per dollar – and was able to simulate the interactions of 1.61 billion particles. Soon thereafter, Hamada built a hybrid cluster with 144 PCs and a total of 576 GeForce GT200 graphics cards plus InfiniBand interconnect – which had much more bandwidth and much lower latency – and was able to do an N-body simulation with 3.28 billion particles thanks to its 190 teraflops of oomph.

Around this time, Nvidia founded its Tesla Accelerated Computing business unit and began to design GPUs specifically to meet the needs of customers who wanted cheaper flops and a lot more of them to run their simulations and models. The floating point and memory bandwidth performance contrasts between CPUs and GPUs is dramatic:

CPU and GPU architectures are not sitting still, and a variety of memory and packaging technologies are going to be brought to bear to boost the performance and memory bandwidth of both kinds of motors. It will be interesting to see how the future “Broadwell” and “Skylake” Xeon E5 processors stack up against the future “Pascal” and “Volta” Tesla GPU accelerators, and of course, the key comparison is how these GPUs will do compared to Intel’s impending “Knights Landing” and future “Knights Hill” parallel X86 processors and coprocessors.

Raw peak theoretical performance on compute and memory bandwidth is one thing, but benchmarks on real-world applications matter, too. Here is how the latest dual-GPU Tesla K80 accelerators compare to prior generation “Ivy Bridge” Xeon E5 v2 processors:

Note: The performance of these various simulation and modeling programs was measured with only a single Xeon E5 processor and then paired with a single, dual-GPU Tesla K80 card.

While the Top500 list of supercomputers is not necessarily representative of the HPC community at large, considering how advanced some of the supercomputing centers are, the share of computing being performed by hybrid machines on the latest June list is impressive. There are 15 machines using “Fermi” generation Tesla accelerators and 33 machines using “Kepler” generation Teslas, and the combined 48 machines represent 9.6 percent of total machines and at 92.5 petaflops of combine peak theoretical double precisions floating point performance, that’s 47.6 percent of all accelerated machine performance on the list and 18 percent of the total 513 petaflops of performance embodied on the list. As more and more applications are reworked for GPU acceleration, it is reasonable to assume that this GPU portion of aggregate HPC computing on the Top500 list share will grow.

Another indicator of the adoption of GPU acceleration is the use of these compute engines by cloud providers, who in turn rent this capacity to their customers. Amazon Web Services and SoftLayer were early adopters, adding GPUs to their clouds many years ago as hybrid CPU-GPU computing was just getting under way. Microsoft Azure added top-of-the-line Tesla K80s to its cloud last month, and SoftLayer has recently upgraded its facilities with Tesla K80s, too. There are a handful of smaller cloud providers, including Nimbix, Peer1 Hosting, Penguin Computing, and Rapid Switch, that offer GPUs on their infrastructure. While Google is a very large consumer of GPUs for its machine learning efforts, the company has not, as yet, offered GPU acceleration on its Compute Engine public cloud. But it certainly could with relative ease. While these cloud CPU-GPU compute facilities are still nascent, the idea is the same: to offer up to an order of magnitude acceleration through GPU coprocessing for applications compared to straight CPU processing.

The Next Wave Is Deep

While the first wave of compute acceleration for GPUs came from the supercomputing community wanting to accelerate their Fortran-based parallel applications for simulating all kinds of physical phenomena – from weather and climate to materials to internal combustion and genomics – the next big wave of CPU-GPU computing came out of the hyperscalers, not the HPC community. Specifically, the big consumer-facing web sites like Google, Baidu, Yahoo, Facebook, and their peers have been pushing the deep learning envelope, and many of them have adopted GPU acceleration to speed up the training of the algorithms that underpin their photo, video, text, and voice recognition systems. These companies are dealing with very large volumes of such data, so being able to have their systems automatically classify such content as it streams into their applications – and allow for it to be searched once it is quantified – is of immense value to their subscribers.

Computers can’t magically know what things are just by analyzing a stack of images or videos. People have to comb through lots and lots of images of things and categorize them in databases, like the ImageNet database, for instance. Once you have an image database that has its contents quantified, you can use a deep learning approach, such as the convolutional neural networks that were perfected by Bell Labs and Facebook researcher Yann LeCun. Since 2010, the ImageNet Large Scale Visual Recognition Challenge has been setting the pace of technology evolution – at least as far as we can tell – and the error rate percentage of the algorithms has been falling dramatically since GPUs were added to the mix:

Almost everyone who does algorithm training for convolutional neural networks has shifted from CPU clusters to beefy servers with lots of GPUs stuffed in them, and in many cases, such as at the large hyperscalers, they have very large banks of such GPU-heavy nodes – with many thousands of GPUs in total – to run myriad variations of training algorithms against datasets in parallel.

To address scale-out issues with deep learning, Nvidia is working with the deep learning community to allow for neural networks to more efficiently span across multiple GPUs within a single node. The company is also going to be delivering more powerful “Pascal” Tesla GPU accelerators next year, with their high-speed NVLink interconnect, that will expand the deep learning capabilities of hybrid CPU-GPU systems. NVLink will allow for high-bandwidth links that enable memory sharing between multiple GPUs, turning them into what is effectively one giant GPU and presumably making the programming job a lot easier for neural nets.

Just this week, Nvidia launched a pair of Tesla accelerators based on its “Maxwell” generation of GPUs, the M4 and M40, aimed specifically at hyperscalers to help accelerate machine learning training models and to help boost the performance of other workloads that these companies run alongside their machine learning applications. (We discuss these other uses of the new Tesla accelerators and their implications here.)

GPU acceleration is not just limited to simulation and modeling on conventional supercomputers and deep learning. GPUs have found their way into accelerating other workloads, including risk management and Monte Carlo simulations in financial services, video encoding and transcoding in the media and entertainment industry, speech processing at Nuance, Google, Microsoft, and Apple, and various kinds of database acceleration, too. And the next big wave, perhaps, will come when Java is fully adapted to run in parallel, making the most popular language for business applications available for acceleration.

Nvidia’s Tesla computing unit has come a long way in seven years. Back in 2008, it sold about 6,000 units in its fiscal year, which ends in January, and the CUDA parallel programming model, which is used to take parallel chunks of C, C++, and Fortran applications and offload them to the GPUs for processing, had something on the order of 150,000 downloads from 2006 through early 2008 when the Tesla unit was coming out with its first Fermi-based products with error correction and other features necessary for enterprise customers. Through Nvidia’s fiscal 2015 earlier this year, the company had a total of 3 million CUDA downloads – 1 million of those in the past 18 months alone – and had sold a total of 450,000 Tesla GPU accelerators. In September, Nvidia said it had shipped a total of 600 million CUDA-capable GPUs across PCs, workstations, and servers, which is a huge base.

The list of CUDA-enabled applications has grown from 27 at the end of fiscal 2008 to 334 at the end of fiscal 2015, and is continuing to grow. These include a wide variety of tools for computational finance, data science and analytics, defense and intelligence, safety and security, manufacturing, media and entertainment, oil and gas, and various academic research in physics, chemistry, and life sciences. By number, the application counts are highest in manufacturing, research, and media and entertainment, but that is not necessarily a reflection of all of the applications out there – plenty of companies code their own machine learning, for instance, but very few write their own EDA or CAD software these days – or how much each Tesla compute kind of application drives. Nvidia does not report revenues for Tesla compute based on industry.

What we do know, since Tesla revenues have become material to Nvidia’s in recent years, is that its HPC & Cloud unit drove $183 million in revenues in fiscal 2014, and this rose by 53 percent to $279 million in fiscal 2015. In the first quarter of fiscal 2016, which ended in April, Nvidia’s HPC & Cloud division revenues rose by 57 percent to $79 million, and in the second quarter of the fiscal year (ended in July), sales were down 15 percent to $62 million. As is the case with other players in the supercomputing and hyperscale segments where Nvidia seems to be getting most of its Tesla action, revenues can be bumpy from quarter to quarter because it only takes one deal to push a lot of product. So you can’t judge the HPC & Cloud unit, where Tesla lives, on a quarter by quarter basis.

The important thing is that Nvidia thinks that GPU acceleration at HPC centers, hyperscalers, and enterprises represents a $5 billion total addressable market, and with each expansion of software capability in both the CUDA toolset and in the application base, Nvidia is able to take a bigger bite out of that market.