Intel and NVIDIA are at it again, but this time Intel seems to have scored a bit of an own goal in its ongoing war of words with its rival. The debate, as always, is "CPU vs. GPU;" a group of Intel researchers has recently presented a paper at the International Symposium on Computer Architecture (ISCA) that purports to demonstrate that NVIDIA's two-year-old GTX 280 GPU is not, in fact, 10X to 1000X as fast as the eight-month-old Core i7 960 CPU at key supercomputing workloads—it's merely 2.5× to 14× faster.

The Intel researchers claim to have tried their best to optimize both the CPU and the GPU, insisting that most of the previous tests that show a 100× advantage or greater for the GPU pitted optimized GPU code against unoptimized CPU code. A genuine effort at optimizing a kernel for both chips, Intel claims, narrows the gap drastically.

Unfortunately for Intel, it's actually not necessary to go deep into the weeds of the paper's specific claims and performance analysis to see where the company has blown it with this attempt to set the record straight. And NVIDIA's didn't cover itself in glory with its response to the paper, either. In short, silliness abounds in this mini-controversy, and both parties should be put in a time out.

For once, the engineers should've consulted PR first

The big knock against the Pentium 4's hyperpipelined architecture and "clockspeed über alles" approach to performance has always been that it originated on the marketing side of the company, and not on the engineering side. In other words, the technical decision to pursue clockspeed at any cost, a decision that left the door for AMD to swoop in and take marketshare with a more conventional yet superior design, was made so that Intel could tout the P4's stratospheric clockspeed in its press releases.

In the case of this paper, which is clearly aimed at scoring points against a technical rival, it probably would've been better if the engineers behind it had consulted with marketing before going public with the results. For almost any HPC application that I can think of, a 2.5× speedup on any code is attractive, especially when you can get it from an add-in coprocessor that's markedly cheaper Intel's top-of-the-line CPU. And a 14× speedup? Who on earth would turn their nose up at this? HPC customers in both industry and academia would be happy to have employees or grad students burning the midnight oil to port to a cheaper hardware platform that would give them an instant fourteenfold increase.

While it may gall Intel to see NVIDIA promoting papers that tout a 100× or even 1000× speedup from using NVIDIA GPUs, the best response to this is to release an Intel GPU that can give comparable gains. The second-best response is some combination of ignoring the chest-thumping and releasing faster CPUs. Releasing another paper that ultimately validates the oppositions claims to being significantly faster at key HPC workloads... that wasn't so slick.

It's also the case that the third-party benchmark results that NVIDIA selects to wave over its head as it runs victory laps are of a particular type that don't lend themselves to easy refutation. Specifically, these benches are not produced by hardware websites, analyst firms, or anyone else who purports to be an objective outsider and provider of intelligence. No, these benches are largely done by labs that are potential users of the hardware in question. They want to know how their codes perform on GPU hardware, and they're not too interested in some idealized kernels that some third-party benchmarking group selected to represent common workloads. These are just not volume customers who are looking for a speedup on shrinkwrapped software; rather, each customer is like its own little niche, and can afford to throw programmers at new hardware if there's a real speedup to be had.

What all of this means for Intel is that NVIDIA's CUDA benchmarks can't be effectively refuted, the way a typical third party bench of commonly available software can. If a group at UPenn says that NVIDIA gave them a 130× speedup on some bit of code that they care about, then they got a 130× speedup, end of story. Again, the only real response to that is to give UPenn some Intel hardware that they can get a 131× speedup on.

So much for where Intel blew it. Now let's look at NVIDIA.

It's not Intel's fault that Fermi was late late late

After the obvious criticism that the results are still embarrassing for Intel, the second most common knock against the report in the coverage I've seen (for example) is that Intel pitted an eight-month-old Core i7 against a two-year-old GTX 280. NVIDIA even gets in on the action in its response, remarking that "this is our previous generation GPU..." But this situation is not Intel's fault. It's NVIDIA's.

Yes, NVIDIA has a pair of new GPUs based on Fermi that just hit the market last month, but the final paper deadline for the conference that Intel just presented at was November 16, 2009. At that time, the Core i7 960 had just launched the prior month, and the GTX 280 was the best you could get from NVIDIA. Fermi was still in the grip of major delays, and ATI actually was sitting firmly atop the GPU performance heap. In short, this was smack in the middle of a rather embarrassing period for NVIDIA, when they were looking like some chumps who couldn't get a high-end product out the door.

The other thing to note about NVIDIA's response is this passage: "We believe the codes that were run on the GTX 280 were run right out-of-the-box, without any optimization. In fact, it’s actually unclear from the technical paper what codes were run and how they were compared between the GPU and CPU."

Actually, no, the Intel team goes into some detail about the kinds of optmizations they did for each kernel and the issues raised. The "Methodology" section of the report (4.1) is just one place where Intel gives references for where they got the GPU code and implementations:

For both CPU and GPU performance measurements, we have optimized most of the kernels individually for each platform. For some of the kernels, we have used the best available implementation that already existed. Specifically, evaluations of SGEMM, SpMV, FFT and MC on GTX280 have been done using code from [1, 8, 2, 34], respectively. For the evaluations of SGEMM, SpMV and FFT on Core i7, we used Intel MKL 10.0... To the best of our knowledge, our performance num- bers are at least on par and often better than the best published data.

I'm not saying that Intel didn't stack the deck at all, but it does seem clear from reading the paper that NVIDIA's claims of a complete lack of optimization and transparency are downright false.

In the end, neither company comes out of this incident looking stellar. The whole dust-up merely serves to remind everyone that GPUs are drastically faster than CPUs on some important HPC codes, that Intel doesn't have a GPU that it can put into the ring against Fermi, and that Fermi itself was embarrassingly late, and that NVIDIA is happy to twist truth if it helps them twist the knife a bit harder.