If you typically follow GPU performance as it related to gaming but have become curious about Bitcoin mining, you’ve probably noticed and been surprised by the fact that AMD GPUs are the uncontested performance leaders in the market. This is in stark contrast to the PC graphics business, where AMD’s HD 7000 series has been playing a defensive game against Nvidia’s GK104 / GeForce 600 family of products. In Bitcoin mining, the situation is almost completely reversed — the Radeon 7970 is capable of 550MHash/second, while the GTX 680 is roughly 1/5 as fast.

There’s an article at the Bitcoin Wiki that attempts to explain the difference, but the original piece was written in 2010-2011 and hasn’t been updated since. It refers to Fermi and AMD’s VLIW architectures and implies that AMD’s better performance is due to having far more shader cores than the equivalent Nvidia cards. This isn’t quite accurate, and it doesn’t explain why the GTX 680 is actually slower than the GTX 580 at BTC mining, despite having far more cores. This article is going to explain the difference, address whether or not better CUDA miners would dramatically shift the performance delta between AMD and Nvidia, and touch on whether or not Nvidia’s GPGPU performance is generally comparable to AMD’s these days.

Topics not discussed here include:

Bubbles

Investment opportunity

Whether or not ASICs, when they arrive next month , this summer , in the future will destroy the GPU mining market.

These are important questions, but they’re not the focus of this article. We will discuss power efficiency and Mhash/watt to an extent, because these factors have an impact on comparing the mining performance of AMD vs. Nvidia.

The mechanics of mining

Bitcoin mining is a specific implementation of the SHA2-256 algorithm. One of the reasons AMD cards excel at mining is because the company’s GPU’s have a number of features that enhance their integer performance. This is actually something of an oddity; GPU workloads have historically been floating-point heavy because textures are stored in half (FP16) or full (FP32) precision.

The issue is made more confusing by the fact that when Nvidia started pushing CUDA, it emphasized password cracking as a major strength of its cards. It’s true that GeForce GPUs, starting with G80, offered significantly higher cryptographic performance than CPUs — but AMD’s hardware now blows Nvidia’s out of the water.

The first reason AMD cards outperform their Nvidia counterparts in BTC mining (and the current Bitcoin entry does cover this) is because the SHA-256 algorithm utilizes a 32-bit integer right rotate operation. This means that the integer value is shifted (explanation here), but the missing bits are then re-attached to the value. In a right rotation, bits that fall off the right are reattached at the left. AMD GPUs can do this operation in a single step. Prior to the launch of the GTX Titan, Nvidia GPUs required three steps — two shifts and an add.

We say “prior to Titan,” because one of the features Nvidia introduced with Compute Capability 3.5 (only supported on the GTX Titan and the Tesla K20/K20X) is a funnel shifter. The funnel shifter can combine operations, shrinking the 3-cycle penalty Nvidia significantly. We’ll look at how much performance improves momentarily, because this isn’t GK110’s only improvement over GK104. GK110 is also capable of up to 64 32-bit integer shifts per SMX (Titan has 14 SMX’s). GK104, in contrast, could only handle 32 integer shifts per SMX, and had just eight SMX blocks.

We’ve highlighted the 32-bit integer shift capability difference between CC 3.0 and CC 3.5.

AMD plays things close to the chest when it comes to Graphics Core Next’s (GCN) 32-bit integer capabilities, but the company has confirmed that GCN executes INT32 code at the same rate as double-precision floating point. This implies a theoretical peak int32 dispatch rate of 64 per clock per CU — double GK104’s base rate. AMD’s other advantage, however, is the sheer number of Compute Units (CUs) that make up one GPU. The Titan, as we’ve said, has 14 SMX’s, compared to the HD 7970’s 32 CU’s. Compute Unit / SMX’s may be far more important than the total number of cores in these contexts.

Next page: Wrath of the Titan…