As we noted in a story this past weekend, some curious and technically minded owners of GeForce GTX 970 graphics cards have noticed unexpected behavior with regard to that card’s use of memory.

Users first noticed that the GTX 970 appears to allocate less of its available memory than the GeForce GTX 980 does, despite the fact that both cards have 4GB of onboard RAM. Specifically, the GTX 970 doesn’t seem to make use of its last 512MB of memory as often as the 980 does. Users then found, using directed tests built with GPU computing development tools, that the GTX 970 can access that last chunk of onboard RAM, but only at much lower transfer rates.

The question was: why? We know that the GeForce GTX 970 and 980 are based on the same GPU silicon, the GM204 chip, but they have different configurations. All of the GM204’s graphics processing units are enabled on the GTX 980, while three of the chip’s 16 shader multiprocessor (SM) units are disabled on the GTX 970. Since not every chip comes out of the fab perfect, chip companies often disable faulty portions of their chips and build cheaper products around them. Graphics processors tend to be massively parallel, so a GPU with a fraction of its units turned off could still power a viable and compelling product.

Could it be that the way Nvidia disabled portions of the GTX 970 was causing its unusual memory access behavior?

This story brewed for a while until Nvidia released a statement this past Saturday with a brief explanation of the issue. The statement read, in part:

The GeForce GTX 970 is equipped with 4GB of dedicated graphics memory. However the 970 has a different configuration of SMs than the 980, and fewer crossbar resources to the memory system. To optimally manage memory traffic in this configuration, we segment graphics memory into a 3.5GB section and a 0.5GB section. The GPU has higher priority access to the 3.5GB section. When a game needs less than 3.5GB of video memory per draw command then it will only access the first partition, and 3rd party applications that measure memory usage will report 3.5GB of memory in use on GTX 970, but may report more for GTX 980 if there is more memory used by other commands. When a game requires more than 3.5GB of memory then we use both segments.

The statement then went on to explain that the overall performance impact of accessing that last half-gigabyte section of RAM ought to be fairly minor. Nvidia provided a few examples comparing performance in RAM-constrained scenarios versus the GTX 980.

This revelation touched off a storm of discussion and speculation in the comments to our story and elsewhere, with folks wondering whether the GTX 970 was somehow broken or subject to a hardware bug.

To clear the air, Nvidia Senior VP of Hardware Engineering Jonah Alben spoke with us yesterday evening. Alben’s primary message to us was straightforward. He said the GTX 970 is “working exactly as we designed it.” He assured us the GTX 970 does have a full 4GB of memory and claimed “we do use it when we need to.”

Alben then explained that the GTX 970’s unusual memory access behavior is a consequence of a new feature built into Maxwell-class GPUs. That feature has to do with how Nvidia disables faulty portions of its chips when needed. Alben said the feature allows Nvidia to make a better product than it could have otherwise.

To help us understand this feature and how it impacts the GeForce GTX 970, Alben took us on a brief tour of the guts of the GM204 GPU. That tour relies pretty heavily on a simplified diagram of the chip that Alben provided us, which we’ve embedded below.

Across the top of the diagram are the shader multiprocessors or SMs, where most of the graphics computational work happens. The bottom half of the diagram shows the chip’s memory config. The GPU has four memory partitions, each with two 32-bit links to external DRAM chips. Those memory partitions are split into two chunks, each with its own section of L2 cache, memory controller and so on.

The middle of the diagram depicts the crossbar that facilitates communication between the shader and memory arrays. You can think of this crossbar as a switched fabric, much like an Ethernet switch, that allows any SM to talk with any L2 cache and memory controller.

Since the diagram depicts a GTX 970, three of the chip’s SMs have been grayed out, along with one of the two L2 cache sections in the memory partition on the far right. The GTX 980 has all of these units enabled.

In the prior generation of Kepler-derived GPUs, Alben explained, any chips with faulty portions of L2 cache would need to have an entire memory partition disabled. For example, the GeForce GTX 660 Ti is based on a GK104 chip with several SMs and an entire memory partition inactive, so it has an aggregate 192-bit connection to memory, down 64 bits from the full chip’s capabilities.

Nvidia’s engineers built a new feature into Maxwell that allows the company to make fuller use of a less-than-perfect chip. In the event that a memory partition has a bad section of L2 cache, the firm can disable the bad section of cache. The remaining L2 cache in the memory partition can then service both memory controllers in the partition thanks to a “buddy interface” between the L2 and the memory controllers. That “buddy interface” is shown as active, in a dark, horizontal arrow, in the bottom right memory partition on the diagram. In the other three memory partitions, this arrow is grayed out because the “buddy” interface is not used.

Thanks to this provision, Nvidia is able to equip the GeForce GTX 970 with a full 256-bit memory interface and still ship it at an attractive price in high volumes. Nvidia still ships some chips with both L2s in a memory partition disabled, like the GeForce GTX 970M for laptops, but Alben said “we have much fewer of those now.” So Nvidia is keeping more hardware functional on more chips thanks to this optimization. Still, this GPU configuration has some consequences we didn’t understand entirely when the card was first released.

For one, the GTX 970 lacks some of the cache capacity and ROP throughput we initially believed it had. Each L2 cache section on the GM204 has an associated ROP partition responsible for blending fragments into pixels and helping with multisampled antialiasing. With one of its L2s disabled, the GTX 970 has only 56 pixels per clock of ROP throughput, not the 64 pixels per clock of ROP throughput specified in the card’s initial specs sheets. (In an even crazier reality, that limit isn’t even the primary fill rate constraint in this product, since the GTX 970’s shader arrays can only send 52 pixels per clock onto the crossbar.) Also, the GTX 970’s total L2 cache capacity is 1792KB, not 2048KB, since one 256KB cache section is disabled.

Alben frankly admitted to us that Nvidia “screwed up” in communicating the GTX 970’s specifications in the reviewer’s guide it supplied to the press. He said that the team responsible for this document wasn’t aware that the GTX 970 takes advantage of Maxwell’s ability to disable selectively a portion of L2 cache.

More intriguing is the question of how well the GeForce GTX 970 makes use of its available memory. That question has two related dimensions, bandwidth and capacity.

To understand the bandwidth issue, first notice a constraint imposed by the disabling of an L2 cache section in the diagram above: the crossbar link into that section of L2 is also disabled. Although the GTX 970’s path to memory remains 256 bits wide in total, the width of the crossbar link is reduced. The two memory controllers in that partition must share a single crossbar link, backed by a single L2 cache. Access to the half-gigabyte of DRAM behind at least one of those two memory controllers will be slower at peak than elsewhere in the system. That 512MB of memory could still be potentially useful, but it’s a bit problematic. Nvidia must work around this limitation.

Now, consider how a full-fledged GM204 like the GTX 980 actually takes advantage of all of the memory bandwidth available to it. If it were to store its data contiguously in one big blob, the GM204 could only ever read or write data at the speed of a single DRAM chip or crossbar link. Instead, to achieve its full bandwidth potential, the GPU must distribute the data it stores across multiple DRAMs, so it can read or write to them all simultaneously. Alben’s diagram above indicates the GM204 has a 1KB stride. In other words, it stores 1KB of data in the first DRAM, then stores another 1KB in the next DRAM, and so on across the array. On the GTX 980, the chip strides across eight DRAMs and then wraps back around. Operations that read or write data sequentially should take advantage of all eight memory channels and achieve something close to the GPU’s peak transfer rate.

For the GTX 970, the Maxwell team had to get creative in order to prevent the slower half-gig of DRAM from becoming a problem. Their answer was to split the GTX 970’s memory into two segments: a large, fast 3.5GB segment and a smaller, slower 512MB segment. These two segments are handled very differently. The 3.5GB segment includes seven memory controllers, and the GPU strides across all seven of them equally. When the GPU is accessing this segment of memory, it should achieve 7/8th of its peak potential bandwidth, not far at all from what one would see with a fully enabled GM204. However, transfer rates for that last 512MB of memory will be much slower, at 1/8th the card’s total potential.

So, as Alben explains it, Nvidia’s hardware config is intended to behave like some GTX 970 owners have measured: fast in the first 3.5GB memory and slower after that.

With a conventional GPU config like the GTX 980, Alben notes, Nvidia reports two memory segments to the operating system: the GPU’s own RAM and the additional system memory available over PCI Express. With the GTX 970, Nvidia reports two separate segments of GPU memory, first the faster 3.5GB chunk and then the slower 512MB one, along with some “hints” telling the OS to prefer the larger, faster segment where possible. As a result, the OS should treat the GTX 970’s memory hierarchically with the correct preference: first the faster segment, then the slower segment. Should the application’s memory needs exceed 4GB in total, it will spill into PCIe memory, just as it would on the GTX 980.

Incidentally, Alben told us that this arrangement helps explain the behavior some folks have noted where the GTX 980 appears to use more of its total memory capacity than the GTX 970 does. Some of the data stored in video RAM during normal operation falls into a gray area: it’s been used at some point, but hasn’t recently been accessed and may not be used again. That data’s presence in RAM isn’t strictly necessary for the current work being done, but it could prove useful at some point in the future. Rather than evict this data from memory, the GPU will keep it there if that space in RAM isn’t otherwise needed. On the GTX 980, with a single memory segment, this “cold” data won’t be ejected until memory use approaches the 4GB limit. On the GTX 970, this same cold data is ejected at 3.5GB. Thus, when doing the same work, the GTX 970 may use less of its total RAM capacity as a matter of course than the GTX 980 does. Again, according to Alben, this behavior is part of Nvidia’s design. The GTX 970 still has access to the full 4GB of RAM if it’s strictly needed.

At this point, I had a simple question for Alben: would it have been better just to make the GTX 970 a 3.5GB card? After all, having that slow half-gig of RAM onboard seems a little iffy, doesn’t it? His response: I don’t think so, because that half-gig of memory is useful. You’re better off spilling into the final 512MB of onboard memory than spilling over to PCI Express, which is even slower still.

Also, Alben noted, with “good heursitics,” Nvidia is able to put data that’s not likely to be used as often into this half-gig segment. In other words, Nvidia’s driver developers may already be optimizing the way their software stores data in the GTX 970’s upper memory segment.

One of the big questions about all of this madness is what happens when the GTX 970 does have to spill into its last half-gig of memory. How much of a performance slowdown does one actually see compared to a GTX 980 based on a full-fledged GM204? So far, the firm has offered some benchmarks showing performance deltas from about one to three percent. In other words, when stepping up into a scenario where substantially more than 3.5GB of VRAM is used, the GTX 970 suffers very slightly more than the GTX 980 does.

Alben told us Nvidia continues to look into possible situations where the performance drop-offs are larger on the GTX 970, and he suggested that in those cases, the company will “see if we can improve the heuristics.” In short, Nvidia is taking responsiblity for managing the GTX 970’s funny VRAM config, and it’s possible that any problems that users turn up could be worked around via a driver update. In the end, that means the GTX 970’s performance may be a little more fragile than the 980’s, but Nvidia has a pretty good track record of staying on top of these kinds of things. This isn’t nearly the chore that, say, maintaining SLI profiles must be.

As our talk ended, Alben reiterated that the Maxwell team is pleased with the GTX 970 in its current form. “We’re proud of what we built, think it’s a great product. We thought this feature would make it a better product, and we think we achieved that goal. We want to make sure people understand it well.”