What we know about Maxwell

Taking an interesting direction, NVIDIA is releasing the first Maxwell parts today in the form of the GeForce GTX 750 Ti.

I'm going to go out on a limb and guess that many of you reading this review would not have normally been as interested in the launch of the GeForce GTX 750 Ti if a specific word hadn't been mentioned in the title: Maxwell. It's true, the launch of GTX 750 Ti, a mainstream graphics card that will sit in the $149 price point, marks the first public release of the new NVIDIA GPU architecture code named Maxwell. It is a unique move for the company to start at this particular point with a new design, but as you'll see in the changes to the architecture as well as the limitations, it all makes a certain bit of sense.

For those of you that don't really care about the underlying magic that makes the GTX 750 Ti possible, you can skip this page and jump right to the details of the new card itself. There I will detail the product specifications, performance comparison and expectations, etc.

If you are interested in learning what makes Maxwell tick, keep reading below.

The NVIDIA Maxwell Architecture

When NVIDIA first approached us about the GTX 750 Ti they were very light on details about the GPU that was powering it. Even though the fact it was built on Maxwell was confirmed the company hadn't yet determined if it was going to do a full architecture deep dive with the press. In the end they went somewhere in between the full detail we are used to getting with a new GPU design and the original, passive stance. It looks like we'll have to wait for the enthusiast GPU class release to really get the full story but I think the details we have now paint the story quite clearly.

During the course of design the Kepler architecture, and then implementing it with the Tegra line in the form of the Tegra K1, NVIDIA's engineering team developed a better sense of how to improve the performance and efficiency of the basic compute design. Kepler was a huge leap forward compared to the likes of Fermi and Maxwell is promising to be equally as revolutionary. NVIDIA wanted to address both GPU power consumption as well as finding ways to extract more performance from the architecture at the same power levels.

The logic of the GPU design remains similar to Kepler. There is a Graphics Processing Cluster (GPC) that houses Simultaneous Multiprocessors (SM) built from a large number of CUDA cores (stream processors).

GM107 Block Diagram

Readers familiar with the look of Kepler GPUs will instantly see changes in the organization of the various blocks of Maxwell. There are more divisions, more groupings and fewer CUDA cores "per block" than before. As it turns out, this reorganization was part of the ability for NVIDIA to improve performance and power efficiency with the new GPU.

The biggest changes are seen in each of the new SMs, now called SMM (Maxwell indicator, previous Kepler based SM should be referenced as SMK) that can deliver 35% more processing power per CUDA core when shader bound. NVIDIA has changed scheduling on the SMM to be more intelligent, avoiding stalls more than previous implementations. This also means there is going to be more software-based work for the CPU to handle, but only by a handful of percent I am told.

These new SMMs were built to improve performance per watt as well as performance per area, a goal that all CPU and GPU designers have. NVIDIA was able to addresses them with changes to the control logic partitioning, workload balancing, clock gating, compiler-based scheduling, instructions per clock and quite a bit more.

Maxwell SMM Diagram

Rather than a single block of 192 shaders, the SMM is divided into four distinct blocks that each have a separate instruction buffer, scheduler and 32 dedicated, non-shared CUDA cores. NVIDIA states that this design simplifies the design and scheduling logic required for Maxwell saving on area and power. Pairs of these blocks are grouped together and share four texture filtering units and a texture cache. Shared memory is a different pool of data that is shared amongst all four processing blocks of the SMM.

With these changes, the SMM can offer 90% of the compute performance of the Kepler SMK but with a smaller die area that allows NVIDIA to integrate more of them per die. GM107, the first full shipping chip based on Maxwell, includes five SMMs (640 CUDA cores) while the GK107 GPU had two SMKs (384 CUDA cores) giving Maxwell a 2.3x shader performance advantage.

Other than the dramatic changes to the SM, the 2 MB L2 cache that NVIDIA has implemented on first version of Maxwell is the other very substantial change. Considering that the Kepler design had an L2 cache implementation at 256 KB, we are seeing an 8x increase in available capacity which should reduce the demand on the integrated memory controller of GM107 dramatically. Even with a 128-bit memory interface then, the GTX 750 Ti should not find DRAM performance to be a bottleneck.

NVIDIA has also improved the video capabilities of Maxwell by enhancing the performance of video encoding by a factor of 2x (users should see even less of a hit on performance when recording video with ShadowPlay now) and decoding by 10x.

A new power state called GC5 has been built to reduce the GPU's power usage during light workloads like video playback. API support is the same for Maxwell as it is for Kepler, meaning that that DirectX 11.2 is not fully supported.

GM107 – Maxwell's First Implementation

Though you can likely deduce many of the features of GM107 by looking at the data above, there are still some details about GM107 to share. With a single GPC, five SMM units and two 64-bit memory controllers, NVIDIA assures us this is the full implementation of GM107.

With 128 CUDA cores per SMM and 5 SMMs total, we get 640 total cores, a 66% increase over the 384 cores found in the GK107 Kepler GPU that was in the GeForce GTX 650. To be fair though, the GeForce GTX 650 Ti has 768 CUDA cores but at nearly 2x the TDP. The base clock of 1020 MHz and Boost clock of 1085 MHz are actually quite reserved; clocks with a modest overclock were easily touching the 1300 MHz level!

Peak theoretical compute performance hits 1.3 TFLOPS (a 60% increase over GK107) even though memory bandwidth remains essentially the same. This again is why the inclusion of the 2 MB L2 cache is so critical for efficient optimization of the Maxwell architecture.

GM107 is still built on 28nm process technology from TSMC but increases the die size by 25% over GK107 and uses 43% more transistors. Considering the 60% compute edge Maxwell has over Kepler in this segment the 25% area change indicates a big focus from NVIDIA on area and performance efficiency. Add to that NVIDIA's ability to get 2x the performance per watt for Maxwell over Kepler on the same 28nm process and it's easy to be impressed.

Future Maxwell GPUs?

If you are wondering where the high end products on Maxwell are, you aren't alone. All NVIDIA would tell us for now is that they would arrive "at a later date."