The Coming of General Purpose GPUs

Until the advent of DirectX 10, there was no point in adding undue complexity by enlarging the die area, which increased vertex shader functionality in addition to boosting the floating point precision of pixel shaders from 24-bit to 32-bit to match the requirement for vertex operations. With DX10's arrival, vertex and pixel shaders maintained a large level of common function, so moving to a unified shader arch eliminated a lot of unnecessary duplication of processing blocks. The first GPU to utilize this architecture was Nvidia's iconic G80.

Four years in development and $475 million produced a 681 million-transistor, 484mm² behemoth -- first as the 8800 GTX flagship and 8800 GTS 640MB on November 8. An overclocked GTX, the 8800 Ultra, represented the G80's pinnacle and was sandwiched between the launches of two lesser products: the 320MB GTS in February and the limited production GTS 640MB/112 Core on November 19, 2007.

Aided by the new Coverage Sample anti-aliasing (CSAA) algorithm, Nvidia had the satisfaction of seeing its GTX demolish every single and dual-graphics competitor in outright performance. Despite that success, the company dropped three percentage points in discrete graphics market share in the fourth quarter -- points AMD picked up on the strength of OEM contracts.

The remaining components of Nvidia's business strategy concerning the G80 became reality in February and June of 2007. The C-language based CUDA platform SDK (Software Development Kit) was released in beta form to enable an ecosystem leveraging the highly parallelized nature of GPUs. Nvidia's PhysX physics engine as well as its distributed computing projects, professional virtualization and OptiX, Nvidia's ray tracing engine, are the more high profile applications using CUDA.

Both Nvidia and ATI (now AMD) had been integrating ever-increasing computing functionality into the graphics pipeline. ATI/AMD would choose to rely upon developers and committees for the OpenCL path, while Nvidia had more immediate plans in mind with CUDA and high performance computing.

To this end, Nvidia introduced its Tesla line of math co-processors in June, initially based on the same G80 core that had already powered the GeForce and Quadro FX 4600/5600, and after a prolonged development that included at least two (and possibly three) major debugging exercises, AMD released the R600 in May.

Aided by the new Coverage Sample anti-aliasing (CSAA) algorithm, Nvidia had the satisfaction of seeing its GTX demolish every single and dual-graphics competitor in outright performance.

Media hype made the launch hotly anticipated as AMD's answer to the 8800 GTX, but what arrived as the HD 2900 XT was largely disappointing. It was an upper-midrange card allied with the power usage of an enthusiast board, consuming more power than any other contemporary solution.

The scale of the R600 misstep had profound implications within ATI, prompting strategy changes to meet future deadlines and maximize launch opportunities. Execution improved with RV770 (Evergreen) as well as the Northern and Southern Islands series.

Along with being the largest ATI/AMD GPU to date at 420mm², R600 incorporated a number of GPU firsts. It was AMD's first DirectX 10 chip, its first and only GPU with a 512-bit memory bus, first vendor desktop chip with a tessellator unit (which remained largely unused thanks to game developer indifference and a lack of DirectX support), first GPU with integrated audio over HDMI support, as well as its first to use VLIW, an architecture that has remained with AMD until the present 8000 series. It also marked the first time since the Radeon 7500 that ATI/AMD hadn't fielded a top tier card in relation to the competition's price and performance.

AMD updated the R600 to the RV670 by shrinking the GPU from TSMC's 80nm process to a 55nm node in addition to replacing the 512-bit bidirectional memory ring bus with a more standard 256-bit. This halved the R600's die area while packing nearly as many transistors (666 million versus 700 million in the R600). AMD also updated the GPU for DX10.1 and added PCI Express 2.0 support, all of which was good enough to scrap the HD 2000 series and compete with the mainstream GeForce 8800 GT and other lesser cards.

In the absence of a high-end GPU, AMD launched two dual-GPU cards along with budget RV620/635-based cards in January 2008. The HD 3850 X2 shipped in April and the final All-In-Wonder branded card, the HD 3650, in June. Released with a polished driver package, the dual GPU cards made an immediate impact with reviewers and the buying public. The HD 3870 X2 comfortably became the single fastest card and the HD 3850 X2 wasn't a great deal slower. Unlike Nvidia's SLI solution, AMD instituted support for Crossfiring cards with a common ASIC.

Pushing on from the G80's success, Nvidia launched its G92 as the 8800 GT on October 29 to widespread acclaim from tech sites, mainly due to its very competitive pricing. Straddling the $199 to $249 range, the 512MB card offered performance that invalidated the G80-based 8800 GTS. It mostly bested the HD 2900 XT and the HD 3870, which launched three weeks after the GT and generally came within 80% of the GTX. Unsurprisingly, this led to a shortage of 8800 GTs within weeks. Strong demand for the Nvidia's new contender and its 8600 GS/GT siblings helped push the company to a 71% discrete market share by year's end.

Hard on the heels of the GT, Nvidia launched the G92-based 8800 GTS 512MB on December 11. While generally suffering in performance-per-dollar to the GT, the GTS's saving grace was its use of better binned GPUs that essentially equalled the GTX and the pricy 8800 Ultra when overclocked.

The story of the GeForce 8 series would not be complete without adding the unfortunate postscript that was the use of high lead solder in the BGA of certain G86, G84, G73, G72/72M GPUs, and C51 and MCP67 graphics chipsets. This, allied with a low temperature underfill, inadequate cooling and an intensive regime of hot/cold cycles caused an inordinate number of graphics failures.

If the 8 series were a technological triumph for Nvidia, the 9 series ushered in a period of stagnation.

Nvidia switched to a Hitachi eutectic (high tin) solder, as used by AMD, in mid-2008 and notably changed the single slot reference design of the 8800 GT's cooler, adding more fan blades as well as tweaking the shroud to facilitate higher airflow. The G92 was suspected of being affected by the underfill issue as well, although dual-slot designs on the 8800 GTS 512M and non-reference cooler equipped cards didn't seem to be overly affected.

The company absorbed $475.9 million in charges relating to the issue, which resulted in heavy customer backlash toward both Nvidia laptop OEMs, who had known of the issue for some time before it became public knowledge. Nvidia's place in the industry will be forever linked to this lowest point in its history.

If the 8 series were a technological triumph for Nvidia, the 9 series ushered in a period of stagnation. The highlight of the range was also the first model launched in February 2008. The 9600 GT was based on the "new" G94, which was little more than a cut down G92 of the previous year built on the same 65nm TSMC process.

Aggressive price cuts from AMD on the HD 3870 and HD 3850 along with falling prices from Nvidia's own 8800 GS and GT made the rest of the 9 series reside almost entirely under the rebrand banner.

Initial 9800 GTs were 8800 GT rebadges while the 8800 GTS (G92) morphed into the 9800 GTX. Transitioning to TSMC's 55nm process shaved 20% in area from the G92 and allowed a small bump in clock frequency to produce the 9800 GTX+, the identical OEM GTS 150, as well as the GTS 250 that entered the retail channel fifteen months after the original 8-series card.

Due to the late arrival of the flagship GT200 and the fact that AMD's HD 3870 X2 was now top dog in the single card arms race, Nvidia resorted to the time honored tradition of doubling up on GPUs by sandwiching two 9800 GTs together to make the 9800 GX2. While it won the benchmark race, most observers were quick to notice that selling a dual-9800 GT for the price of three individual 9800 GTs had limited appeal at best.

By June, Nvidia released its GTX 260 and GTX 280 with the GT200 GPU, a 576mm² part that represents the largest production GPU die to date (Intel's Larrabee was estimated at 600-700mm²) and the largest production chip of any kind fabricated by TSMC.

The GT200 reiterated Nvidia's desire to push GPGPU into the spotlight by incorporating dedicated double precision (FP64) and compute hardware into the design. The gaming-oriented architectural changes were more modest, but this didn't stop Nvidia from pricing the 280 at an eye-watering $649 or launching 3D Vision (3D gaming and video) drivers in conjunction with 3D shutter glasses and an IR emitter -- a very expensive package.

Pricing fell dramatically after the HD 4870 and 4850 arrived, with the GTX 280 dropping 38% to $400 and the GTX 260 25% to $299.

AMD responded to the GT200 and G92 with the RV770. The first card, a lower-mainstream HD 4730, launched on June 8, with the mainstream and performance market HD 4850 and 4870 following on the 25. The launch had lost a measure of impact as specification leaks and stores began selling the HD 4850 a week before the NDA expired -- a common occurrence now, but less pervasive in 2008.

The 4870 and 4850 became the first consumer graphics cards to use GDDR5 memory, which Nvidia eventually implemented eighteen months later with the GT215-based GT 240.

The HD 4870 and 4850 earned rave reviews with its extensive feature list, including 7.1 LPCM sound over HDMI, general performance and multi-GPU scaling and, of course, the price. The card's sole drawback was its tendency to produce high local temperatures across the voltage regulation components in reference boards, which caused disproportionate failure rates and lockups -- especially when using burn-in software such as Furmark.

In keeping with the previous generation and the "need" to curtail the GTX 280's two-month reign, AMD released the HD 4870 X2 in August. The card quickly entrenched itself at the top of review benchmark charts in most categories including performance, but also in the category of noise output and heat production thanks to the reference blower fan.

January of 2009 brought only an incremental tweak of Nvidia's line-up when the GT 200 transferred to TSMC's 55nm process. 55nm saw its use in the B3 revision chips, which first saw duty in September the previous year as the Core 216 version of the GTX 260. The company offered its GTX 295, which featured two cut down (ROPs and memory bus) GT200-B3s.

The single-GPU variant of the card launched as the GTX 275 in April. But so would AMD's reply: a revised RV790XT-powered HD 4890 and the HD 4770 (RV740), which was also AMD's first 40nm card.

The HD 4770, while not a major product in its own right, gave AMD immeasurable experience with TSMC's troubled 40nm process, which produced large variances in current leakage as well as high defect rates due to incomplete connections between metal layers in the GPU die. With this working knowledge, AMD was able to improve the foundry process issues that Nvidia faced with its Fermi architecture -- issues that hadn't presented themselves with Nvidia's initial miniscule 40nm GPUs.

Nvidia rolled out its first 40nm products in July. The entry-level GT216 and GT218 came in the form of the GeForce 205, 210 and GT 220, all of which were OEM products until October when the latter two hit retail. They are only noteworthy for being Nvidia's first DX10.1 cards -- something AMD achieved with the HD 4870/4850 -- as well as improving sound capabilities with 7.1 audio, lossless LPCM audio, bitstreaming of Dolby TrueHD/DTS-HD/-HD-MA, and audio over HDMI. The series was aimed at the home theater market and was eventually rebranded as the 300 series in February 2010.

TSMC's troubled 40nm process hit AMD's ability to capitalize on Nvidia's Fermi no-show as heavy demand outstripped supply.

In the four months between September 2009 and February 2010, AMD completed a thorough top to bottom launch of four GPUs (Cypress, Juniper, Redwood and Cedar), which comprised the Evergreen family, starting with the top-tier HD 5870, followed a week later by the upper-midrange HD 5850.

TSMC's troubled 40nm process hit AMD's ability to capitalize on Nvidia's Fermi no-show as heavy demand outstripped supply. This was in large part driven by AMD's ability to time Evergreen's release with Windows 7 and the adoption of DirectX 11.

While DX11 took time to show substantial worth with Evergreen, another feature introduced with the HD 5000 made an immediate impact in the form of Eyefinity, which relies upon the flexibility of DisplayPort to enable as many as six display pipelines per board. These are routed to a convention DAC or a combination of the internal TMDS transmitters and DisplayPort.

Previous graphics cards generally used a combination of VGA, DVI and sometimes HDMI, all of which needed a dedicated clock source per output. This added complexity, size and pin count to a GPU. DisplayPort negated the need for independent clocking and opened the way for AMD to integrate up to six display pipelines in their hardware, while software remains responsible for providing the user experience. This included bezel compensation and spanning the display across the panels at an optimum resolution.

The Evergreen series became class leaders across the board (issues with texture filtering aside), with the HD 5850 and HD 5770 attracting a large percentage of the cost-conscious gamer fraternity and the HD 5870 and dual-GPU HD 5970 providing an unequalled level of performance and efficiency.

Nvidia finally (soft) launched its first Fermi boards six months later on April 12 by way of the GTX 470 and 480. None of the company's dies were fully functional -- as was the case with the following GF104 -- so Fermi's core speeds were rather conservative to curb power usage and memory bandwidth was lower due to Nvidia's inexperience with GDDR5 I/O.

While the GTX 480 was greeted with a tepid response, Nvidia's second Fermi chip, the mainstream GF104 in the GTX 460, was a monumental success.

Less than optimal yields on TSMC's 40nm process, which had already caused supply issues for AMD, became greatly magnified due to the GF100 Fermi's die size of 529mm². With die size, yield, power requirement and heat output all being inextricably linked, Nvidia's 400 series paid a high penalty for gaming performance compared to AMD's line-up.

Quadro and Tesla variants of the GF100 suffered little in the marketplace, if at all, thanks to an in-place ecosystem within the professional markets. One aspect of the launch that did not disappoint was the introduction of transparency supersampling antialiasing (TrSSAA), which was to be used with the in-place coverage sampled AA (CSAA).

While the GTX 480 was greeted with a tepid response, Nvidia's second Fermi chip, the mainstream GF104 in the GTX 460, was a monumental success. It offered good performance at great price, with the 192bit/768MB running $199 and the 256bit/1GB at $229. They launched a multitude of non-reference and factory overclocked cards with significant overclocking headroom available due to the conservative reference clocks chosen by Nvidia to aid in lowering power consumption.

Part of the 460's positive reception stemmed from the muted expectations after the GF100's arrival. The GF104 was speculated to be no more than half a GF100 and would suffer appallingly next to AMD's Cypress GPU. This proved wrong. A second surprise awaited the blogging "experts" as well as AMD when Nvidia launched a refreshed version of the GF100, the GF110, in November.

The updated part achieved what its predecessor couldn't -- namely enabling the whole chip. The resulting GTX 570 and 580 were what the original 400 series was supposed to be.

Barts, the first AMD Northern Islands series GPU, arrived in October. More an evolution from Evergreen, Barts was designed to lower production costs from the Cypress die. Rather than offering a substantial increase in performance, the GPU looked to equal the previous HD 5830 and HD 5850 but saved substantially on GPU size. AMD pared away the stream processor (shader) count, overhauled and reduced the physical size of the memory controller (and the associated lowering of memory speed), and removed the ability to perform double-precision calculations. Barts did, however, have a tessellation upgrade over Evergreen.

While performance increases weren't dramatic, AMD did upgrade facets of the display technology. DisplayPort was pushed to 1.2 (the ability to drive multiple monitors from one port, 120Hz refresh for high resolution displays, and bitstreaming audio), HDMI to 1.4a (3D 1080p video playback, 4K screen resolution), and the company added an updated video decoder with DivX support.

AMD also improved the driver feature set by introducing morphological anti-aliasing (MLAA), a post-processing blur filter whose functionality -- especially at launch -- was extremely hit or miss.

The introduction of the HD 6970 and HD 6950 added a conventional AA mode to the Catalyst driver with EQAA (Enhanced Quality AA), while AMD also implemented embryonic HD3D support, which was flaky at best, and Dynamic power usage, this time profiled with PowerTune.

Generally speaking, the Cayman parts were better than the first generation Fermi chips. They were supposed to trump them but lagged by a few percentage points behind the second generation (the GTX 500s) and subsequent driver releases from both camps added further variance.

Cayman's November launch was postponed a month with the HD 6970 and 6950 launching on December 15, and it represented a (brief) departure from the VLIW5 architecture, which ATI/AMD had been using continuously since the R300 series. The company instead used VLIW4, which dropped the fifth Special Function (or Transendental) execution unit in every stream-processing block.

This was intended to withdraw an overabundance of resources to DX9 (and earlier) games while adding a more compute-orientated reorganization of the graphics pipeline.

The integrated graphics of the Trinity and Richland series of APUs are the only other VLIW4 parts, and while AMDs newest graphics architecture is based upon GCN (Graphics Core Next), VLIW5 lives on in the HD 8000 series as rebrands of entry level Evergreen GPUs.

Mirroring the GF100/GF110 progression, the GTX 460's successor -- the GTX 560 Ti -- arrived in January 2011. The GF114-based card featured a fully functional revised GF104, and proved to be as robust and versatile as its predecessor. It offered a myriad on non-reference interpretations with and without factory overclocks.

AMD responded by lowering the cost of its HD 6950 and 6870 immediately, and so the GTX 560 Ti's price/performance advantage disappeared even as reviews were being penned. With mail in rebates offered by many board partners, the HD 6950 -- particularly the 1GB version -- made a more compelling buy.

Nvidia's second major launch of 2011, more precisely on March 26, started with a bang. The GTX 590 married two fully functional GF110s to a single circuit board. The PR fallout started almost immediately.

The boards were running a driver that didn't enable power limiting to the correct degree and that was paired with a BIOS that allowed high voltage. This oversight allowed an aggressive overvoltage to start blowing MOSFETS. Nvidia remedied the situation with a more restrained BIOS and driver, but the launch day activities prompted some scathing reviews and at least one popular YouTube video. The GTX 590 achieved no more than performance parity with the two-week-old HD 6990, AMD's own dual card.

With no clear cut winner across the benchmarks, the products stirred up an endless stream of debates across forums, ranging from multi-GPU scaling, stock availability, benchmark relevance, testing methodology, and exploding 590s.

The AMD Northern Islands successors, Southern Islands, began a staggered release schedule of the series, beginning on January 9, with the flagship HD 7970. It was the first PCI-E 3.0 card and the first recipient of AMD's GCN architecture built on TSMC's 28nm process node. Only three weeks later, the 7970 was joined by a second Tahiti-based card, the HD 7950, followed by the mainstream Cape Verde cards on February 15. The performance Pitcairn GPU-based cards hit the shelves a month later in March.

The cards were good, but didn't provide earth-shattering gaming improvements over the previous 40nm based boards. This, combined with less competitive price tags that had been an AMD staple since the HD 2000 series, no WHQL drivers for two months and a non-functional Video Codec Engine (VCE), tempered the enthusiasm of many potential users and reviewers.

One bonus of the Tahiti parts was the confirmation that AMD had left a lot of untapped performance available via overclocking. This had been a trade-off between power usage and heat output versus clock speed, but led to a conservative core and memory frequency. The need to maximize yield and an underestimation of Nvidia's Kepler-based GTX 680/670, may also have entered into the equation.

Nvidia continued to diversify their feature set in GPU’s, by introducing the Kepler architecture.

In previous generations, Nvidia led with the most complex chip to satisfy the high-end gaming community and to start the lengthy validation process for professional (Tesla/Quadro) models. This approach hadn't served the company particularly well in recent prior generations and so it seems the smaller GK107 and the performance-orientated GK104 received priority over the beastly GK110.

The GK107 was presumably required since Nvidia had substantial OEM mobile contracts to fulfill and needed the GK104 for the premium desktop market. Both GPUs shipped as A2 revision chips. Mobile GK107s (GT 640M/650M, GTX 660M) began shipping to OEMs in February and were officially announced on March 22, the same day Nvidia launched its GK104-based GTX 680.

In another departure from Nvidia's recent GPU design, the shader clock ran at the same frequency as the core. Since the GeForce 8 series, Nvidia had employed a shader running at least twice the core frequency -- as high as 2.67 times the core in the 9 series and exactly twofold in the 400 and 500.

Nvidia realized that more cores running at a slower speed are more efficient for parallel workloads than fewer cores running at twice the frequency.

The rationale for the change was predicated upon Nvidia shifting focus (in consumer desktop/mobile) from outright performance to performance-per-watt efficiency. More cores running at a slower speed are more efficient for parallel workloads than fewer cores running at twice the frequency. Basically, it was a refinement of the GPU versus CPU paradigm (many cores, lower frequency, high bandwidth and latency versus few cores, high frequency, lower bandwidth and latency).

Reducing the shader clock also has the advantage of lowering power consumption and Nvidia further economised on design by drastically reducing the die's available double precision units, as well as reducing the bus width to a more mainstream 256-bit. These changes, along with a relatively modest base core speed augmented by a dynamic boost feature (overclock on demand), presented a much more balanced product -- albeit at the expense of compute ability. Yet if Nvidia had kept Fermi's compute functionality and bandwidth design, it would have been ridiculed for producing a large, hot, power-hungry design. The laws of physics yet again turned chip design into an art of compromise.

Once again, Nvidia produced a dual GPU board. Because of the GK104's improved power envelope, the GTX 690 is essentially two GTX 680s in SLI. The only distinction is that the 690's maximum core frequency (boost) is 52MHz lower. While performance is still at the whim of the driver's SLI profiling, the card's functionality is first rate and its aesthetics worthy of the limited edition branding it wears.

The GK 110 marks a departure from Nvidia's usual practice of launching a GPU first under the GeForce banner. First seen as the Tesla K20, the card was requested in large numbers for supercomputing contracts, with over 22,000 required for ORNL's Cray XK7 Titan, NCSA's Blue Waters, the Swiss CSCS Todi and Piz Daint systems.

Consumers had to wait six months before the GK110 arrived as a GeForce. Dubbed GTX Titan, the lack of a numerical model number reinforces Nvidia's desire to see the card as a model separate from the existing (and likely following) Kepler series. At $999, Titan is aimed at ultra-enthusiasts and benchmarkers. Nvidia also widened the appeal to researchers and professionals on a budget as it marks the first time that the company has allowed a GeForce card to retain the same compute functionality as its professional Tesla and Quadro brethren.

The card quickly assumed top dog status in gaming benchmarks, especially evident in multi-monitor resolutions with super-sampled antialiasing applied. However, Nvidia's indifferent OpenCL driver support and a surge of recent gaming titles allied with AMD's Gaming Evolved program tempered the Titan's impact as much as its exorbitant price tag.

June saw AMD play "me too" by offering the HD 7970 GHz Edition -- a 75MHz jump in core frequency with a further 50MHz boost available (as opposed to the dynamically-adjusted version offered by Nvidia). The GHz Edition represented the frequency that card probably should have started with in January.

Unfortunately for AMD, the market this SKU targeted had already determined that the standard model was generally capable of the same (if not better) performance via overclocking at a substantially lower price and lower core voltage. AMD followed the HD 7970 GHz Edition with the HD 7950 Boost.

Present and Future of PC Graphics, In a Nutshell

So far, 2013 has seen Nvidia and AMD battle over a PC graphics discrete market share that is incrementally shrinking as game development and screen resolution fail to match the strides integrated graphics are making.

In early 2002, Intel had a 14% PC graphics market share. With the arrival of its Extreme Graphics (830 to 865 chipsets), the company's share rose to 33%, then to 38% with the third and fourth generation DX 9 chipsets, and now to more than 50% with the DX10 GMA 4500 series. Integrating a GPU into the CPU means that Intel is now responsible for shipping around 60% of PC graphics.

Market share this quarter Market share last quarter Unit change quarter to quarter Share difference quarter to quarter Market share last year AMD 19.7% 21.0% -13.6% -1.2% 24.8% Intel 63.4% 60.0% -2.9% 3.4% 59.2% Nvidia 16.9% 18.6% -16.7% -1.73% 15.7% Via/S3 0.0% 0.4% -100% 0.0% 0.4% Total 100.0% 100.0% -8.2% 100.2%

The need for new graphics products becomes less apparent with every successive generation. Most titles are based on a ten-year-old API (DX 9 became available in December 2002) so image enhancements in games are becoming less focused on GPU load and more on post-processing filtering -- something that is unlikely to change even with DX11-compliant next-generation consoles. Reliance upon rasterization will continue as ray tracing proves to be a difficult nut to crack.

All this unfortunately points to hardware junkies having less to tinker with in the future unless there is a fundamental evolution in game engines or the availability of affordable ultra-high resolution displays. Whichever way things go in the coming months and years, rest assured, we will continue to review upcoming GPUs on TechSpot.