Intel Turns Graphics up to (Gen) 11

Mainstream Ice Lake GPUs Boost Compute by 2.6x, Break 1Tflop/s

January 21, 2019

By David Kanter

The Gen11 architecture is Intel’s latest step toward massively boosting the performance of its integrated graphics. The upcoming 10nm Ice Lake family will include the company’s first mainstream GPU with 1Tflop/s of single-precision compute capability. At the same time, Gen11 increases power efficiency through clever architectural features and optimized fixed-function hardware for the latest media formats. The new GPU will be Intel’s first to reach production since Gen9 debuted in the 2015 Skylake processors. The company developed a Gen10 GPU for Cannon Lake, but 10nm-process delays sank that product (see MPR 6/11/18, “Cannon Lake Misfires”). As a result, AMD’s integrated Radeon GPU outmatches Intel’s current integrated GPUs for mainstream processors.

Intel has tweaked each graphics generation for greater scalability. Gen11’s large performance boost owes to a repartitioned hierarchy of shader cores and memories. The company has also accelerated its adoption of new architectural features. For example, Skylake’s Gen9 GPU was first to employ full conservative rasterization, beating AMD and Nvidia to market (see MPR 10/19/15, “Skylake Scales Graphics, Fixes Media”). Although Intel invented coarse pixel shading, the 10nm delays will prevent Gen11 from being the industry’s first architecture with this feature. Nevertheless, the new GPU adds this capability along with tile-based rendering.

The company is continuing to emphasize media processing by adding fixed-function support for new formats such as VP9 and several HDR varieties, along with greater throughput and power efficiency. On the display side, it boosted the pixel output to drive larger displays and adopted variable refresh rates to more intelligently operate displays in conjuction with the GPU.

Because of its PC market share, Intel’s mainstream graphics define the industry’s performance floor. Gen11 is scheduled to arrive in Ice Lake in 4Q19, delivering more than 1Tflop/s—a 10x increase from the first on-die GPUs in 2011—as well as a tremendous improvement in programmability and other features. Ice Lake should also reach graphics parity with AMD, and higher-performance versions will once again nip at the low end of the discrete-graphics market.

Greater Geometry and Swelling Shaders

Intel organized its graphics architecture in three resource tiers with varying degrees of replication. As Figure 1 illustrates, the so-called unslice contains the interface to the fabric, geometry fixed functions, and media fixed functions. Although the company can increase the dedicated hardware for media encoding and decoding to boost performance (e.g., for a GT3 model), the other unslice components are fixed. Generally, performance within a graphics family (e.g., Gen11) varies by changing the slice count, thus altering the shader-array size. Each slice includes caches and other slice-common resources along with subslices, which contain the GPU’s execution units (EUs, also called shader cores). The display engine resides in the system agent, on the other side of the ring bus.

Figure 1. Gen11 GT2 architecture. The Gen11 graphics and media pro­cessor divides into one unslice and one or more slices that each contain a slice com­mon and multiple subslices. The GT2 variant shown here comprises a single slice and eight subslices, providing vastly more resources than Gen9 GT2 and roughly the same as Gen9 GT4.

The geometry hardware and the slices execute graphics and compute workloads written in APIs such as DirectX and OpenCL. Media workloads (e.g., Quick Sync Video and Intel’s Media SDK) predominantly use the video-quality-enhancement engine (VQE), multiformat-codec (MFX) engine, and scalar and format conversion (SFC), although some portions may execute on the slices. The display engine outputs the actual video, and the unslice contains a blitter and some other display-related functions.

Geometry hardware has long been a limitation for Intel. Gen9 could only resolve four vertex attributes per clock, whereas Gen11 can resolve six. Other geometry functions, such as hull shading, have improved as well.

Intel reorganized the Gen11 shader array to boost the ratio of shared resources to compute resources, increasing scalability and enabling larger GPUs. The overall result is that Gen11 GT2, which sets baseline performance for the Ice Lake client family, should deliver better performance than Skylake’s GT4. It offers 1,024 single-precision flops per clock—a huge 2.6x performance leap relative to a baseline Skylake GT2. Various parts of the unslice, such as the dedicated HEVC hardware, also garnered enhancements to boost performance.

Big Slices For a Beefy Dish

Intel rebalanced resources with the shader array, increasing the size of a slice from three subslices to eight. As Figure 2 shows, the slice includes common resources and eight subslices. By contrast, the Skylake generation offers one to three slices (corresponding to GT2, GT3, and GT4), and each slice has three subslices. For both Gen11 and Gen9, each slice contains common functions such as a rasterizer, L3 cache, shared local memory, and render-output pipeline. The subslices comprise a shared instruction cache, eight shader cores (also called execution units, or EUs), a texture cache, and a texture/media-sampling pipeline. Low-end configurations (e.g., GT1) can reduce the number of EUs per subslice to improve yield.

Figure 2. Gen11 slice design. The slice common now includes an L3 cache and a separate shared local memory to boost bandwidth. The slice has eight subslices, up from three in the previous generation.

As alluded to earlier, Intel also enhanced the slice common, particularly the L3 cache. This cache is a massively banked array allocated to and shared among several logical data structures in Gen9: the L3 data cache, the shared local memory (SLM), and the unified return buffer (URB). Each Gen9 L3 instance is 768KB and comprises four banks that can simultaneously read and write 64 bytes. Contention between the logical functions sharing the L3 limited scalability in previous generations.

For Gen11, the company repartitioned the slice-common memory among functions. The L3 data cache, the URB, and a new tile cache (for tile-based rendering) still share the L3 cache, but it’s much larger at 3MB and offers greater bandwidth thanks to its eight banks. Additionally, the SLM is now a separate 512KB structure with eight banks (one per subslice), reducing contention and boosting performance.

Although the overall subslice architecture is similar to Skylake’s (Gen9) and Broadwell’s (Gen8), Intel revisited the EUs in the shader array. For example, in Gen11, only one of the two 128-bit SIMD ALUs executes integer operations, reducing throughput for integer workloads but still allowing full-rate address calculations. The company also moved away from hardware support for double-precision FP in favor of lower-overhead emulation. Similarly, it redesigned the SIMD-ALU interfaces for efficiency. Overall, the changes to the Gen11 EU reduced the area by about 25% (isoprocess) compared with Gen9—a considerable improvement, especially when coupled with the shrink to Intel’s 10nm process.

Putting it all together, Gen11 GT2 is far brawnier than the prior generation, as Table 1 shows, packing 64 EUs compared with 24 in Gen9 GT2. As previously mentioned, the vertex rate increased by 50%, the bandwidth from the L3 and SLM by 8x, and the fill rate by 2x. In addition to achieving 1Tflop/s on single-precision FP, the new GPU can double that rate on half-precision operations. Whereas previous generations provide meager double-precision FP support, Gen11 drops this capability entirely, requiring software emulation for these operations. Integer compute improves by only 33% over Gen9.

Table 1. Comparison of Intel GT2 graphics generations. N/A=not applicable. Since 2012, the company has increased GPU per-cycle performance by 4x. *Plus 1,024 bytes/cycle for SLM. (Source: Intel)

Throwing Tiles Into the Bins Saves Bandwidth

Intel also built a more intelligent graphics pipeline relative to prior generations. In particular, Gen11 employs two techniques to increase efficiency: tile-based rendering (TBR) and coarse pixel shading (CPS). Conventional immediate-mode rendering (IMR) draws triangles for an entire screen as soon as they arrive and converts them into pixel fragments for subsequent shading. This approach is relatively simple, especially for a GPU that’s rasterizing one triangle at a time. But IMRs tend to access memory on the basis of submission order (which is unpredictable) and can consume excessive bandwidth writing out the pixel fragments.

Conceptually, TBR divides the screen into tiles and bins them before rasterization. Each bin is rasterized, shaded, and written out to memory. TBR adds latency, but it creates spatial locality and enables on-chip memory to cache more data, reducing memory-bandwidth usage. Gen11’s TBR works with variable-size tiles that reside in the L3’s tile cache; typical tile sizes, however, are 256x256 and 512x512.

Mobile graphics processors introduced TBR, but Nvidia quietly adopted it for the second-generation Maxwell architecture (which featured a capacious 2MB L2 cache). AMD also employed the technology in its Vega architecture (see MPR 8/28/17, “AMD Vega Shoots for GPU Stars”). TBR is a motivation for Gen11’s large 3MB L3, which supports it frame by frame. According to Intel, the technology can boost performance by up to 10% for various benchmark scenes.

Coarse Pixels Cut Down Shading

Coarse pixel shading is a clever technique that reduces pixel shading—the most computationally intense portion of the graphics pipeline—without affecting visual quality. In a typical pipeline, a pixel shader calculates the color of each pixel fragment. CPS breaks the one-to-one mapping of pixel fragments to pixel shaders and allows blocks of pixels to share the result of a single pixel shader calculation. For example, it could render a distant object while traditional dense shaders render a nearby object (or anything with text). In AR/VR, CPS can implement foveated rendering with dense pixel shading at the center of the field of view and sparse shading around the periphery.

Intel’s implementation supports 1x1 (i.e., regular shading), 2x2, 2x4, 4x2, and 4x4 pixel blocks. The driver provides three different modes for applications: global mode, a per-draw call mode, and an elliptical (foveated) mode. In the elliptical mode, the center of the ellipse uses the dense 1x1 block at the center, a 4x4 block in the outer regions, and a mosaic of blocks in the intermediate region, as Figure 3 shows. The Gen11 GPU employs a variety of block sizes in a UE4 Sun Temple scene, for which Intel estimates that CPS boosts performance by 30–50%.

Figure 3. Example of coarse pixel shading. CPS enables a small block of pixels to share a single pixel-shader evalua­tion, so frame rendering requires fewer resources. In the foveated view, the inner region uses 1x1 blocks while the outer region uses 4x4 blocks. (Image source: Intel)

CPS is a superb example of successfully converting research into a product; a team of Intel graphics researchers first described it in 2014. But in an ironic twist, Nvidia poached most of that team and was first to market, deploying the technology as variable-rate shading in the Turing graphics architecture (see MPR 10/8/18, “Turing Accelerates Ray Tracing”). Intel’s longer time to market reflects both a more bureaucratic product-development process and its unfortunate 10nm delays.

8K Media and Variable-Rate Displays

Media encoding and decoding has always been a strength for Intel, and Gen11 is no exception. The company overhauled the media pipelines to target 8K resolution and to shift more codecs into fixed-function logic. The encoder adds hardware VP9 support and improves on HEVC by enabling 4:2:2 and 4:4:4 formats. The dedicated media hardware comes as a set of smaller composable engines that scale throughput inversely with resolution, as opposed to a single monolithic instance. The media hardware can sustain a single 8K stream, four 4K streams, or sixteen 1080p streams. Intel also doubled the display pipeline’s pixel output.

For gaming, Gen11 adds VESA adaptive sync, a royalty-free open standard for adaptive display-refresh rates. Traditional displays operate at a fixed rate (e.g., 60Hz), which can create problems because GPU rendering is asynchronous relative to that rate. The misalignment can cause screen tearing. An alternative is V-sync: it forces the GPU to render at multiples of the monitor frame rate and uses triple buffering to eliminate screen tearing, but it increases latency by one to three refresh cycles (e.g., up to 50ms for a 60Hz display). Serious gamers won’t accept this extra delay. Adaptive sync takes a different approach by enabling the display to operate at the GPU rendering rate. These displays operate over a range of rates (typically 30–60Hz and up to 144Hz for gaming models) and dynamically align refresh timing with the graphics-card output to eliminate tearing while keeping latency low.

Adaptive sync dates back about five years. AMD pioneered it in 2014 as an optional DisplayPort 1.2a feature and marketed it under the moniker FreeSync. Nvidia’s G-Sync is a similar (albeit less popular) technology that requires a fairly expensive controller and memory and that increases display cost by about $200. Ironically, this technology is most valuable when the GPU struggles to maintain an acceptable refresh rate (e.g., around 30Hz)—a scenario that’s more common for Intel’s integrated graphics than for AMD’s and Nvidia’s discrete graphics cards. Adaptive sync is therefore long overdue at Intel. Fortunately, the company’s dominant market share (greater than 60%) will push the technology from enthusiast gamers to mainstream consumers.

Teraflops for All

Intel’s graphics aspirations are hardly a secret. The company sees GPUs as an important technology and is working to develop its first discrete product, which should debut in late 2020. Code-named Arctic Sound, the new GPU will target gaming as well as data-center computing (e.g., training neural networks), following an approach similar to that of Nvidia. Intel is probably targeting a 100–150W TDP card first rather than 250–300W behemoths, so the initial product is unlikely to challenge Nvidia’s top-end GPUs. But it will compete against mainstream graphics cards and data-center accelerators such as the Tesla T4.

In that context, the Gen11 family is an intermediate step. It is an integrated GPU, but it bumps up against the performance of low-end discrete graphics. When it debuts late this year, the new GPU will be an inflection point for Intel’s graphics strategy. The mainstream GT2 variant will boost compute performance by 2.6x relative to the prior-generation Skylake GT2, breaking the 1Tflop/s mark. Splitting the internal memory hierarchy increases bandwidth even more—an impressive 4x.

Intel’s 10nm delays meant a missed step in the graphics roadmap, allowing AMD to seize the high ground for mainstream processors. The Gen11 GPU’s performance boost should restore Intel’s competitive position and set a new bar for the industry. Thanks to cutting-edge features such as coarse pixel shading, Intel is fast catching up to the industry-leading discrete GPUs while continuing to offer best-in-class media processing.