AMD unveiled its Mantle graphics programming layer during a press event in Hawaii two months ago. The announcement immediately sent waves through the PC gaming community, and in its wake, we heard about a number of games adopting the API—games from Battlefield 4 to Thief to Star Citizen. However, AMD divulged comparatively little about how Mantle works or about the benefits we can expect from it. The lack of concrete information about Mantle spawned lots of speculation and debate, but most of it wasn’t very enlightening.

Fortunately, we now know some specifics. At its APU13 developer conference in San Jose, California, AMD invited journalists and developers to listen to hours worth of keynotes and sessions by Mantle luminaries. We didn’t just hear from the API’s architects; we also listened to some its more illustrious early adopters, a few of whom helped develop Mantle in collaboration with AMD.

Among the speakers were Guennadi Riguer and Brian Bennett, the two AMD staffers who created Mantle; Johan Andersson of EA DICE, the man behind the Frostbite engines that power the Battlefield series; and Jurjen Katsman, the CEO of Nixxes, a Dutch studio that’s porting the next Thief game to the PC. (Nixxes can also be credited with porting Deus Ex: Human Revolution, Hitman Absolution, and Tomb Raider to Windows.)

Altogether, the Mantle presentations and talks at APU13 amounted to well over three hours of material. Much of that material was laden with game programming jargon and cryptic PowerPoint diagrams, and almost all of it was presented by developers with a knack for talking really, really fast. What follows is some of the information we managed to glean from those sessions—and from talking with a few of those folks one-on-one.

What on earth is Mantle?



Before we get started, we should probably talk a little bit about what Mantle is.

Mantle is a new application programming interface, or API, for real-time graphics that’s intended to be a substitute for Direct3D and OpenGL. Mantle is designed to cut much of the overhead associated with those APIs, and in many respects, it’s meant to operate at a lower level, closer to the metal, than they do. In that sense, Mantle is similar—but not identical—to the low-level APIs used to develop games on consoles like the Xbox One and the PlayStation 4.

At present, Mantle support is limited to Windows systems with graphics processors based on AMD’s Graphics Core Next architecture. Games written using Mantle will run on discrete Radeons from the HD 7000 series onward, and they’ll work on upcoming Kaveri APUs, too. (The GCN graphics architecture has made its way into some other notable silicon, including the SoCs inside of both the PlayStation 4 and the Xbox One, but Mantle does not, to our knowledge, currently support those.)

All of this talk of being “close to the metal” refers to a classic tradeoff in programming interfaces, especially in real-time graphics. A programming interface may choose to go very low level by exposing control over the smallest details of the hardware, giving the developer access to exact buffer sizes and the like. Doing so can allow programmers to extract the best possible performance out of a particular piece of silicon. However, applications written for low-level APIs can become dependent on the presence of specific hardware. When a new chip architecture comes along, a “close to the metal” application may run poorly or even refuse to run on the new silicon. In order to maintain broader compatibility and flexibility, higher-level APIs restrict access to hardware-specific features and expose a simpler set of capabilities that presumably will be available across multiple chip architectures.

Console APIs can afford to be fairly low-level, since console hardware doesn’t change for years at a stretch. By contrast, the high-level nature of Direct3D is the bit of magic that allows us to run decade-old PC games on brand-new graphics cards without issue.

In Mantle’s case, according to Riguer, AMD has lowered the abstraction level in some areas but “not across the board.” DICE’s Johan Andersson described the traditional approach as “middle-ground abstraction,” where a compromise is struck between performance and usability. Mantle, by comparison, offers “thin low-level abstraction” that exposes how the underlying hardware works. Riguer boiled it down further by comparing Mantle to driving a car with a manual transmission—more responsibility, but also more fun.

Also, while Graphics Core Next is the “hardware foundation” for Mantle, AMD’s Guennadi Riguer and some of the other Mantle luminaries at APU13 made it clear that the API is by no means tied down to GCN hardware. Some of Mantle’s features are targeted at GCN, but others are generic. “We don’t want to paint ourselves in a corner,” Riguer explained. “What we would like to do with Mantle is to have [the] ability to innovate on future graphics architectures for years to come, and possibly even enable our competitors to run Mantle.” Jurjen Katsman of Nixxes was even bolder in his assessment, stating, “There’s nothing that I can see from my perspective that stops [Mantle] from running on pretty much any hardware out there that is somewhat recent.”

Of course, technical feasibility isn’t the only obstacle in the way of Nvidia’s hypothetical adoption of Mantle. We’ll discuss this again in a little more detail at the end of the article. But first…

The problem with Direct3D

To understand why AMD created Mantle, it helps to know about some of the pitfalls of development with current, vendor-agnostic APIs. That model involves a substantial amount of overhead, and it apparently puts much of the optimization burden on driver developers, leaving game developers with limited control over how the hardware runs their software.

Katsman was particularly critical, calling Direct3D “extremely unpredictable” and complaining that, in some titles, “50% of your CPU time is spent by the driver and by Direct3D doing something that you’re not quite sure about.” AMD’s Riguer blamed that high overhead partly on the fact that graphics drivers have “no straightforward way to translate API commands to GPU commands” and are “not all that lean and mean.” In consoles, where the APIs are closer to the metal, Katsman said overhead amounts to something like “a few percent” of total CPU time.

The slide above, taken from the Nixxes presentation, outlines some of Katsman’s grievances with Direct3D in more detail.

Among those grievances is the performance hit caused by the driver compiling shaders at “undefined times” in the background. Katsman noted that, in Deus Ex: Human Revolution, one of Nixxes’ PC ports, shader compilation caused the game to stutter—which, in turn, led players to complain online. For what it’s worth, we did notice some ugly stuttering in our own testing of that game, although it’s not clear if those slowdowns were caused by this specific problem.

Another issue with Direct3D is the developer’s lack of control over GPU memory. Riguer explained that consoles let developers achieve “much greater visuals than on [the] PC with comparable or greater memory configs.” Katsman provided some background information about why that is. “In general, [with] Direct3D, if you destroy and recreate resources all the time, the API is too slow to do that, so you’re stuck having a fixed amount of resources that you cache and you keep around,” Katsman said. “Memory usage on PC is actually far higher, and we’re not really getting anything in return.”

There’s also the overhead associated with draw calls, Direct3D’s basic commands to place and manipulate objects on the screen. Packing in the amount of detail in today’s games requires lots of draw calls for each frame, and that leads to what developers call the small-batch problem. In Riguer’s words, “You hit a wall after so many draw calls per frame.” The limit is usually around 3,000-5,000 draw calls per frame, although very skilled developers can purportedly manage 10,000 or more. According to Katsman, developers must “jump through a lot of hoops” and come up with “new and clever ways to have fewer draw calls.” The barrier to increasing the number of draw calls per frame lies not with the hardware, Katsman added, but with the API.

Katsman then decried the fact that driver optimizations are “almost required” for new games. Anyone who’s ever had to download multiple beta driver updates to support a new PC game will be all too familiar with that problem. Developers are, in effect, unable to make their games work well by themselves. “I think that’s actually very harmful and doesn’t really contribute to users getting a good experience from the games they buy,” said Katsman.

Finally, PC games underutilize multi-core processors. Four-, six-, and eight-core chips aren’t uncommon in modern gaming PCs, but AMD’s Riguer said that “very few of those cores are available for driving graphics today.” Katsman elaborated on this point, noting that developers must expect drivers to spawn extra threads. He brought up this hypothetical scenario: “If the system has eight cores, then as an app, we should probably only use five, because who knows, the driver may still use another three or so.” That truly is a hypothetical scenario, though—in practice, Katsman pointed out that most games “flatten off at one core.”

What Mantle does

Mantle takes a number of steps to alleviate the issues outlined on the previous page. By giving developers more direct control of the GPU and putting them, in Riguer’s words, in the “driver developer’s seat,” Mantle can cut overhead and allow for more efficient use of both the graphics hardware and the CPU.

Mantle’s most fundamental and innovative feature, according to AMD’s Brian Bennett, is its execution model. Here’s how he described it:

These days, a modern GPU typically has a number of engines that execute commands in parallel to do work for you. You have a graphics or compute engine, DMA, multimedia . . . whatever. The basic building block for work for those engines is a command buffer. In [the diagram above], a command buffer is a colored rectangle. A driver builds commands targeting one of the engines; it puts the command buffer into an execution queue, and the engine, when it’s ready to do some work, goes to the execution queue, grabs the work, and performs it.

[That’s] as opposed to a context-based execution model, where it’s up to the driver to choose which engine we want to target; it’s up to the driver to figure out where we break command buffers apart, and manage the synchronization between those engines. Mantle exposes all that, abstracted, to you. So, you have the ability to build a command buffer, insert commands into it, submit it to the queue, and then synchronize the work between it. This lets you take full advantage of the entire GPU. More fundamentally to Mantle’s goals is the fact that you can create these command buffers from multiple application threads in parallel. . . . That is the key to opening up the potential of our multi-core CPUs these days. There is no synchronization at the API level in Mantle; there is no state that persists between command buffers. It is up to you to do the synchronization of your command building and of your command submission; and if you want to do work on multiple engines, we give you constructs to synchronize work between those engines. You have all the power.

Mantle’s execution model extends to multiple GPUs. Developers have access to all of the engines on all of a system’s Mantle-compatible GPUs, and they can control those GPUs and handle synchronization themselves. “Synchronization between the GPUs,” Riguer explained, “becomes a natural extension to the mechanism we exposed . . . on synchronization between multiple queues. In fact, we make [the] multi-GPU model exactly like a single-GPU model scaled up to multiple devices.”

As a result, developers have much more flexibility in the way they split up workloads between GPUs, and they can “try to make [their games] scale a lot better” than what’s possible with CrossFire right now. Techniques superior to today’s alternate frame rendering (AFR), whereby each GPU renders a different frame in the animation, can be developed, and asymmetric configurations—such as those with slow integrated graphics and fast discrete graphics—can be more readily exploited.

Moving beyond AFR is particularly important. While that technique works reasonably well with current games, Riguer said future titles will run more workloads with lots of frame-to-frame dependencies, such as compute-based effects. To handle those, “You would need to either duplicate the workload across GPUs or serialize across the GPUs. In either case, your scaling suffers.”

Mantle manages memory in a very different way than Direct3D, too. Here is Bennett’s explanation of that feature:

In traditional APIs, when you create an object like an image or a buffer, the driver implicitly allocates memory for you. [That] seems okay, but it has a number of problems. It’s difficult to efficiently recycle memory; you’re going to have bigger memory footprints because of that; creating the object itself is more expensive, because you have to go to the OS to get the GPU memory; and the driver becomes inefficient, because it spends a lot of time managing these OS video memory handles to work with the display driver model.

In Mantle, API objects are simple CPU-side info that have no memory explicitly attached. Instead, you as the app developer allocate GPU memory explicitly and bind it to the object.

Again, higher efficiency and flexibility is the name of the game.

That brings us to monolithic pipelines. To paraphrase Johan Andersson, Mantle rolls all of the various shader stages that make up the graphics pipeline into a single object. Above, I’ve added the slide from Andersson’s keynote, since it’s somewhat more enlightening than the one used by Riguer and Bennett in their presentation.

In short, monolithic pipelines help avoid draw-time shader compilation—a problem that, as I mentioned earlier, can make games stutter. Here’s how Bennett sums it up:

In the current implementations, draw-time validation that the driver does is super expensive. Since you can vary all your shaders in state independently, we spend a lot of time at draw deciding what hardware commands we should write. By compiling the pipeline up front, binding the pipeline is lightning fast in comparison.

Second, by compiling this up front, you give us the opportunity to spend some cycles to improve the GPU performance. If we know everything you’re doing in the whole pipeline, we can optimize that. And . . . with the draw-time validation models, sometimes you’ll bind a new state, call draw, and that draw will have an inexplicably high CPU cost. Maybe the driver had to kick off a shader compile in the background, and that’s going to impact you. [There are] no surprises with Mantle.

Mantle doesn’t just help prevent shader compilation from occurring mid-game. It can also prevent shaders from being recompiled each time the game is launched. According to Riguer, recompilation can account for a “lot of the startup time,” but with Mantle, “the shader compilation is a lot more predictable, and we give you the ability to save and load very quickly and easily a complete compiled shader pipeline, which should virtually eliminate all the loading time that stems from shader compilation.”

Incidentally, Bennett said he expects pipelines to look “different in the future.” He suggested that Mantle’s graphics pipeline abstraction will help the API adapt to these future changes—enabling “some stuff that we can’t do in real time now.”

Mantle introduces a new way to bind resources to the graphics pipeline, as well. According to Bennett, the traditional binding technique is a “pretty big performance hog,” and the currently popular alternative, which he calls “bindless,” has downsides of its own, including higher shader complexity, reduced stability, and being “less GPU cache friendly.”

Mantle’s binding model involves simplified resource semantics compared to Direct3D, and it works like so:

In Mantle, when you create your pipeline, you define a layout for how the resources will be accessed from the pipeline, and you bind that descriptor set. The descriptor set is an array of slots that you bind resources to. Notably, you can bind another descriptor set to a slot—and this lets you set hierarchical descriptions of your resources.

If your eyes just glazed over, that’s okay—mine did a little, too. In any event, Bennett said that the ability to build descriptor sets and to generate command buffers in parallel is “very good for CPU performance.” During his presentation, Johan Andersson brought up a descriptor set use case that reduced both CPU overhead and memory usage.

Bennett went over one more way in which Mantle can reduce CPU overhead: resource tracking. Right now, drivers spend a “lot of time” keeping track of resources. With Mantle, tracking resources is up to the application. Bennett said he expects apps to do a better job of it than the graphic drivers, and he hinted that developers won’t have to do much extra work to make that happen: “Your game engine is probably doing that sort of tracking already, because you’re supporting consoles that require it.”

Last, but not least, Mantle has some debugging and validation tools built into the API and the accompanying driver. AMD didn’t share a ton of specifics about those, but there was mention of “lots of extra controls for stress testing applications and forcing very specific debug scenarios.” Riguer added, “In fact, I would say that writing [debug] tools on top of Mantle, in many cases, would not be much harder than slapping on a fancy UI on top of capabilities we are putting right into Mantle.” Both Johan Andersson of DICE and Jurjen Katsman of Nixxes called Mantle’s debugging and validation tools “really powerful.”

More on how Mantle helps performance

Mantle’s closer-to-the-metal development model, coupled with a more lightweight driver, seems to pay some very real performance dividends. The game developers in attendance at APU13 were reluctant to quote actual performance figures from their games, partly because their work still isn’t quite finished. However, some figures were quoted that shed light on Mantle’s performance benefits.

For starters, Nixxes’ Katsman revealed that “very early figures from Thief” (which is “not fully running on Mantle yet”) showed a big reduction in draw call overhead. “Before, we would often see about 40% of the CPU time stuck in the driver, in D3D, or in various threads,” he said. “The early measurements we did, right now we have that down to about a fifth of that.”

The guys from Oxide offered a more visual representation of Mantle’s CPU overhead in their talk. Mantle is the yellow rectangle, the game engine is the blue one, and unused CPU time is shown in green:

DICE’s Andersson extrapolated upon that same notion in his keynote, saying that, with Mantle, the CPU “should never really be a bottleneck for the GPU anymore.” In a separate demonstration, Oxide showed their Mantle-enabled space game suffering no frame rate hit when the FX-8350 processor on which it ran was underclocked to 2GHz, or half its base speed. (Graphics processing in that demo was handled by a Radeon R9 290X.)

The reduction in draw call overhead also means more draw calls can be issued per frame. Riguer said Mantle raises the draw call limit by an order of magnitude to “at least” 100,000 draw calls per frame “at reasonable frame rates.” This isn’t just theoretical—Oxide showed their space game demo actually hitting 100,000 draw calls per frame. Andersson, who was in the audience for that presentation, was impressed enough to tweet about the demo.

Mantle will allow game developers to use more CPU cores, too, as these two slides from Andersson’s presentation show. According to Andersson, the Mantle model outlined in the second slide is the “the exact model that we’re using on all of the consoles”—both current and next-gen ones. In his talk, Katsman explained that, if a system has eight cores, Mantle allows developers to use all of those cores for their game. “So, we can have four to do rendering, a few more to do physics and some other things. We can make games that are far more complicated. We can increase the draw distance to significant distances, have far denser worlds.”

According to Katsman, “The density of everything in the world is something that’s being held back, and I think Mantle will help alleviate that.” That said, “Just because we can draw more things doesn’t mean we have the CPU resources to simulate them all.” For example, while Mantle might make it possible to draw many more characters in a given scene, developers will have to consider the cost of running AI simulations for all of those characters.

In addition to making more effective and efficient use of the CPU, Mantle will allow GPU resources to be used more efficiently. Katsman brought up the Radeon R9 290X, which has 5.6 tflops of compute power, and said that an “awful lot” of that compute power is “lying there dormant.” With current APIs, some of the compute power might be used for some parts of a frame, but other parts “will be bottlenecked by something else,” such as “getting things from memory, by fetching textures through the texture fetch units, [and] the rasterization units.” He went on:

The APIs we have right now, they just allow us to queue synchronous workloads. We say, “draw some triangles,” and then, “do some compute,” and the driver can try to be a little smart, and maybe it’ll overlap some of that. But for the most part, it’s serial, and where we’re doing one thing, it’s not doing other things.

With Mantle . . . we can schedule compute work in parallel with the normal graphics work. That allows for some really interesting optimizations that will really help your overall frame rate and how . . . with less power, you can achieve higher frame rates. What we’d see, for example—say we’re rendering shadow maps. There’s really not much compute going on. . . . Compute units are basically sitting there being idle. If, at the same time, we are able to do post-processing effects—say maybe even the post-processing from a previous frame, or what we could do in Tomb Raider, [where] we have TressFX hair simulations, which can be quite expensive—we can do that in parallel, in compute, with these other graphics tasks, and effectively, they can become close to zero cost. If we guessed that maybe only 50% of that compute power was utilized, the theoretical number—and we won’t reach that, but in theory, we might be able to get up to 50% better GPU performance from overlapping compute work, if you would be able to find enough compute work to really fill it up.

The 50% figure is a theoretical best-case scenario, but Katsman added, “It seems quite realistic that you would get maybe 20% additional GPU performance out of optimizations like that.”

Also, because Mantle lets developers use GPU memory more efficiently, the new API could allow for the use of higher-resolution textures in a given game, according to Katsman.

Some caveats

Mantle’s advantages are many, but a few downsides that were mentioned in the various presentations at APU13.

One of those is that, unsurprisingly, supporting an additional API incurs added development time and cost. Mantle currently works only on GCN-based Radeon graphics processors, which means that developers who adopt it must also use either Direct3D or OpenGL to support other graphics hardware. Andersson said DICE spent about two months porting Battlefield 4‘s Frostbite 3 game engine to Mantle. Asked for a ballpark cost figure, Katsman told me that, for a simple PC project like Nixxes’ Thief port, adding Mantle support might amount to roughly a 10% increase in development cost. He was quick to add, however, that such an increase is a drop in the bucket compared to the total development cost of the entire game for all platforms, which might add up to something like $50 million.

The lack of multi-vendor and multi-platform support is another one of Mantle’s notable downsides. Microsoft and Sony use different APIs for the Xbox One and PlayStation 4, and Mantle doesn’t yet support Linux, OS X, or Valve’s upcoming SteamOS. There are some mitigating factors here, though. Katsman noted that Mantle optimizations are “conceptually similar” to the ones developers write for next-gen consoles. That tells us developers won’t be starting from scratch when adding Mantle support to their games. Also, Katsman believes Mantle’s performance improvements make its implementation worthwhile even if only a fraction of users benefit. As he pointed out, developers already spend time writing support for features like Eyefinity and HD3D into their games, and those features have even smaller user bases.

Finally, adding Mantle support to current game engines, as Nixxes did with the version of Unreal Engine 3 used by Thief, can be a challenge. “Native D3D ports will not magically get much higher performance,” explained Katsman. “If you emulate the same system on top of Mantle, you will not get much better performance.” Fully optimizing an existing engine for Mantle seems to involve breaking and rewriting some chunks of that engine to take advantage of the new development model. But here again, Katsman believes the performance improvements make the effort worthwhile.