Demystifying Asynchronous Compute



DirectX 12; Command Lists and Multi-Engine

Most modern GPUs contain multiple independent engines that provide specialized functionality. Many have one or more dedicated copy engines, and a compute engine, usually distinct from the 3D engine. Each of these engines can execute commands in parallel with each other. Direct3D 12 provides granular access to the 3D, compute and copy engines, using queues and command lists. Click to expand...



Synchronization and Multi-Engine (Windows)

Parallelism and Concurrency

Asynchronous Compute and DX12

Asynchronous and low priority GPU work. This enables concurrent execution of low priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.

High priority compute work. With background compute it is possible to interrupt 3D rendering to do a small amount of high priority compute work. The results of this work can be obtained early for additional processing on the CPU.

Background compute work. A separate low priority queue for compute workloads allows an application to utilize spare GPU cycles to perform background computation without negative impact on the primary rendering (or other) tasks. Background tasks may include decompression of resources or updating simulations or acceleration structures. Background tasks should be synchronized on the CPU infrequently (approximately once per frame) to avoid stalling or slowing foreground work. Click to expand...

Parallelism: Simultaneous execution of two or more tasks, they are executing at the same instant therefore on independent units.



Concurrency: Overlapping execution of two or more tasks, they are not executing at the same instant however both tasks are progressing forwards in their execution within the same time-frame.



Asynchrony: Order-independent execution of two or more tasks, a routine can be called before the preceding routine returns.

Parallel

serial



Concurrent

sequential

Asynchronous

synchronous



NVIDIA and AMD - Two different approaches

As some of you may know, this is one of my favorite topics () and I thought I'd try to write a sort of "definitive guide" to dispel some of the misconceptions, rumors and hysteria over this feature.So to start off we're going to briefly look at what the Direct X 12 specification says, because a little background is needed in order to approach asynchronous compute.DirectX 12 employs a different command submission model using command lists, each of which is created by one CPU thread and submitted to one of three queues, corresponding to one of three "engines".We have to be a little careful when reading Microsoft documentation, because their terminology often doesn't coincide with that used by hardware vendors.So an "engine" is essentially a command processor, and each has it's own queue(s) with the API exposing signalling in the form of fences which are used to coordinate work across queues.An "engine" is an API construct, to be clear, neither AMD/NV refer to any actual hardware blocks as engines - thanks razor1 The critical thing here is independence, the hardware is no longer forced to approach things sequentially, you can still execute compute and DMA(copy) commands through the graphics queue, but you also have the liberty to exploit these features exposed by the new API, granting more granular control to the developer.This means you can have graphics, compute and DMA dispatches in parallel; through independent dispatchers.The compute queue, or the compute "engine" rather; ACEs for AMD, GMU for NVIDIA; do not have access to fixed function hardware, so no rasterizers, no geometry engines etc. The compute queue is good for things that need ALU/FPU power and not much else.The copy queue is self explanatory, but don't forget about it, the key term really shouldn't be async compute but multi-engine, but we'll get to that later.Parallelism and concurrency can be defined in a multitude of ways, and you will easily find several definitions for each with a quick google search.Generally speaking, parallelism is a condition that stems from multiple work units ; the key here is independence. A task that can be broken up into a series of subtasks that can be executed simultaneously is said to be parallelizable.Slapping people is an inherently parallelizable task for a vast majority of the population; if you split up slapping into left-handed slapping and right-handed slapping, you can now slap with both hands, simultaneously. - you scoundrel.You can't talk about parallelism in a void, you need to specify "where" the parallelism takes place. At the bit level ? At the instruction level ? Thread level ?The transition from 32-bit to 64-bit CPU architectures brought increased bit-level parallelism.The transition to micro-ops in Intel CPUs brought increased instruction-level parallelism.The essence of parallelism is independence and task subdivision.Concurrency can be seen as a more general notion of parallelism, with parallelism being a subset of concurrency upon which an additional condition is placed; independence.The key to concurrency is interruptability. In a concurrent execution model, multiple tasks move forward in their execution within the same time span on a single work unit. The start and end times of running tasks overlap.If you weren't content with slapping people with two hands, and wanted to add insult to injury you could spit in their eye; but you know how it is; you're a busy person, you don't have all day. It takes five seconds to go through one cycle of the Parallel Slap™, if you add a two second spit routine in after it then you'll only get to do it 8 times a minute. Ain't nobody got time for that; we've all been there, time is money.Instead you can choose to spit right in their eye in the downtime between your palms making contact with their face and the moment you retract your arms back. You can now perform two Parallel Slaps™ and a Concurrent Spit™ in 5 seconds.Congratulations, you are now slapping in parallel and spitting concurrently - you monster.Yeah... Ahem... Back on track...Microsoft mentions a few examples of use cases for multi-engineRight off that bat it's worth saying the term 'asynchronous' is abused and misused all the time; if two events are asynchronous they are time-independent of each other. Developers have made claims about their 'async compute' implementations in the past that have been rather confusing because it's usually applied to rendering tasks; it ain't asynchronous if there are data dependencies. To my knowledge, there is no title that actually makes use of ASYNCHRONOUS compute. Most use cases would fall under the third bullet point quoted above.Asynchronous execution arises when you do not have to wait for a routine to return before dispatching additional work, at least that's how it's conventionally defined, which is somewhat similar to how concurrency was defined earlier - which many people find confusing.I encourage you to go and google "what is concurrency?" or "what's the difference between concurrency and parallelism?"; you'll find tons of answers, each slightly different from all the others.People often have trouble reconciling asynchrony with concurrency and parallelism, in fact people confuse concurrency and parallelism all the time!is the opposite ofis the opposite ofis the opposite ofLet's go back to the previous example; let's say you have a particularly fiendish pet monkey perched on your shoulder. After slapping with both hands in parallel you exploit the brief stall time before retracting your arms to order your monkey to spit. If we consider the task to be "order your monkey to spit" you are executing it concurrently with a parallel slap. Once the order is given, you can move forward in the SLAP task without waiting for the monkey to actually spit. This is the essence of asynchrony. While from your POV it was concurrent, if you consider the whole system including the slapping maniac and his asshole-ish monkey the spitting and the slapping are asynchronous, and possibly parallel; monkey spits while you are retracting your arms from the slap.So now, you're slapping with both hands in parallel ( two tasks executing at the same exact time, or a task subdivided into two subtasks which execute simultaneously, just depends on what you define the task to be), ordering your monkey to SPIT (task 2) concurrently, and the monkey is spitting asynchronously - you slapping, spitting lunatic.You might be wondering why the spitting is considered asynchronous instead of concurrent, all that's changed is that we've offloaded the actual SPIT operation to another unit. Fair game, I understand your confusion and hopefully this will clear it up.In the first example of parallel slapping and concurrent spitting, your slap and spit operations were not executing at the same time. You "paused" the slap when your hands make contact with skin, switched to spitting, finished spitting, then went back to finish your slapping task by executing the 'retract arms' operation.While concurrency and parallelism improve the throughput of a machine, asynchrony improves latency.Using a more serious example; a CPU and I/O operations. If the CPU were to wait for I/O operations to return it would spend a huge deal of time waiting, instead these operations are executed asynchronously such that the CPU only sends the command(s) then moves onto another task without waiting for a return. The I/O operations will still take the same amount of time, but the CPU has effectively hidden a big chunk of latencySo now we've established what we mean by parallel, concurrent and asynchronous we can finally move on to how nvidia and AMD are able to exploit this new command submission model.Now for the purposes of this example let's assume we have two tasks on two different queues, let's call them A and B.Task A is on the graphics queue, and it uses some fixed function hardware.Task B is on the compute queue and uses only ALU/FPU resources.We have two GPUs; one GCN and one Maxwell based, both containing 10 SMs/CUs.Let's assume task A executes in 10 milliseconds on a single unit (an SM or a CU).Task B executes in 3 milliseconds on a single unit.When assigned to a single unit, both GCN and Maxwell execute the task in 10ms with no stall time on the unit whatsoever; CU/SM utilization is 100%.However, if we spread the workload across all units (10) instead of executing in 1ms, it takes 1.25ms on both GCN and Maxwell; there's 0.25ms of stall time ; utilization is now 80%.If we were to leave it at that then execute task B sequentially (spread across all 10 units) we would have a total execution time of 1.25 + 0.3 = 1.55msThere's some stall time on the SM/CUs however, which means there's room for improvement.