Connect directly with NVIDIA Developer Technology Engineers on OpenGL and Vulkan-related topics to get answers to your questions, from regular graphics use to compute shaders, ray tracing, or interop between the APIs. Register for this GTC Connect with Experts session, Vulkan and OpenGL Q&A , May 27, 9AM PT.

This post was updated substantially in March 2020. The original post was about the experimental NVX extension, while the current post is about the production NV version.

We are releasing the VK_NV_device_generated_commands (DGC) Vulkan extension, which allows GPU generation of the most frequent rendering commands. This extension offers an improved design over the experimental NVX extension.

Vulkan now has functionality like the DirectX12 ExecuteIndirect function. However, we added the ability to change shaders ( ShaderGroups ) on the GPU as well. This marks the first time that an open graphics API has provided functionality to create compact command buffer streams on the device, avoiding the worst-case state setup of previous indirect drawing methods.

Motivation

With general advances in programmable shading, the GPU can take on an ever-increasing set of responsibilities for rendering, by computing supplemental data and allowing a greater variety of rendering algorithms to be implemented. However, when it comes to setting up state for draw calls, the decisions must primarily be made on the CPU. Therefore, explicit synchronization or working from past frame’s results was necessary. Device-generated commands remove this readback latency and overcome existing inefficiencies.

Another trend is the development towards global scene representations on the GPU. For example, in ray tracing ( VK_NV/KHR_ray_tracing ), the GPU stores geometry and instance information for the scene. Recent API developments ( VK_VERSION_1_2 , or VK_EXT_descriptor_indexing and VK_EXT_buffer_device_address ) allow the GPU to use bindless resources. This extension supplements such representations and allows you to determine the subset of objects to be rasterized on the device by modifying buffer contents before the command generation.

Figure 1. Unified Scene Representation

In Figure 1, the center represents the scene resources and bindings stored in buffers. With this extension, both ray tracing and rasterization have similar setups to provide geometry and shading information in buffers and pipelines. In both cases, the decision of what work to perform can be done on the GPU: what rays to shoot, or which objects to rasterize and which shading to use.

Some usage scenarios that may benefit from the functionality of this extension are:

Occlusion culling : Use custom shader-based culling techniques to obtain greater accuracy with easy access to the latest depth-buffer information.

: Use custom shader-based culling techniques to obtain greater accuracy with easy access to the latest depth-buffer information. Object sorting : In forward shading-limited scenarios, sort the objects front to back to achieve greater performance.

: In forward shading-limited scenarios, sort the objects front to back to achieve greater performance. Level of detail : Influence the shaders or geometry used for an object depending on its screen-space footprint.

: Influence the shaders or geometry used for an object depending on its screen-space footprint. Work distribution: Bucket arbitrary workloads for improved coherency in their resource usage, for example, by using more specialized shaders that have compile-time optimizations applied.

Evolution of GPU Work Creation

Figure 2. Device-Generated Commands

Over the years, GPU work creation has evolved. First, Draw Indirect provided the ability to source basic parameters for a draw call (number of primitives, instances, etc.) from GPU buffers that can be written by shaders.

Next came the popular Multi Draw Indirect feature that is practical for both CPU time reasons, including approaching zero driver overhead (AZDO) as well as GPU culling or level-of-detail algorithms.

Following this, our OpenGL GL_NV_command_list extension added the ability to do basic state changes (for example, vertex/index/uniform-buffers) using tokens stored in regular GPU buffers. Similar functionality was later exposed in DirectX 12 Execute Indirect , and we are now introducing this to Vulkan with DGC.

One problem with existing indirect techniques is that the command buffer must be prepared for the worst-case scenario on the CPU. You may also require more memory than necessary for the distribution of indirect commands, given that it is not known which state buckets they will end up in. To solve these problems, this extension adds further capabilities, such as the ability to switch between different shaders.

Figure 3. Indirect State Commands

Principal usage

To leverage this new functionality, there are a few additions to the Vulkan API. You need to express what commands to generate, make shaders bindable from the device, and run the actual generation and execution.

Figure 4. Indirect Commands Layout

A new Vulkan object type is introduced:

VkIndirectCommandsLayoutNV : The layout encodes the sequence of commands to be generated and where the data is found in the input streams. In addition to the command tokens and predefined functional arguments (binding units, variable sizes, etc.), additional usage flags that affect the generation can also be provided.

Graphics pipeline creation is extended by adding shader groups:

VkGraphicsPipelineShaderGroupsCreateInfoNV : As part of VkGraphicsPipelineCreateInfo , you can define multiple shader groups within the pipeline. By default, every pipeline has one group. The shader groups are bound through an index either coming from an indirect command or the new vkCmdBindPipelineShaderGroupNV . It is also possible to append shader groups by referencing existing compatible graphics pipelines, which eases integration.

: As part of , you can define multiple shader groups within the pipeline. By default, every pipeline has one group. The shader groups are bound through an index either coming from an indirect command or the new . It is also possible to append shader groups by referencing existing compatible graphics pipelines, which eases integration. VkGraphicsShaderGroupCreateInfoNV : A shader group can override a subset of the pipeline’s base graphics state. This includes the ability to change shaders, as well as the vertex or tessellation setup.

The following steps generate commands on the device:

Define a sequence of commands to be generated, using a VkIndirectCommandsLayoutNV object. For the ability to change shaders, create the graphics pipelines with VK_PIPELINE_CREATE_INDIRECT_BINDABLE_BIT_NV and then create an aggregate graphics pipeline that imports those pipelines as graphics shader groups using VkGraphicsPipelineShaderGroupsCreateInfoNV , which extends VkGraphicsPipelineCreateInfo . Create a preprocess buffer based on sizing information acquired using vkGetGeneratedCommandsMemoryRequirementsNV . Fill the input buffers for the generation step and set up the VkGeneratedCommandsInfoNV struct accordingly. It is used in preprocessing as well as execution. Some commands require buffer addresses, which can be acquired using vkGetBufferDeviceAddress . (Optional) Use a separate preprocess step using vkCmdPreprocessGeneratedCommandsNV . Run the execution using vkCmdExecuteGeneratedCommandsNV .

Key design choices

Compared to the approach of GL_NV_command_list or DirectX12 ExecuteIndirect , this extension uses some different design choices:

Graphics Shader Groups : Like ray tracing, a graphics pipeline contains a collection of shader groups that can be accessed from the device.

: Like ray tracing, a graphics pipeline contains a collection of shader groups that can be accessed from the device. Stateless Command Sequences: While the new design also uses command tokens, these tokens define a stateless sequence. The GL_NV_command_list extension encoded a serial token stream, which allowed you to resolve redundant state setup and create compact streams. However, the serial token stream design was not as portable or parallel-processing friendly as the traditional stateless design of MDI or ExecuteIndirect . Not all command types must be generated, although stateful commands must be defined before work-provoking ones.

VkIndirectCommandsTokenTypeNV Equivalent vkCmd Input buffer data VK_INDIRECT_COMMANDS_TOKEN_TYPE_SHADER_GROUP_NV vkCmdBindPipelineShaderGroup ShaderGroup index VK_INDIRECT_COMMANDS_TOKEN_TYPE_STATE_FLAGS_NV – vkFrontFace state VK_INDIRECT_COMMANDS_TOKEN_TYPE_INDEX_BUFFER_NV vkCmdBindIndexBuffer VkDeviceAddress and VkIndexType VK_INDIRECT_COMMANDS_TOKEN_TYPE_VERTEX_BUFFER_NV vkCmdBindVertexBuffers VkDeviceAddress and stride VK_INDIRECT_COMMANDS_TOKEN_TYPE_PUSH_CONSTANT_NV vkCmdPushConstant Raw values VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_INDEXED_NV vkCmdDrawIndexedIndirect Draw arguments VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_NV vkCmdDrawIndirect Draw arguments VK_INDIRECT_COMMANDS_TOKEN_TYPE_DRAW_TASKS_NV vkCmdDrawMeshTasksIndirectNV Draw arguments

Optional Explicit Preprocessing : This extension optionally allows some preprocessing to be done separately. This way, you can asynchronously overlap preparation of new draw calls with the drawing of a previous batch.

: This extension optionally allows some preprocessing to be done separately. This way, you can asynchronously overlap preparation of new draw calls with the drawing of a previous batch. Explicit Resource Control: To fit into Vulkan’s explicit design, the memory requirements for command generation are exposed and managed directly through buffers by you.

Figure 5. Asynchronous Preprocessing

Multiple Input Streams: Previous approaches only allowed a single input buffer stream. MDI and ExecuteIndirect are designed to use an array of structures (AoS) layout, which makes partial updates less cache-efficient. DGC lets you define how many input streams are used and where tokens are stored. This way, you can choose to use a structure of arrays (SoA) layout for compact, cache-friendly streams, or use a hybrid layout to easily separate dynamic from static content.

Figure 6. Multiple Input Streams

Relaxed Ordering : By default, all sequences are processed in linear order. You can relax the ordering requirement to be implementation-dependent, which speeds up processing.

: By default, all sequences are processed in linear order. You can relax the ordering requirement to be implementation-dependent, which speeds up processing. Sequence Count & Index Buffer: The number of sequences, as well as the subset, can be provided by additional buffers containing count and index values. This allows for easy sorting or filtering of draw calls without having to change the input data.

Figure 7. Flexible Sequencing

What’s the catch?

No feature is free of trade-offs. A device-generation approach means that some driver-side optimizations may not apply. Furthermore, the generation process can add to the overall frame time, in cases where the CPU is able to record commands without affecting the GPU time. Finally, it requires additional GPU memory.

In summary, the goal of this extension is primarily to reduce the amount of actual work done on the GPU, by making decisions on the device about what and how work is generated. It is not about off-loading command generation from CPU to the GPU in general.

Sample and availability

A new open-source sample is available on GitHub. This extension is exposed in our beta driver first, and will be available in production drivers later.

Special thanks to Neil Bickford, Matthew Rusch, Mathias Schott, Daniel Koch, James Helferty, and Markus Tavenrath who contributed to the posts.