The increased performance potential of modern graphics APIs is coupled with a dramatically increased level of developer responsibility. Optimal use of Vulkan is not a trivial concept, especially in the context of a large engine, and information about how to maximize performance is still somewhat sparse. The following document is a compilation of wisdom from some of the Vulkan experts at NVIDIA. It is not exhaustive, and is expected to be augmented over time, but should be a useful stepping stone for developers looking to utilize Vulkan to its full potential.

Engine Architecture

Do

Parallelize command buffer recording, image and buffer creation, descriptor set updates, pipeline creation, and memory allocation / binding. Task graph architecture is a good option which allows sufficient parallelism in terms of draw submission while also respecting resource and command queue dependencies.

Don’t

Don’t expect driver to move processing of your Vulkan API commands to a worker thread. While total cost of recording command lists on Vulkan should be relatively low, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to record work in parallel, the greater the benefit in terms of draw call submission performance that can be expected.

Debugging

Do

Use the validation layers! The validation layers can flag many errors in the command stream, which can help avoid bugs in your application.

When using MSVC on Windows, debug builds will by default enable the MSVCRT debug heap, which can slow the validation layers down. Setting the environment variable _NO_DEBUG_HEAP=1 disables the debug heap and is recommended, if possible. Note that the debug heap can help catch subtle memory corruption issues on the application side, so keep this in mind when deciding whether to disable it.

disables the debug heap and is recommended, if possible. Note that the debug heap can help catch subtle memory corruption issues on the application side, so keep this in mind when deciding whether to disable it. During development, register a debug callback by using VK_EXT_debug_utils . The driver calls this for various non-performance critical validation checks it might perform.

. The driver calls this for various non-performance critical validation checks it might perform. The same extension also allows attaching debug names to resources; the callback will provide this information back to the application and debuggers like Nsight and RenderDoc for resource identification. Make sure to turn it off in a retail build due to performance reasons.

Annotate your rendering regions with VK_EXT_debug_marker , or use the newer VK_EXT_debug_utils , to improve the debugging and profiling experience when using tools. Note that this extension will only be visible when a tool that consumes the information is available.

, or use the newer , to improve the debugging and profiling experience when using tools. Note that this extension will only be visible when a tool that consumes the information is available. NVIDIA profiling tools utilize these markers so it’s recommended to keep region annotations even in release builds. The single check for the existence of the extension is negligible per region.

Work Submission

Do

Accept the fact that you are responsible for achieving and controlling GPU/CPU parallelism. Submitting work to command lists doesn’t start any work on the GPU.

Calling vkQueueSubmit() does start work on the GPU. Use a separate command pool for each thread which records command buffers, for each frame.

does start work on the GPU. Use a separate command pool for each thread which records command buffers, for each frame. Build command buffers in parallel and evenly across several threads/cores to multiple command lists. Recording commands is a CPU intensive operation and no driver threads come to the rescue.

Be aware of the cost of setting up and resetting a command list. A reasonable number of command lists are required for efficient parallel work submission

Synchronization across command lists can force them to be split

Aim for 15-30 command buffers and 5-10 vkQueueSubmit() calls per frame, batch VkSubmitInfo() to a single call as much as possible. Each vkQueueSubmit() has a performance cost on CPU, so lower is generally better. Note that VkSemaphore -based synchronization can only be done across vkQueueSubmit() calls, so you may be forced to split work up into multiple submits.

calls per frame, batch to a single call as much as possible. Each has a performance cost on CPU, so lower is generally better. Note that -based synchronization can only be done across calls, so you may be forced to split work up into multiple submits. Functions such as vkAllocateCommandBuffers() , vkBeginCommandBuffer() , and vkEndCommandBuffer() should be called from the thread which fills the command buffer. These calls take measurable time on CPU and therefore should not be collected in a specific thread.

, , and should be called from the thread which fills the command buffer. These calls take measurable time on CPU and therefore should not be collected in a specific thread. Check for gaps in execution on GPU using GPUView.

Reuse command buffers when possible. Secondary command buffers can be helpful here, depending on the workload – check carefully to determine if they are actually advantageous.

Don’t

Don’t submit small command buffers. If a submission is processed on the GPU faster than new ones can be submitted on the CPU, it will result in wasted / idle GPU cycles.

Don’t overlap compute work on the graphics queue with compute work on a dedicated asynchronous compute queue. This may lead to gaps in execution of the asynchronous compute queue.

Switch compute workload to graphics workloads in this case if possible.

Don’t design around lots of command buffer reuse. These usually generate many per-frame changes in terms of object visibility, etc.

Post-processing may be an exception.

Don’t create too many threads or too many command lists. Too many threads will oversubscribe your CPU resources, too many command lists may accumulate too much overhead.

Pipeline

Do

Create pipelines asynchronously to rendering.

Use pipeline cache.

Use specialization constants. This may cause a possible decrease in the number of instructions and registers used by the shader.

Specialization constants can also be used instead of offline shader permutations to minimize the amount of bytecode that needs to be shipped with an application

Start using more general pipelines (with generic shaders that compile quickly) first and generate specializations later. This gets you up running faster even if you are not running the most optimal pipeline/shader yet.

Minimize state changes between pipelines where possible. A pipeline doesn’t necessarily map to an atomic state change on GPU.

Group draw calls, taking into account what kinds of shaders they use.

Changing the depth comparison function to the opposite value (less->greater) disables Z-cull.

Switching tessellation and geometry shaders on/off is an expensive operation.

Use identical sensible defaults for don’t care fields wherever possible. This creates more possibilities for PSO reuse.

Don’t

Don’t expect speedup from Pipeline Derivatives.

Pipeline Layout

Do

Try to keep the number of descriptor sets in pipeline layouts as low as possible.

Minimize the number of descriptors in the descriptor sets.

Use push constants for per draw call updates of constants. However, the performance benefit depends on the amount of per draw call data.

Use dynamic uniform/storage buffers for per draw call changes of uniform/storage buffers.

Prefer using combined image and sampler handles.

Command Pools and Buffers

Do

Reuse command pools for similarly sized sequences of draw calls.

Allocations are fast when the buffer has been pre-warmed.

Use L * T + N pools. (L = the number of buffered frames, T = the number of threads which record command buffers, N = extra pools for secondary command buffers).

Call vkResetCommandPool before reusing it in another frame. Otherwise the pool will keep on growing until you run out of memory

Don’t

Don’t create/destroy command pools, reuse them instead. Save the overhead of allocator creation/destruction and memory allocation/free.

Don’t forget that command pools consume GPU memory.

Memory Management

Do

Avoid video memory overcommitment. vkAllocateMemory() will return VK_ERROR_OUT_OF_DEVICE_MEMORY or VK_ERROR_OUT_OF_HOST_MEMORY .

will return or . When memory is over-committed on Windows, the OS may temporarily suspend a process from the GPU runlist in order to page out its allocations to make room for a different process’ allocations. There is no OS memory manager on Linux that mitigates over-commitment by automatically performing paging operations on memory objects.

Use dedicated memory allocations ( VK_KHR_dedicated_allocation , core in VK 1.1) when appropriate.

, core in VK 1.1) when appropriate. Using dedicated memory for at least some allocations can help mitigate problems that may occur when device-local memory consumption is near or exceeds the size of a device-local memory heap. This may improve performance for color and depth attachments.

Use VK_KHR_get_memory_requirements2 (core in VK 1.1) to check whether an image/buffer need dedicated allocation.

(core in VK 1.1) to check whether an image/buffer need dedicated allocation. Use memory sub-allocation. vkAllocateMemory() is an expensive operation on the CPU. Cost can be reduced by suballocating from a large memory object. Memory is allocated in pages which have a fixed size; sub-allocation helps to decrease the memory footprint.

is an expensive operation on the CPU. Cost can be reduced by suballocating from a large memory object. Memory is allocated in pages which have a fixed size; sub-allocation helps to decrease the memory footprint. Group memory binding calls. ( VK_KHR_bind_memory2 ). vkBind*Memory() is an expensive operation on the CPU.

). is an expensive operation on the CPU. Explicitly look for the VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT when picking a memory type for resources which should be stored in video memory.

when picking a memory type for resources which should be stored in video memory. Ideally stay < 1024 allocations to reduce CPU overhead on Windows 7. The more allocations, the more overhead on work submission. This cost does not impact Linux.

Don’t

Don’t rely on configuration of memory heaps/types.

Always query and use the memory properties using vkGetPhysicalDeviceMemoryProperties() .

. Always query and use the memory requirements of an image/buffer using vkGet*MemoryRequirements() .

. Don’t create and destroy resources if possible. Resource creation, destruction, and memory binding are expensive operations on the CPU.

Don’t put every resource into a Dedicated Allocation.

For memory objects that are intended to be in device-local, do not just pick the first memory type. Pick one that is actually device-local.

Resources

Do

Copy both depth and stencil to avoid a slow path for copying.

Always use VK_IMAGE_TILING_OPTIMAL . VK_IMAGE_TILING_LINEAR is not optimal. Use a staging buffer and vkCmdCopyBufferToImage() to update images on the device, .

. is not optimal. Use a staging buffer and to update images on the device, . Prefer using 24 bit depth formats for optimal performance.

Prefer using packed depth/stencil formats. This is a common cause for notable performance differences between an OpenGL and Vulkan implementation.

Don’t

Don’t use 32-bit floating point depth formats, due to the performance cost, unless improved precision is actually required.

Render Passes

Do

Remember to properly specify VkAttachmentDescription::storeOp if you need the render pass output later-on.

Barriers

Do

Minimize the use of barriers. A barrier may cause a GPU pipeline flush. We have seen redundant barriers and associated wait for idle operations as a major performance problem for ports to modern APIs.

Make sure to always use the minimum set of resource usage flags. Redundant flags may trigger redundant flushes and stalls in barriers and slow down your app unnecessarily.

Specify the minimum set of memory barriers in vkCmdPipelineBarrier() . Adding false dependencies adds redundancy.

. Adding false dependencies adds redundancy. Group barriers in one call to vkCmdPipelineBarrier(). This way the worst case can be picked instead of sequentially going through all barriers.

Use optimal srcStageMask and dstStageMask . Most important cases: If the specified resources are accessed only in compute or fragment shaders, use the compute or the fragment stage bits for both masks, to make the barrier fragment-only or compute-only.

and . Most important cases: If the specified resources are accessed only in compute or fragment shaders, use the compute or the fragment stage bits for both masks, to make the barrier fragment-only or compute-only. Use VK_IMAGE_LAYOUT_UNDEFINED when the previous content of the image is not needed.

Don’t

Don’t insert redundant barriers; this limits parallelism; avoid read-to-read barriers

Get the resource in the right state for all subsequent reads

More Information

You can find additional information about using Vulkan with NVIDIA GPUs in Introduction to Real-Time Ray Tracing with Vulkan, Turing Extensions for Vulkan and OpenGL, and Path Tracing for Quake II in Two Months.