Engine Architecture/Structure

Do’s

Prefer a tasks graph architecture for parallel draw submission

This way you may achieve sufficient parallelism in terms of draw submission whilst making sure that resource and command queue dependencies get respected



Consider a ‘Master Render Thread’ for work submission with a couple of ‘Worker Threads’ for command list recording, resource creation and PSO ‘Pipeline Stata Object’ (PSO) compilation

The idea is to get the worker threads generate command lists and for the master thread to pick those up and submit them

Expect to maintain separate render paths for each IHV minimum

The app has to replace driver reasoning about how to most efficiently drive the underlying hardware

Don’ts

Don’t rely on the driver to parallelize any Direct3D12 works in driver threads

On DX11 the driver does farm off asynchronous tasks to driver worker threads where possible – this doesn’t happen anymore under DX12



While the total cost of work submission in DX12 has been reduced, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to submit work in parallel, the more benefit in terms of draw call submission performance can be expected.

Work Submission – Command Lists & Bundles

Do’s

Accept the fact that you are responsible for achieving and controlling GPU/CPU parallelism

Submitting work to command lists doesn’t start any work on the GPU



Calls to ExecuteCommadList() finally do start work on the GPU

Submit work in parallel and evenly across several threads/cores to multiple command lists

Recording commands is a CPU intensive operation and no driver threads come to the rescue



Command lists are not free threaded so parallel work submission means submitting to multiple command lists

Be aware of the fact that there is a cost associated with setup and reset of a command list

You still need a reasonable number of command lists for efficient parallel work submission



Fences force the splitting of command lists for various reasons ( multiple command queues, picking up the results of queries)

Try to aim at a reasonable number of command lists in the range of 15-30 or below. Try to bundle those CLs into 5-10 ExecuteCommandLists() calls per frame.

Reuse fragments recorded in bundles if you can

No need to spend CPU time once again

Use bundle resource binding inheritance sparsely

This allows bundles to be reused with less overhead as it facilitates more thoroughly cooked bundles



Check carefully if the use of a separate compute command queues really is advantageous

Even for compute tasks that can in theory run in parallel with graphics tasks, the actual scheduling details of the parallel work on the GPU may not generate the results you hope for

Be conscious of which asynchronous compute and graphics workloads can be scheduled together - use fences to pair up the right workloads

Make sure to use just one CBV/SRV/UAV/descriptor heap as a ring-buffer for all frames if you want to aim at running parallel asynchronous compute and graphics workloads

Don’ts

Don’t use bundles to record more than a few draw calls (e.g.~12 draw calls is fine)

Otherwise you typically limit the reusability of the bundle

Don’t overlap compute work on the 3D queue with compute work on a dedicated asynchronous compute queue

This may lead to bubbles in the asynchronous compute queue



Switch compute workload to graphics workloads in this case if possible

Don't submit extremely small command lists.

Small command lists can sometimes complete faster than the OS scheduler on the CPU can submit new ones. This can result in wasted idle GPU cycles.



The OS takes 50-80 microseconds to schedule command lists after the previous ExecuteCommandLists call. If a command list or all command lists in the call executes faster than that, there will be a bubble in the HW queue



Check for bubbles using GPUView

Don’t record everything or big scene parts in just very few command lists

This limits your ability to fully utilize all your CPU cores



Also building a few large command lists means you’ll potentially find it harder to keep the GPU from going idle

Don’t submit only at the end of frame after you have recorded everything

You may waste the opportunity to keep the GPU working in parallel with the recording of other command lists

Don’t expect lots of list reuse

There are usually many per-frame changes in terms of objects visibility etc.



Post-processing may be an exception

Don’t create too many threads or too many command lists

Too many threads will oversubscribe your CPU resources, whilst too many command lists may accumulate too much overhead

Pipeline State Objects (PSOs)

Do’s

Create PSOs on worker threads asynchronously PSO creation is where shaders compilation and related stalls happen



Start using more general PSOs (with generic shaders that compile quickly) first and generate specializations later Gets you up running faster even if you are not running the most optional PSO/shader yet

It is your job to generate shader specializations – the driver will not generate constant optimized shader variants behind your back



Avoid runtime PSO compilations as they most likely will lead to stalls The driver-managed shader disk cache may come to the rescue though



Minimize state changes between PSOs where possible A PSO doesn’t necessarily map to an atomic state change on the GPU



Use identical sensible defaults for don’t care fields wherever possible

This allows for more possibilities for PSO reuse

Use the /all_resources_bound / D3DCOMPILE_ALL_RESOURCES_BOUND compile flag if possible

This allows for the compiler to do a better job at optimizing texture accesses. We have seen frame rate improvements of > 1% when toggling this flag on.

Don’ts

Don’t toggle between compute and graphics on the same command queue more than absolutely necessary

This is still a heavyweight switch to make



Don’t toggle tessellation on/off more than absolutely necessary

Again, this is still a heavyweight switch to make



Don’t forget that PSO creation is where shaders get compiled and stalls get introduced

It is really important to create PSO asynchronously and early enough before they get used



Tread carefully with thread priorities for PSO compilation threads



Use Idle priority if there is no ‘hurry’ to prevent slowdowns for game threads





Consider temporarily boosting priorities when there is a ‘hurry’





Root Signatures

Do’s

Place constants and CBVs (SRVs and UAVs only if you have directly into the root signature if possible on NVIDIA Hardware Start with the entries for the pixel stage Constants that sit directly in root can speed up pixel shaders significantly on NVIDIA hardware – specifically consider shader constants that toggle parts of uber-shaders CBVs that sit in the root signature can also speed up pixel shaders significantly on NVIDIA hardware Carry on with decreasing execution frequency of the shader stages Using root signature CBVs does not require a descriptor heap for storing CBV desciptors, versioning entries or extra indirection (=> no need to call CreateConstantBufferView() ) Remember root views don’t do bounds checking and have other limitations

Cache the current values of root constants, CBVs, SRVs and UAVs in CPU memory and only change the contents of the root signature when a true change is detected We have seen significant speedups through managing changes properly

Limit the shader visibility of CBVs, SRVs and UAVs to only the necessary stages

There is overhead in the driver and on the GPU for each stage that needs to see those views



Use the DENY_*_ACCESS flags to explicitly limit resource-shader visibility

Minimize the number of Root Signature changes

The problem is not the change of the RS but there is usually a follow up cost of initializing the root signature entries after such a change

Gracefully handle CBV, UAV, SRV and Sampler descriptors on Tier 1 and CBV and UAV descriptors on Tier 2 hardware

For these Tiers, the application must fill in all descriptors defined in the root signature (and descriptor tables used) by the time the command list executes. This is even the case if the used shaders may not reference all these descriptors.



For Tier 3 do keep your unused descriptors bound – don’t waste time unbinding them as this can easily introduce state thrashing bottlenecks



Don’ts

Don’t group CBVs into CBV descriptor tables that have a different update frequency

Ideally all CBVs in a table would need updating at the same time

Don’t bloat your root signature and descriptor tables to be able to reuse them

Try to aim at using a minimum set of entries for each set of materials

Don't simultaneously set visible and deny flags for the same shader stages on root table entries

For current drivers the deny flags only work when D3D12_SHADER_VISIBILITY_ALL is set

Don’t place constants SRVs and UAVs directly into the root signature unless you have a lot of draw/dispatch call that can make use of them

Don’t leave resource bindings undefined after a change of Root Signature

A change in root signatures removes/clears all resource binding used in the previous root signature

Allocators and Lists

Do's

Reuse allocators for similarly sized sequence of draw call

Allocations are fast when the list has been pre-warmed



Use 2*T + N allocators minimum

2* - one set of lists/allocators from last frame is still being consumed by the GPU and the second set is being built/used in the current frame



T = #threads creating command lists – please note that allocators are not free threaded!



N = extra pool for bundles

Call Allocator::Reset before reusing it in another frame

Otherwise the allocator will keep on growing until you’ll run out of memory

Don’ts

Don’t forget that Allocator and Lists consume GPU memory

A too large allocators may limit your GPU working set in other undesirable ways

Don’t create/destroy allocators but reuse allocators Save the overhead for allocator creation/destruction



Don’t reuse for differently sized sequence of draw calls

This leads to worst case size allocator



Don't forget to reset the corresponding allocator when resetting a set of command lists Not resetting an allocator means leaking memory!

Don’t free/reuse an Allocator still in use by active command lists

This is illegal and may free or overwrite memory that the command list is still using



Resources

Do's

Avoid vidmem overcommitment

Use IDXGIAdapter3:: QueryVideoMemoryInfo() to gain accurate information about the available video memory



Foreground app isn’t necessarily allocated all, or even a high %, of vidmem



Respond to budget changes from OS



Consider using IDXGIAdapter3::RegisterVideoMemoryBudgetChangeNotificationEvent





Consider capping graphics settings based on memory available



Create overflow heaps in sysmem and move resources over from vidmem heaps



DX12 gives the app a memory managment advantage over the DX11 driver here



Break up command lists so that the amount of memory referenced in each one fits in vidmem.





Keep track of what's used per CL





Consider using MakeResident/Evict before/after executing command lists when you are going over the vidmem budget

Use committed resources where possible to give the driver more knowledge



This allows the driver to better manage GPU memory





A good use case for placed resources are resource heaps that are e.g. used during streaming and do hold different sets of read-only textures over their life time

Batch up MakeResident calls (expect a CPU and GPU cost for page table updates)

This lowers the overhead inside the driver and the GPU

Work to a given memory budget using MakeResident/MakeUnresident

Do drop mip levels of tiled resources as needed



Need to handle the case when MakeResident fails

Be aware of the fact that certain resource types have different alignment rules within a heap

Make sure to devise ways to deal with varying resource binding Tiers within a device feature level

UAV count across all stages may be limited to 8 or 64



CBV count may be limited to 14 per stage



Sampler count may be limited to 16 per stage

Be aware of the aliasing rules for heaps

See tiled resource specification for a good roll-up

Be aware of the fact that there are different heap types for resources, SRVs, DSVs etc.

On some heap tiers there may be more restrictions than on others



Check resource heap tier capabilities

Do fill D3D12_TEXTURE_COPY_LOCATION with care when using CopyTextureRegion() when copying depth stencil textures

Copying only the depth part of the resource may hit a slow path

Dont's

Don’t go overboard with your re-use count for placed resources for depth stencil and render target resources

On top of the need to clear those resources before they can be rendered to, there may be other hardware dependent book-keeping operations that make those switches expensive

Don’t rely on the availability of tiled resources (check cap bits)

Still need to think about different DX12 hardware classes

Don’t rely on being able to allocate all GPU memory in one go

Depending on the underlying GPU architecture the memory may or may not be segmented Don’t expect an immediate cost for a MakeUnresident call

Cost might be deferred until another MakeResident call utilizes the memory



Use GPUView analysis to find out about deferred paging requests

Don’t destroy and create resources if it can be avoided

Better to use MakeUnresident and MakeResident where possible



Saves the overhead of creation and destruction of resources

Barriers, Fences & Hazards

Do's

Minimize the use of barriers and fences

We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports



The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it



Any barrier or fence can limit parallelism

Make sure to always use the minimum set of resource usage flags

Stay away from using D3D12_RESOURCE_USAGE_GENERIC_READ unless you really need every single flag that is set in this combination of flags



Redundant flags may trigger redundant flushes and stalls and slow down your game unnecessarily



To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.

Specify the minimum set of targets in ID3D12CommandList::ResourceBarrier

Adding false dependencies adds redundancy

Group barriers in one call to ID3D12CommandList::ResourceBarrier

This way the worst case can be picked instead of sequentially going through all barriers

Use split barriers when possible

Use the _BEGIN_ONLY/_END_ONLY flags



This helps the driver doing a more efficient job

Do use fences to signal events/advance across calls to ExecuteCommandLists

Dont's

Don’t insert redundant barriers

This limits parallelism



A transition from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_RENDER_TARGET and back without any draw calls in-between is redundant



Avoid read-to-read barriers



Get the resource in the right state for all subsequent reads

Don’t use D3D12_RESOURCE_USAGE_GENERIC_READ without good reason.

For transitions from write-to-read states, ensure the transition target is inclusive of all required read states needed before the next transition to write. This is done from the API by combining read state flags– and is preferred over transitioning from read-to-read in subsequent ResourceBarrier calls.

Don’t sequentially call ID3D12CommandList::ResourceBarrier with just one barrier

This doesn’t allow the driver to pick the worst case of a set of barriers

Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.

Multi GPU

Do's

Use the DX12 standard checks to find out how many GPUs are in your system

No need to use vendor specific APIs anymore



Make sure to check the CROSS_NODE_SHARING tier



Take full control over which surface syncs need to happen and which don’t

Make full use of the explicit control over resources



Create resources that need to by synchronized on each node



Use the proper CreationNodeMask





Make them visible on other nodes that need access



Copy them to the current node when needed



Minimize the number of necessary syncs

If the device supports tier 2 cross node sharing

Always compare performance to a tier 1 type implementation

Use designated copy queues to do cross node copy operations

Keep the main queue open to do rendering work in parallel



Dont's

Don’t try to benefit from implicit MGPU scaling

Don’t rely on any surface syncs to be done automatically (implicitly behind your back)

You should take full control over what syncs happen if you need them

Swap Chains

Do's

Do use flip mode swap-chains

Do use SetFullScreenState(TRUE) along with a (borderless) fullscreen window and a non-windowed flip model swap-chain to switch to true immediate independent flip mode

This is at the moment, according to Microsoft, the only mode you can get unleashed frame rates with tearing out of D3D12 when calling Present(0,0)



Any other mode doesn’t allow unlimited frame rates with tearing

Do use the DXGI_SWAP_CHAIN_FLAG_ALLOW_MODE_SWITCH flag consciously The flag is not necessary to achieve unlimited frame rates (see above) if your window size matches the current screen resolution



If this flag is set, trying to change resolution using ResizeTarget() before calling SetFullScreenState(TRUE) works fine and you’ll achieve uncapped FPS

If this flag is not set, trying to change resolution using ResizeTarget() before calling SetFullScreenState(TRUE) results in no change of display resolution. Your target will get stretched to the current resolution and FPS won’t be uncapped. If not in fullscreen state (true immediate independent flip mode) do control your latency and buffer count in your swap-chain carefully for the desired FPS and latency

Use IDXGISwapChain2::SetMaximumFrameLatency(MaxLatency) to set the desired latency





For this to work you need to create your swap-chain with the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag set.





li>A sync interval of 0 indicates that "the buffer I am presenting now is the newest buffer available next time composition happens" and discards all previous presents. However, the present does not go through until composition happens, which currently is only at VSync. DXGI will start to block in Present() after you have presented MaxLatency-1 times





At the default latency of 3 this means that you FPS can’t go higher than 2 * RefershRate. So for a 60Hz monitor the FPS can’t go above 120 FPS.





Try using about 1-2 more swap-chain buffers than you are intending to queue frames (in terms of command allocators and dynamic data and the associated frame fences) and set the "max frame latency" to this number of swap-chain buffers.

If not in fullscreen state (true immediate independent flip mode) consider using a waitable object swap-chain along with WaitForSingleObjectEx() to generate higher FPS

Please note that this will lead to some frame never being even partially visible, but may be a good solution for benchmarking



Using the waitable object swapchain and GetFrameLatencyWaitableObject(), one can test if a buffer is available before rendering to it or presenting it – the following options are available:

Use an additional off-screen surface

Render to the off-screen surface. Test the waitable object with timeout 0 to check if a buffer is available. If so copy to the swap-chain back buffer and Present(). If no buffer is available start the frame over again. At the beginning of the frame, test the waitable object. If it succeeds, render to the available swapchain buffer. If it fails, render to the offscreen surface.

Use a 3 or 4 buffer swapchain

Render directly to a back buffer. Before calling Present(), test the waitable object. If it succeeds, call Present(), if not, start over.

Dont's

Don’t forget that there's a per swap-chain limit of 3 queued frames before DXGI will start to block in Present().

Set the DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT flag on swapchain creation and use IDXGISwapChain2::SetMaximumFrameLatency to modify this default value

Don’t forget to call ResizeBuffers() after you have switched to true immediate indepent flip mode using SetFullScreenState(TRUE).

SetStablePowerState

Don’t ever call SetStablePowerState(TRUE) from game engine code.

Do consider carefully whether or not you need highly stable results at the expense of lower performance. See the discussion in our blog.

If and only if you want its stable results, do call SetStablePowerState from a separate, standalone application.

To avoid confusion, do make it crystal clear when the function is in effect or not. (One way to make it obvious is to record clocks along with performance results. We often do that. Our blog has a code snippet showing how to query GPU clocks on NVIDIA.)

Do use the DX12 API and our standalone program to stabilize the clocks when testing other APIs.

We have a separate blog post with more discussion: SetStablePowerState.exe: Disabling GPU Boost on Windows 10 for more deterministic timestamp queries on NVIDIA GPUs

DirectX12 Hardware Features and other Maxwell Features

Do's

Use hardware conservative raster for full-speed conservative rasterization

No need to use a GS to implement a ‘slow’ software base conservative rasterization



See /content/dont-be-conservative-conservative-rasterization



Make use of NvAPI (when available) to access other Maxwell features

Advanced Rasterization features



Bounding box rasterization mode for quad based geometry





New MSAA features like post depth coverage mask and overriding the coverage mask for routing of data to sub-samples





Programmable MSAA sample locations



Fast Geometry Shader features



Render to cube maps in one geometry pass without geometry amplifications





Render to multiple viewports without geometry amplifications





Use the fast pass-through geometry shader for techniques that need per-triangle data in the pixel shader



New interlocked operations



Enhanced blending ops



New texture filtering ops

Don’ts

Don’t use Raster Order View (ROV) techniques pervasively

Guaranteeing order doesn’t come for free



Always compare with alternative approaches like advanced blending ops and atomics

NVIDIA DirectX12 Hardware Features table