It seems to be coded to strictly favor Pascal's hack-job async implementation, namely compute preemption, as per nVidia's dx12 "do's"and "dont's".<br><a class="spoiler-link H-spoiler-toggle" href="#"><strong>Warning: Spoiler!</strong> <span class="spoiler-help">(Click to show)</span></a><div class="spoiler-hidden">Do's<br><br>

Minimize the use of barriers and fences<br>

We have seen redundant barriers and associated wait for idle operations as a major performance problem for DX11 to DX12 ports<br>

The DX11 driver is doing a great job of reducing barriers – now under DX12 you need to do it<br>

Any barrier or fence can limit parallelism<br>

Make sure to always use the minimum set of resource usage flags<br>

Stay away from using D3D12_RESOURCE_USAGE_GENERIC_READ unless you really need every single flag that is set in this combination of flags<br>

Redundant flags may trigger redundant flushes and stalls and slow down your game unnecessarily<br>

To reiterate: We have seen redundant and/or overly conservative barrier flags and their associated wait for idle operations as a major performance problem for DX11 to DX12 ports.<br>

Specify the minimum set of targets in ID3D12CommandList::ResourceBarrier<br>

Adding false dependencies adds redundancy<br>

Group barriers in one call to ID3D12CommandList::ResourceBarrier<br>

This way the worst case can be picked instead of sequentially going through all barriers<br>

Use split barriers when possible<br>

Use the _BEGIN_ONLY/_END_ONLY flags<br>

This helps the driver doing a more efficient job<br>

Do use fences to signal events/advance across calls to ExecuteCommandLists<br>

Dont's<br><br>

Don’t insert redundant barriers<br>

This limits parallelism<br>

A transition from D3D12_RESOURCE_STATE_PIXEL_SHADER_RESOURCE to D3D12_RESOURCE_STATE_RENDER_TARGET and back without any draw calls in-between is redundant<br>

Avoid read-to-read barriers<br>

Get the resource in the right state for all subsequent reads<br>

Don’t use D3D12_RESOURCE_USAGE_GENERIC_READ unless you really needs every single flag<br>

Don’t sequentially call ID3D12CommandList::ResourceBarrier with just one barrier<br>

This doesn’t allow the driver to pick the worst case of a set of barriers<br>

Don’t expect fences to trigger signals/advance at a finer granularity then once per ExecuteCommandLists call.</div>

<br>

This has come to my attention thanks to a few members of this board, namely <a data-huddler-embed="href" href="/u/450395/Doothe">@Doothe</a>, <a data-huddler-embed="href" href="/u/472248/Mahigan">@Mahigan</a>, <a data-huddler-embed="href" href="/u/364623/Slomo4shO">@Slomo4shO</a>, <a data-huddler-embed="href" href="/u/412352/JackCY">@JackCY</a>, <a data-huddler-embed="href" href="/u/231506/PontiacGTX">@PontiacGTX</a> among others.<br><br>

These are some of the more interesting posts:<br><div class="quote-container" data-huddler-embed="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25357799" data-huddler-embed-placeholder="false"><span>Quote:</span>

<div class="quote-block">Originally Posted by <strong>Doothe</strong> <a href="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25357799"><img alt="View Post" class="inlineimg" src="/img/forum/go_quote.gif"></a><br><br>

I took three screenshots, one of each game, in GPUView. From left to right, DOOM, AOTS, and Time Spy. Each timeline is roughly the same length of time. I'm still learning how to read, and interpret this information but I figured I'd share some of the images with you guys and maybe get a better understanding of whats going on.<br><br><img alt="s51q4IX.jpg" class="bbcode_img" src="http://i.imgur.com/s51q4IX.jpg"><br><br>

the image is 4800x2560. i recommend opening it up in a separate tab.</div>

</div>

<br><div class="quote-container" data-huddler-embed="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25357916" data-huddler-embed-placeholder="false"><span>Quote:</span>

<div class="quote-block">Originally Posted by <strong>Doothe</strong> <a href="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25357916"><img alt="View Post" class="inlineimg" src="/img/forum/go_quote.gif"></a><br><br>

Time Spy has a Pre-Emption Packet(black rectangle) in the 3D Queue that shows up every time a compute queue is processed<br><br><br>

From Nvidia’s whitepaper:<br>

"Compute Preemption is another important new hardware and software feature added to GP100 that allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architectures. Compute Preemption prevents long-running applications from either monopolizing the system (preventing other applications from running) or timing out."<br><br><br>

btw doom is vulkan. Idk if Vulkan is properly picked up by GPUView so disregard it if you want.</div>

</div>

<br><div class="quote-container" data-huddler-embed="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25357917" data-huddler-embed-placeholder="false"><span>Quote:</span>

<div class="quote-block">Originally Posted by <strong>Slomo4shO</strong> <a href="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25357917"><img alt="View Post" class="inlineimg" src="/img/forum/go_quote.gif"></a><br><br>

Compute queues as a % of total run time:<br><br>

Doom: 43.70%<br>

AOTS: 90.45%<br>

Time Spy: 21.38%</div>

</div>

<br><div class="quote-container" data-huddler-embed="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25357977" data-huddler-embed-placeholder="false"><span>Quote:</span>

<div class="quote-block">Originally Posted by <strong>JackCY</strong> <a href="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25357977"><img alt="View Post" class="inlineimg" src="/img/forum/go_quote.gif"></a><br><br>

That's what I keep saying <img alt="biggrin.gif" class="bbcode_smiley" src="http://files.overclock.net/images/smilies/biggrin.gif"> They simply reused their older DX11 like approach with DX12 and the features they use are quite limited as well so that they can support old hardware so new HW with new features that older doesn't have they do not use. I bet they also want 1 engine with 1 path to run on all GPUs to make their Benchmark "valid", to them but it makes it invalid to me since it doesn't use each HW to it's maximum potential, be it NV or AMD or some other GPU.<br><br>

Figuratively: Say there are two architectures, one has 1 thread to do the work and the other has 16 threads, now they make an engine that only uses 1 thread and try to compute parallel work using 1 thread so they switch context like mad to get it done, of course this engine works on both 1 and 16 threaded HW and runs the same speed in theory but that 16 threaded HW is underutilized as it could do 16 times more work at the same time if used in parallel with 16 submission threads. Context switching is expensive and so on.<br><br><a class="bbcode_url" href="http://ext3h.makegames.de/DX12_Compute.html" target="_blank">This article has a bit of explanation of the differences between architectures and their features.</a><br><br><a class="H-lightbox-open" href="http://www.overclock.net/content/type/61/id/2831854/"><img alt="" class="lightbox-enabled" data-id="2831854" data-type="61" src="http://www.overclock.net/content/type/61/id/2831854/width/5000/height/1000/flags/LL" style="; width: 1144px; height: 558px"></a></div>

</div>

<br><div class="quote-container" data-huddler-embed="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25358105" data-huddler-embed-placeholder="false"><span>Quote:</span>

<div class="quote-block">Originally Posted by <strong>Slomo4shO</strong> <a href="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25358105"><img alt="View Post" class="inlineimg" src="/img/forum/go_quote.gif"></a><br><br>

So even lower at 18.76%...<br><br>

The bench definitely isn't compute heavy.</div>

</div>

<br><div class="quote-container" data-huddler-embed="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25358182" data-huddler-embed-placeholder="false"><span>Quote:</span>

<div class="quote-block">Originally Posted by <strong>PontiacGTX</strong> <a href="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25358182"><img alt="View Post" class="inlineimg" src="/img/forum/go_quote.gif"></a><br><br>

time spy compute queues are less than AotS, most of them are graphics, and it seems they do double fences which could be throttling AMD´s compute+graphics perf and/or parrallelism<br><a class="H-lightbox-open" href="http://www.overclock.net/content/type/61/id/2831905/"><img alt="" class="lightbox-enabled" data-id="2831905" data-type="61" src="http://www.overclock.net/content/type/61/id/2831905/width/500/height/1000/flags/LL" style="; width: 500px; height: 322px"></a><br><br>

then 3dmark could run a single path where It fits most hardware, with pre emption</div>

</div>

<br><div class="quote-container" data-huddler-embed="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25358269" data-huddler-embed-placeholder="false"><span>Quote:</span>

<div class="quote-block">Originally Posted by <strong>PontiacGTX</strong> <a href="/t/1605674/computerbase-de-doom-vulkan-benchmarked/400_100#post_25358269"><img alt="View Post" class="inlineimg" src="/img/forum/go_quote.gif"></a><br><br>

It can be used for GCN, but it wont take advantage of parrarellism and perofrmance gains, Maxwell can do some degree of pre emption and it doesnt get negative performance(given how fences are limiting the contexts switching) and it can work in Pascal given it has improved pre emption<br><a class="H-lightbox-open" href="http://www.overclock.net/content/type/61/id/2831918/"><img alt="" class="lightbox-enabled" data-id="2831918" data-type="61" src="http://www.overclock.net/content/type/61/id/2831918/width/350/height/700/flags/LL" style="; width: 350px; height: 197px"></a><br>

when people compares them Maxwell seems to have some degree of async compute(benchmark is aimed to it but it does pre emption) <a class="bbcode_url" href="http://cdn.wccftech.com/wp-content/uploads/2015/04/2.png" target="_blank">GCN can do pre emption</a> but it isnt deliver same gains as async compute and Pascal shows their improved pre emption gains<br><br><a class="H-lightbox-open" href="http://www.overclock.net/content/type/61/id/2831919/"><img alt="" class="lightbox-enabled" data-id="2831919" data-type="61" src="http://www.overclock.net/content/type/61/id/2831919/width/350/height/700/flags/LL" style="; width: 350px; height: 299px"></a><br><br>

Devs tell they use a single path but this only favors one side</div>

</div>

<br><br>

I love big boards like this, because you can call everyone's attention to a problem when it's noticed. And a lot of people here are quite capable of noticing such problems <img alt="thumb.gif" class="bbcode_smiley" src="http://files.overclock.net/images/smilies/thumb.gif"> Glad to be a part of such a community.<br><br>

Anyway, I can say that the logical conclusion from this is that Futuremark's benchmark is BOTCHED and biased, not indicative of DX12 capabilities as it should be, but instead restricting them - thus it has arguably no credibility as a BENCHMARK suite.<br><br>

Benchmark - Standard, or a set of standards, used as a point of reference for evaluating performance or level of quality. Benchmarks may be drawn from a firm's own experience, from the experience of other firms in the industry, or from legal requirements such as environmental regulations.<br>

+example A new benchmark was set for the football team when weakest member benched 200 pounds, therefore setting the expectations for all other teammates to bench at least that amount.<br><br>

In this case we have the weakest member benching 200 pounds, but he happens to be sponsoring the gym... and the gym has 2 members. So the bench press goes to 200 pounds.<br><br>

I am incredibly disappointed and that's why I am giving voice to this in NEWS.