VR And Multi-GPU April 3, 2014 · Coding, GPU, Graphics ·

The second coming of VR will be a great thing for the GPU hardware industry. With the requirements for high resolution, good AA, high framerates and low latency, successful VR applications will need substantial GPU horsepower behind them.

The PS4 GPU has 1.84 Tflop/s available[1], so—making a very rough back-of-the-envelope analysis—for a game running at 1080p / 30 Hz, it has about 30 Kflop per pixel per frame available. If we take this as the benchmark of the compute per pixel needed for “next-gen”-quality visuals, we can apply it to the Oculus DK2. There we want 75 Hz, and we probably want a resolution of at least 2460×1230 (extrapolated from the recommendation of 1820×910 for the DK1[2]—this is the rendering resolution, before resampling to the warped output). That comes to 6.8 Tflop/s. A future VR headset with a 4K display would need 27 Tflop/s—and if you go up to 90 Hz, make it 32 Tflop/s!

The NVIDIA Titan Black offers 5.1 Tflop/s[3] and the AMD R9 290X offers 5.6 Tflop/s[4]. These are the biggest single GPUs on the market at present, but they appear to fall short of the compute power needed for “next-gen”-quality graphics on the DK2. This is very rough analysis, but there’s clearly reason to be interested in multi-GPU rendering for VR.

The trouble is, for VR we also want very low input latency, for head tracking to be most effective at producing a sense of presence. Multi-GPU solutions are great for improving framerate, but not so great at latency. The typical way multiple GPUs are used in a game is alternate-frame rendering (AFR), meaning the GPUs just trade off frames. The time taken to render an individual frame doesn’t decrease at all, but by correctly distributing the GPUs’ start times, a higher framerate can be maintained.

In other words, if the stars align, the framerate scales linearly with the number of GPUs—but the latency gets no lower than the time taken to render an individual frame. (It does decrease from the single-GPU case, due to reduced time that frames are queued on the CPU.)

An obvious way to use multiple GPUs for VR is to assign one eye to each GPU. If more than two GPUs are present, AFR can be used as well. Since the two eye views will generally be similar, this should give a good even work distribution, but it doesn’t reduce the latency relative to the mono-display case.

We could try to further subdivide the per-frame work, dividing the scene into chunks either in image space or in geometry (or both, for different parts of the frame)—in other words, trying to emulate the internal hardware scheduling over shader cores that a GPU already does, but across GPUs and in software. It’s a strategy with limited upside due to the difficulty of load-balancing and the overhead of moving data around between GPUs.

It’s interesting to try to think of other ways to use multiple GPUs to decrease latency, with the goal of getting up-to-the-millisecond head tracking—even if other aspects of the scene might only update with the more usual multi-frame latencies. I’m sure the folks at Oculus, Valve, etc. have already been thinking a lot about this stuff, but I was musing about it lately and had a few ideas. I’d be happy to hear your ideas, too!

We’ve heard a lot about temporal reprojection lately, mostly for antialiasing and complex filtering operations like screen-space reflections.[5][6][7] A reprojection-like process could also be useful for VR, by rendering a high-res wide-view image using a traditional high-latency renderer on one (or more) GPU at 30–60 Hz, then reprojecting it based on newer head-tracking data on the GPU that’s driving the display. This reprojection step could be very fast—perhaps 1–3 ms—so, with appropriate API/OS support it could be done just-in-time before scanout with quite low latency, or even during scanout, “racing the beam” as Michael Abrash has proposed.[8]

Here, I’m imagining that a high-priority CPU thread samples the head-tracking sensor at 75 Hz and kicks rendering on the just-in-time GPU, which quickly reprojects based on the most recent source image and scans out. Meanwhile, the other GPUs are spitting out new source images at a much lower rate—here four times slower, though no fixed ratio is necessary.

The source image should have a bit wider FOV than the final display, to provide some slack in case of fast rotation. For translation, we’ll have to deal with disocclusions—parallax bringing areas into view that don’t have data in the source image. If we’re rendering the source image in stereo, using both halves of the stereo pair provides extra data that can help fill in disocclusions. It might even be helpful to render 3–4 views in the source image, not just two, to help provide data coverage for omnidirectional head translation.

Depth-based reprojection will, however, have issues with transparencies, reflections, depth of field, motion blur, and so forth. It also won’t update specular highlights due to head translation, except insofar as it can interpolate between the source views; and motion will occur only at the rate the source images are rendered, unless we render motion vectors and add prediction to the reprojection pass.[9]

The “shading cache” is a concept first introduced in the context of stochastic rasterization.[10][11] Briefly, the idea is to decouple the shading rate from the rasterization rate, and store shading results in a cache as they’re computed, in case a very similar shading sample is called for again later. In stochastic rasterization, that happens because of defocus and motion blur, where the same shaded point on a triangle can contribute to many pixels if the triangle is defocused or in motion. In VR, we can make use of this idea as well: the same shaded point on a triangle can contribute to pixels from both eyes over multiple frames in quick succession.

This idea for a rendering pipeline, like the previous one, has a single GPU dedicated to just-in-time rendering of the final image; but here, the other GPUs are populating some sort of abstract shading cache, containing pre-computed shading results covering the region you’re likely to see during the next few tens of milliseconds. This would include areas that you can’t see from the current position but could become disoccluded soon, and ideally some information about emitted light in different directions so that specular highlights and reflections respond to head translation. I don’t know precisely how this cache would be organized—as an octree or other 3D structure? In texture space using scene-wide unique unwrapping? A hash table based on triangle ID and barycentrics? It would be a research project, to say the least.

Then the just-in-time GPU would render the geometry, applying the results from the most recent shading cache. Since it’s working with (a subset of) the actual scene geometry, it’ll be more accurate than attempting to do image-based reprojection, but it won’t need to re-do all lighting and shading calculations every frame. Motion prediction based on per-vertex motion vectors would be easy to add, although shading would still update at a lower rate. Transparencies wouldn’t be an issue—or at least, no more of an issue than they usually are in rendering! Dynamically generated geometry, such as particles, would have to be captured so it could be re-rendered consistently by the just-in-time passes.

The just-in-time GPU would also have to do all postprocessing, which can be a substantial amount of work and would of course increase the latency again. There might be ways to mitigate this—some forms of postprocessing could be baked into the shading cache, such as SSAO and depth-based fog/haze; it might also be possible to segment the scene by depth and use the image-based reprojection approach for far areas while drawing geometry and doing post for near areas.

A big GPU will sit idle for most of the frame if all it has to do is the relatively cheap just-in-time reprojection or shading application pass. We’d have to find something to fill the remaining time—perhaps incremental simulation updates (e.g. particle systems, cloth, fluids etc.) that can be done in small chunks and decoupled from the main rendering pipeline, so we can ensure the GPU is available when it’s time to do the just-in-time pass.

This could be an interesting case for heterogeneous multi-GPU—imagine a low-spec iGPU doing the relatively cheap just-in-time passes and driving the display, while a high-spec dGPU (or several) do the heavy lifting of rendering the source frames or shading cache. On the other hand, a low-spec iGPU will take longer to do the just-in-time passes, which will increase the latency. For low latency it’s better to have a huge GPU that can go wide and chew through all the pixels very quickly.

That being said, what if the high-spec dGPUs were far away from you? Cloud-based rendering is a topic that’s been getting interest lately, and the same kind of latency reduction strategies could be applicable there. Imagine having your VR headset driven by the little GPU in your phone, but streaming down source frames or shading cache updates from a dGPU in the cloud (or your home PC)!

All of this is very abstract and speculative, and would take tons of engineering to make it actually work. And if predictive head tracking works well enough in practice that several tens of milliseconds of input latency is actually fine, then there may not be a compelling need to develop low-latency rendering methods. And not many people have multiple GPUs anyway…although maybe you could do this kind of stuff on a single GPU if it had the right HW/OS/API pre-emption support. :) But the ideas are definitely interesting to think about! VR is exciting not just because it’s VR, but also because it’s going to expand the horizons of computer graphics in new directions that we wouldn’t necessarily have considered before.