Part 2: Solving Performance Problems

In my last post I discussed ways to build efficient Gear VR games and the traits of Gear VR devices. In this installment I’ll focus on ways to debug Unity applications that are not sufficiently performaning on those devices.

Performance Debugging

Even if you’ve designed your scene well and set reasonable throttling values, you may find that your game does not run at a solid 60 frames per second on the Gear VR device. The next step is to decipher what’s going on using three tools: Unity’s internal profiler log, Unity’s Profiler , and the Oculus Remote Monitor.

The very first thing you should do when debugging Unity performance is to turn on the Enable Internal Profiler option in Player Settings. This will spit a number of important frame rate statistics to the logcat console every few seconds, and should give you a really good idea of where your frame time is going.

To illustrate the common steps to debugging performance, let’s look at some sample data from a fairly complicated scene in a real game running on a Note 4 Gear VR device:

Android Unity internal profiler stats: cpu-player> min: 8.8 max: 44.3 avg: 16.3 cpu-ogles-drv> min: 5.1 max: 6.0 avg: 5.6 cpu-present> min: 0.0 max: 0.3 avg: 0.1 frametime> min: 14.6 max: 49.8 avg: 22.0 draw-call #> min: 171 max: 177 avg: 174 | batched: 12 tris #> min: 153294 max: 153386 avg: 153326 | batched: 2362 verts #> min: 203346 max: 203530 avg: 203411 | batched: 3096 player-detail> physx: 0.1 animation: 0.1 culling 0.0 skinning: 0.0 batching: 0.4 render: 11.6 fixed-update-count: 1 .. 1 mono-scripts> update: 0.9 fixedUpdate: 0.0 coroutines: 0.0 mono-memory> used heap: 3043328 allocated heap: 3796992 max number of collections: 0 collection total duration: 0.0

To capture this sample you’ll need to connect to your device over Wi-Fi using ADB over TCPIP and run adb logcat while the device is docked in the headset (for more information, see “Android Debugging” in the Mobile SDK documentation).

What the sample above tells us is that our average frame time is 22 ms, which is about 45 frames per second—way below our target of 60 frames per second. We can also see that this scene is heavy on the CPU—16.3 ms of our 22 ms total is spent on the CPU. We’re spending 5 ms in the driver (“cpu-ogles-drv”), which suggests that we’re sending the driver down some internal slow path. The probable culprit is pretty clear: at 174 draw calls per frame, we’re significantly over our target of 100. In addition, we’re pushing more polygons than we would like. This view of the scene doesn’t really tell us what’s happening on the GPU, but it tells us that we can explain our 45 ms frame rate just by looking at the CPU, and that reducing draw calls should be the focus of our attention.

This data also shows that we have regular spikes in the frametime (represented by max frametime of 49.8 ms). To understand where those are coming from, the next step is to connect the Unity Profiler and look at its output.

As expected, the graph shows a regular spike. During non-spike periods, our render time is similar to the values reported above, and there is no other significant contributor to our final frametime.

The profiler blames the spikes on something called Gfx.WaitForPresent. Curiously, our actual render time is not significantly increased in the spike frame. What’s going on here anyway?

Wait for present

WaitForPresent (and its cousin, Overhead) appears to be some repeated cost that comes along and destroys our frame. In fact, it does not actually represent some mysterious work being performed. Rather, WaitForPresent records the amount of time that the render pipeline has stalled.

One way to think of the render pipeline is to imagine a train station. Trains leave at reliable times—every 16.6 ms. Let’s say the train only holds one person at a time, and there is a queue of folks waiting to catch a ride. As long as each person in the queue can make it to the platform and get on the next train before it leaves, you’ll be able to ship somebody out to their destination every 16 ms. But if even one guy moves too slowly—maybe he trips on his shoelaces—he’ll not only miss his train, he’ll be sitting there waiting on the platform for another 16 ms for the next train to come. Even though he might have only been 1 ms late for his train, missing it means that he has to sit there and wait a long time.

In a graphics pipeline, the train is the point at which the front buffer (currently displayed) and back buffer (containing the next frame to display) swap. This usually happens when the previous frame finishes its scanout to the display. Assuming that the GPU can execute all of the render commands buffered for that frame in a reasonable amount of time, there should be a buffer swap every 16 ms. To maintain a 60 frames-per-second refresh rate, the game must finish all of its work for the next frame within 16 ms. When the CPU takes too long to complete a frame, even if it’s only late by 1 ms, the swap period is missed, the scanout of the next frame begins using the previous frame’s data, and the CPU has to sit there and wait for the next swap to roll around. To use the parlance of our example above, the train is the swap and the frame commands issued by the CPU are the passengers.

WaitForPresent indicates that this sort of pipeline stall has occurred and records the amount of time the CPU idles while waiting on the GPU. Though less common, this can also happen if the CPU finishes its work very quickly and has to wait for the next swap.

In this particular example, it’s pretty clear that our frame rate is inconsistent enough that we cannot schedule our pipeline reliably. The way to fix WaitForPresent is thus to ignore it in the profiler and concentrate on optimizing everything else, which in this case means reducing the number of draw calls we have in the scene.

Other Profiler Information

The Unity profiler is very useful for digging into all sorts of other runtime information, including memory usage, draw calls, texture allocations, and audio overhead. For serious performance debugging, it’s a good idea to turn off Multithreaded Rendering in the Player Preferences. This will slow the renderer down a lot but also give you a clearer view of where your frame time is going. When you’re done with optimizations, remember to turn Multithreaded Rendering back on.

In addition to draw call batching, other common areas of performance overhead include overdraw (often caused by large transparent objects or poor occlusion culling), skinning and animation, physics overhead, and garbage collection (usually caused by memory leaks or other repeated allocations). Watch for these as you dig into the performance of your scene. Also remember that displaying the final VR output, which includes warping and TimeWarp overhead, costs about 2 ms every frame.

Oculus Remote Monitor

OVRMonitor is a new tool recently released with the Oculus Mobile SDK. It helps developers understand the way pipelining works and identify pipeline stalls. It can also stream low resolution unwarped video from a Gear VR device wirelessly, which is useful for usability testing.

OVRMonitor is currently in development, but this early version can still be used to visualize the graphics pipeline for Gear VR applications. Here’s a shot of a tool inspecting a game running the same scene discussed above:

The yellow bar represents the vertical sync interrupt that indicates that scan out for a frame has completed. The image at the top of the window is a capture of the rendered frame, and the left side of the image is aligned to the point in the frame where drawing began. The red bar in the middle of the image shows the TimeWarp thread, and you can see it running parallel to the actual game. The bottom blue are indicates the load on CPU and GPU, which are constant (i.e., in this case, all four CPUs are running).

This shot actually shows us one of the WaitForPresent spikes we saw above in the Unity Profiler. The frame in the middle of the timeline began too late to complete by the next vertical blank, and as a result the CPU blocked for a full frame (evidenced by the lack of screen shot in the subsequent frame and the 16.25 ms WarpSwapInternal thread time).

OVRMonitor is a good way to get a sense of what is happening in your graphics pipeline from frame to frame. It can be used with any Gear VR app built against the latest SDK. See the documentation in SdkRoot/Tools/OVRMonitor for details. More documentation and features are coming soon.

Tips and Tricks

Here are a few performance tricks we’ve used or heard about from other developers. These are not guaranteed solutions for all VR applications, but they might give you some ideas about potential solutions for your particular scene.

Draw big, opaque meshes last. Try sorting your skybox into the Geometry+1 render queue so that it draws after all the other opaque geometry. Depending on your scene, this probably causes a lot of the pixels covered by the skybox to be discarded by depth test, thus saving you time. Ground planes and other static, opaque objects that touch a lot of pixels and are likely to be mostly occluded by other objects are candidates for this optimization.

Dynamically change your CPU / GPU throttling settings. You can change your throttling settings at any time. If you are able to run most of your game at a low setting but have one or two particularly challenging scenes, consider cranking the CPU or the GPU up just during those scenes. You can also drop the speed of one or both processors in order to save battery life and reduce heat during scenes that are known to be simple. For example, why not set GPU to 0 during a scene load?

Update render targets infrequently. If you have secondary render textures that you are drawing the scene to, try drawing them at a lower frequency than the main scene. For example, a stereoscopically rendered mirror might only refresh its reflection for only one eye each frame. This effectively lowers the frame rate of the mirror to 30 frames per second, but because there is new data every frame in one eye or the other, it looks okay for subtle movements.

Lower the target eye resolution. By trading a bit of visual quality you can often improve your fill-bound game performance significantly by slightly lowering the size of the render target for each eye. OVRManager.virtualTextureScale is a value between 0.0 and 1.0 that controls the size of the render output. Dropping resolution slightly when running on older devices is often an easy way to support slower hardware.

Compress your textures. All Gear VR devices support the ASTC compression format, which can significantly reduce the size of your textures. Note that as of this writing, Unity 4.6 expects ASTC compression to go hand-in-hand with OpenGL ES 3.0. If you use ASTC under GLES 2.0, your textures will be decompressed on load, which will probably lengthen your application’s start-up time significantly. ETC1 is a lower-quality but universally supported alternative for GLES 2.0 users.

Use texture atlases. As described above, large textures that can be shared across a number of meshes will batch efficiently. Avoid lots of small individual textures.

For more information on optimizing your Unity mobile VR development, see “Best Practices: Mobile” in our Unity Integration guide.