In Unity 4.6 / 5.0, the generation of batches for rendering of the UI system is very slow. This is due to a few factors, but ultimately our deadlines kept us from dedicating the time to polishing this part of the UI, instead focusing on the usability and API side of things.

In the final sprints of finishing the UI, we were lucky enough to have some help with optimisation. After we shipped, we decided to take a step back and analyse exactly why things were slow and how we could fix them.

If you want the quick and dirty: We managed to move everything (apart from job scheduling) away from the main thread as well as drastically fix up some of the algorithms we were using in the batch sorting.

Performance Project

We developed a few UI performance test scenes to get a good baseline to work with when testing the performance changes. They stress the UI in a variety of ways. The test that was most applicable to the sorting / batch generation test had the canvas is completely filled with ‘buttons’. There are overlaps between the text on the button and the button background, so there will always will be some overhead in calculating what can batch with what. The test constantly modifies the UI elements so that rebatching is required every frame.

The test can be configured to place UI elements in an ordered way (taking advantage of spacial closeness), or a random way (potentially stressing the sorting algorithms more). It was clear to us that batch sorting needed to be fast in both scenarios. In 4.6 / 5.0 it is fast in neither.

It should be noted that the performance test tends to have ~10k UI elements. This is not something we would expect to see in a ‘real’ UI, most UI’s we’ve experienced have ~300 items per canvas.

All performance and profiling done from my MacBook Air (13-inch, Mid 2013).

Portion of the test scene:

Original (pre 4.6, no stats)

During the 4.6 betas we were getting feedback that batch sorting was very slow when there were many elements on the canvas. This was due to there basically being NO smartness when we were trying to figure out batch draw order. We would simply iterate the elements on the canvas and see what we collided with and then assign a depth based on some rules. This meant that as we added more elements to the scene, things would get slower (O(N^2)), much slower. This is ‘bad vibes’ in terms of performance.

4.6 / 5.0 release (baseline)

We did some work on the sorting that took advantage of the idea that ordered drawn elements would normally be in a similar location on the screen. From this, a bounding box was built (per group of n elements) and then new elements were collided with this group before being collided with individual elements. This lead to a decent performance increase in scenes that had locality between UI elements, but in randomly ordered scenes, or scenes were elements were spaced far apart, the improvements were only marginal.

If we take a look at this version you can see that when placing random elements the batch performance massively breaks down, taking roughly 100ms to sort and populate a scene…. that’s for reals slow.

Looking at this in the timeline profiler also reveals another worrying situation… we are completely blocking anything else from happening. The batch generation is run just before UI is rendered, this is after a late update and often after scene cameras are rendered. It looks like it would make sense to bring the batch generation to be right after late update so that it can happen while a scene would normally be rendered.

Improved sorting (Take 1)

We did a first pass on improving sorting. It was still based on the idea of element locality, but with a few more smarts. It would try and keep groups ‘batchable’, so we could include / exclude batchability on a whole group level. It was faster, but still fell down when given very spatially separate scenes and did not scale well with the number of renderable elements.

Non spatially grouped input

Spatially grouped input

This is pretty poor. It was clear that we needed a new approach.

Improved sorting (Take 2)

As mentioned earlier, sorting tends to break down and be slow in larger UI scenes with spread elements. We took a step back and thought about what might be a better approach. In the end we decided to implement a canvas grid structure. Each grid square becomes a ‘bucket’ and any UI element that touches a square gets added to that bucket. This means that when adding a new UI element we only need to look into the buckets that the element touches to find what it can / can’t batch with. This led to significant performance improvements when the scene was ordered randomly.

Non spatially grouped input

Spatially grouped input

Comparable performance between setups! Geometry Job

We reached the first step on the path to pulling the UI off the main thread by using the new Geometry Job system which was introduced in Unity 5. This is an internal feature that can be used to populate a vertex / index buffers in a threaded way. The changes that were made here allowed us to move a whole bunch of code off the main thread as the timeline below shows. There is some small overhead with regards to managing the geometry job, we have to create the job and job instructions, for example, which requires some memory, but this is negligible compared to the previous main thread cost.

Simplifying the batch sort

During the optimisation process, we did a bunch of smaller, profiler guided optimisations. The biggest gain was probably when we vectorised a bunch of our rectangular overlap checks in the sorting. Basically, getting our data into a super nice, DOD, layout ready for overlap checking, then checking with one call… it removed the overlap checks from a hot spot in the c++ profiler when before they accounted for ~60% of the sort time. As you can see doing this really helped our sort performance a bunch. But there was still a ways to go, and that was taking the 7ms off the main thread.

vectorised overlap code

Taking it all off (the main thread)

The next logical step for us was to remove UI generation from the main thread. For this, we used the internal Job system to schedule a number of tasks. Some of them are serial, others are able to go wide and execute in parallel. Here is the breakdown:

1) Split incoming UI instructions into renderable instructions (1 UI instruction can contain many draw calls due to submeshs and multiple materials). This task goes wide. It allocates memory to accommodate the maximum possible number of renderable instructions. The incoming instructions are then processed in parallel and placed into the output array. This array is then ‘compressed’ down in a combine job into a contiguous section of memory just containing the valid instructions.

2) Sort the renderable instructions. Compare depths, overlaps ect. Basically sort for a command buffer the requires the LEAST amount of state change when rendering.

3) Batch Generation

Generate the render command buffer. Create draw calls (batches / sub batches). Generate the transform instructions that the geometry job can use.

The jobs are scheduled right after LateUpdate. This allows them to execute while a normal scene would be rendering and before the UI would be displayed. When these jobs are scheduled a fence is held by the main thread. It will be waited on by both the call for the canvas rendering and the Geometry job until all the required data has been generated.

In the example below, you can see the geometry job ‘stall’ as it waits for the batch generation to be completed, we need to do more testing around this but as these scenes do not have any renderable elements aside from UI this issue would decrease as the complexity of the scene increases.

Executing on a machine that has a few more cores than my MacBook Air

So there we have it, 0.4ms on the main thread for a very expensive UI

Other performance things we did