This is part 19 of a tutorial series about rendering. The previous part covered realtime GI, probe volumes, and LOD groups. This time we'll add support for another way to consolidate draw calls into batches.

This tutorial was made with Unity 2017.1.0f3.

Batching Instances

Instructing the GPU to draw something takes time. Feeding it the data to do so, including the mesh and material properties, takes time as well. We already know of two ways to decrease the amount of draw calls, which are static and dynamic batching.

Unity can merge the meshes of static objects into a larger static mesh, which reduces draw calls. Only objects that use the same material can be combined in this way. This comes at the cost of having to store more mesh data. When dynamic batching is enabled, Unity does the same thing at runtime for dynamic objects that are in view. This only works for small meshes, otherwise the overhead becomes too great.

There is yet another way to combine draw calls. It is know as GPU instancing or geometry instancing. Like dynamic batching, this is done at runtime for visible objects. The idea is that the GPU is told to render the same mesh multiple times in one go. So it cannot combine different meshes or materials, but it's not restricted to small meshes. We're going to try out this approach.

Many Spheres To test GPU instancing, we need to render the same mesh many times. Let's create a simple sphere prefab for this, which uses our white material. White sphere prefab. To instantiate this sphere, create a test component which spawns a prefab many times and positions it randomly inside a spherical area. Make the spheres children of the instantiator so the editor's hierarchy window doesn't have to struggle with displaying thousands of instances. using UnityEngine; public class GPUInstancingTest : MonoBehaviour { public Transform prefab; public int instances = 5000; public float radius = 50f; void Start () { for (int i = 0; i < instances; i++) { Transform t = Instantiate(prefab); t.localPosition = Random.insideUnitSphere * radius; t.SetParent(transform); } } } Create a new scene and put a test object in it with this component. Assign the sphere prefab to it. I'll use it to create 5000 sphere instances inside a sphere of radius 50. Test object. With the test object positioned at the origin, placing the camera at (0, 0, -100) ensures that the entire ball of spheres is in view. Now we can use the statistics panel of the game window to determine how all the objects are drawn. Turn off the shadows of the main light so only the spheres are drawn, plus the background. Also set the camera to use the forward rendering path. A sphere of spheres. In my case, it takes 5002 draw calls to render the view, which is mentioned as Batches in the statistics panel. That's 5000 spheres plus two extra for the background and camera effects. Note that the spheres are not batched, even with dynamic batching enabled. That's because the sphere mesh is too large. Had we used cubes instead, they would've been batched. A sphere of cubes. In the case of cubes, I only end up with eight batches, so all cubes are rendered in six batches. That's 4994 fewer draw calls, reported as Saved by batching in the statistics panel. In my case it also reports a much higher frame rate. 83 instead of 35 fps. This is a measure of the time to render a frame, not the actual frame rate, but it's still a good indication of the performance difference. The cubes are faster to draw because they're batched, but also because a cube requires far less mesh data than a sphere. So it's not a fair comparison. As the editor generates a lot of overhead, the performance difference can be much greater in builds. Especially the scene window can slow things down a lot, as it's an extra view that has to be rendered. I have it hidden when in play mode to improve performance.

Supporting Instancing GPU instancing isn't possible by default. Shaders have to be designed to support it. Even then, instancing has to be explicitly enabled per material. Unity's standard shaders have a toggle for this. Let's add an instancing toggle to MyLightingShaderGUI as well. Like the standard shader's GUI, we'll create an Advanced Options section for it. The toggle can be added by invoking the MaterialEditor.EnableInstancingField method. Do this is a new DoAdvanced method. void DoAdvanced () { GUILayout.Label("Advanced Options", EditorStyles.boldLabel); editor.EnableInstancingField(); } Add this section at the bottom of our GUI. public override void OnGUI ( MaterialEditor editor, MaterialProperty[] properties ) { this.target = editor.target as Material; this.editor = editor; this.properties = properties; DoRenderingMode(); DoMain(); DoSecondary(); DoAdvanced(); } Select our white material. An Advanced Options header is now visible at the bottom of its inspector. However, there isn't a toggle for instancing yet. No support for instancing yet. The toggle will only be shown if the shader actually supports instancing. We can enabled this support by adding the #pragma multi_compile_instancing directive to at least one pass of a shader. This will enable shader variants for a few keywords, in our case INSTANCING_ON, but other keywords are also possible. Do this for the base pass of My First Lighting Shader. #pragma multi_compile_fwdbase #pragma multi_compile_fog #pragma multi_compile_instancing Supported and enabled instancing. Our material now has an Enable Instancing toggle. Checking will change how the spheres are rendered. Only one position per batch. In my case, the number of batches has been reduces to 42, which means that all 5000 spheres are now rendered with only 40 batches. The frame rate has also shot up to 80 fps. But only a few spheres are visible. All 5000 spheres are still being rendered, it's just that all spheres in the same batch end up at the same position. They all use the transformation matrix of the first sphere in the batch. This happens because the matrices of all spheres in a batch are now send to the GPU as an array. Without telling the shader which array index to use, it always uses the first one.

Instance IDs The array index corresponding to an instance is known as its instance ID. The GPU passes it to the shader's vertex program via the vertex data. It is an unsigned integer named instanceID with the SV_InstanceID semantic on most platforms. We can simply use the UNITY_VERTEX_INPUT_INSTANCE_ID macro to include it in our VertexData structure. It is defined in UnityInstancing, which is included by UnityCG. It gives us the correct definition of the instance ID, or nothing when instancing isn't enabled. Add it to the VertexData structure in My Lighting. struct VertexData { UNITY_VERTEX_INPUT_INSTANCE_ID float4 vertex : POSITION; … }; We now have access to the instance ID in our vertex program, when instancing is enabled. With it, we can use the correct matrix when transforming the vertex position. However, UnityObjectToClipPos doesn't have a matrix parameter. It always uses unity_ObjectToWorld . To work around this, the UnityInstancing include file overrides unity_ObjectToWorld with a macro that uses the matrix array. This can be considered a dirty macro hack, but it works without having to change existing shader code, ensuring backwards compatibility. To make the hack work, the instance's array index has to be globally available for all shader code. We have to manually set this up via the UNITY_SETUP_INSTANCE_ID macro, which must by done in the vertex program before any code that might potentially need it. InterpolatorsVertex MyVertexProgram (VertexData v) { InterpolatorsVertex i; UNITY_INITIALIZE_OUTPUT(Interpolators, i); UNITY_SETUP_INSTANCE_ID(v); i.pos = UnityObjectToClipPos(v.vertex); … } Instanced spheres. The shader can now access the transformation matrices of all instances, so the spheres are rendered at their actual locations. How does the matrix array replacement work? When instancing is enabled, in the most straightforward case, it boils down to this. static uint unity_InstanceID; CBUFFER_START(UnityDrawCallInfo) // Where the current batch starts within the instanced arrays. int unity_BaseInstanceID; CBUFFER_END #define UNITY_VERTEX_INPUT_INSTANCE_ID uint instanceID : SV_InstanceID; #define UNITY_SETUP_INSTANCE_ID(input) \ unity_InstanceID = input.instanceID + unity_BaseInstanceID; // Redefine some of the built-in variables / // macros to make them work with instancing. UNITY_INSTANCING_CBUFFER_START(PerDraw0) float4x4 unity_ObjectToWorldArray[UNITY_INSTANCED_ARRAY_SIZE]; float4x4 unity_WorldToObjectArray[UNITY_INSTANCED_ARRAY_SIZE]; UNITY_INSTANCING_CBUFFER_END #define unity_ObjectToWorld unity_ObjectToWorldArray[unity_InstanceID] #define unity_WorldToObject unity_WorldToObjectArray[unity_InstanceID] The actual code in UnityInstancing is a lot more complex. It deals with platform differences, other ways to use instancing, and special code for stereo rendering, which leads to multiple steps of indirect definitions. It also has to redefine UnityObjectToClipPos because UnityCG includes UnityShaderUtilities first. The buffer macros will be explained later.

Batch Size It is possible that you end up with a different amount of batches than I get. In my case, 5000 sphere instances are rendered in 40 batches, which means 125 spheres per batch. Each batch requires its own array of matrices. This data is send to the GPU and stored in a memory buffer, known as a constant buffer in Direct3D and a uniform buffer in OpenGL. These buffers have a maximum size, which limits how many instances can fit in one batch. The assumption is that desktop GPUs have a limit of 64KB per buffer. A single matrix consists of 16 floats, which are four bytes each. So that's 64 bytes per matrix. Each instance requires an object-to-world transformation matrix. However, we also need a world-to-object matrix to transform normal vectors. So we end up with 128 bytes per instance. This leads to a maximum batch size of `64000 / 128 = 500`, which could render 5000 spheres in only 10 batched. Isn't the maximum 512? Memory is measure in base-two, not base-ten, so 1KB represents 1024 bytes, not 1000. Thus, `(64 * 1024) / 128 = 512`. UNITY_INSTANCED_ARRAY_SIZE is by default defined as 500, but you could override it with a compiler directive. For example, #pragma instancing_options maxcount:512 sets the maximum to 512. However, this will lead to assertion failure errors, so the practical limit is 511. There isn't much difference between 500 and 512, through. Although the maximum is 64KB for desktops, most mobiles are assumed to have a maximum of only 16KB. Unity copes with this by simply dividing the maximum by four when targeting OpenGL ES 3, OpenGL Core, or Metal. Because I'm using OpenGL Core in the editor, I end up with a maximum batch size of `500 / 4 = 125`. You can disable this automatic reduction by adding the compiler directive #pragma instancing_options force_same_maxcount_for_gl . Multiple instancing options are combined in the same directive. However, that might lead to failure when deploying to mobile devices, so be careful. What about the assumeuniformscaling option? You can use #pragma instancing_options assumeuniformscaling to indicate that all instanced objects have a uniform scale. This obviates the need to use the world-to-object matrix for the conversion of normals. While the UnityObjectToWorldNormal function does change its behavior when this option is set, it doesn't eliminate the second matrix array. So this option effectively does nothing, at least in Unity 2017.1.0.

Instancing Shadows Up to this point we have worked without shadows. Turn the soft shadows back on for the main light and make sure that the shadow distance is enough to include all spheres. As the camera sits at -100 and the sphere's radius is 50, a shadow distance of 150 is enough for me. Lots of shadows. Rendering shadows for 5000 spheres takes a toll on the GPU. But we can use GPU instancing when rendering the sphere shadows as well. Add the required directive to the shadow caster pass. #pragma multi_compile_shadowcaster #pragma multi_compile_instancing Also add UNITY_VERTEX_INPUT_INSTANCE_ID and UNITY_SETUP_INSTANCE_ID to My Shadows. struct VertexData { UNITY_VERTEX_INPUT_INSTANCE_ID … }; … InterpolatorsVertex MyShadowVertexProgram (VertexData v) { InterpolatorsVertex i; UNITY_SETUP_INSTANCE_ID(v); … } Instanced shadows. Now it is a lot easier to render all those shadows.