We’re rebuilding the core of Unity with our Data-Oriented Tech Stack, and many game studios are already seeing massive performance wins when using the Entity Component System (ECS), the C# Job System, and the Burst Compiler. At Unite Copenhagen, we had a chance to sit down with Far North Entertainment and go deep into how they implemented these DOTS features into their otherwise traditional Unity project.

Far North Entertainment is a Swedish studio co-owned by five friends from engineering studies. Since releasing Down to Dungeon for Gear VR in early 2018, the company’s been working on a game that explores a classic PC game genre – the post-apocalyptic zombie survival game. What makes the project stand out is the number of zombies chasing you. The team’s vision included thousands of brain-hungry enemies coming after you in enormous hordes.

However, they quickly run into lots of performance issues when prototyping this idea. Spawning, despawning, updating, and animating all those zombies was a major bottleneck, even after the team tried implementing possible solutions like object pooling and animation instancing.

This led the studio’s CTO Anders Eriksson to look at DOTS, and how to change his mindset from an object-oriented mindset to a data-oriented one. “The key insights that helped us make the shift was to stop thinking about objects and object hierarchies, and start to think about the data, how it is transformed and how it is accessed,” he says. That means that the code doesn’t have to be modeled around something that makes sense in real life and it doesn’t have to solve the most general case. He’s got a lot of advice for anyone trying to make the same shift:

“Ask yourself what the actual problem is that you are trying to solve, and what data is relevant for a specific solution. Will you do the same transformations of the same set of data over and over again? How much relevant data can you pack into a CPU cache line? If you are also looking to convert existing code, then identify how much garbage data you are filling the cache lines with. Can you split up the calculations on multiple threads, and/or utilize SIMD instructions?”

﻿

The team came to understand that entities in Unity Component System are just lookup ids into streams of components. Components are just data, while the systems contain all logic and filter out entities with a certain component signature known as Archetypes. “I think one insight that helped us to visualize this was to think of ECS as an SQL database. Each Archetype is a table where each column is a component and each row is a unique entity. You then use the systems to query into these archetype tables to do operations on entities,” says Anders.

Getting started with DOTS

To get to this understanding, he studied the Entity Component System documentation, the ECS Samples and the sample that we created with Nordeus and unveiled at Unite Austin. But general materials about data-oriented design were also extremely helpful to the team. “The talk from Mike Acton about Data-oriented design from CppCon 2014 is what initially opened our eyes to this way of programming.”

The Far North team posted what they’ve learned on their Dev Blog and this September, came to Copenhagen to talk about their experiences switching to the data-oriented mindset at Unite.

﻿

This blog post builds on this presentation and explains the specifics of their implementation of ECS, the C# Job System, and the Burst Compiler in more detail. The Far North team has also kindly shared a lot of code examples from their project.

Lining up zombie data

“The problem we faced was with doing client-side interpolation of translations and rotations for thousands of entities,” says Anders. Their first object-oriented approach was to make an abstraction of a ZombieView script which inherited a more general EntityView parent class. An EntityView is a MonoBehaviour attached to a GameObject. It acts as a visual representation of the game model. Every ZombieView was responsible for handling its own translation and rotation interpolation in its Update function.

This seems fine until you realize that every entity is allocated in a random place in memory. That means if you’re accessing thousands of entities, the CPU has to catch them one by one in memory, which is very slow. If you lay your data in neat continuous blocks, the CPU can cache a whole bunch of them at once. Most CPUs today can get around 128 or 256 bits per cycle from a cache.

The team decided to convert the enemies to DOTS with the hope to eliminate client-side performance bottlenecks. First in line was the Update function of the ZombieView. The team identified what parts of it should be separated into different systems, and what data would be needed. The first and most obvious thing was the interpolation of positions and rotations since the world of the game is a 2D grid. Two floats represent where the zombie is heading, and the final component was a target position component that keeps track of the server position for the enemy.

[Serializable] public struct PositionData2D : IComponentData { public float2 Position; } [Serializable] public struct HeadingData2D : IComponentData { public float2 Heading; } [Serializable] public struct TargetPositionData : IComponentData { public float2 TargetPosition; } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [ Serializable ] public struct PositionData2D : IComponentData { public float2 Position ; } [ Serializable ] public struct HeadingData2D : IComponentData { public float2 Heading ; } [ Serializable ] public struct TargetPositionData : IComponentData { public float2 TargetPosition ; }

Next up was creating the archetype for the enemies. An Archetype is simply a set of components that belong to a certain entity, in other words, a component signature.

The project uses prefabs for defining archetypes since there are more components needed for the enemies and some of them need references to GameObjects. The way this works is that you can wrap your component data in a ComponentDataProxy which will turn it into a MonoBehaviour that can be attached to a prefab. When you call instantiate with the EntityManager and pass a prefab it will create an entity with all component data that were attached to the prefab. All component data are stored in 16kb chunks of memory called ArchetypeChunks.

Here is a visualization of how the component streams in our archetype chunk will be organized, so far:

“One of the main advantages of archetype chunks is that you often don’t have to do new heap allocations when creating new entities since the memory has been allocated upfront This means that the creation of entities often just means writing data to the end of the component streams inside the archetype chunks. The only time a new heap allocation needs to be done is when you create entities that won’t fit within the chunk boundary. This will either trigger a new archetype chunk of 16kb to be allocated or if there is an empty chunk of the same archetype, it can be reused. The data for the new entities will then be written to the component streams of the new chunk,” explains Anders.

Multi-threading your zombies

So now that the data was tightly packed and laid out in a cache-friendly way in memory, the team could easily take advantage of the C# Job System to run their code in parallel on multiple CPU cores.

The next step was creating a System that filtered out all entities from all archetype chunks that have a PositionData2D, HeadingData2D and TargetPositionData component.

To do this Anders and his team created a JobComponentSystem and constructed their query in the OnCreate function. It looks something like this:

private EntityQuery m_Group; protected override void OnCreate() { base.OnCreate(); var query = new EntityQueryDesc { All = new [] { ComponentType.ReadWrite<PositionData2D>(), ComponentType.ReadWrite<HeadingData2D>(), ComponentType.ReadOnly<TargetPositionData>() }, }; m_Group = GetEntityQuery(query); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 private EntityQuery m_Group ; protected override void OnCreate ( ) { base . OnCreate ( ) ; var query = new EntityQueryDesc { All = new [ ] { ComponentType . ReadWrite < PositionData2D > ( ) , ComponentType . ReadWrite < HeadingData2D > ( ) , ComponentType . ReadOnly < TargetPositionData > ( ) } , } ; m_Group = GetEntityQuery ( query ) ; }

This declares that there is a query that filters out all entities in the world that have a position, heading and target. What they aimed to do next is to schedule jobs each frame via the C# Job System in order to distribute the calculations over multiple worker threads.

“The great thing about the C# Job System is that it is the same system that Unity uses under the hood in its code, so we didn’t have to worry about execution threads halting each other by claiming the same CPU cores and causing performance issues,” says Anders.

The team has chosen to use IJobChunk because thousands of enemies meant having a lot of Archetype chunks that will match their query during runtime. IJobChunk distributes the right chunks over the different worker threads.

Every frame, a new job called UpdatePositionAndHeadingJob is responsible for handling the interpolation of the positions and rotations of enemies in the game.

The code for scheduling the jobs looks like this:

protected override JobHandle OnUpdate(JobHandle inputDeps) { var positionDataType = GetArchetypeChunkComponentType<PositionData2D>(); var headingDataType = GetArchetypeChunkComponentType<HeadingData2D>(); var targetPositionDataType = GetArchetypeChunkComponentType<TargetPositionData>(true); var updatePosAndHeadingJob = new UpdatePositionAndHeadingJob { PositionDataType = positionDataType, HeadingDataType = headingDataType, TargetPositionDataType = targetPositionDataType, DeltaTime = Time.deltaTime, RotationLerpSpeed = 2.0f, MovementLerpSpeed = 4.0f, }; return updatePosAndHeadingJob.Schedule(m_Group, inputDeps); } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 protected override JobHandle OnUpdate ( JobHandle inputDeps ) { var positionDataType = GetArchetypeChunkComponentType < PositionData2D > ( ) ; var headingDataType = GetArchetypeChunkComponentType < HeadingData2D > ( ) ; var targetPositionDataType = GetArchetypeChunkComponentType < TargetPositionData > ( true ) ; var updatePosAndHeadingJob = new UpdatePositionAndHeadingJob { PositionDataType = positionDataType , HeadingDataType = headingDataType , TargetPositionDataType = targetPositionDataType , DeltaTime = Time . deltaTime , RotationLerpSpeed = 2.0f , MovementLerpSpeed = 4.0f , } ; return updatePosAndHeadingJob . Schedule ( m_Group , inputDeps ) ; }

Here is the declaration of the job:

public struct UpdatePositionAndHeadingJob : IJobChunk { public ArchetypeChunkComponentType<PositionData2D> PositionDataType; public ArchetypeChunkComponentType<HeadingData2D> HeadingDataType; [ReadOnly] public ArchetypeChunkComponentType<TargetPositionData> TargetPositionDataType; [ReadOnly] public float DeltaTime; [ReadOnly] public float RotationLerpSpeed; [ReadOnly] public float MovementLerpSpeed; } 1 2 3 4 5 6 7 8 9 10 11 12 public struct UpdatePositionAndHeadingJob : IJobChunk { public ArchetypeChunkComponentType < PositionData2D > PositionDataType ; public ArchetypeChunkComponentType < HeadingData2D > HeadingDataType ; [ ReadOnly ] public ArchetypeChunkComponentType < TargetPositionData > TargetPositionDataType ; [ ReadOnly ] public float DeltaTime ; [ ReadOnly ] public float RotationLerpSpeed ; [ ReadOnly ] public float MovementLerpSpeed ; }

So when a worker thread has pulled a job from its queue it will call the execute kernel of that job.

Here is what the execute kernel looks like:

public void Execute(ArchetypeChunk chunk, int chunkIndex, int firstEntityIndex) { var chunkPositionData = chunk.GetNativeArray(PositionDataType); var chunkHeadingData = chunk.GetNativeArray(HeadingDataType); var chunkTargetPositionData = chunk.GetNativeArray(TargetPositionDataType); for (int i = 0; i < chunk.Count; i++) { var target = chunkTargetPositionData[i]; var positionData = chunkPositionData[i]; var headingData = chunkHeadingData[i]; float2 toTarget = target.TargetPosition - positionData.Position; float distance = math.length(toTarget); headingData.Heading = math.select( headingData.Heading, math.lerp(headingData.Heading, math.normalize(toTarget), math.mul(DeltaTime, RotationLerpSpeed)), distance > 0.008 ); positionData.Position = math.select( target.TargetPosition, math.lerp( positionData.Position, target.TargetPosition, math.mul(DeltaTime, MovementLerpSpeed)), distance <= 1 ); chunkPositionData[i] = positionData; chunkHeadingData[i] = headingData; } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 public void Execute ( ArchetypeChunk chunk , int chunkIndex , int firstEntityIndex ) { var chunkPositionData = chunk . GetNativeArray ( PositionDataType ) ; var chunkHeadingData = chunk . GetNativeArray ( HeadingDataType ) ; var chunkTargetPositionData = chunk . GetNativeArray ( TargetPositionDataType ) ; for ( int i = 0 ; i < chunk . Count ; i ++ ) { var target = chunkTargetPositionData [ i ] ; var positionData = chunkPositionData [ i ] ; var headingData = chunkHeadingData [ i ] ; float2 toTarget = target . TargetPosition - positionData . Position ; float distance = math . length ( toTarget ) ; headingData . Heading = math . select ( headingData . Heading , math . lerp ( headingData . Heading , math . normalize ( toTarget ) , math . mul ( DeltaTime , RotationLerpSpeed ) ) , distance > 0.008 ) ; positionData . Position = math . select ( target . TargetPosition , math . lerp ( positionData . Position , target . TargetPosition , math . mul ( DeltaTime , MovementLerpSpeed ) ) , distance <= 1 ) ; chunkPositionData [ i ] = positionData ; chunkHeadingData [ i ] = headingData ; } }

“You might notice that we use selects instead of branches and the reason for this is to avoid something called branch misprediction. The select function will evaluate both expressions and choose the one that matches the condition, so if your expressions are not that heavy to calculate, I would recommend using select since it is often cheaper than having to wait for the CPU to recover from a branch misprediction,” Anders points out.

Bursting with performance

The last step of the DOTS transformation for the enemy position and heading interpolation was to enable the Burst Compiler. Anders found this quite easy: “Since the data is laid out in contiguous arrays, and since we are using the new Mathematics library from Unity, all we had to do was to add the BurstCompile attribute to our Job”.

[BurstCompile] public struct UpdatePositionAndHeadingJob : IJobChunk { public ArchetypeChunkComponentType<PositionData2D> PositionDataType; public ArchetypeChunkComponentType<HeadingData2D> HeadingDataType; [ReadOnly] public ArchetypeChunkComponentType<TargetPositionData> TargetPositionDataType; [ReadOnly] public float DeltaTime; [ReadOnly] public float RotationLerpSpeed; [ReadOnly] public float MovementLerpSpeed; } 1 2 3 4 5 6 7 8 9 10 11 12 13 [ BurstCompile ] public struct UpdatePositionAndHeadingJob : IJobChunk { public ArchetypeChunkComponentType < PositionData2D > PositionDataType ; public ArchetypeChunkComponentType < HeadingData2D > HeadingDataType ; [ ReadOnly ] public ArchetypeChunkComponentType < TargetPositionData > TargetPositionDataType ; [ ReadOnly ] public float DeltaTime ; [ ReadOnly ] public float RotationLerpSpeed ; [ ReadOnly ] public float MovementLerpSpeed ; }

The Burst Compiler gives us Single Instruction Multiple Data (SIMD); machine instructions that can operate on multiple sets of input data and produce multiple sets of output data in a single instruction. That helps us fill more seats on the 128-bit cache bus with the right data.

The Burst Compiler in combination with a cache-friendly data layout and the job system gave the team enormous speedups. Here is a performance table they put together after measuring the difference after each step of the transformation.

This meant that Far North got completely rid of the bottlenecks related to the client-side interpolation of position and heading of zombies. Their data is now laid out in a cache-friendly way and their cache lines are populated with only relevant data. All cores of the CPU get a workout and the BurstCompiler outputs highly optimized machine code with SIMD instructions.

DOTS tips and tricks from Far North Entertainment

Start thinking in streams of data, since in ECS, entities are just lookup indices into parallel streams of component data.

Think of ECS as relational Database where Archetypes are tables, components are columns and entities are indices within the table (rows).

Organize your data in contiguous arrays to utilize the CPU caches and hardware prefetcher.

Forget about your first instincts to create an Object hierarchy and trying to make a general solution before you understand the actual problem you are trying to solve.

Think about garbage collection. Avoid excessive Heap allocations in performance-critical areas, Use unity’s new Native containers instead. But beware that you must handle the clean up manually.

Understand the cost of your abstractions, beware of virtual function call overhead.

Utilize all cores of the CPU with the C# Job System.

Understand the hardware. Is the Burst compiler generating SIMD instructions? Use the Burst Inspector for analyzing.

Stop wasting cache lines. Think of packing data into cache lines as you do when packing data into UDP packets.

The top tip that Anders Eriksson wants to share is more general advice for anyone whose project is already in production: “Try to identify specific areas in your game where you’re having performance issues and see if you can start to apply DOTS in that isolated area. You don’t have to transform the entire code base!”

Going forward

“We want to adopt DOTS in more areas of our game and we were quite excited about Unite announcements on DOTS animations, Unity Physics, and Live Link. We would like to be able to convert more of our game objects to ECS entities and it seems like Unity is making good progress in making that a reality,” concludes Anders.

If you have more questions for the Far North Entertainment team, we recommend you join their Discord!

Check out our Unite Copenhagen DOTS playlist to learn how other innovative game studios use DOTS to make great games faster, and how all the upcoming DOTS-powered components, including DOTS Physics, our new Conversion Workflow, and the Burst Compiler, work together.