In my last post, I talked about achieving 160 requests per frame using burst compiled code. After more studies and iterations, I found that I could do better.

As I’m watching the profiler of my old implementation, I realized that each A* search is being executed one after the other. They were distributed in multiple threads but they’re not running in parallel. Unused thread execution time is wasteful. I need to write code that can execute A* searches in parallel. It wasn’t as easy as I initially thought. I had to run through some Unity ECS gotchas.

A* searches are distributed in all threads but are not running in parallel. Click here for larger image.

Obstacles

My first instinct is to use IJobParallelFor. I can schedule an IJobParallelFor for each chunk of requests. This means that I execute an A* search for each index:

struct AStarSearchParallelFor : IJobParallelFor { [ReadOnly] public NativeArray<Entity> entities; // The request entities public void Execute(int index) { // Perform a single A* search here } }

This also implies that I have to instantiate local containers like a list for nodes, an open set which uses a heap, and a hash map that acts like a hash set for the close set. I couldn’t get this to work with the Burst compiler. I always get errors when disposing a container within the job struct. Somehow, this is not supported by Burst.

While rummaging through the ECS forums, I found a thread that answers my problem. It said that you don’t need to dispose containers that are allocated in a job that is allocated with Allocater.Temp. I don’t have a concrete evidence that this is true except that I’ve tried it in a test system and see if my memory usage grows (due to undisposed containers). My memory usage is not growing at all. So I’m going to have to trust that programmer who was replying in that thread.

struct AStarSearchParallelFor : IJobParallelFor { [ReadOnly] public NativeArray<Entity> entities; // The request entities public void Execute(int index) { Search search = new Search() { // Set parameters here }; search.Execute(); } // The whole A* algorithm is implemented here // I did it this way so I don't have to pass around too many // parameters if it were made by only using functions private struct Search { private NativeList<AStarNode> allNodes; private OpenSet openSet; private NativeHashMap<int2, byte> closeSet; public void Execute() { // It's ok not to dispose these (I hope) this.allNodes = new NativeList<AStarNode>(10, Allocator.Temp); GrowingHeap heap = new GrowingHeap(new NativeList<HeapNode>(10, Allocator.Temp)); this.openSet = new OpenSet(heap, new NativeHashMap<int2, AStarNode>(10, Allocator.Temp)); this.closeSet = new NativeHashMap<int2, byte>(10, Allocator.Temp); ... // Rest of the algorithm } } }

From here, the next step is to pass the required components from chunks. My A* job requires four components:

AStarSearchParameters – Stores the start and goal positions

– Stores the start and goal positions AStarPath – Stores whether the path is reachable or not. Also stores the number of positions to step through to get to the goal. Basically, this component can be used to traverse through the path.

– Stores whether the path is reachable or not. Also stores the number of positions to step through to get to the goal. Basically, this component can be used to traverse through the path. DynamicBuffer<Int2BufferElement> – The list of positions from A* search will be stored here

– The list of positions from A* search will be stored here Waiting – This is a common component that I use that has a “done” boolean. This will be used by agents to see if their A* request is already done so they can proceed to their next action.

As I’m working with chunks, it’s just natural that I use NativeArray for the components and a BufferAccessor for the list of positions. But there’s a problem here. BufferAccessor can’t be used in this case. I made a test for this:

private struct JobParallelFor : IJobParallelFor { [NativeDisableContainerSafetyRestriction] public BufferAccessor<Int2BufferElement> accessor; public void Execute(int index) { DynamicBuffer<Int2BufferElement> buffer = this.accessor[index]; buffer.Clear(); for (int i = 0; i < 1000; ++i) { buffer.Add(new Int2BufferElement() { value = new int2(i, i) }); } } }

The job above is a test code I made to see if I can use BufferAccessor in an IJobParallelFor. This works for only one chunk. If there’s more than one chunk to process, it throws errors. Something about a NativeArray can’t be accessed as a previous job is still writing to it. Intuitively, it should work as I’m only accessing one and only one DynamicBuffer per Execute(). But, I’m wrong. Still totally an ECS noob.

I need to find another way. Then I remember my old implementation. In there I used BufferFromEntity instead. This can still be applied to my parallel solution since I can retrieve the entities from chunks. I made a simple test if it would work in parallel:

private struct JobWithBufferFromEntity : IJobParallelFor { [ReadOnly] public NativeArray<Entity> entities; [NativeDisableContainerSafetyRestriction] public BufferFromEntity<Int2BufferElement> allBuffers; public void Execute(int index) { Entity entity = this.entities[index]; DynamicBuffer<Int2BufferElement> buffer = this.allBuffers[entity]; buffer.Clear(); for (int i = 0; i < 1000; ++i) { buffer.Add(new Int2BufferElement() { value = new int2(i, i) }); } } }

Surprisingly, this works even with multiple chunks. With this, I wrote my A* job that runs in parallel:

[BurstCompile] public struct AStarSearchParallelFor<HeuristicCalculator, ReachabilityType> : IJobParallelFor where HeuristicCalculator : struct, HeuristicCostCalculator where ReachabilityType : struct, Reachability { [ReadOnly] public NativeArray<Entity> entities; // The request entities [NativeDisableContainerSafetyRestriction, ReadOnly] public ComponentDataFromEntity<AStarSearchParameters> allParameters; [NativeDisableContainerSafetyRestriction, WriteOnly] public ComponentDataFromEntity<AStarPath> allPaths; [NativeDisableContainerSafetyRestriction, WriteOnly] public ComponentDataFromEntity<Waiting> allWaiting; [NativeDisableContainerSafetyRestriction, WriteOnly] public BufferFromEntity<Int2BufferElement> allPathLists; [ReadOnly] public ReachabilityType reachability; private HeuristicCalculator heuristicCalculator; // This will be specified by client on whether it wants to include diagonal neighbors [ReadOnly] public NativeArray<int2> neighborOffsets; [ReadOnly] public GridWrapper gridWrapper; // Execute search per entity in entities public void Execute(int index) { Search search = new Search() { entity = this.entities[index], allParameters = this.allParameters, allPaths = this.allPaths, allPathLists = this.allPathLists, reachability = this.reachability, heuristicCalculator = this.heuristicCalculator, neighborOffsets = this.neighborOffsets, gridWrapper = this.gridWrapper }; search.Execute(); // Update waiting this.allWaiting[this.entities[index]] = new Waiting() { done = true }; } private struct Search { public Entity entity; public ComponentDataFromEntity<AStarSearchParameters> allParameters; public ComponentDataFromEntity<AStarPath> allPaths; public BufferFromEntity<Int2BufferElement> allPathLists; public ReachabilityType reachability; public HeuristicCalculator heuristicCalculator; public NativeArray<int2> neighborOffsets; public GridWrapper gridWrapper; private NativeList<AStarNode> allNodes; private OpenSet openSet; private NativeHashMap<int2, byte> closeSet; private int2 goalPosition; public void Execute() { // Instantiate containers this.allNodes = new NativeList<AStarNode>(10, Allocator.Temp); GrowingHeap heap = new GrowingHeap(new NativeList<HeapNode>(10, Allocator.Temp)); this.openSet = new OpenSet(heap, new NativeHashMap<int2, AStarNode>(10, Allocator.Temp)); this.closeSet = new NativeHashMap<int2, byte>(10, Allocator.Temp); AStarSearchParameters parameters = this.allParameters[this.entity]; this.goalPosition = parameters.goal; float startNodeH = this.heuristicCalculator.ComputeCost(parameters.start, this.goalPosition); AStarNode startNode = CreateNode(parameters.start, -1, 0, startNodeH); this.openSet.Push(startNode); float minH = float.MaxValue; Maybe<AStarNode> minHPosition = Maybe<AStarNode>.Nothing; // Process while there are nodes in the open set while (this.openSet.HasItems) { AStarNode current = this.openSet.Pop(); if (current.position.Equals(this.goalPosition)) { // Goal has been found int pathCount = ConstructPath(current); this.allPaths[this.entity] = new AStarPath(pathCount, true); return; } ProcessNode(current); this.closeSet.TryAdd(current.position, 0); // We save the node with the least H so we could still try to locate // the nearest position to the destination if (current.H < minH) { minHPosition = new Maybe<AStarNode>(current); minH = current.H; } } // Open set has been exhausted. Path is unreachable. if (minHPosition.HasValue) { int pathCount = ConstructPath(minHPosition.Value); this.allPaths[this.entity] = new AStarPath(pathCount, false); // false for unreachable } else { this.allPaths[this.entity] = new AStarPath(0, false); } } private AStarNode CreateNode(int2 position, int parent, float g, float h) { int index = this.allNodes.Length; AStarNode node = new AStarNode(index, position, parent, g, h); this.allNodes.Add(node); return node; } private void ProcessNode(in AStarNode current) { if (IsInCloseSet(current.position)) { // Already in closed set. We no longer process because the same node with lower F // might have already been processed before. Note that we don't fix the heap. We just // keep on pushing nodes with lower scores. return; } // Process neighbors for (int i = 0; i < this.neighborOffsets.Length; ++i) { int2 neighborPosition = current.position + this.neighborOffsets[i]; if (current.position.Equals(neighborPosition)) { // No need to process if they are equal continue; } if (!this.gridWrapper.IsInside(neighborPosition)) { // No longer inside the map continue; } if (IsInCloseSet(neighborPosition)) { // Already in close set continue; } if (!this.reachability.IsReachable(current.position, neighborPosition)) { // Not reachable based from specified reachability continue; } float tentativeG = current.G + this.reachability.GetWeight(current.position, neighborPosition); float h = this.heuristicCalculator.ComputeCost(neighborPosition, this.goalPosition); if (this.openSet.TryGet(neighborPosition, out AStarNode existingNode)) { // This means that the node is already in the open set // We update the node if the current movement is better than the one in the open set if (tentativeG < existingNode.G) { // Found a better path. Replace the values. // Note that creation automatically replaces the node at that position AStarNode betterNode = CreateNode(neighborPosition, current.index, tentativeG, h); // Only add to open set if it's a better movement // If we just push without checking, a node with the same g score will be pushed // which causes infinite loop as every node will be pushed this.openSet.Push(betterNode); } } else { AStarNode neighborNode = CreateNode(neighborPosition, current.index, tentativeG, h); this.openSet.Push(neighborNode); } } } // Returns the position count in the path private int ConstructPath(AStarNode destination) { // Note here that we no longer need to reverse the ordering of the path // We just add them as reversed in AStarPath // AStarPath then knows how to handle this DynamicBuffer<Int2BufferElement> pathList = this.allPathLists[this.entity]; pathList.Clear(); AStarNode current = this.allNodes[destination.index]; while (current.parent >= 0) { pathList.Add(new Int2BufferElement(current.position)); current = this.allNodes[current.parent]; } return pathList.Length; } public bool IsInCloseSet(int2 position) { return this.closeSet.TryGetValue(position, out _); } } }

Job System

Using AStarSearchParallelFor should be trivial to use in a job system, or so I thought:

private JobHandle Process(in ArchetypeChunk chunk, in SimpleReachability reachability, JobHandle inputDeps) { AStarSearchParallelFor<SimpleHeuristicCalculator, SimpleReachability> searchJob = new AStarSearchParallelFor<SimpleHeuristicCalculator, SimpleReachability>() { entities = chunk.GetNativeArray(this.entityType), allParameters = this.allParameters, allPaths = this.allPaths, allWaiting = this.allWaiting, allPathLists = this.allPathLists, reachability = reachability, neighborOffsets = this.neighborOffsets, gridWrapper = this.gridSystem.GridWrapper }; return searchJob.Schedule(chunk.Count, 64, inputDeps); }

I made some request entities with random start and goal position so that the system that schedules the job would run. I tried 1 request and I saw that it works. Tried 2. It worked but not in parallel. So I increased it 10. Still not in parallel. Tried 20, not in parallel. Then 50. Still not in parallel. I’m quite disappointed.

So I hanged out on the forums as a break. Luckily, I found a thread where a programmer is trying to do something like I do. He wanted to run parallel jobs for each entity because his processing is heavy. Somebody replied that he ought to use IJobParallelFor with a low innerloopBatchCount, the second parameter in Schedule() (64 in my code). At that moment, I just realized what that parameter is for. (I didn’t know. I just follow example code. Don’t judge me.)

Passing in 64 means that the execution will only try to distribute to other threads when it reaches 64 executes. It’s like a divisor to determine how many batches of execution to create. Say chunk.Count is equal to 512. This means that 8 execution batches would be created because 500 / 64 = 8. These 8 executions would then be distributed to the worker threads in parallel. At least that’s how I understood it.

In my case, there would always be only one execution batch. My request count never reached 64. I should pick a reasonable number such that there’s a higher chance that execution batches would be distributed evenly (not all A* requests are computed equally). Setting it to 1 means that there would one execution batch per request. But this is not without cost, though, I would surmise. There should be a cost associated with creating an execution batch and assigning a thread to it. So I picked 2 instead. It’s good enough that batches would be distributed evenly while not too many to pressure the scheduler.

private JobHandle Process(in ArchetypeChunk chunk, in SimpleReachability reachability, JobHandle inputDeps) { AStarSearchParallelFor<SimpleHeuristicCalculator, SimpleReachability> searchJob = new AStarSearchParallelFor<SimpleHeuristicCalculator, SimpleReachability>() { entities = chunk.GetNativeArray(this.entityType), allParameters = this.allParameters, allPaths = this.allPaths, allWaiting = this.allWaiting, allPathLists = this.allPathLists, reachability = reachability, neighborOffsets = this.neighborOffsets, gridWrapper = this.gridSystem.GridWrapper }; // Just changed 64 to 2 return searchJob.Schedule(chunk.Count, 2, inputDeps); }

Once I did this, I could now see that the A* search executions are run in parallel:

Sweet parallel execution. This is 60 requests.

Benchmark

I did the same benchmark as my last article. I made a test environment where I could increase/decrease the number of requests until I hit more than 16ms. The last record was 160. This time I could make 300 requests before hitting past 16ms.

Threads are now almost filled. Click here for larger image.

Note that this is still using a generic A*. This could be further optimized by using A* algorithm optimizations like Jump Point Search or HPA*. For our use case, however, we probably won’t need to. 20 requests per frame would already be good for us.

Next Steps

I’m slowly working to apply this to our game, Academia. I’m almost there. This will be another part of the game that will be optimized using Unity’s DOTS. But that will be for another post.