Some of you lovely folks reminded me it's been a really long time since I did an update. Sorry 'bout that. I've been busy."Doing what?" I hear you cry (probably). Well, lot's of things. Some bug fixing (e.g. making the decals on the planet surface render properly), adding features to make the artists happy, stuff like that.But the biggest focus has been performance. Below is a screengrab from UberProbe (our internal profiling tool), showing a capture from a late game.Now, for this to make sense, I need to describe a couple of things. This view shows time spent on the CPU (where we are hitting the biggest performance issues). Time is the x axis (so from left to right). In this case, a frame is taking around 40ms (so running at 20fps). Each vertical segment (so the gray and the blue background bars) are threads (there's two in this capture). The green blocks show time spent in functions, with the function callstack (i.e. things that call in to other things) arranged vertically.Now, the observant reader will notice that we're doing all our work on a single thread here. This has made the rapid development schedule a lot easier to manage, but it's also restricted our performance quite a bit. Here, we can see a chunk of time spend in the "update" part of our loop (where we process data sent from the server, update entity positions, and so on), and then even more time spent in the "render" section.So my focus has been on moving some of this work onto other cores, so we can make the most of all those nice beefy multi-core CPUs you guys are playing on. This is not trivial (at all). Some reasons this is hard:OpenGL has the concept of the "context thread", which means you can only do openGL calls on the thread (or CPU core) that owns the context. You can move it around, but that's expensive, and also would require some fancy synchronization footwork.DirectX 11 and above support something called "deferred contexts", where you can record GPU command streams on any thread, and then just submit them on your main thread. This means you can spread rendering work across multiple cores. I believe Mantle (AMD's API) can do that too. Unfortunately, OpenGL does not support this, even with an extension, so we are SOL.We have collections of entities (things in the game - particle systems, units, buildings, etc) and you can't have multiple threads touching these collections at the same time. Bad things can happen (well, the game will crash).By that I mean, there's stuff in the render part that depends on stuff that's being done in the update. That's why the update is done first. This makes it very hard to do things in parallel.We use things called "synchronization primitives" to manage how we access shared data across threads. These include "mutexes", "critical sections" and something I added called "ticket locks". The first two are operating system concepts, so have some overhead. The last one is my own thingie, and to wait you need to do something called "spinning" where you just loop around a small piece of code waiting for something to change. Ticketlocks spin, which is ok if you do it for very small periods of time, but if you do it too much, the operating system goes "oh-ho! You appear to be misbehaving! I am going to deprioritize you now!" and your framerate goes in the toilet.You also get the OS scheduling other stuff, which takes time away from your threads. It's not polite about it either, and will do so right when you're in the middle of something. Fortunately, the way our threading now works, we don't see this make much of an affect.Before I waffle more, here's another picture. This one show's the current state of things:You can see here, we now have a number of threads running, doing work.I'm actually getting this out in two phases, and there's two methods I'm using to scale. What will be shipping out next (soon... soooooon...) is the following:First, I added a "parallel for" concept. What this does is allows us to take a chunk of work (say, 1000 units that need updating) and spread the work equally across many threads. The left-most part of the probe here shows that. Basically, unit skinning/animation, updates, etc, are done in a big parallel update loop that just reduces the time it takes to run.The implementation I ended up with actually makes each thread take a large chunk of remaining work (so for 4 threads and 1000 units, each thread would take 200 or so the first time); and then a proportionally smaller chunk each successive time. Since different bits of work can take different amounts of time this helps avoid too much overhead in "thread drainout" - where all the other threads are stalled waiting for one that took on too much to finish. It also means we reduce the book-keeping overhead of making each thread pull one item at a time (since they do batches). And, as well, that helps with cache coherency, since you have threads all pounding on data that's adjacent in memory.So for animations in a very late game, I was seeing around 10ms to update them all. By doing it in parallel, that cuts it to 2-3ms, which makes a big difference.The parallel-for implementation I added also has the concept of an asynchronous version. What that means is you can say "do all this in parallel, but let me keep doing something else". So for particle system depth-sorting and GPU data-writes, I just throw that off over the fence and it gets done in the background. Later in the frame when we want to draw them, they're all just magically ready. In the single-threaded implementation, I saw this work take up to 20ms a frame (which is a lot!). By making it asynchronous, that 20ms just goes away.This gets a particular late-game replay on my system from around 8-10fps on my machine to around 20fps. Not there yet, but much, much better.Coming slightly less soon (since it's not fully stabilized yet) is moving a large chunk of the update to run in parallel to the main render block. This can cut another large chunk of time out of the frame, and make even more improvements. To do this, I had to break out the render chunk into multiple bits, and schedule stuff in the background. I also had to make a lot of changes we make to collections (meshes, lights, particles) deferred; that means, when the parallel update makes a change to a light, we don't change the light just yet. We take the delta, and add it to a list. When we're done with the rendering, we just batch-add all those changes before looping to the next frame.This change gets us even more speed. In the above example, that 8-10fps late game now runs 25-30fps (with some hitches, which I'm working on AFTER that).On a quad-core system (like my work machine) this gets CPU usage from around 8% to 25-33%. Bear in mind, on a hyperthreaded system the ideal is not actually 100%, since you may actually run slower in that case (I can talk more about that if anyone is interested). In this case, 50% or so is ideal.Even though I'd love to be able to set everyone's CPUs on fire because we're just maximizing their power, I doubt we'll get close to doing that. But we're getting better, and the gains will keep on coming.