A while ago I posted some comments on the press coverage of an event by Valve, where they explained threading in the Source Engine to a couple of tech-journalists. Unfortunately, the coverage left many open questions and some things even appeared to be wrong. Fortunately, Tom Leonard, the threading guru from Valve is a nice guy and agreed to answer some of my questions via email. This post contains the conversation we had.

Michael: Both articles claim that existing threading engines were inefficient when it came to games. Inefficient is a broad description, which engines did you evaluate what exactly made them unsuitable for your game? Latency? Thread-Switching-Costs? Some other kind of overhead? Or something else?

Tom: To clarify, although we considered actually writing our own low level threading system, this idea was just a thought exercise. At the low level, we’re using operating system threads. We just tend to micromanage more than the average application. To clarify, although we considered actually writing our own low level threading system, this idea was just a thought exercise. At the low level, we’re using operating system threads. We just tend to micromanage more than the average application. At the OS level, while we’re obliged at times to use OS features to play nice (e.g., Windows events), we generally avoid them unless totally necessary because thier cost (in time) is very hard to predict. For example, a hand-built mutex using an atomic compare exchange spin loop with a custom back-off is way more efficient and predictable than a Windows critical section.

Michael: How much work is involved in having to maintain your own threading engine along with the game? As far as I know, the source engine even runs on multiple platforms (e.g. on the Xbox, which is built on PowerPCs if I am not mistaken), which makes the situation even more challenging, right?

Tom: The PC is more challenging than the Xbox. The PC is a variable set of cores, different manufacturers, different generations within one manufacturer, with arbitrary code running both in other processes in DirectX. On the other hand the Xbox is a fixed number of cores with a well defined CPU sharing model with the Xbox UI. The PC is more challenging than the Xbox. The PC is a variable set of cores, different manufacturers, different generations within one manufacturer, with arbitrary code running both in other processes in DirectX. On the other hand the Xbox is a fixed number of cores with a well defined CPU sharing model with the Xbox UI. So the underlying framework on the 360 is a subset of the PC. The 360 tends to differ in that on the 360 the code being threaded tends to have a stronger opinion of which core it should be assigned to.

Michael: Maybe I don’t understand enough about game internals and the challenges you face in game programming to really understand that decision. What exactly is so different about games that you had to invent your own system for it? Scarce resources (especially on consoles)? Soft realtime requirements? Or something else?

Tom: Primarily the fact that games want to be a real-time simulation, and Windows is not a real time operating system. As a game programmer, I need to make decisions about feature sets and complexity with the confidence that I can control the framerate implications. Because of this, using opaque tools really isn’t an option. Primarily the fact that games want to be a real-time simulation, and Windows is not a real time operating system. As a game programmer, I need to make decisions about feature sets and complexity with the confidence that I can control the framerate implications. Because of this, using opaque tools really isn’t an option. Futhermore we know so much more about what we’re doing at a macroscopic scale that we can generally make much better decisions about when and where to execute code than a general purpose library could.

Michael: How does your threading-engine cope with these problems? From the articles it appears you are working on a very low level (assembly) to solve the problems – but the usual threading engines can do that as well…

Tom: The primitives are assembly where needed (various flavors of CAS operations). Everything else is in C++.

Michael: Is the assumption I made in my comments that you are not using lock-free algorithms correct?

Tom: Incorrect. We’re striving to make our code as lock free as we can, as it’s the only way to scale. Part of our core toolkit is a set of lock-free data structures (all CAS based, some with back off schemes, some with thread cooperation schemes). Incorrect. We’re striving to make our code as lock free as we can, as it’s the only way to scale. Part of our core toolkit is a set of lock-free data structures (all CAS based, some with back off schemes, some with thread cooperation schemes). Based on these primitives I’ve built a set higher-level systems to help programmers express things in a lock-free manner. Sometimes on Windows the code has to go into a wait state to cooperate with other processes. But on the 360 the code can run full-throttle lock-free. I’ve also used these tools to make all of our core systems (allocators, resource management, world representations) lock-free. In the case of the presentation last month, for the purposes of communicating to the layperson, it was easier to overload the term “lock-free” in a general sense than the more specific “lock-free”, since the former presupposes the latter, and the latter is only interesting if you’ve committed to the former as an design goal. I’ve never really tried to articulate this idea, so forgive me if this doesn’t make much sense. I’m of the opinion that the very strict “lock-free” definition is interesting for people implementing the building blocks, but only marginally useful to general programmers. Really applying lock free algorithms requires a corresponding code and data design philosophy in higher level code, which is where I’ve been spending most of my effort. For me personally, at times it reminds me of the sort of mental shift I had to go through in the late eighties as OO started surpassing structured programming. To be successful you have to get your head into this space of non-deterministic code whose reality might switch from instruction to instruction, where you leverage inference rather than locks to know part of the system is stable. For example, the work distribution system sends prioritized jobs out to pools of threads or specific threads. Using a pairing of a set of lock free queues and an interlocked integer it’s able to support multiple providers and multiple servicers with good scalability. While the lock free queues and stacks were a precondition to achieving this, the most interesting step was understanding how the API needed to express itself to allow it (e.g., a few discrete priorities), and how the worker code needed to be structured (i.e., make the fact that it gets dibs to do a work unit disjoint from the actual unit it ends up processing). I’m sure this sort of stuff is well understood in domains where threading has been prevalent. It’s new to gaming.

Michael: This I find very interesting. It sounds like you ended up implementing a whole lot of functionality lock-free. Personally, I have only very limited experience with lock-free programming (I generally try to stay portable, because we have access to a lot of different machines – a goal that does not mix well with lock-free programming, especially since I have not found any good and portable lock-free libs). I have done an experiment with a small lock-free library some time ago, however and compared a lock-free taskpool with a hand-written one (using in part thread-local storage for the latter to avoid locking at all) and the results under heavy load were strongly in favor of the hand-written version – this convinced me at the time that lock-free was not something I had to get into immediately – these CAS-ops still do take a lot of time. Maybe I will have to revisit that decision sometime, especially since I was of course comparing apples and oranges, since the lock-free version did not get to use thread-local storage.

But related to this, I have another question: Avoiding locks in the first place by making certain parts of the problem private to each thread is a technique that has worked for me a lot of times. I could imagine that it could work in a game engine as well, e.g. by partitioning the world into smaller pieces that only one thread ends up processing. Are you doing something similar? Or is that not feasible for an FPS?

Tom: Absolutely, and I too consider this to be superior even to a lock free approach when feasable. The best example I can think of in our code is a data structure known as the “spatial partition”. Our system is a client-server architecture where the server and client each have a similar but not identical opinion of the state of the world. Previously this was a single voxel hash that contained the server view, the client view, and all of the shared view (the static world). Rather than make this thread safe, I chose instead to split into two separate trees, duplicating the shared view in each. Absolutely, and I too consider this to be superior even to a lock free approach when feasable. The best example I can think of in our code is a data structure known as the “spatial partition”. Our system is a client-server architecture where the server and client each have a similar but not identical opinion of the state of the world. Previously this was a single voxel hash that contained the server view, the client view, and all of the shared view (the static world). Rather than make this thread safe, I chose instead to split into two separate trees, duplicating the shared view in each. Another pattern that seems to recur is some sort of lock-free system to get a “ticket” for a given instance of work or of a resource, and then run that work in on an independent instance represented by that ticket. The pool of instances may be limited, but with the right tuning any real contention can be limited.

Michael: My previous questions were all about the internals of your threading-system, but obviously a lot of work went into making it easy to use, an area I personally find very important and challenging. From a very broad perspective, what did you and your users find so easy to work with in the system? Or asked differently: What does your threading system look like from a users perspective? Or were you even able to hide threading from most developers – and if yes, how?

Tom: While I expose a set of lower level tools for programmers who know what they’re doing, at a high level I expose a small set of tools to express a few general techniques. I also rely heavily on C++ template mechanisms to coerce the compiler into generating all of the marshalling code so programmers don’t need to. Here are some examples… While I expose a set of lower level tools for programmers who know what they’re doing, at a high level I expose a small set of tools to express a few general techniques. I also rely heavily on C++ template mechanisms to coerce the compiler into generating all of the marshalling code so programmers don’t need to. Here are some examples… A simple one-off push to another core looks like this: < View plain text > C++

if ( ! IsEngineThreaded ( ) ) _Host_RunFrame_Server ( numticks ) else ThreadExecute ( _Host_RunFrame_Server, numticks ) ; A simple parallel loop: < View plain text > C++

extern void ProcessPSystem ( CNewParticleEffect * & pNewEffect ) ; if ( particlesToSimulate. Count ( ) ) ParallelProcess ( particlesToSimulate. Base ( ) , particlesToSimulate. Count ( ) , ProcessPSystem ) ; Queue up a bunch of work items, wait for them to complete: < View plain text > C++

BeginExecuteParallel ( ) ExecuteParallel ( g_pFoo, & CFoo :: Func1 , arg1 ) ; ExecuteParallel ( & Bar, arg2, arg3 ) ; EndExecuteParallel ( ) Semi-manual. This one is interesting as the code has used a similar template techniques to create a queue of journaled graphics calls and then executes the whole batch on another core: < View plain text > C++

m_pActiveAsyncJob = g_pThreadPool - > QueueCall ( this , & CMaterialSystem :: ThreadExecuteQueuedContext , pPrevContext ) ; Very manual: < View plain text > C++

CJob * pJob = new CFileAsyncWriteJob ( pFileName, pSrc, nSrcBytes, bFreeMemory, bAppend ) ; m_pThreadPool - > AddJob ( pJob ) ; Now, I realize that this isn’t much to look at. But it is pretty straightforward to most programmers, and underneath the hood (in CThreadPool) there is a unified set of profiling and debugging tools that programmers can leverage.

Michael: Since I am also a member of the OpenMP language committee, I cannot resist the temptation to ask: did you consider using OpenMP? And if yes, what kept you from doing so? I could imagine coupling the ease of use of OpenMP with a custom-built OpenMP-runtime-library, optimized for games could be a viable approach to overcome some of your problems – but then again, I don’t really know enough about the challenges of game programming to have more than a speculative opinion on that :-).

Tom: I looked into the appropriateness of OpenMP for our problem domain. The primary issue with OpenMP for us was control of performance. At critical points, we really want to control exactly where code is executed, which OpenMP seemed to lack. For example, we can know a lot about the data and code we’re working on, and distribute work to maximize cache utilization. Furthermore, there was no real guarantee of the characteristics of OpenMP between compilers and platforms. I looked into the appropriateness of OpenMP for our problem domain. The primary issue with OpenMP for us was control of performance. At critical points, we really want to control exactly where code is executed, which OpenMP seemed to lack. For example, we can know a lot about the data and code we’re working on, and distribute work to maximize cache utilization. Furthermore, there was no real guarantee of the characteristics of OpenMP between compilers and platforms. Also, the type of problem OpenMP is good at, loop parallelization, is by far a subset of the types of ways we want to do distribute work. Finally, we have a category of programmer here who are really game designers who program, and I felt the decoration of OpenMP would be a barrier for some of those people. This is exacerbated by the fact OpenMP uses pragmas, which means more complicated usages can’t be disguised behind preprocessor macros, the common way of mitigating that sort of issue.

Michael: Thanks Tom, that was very enlightening!