I was fortunate enough to attend the talk Rich Geldreich of Valve Software gave at the OpenGL Anniversary party duing SIGGRAPH 2012 in the JW Marriot Gold Ballroom in LA: “Left 4 Dead 2 Linux: From 6 to 300 FPS in OpenGL.”

Update: As Rich himself was so kind to mention in the comments, the slides are now available here. I’ve updated this post with some corrections and the info that was missing.

Update #2: A video of the talk was placed on youtube here: http://youtu.be/bTO1D9pg4Ug?t=32m41s

A very enthousiastic Rich had to rush his interesting talk a little bit since the OpenGL party was planned to start at 7 PM and the crowd was eager to toast on the 20th anniversary (OpenGL/SGI founder Kurt Ageley gave a speech), but he still managed to shed a little insight into how Valve tackled the performance issues in porting the game to OpenGL. I’m just going to share some of my notes, this is not a complete overview – Rich stressed that he would publish a lot more on the Valve Linux blog.

The presentation started off with Rich demonstrating the game running on (what seemed to be) Ubuntu 12.04 (32bit version), on a laptop. He explained that they often use a 30-ish second demo run, recorded during a playthrough by Rick Johnson at Valve, of a part of the “Dead Center” campaign to benchmark their performance. It’s the part which corresponds to the first couple of minutes of this youtube walkthrough, coming out of the first building and moving to the orange decontamination tents.

They can run these demos in timedemo mode as well, forcing the engine to crank out all the frames as fast as it can. The only thing that differs between runs are some of the zombie limb physics. The short segment features a lot of heavy zombie-slaughtering action and a fair amount of exploding canisters, so I guess it’s a good representation of general gameplay. Rich mentioned they used other, longer demos to benchmark performance as well.

i’ll bullet-point the rest of the talk:

Valve started out with little OpenGL experience, and gradually learned along the way. The original Linux port of the game was done before the in-house Linux team was formed. After that, they invited hardware vendors to the office to discuss the possibilities and limitations of an eventual port. Rich stressed that it was really a team effort and that they couldn’t have done it without the support from software/hardware partners.

The talk focused on their performance improvements using the Nvidia GTX 680. OpenGL performance improvments were the highest on Nvidia’s GL driver.

They use a D3D – OpenGL translation system which directly maps Direct3D calls on OpenGL calls and immediately flushes its state to the graphics buffer. It translates the API calls on the fly and supports shader model 2.0b, with 3.0 support coming op.

system which directly maps Direct3D calls on OpenGL calls and immediately flushes its state to the graphics buffer. It translates the API calls on the fly and supports shader model 2.0b, with 3.0 support coming op. The overhead is reasonable: 50/50 between CPU cycles calling GL on one hand and the translation calls on the other on multithreaded drivers; The overhead is 80/20 on single-threaded drivers.

The performance optimizations focused on knowing exactly what was happening every microsecond in this translation layer, which they try to keep universal so it can be easily used in the other Source Games (Team Fortress 2, …) as well.

Experimenting and benchmarking experiments was hard, because the game is multithreaded: bottlenecks tend to shift around once you start looking at them. The driver’s main thread is also mainly invisible to their profiling tools. Also is the Source engine very configurable to be able to run on a wide range of system configurations, which leads to easy misconfiguration, which in turn hampered the usability of some performance benchmarks.

The primary debugging/profiling tool used was Telemetry by RAD Game Tools (the guys behind the BINK video format). It consists of a visualizer app, a runtime component and a server. Since Telemetry is an intrusive system, extra lines had to be added to define Telemetry zones in the code. These zones can be nested, labeled … for easier profiling. They usually map to a function, but always. In addition to these Telemetry zones, they can define timespans, which are not bound to certain execution areas, and may span multiple frames.

by RAD Game Tools (the guys behind the BINK video format). It consists of a visualizer app, a runtime component and a server. Since Telemetry is an intrusive system, extra lines had to be added to define Telemetry zones in the code. These zones can be nested, labeled … for easier profiling. They usually map to a function, but always. In addition to these Telemetry zones, they can define timespans, which are not bound to certain execution areas, and may span multiple frames. Using this tool paid off when it was time to find performance bottlenecks, as Rich showed by demonstrating a Telemetry analysis of a typical game run (using the demo we just saw), in which he was able to zoom in up to microsecond level to see what was eating cycles. He mentioned that they’ve only scratched the surface of what was possible with this tool, and looked forward to using it more.

Another home-mode performance analysis tool they’ve made is a visual batch trace mode , in which the engine outputs PNG files on which the scanlines depict using several colors on what the time was spent. This way, they could visually compare the original Direct3d, OpenGL single-threaded and OpenGL multi-threaded performance by drawing the batch traces next to eachother. They stitched together these PNG files using Virtualdub. These colorful (and quite psychedelic) videos soon became a great way to communicate performance improvements to vendors and other associates. A few of them can be found in this youtube channel.

, in which the engine outputs PNG files on which the scanlines depict using several colors on what the time was spent. This way, they could visually compare the original Direct3d, OpenGL single-threaded and OpenGL multi-threaded performance by drawing the batch traces next to eachother. They stitched together these PNG files using Virtualdub. These colorful (and quite psychedelic) videos soon became a great way to communicate performance improvements to vendors and other associates. A few of them can be found in this youtube channel. Other tool they used for performance benchmarking: GPU PerfStudio by AMD.

Other optimizations they’ve worked on: Pushing for multithreaded drivers: Rich noted that this was a day/night difference once they got decent multithreaded drivers working on Linux for the GL calls, in cooperation with Nvidia. They’re working with Intel on fixes/optimizations for its mesa driver, which should be picked up by the Linux kernel during the following months They rewrote the hottest D3D->GL code segments for better performance, paying attention to the fact that they ought to be able to reuse their work for other Source titles as well. Dynamic buffer updating: glMapBufferRange vs. glBufferSubData After some consulting, they added –ffast-math to their GCC compiler options and removed -fPIC, resulting in some 8-10% performance increase.



As mentioned before, this is by far not a complete overview: the talk was rushed, Rich dropped some slides, and I was busy paying attention and writing down some stuff for you guys :)

Overall, the talk was very interesting, and it was very clear that they are taking the OpenGL route very seriously, with the focus on porting more titles to the platform easily. They also noted that this couldn’t have been done without the excellent support and communication with Intel, Nvidia and AMD, pushing for kernel and driver changes.

Also a little shoutout to the KHRONOS group for organizing such an amazing OpenGL Anniversary party. There was candy, popcorn, beer and lots of games to get competitive on!