Graphics Drivers

Gah. So if you saw the last post I made about OSX, you may remember it was running at 1 FPS.

I spent a lot of time thinking about this issue and a quite a bit of time trying to code solutions. Despite OpenGL being a ‘cross platform’ library, at this point I’m pretty sure each platform that uses it is going to have to be tailored to that platforms specific graphics drivers.

Here’s my debugging method. (This is going to sound elegant as I type this out, but there was a lot of stumbling and double and triple checking things…)

One Frame Per Second

So I’m sitting there looking at the game chug along at 1FPS, and thinking: the loading screens run fast, but the title screen runs miserably. The loading screens have 1-3 draw calls per frame, whereas the title screen has hundreds, if not thousands. Something per draw call must be going slow.

Sure enough, if I don’t make any draw calls, things run fast, but this is mostly useless, since I can’t see anything.

A few thoughts enter my mind.

Hypothesis

– The graphics driver is defaulting to software rendering or software transformations. – I’m doing something that’s not OpenGL 3.2 compliant, or doing something causing OpenGL errors. – The GPU is waiting on the CPU (or vice versa) for something.

The first idea just shouldn’t be possible, as I selected a pixel format (an OpenGL thing that specifies what kind of rendering you’ll be doing) on OSX requiring hardware acceleration and no software fall back. But I’ll double check.

The second idea is somewhat likely, but I worked very hard to make the Windows renderer OpenGL 3.2 compliant and it doesn’t show any errors. But I’ll check anyway since it’s a different driver and different GPU using the same code.

Third idea? Let’s hope it’s not that.

Testing

How do you check something like this? There’s some sorta-ok GPU debugging tools available for OSX, so I downloaded them and started them up. After a little documentation reading, I got them working. You can set some OpenGL break points which will stop the program and give a bit of information if theres an error or if you encounter software rendering.

Of course nothing is easy. No OpenGL errors, no software rendering. This immediately discounted ideas #1 and #2. So it’s probably #3. Something is syncing the CPU and GPU. Blah.

Next I looked at what OpenGL calls were being made and how long they were taking.

Ah ha! You’ll notice the highlighted lines (which are draw calls), and that opengl calls are taking up a crazy 98% of the frame.

Looking close at individual calls, the huge time differences can be seen between glDraw calls and other API calls…

Having written low level code for consoles that don’t really have a driver has given me a good understanding of what sort of things go on when the CPU sends commands to the GPU, and what can cause a stall. Generally this happens when you’re either writing to dynamic resources that the GPU is currently using but the CPU wants to update. Or, when the CPU is waiting for the GPU to finish some rendering so it can access a rendered or computed result.

I only have 3 places in code that might cause this. The first one I looked at is updating vertex and index data used for dynamic rendering – which is used for particle systems, ui, and other things that change frame to frame.

The (abbreviated) code looks like this:

GLbitfield flags = GL_MAP_WRITE_BIT; if (_currentOffset + bytes > _bufferBytes) { // at the end of the buffer, invalidate it and start writing at the beginning... flags |= GL_MAP_INVALIDATE_BUFFER_BIT; _currentOffset = 0; } else { // there's still room, write past what the GPU is using and notify that there's no // need to stall on this write. flags |= GL_MAP_UNSYNCHRONIZED_BIT; } glBindBuffer(GL_ARRAY_BUFFER, _objectId); void* data = glMapBufferRange(GL_ARRAY_BUFFER, _currentOffset, bytes, flags); // write some data .... glUnmapBuffer(GL_ARRAY_BUFFER); // draw some stuff with the data at _currentOffset. _currentOffset += bytes;

It’s setup so that generally you’re just writing more data while the GPU can use data earlier in the buffer as it’s needed. Occasionally when you run out of room you let the driver know you’re going to overwrite the buffer. (This can be better with multiple buffers, but I didn’t want to overcomplicate this example code.)

This didn’t seem to be the problem as nearly every draw call was slow. Drawing that used fully static data was slow too. Static data is setup with code that looks like this.

glGenBuffers(1, &_objectId); glBindBuffer(GL_ARRAY_BUFFER, _objectId); glBufferData(GL_ARRAY_BUFFER, bytes, data, GL_STATIC_DRAW);

That data isn’t ever touched again, and hopefully the GPU takes the hint that it can reside in GPU memory so no problem there.

But then I noticed that not every draw call was slow. Using the OpenGL Profiler trace I could see that sequential draw calls without any changes to any render state in-between did not stall.

Hmmmm….

What’s the most common thing that changes between draw calls? If it’s not the material on the object, it’s the location where that object is drawn. It’s transformation – position and orientation. Transformations are generally stored in a very fast (and fairly small) section of GPU memory meant just for this purpose. It’s also where the camera location, object color, and other variable properties are stored. We call this data ‘uniforms’. Or in my engine ‘constants’.

In OpenGL 3.2 I used uniform buffer objects, since it most closely matches my engine architecture and that of DX10/11. DX9 fits the concept as well, since you can specify the location of all uniforms. Seems like a good fit.

After some pre-configuration, sending uniforms to the GPU for vertex and pixel programs to use is really easy. It looks like this:

void ConstantBuffer::Bind(Context& context, void* data, int32 offsetBytes, int32 bytes) { glBindBuffer(GL_UNIFORM_BUFFER, _objectId); glBufferSubData(GL_UNIFORM_BUFFER, offsetBytes, bytes, data); }

To my knowledge this should be crazy fast. On some hardware (way down at the command stream level) this data is part of the command buffer and it updates constants just before the vertex and pixel shaders are invoked. Worst case if its actually a separate buffer the GPU uses, and/or the driver supports getting this data back to the CPU, it needs to copy it off somewhere until the GPU needs it and the last set values can be read back by the CPU without any stall…

But you never know….

I read the OpenGL docs again, and sure enough glBufferSubData can cause a stall and the GPU waits for the previous commands to consume the previous values.

“Consider using multiple buffer objects to avoid stalling the rendering pipeline during data store updates. If any rendering in the pipeline makes reference to data in the buffer object being updated by glBufferSubData, especially from the specific region being updated, that rendering must drain from the pipeline before the data store can be updated.”

Really? Why? Setting uniforms HAS to be fast. You do it almost as often as issuing draw commands!!! This has been true since vertex shader 1.0. (Yeah I know, this doesn’t have to be quite true for some of the newest GPUs and APIs)

So for kicks, since there’s more than one way to modify buffer data in OpenGL, I changed the ConstantBuffer update to:

void ConstantBuffer::Bind(Context& context, void* data, int32 offsetBytes, int32 bytes) { glBindBuffer(GL_UNIFORM_BUFFER, _objectId); void* destData = glMapBufferRange(GL_UNIFORM_BUFFER, offsetBytes, bytes, GL_MAP_WRITE_BIT); memcpy(destData, data, bytes); glUnmapBuffer(GL_UNIFORM_BUFFER); }

And while in my mind there really shouldn’t be any difference, the statistics on OpenGL commands changes to this:

Huh, theres all that wait time again, but its moved to setting uniforms. Now I’m getting somewhere. I figure I’m just not using the API correctly when setting uniforms.

Experimentation

So I tried a bunch of different things.

I tried having a single large uniform buffer using the GL_MAP_INVALIDATE_BUFFER_BIT / GL_MAP_UNSYNCHRONIZED_BIT and glBindBufferRange() so that no constants were overwritten. This was slower. And yes, you can get slower than 1 FPS.

I tried having a uniform buffer per draw call so they were never overwritten, except between frames. This was slower, using either glMapBuffer or glBufferSubData.

I tried changing the buffer creation flags. No change.

I read about other coders running through their entire scene, collecting uniforms, updating a uniform buffer once at the beginning of the frame, and then running through the scene again just to make draw calls. This is stupid and slow.

I wished I could use a newer version of OpenGL to try some other options, but I’m using 3.2 for maximum compatibility.

Eureka!

Then I got a sinking feeling in my stomach. I knew the answer (actually was pretty sure…) but I didn’t want to code it. Ugh.

Back before OpenGL 3.0 / DirectX 10, there weren’t any uniform buffers. Uniforms were just loose data that you set one at a time using functions like glUniformMatrix4fv and glUniform4fv.

What isn’t great about the old way is every time you change vertex and pixel programs, you need to reapply all the uniforms that have changed that the next GPU programs uses. OpenGL 3.2 doesn’t let the shader pick where uniforms go in memory, so you always have to look it up, and the location of each uniform variable can change shader to shader.

With uniform buffers, if you set some values once and it doesn’t change the entire frame there’s nothing else to do.

So I went about changing the engine to use the old old way.

-First I had to change all the shaders to not use uniform buffers. Luckily I have the shader compiler so this was a few lines of code instead of hand editing 100’s of shaders. -Then I sat around for a few minutes for all the shaders to regenerate and recompile. -Next I had to record the per vertex/pixel program combination of which uniforms were used and where they needed to be uploaded to. This was a non-trival amount of code to write. -Then any time a shader changed, I had to change the code to dirty all uniforms so they’d be reapplied. -Then I had to write a new uniform binding function.

Here’s the new constant binding function. Pretty messy memory wise, and many more calls to the GL API frame.

void ConstantBuffer::Bind(Context& context, void* data, int32 offsetBytes, int32 /*bytes*/) { _Assert(offsetBytes == 0, "can't upload with non-zero offset"); const VideoProgram* program = context.GetVideoProgram(); const Collection::Array & upload = program->GetUploadInfo(context.GetDetailLevel(), _ordinal); for (int32 i = 0; i < upload.GetSize(); ++i) { const VideoProgram::UploadInfo& uploadInfo = upload[i]; switch (uploadInfo._type) { case GL_FLOAT_MAT4: glUniformMatrix4fv(uploadInfo._index, uploadInfo._size, false, (float*)data + (uploadInfo._offset * 4)); break; case GL_FLOAT_VEC4: glUniform4fv(uploadInfo._index, uploadInfo._size, (float*)data + uploadInfo._offset * 4); break; } } }

Success

Finally I watched the game run at 60 FPS. So now the statistics are nicer. And only 5% CPU time spent in OpenGL. Woot.

Graphics Drivers

Ok, so the driver is optimized to set loose constants very quickly, but when presented as a block it just stalls waiting for the GPU to finish? I don't get it. The Windows drivers seem to handle uniform buffers properly. I understand writing the driver to the OpenGL spec - but geez, this makes uniform buffers mostly useless. It's known to be a uniform buffer, the calling code is updating it, it's marked as DYNAMIC_WRITE, so why isn't it doing exactly the same things as what my manual setting of each uniform value is doing???? Arhghghghg.

I'm sure someone has a good answer as to how to update uniform buffers on Mac OSX, but I couldn't find it. Or maybe the answer is upgrading, or not using them? But this was debugging hours I didn't need to spend. Actually I take that back. Tracking down issues like this is pretty satisfying...

So I can just keep the code the way that works on Mac, but uniform buffers are so much more elegant. Plus what if Linux runs faster with uniform buffers instead of loose uniforms? Or if Windows does? Then I have to generate two different OpenGL shaders, and have different code per platform to get the same data to the GPU. Now I'm not so worried that the Windows OpenGL implementation was slightly different from OSX, because I can see the implementations are going to be driver dependent anyway...

OpenGL is cross platform? Sorta. Yikes.