In part 2 of the article about persistent mapped buffers I share results from the demo app.

I've compared single, double and triple buffering approach for persistent mapped buffers. Additionally there is a comparison for standard methods: glBuffer*Data and glMapBuffer.

Note:

This post is a second part of the article about Persistent Mapped Buffers,

see the first part here - introduction

Demo

Github repo: fenbf/GLSamples

How it works:

app shows number of rotating 2D triangles (wow!)

triangles are updated on CPU and then send (streamed) to GPU

drawing is based on glDrawArrays command

in benchmark mode I run this app for N seconds (usually 5s) and then count how many frames did I get

additionally I measure counter that is incremented each time we need to wait for buffer

vsync is disabled

Features:

configurable number of triangles

configurable number of buffers: single/double/triple

optional syncing

optional debug flag

benchmark mode (quit app after N seconds)

Code bits

Init buffer:

size_t bufferSize{ gParamTriangleCount * 3 * sizeof(SVertex2D)}; if (gParamBufferCount > 1) { bufferSize *= gParamBufferCount; gSyncRanges[0].begin = 0; gSyncRanges[1].begin = gParamTriangleCount * 3; gSyncRanges[2].begin = gParamTriangleCount * 3 * 2; } flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; glBufferStorage(GL_ARRAY_BUFFER, bufferSize, 0, flags); gVertexBufferData = (SVertex2D*)glMapBufferRange(GL_ARRAY_BUFFER, 0, bufferSize, flags);

Display:

void Display() { glClear(GL_COLOR_BUFFER_BIT); gAngle += 0.001f; if (gParamSyncBuffers) { if (gParamBufferCount > 1) WaitBuffer(gSyncRanges[gRangeIndex].sync); else WaitBuffer(gSyncObject); } size_t startID = 0; if (gParamBufferCount > 1) startID = gSyncRanges[gRangeIndex].begin; for (size_t i(0); i != gParamTriangleCount * 3; ++i) { gVertexBufferData[i + startID].x = genX(gReferenceTrianglePosition[i].x); gVertexBufferData[i + startID].y = genY(gReferenceTrianglePosition[i].y); } glDrawArrays(GL_TRIANGLES, startID, gParamTriangleCount * 3); if (gParamSyncBuffers) { if (gParamBufferCount > 1) LockBuffer(gSyncRanges[gRangeIndex].sync); else LockBuffer(gSyncObject); } gRangeIndex = (gRangeIndex + 1) % gParamBufferCount; glutSwapBuffers(); gFrameCount++; if (gParamMaxAllowedTime > 0 && glutGet(GLUT_ELAPSED_TIME) > gParamMaxAllowedTime) Quit(); }

WaitBuffer:

void WaitBuffer(GLsync& syncObj) { if (syncObj) { while (1) { GLenum waitReturn = glClientWaitSync(syncObj, GL_SYNC_FLUSH_COMMANDS_BIT, 1); if (waitReturn == GL_ALREADY_SIGNALED || waitReturn == GL_CONDITION_SATISFIED) return; gWaitCount++; // the counter } } }

Test Cases

I've created a simple batch script that:

runs test for 10, 100, 1000, 2000 and 5000 triangles

each test (takes 5 seconds): persistent_mapped_buffer single_buffer sync persistent_mapped_buffer single_buffer no_sync persistent_mapped_buffer double_buffer sync persistent_mapped_buffer double_buffer no_sync persistent_mapped_buffer triple_buffer sync persistent_mapped_buffer triple_buffer no_sync standard_mapped_buffer glBuffer*Data orphan standard_mapped_buffer glBuffer*Data no_orphan standard_mapped_buffer glMapBuffer orphan standard_mapped_buffer glMapBuffer no_orphan

in total 5*10*5 sec = 250 sec

no_sync means that there is no locking or waiting for the buffer range. That can potentially generate a race condition and even an application crash - use it on your own risk! (at least in my case nothing happened - maybe a little bit of dancing vertices :) )

means that there is no locking or waiting for the buffer range. That can potentially generate a race condition and even an application crash - use it on your own risk! (at least in my case nothing happened - maybe a little bit of dancing vertices :) ) 2k triangles uses: 2000*3*2*4 bytes = 48 kbytes per frame. This is quite small number. In the followup for this experiment I'll try to increase that and stress CPU to GPU bandwidth a bit more.

Orphaning:

for glMapBufferRange I add GL_MAP_INVALIDATE_BUFFER_BIT flag

I add flag for glBuffer*Data I call glBufferData(NULL) and then normal call to glBufferSubData .

Results

All results can be found on github: GLSamples/project/results

100 Triangles

GeForce 460 GTX (Fermi), Sandy Bridge Core i5 2400, 3.1 GHZ

Wait counter:

Single buffering: 37887

Double buffering: 79658

Triple buffering: 0

AMD HD5500, Sandy Bridge Core i5 2400, 3.1 GHZ

Wait counter:

Single buffering: 1594647

Double buffering: 35670

Triple buffering: 0

Nvidia GTX 770 (Kepler), Sandy Bridge i5 2500k @4ghz

Wait counter:

Single buffering: 21863

Double buffering: 28241

Triple buffering: 0

Nvidia GTX 850M (Maxwell), Ivy Bridge i7-4710HQ

Wait counter:

Single buffering: 0

Double buffering: 0

Triple buffering: 0

All GPUs

With Intel HD4400 and NV 720M

2000 Triangles

GeForce 460 GTX (Fermi), Sandy Bridge Core i5 2400, 3.1 GHZ

Wait counter:

Single buffering: 2411

Double buffering: 4

Triple buffering: 0

AMD HD5500, Sandy Bridge Core i5 2400, 3.1 GHZ

Wait counter:

Single buffering: 79462

Double buffering: 0

Triple buffering: 0

Nvidia GTX 770 (Kepler), Sandy Bridge i5 2500k @4ghz

Wait counter:

Single buffering: 10405

Double buffering: 404

Triple buffering: 0

Nvidia GTX 850M (Maxwell), Ivy Bridge i7-4710HQ

Wait counter:

Single buffering: 8256

Double buffering: 91

Triple buffering: 0

All GPUs

With Intel HD4400 and NV 720M

Summary

Persistent Mapped Buffers (PBM) with triple buffering and no synchronization seems to be the fastest approach in most tested scenarios. Only Maxwell (850M) GPU has issues with that: slow for 100 tris, and for 2k tris it's better to use double buffering.

PBM width double buffering seems to be only a bit slower than triple buffering, but sometimes 'wait counter' was not zero. That means we needed to wait for the buffer. Triple buffering has no such problem, so no synchronization is needed. Using double buffering without syncing might work, but we might expect artifacts. (Need to verify more on that).

Single buffering (PBM) with syncing is quite slow on NVidia GPUs.

using glMapBuffer without orphaning is the slowest approach

interesting that glBuffer*Data with orphaning seems to be even comparable to PBM. So old code that uses this approach might be still quite fast!

TODO: use Google Charts for better visualization of the results

Please Help

If you like to help, you can run benchmark on your own and send me (bartlomiej DOT filipek AT gmail ) the results.

Windows only. Sorry :)

Behchmark_pack 7zip @github

Go to benchmark_pack and execute batch run_from_10_to_5000.bat .

run_from_10_to_5000.bat > my_gpu_name.txt

The test runs all the tests and takes around 250 seconds.