In my previous post I was talking about steps one needs to undertake and data they get when profiling an application. Now let’s use this knowledge in practice.

I will be profiling Blaze, a 2D physics engine based on Box2D. It was actually Mason’s newsgroup post which inspired me to blog about profiling.

Prerequisites

To repeat my experiments you will need:

DMD compiler with Tango. The simplest is to get it from the Tango official download site. The “current release (including DMD 1.033)” is what I’m actually using. Here is a direct link to download it.

Blaze source code. In this article I’m hacking trunk revision 409 so I encourage you to get the same version. You can do so using either a Subversion client: svn checkout -r 409 http://svn.dsource.org/projects/blaze/trunk blaze or the Blaze repository browser, the “Download in other formats: Zip Archive” link at the bottom of the page. Here’s a direct link if you wish to save some browsing.

Derelict. I’m using trunk revision 336, but I think any latest version will suffice. Get it either via SVN svn checkout http://svn.dsource.org/projects/derelict/trunk derelict or from the repository browser, direct link included.

SDL.dll. You can get a pre-built Windows version from the official downloads page.

That’s it. You won’t be needing anything else. Make sure you put DMD’s bin on the path, the rest of this article assumes that.

Build

Blaze comes with a testbed application and a set of DSSS configuration files. I won’t use DSSS though. I’ll use Jake instead—a minimalistic, one-step build utility similar to Bud and Rebuild but more robust because it relies upon DMD’s built-in dependency resolution system. And it’s simple. And it’s bundled with the Tango distribution mentioned above so you should already have it. In fact I’m using a similar tool of my own which works not only with Tango but with Phobos as well, but that’s another story. Jake is perfectly sufficient in the scope of this article.

I use the following command to build the testbed application:

jake blazeDemos.d -O -release -inline -profile -I..\blaze -I\home\snake\src\derelict\DerelictGL -I\home\snake\src\derelict\DerelictUtil -I\home\snake\src\derelict\DerelictGLU -I\home\snake\src\derelict\DerelictSDL

Of course you should change path to Derelict according to your setup. The build takes around 12 seconds on my 1.8 GHz Core2 Duo laptop. Don’t forget to copy the sdl.dll into the same directory where you build the testbed app. Now we are ready to

Profile!

Run the blazeDemos.exe. The app starts with the default domino demo. It runs slowly, much slower than without profiling enabled.

The profiling process only measures performance of code actually executed, so it is essential to let the application run for some time. I suggest you wait for everything stops moving except for the pendulum, then close the application. This way many code paths are executed and therefore measured, and the results are easy to reproduce. It takes around 30 seconds on my laptop for the scene to completely stabilize.

Let’s see what we’ve got. I want generic performance information now so I skip right to the overal timings section of trace.log starting with a row of equal signs. Here’s what I’ve got in several first lines:

======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ======== Num Tree Func Per Calls Time Time Call 1 9718996 2105879 2105879 __Dmain 859 1767855 1100816 1281 _D5blaze9collision5nbody12sortAndSweep12SortAndSweep6searchMFZv 859 741745 618140 719 _D10blazeDemos9drawSceneFZv 53649 654964 571310 10 _D5blaze9collision8pairwise11collidePoly14edgeSeparationFC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf 2789 628572 542171 194 _D5blaze8dynamics7contact13contactSolver13ContactSolver5_ctorMFS5blaze5world8TimeStepAC5blaze8dynamics7contact7contact7ContactiZC5blaze8dynamics7contact13contactSolver13ContactSolver 1 361446 361442 361442 _D10blazeDemos14createGLWindowFAaiiibZv 859 322915 279146 324 _D5blaze9collision5nbody12sortAndSweep12SortAndSweep9shellSortMFZv 15712 914581 240098 15 _D5blaze9collision8pairwise11collidePoly17findMaxSeparationFKiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf

Let’s see. First comes __Dmain , that is, main() . It’s Tree Time is 9718996 microseconds which is almost 10 seconds. main() is the application starting point, so everything else is called from it—the testbed application is single-threaded. The main ‘s tree time must be the total program execution time. But the measured program run time was 30 seconds. Why’s the difference? This is because profiling comes with huge overhead. That’s profiling code which runs 2/3 of the time. Without profiling the app would run 3 times faster. This is no surprise.

What actually surprises is the Func Time column. Func time is the total time spent in the body of this function but not in any functions it calls. As I explained earlier, this time is used to sort the function table to put most time consuming functions at the top of the list. And the top one is main ! Even more, according to the measurements main consumes more than 20% of execution time:

2.1 / 9.72 * 100 = 21.6%

This is really unexpected. main() contains the most outer loop, body of which executes once per frame, and it only calls a small number of other functions. It’s easy to see the main loop activity looking at main ‘s fan out:

__Dmain 0 34789587 7538091 1 _D10blazeDemos14createGLWindowFAaiiibZv 1 _D10blazeDemos10keyPressedFiZv 859 _D10blazeDemos11processTimeFZv 859 _D5tango4text7convert7Integer13__T6decodeTaZ6decodeFAaKaJaJiZv 859 _D5tango4text7convert7Integer16__T9formatterTaZ9formatterFAalaaiZAa 859 _D5tango4stdc7stringz9toStringzFAaAaZPa 859 _D5blaze5world5World4stepMFfiiZv 859 _D7dominos7Dominos6updateMFZv 859 _D10blazeDemos13processEventsFZv 859 _D10blazeDemos9drawSceneFZv 859 _D10blazeDemos8limitFPSFkZv

main() only calls 9 other functions, less than 900 times each. This shouldn’t take more than complicated collision detection. But it does. Here’s why.

Profiler takes into account only functions compiled with -profile switch. It is blind towards any other functions. If an instrumented function calls a non-instrumented one, the call time is added to the caller’s “func time” as if the called function were inlined.

Looking at main() , it is easy to come with a list of potentially offending functions:

glClear() A direct call to OpenGL dynamic library SDL_WM_SetCaption() A direct call to SDL dynamic library SDL_GL_SwapBuffers() Again, a direct call to SDL dynamic library

Let’s wrap each into a separate D function, like mainClean , mainCaption , and mainSwap . Don’t forget to disable inlining for this trick to work.

Important! Every run of an instrumented application updates trace.log and trace.def to include new results. But the application doesn’t check whether trace.log and trace.def are from the same application. If they’re not, or if they’re from another build of the same application, they get corrupted. I think there should be some sort of a check sum at the start of these files so that app can compare against its own check sum and overwrite incompatible files completely. For now though, always erase trace.log and trace.def after re-building your test application.

The trick works. main() takes 93rd place, consuming only modest 21 milliseconds. And the first place is taken by mainCaption . That’s it. Setting window caption takes more than 1/5 of execution time of an optimized application.

Finding out why setting caption is so slow is probably out of scope of this article. I’ll just assume there is a reason and hack around it. There is a variable, frames , which is set to zero every time a frame rate counter, fps_ , is refreshed. So I’ll add a check and will only update caption when this happens. Here’s the final code fragment:

if (frames == 0) { // Create fps caption char[] title = "Blaze Demo - Press Keys 0-9, q-e To Switch Demos - FPS: "; title ~= Integer.format (new char[32], fps_); void mainCaption() { SDL_WM_SetCaption(toStringz(title), null); } mainCaption(); }

Rebuild, erase trace.*, run. Now mainCaption is at 28th place with only 1/4 second consumed. The first place is rightly taken by solveVelocityConstraints which is probably computation-intensive.

Final test, with inlining. Compile, erase, run:

======== Timer Is 3579545 Ticks/Sec, Times are in Microsecs ======== Num Tree Func Per Calls Time Time Call 829 1691967 1065306 1285 _D5blaze9collision5nbody12sortAndSweep12SortAndSweep6searchMFZv 829 715255 600852 724 _D10blazeDemos9drawSceneFZv 52839 647945 565433 10 _D5blaze9collision8pairwise11collidePoly14edgeSeparationFC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf 2729 619133 535539 196 _D5blaze8dynamics7contact13contactSolver13ContactSolver5_ctorMFS5blaze5world8TimeStepAC5blaze8dynamics7contact7contact7ContactiZC5blaze8dynamics7contact13contactSolver13ContactSolver 1 345129 345126 345126 _D10blazeDemos14createGLWindowFAaiiibZv 829 305158 264505 319 _D5blaze9collision5nbody12sortAndSweep12SortAndSweep9shellSortMFZv 15472 902121 235004 15 _D5blaze9collision8pairwise11collidePoly17findMaxSeparationFKiC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf 2729 355048 223354 81 _D5blaze8dynamics7contact13contactSolver13ContactSolver24solvePositionConstraintsMFfZb 8187 251732 218908 26 _D5blaze8dynamics7contact13contactSolver13ContactSolver24solveVelocityConstraintsMFZv 16368 251190 217956 13 _D5blaze9collision8pairwise8distance8inPointsFS5blaze6common4math5bVec2AS5blaze6common4math5bVec2iZb 4945 467027 170846 34 _D5blaze9collision8pairwise8distance103__T15distanceGenericTC5blaze9collision6shapes7polygon7PolygonTC5blaze9collision6shapes7polygon7PolygonZ15distanceGenericFKS5blaze6common4math5bVec2KS5blaze6common4math5bVec2C5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormC5blaze9collision6shapes7polygon7PolygonS5blaze6common4math6bXFormZf 2726 1784432 167115 61 _D5blaze8dynamics6island6Island5solveMFS5blaze5world8TimeStepS5blaze6common4math5bVec2bZv 1 7668760 165340 165340 __Dmain

main() is at 13th place, consuming only 165 milliseconds in total—all title updates included because of inlining. Clean measured run time is now 7.67 seconds instead of 9.72. That’s more than 20% speed up. Not bad at all.

Afterword

You’ve probably heard of premature optimization. It’s when you waste time optimizing something just to find out later that it didn’t matter, but you could have had a 20% speed gain by writing a simple if in a completely different place you couldn’t even think of. Or when you realize that you must optimize your heavily optimized algorithm differently, essentially re-implementing it.

Don’t ever make this mistake. Use profiler.