A realtime Mandelbrot zoomer in SSE assembly and CUDA



Yep, looks boring when it doesn't move... Yep, looks boring when it doesn't move...

(Written around 2002, enhanced with OpenMP and CUDA in 2009)Back in 2002, I wanted to try out SSE programming. I bought and installed an Athlon-XP based motherboard, and over the better part of a weekend, I created a simple implementation of a Mandelbrot zoomer in native SSE assembly. I was happy to see that it was almost 3 times faster than pure C - even when compared to pipelined floating point calculations.

Many years later, I revisited the code and enhanced it to use the GNU autotools; it therefore became portable to all Intel platforms under the sun (Linux/x86, Mac OS/X, Windows (via MinGW gcc), OpenBSD/amd64 and FreeBSD/amd64). Benchmarks showed - as expected - that OSes had very little to do with such "pure" calculations as this; speed was pretty much the same under all of them (I used libSDL for portable HW-assisted "blitting" of the pixel data).

Recently (2009), I learned how to use OpenMP and Intel TBB to make use of additional CPUs/cores - so I added OpenMP #pragmas to both the pure-C and SSE modes... and smiled as I saw my little Atom330 get a boost from 58 to around 150 frames per second. The manually written SSE code was also using GCC's AT/T syntax, so it could be automatically inlined by GCC even at -O3 optimizations.

I thought that this was the limit - what else could I do now but to sit back and wait for CPU cores to multiply?... Perhaps buy an i7?

Nope. Wrong.

Last weekend, I got to play with an NVIDIA GT240 (around 100$). Having read a lot of blogs about GPU programming, I downloaded the CUDA SDK and started reading some samples.

In less than one hour, I went from my rather complex SSE inline assembly, to a simple, clear Mandelbrot implementation... that run... 15 times faster!

Let me say this again: 1500% faster. Jaw dropping. Or put a different way: I went from 147fps at 320x240... to 210fps... at 1024x768!

I only have one comment for my fellow developers:

The algorithm in question was perfect for a CUDA implementation. You won't always get this kind of speedups (while at the same time doing it with clearer and significantly less code).

But regardless of that, you must start looking into GPGPU coding: CUDA (and soon, OpenCL) can become huge game changers.

If you don't, you are in danger of missing... significant opportunities...

Update, one week later: It turned out that my GT240 is so ...fresh, that the Linux drivers aren't yet up to speed - they are driving my card at Performance Level 1 instead of 2 (i.e. at 324MHz memory clock instead of 1700MHz!). Under Windows, the code runs at an astonishing 420 frames per second, so the speedup is not 15x, it is in fact more like 30x...

Update, March 2010: The newest NVIDIA drivers (195.36.15) have corrected the bug - the Linux and Windows versions now run at identical speeds: Blazing ones :‑)

Downloads

SSE version

The portable GPL source code for the SSE version is here

Compile and run with...

./configure && make && ./src/mandelSSE

You can zoom-in/zoom-out with the left and right mouse buttons. Hit ESC and it will report the frames per second achieved.

If you are a Windows user, here is a Win32 binary (compiled via TDM/MinGW, and compressed with 7-zip). To use it, simply unzip the package and run mandelSSE.exe .

CUDA version

The CUDA package ( portable GPL source code ), is also available. The program also uses OpenGL to blit the CUDA-generated data on the screen - this way there's no moving of pixel data back-and-forth between the CPU and the GPU: they are calculated on the GPU global memory, and they are drawn from that (via a texture).

If you are working under Windows, a precompiled Windows binary is here (compressed with 7-zip). Make sure you disable Vertical Sync in order to get the maximum frame rate.

Code comparison

CoreLoop

__global__ void CoreLoop ( int * p , float xld , float yld , float xru , float yru , int MAXX , int MAXY ) { float re , im , rez , imz ; float t1 , t2 , o1 , o2 ; int k ; unsigned result = 0 ; unsigned idx = blockIdx . x * blockDim . x + threadIdx . x ; int y = idx / MAXX ; int x = idx % MAXX ; re = ( float ) xld + ( xru - xld )* x / MAXX ; im = ( float ) yld + ( yru - yld )* y / MAXY ; rez = 0 . 0f ; imz = 0 . 0f ; k = 0 ; while ( k < ITERA ) { o1 = rez * rez ; o2 = imz * imz ; t2 = 2 * rez * imz ; t1 = o1 - o2 ; rez = t1 + re ; imz = t2 + im ; if ( o1 + o2 > 4 ) { result = k ; break ; } k ++; } p [ y * MAXX + x ] = lookup [ result ]; }

mov ecx , 0 movaps xmm 5 , [ fours ] movaps xmm 6 , [ re ] movaps xmm 7 , [ im ] xorps xmm 0 , xmm 0 xorps xmm 1 , xmm 1 xorps xmm 3 , xmm 3 loop1: movaps xmm 2 , xmm 0 mulps xmm 2 , xmm 1 mulps xmm 0 , xmm 0 mulps xmm 1 , xmm 1 movaps xmm 4 , xmm 0 addps xmm 4 , xmm 1 subps xmm 0 , xmm 1 addps xmm 0 , xmm 6 movaps xmm 1 , xmm 2 addps xmm 1 , xmm 1 addps xmm 1 , xmm 7 cmpltps xmm 4 , xmm 5 movaps xmm 2 , xmm 4 movmskps eax , xmm 4 andps xmm 4 , [ ones ] addps xmm 3 , xmm 4 or eax , eax jz short nomore inc ecx cmp ecx , 119 jnz short loop 1

Just so you understand what I meant by "getting a speedup with clearer and simpler code", this is the CUDA, from inside src/mandel.cu: if you have ever coded a Mandelbrot zoomer, you'll probably agree that this is one of the simplest ways to implement it:Now compare that to the SSE main loop (brace yourselves):Convinced? :‑)

Not only was this SSE code a pain to write, and impossible to maintain... It is now an order of magnitude slower than what you can get with cheap GPUs!

Times have changed (yet again)...



Back to index My CV Last update on: Sun Nov 10 10:01:53 2019

The comments on this website require the use of JavaScript. Perhaps your browser isn't JavaScript capable or the script is not being run for another reason. If you're interested in reading the comments or leaving a comment behind please try again with a different browser or from a different connection.