leonardo

View: Recent Entries. View: Archive. View: Friends. View: Profile. View: Website (My Website). December 14th, 2008 Tags: benchmark, c, gcc, llvm Security: Subject: LLVM-GCC Vs GCC Time: 09:54 pm

http://shootout.alioth.debian.org/



So I have done few benchmarks myself, using (mostly) the C code. The code used is exactly the same for both compilers.







C souce code (from the Shootout site), and timings in OpenOffice format:

http://www.fantascienza.net/leonardo/js/llvm_vs_gcc.zip



In the nbody benchmark there's a large difference, I don't know its origin (I hope LLVM will fix those problems. And I hope LLVM will someday support exceptions on Windows too).



Generally for LLVM-gcc it's generally better to compile with -msse3 (without it some timings become quite bad, expecially for the mandelbrot benchmark).

Compilers used: LLVM-gcc V. 2.4 GCC V. 4.2.1-dw2 (mingw32-2) Compiler options used: GCC: -O3 -s -fomit-frame-pointer LLVM-gcc: -O3 -s -fomit-frame-pointer Benchmarks using FP numbers are compiled with -msse3 too. CPU used: Intel Core2, 2 GHz (32-bit mode). All benchmarks use only 1 core. TIMINGS GCC, best of 3: bintrees, n=15: 4.24 s fannkuck, n=11: 5.24 s fasta, n=9_000_000 (> NUL): 3.76 s fasta, n=9_000_000 (a): 4.17 s k_nucleotide, (d): 4.63 s mandelbrot, (c) n=4_000: 2.49 s meteor_contest_ccp, n=2_098: 0.12 s meteor_contest_c, n=2_098: 0.17 s nbody, (c) n=10_000_000: 5.92 s nsieve, n=12: 5.47 s nsieve_bits, n=13: 4.31 s partial_sums, (c) n=7_000_000: 5.77 s recursive, (c) n=12: 5.82 s reverse_complement (b) (> NUL): 1.77 s reverse_complement (b) (a): 2.54 s spectral_norm, (c) n=3000: 6.78 s sum_file, input=71_974_912 bytes: 2.28 s TIMINGS LLVM-gcc, best of 3: bintrees, n=15: 4.26 s fannkuck, n=11: 5.45 s fasta, n=9_000_000 (> NUL): 3.69 s fasta, n=9_000_000 (a): 4.01 s k_nucleotide, (d): 4.71 s mandelbrot, (c) n=4_000: 2.40 s meteor_contest_ccp, n=2_098: 0.13 s meteor_contest_c, n=2_098: 0.14 s nbody, (c) n=10_000_000: 16.63 s nsieve, n=12: 5.47 s nsieve_bits, n=13: 4.15 s partial_sums (c), n=7_000_000: 6.52 s recursive, (c) n=12: 6.47 s reverse_complement (b) (> NUL): 1.90 s reverse_complement (b) (a): 2.60 s spectral_norm, (c) n=3000: 5.96 s sum_file, input=71_974_912 bytes: 3.28 s Key: (a) = to no existing output file. (b) = input generated by fasta with N=9_000_000. (c) = compiled with -msse3 too. (d) = from fasta file n=1_000_000 Note, useful as reference point: nbody.java, N=10_000_000: 5.48 s

After a suggestion I have compiled again all the programs with a more fitting march:

llvm-gcc -O3 -s -fomit-frame-pointer -msse3 -march=core2 Or: llvm-g++ -O3 -s -fomit-frame-pointer -msse3 -march=core2 Some timings are changed a little: TIMINGS LLVM-gcc core2, best of 3: fasta, n=9_000_000 (> NUL): 3.69 s ==> 2.75 s fasta, n=9_000_000 (a): 4.01 s ==> 3.08 s reverse_complement (b) (> NUL): 1.90 s ==> 1.88 s reverse_complement (b) (a): 2.60 s ==> 2.97 s



So overall there's an improvement. Ignoring the timings for nbody the total of the other timings (with -march=core2) is close enough to the total for gcc.



In the meantime LLVM developers have found the problem with nbody (and filed a bug performance report), the compiler doesn't inline the sqrt() in the following line, that is the most hot loop:

double distance = sqrt(dx * dx + dy * dy + dz * dz);

See:

http://www.llvm.org/PR3219



For the Java code of 'nbody' see also here:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=all

You can find that reformatted nbody Java code into the zip too.



Using idea from the following two pages:

http://weblogs.java.net/blog/kohsuke/archive/2008/03/deep_dive_into.html

http://blogs.tedneward.com/2008/04/06/The+Complexities+Of+Black+Boxes.aspx

Installing the "self-extracting DEBUG Jar file", and then using:

java -XX:+PrintOptoAssembly -server -cp . nbody

I was able to find the asm code produced by the JavaVM for the nbody benchmark. It essentially uses only the SSE registers, and no floating point stack. It contains three inlines calls to the sqrt (but the program contains only two of them). At a first look, that asm doesn't look much different frm the asm produced by LLVM-gcc (but LLVM-gcc doesn't inline the call to the sqrt).

I have seen that the last C++ version of the nbody (you can find it too inside the zip) compiled with LLVM-gcc is able to run in 4.98 s, but it uses lot of intrinsics like __builtin_ia32_haddpd(), that will not be the best for future CPUs (while the Java code is perfectly general), in practice it's partially asm already.



Update 1: I have added CPU used, compiler version used, changed the title of the post a little.



Update 2: I have cleaned up timings and the graph, leaving only the ones with -msse3 for benchmarks that use FP numbers.



Update 3: I have added timings for -march=core2, link to bug #3219, and fixed the key a little.



Update 4, Dec 19: I have added the Java code and relative asm and comments.



See a follow up: So far the Shootout site has refused to add a comparison between the LLVM compiler and the other ones:So I have done few benchmarks myself, using (mostly) the C code. The code used is exactly the same for both compilers.C souce code (from the Shootout site), and timings in OpenOffice format:In the nbody benchmark there's a large difference, I don't know its origin (I hope LLVM will fix those problems. And I hope LLVM will someday support exceptions on Windows too).Generally for LLVM-gcc it's generally better to compile with -msse3 (without it some timings become quite bad, expecially for the mandelbrot benchmark).After a suggestion I have compiled again all the programs with a more fitting march:So overall there's an improvement. Ignoring the timings for nbody the total of the other timings (with -march=core2) is close enough to the total for gcc.In the meantime LLVM developers have found the problem with nbody (and filed a bug performance report), the compiler doesn't inline the sqrt() in the following line, that is the most hot loop:double distance = sqrt(dx * dx + dy * dy + dz * dz);See:For the Java code of 'nbody' see also here:You can find that reformatted nbody Java code into the zip too.Using idea from the following two pages:Installing the "self-extracting DEBUG Jar file", and then using:java -XX:+PrintOptoAssembly -server -cp . nbodyI was able to find the asm code produced by the JavaVM for the nbody benchmark. It essentially uses only the SSE registers, and no floating point stack. It contains three inlines calls to the sqrt (but the program contains only two of them). At a first look, that asm doesn't look much different frm the asm produced by LLVM-gcc (but LLVM-gcc doesn't inline the call to the sqrt).I have seen that the last C++ version of the nbody (you can find it too inside the zip) compiled with LLVM-gcc is able to run in 4.98 s, but it uses lot of intrinsics like __builtin_ia32_haddpd(), that will not be the best for future CPUs (while the Java code is perfectly general), in practice it's partially asm already.Update 1: I have added CPU used, compiler version used, changed the title of the post a little.Update 2: I have cleaned up timings and the graph, leaving only the ones with -msse3 for benchmarks that use FP numbers.Update 3: I have added timings for -march=core2, link to bug #3219, and fixed the key a little.Update 4, Dec 19: I have added the Java code and relative asm and comments.See a follow up: http://leonardo-m.livejournal.com/77877.html comments: Leave a comment

ext_138141 Link: (Link) Time: 2008-12-14 10:31 pm (UTC) Thanks, very useful! Maybe you should post these results on the LLVM dev list to let them fix the problems with nbody?



I've only tried the nbody benchmark so far and I got similar results:

I have no exact numbers, but with gcc as 1.0 I had for LLVM

on a AMD 64 X2 4400+: around 1.5

on a Core2Duo (don't know the specs): around 2.0



I even compared llvm-gcc with my own compiler (with a LLVM backend) and both produce approximately the same results here. So I don't think that it's something frontend related.





On which CPU did you run the tests? (Reply) (Thread)



leonardo_m Link: (Link) Time: 2008-12-14 10:58 pm (UTC) You are welcome. The CPU used is a Core2 at 2 GHz.

Later I may show such results to the LLVM dev list.

Your ratios are less extreme than mine. (Reply) (Parent) (Thread)

(Anonymous) Subject: Great data Link: (Link) Time: 2008-12-15 12:22 am (UTC) Very useful data.



I'm wondering how Java manages to beat even GCC at what I presume is a numerical benchmark. Is the Java version using some kind of special native code numerics lib or something? (Reply) (Thread)

ext_138158 Subject: Re: Great data Link: (Link) Time: 2008-12-15 12:23 am (UTC) That was me by the way. (Reply) (Parent) (Thread)



leonardo_m Subject: Re: Great data Link: (Link) Time: 2008-12-16 05:17 pm (UTC) It can be interesting to go read the asm instruction run by the JavaVM here. (Reply) (Parent) (Thread)



leonardo_m Subject: Re: Great data Link: (Link) Time: 2008-12-19 01:08 pm (UTC) Added into the zip the asm coming from the JavaVM. (Reply) (Parent) (Thread)



leonardo_m Subject: Re: Great data Link: (Link) Time: 2008-12-15 12:26 am (UTC) No, Java isn't using anything special (Java 1.6.0_06).



The same result can be seen on the Shootout site too, test it yourself:

http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=all (Reply) (Parent) (Thread)

igouy Subject: Re: Great data Link: (Link) Time: 2008-12-15 03:53 am (UTC) Those old benchmarks game measurements also show Oberon-2 nbody (source code translated to C then compiled with GCC) and G++ nbody faster than Java. (Reply) (Parent) (Thread)

(Anonymous) Subject: SSE vs FP Stack Link: (Link) Time: 2008-12-15 02:46 am (UTC) It looks like your llvm-gcc is defaulting to targeting really old CPUs. Make sure to pass -msse2 or later to all compiles. If you build llvm-gcc yourself you can get this by configuring llvm-gcc with '--with-arch=nocona --with-tune=generic'. This is how llvm-gcc is built on Mac OS/X for example. (Reply) (Thread)



leonardo_m Subject: Re: SSE vs FP Stack Link: (Link) Time: 2008-12-15 12:28 pm (UTC) It looks like your llvm-gcc is defaulting to targeting really old CPUs.



You are probably right, but I have used the llvm-gcc pre-compiled for Win as it comes from the site.





Make sure to pass -msse2 or later to all compiles.



Thank you for the suggestion. As you can see I have already used sse3 in most of the benchmarks that use floating point numbers: mandelbrot_sse3, nbody_sse3 and partial_sums_sse3 (one of such timings isn't present in the graph). The only missing ones are recursive and spectral_norm:

Timings 'recursive': GCC, n=12: 5.88 s GCC, -msse3, n=12: 5.82 s LLVM-gcc, n=12: 7.95 s LLVM-gcc, -msse3, n=12: 6.47 s Timings 'spectral_norm': GCC, n=3_000: 6.78 s GCC, -msse3, n=3_000: 6.78 s LLVM-gcc, n=3_000: 6.70 s LLVM-gcc, -msse3, n=3_000: 5.96 s

I'll soon update (and clean up) the graph with this new data. (Reply) (Parent) (Thread)

_asl_ Subject: Re: SSE vs FP Stack Link: (Link) Time: 2008-12-16 09:18 am (UTC) > You are probably right, but I have used the llvm-gcc pre-compiled for Win as it comes from the site.

Correct, we need to support everything for pre-compiled binaries, thus it was built to generate 'generic' i686 code by default. -msseN is not always enough - please consider adding -march=foo compiler option (Reply) (Parent) (Thread)



leonardo_m Subject: Re: SSE vs FP Stack Link: (Link) Time: 2008-12-16 05:02 pm (UTC) I have compiled all programs with:

llvm-gcc -O3 -s -fomit-frame-pointer -msse3 -march=core2

(or llvm-gccg++).

With the following good/bad changes:

fasta, n=9_000_000 (> NUL): 3.69 s ==> 2.75 s fasta, n=9_000_000 (a): 4.01 s ==> 3.08 s reverse_complement (b) (> NUL): 1.90 s ==> 1.88 s reverse_complement (b) (a): 2.60 s ==> 2.97 s

Unfortunately now the comparison is skewed still, because my modern MinGW (based on GCC 4.2.1) doesn't support core2.

I'll update the page soon. (Reply) (Parent) (Thread)

_asl_ Subject: Re: SSE vs FP Stack Link: (Link) Time: 2008-12-16 05:43 pm (UTC) what's about -march=nocona? (Reply) (Parent) (Thread)



leonardo_m Subject: Re: SSE vs FP Stack Link: (Link) Time: 2008-12-16 06:00 pm (UTC) I generally avoid to use things that I don't understand. What's -march=nocona for? (Reply) (Parent) (Thread)

igouy Subject: Misleading allegation Link: (Link) Time: 2008-12-15 03:47 am (UTC) > the Shootout site has refused to add a comparison between

> the LLVM compiler and the other ones (in particular GCC.

> While it compares the Intel compiler against GCC)



The only C implementation in the current benchmarks game is GCC.



http://alioth.debian.org/forum/message.php?msg_id=181218 (Reply) (Thread)



leonardo_m Subject: Re: Misleading allegation Link: (Link) Time: 2008-12-15 10:02 am (UTC) The only C implementation in the current benchmarks game is GCC.



So, it's time to add the LLVM too to the tested backends. (Reply) (Parent) (Thread)

igouy Subject: Re: Misleading allegation Link: (Link) Time: 2008-12-15 06:05 pm (UTC) FAQ Why don't you include language X?



"We have no ambition to measure every Python implementation or every Haskell implementation or every C implementation - that's a chore for Python enthusiasts and Haskell enthusiasts and C enthusiasts."





GCC works fine as an example C implementation for the benchmarks game.



You want to compare C implementations and I applaud you actually bothering to make the timings needed - but don't criticize others for not taking on that chore.





http://shootout.alioth.debian.org/u32q/faq.php#measurementscripts (Reply) (Parent) (Thread)



leonardo_m Subject: Re: Misleading allegation Link: (Link) Time: 2008-12-15 06:28 pm (UTC) FAQ Why don't bla bla bla...



I don't care of your FAQ. A FAQ isn't a replacement for human kindness, or even common sense. At the moment your site is probably the best of its kind, so it's seen by lot of people as a reference, and it's used a lot. Hopefully some people will create a site more open than yours. (Reply) (Parent) (Thread)

igouy Subject: Re: Misleading allegation Link: (Link) Time: 2008-12-15 06:48 pm (UTC) > A FAQ isn't a replacement for human kindness, or even common sense



Was it unkind to applaud you actually bothering to make the timings needed?





> Hopefully some people will create a site more open than yours.



You could always do that yourself, or create a C comparison like the Great Ruby Shootout



http://antoniocangiano.com/2008/12/09/the-great-ruby-shootout-december-2008/ (Reply) (Parent) (Thread)

(Anonymous) Subject: Analysis of n-body Link: (Link) Time: 2008-12-16 06:48 am (UTC) There's been some analysis of the slowness you're experiencing on n-body and some of the other floating-point benchmarks, and we think we've found the explanation:



On Linux, math functions like sqrt() set the errno global variable, which means they cannot be lowered into native instructions. GCC does something smart and emits code to determine if errno will not be set (in the common case), and if so uses a hardware instruction. Otherwise it calls the real function. LLVM does not yet do this.



For what it's worth, this problem does not manifest on Mac OS X, where the math functions do not set errno, so the call is always lowered to hardware instructions where possible.



You can track this problem at http://www.llvm.org/PR3219



--Owen Anderson (Reply) (Thread)



leonardo_m Subject: Re: Analysis of n-body Link: (Link) Time: 2008-12-16 05:01 pm (UTC) Thank you. I have given a look at the llvm-dev mailing list too. It seems an important enough "performance bug". I hope it will be improved. (Reply) (Parent) (Thread)

leonardo

View: Recent Entries. View: Archive. View: Friends. View: Profile. View: Website (My Website).