Doing a quick test with turning on optimization, I got results of about 150 ms for an ancient AMD 64 X2 processor, and about 90 ms for a reasonably recent Intel i7 processor.

Then I did a little more to give some idea of one reason you might want to use C++. I unrolled four iterations of the loop, to get this:

#include <stdio.h> #include <ctime> int main() { double a = 3.1415926, b = 2.718; double c = 0.0, d=0.0, e=0.0; int i, j; clock_t start, end; for(j=0; j<10; j++) { start = clock(); for(i=0; i<100000000; i+=4) { a += b; c += b; d += b; e += b; } a += c + d + e; end = clock(); printf("Time Cost: %fms

", (1000.0 * (end - start))/CLOCKS_PER_SEC); } printf("a = %lf

", a); return 0; }

This let the C++ code run in about 44ms on the AMD (forgot to run this version on the Intel). Then I turned on the compiler's auto-vectorizer (-Qpar with VC++). This reduced the time a little further still, to about 40 ms on the AMD, and 30 ms on the Intel.

Bottom line: if you want to use C++, you really need to learn how to use the compiler. If you want to get really good results, you probably also want to learn how to write better code.

I should add: I didn't attempt to test a version under Javascript with the loop unrolled. Doing so might provide a similar (or at least some) speed improvement in JS as well. Personally, I think making the code fast is a lot more interesting than comparing Javascript to C++.

If you want code like this to run fast, unroll the loop (at least in C++).

Since the subject of parallel computing arose, I thought I'd add another version using OpenMP. While I was at it, I cleaned up the code a little bit, so I could keep track of what was going on. I also changed the timing code a bit, to display the overall time instead of the time for each execution of the inner loop. The resulting code looked like this:

#include <stdio.h> #include <ctime> int main() { double total = 0.0; double inc = 2.718; int i, j; clock_t start, end; start = clock(); #pragma omp parallel for reduction(+:total) firstprivate(inc) for(j=0; j<10; j++) { double a=0.0, b=0.0, c=0.0, d=0.0; for(i=0; i<100000000; i+=4) { a += inc; b += inc; c += inc; d += inc; } total += a + b + c + d; } end = clock(); printf("Time Cost: %fms

", (1000.0 * (end - start))/CLOCKS_PER_SEC); printf("a = %lf

", total); return 0; }

The primary addition here is the following (admittedly somewhat arcane) line:

#pragma omp parallel for reduction(+:total) firstprivate(inc)

This tells the compiler to execute the outer loop in multiple threads, with a separate copy of inc for each thread, and adding together the individual values of total after the parallel section.

The result is about what you'd probably expect. If we don't enable OpenMP with the compiler's -openmp flag, the reported time is about 10 times what we saw for individual executions previously (409 ms for the AMD, 323 MS for the Intel). With OpenMP turned on, the times drop to 217 ms for the AMD, and 100 ms for the Intel.

So, on the Intel the original version took 90ms for one iteration of the outer loop. With this version we're getting just slightly longer (100 ms) for all 10 iterations of the outer loop -- an improvement in speed of about 9:1. On a machine with more cores, we could expect even more improvement (OpenMP will normally take advantage of all available cores automatically, though you can manually tune the number of threads if you want).