As a developer who is not responsible for many infrastructure modules, I shouldn’t really waste so much of time micro-benchmarking. Profiling my code after the fact should be sufficient if even that is necessary.

But it seems I’ve got into a bad habit.

I’m not the only person that thinks that MooseX loading all those pre-requisites is what is taking a lot of time. Dave Rolsky says in the latest Moose blog post:

What’s a bit sad, however, is that this only appears to save 6-10% of the compilation time in real usage. I’m not sure why that is, but I think the issue is that the perl core’s compilation time may be a big factor. As I said, loading that one Markdent modules loads a lot of modules (203 individual .pm files), and the raw overhead of simply reading all those files and compiling them may be a significant chunk of the compilation time.

That gets me thinking about the speed of the perl compilation step. Although using eval doesn’t have the same disk IO as use , it still exercises the compiler.

Loop Unrolling with Eval

Does anyone remember back in the day when we used to unroll loops? For example, if we had an instruction sequence that was do_something, add, jne and do_something was relatively short, you could save some time unrolling that to do_something, add, do_something, add, jne – you would only have to execute approximately half as many jne instructions across a loop lifetime.

That is something I never worry about in perl. Unless I have to come up with contrived examples for my blog.

sub eval_loop { my $count = 0; my $s = '' ; for ( my $i = 0; $i < 100_000; ++$i) { $s .= '$count += ' . $i . ";

" ; } eval $s; $count; } sub vanilla_loop { my $count = 0; for ( my $i = 0; $i < 100_000; ++$i) { $count += $i; } $count; } my $code = '' ; for ( my $i = 0; $i < 100_000; ++$i) { $code .= '$count += ' . $i . ";

" ; } $code = ' sub totally_unrolled { my $count = 0;' . "

" . $code . ' $count; };' ; eval $code;

eval_loop compiles the unrolled loop every time the subroutine is called whereas totally_unrolled is compiled prior to benchmarking.

I also wanted some partially unrolled loops for comparison sake. Obviously these examples are not suitable for production use. For example, consider what would happen if unrolled_loop_4 needed to execute a number of times that was not a multiple of 4.

sub unrolled_loop_2 { my $count = 0; for ( my $i = 0; $i < 100_000; ++$i) { $count += $i; $count += ++$i; } $count; } sub unrolled_loop_4 { my $count = 0; for ( my $i = 0; $i < 100_000; ++$i) { $count += $i; $count += ++$i; $count += ++$i; $count += ++$i; } $count; }

Benchmarking

The full benchmarking code is as follows. I test the result of each function to ensure I’m getting the same answer.

use Benchmark qw (:hireswallclock); use strict ; use warnings ; sub vanilla_loop { my $count = 0; for ( my $i = 0; $i < 100_000; ++$i) { $count += $i; } $count; } sub unrolled_loop_2 { my $count = 0; for ( my $i = 0; $i < 100_000; ++$i) { $count += $i; $count += ++$i; } $count; } sub unrolled_loop_4 { my $count = 0; for ( my $i = 0; $i < 100_000; ++$i) { $count += $i; $count += ++$i; $count += ++$i; $count += ++$i; } $count; } sub eval_loop { my $count = 0; my $s = '' ; for ( my $i = 0; $i < 100_000; ++$i) { $s .= '$count += ' . $i . ";

" ; } eval $s; $count; } my $code = '' ; for ( my $i = 0; $i < 100_000; ++$i) { $code .= '$count += ' . $i . ";

" ; } $code = ' sub totally_unrolled { my $count = 0;' . "

" . $code . ' $count; };' ; eval $code; print vanilla_loop(), "

" ; print unrolled_loop_2(), "

" ; print unrolled_loop_4(), "

" ; print eval_loop(), "

" ; print totally_unrolled(), "

" ; Benchmark::cmpthese(-1, { 'vanilla' => \&vanilla_loop, 'unrolled-2' => \&unrolled_loop_2, 'unrolled-4' => \&unrolled_loop_4, 'eval' => \&eval_loop, 'unrolled-max' => \&totally_unrolled, });

And the results are:

$ perl unrolling.pl 4999950000 4999950000 4999950000 4999950000 4999950000 Rate eval unrolled-2 unrolled-4 vanilla unrolled-max eval 3.25/s -- -93% -95% -95% -96% unrolled-2 48.1/s 1381% -- -25% -26% -45% unrolled-4 63.9/s 1865% 33% -- -2% -27% vanilla 65.1/s 1902% 35% 2% -- -26% unrolled-max 87.7/s 2598% 82% 37% 35% --

Conclusions

I am actually surprised that eval is as fast as it is. It runs less than 20 times slower than the vanilla loop, yet it has to construct a string by appending to it 100,000 times and run the compiler on the result. Or to put it another way, it compiles and runs 100,000 lines of code in under a third of a second. Good job perl compiler guys! Or did you put something in there for patholgically ridiculous examples like this?

And I was right never to worry about unrolling my loops. unrolled-2 is quite significantly slower than vanilla. As I can’t think of any good reason for that, it makes me worried that I’ve done something wrong.

Even fully unrolling the loop, which is quite ridiculous gives me a relatively tiny 35% speed increase.