In my blog post related to Perl Weekly Challenge 54 posted on April 4, 2020, the section about the "extra credit" task concerning the Collatz conjecture described in some details the difficulties encountered when trying to cache the data: the volume of data is very large. I'm blogging again on the subject because of new findings.

The Collatz conjecture concerns a sequence defined as follows: start with any positive integer n. Then each term is obtained from the previous term as follows: if the previous term is even, the next term is one half of the previous term. If the previous term is odd, the next term is 3 times the previous term plus 1. For example, the Collatz sequence for 23 is this:

23 70 35 106 53 160 80 40 20 10 5 16 8 4 2 1

The conjecture is that, no matter what input value of n, the sequence will always reach 1. It is usually believed to be true (and no counter-example has ever been found), but, despite a lot of efforts, nobody has been able to prove it, and this is deemed to be a very difficult problem.

Computing the Collatz sequence of a given number is fairly easy and can be done in a simple one-liner:

$ perl -E '$n = shift; print "$n "; while ($n != 1) { $n = $n % 2 ? 3 * $n + 1 : $n / 2; print "$n "} ' 26 26 13 40 20 10 5 16 8 4 2 1

The extra-credit task was to calculate the sequence length for all starting numbers up to 1000000 (1e6), and output the starting number and sequence length for the longest 20 sequences.

The Original Solution

In theory, it wouldn't be very complicated to encapsulate the above code into a loop to compute the Collatz sequence for any range of numbers. Except that going all the way up to 1,000,000 is probably going to take a very long time. One of the reason is that we are going to recompute Collatz sequence successors for the same number again and again many times. If you look at the two above examples, the sequences both end up with the following series: 40 20 10 5 16 8 4 2 1 . So, it might be useful, when we reach 40 for the first time, to compute the end of the sequence only once, and to store it in a hash of arrays (or possibly an array of arrays), in order to retrieve it straight from the hash when we reach 40 once more. And, of course, we can reuse the end of the sequence when computing the Collatz sequence for 40, 80, 160, as well as 52, 104, etc. Such a strategy is called caching or memoizing: storing in memory the result of a computation that we’re likely to have to compute again. It is sometimes described as “trading memory for time.”

Since we want to compute the Collatz sequence for all integers up to 1,000,000, the cache will grow very large (several millions of sequences) and we might run out of memory. In the first version of the program below, I tried to store all sequences up to one million, and the program turned out to be painfully slow. Looking at the system statistics, I found that, after a while, available memory became exhausted and the system would swap memory on the disk, leading to very slow execution. I made a couple of tests, and found that I could store the sequences for all numbers up to about 300,000 without exceeding the available memory of my computer (that number might be different on your computer), thus preventing the process from swapping and getting more or less the optimal performance, hence the MAX constant set to 300,000. Since I knew from earlier tests that the 20 longest sequences would all have more than 400 items, I also hard-coded a lower limit of 400 items for the sequences whose length had to be recorded. Another possibly better solution might have been to maintain a sliding array of the top 20 sequences, but I feared that maintaining this array many times over the execution of the program would end up impairing performance.

#!/usr/bin/perl use strict; use warnings; use feature qw/say/; use Data::Dumper; use constant MAX => 300000; my %cache; sub collatz_seq { my $input = shift; my $n = $input; my @result; while ($n != 1) { if (exists $cache{$n}) { push @result, @{$cache{$n}}; last; } else { my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2; push @result, $new_n; $cache{$n} = [$new_n, @{$cache{$new_n}}] if defined ($cache{$new_n}) and $n < MAX; $n = $new_n; } } $cache{$input} = [@result] if $n < MAX; return @result; } my @long_seqs; for my $num (1..1000000) { my @seq = ($num, collatz_seq $num); push @long_seqs, [ $num, scalar @seq] if scalar @seq > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19];

With these optimizations, I was able to reduce execution time to 1 min 7 sec.:

$ time perl collatz.pl 837799: 525 626331: 509 939497: 507 704623: 504 910107: 476 927003: 476 511935: 470 767903: 468 796095: 468 970599: 458 546681: 452 818943: 450 820022: 450 820023: 450 410011: 449 615017: 447 886953: 445 906175: 445 922524: 445 922525: 445 real 1m7,469s user 1m6,015s sys 0m1,390s

Changing the Caching Strategy

A couple of days after I submitted my solution to the Perl Weekly Challenge and posted my blog post mentioned above, I figured out that my caching strategy was in fact quite inefficient: the program doesn't need to cache the full sequence, it would be enough to just store the number of its items. And that reduces considerably the memory footprint and other overhead of the cache.

I originally did not try this change in Perl (and did not intend to do it), but I did it with the Raku solution. Changing the caching strategy made the Raku program 6 times faster (see below).

On April 5, 2020 (one day after my original blog post), 1nick published a very interesting message on the Perl Monks forum in which he presented another strategy: parallelizing the process using MCE::Map Each worker is handed only the beginning and end of the chunk of the sequence it will process, and workers communicate amongst themselves to keep track of the overall task. With this change (and no caching), this program ran 5 times faster, on a 12-core machine (the full program is presented in Nick's post). Following that initial post, an extremely interesting discussion emerged between Nick and several other Perl monks. I really cannot summarize this discussion here, just follow the link if you're interested (it's really worth the effort). Note that I saw this Perl Monks thread of discussion only on April 14.

Given that discussion on the Perl Monks forum, I felt compelled to implement the modified caching strategy (caching the sequence lengths rather than the sequences themselves) in the Perl version.

The computer on which I ran the next test is slower than the one where I ran those above. These are the timings of my original program for this computer:

real 1m37,551s user 1m9,375s sys 0m21,031s

This is now my first implementation with the new caching strategy:

#!/usr/bin/perl use strict; use warnings; use feature qw/say/; use constant MAX => 1000000; my %cache = (2 => 2); sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; while ($n != 1) { if (exists $cache{$n}) { $result += $cache{$n}; last; } else { my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2; $result++; $cache{$n} = $cache{$new_n} + 1 if defined $cache{$new_n} and $n < MAX; $n = $new_n; } } $cache{$input} = $result if $input < MAX; return $result; } my @long_seqs; for my $num (1..1000000) { my $seq_length = collatz_seq $num; push @long_seqs, [ $num, $seq_length ] if $seq_length > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19];

This program produces the same outcome, but is nearly 3 times faster:

real 0m34,207s user 0m34,108s sys 0m0,124s

It's pretty good, but still not as good as Nick's parallelized solution (which ran 5 times faster).

Using an Array Instead of a Hash

But we now end up with a cache having essentially one entry per input number in the 1..1000000 range. So, I thought, perhaps it might be better to use an array, rather than a hash, for the cache (accessing an array item should be faster than a hash lookup).

This is the code for this new implementation:

#!/usr/bin/perl use strict; use warnings; use feature qw/say/; use constant MAX => 1000000; my @cache = (0, 1, 2); sub collatz_seq { my $input = shift; my $n = $input; my $result = 0; while ($n != 1) { if (defined $cache[$n]) { $result += $cache[$n]; last; } else { my $new_n = $n % 2 ? 3 * $n + 1 : $n / 2; $result++; $cache[$n] = $cache[$new_n] + 1 if defined $cache[$new_n] and $n < MAX; $n = $new_n; } } $cache[$input] = $result if $input < MAX; return $result; } my @long_seqs; for my $num (1..1000000) { my $seq_length = collatz_seq $num; push @long_seqs, [ $num, $seq_length ] if $seq_length > 400; } @long_seqs = sort { $b->[1] <=> $a->[1]} @long_seqs; say "$_->[0]: $_->[1]" for @long_seqs[0..19];

With this new implementation, we still obtain the same result, but the program is now more than 55 times faster than my original one (and almost 20 times faster than the solution using a hash for the cache):

$ time perl collatz3.pl 837799: 525 626331: 509 [Lines omitted for brevity] 922524: 445 922525: 445 real 0m1,755s user 0m1,687s sys 0m0,061s

I strongly suspected that using an array would be faster, but I frankly did not expect such a huge gain until I tested it.

So, it is true that throwing more CPU cores at the problem makes the solution significantly faster (although to a limited extent with my computer that has only 4 cores). But using a better algorithm can often be a better solution. The best, of course, would be to do both, and this can be done, as we will see below.

Further Optimizations

After I presented those results on the Perl Monks forum, another Perl monk, Mario Roy, the person who wrote the MCE::Map used by Nick and a number of other very useful Perl modules for parallel processing, suggested three further optimizations:

1. Replaced division by 2.

$n >> 1;

2. Removed one level of branching.

while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; my $new_n = $n % 2 ? 3 * $n + 1 : $n >> 1; $result++; $cache[$n] = $cache[$new_n] + 1 if defined $cache[$new_n] and $n < $max; $n = $new_n; }

3. Lastly, reduced the number of loop iterations.

while ($n != 1) { $result += $cache[$n], last if defined $cache[$n]; $n % 2 ? ( $result += 2, $new_n = (3 * $n + 1) >> 1 ) : ( $result += 1, $new_n = $n >> 1 ); $cache[$n] = $cache[$new_n] + ($n % 2 ? 2 : 1) if defined $cache[$new_n] and $n < $max; $n = $new_n; }

On his computer and with a larger range (up to 1e7 instead of 1e6), he obtained the following timings:

collatz3_a.pl 1e7 13.130s (a) original collatz3_b.pl 1e7 12.394s (b) a + replaced division with >> 1 collatz3_c.pl 1e7 12.261s (c) b + removed 1 level of branching collatz3_d.pl 1e7 9.170s (d) c + reduced loop iterations

So, that's about 30% faster. Interesting, I would not have thought such micro-optimizations would provide such a significant gain. I’ll have to remember that. But that was just the first step of Mario’s approach, the really good things come now.

Combining Caching and Parallel Execution

In another Perl Monks post, Mario Roy showed how to combine caching with parallel execution using the File::Map module that implements mapped memory, which can be shared between threads or forked processes. With a 32-core CPU, Mario was able to reduce the execution duration to less than 0.7 second! Wow! Please follow the link for the details.

So, yes, it is possible to combine caching with parallel execution.

New Caching Strategy in Raku

As mentioned earlier, when the idea came to me to store the sequence lengths rather than the sequences themselves, I originally tried to implement it in Raku. I'll cover that in detail in my review of the Raku solutions, but let me provide here a summary.

Remember that the original solution took about 9 minutes with Raku.

This is the first implementation (using sequence length in a hash):

use v6; my %cache = 2 => 2; sub collatz-seq (UInt $in) { my $length = 0; my $n = $in; while $n != 1 { if %cache{$n} :exists { $length += %cache{$n}; last; } else { my $new_n = $n % 2 ?? 3 * $n + 1 !! $n / 2; $length++; %cache{$n} = %cache{$new_n} + 1 if defined (%cache{$new_n}) and $new_n <= 2000000; $n = $new_n.Int; } } %cache{$in} = $length if $in <= 2000000; return $length; } my @long-seqs; for 1..1000000 -> $num { my $seq-length = collatz-seq $num; push @long-seqs, [ $num, $seq-length] if $seq-length > 400; } @long-seqs = sort { $^b[1] <=> $^a[1]}, @long-seqs; say "$_[0]: $_[1]" for @long-seqs[0..19];

This new program displays the same output as the previous one, but runs about 6 times faster:

$ time perl6 collatz2.p6 837799: 525 626331: 509 939497: 507 [Lines omitted for brevity] 906175: 445 922524: 445 922525: 445 real 1m31,660s user 0m0,000s sys 0m0,062s

This is the code for the implementation using an array instead of a hash for the cache:

use v6; my @cache = 0, 1, 2; sub collatz-seq (UInt $in) { my $length = 0; my $n = $in; while $n != 1 { if defined @cache[$n] { $length += @cache[$n]; last; } else { my $new_n = $n % 2 ?? 3 * $n + 1 !! $n / 2; $length++; @cache[$n] = @cache[$new_n] + 1 if defined @cache[$new_n] and $new_n <= 2000000; $n = $new_n.Int; } } @cache[$in] = $length; return $length; } my @long-seqs; for 2..1000000 -> $num { my $seq-length = collatz-seq $num; push @long-seqs, [ $num, $seq-length] if $seq-length > 200; } @long-seqs = sort { $^b[1] <=> $^a[1]}, @long-seqs; say "$_[0]: $_[1]" for @long-seqs[0..19];

And the new program runs about twice faster than with a hash (and 12 times faster than the original code):

$ time ./perl6 collatz3.p6 837799: 525 626331: 509 [Lines omitted for brevity] 906175: 445 922524: 445 922525: 445 real 0m45,735s user 0m0,015s sys 0m0,046s

Interestingly, the Perl program runs 3 times faster after the first optimization, and 55 times faster after the second optimization, where as the Raku program runs 6 times faster after the first optimization and 12 times faster after the second one. It is not necessarily surprising that some optimizations work better with one language and others with another language, but I somehow did not expect such a large discrepancy.

Raku has a very good support for parallel execution and concurrent programming. I'm pretty sure it should be possible to make good use of this capability, but I haven't really looked at that topic for at least four years, so I don't think I could come up with a good parallel solution without spending quite a bit of effort and time. Also, with my poor computer with only four cores, I would certainly not be able to get results anywhere close to Mario Roy with his 32-core platform.

Wrapping-up

Perl Weekly Challenge 56 is up for your perusal!