Van Eck, US States and Raku

Perl 6 → Raku This article has been moved from «perl6.eu» and updated to reflect the language rename in 2019.

[19] Published 29. June 2019.

This is my response to the Perl Weekly Challenge #14.

Challenge #14.1

Write a script to generate Van Eck’s sequence starts with 0. For more information, please check out wikipedia page. This challenge was proposed by team member Andrezgz.

The first value in the Van Eck sequence is zero. Each subsequent value is calculated like this:

Look for the previous occurence of the last value in the list (from the end) If we don't find it, the next value is zero If we do find it, the next value is the distance from the end

A walk through of the first ten values may make it easier to understand:

Sequence Search for Found The next value 0 0 No 0 0 , 0 0 Yes 1 0, 0, 1 1 No 0 0, 0, 1 , 0 0 Yes 2 0, 0, 1, 0, 2 2 No 0 0, 0, 1, 0, 2 , 0 0 Yes 2 0, 0, 1, 0, 2, 0 , 2 2 Yes 2 0, 0, 1, 0, 2, 0, 2 , 2 2 Yes 1 0, 0, 1, 0, 2, 0, 2, 2 , 1 1 Yes 6 0, 0, 1, 0, 2, 0, 2, 2, 1, 6

I have chosen to delegate the actual calculation to a helper procedure «van-help»:

unit sub MAIN (Int $limit = 10); # [1] my @van-eck = (0); # [2] @van-eck.push(van-help(@van-eck)) for ^($limit -1); # [3] say @van-eck.join(", "); # [4] sub van-help(@array is copy) # [5] { my $last = @array.pop; # [6] if any(@array) == $last # [7] { for @array.reverse # [8] { state $delta++; # [9] return $delta if $_ == $last; # [10] } } return 0; # [11] }

File: van-eck-list

[1] The sequence is infinite, so we have to stop at an explicit location. This will give us the first 10 numbers by default.

[2] The first number in the sequence is zero.

[3] Add the next number, one at a time, nine times by default. Combined with the initial one (from [2]), we get ten.

[4] Print them.

[5] This helper function takes a partial Van Eck sequence as input, and returns the next number. The input is set up with «is copy» so that we can change it (in [6]).

[6] Remove the last element from the sequence.

[7] If any of the remaining elements are equal to the one we removed,

[8] Loop through the values, from the end.

[9] • Keep a count of how many iterations in the loop.

[10] • Return the distance between the two values when we find the first match.

[11] If the test in [7] fails, we return 0 (meaning that it is the first time the number occurs in the sequence).

A variable declared with state gives us a state variable. The assignment in the declaration (the initialization) is only done once, so we get a counter. See docs.raku.org/syntax/state for more information about «state».

Running it:

$ raku van-eck-list 0, 0, 1, 0, 2, 0, 2, 2, 1, 6 $ raku van-eck-list 20 0, 0, 1, 0, 2, 0, 2, 2, 1, 6, 0, 5, 0, 2, 6, 5, 4, 0, 5, 3

And that's it, really. As a list, as shown in the Wikipedia article the challenge refers to.

But this list form is extremely unhelpful if we dig a little deeper, and discover the https://oeis.org/A181391/b181391.txt list of the 100.000 first elements in the list - and we want to check that we have gotten e.g. the 100th value right.

Adding the indices is straightforward(ish):

unit sub MAIN (Int $limit = 10 , :$verbose = False ); my @van-eck = (0); @van-eck.push(van-help(@van-eck)) for ^($limit -1); if $verbose { my $count = 1; say "{ $count++ }: $_" for @van-eck; } else { say @van-eck.join(", "); }

File: van-eck-verbose (changes only)

The program behaves as before, unless we use the «--verbose» flag:

$ van-eck-verbose --verbose 20 1: 0 2: 0 3: 1 4: 0 5: 2 6: 0 7: 2 8: 2 9: 1 10: 6 11: 0 12: 5 13: 0 14: 2 15: 6 16: 5 17: 4 18: 0 19: 5 20: 3

Testing, testing, testing

This makes it easier to check that I got it right, but it does require manual work. That is tedious. So let us make the program do it for us.

We should let the program download the sequence file, and compare the values with the ones from our sequence generator one by one.

But before doing that, let us see how the program handles long sequences. I'm not optimistic (and will show why in the next section).

$ van-eck-verbose --verbose 10000 $ van-eck-verbose --verbose 100000

The first one takes 9 minutes, and the second one takes 13 hours (and some minutes). I have shown the commands only, not the output.

There are several reasons:

The helper function «van-help» makes a copy of the array each time a new value is calculated. The longer the array, the longer time this will take

I use «any» to look for a value in the array, and this is essentially the same as checking every single value, if the value we are looking for isn't there. (And we have no way of knowing how efficient «any» is. It will probably not look from the end, but that is just a guess.)

The solution to the first one is to inline the helper function, and the second is fixed by storing the last position of the values we have in the sequence in a hash (lookup by the number, and the value is the position in the array). Then it is simply a hash lookup, and not a search. Hashes are optimized for lookup, so are extremely fast.

unit sub MAIN (Int $limit where 2 <= $limit <= 100000 = 10, # [1] :$verbose = False, :$test = False); # [1] my @van-eck = (0); # [2] my %seen; # [3] for ^($limit -1) -> $pos # [4] { %seen{@van-eck[*-1]}.defined # [5] ?? @van-eck.push: $pos - %seen{@van-eck[*-1]} # [5a] !! @van-eck.push: 0; # [5b] %seen{@van-eck[*-2]} = $pos; # [6] } if $test # [7] { use LWP::Simple; # [8] my $ok = 0; my $error = 0; my $index = 1; for LWP::Simple.get('https://oeis.org/A181391/b181391.txt').lines -> $line { # [8] last if $index > $limit; # [9] my $my = @van-eck[$index -1]; # [10] my $def = $line.words[1]; # [10a] if $my == $def # [10b] { say "$index: $my == $def (OK)"; $ok++; } else { say "$index: $my != $def (Error)"; $error++; } $index++; } say "

OK: $ok"; # [11] say "Error: $error"; # [11] } elsif $verbose # [7] { my $count = 1; say "{ $count++ }: $_" for @van-eck; } else # [7] { say @van-eck.join(", "); }

File: van-eck

[1] The sequence file has 100.000 entries, so I have set that up as the limit. Invoke the test with the new «--test» argument.

[2] The first value in the sequence.

[3] This hash keeps the last position of every value we have in the sequence as we generate values one at a time.

[4] If the limit is 10, we get the values 0 .. 8 in this loop.

[5] If we have seen the value in question,

[5a] • use the distance to the current position as the new value,

[5b] • and if not, use 0.

[6] Update the position of the last value. It must not be used in the distance calculation in step [5a], so we do it here. Post-factum.

[7] I have used «if/elsif/else» instead of multiple dispatch this time.

[8] I use the «LWP::Simple» module to fetch the seqence. (On my Ubuntu/Linux machine this failed because of a missing system library; installing the «libssl-dev» package resolved the problem.) Iterate through the values, one at a time.

[9] Stop when we have reached the specified limit.

[10] Get the value from my sequence, and compare it with the one we downloaded [10a, 10b]. Count the correct values and errors, and report them as we go along.

[11] Print the result.

Running it:

$ raku van-eck-test --test 1: 0 == 0 (OK) 2: 0 == 0 (OK) 3: 1 == 1 (OK) 4: 0 == 0 (OK) 5: 2 == 2 (OK) 6: 0 == 0 (OK) 7: 2 == 2 (OK) 8: 2 == 2 (OK) 9: 1 == 1 (OK) 10: 6 == 6 (OK) OK: 10 Error: 0

Running it, and testing the full set of 100000 values (without showing the values):

$ raku van-eck --test 100000 ... OK: 100000 Error: 0

So I think we can conclude that my algorithm generates the correct values. (Both of them, actually, But disregard the first one.) The program took about 5 seconds to run on my machine.

Running the minimalistic program is of course faster (as it doesn't download the file from Internet, and have less output). This takes about 1.5 seconds:

$ raku van-eck 100000

Lesson Learned: A poorly designed algorithm can cause incredible inefficient code.

Challenge #14.2

Using only the official postal (2-letter) abbreviations for the 50 U.S. states, write a script to find the longest English word you can spell? Here is the list of U.S. states abbreviations as per wikipedia page. This challenge was proposed by team member Neil Bowers.



For example, Pennsylvania + Connecticut = PACT Wisconsin + North Dakota = WIND Maine + Alabama = MEAL California + Louisiana + Massachusetts + Rhode Island = Calamari

The first task is obtaining the states, as a data structure (a hash). This page https://www.perlmonks.org/?node_id=60745 gives it in Perl.

I have modified it for Raku:

my %states = "AK Alaska LA Louisiana OH Ohio AL Alabama MA Massachusetts OK Oklahoma AR Arkansas MD Maryland OR Oregon AZ Arizona ME Maine PA Pennsylvania CA California MI Michigan RI Rhode Island CO Colorado MN Minnesota SC South Carolina CT Connecticut MO Missouri SD South Dakota DE Delaware MS Mississippi TN Tennessee FL Florida MT Montana TX Texas GA Georgia NC North Carolina UT Utah HI Hawaii ND North Dakota VA Virginia IA Iowa NE Nebraska VT Vermont ID Idaho NH New Hampshire WA Washington IL Illinois NJ New Jersey WI Wisconsin IN Indiana NM New Mexico WV West Virginia KS Kansas NV Nevada WY Wyoming KY Kentucky NY New York".split(/\s\s+/).hash; say %states.keys.sort;

File: us-states:

We split on two or more spaces, so that the single space inside a two word state name is ignored by «split».

Running it (with a newline inserted to make it fit the article width):

$ raku us-states (AK AL AR AZ CA CO CT DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS \ MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY)

Then we need a dictionary, and we did just that back in Challenge #5.1 (See my Aruk Masgnara (or «Rakugrams») article).

The «dictionary-lookup2» program is almost what we want (look it up in the article if you want to know more), and it is easy to modify it to fit the new requirements.

This time we don't need a hash (or Set, actually). The challenge wanted the longest word, so we sort the dictionary by word size:

my $dictionary = "/usr/share/dict/british-english"; # [1] .say for get-dictionary($dictionary); # [2] sub get-dictionary ($file where $file.IO.r) # [3] { return $file.IO.lines.grep(* !~~ /\W/)>>.uc.sort({ $^b.chars cmp $^a.chars }); } # 4 ########## # 5 ########### # 6 # # 7 ############################

File: dictionary-list-by-length

[1] The dictionary file. Change the file name if you want to use another file.

[2] Load the dictionary (in [3]) and print it, one word at a time.

[3] Note the «where» clause to get a better error message if the file doesn't exist, or isn't readable..

[4] Read the file, one line at a time.

[5] • only use lines with nothing but letters.

[6] • turn the words to uppercase, as the state codes are in uppercase.

[7] • sort them, by the length. We have swapped the placeholder variables «$^a» and «$^b» around so that the longest word comes first.

Running it (and showing the first 10 lines only):

$ raku dictionary-list-by-length COUNTERREVOLUTIONARIES ELECTROENCEPHALOGRAPHS ELECTROENCEPHALOGRAMS ELECTROENCEPHALOGRAPH ANDRIANAMPOINIMERINA COUNTERREVOLUTIONARY ELECTROENCEPHALOGRAM UNCHARACTERISTICALLY CHLOROFLUOROCARBONS COUNTERINTELLIGENCE ...

The full program

Now we can do the full program:

unit sub MAIN ($dictionary where $dictionary.IO.e && $dictionary.IO.r # [1] = "/usr/share/dict/british-english", :$all); my %states = "AK Alaska LA Louisiana OH Ohio AL Alabama MA Massachusetts OK Oklahoma AR Arkansas MD Maryland OR Oregon AZ Arizona ME Maine PA Pennsylvania CA California MI Michigan RI Rhode Island CO Colorado MN Minnesota SC South Carolina CT Connecticut MO Missouri SD South Dakota DE Delaware MS Mississippi TN Tennessee FL Florida MT Montana TX Texas GA Georgia NC North Carolina UT Utah HI Hawaii ND North Dakota VA Virginia IA Iowa NE Nebraska VT Vermont ID Idaho NH New Hampshire WA Washington IL Illinois NJ New Jersey WI Wisconsin IN Indiana NM New Mexico WV West Virginia KS Kansas NV Nevada WY Wyoming KY Kentucky NY New York".split(/\s\s+/).hash; sub get-dictionary ($file where $file.IO.r) # [2] { return $file.IO.lines>>.uc # [2a] .grep({ .chars %% 2 }) # [2b] .grep(* !~~ /\W/) # [2c] .sort({ $^b.chars cmp $^a.chars }); # [2d] } my $found = 0; for get-dictionary($dictionary) -> $word # [3] { last if $word.chars < $found; # [3a] check-word($word); # [4] last if $found && !$all; # [5] } say "

Word length: $found."; # [6] sub check-word ($word) # [4] { my @parts = $word.comb(2); # [7] return unless %states{$_} for @parts; # [8] say "{ @parts.map({ %states{$_} }).join(" + ") } = $word"; # [9] $found = $word.chars; # [10] }

File: state-words

[1] I use an english dictionatty file, but specify another file if needed. Use the «--all» flag to get all the matches (with the same length), and not hust the first one.

[2] Read the dictionary file, force the words to uppercase [2a], remove words with an odd number of characters [2b] (as the state codes are two characters long, and any combination of them has an even length), remove any word with a non character on them [2c], and finally sort them by size, with the longest first [2d].

[3] Iterate through the words in the dictionary (starting with the longest word), and abort if the current word length is smaller than the length of the first match [3a].

[4] «check-word» does the job (and sets «$found»).

[5] Abort if we have shown the first match, and the user hasn't asked for all («--all»).

[6] Report the length of the word(s) as well.

[7] Split the words in strings of two and two characters with comb(2) . We have already ensured that the word length is even, so all the parts will have the correct length.

[8] Iterate through the partial strings of the word, and return if the current one doesn't match a state identifer.

[9] If they all match, we have matching word. Print it.

[10] and record the word length.

See docs.raku.org/routine/comb for more information about «comb».

Running it:

$ raku state-words Colorado + North Carolina + Oregon + Delaware = CONCORDE Word length: 8.

$ raku state-words --all Colorado + North Carolina + Oregon + Delaware = CONCORDE Georgia + New York + Maine + Delaware = GANYMEDE Massachusetts + North Dakota + Arkansas + Indiana = MANDARIN California + Louisiana + Michigan + Nebraska = CALAMINE Massachusetts + Indiana + Louisiana + North Dakota = MAINLAND Massachusetts + Louisiana + Rhode Island + Alabama = MALARIAL Massachusetts + North Dakota + Arkansas + Indiana = MANDARIN Maine + Missouri + Rhode Island + Alabama = MEMORIAL Missouri + Oregon + Louisiana + North Dakota = MOORLAND Word length: 8.

The duplicate hits for «MANDARIN» is caused by two entries in the dictionary: «Mandarin» and «mandarin».

The easiest way of removing duplicate values from a list is with the unique method. The change is trivial:

sub get-dictionary ($file where $file.IO.r) { return $file.IO.lines>>.uc.grep({ .chars %% 2 }) .grep(* !~~ /\W/) .unique .sort({ $^b.chars cmp $^a.chars }); }

File: state-words (partial)

See docs.raku.org/routine/unique for more information about «unique».

The dictionary I use doesn't have the word «Calamari» (which is shown exactly like this in the challenge, inconsistent with the other words which are in upper case letters only).

The actual result(s) depend on the dictionary file.

Other Languages

$ raku state-words --all "/usr/share/dict/american-english"

The challenge asked for «the longest English word», but we can have a go at an American dictionary as well:

It gives exactly the same result as the English one.

The French dictionary;

$ raku state-words -all /usr/share/dict/french Arkansas + Missouri + Rhode Island + California + Indiana = ARMORICAIN Massachusetts + North Dakota + Arkansas + Indiana + Alabama = MANDARINAL Word length: 10.

Longer words than in English.

The German dictionary:

$ raku state-words --all "/usr/share/dict/ngerman" Colorado + North Carolina + Oregon + Delaware = CONCORDE Massachusetts + North Dakota + Arkansas + Indiana = MANDARIN Massachusetts + South Carolina + Hawaii + Nebraska = MASCHINE Virginia + North Dakota + Alabama + Indiana = VANDALIN Word length: 8.

The Italian dictionary:

$ raku state-words -all /usr/share/dict/italian Rhode Island + Colorado + Michigan + North Carolina + Iowa + Virginia \ + Missouri = RICOMINCIAVAMO Rhode Island + Colorado + North Carolina + Illinois + Iowa + Virginia \ + Missouri = RICONCILIAVAMO Word length: 14.

14 characters is quite impressive. It would have been even more impressive if I had recognized them as words...