Aruk Masgnara

Perl 6 → Raku This article has been moved from «perl6.eu» and updated to reflect the language rename in 2019.

[9] Published 28. April 2019

This is my response to the Perl Weekly Challenge #5.

Challenge #5.1: All Anagrams

Write a program which prints out all anagrams for a given word. For more information about Anagram, please check this wikipedia page.

We need a dictionary file, as random letter combinations doesn't cut it. Happily I already found three such files while writing the Raku P(i)ermutations article. (Scroll down to the «Oops?» section.) I'll use the english one; «/usr/share/dict/british-english».

Dictionary Lookup

unit sub MAIN (Str $word where $word !~~ /\W/); # [1] my $dict = get-dictionary("/usr/share/dict/british-english"); say $dict{$word} # [2] ?? "$word: Is a valid word" !! "$word: Not a valid word"; sub get-dictionary ($file where $file.IO.r) # [3] { my %hash; $file.IO.lines.grep(* !~~ /\W/).map({ %hash{$_} = True; }); # [4] ######## # [5] ########## # [6] #################### return %hash; }

I'll start by writing a program that looks up a user specified word in the dictionary. We load the dictionary into a hash, and the rest is pretty straightforward: File: dictionary-lookup (without fixes)

[1] We allow words only, no spaces (or any non-letter characters)

[2] Did we find the word, or not.

[3] We get a nice error message if we pass it the name of a non-readable file.

[4] This gives all the lines in the dictionary, one word on each.

[5] For almost every word, this dictionary file has an additional entry with «'s» appended. That is no good here, so we skip words with non-letters in them.

[6] We use « map » to set an entry in the hash for the word (with True as value).

We should probably do a case insensitive match. Adding .lc to turn the string to lowercase is the best choice as most of the letters (in the dictionary) are lowercase anyway. I have highlighted the changes like this :

unit sub MAIN (Str $word is copy where $word !~~ /\W/); $word .= lc; my %dict = get-dictionary("/usr/share/dict/british-english"); say %dict{$word} ?? "$word: Is a valid word" !! "$word: Not a valid word"; sub get-dictionary ($file where $file.IO.r) { my %hash; $file.IO.lines.grep(* !~~ /\W/).map({ %hash{ .lc } = True; }); return %hash; }

File: dictionary-lookup

See docs.raku.org/routine/lc for more information about «lc».

I have used a couple of short forms: « $word .= lc » instead of « $word = $word.lc; », and « .lc » instead of « $_.lc ».

All arguments to a procedure are read only by default. We can get a writable local copy by appending « is copy » in the procedure signature.

Note that lowercase/uppercase doesn't always roundtrip. A good example is the German character «ß» which is written as «SS» in uppercase:

> say "Straße".uc; # -> STRASSE > say "Straße".uc.tclc; # -> Strasse

We can shorten «get-dictionary» considerably, if we use a «Set» instead of a hash. A «Set» is a variant of hash, where the values can only be True or False . Or rather, it looks that way from the outside. Only positive values (the keys) are stored in a «Set», so it is quite compact. We get the value True on lookup if the given key is present, and False otherwise.

sub get-dictionary ($file where $file.IO.r) { return $file.IO.lines.grep(* !~~ /\W/).Set; }

File: dictionary-lookup2 (partial)

The «grep» gave us a list, and coercing that list to a «Set» by applying the « .Set » method on it gives a «Set». No need for «map», as we did with the hash version.

If you find the name «Set» familiar in a mathematical sense, you are right. They are the same, and Raku even has built in Set operators that you can use on them. See docs.raku.org/type/Set for more information about «Set».

The conversion of the dictionary to lowercase got lost as I got rid of the «map». We cannot just put the « .lc » back on after the « grep », as we have a list - and « .lc » works on a single string. The result would be the entire list coerced to a single (and very long) string with lowercase letters. Try it in REPL:

> say <a b c d>.raku; # -> ("a", "b", "c", "d") > say <a b c d>.lc.raku; # -> "a b c d"

See docs.raku.org/routine/raku for more information about «raku», which I used here to get a clearer view of the variable types.

The raku method is also available as perl for legacy reasons. It will probably be deprecated, so you are advised to use raku .

We can tell Raku to work on all the elements in a list with the « >>. » Hyper Operator.

See docs.raku.org/language/operators#Hyper_operators for more information about Hyper Operators.

unit sub MAIN (Str $word is copy where $word !~~ /\W/, :$dictionary where $dictionary.IO.r = "/usr/share/dict/british-english"); $word .= lc; my $dict = get-dictionary($dictionary); say $dict{$word} ?? "$word: Is a valid word" !! "$word: Not a valid word"; sub get-dictionary ($file where $file.IO.r) { return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set; }

File: dictionary-lookup2

Note the « $ » in « $dict » this time. If we had kept the « % » we would have coerced the Set to a hash on assignment. (It is possible to tell « %dict » to be a Set, like this: « my %dict is Set ». But the assignment will not work out quite as expected, as shown in the third line:

> my $h = <A B C>.Set; # -> set(A B C) > my %h = <A B C>.Set; # -> {A => True, B => True, C => True} > my %h is Set = <A B C>.Set; # -> set(set(A B C)) > my %h is Set = <A B C>; # -> set(A B C)

I have added support for additional dictionary files, with the «--dictionary» argument:

$ raku dictionary-lookup2 defence defence: Is a valid word $ raku dictionary-lookup2 --dictionary=/usr/share/dict/american-english defence defence: Not a valid word $ raku dictionary-lookup2 defense defense: Not a valid word $ raku dictionary-lookup2 --dictionary=/usr/share/dict/american-english defense defense: Is a valid word $ raku dictionary-lookup2 --dictionary=/usr/share/dict/ngerman börse börse: Is a valid word

Anagrams

The « permutations » method (described in my Raku P(i)ermutations article) gives us all the possible permutations of the elements in a list. So we'll turn the word into a list of single characters (with « .comb » and apply « .permutations » on it.

Except that « permutations » gives us a list of lists (with the single characters):

> say "abc".comb.permutations; ((a b c) (a c b) (b a c) (b c a) (c a b) (c b a))

I prefer to deal with strings, so we can use the familiar « >>. » Hyper Operator to combine the inner lists into strings (potential words):

> say "abc".comb.permutations>>.join; (abc acb bac bca cab cba)

No duplicates, but only because we didn't have duplicate letters in the input string. The « .unique » method fixes that:

> say "abb".comb.permutations>>.join; (abb abb bab bba bab bba) > say "abb".comb.permutations>>.join.unique; (abb bab bba)

See docs.raku.org/routine/comb for more information about «comb».

See docs.raku.org/routine/unique for more information about «unique».

And here it is, the full program:

unit sub MAIN (Str $word is copy where $word !~~ /\W/, :$dictionary where $dictionary.IO.r = "/usr/share/dict/british-english"); $word .= lc; my $dict = get-dictionary($dictionary); print "Anagrams:"; for $word.comb.permutations>>.join.unique -> $candidate { # next if $candidate eq $word; # [1] print " $candidate" if $dict{$candidate}; # [2] } print "

"; sub get-dictionary ($file where $file.IO.r) is export { return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set; }

File: anagrams

[1] Skip the input word itself. I have commented it out, as it is perhaps ok to report it. That way we can see if it is a legal word.

[2] Print the canditate, if we find it in the dictionary.

Testing it:

$ raku anagrams Elvis Anagrams: elvis evils lives veils $ raku anagrams Elvi Anagrams: evil levi live veil vile $ raku anagrams Elviz Anagrams:

Multi Word Anagrams?

«rail safety» = «fairy tales»

«eleven plus two» = «twelve plus one»

Anagrams can also include several words. Here are two examples from the Wikipedia article:

The challenge did specify «for a given word», so we can ignore multiple word input.

But multiple word solutions should be handled. E.g:

«funeral» = «real fun»

Word List Problem

> say "/usr/share/dict/british-english".IO.lines.grep({.chars == 1}); (A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \ a b c d e f g h i j k l m n o p q r s t u v w x y z) > "/usr/share/dict/american-english".IO.lines.grep({.chars == 1}); (A B C D E F G H I J K L M N O P Q R S T U V W X Y Z \ a b c d e f g h i j k l m n o p q r s t u v w x y z)

In addition to the duplicate words (as mentioned above), the english files (both UK and US) have entries for every single letter, both upper- and lowercase:

That doesn't work out with multiple words anagrams, as we'd drown in single letter words. So I wrote a program to set up copies of the dictionary files without words with non-letters, and getting rid of the one letter words (except «A» and «I») in the english lists:

my %source = <UK> => "/usr/share/dict/british-english", # [1] <US> => "/usr/share/dict/american-english", <DE> => "/usr/share/dict/ngerman"; unit sub MAIN (Str $language where %source{$language}.defined); # [2] my @lines = %source{$language}.IO.lines.grep(* !~~ /\W/); spurt "dict-$language.txt", $language eq "DE" ?? @lines.join("

") ~ "

" # [3] !! "A

I

" ~ @lines.grep( {.chars > 1 } ).join("

") ~ "

"; # [4]

File: mkdictionary

[1] We set up the three languages and the corresponding dictionary files.

[2] The language must be defined in the hash in [1].

[3] German? No special filtering required.

[4] English? Get rid of the one letters words, but add «A» and «I» back again.

Multi Word Anagrams!

unit sub MAIN (Str $word is copy, :$dictionary where $dictionary.IO.r = "dict-UK.txt"); # [1] $word = $word.trans(" " => "", :delete).lc; # [2] my $dict = get-dictionary($dictionary); # [3] my @permutations = $word.comb.permutations>>.join.unique; # [4] my SetHash $seen; # [5] check-anagram("", $_) for @permutations; # [6] say "Anagrams: { $seen.keys.elems }"; # [7] .say for $seen.keys.sort; # [7] sub get-dictionary ($file where $file.IO.r) # [3] { return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set; } sub check-anagram ($base is copy, $candidate is copy) # [8] { if $dict{$candidate} # [9] { $seen{"$base $candidate".trim-leading} = True; # [9] # The first character is a space. return; # [9] } for 1 .. $candidate.chars # [10] { my $new-base = $candidate.substr(0, $_); # [11] my $new-candidate = $candidate.substr($_); # [11] check-anagram("$base $new-base", $new-candidate) if $dict{$new-base}; # [12] ###### # [13] ########## # [14] ######## # [12] ############# } }

This looks like a recursive problem to me, and it played out quite nicely: File: multigrams

[1] I have removed the «where» clause on «$word»; see [2] for why.

[2] This line makes it possible to support multiple words as input. We simply remove the spaces with «trans».

[3] Load the dictionary.

[4] A list of candidates, without repetionsn and without spaces.

[5] Used to store the anagrams.

[6] And off we go. Note the recursive call. I'll explain it below.

[7] Report the result.

[8] A recursive procedure, called with the whole word the first time.

[9] If the candidate word is a word, take a note and we are finished.

[10] We loop through the first X number of letters in the candidate word,

[11] taking the letters as the new base and the rest as the new candidate.

[12] Then we do a recursive call, if the new base is a valis word,

[13] with the new base added to the old one (forming several words),

[14] and the new candidate.

See docs.raku.org/routine/trans for more information about «trans».

Recursion Explained

Let's say that we start with the string «He is on it». We get a list of permutations (without the spaces) in [6]. When we get to the "heisonit" candidate, the call is «check-anagram("", "heisonit")». The procedure checks if it is a valid word, which is isn't, and goes on. The «for» loop«check-anagram(" h", "eisonit")», as «h» isn't a legal word. The next iteration«check-anagram(" he", "isonit")», as «he» is a legal word. The procedure checks if "isonit" is a valid word. It isn't so it goes on to the «for» loop, where "i" is a valid word and «check-anagram(" he i", "sonit")» is called. That call leads to another match "so", but the rest "nit" doesn't work out as word(s), so the next iteration tries «check-anagram(" he is", "onit")». The "o" isn't a word, but "on" is, so «check-anagram(" he is on", "it")» is called, and it works out (on the second iteration as "i" is ok, but "t" isn't) as "it" is a valid word.

Running it takes slightly over 1 second on my pc, and the result shows that the english word list has too many «words» that shouldn't really be considered words. I have highlighted some of them, and abridged the ouput (which has 306 lines):

$ raku multigrams "real fun" Anagrams: 306 a elf run a elf urn a flue rn a fr le nu a fr nu le ...

Getting rid of the redundant «words» is hard, but I added the «--log-words» command line option to the program. It writes the words it found to a separate file:

unit sub MAIN (Str $word is copy, :$dictionary where $dictionary.IO.r = "dict-UK.txt" , :$log-words ); my SetHash $seen; my SetHash $word-list; .say for $seen.keys.sort; spurt "wordlog.txt", $word-list.keys.sort.join("

") ~ "

" if $log-words; $word-list{$candidate} = True if $log-words; $seen{"$base $candidate".trim-leading} = True;

$ raku multigrams --log-words "real fun" $ mv wordlog.txt english.txt $ emacs english.txt

File: multigrams (changes only)

The file has 61 lines (and so called words). I commented out (remove with undoability) any word that didn't make sense to me. Then I ran the «multigram» program again, with that dictionary:

$ raku multigrams --dictionary=english.txt "real fun" Anagrams: 41 a elf run a elf urn a run elf a urn elf earl fun earn flu elf a run elf a urn elf run a elf urn a fan lure fan rule flea run flea urn flu earn flu near fun earl fun lear fun real funeral fur lane fur lean fur neal lane fur leaf run leaf urn lean fur lear fun lure fan neal fur near flu real fun rule fan run a elf run elf a run flea run leaf urn a elf urn elf a urn flea urn leaf

From 306 to 41 anagrams. That is much better.

We could consider the anagrams with the same words, but in a different order, as (almost) identical and show them on the same line. E.g:

fun real |real fun a elf urn | a urn elf | elf a urn | elf urn a | urn a elf | urn elf a

That should reduce the list significantly, and it reads better as well. Changes only:

unit sub MAIN (Str $word is copy, :$dictionary where $dictionary.IO.r = "dict-UK.txt", :$log-words , :$tabular ); if $tabular { my %shown; for $seen.keys.sort { unless /\s/ { .say; next; } my @w = .words.sort; my $w = @w.join(" "); next if %shown{$w}; %shown{$w} = True; print $w unless @w; print @w.permutations.unique.join(" | "); print "

"; } } else { .say for $seen.keys.sort; }

$ raku multigrams --dictionary=english.txt --tabular "real fun" Anagrams: 41 a elf run | a run elf | elf a run | elf run a | run a elf | run elf a a elf urn | a urn elf | elf a urn | elf urn a | urn a elf | urn elf a earl fun | fun earl earn flu | flu earn fan lure | lure fan fan rule | rule fan flea run | run flea flea urn | urn flea flu near | near flu fun lear | lear fun fun real | real fun funeral fur lane | lane fur fur lean | lean fur fur neal | neal fur leaf run | run leaf leaf urn | urn leaf

File: multigrams (changes only)

Multi word anagrams should have meaning, and that is almost impossible to get a computer to figure out. The tabular output is the best I can do, but the reader must do the rest of the work by sorting out the meaningless candidates.

Challenge #5.2: Anagram Sequence

Write a program to find the sequence of characters that has the most anagrams.

Single Word Anagrams

This is rather easy. Load the dictionary, sort the words by length (largest first), and compute the number of anagrams for each of them. Stop when the number of letters in the words fall below the maximum number of anagrams already computed (as we clearly cannot do better). Then print the largest value.

It will probably take quite some time to run, though... But what the heck, I'll program it anyway:

unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt"); my $dict = get-dictionary($dictionary); my %count; for $dict.keys.sort( { $^b.chars <=> $^a.chars } ) -> $word { # next if $word.chars > 20; last if %count.values.max > $word.chars; %count{$word} = count-anagrams($word); } for %count.keys.sort( { %count{$^b} <=> %count{$^a} } ) { say "$_ : ", %count{$_}; } sub get-dictionary ($file where $file.IO.r) { return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set; } sub count-anagrams ($word) { my $count = 0; $count++ if $dict{$_} for $word.comb.permutations>>.join.unique; say "$word: $count"; return $count; }

File: maxigrams-error

Running it:

$ raku maxigrams-error Cowardly refusing to permutate more than 20 elements, tried 22 in sub count-anagrams at ./maxigrams line 27 in sub MAIN at ./maxigrams line 13 in block at ./maxigrams line 1

Adding a workaround line (shown commented out above) to avoid permutations of more than 20 characters makes it work, but any word longer than 20 characters will be ignored. That may be a problem.

Note that by «makes it work» I am not quite honest. It takes forever to run. This is a fundamental problem with the approach, as it applies « .permutations » on the words. A 20 character word has 20! (or 2432902008176640000) permutations (if all the letters are different, which they probably aren't. But that doesn't really matter that much, as it is still a very large number even if we get rid of some digits). That is a lot to go through.

Also note that we already know the potential anagrams; the other words in the dictionary. So instead of all this permutation nonsense, we can simply rearrange the letters in the words (alphabetically is an obvious choice) and count how many times they occur in the dictionary. The highest count is the winner. We will not get a legal word, but we were not asked for it so that is ok.

unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt"); my $dict = get-dictionary($dictionary); my %count; %count{ .comb.sort.join }++ for $dict.keys; # [1] my $max = 0; # [2] for %count.keys.sort( { %count{$^b} <=> %count{$^a} } ) # [3] { $max = %count{$_} if %count{$_} > $max; # [4] last if %count{$_} < $max; # [5] say "$_: ", %count{$_}; # [6] } sub get-dictionary ($file where $file.IO.r) { return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set; }

File: maxigrams

[1] We rearrange the letters in the words (alphabetically), and increment the counter.

[2] The maximum number of anagrams.

[3] We sort the hash by the key size (number of anagrams); largest first.

[4] Set the maximum count the first iteration.

[5] Finish if we go below that count. This allows several words with the same number of anagrams.

[6] print the «word».

« $^a » and « $^b » are a Placeholder Variables that magically turn into existence when we use them. See docs.raku.org/language/variables#The_^_twigil for more information.

Running it:

$ raku maxigrams aelst: 8 $ raku maxigrams --dictionary=dict-US.txt aelst: 8 $ raku maxigrams --dictionary=dict-DE.txt ceehinrst: 11

The english versions ran in about 4-5 seconds, and the german took almost 40 seconds to run (as the dictionary is much larger, and we got a hit on a longer word).

Now, if you want to know the actual words, look them up with «anagram»:

$raku anagrams aelst Anagrams: least slate stael stale steal tales teals tesla $ raku anagrams --dictionary=dict-US.txt aelst Anagrams: least slate stael stale steal tales teals tesla $ raku anagrams --dictionary=dict-DE.txt ceehinrst Anagrams: enterichs entsicher entsichre erscheint erschient reichsten \ scheitern schreiten sicherten streichen tierchens

But it isn't that hard to add it to the program, while we're at it:

unit sub MAIN (Str :$dictionary where $dictionary.IO.r = "dict-UK.txt"); my $dict = get-dictionary($dictionary); my %count; %count{ .comb.sort.join }++ for $dict.keys; my $max = 0; for %count.keys.sort( { %count{$^b} <=> %count{$^a} } ) { $max = %count{$_} if %count{$_} > $max; last if %count{$_} < $max; say "$_: ", %count{$_} , " ", anagrams($_) ; } sub get-dictionary ($file where $file.IO.r) { return $file.IO.lines.grep(* !~~ /\W/)>>.lc.Set; } sub anagrams ($word) { $word.comb.permutations>>.join.unique.grep( { $dict{$_} } ); }

File: maxigrams2

The last part is quite clever, or perhaps too clever. We start with a list of all the possible permutations, and « .grep » gets rid of those that are not valid words (not present in the dictionary). Sit back, and enjoy that one line. Quite a lot of activity for a single line of code...

$ raku maxigrams2 aelst: 8 (least slate stael stale steal tales teals tesla)

Multi Word Anagrams

This doesn't make sense, as any sentence whatsoever could give us anagrams simply by rearranging the order of the words. Four words (e.g. «He is on it») give 24 permuatations, clearly more than 8 (the english dictionaries) or 11 (the german dictionary), and adding even more words increases the number of permutations.

Note that we have lost the possibility to get this result:

«funeral» = «real fun»

Fixing that would involve running «multigrams» (after implementing the multi word support from «anagrams») and wait forever for it to finish. That clearly isn't doable.

And that's it.