Given that I a while back wrote an article suggesting that Perl 6 should change its name to Century, you’d probably expect me to have an opinion about the Raku debacle. Well, I don’t. For me, I think the time has come to share something else than interesting peculiarities of Perl 6, namely real code I use in my work.

Too soon?

I work as a kind of Jack-of-all-trades at a Norwegian media company, mostly wrangling numbers. Now and then I need to program a few ad hoc scripts to help extract interesting data from various databases, web pages, etc. Some of these end up in production, deployed in the cloud. But a surprising amount of what I do is to write ad hoc code solving a one-time specific problem. Code that is neither particularly optimised nor especially robust, given that it will be used once and never leave my MacBook.

What follows is one example of the latter.

One day I noticed that quite a few of the searches our clients made in one of our databases returned zero results. It was easy to see that many of them were caused by incorrect spelling [1]. Wanting to investigate this a little further, I pulled six months worth of queries from our log database. Since this was not meant as anything but an experiment, I reduced the amount of that by removing multi-word queries, and keeping only one-word queries that contained alphabetic characters. I also normalized the words, i.e. I dropped accented versions of characters. That reduced the dataset to a manageable size. I stored the resulting dataset in a simple CSV file.

File: st2.csv (containing 726,609 queries; a subset shown here) Gaza,94279,editorial

Kaulitz,1667,editorial

senkvald,0,editorial

senkveld,1608,editorial

plenklipper,0,editorial

The first column is the word that was searched for, the second column contained the number of results (the third is irrelevant for the discussion here).

Looking at this data I started wondering whether I could make qualified guesses as to what words the misspellers actually thought they wrote. In short: Could I create data that could form the basis of a Google-like “Did you mean X” feature?

I knew that I could use the Levenshtein-Damerau algorithm (among others) to compare one word to another and figure out how similar they were. But it seemed like overkill to compare each word to all the words in, say, a dictionary. A program doing that would take forever to run.

But: If you look at the example CSV above you see that the word “senkvald” has zero results while “senkveld” has 1608 results. That made me realize that for each mis-spelled query there’d likely be a correctly spelled query that returned results. I could limit the comparisons to queries with results.

What I needed was a program that created a list of mis-spelled zero-hit words and a list of words with hits; just that optimisation alone would speed things up. In addition I thought that the speed would increase even further if I utilized Perl 6 parallelisation stuff.

Now, bring out your torches and pitchforks, because I’m sure you’ll see lots of stuff to improve and optimize here. But in this case it’s a little beside the point, because some of the stuff I’ve done here is done because it’s simple to do and easy to read, and not because it’s the fastest way to do these things [2].

#!/usr/bin/env perl6 use v6.c; use Text::Levenshtein::Damerau; my %term;

my %alphabet-hits;

my %alphabet-zero-hits;

my

my

my my @results my @promises my @zero -hit;my @non -zero-hit; my $PARALLELS = 4;

my $MAX-DISTANCE = 0.3;

my $MIN-COMP-LENGTH = 4;

my $MAX-COMPARISONS = 1_000_000; constant @ALPHABET = (("a".."z"), ()).flat; $*ERR.out-buffer = 0;

$*OUT.out-buffer = 0; read-words;

similarity;

sort-and-spurt;

my $file = open "st2.csv";

for $file.lines -> $line {

my

%term{lc(

}

$file.close;

note "Read search queries.";

for (%term.keys) -> $t {





}



{ $_.chars >= $MIN-COMP-LENGTH } ).sort;



{ $_.chars >= $MIN-COMP-LENGTH } ).sort;

note "Processed " ~ %term.keys.elems ~ " terms.";

for (

%alphabet-hits{$char} =

(

%alphabet-zero-hits{$char} =

(

$max-no-comparisons +=

%alphabet-hits{$char}.elems *

%alphabet-zero-hits{$char}.elems;

}

note "Split dictionaries into alphabetized chunks.";

} sub read-words {my $file = open "st2.csv";for $file.lines -> $line {my @elms = $line.split(",");%term{lc( @elms [0])} += @elms [1];$file.close;note "Read search queries.";for (%term.keys) -> $t { @zero -hit.push($t) if %term{$t} == 0; @non -zero-hit.push($t) if %term{$t} > 0; @zero -hit = @zero -hit.grep({ $_.chars >= $MIN-COMP-LENGTH } ).sort; @non -zero-hit = @non -zero-hit.grep({ $_.chars >= $MIN-COMP-LENGTH } ).sort;note "Processed " ~ %term.keys.elems ~ " terms.";for ( @ALPHABET ) -> $char {%alphabet-hits{$char} = @non -zero-hit.grep( { $_.starts-with($char) } ));%alphabet-zero-hits{$char} = @zero -hit.grep( { $_.starts-with($char) } ));$max-no-comparisons +=%alphabet-hits{$char}.elems *%alphabet-zero-hits{$char}.elems;note "Split dictionaries into alphabetized chunks.";

for (

my ($a, $b) = $o.split(',');

my $near-equal =

($a.split('').unique.sort cmp $b.split('').unique.sort

== Same ?? '*' !! '');

say "$o,$near-equal";

}

note "Sorted and output finished.";

} sub sort-and-spurt {for ( @results .sort) -> $o {my ($a, $b) = $o.split(',');my $near-equal =($a.split('').unique.sort cmp $b.split('').unique.sort== Same ?? '*' !! '');say "$o,$near-equal";note "Sorted and output finished.";

my $processed-comparisons = 0;

my $zero-chunk-size = (

my

for 1..$PARALLELS -> $t {

$zero-chunk-size++ if

($t == $PARALLELS && $zero-chunk-size <

my

push

}

note "Finished waiting: " ~ await Promise.allof( sub similarity {my $processed-comparisons = 0;my $zero-chunk-size = ( @ALPHABET .elems / $PARALLELS).Int;my @alphabet -copy = @ALPHABET for 1..$PARALLELS -> $t {$zero-chunk-size++ if($t == $PARALLELS && $zero-chunk-size < @alphabet -copy.elems);my @alpha -chunk = @alphabet -copy.splice(0, $zero-chunk-size);push @promises , start compare( @alpha -chunk, $t);note "Finished waiting: " ~ await Promise.allof( @promises );

note "Parallel $parallel-no: " ~

for (

for ((%alphabet-zero-hits{$start-char}).flat) -> $z {

for ((%alphabet-hits{$start-char}).flat) -> $nz {

my $distance = dld($z, $nz);

my $fraction = $z.chars > $nz.chars ??

$distance / $z.chars !! $distance / $nz.chars;

if $fraction < $MAX-DISTANCE {



}

note "Compared $processed-comparisons of " ~

"$max-no-comparisons."

if $processed-comparisons %% 10_000;

return False if ++$processed-comparisons > $MAX-COMPARISONS;

}

}

}

} sub compare( @alphabet -subset, $parallel-no) {note "Parallel $parallel-no: " ~ @alphabet -subset.elems;for ( @alphabet -subset) -> $start-char {for ((%alphabet-zero-hits{$start-char}).flat) -> $z {for ((%alphabet-hits{$start-char}).flat) -> $nz {my $distance = dld($z, $nz);my $fraction = $z.chars > $nz.chars ??$distance / $z.chars !! $distance / $nz.chars;if $fraction < $MAX-DISTANCE { @results .push: $z ~ "," ~ $nz ~ "," ~ $fraction;note "Compared $processed-comparisons of " ~"$max-no-comparisons."if $processed-comparisons %% 10_000;return False if ++$processed-comparisons > $MAX-COMPARISONS; }

This script did the job! I got a list of around 17.000 suggestions that could be useful for implementing a “Did you mean X” type of routine. But what was unexpectedly more useful, and more immediately useful at that, is that we could use it internally to improve how we tagged our content. Instead of improving things automatically, it first and foremost improves manual work.

Anyway, back to the interesting stuff (at least to me) this uncovered about Perl 6:

I used the keyword note a lot; note is like say, only that it outputs to STDERR. A little like warn in Perl 5. I could direct the output of the program to a file, while still being able to see the debug/error info. Output is often buffered. Most of the time I want it not to be. I want to see the results immediately, so I used this: $*OUT.out-buffer = 0; — This is equivalent to Perl 5’s $|++ , except that the Perl 6 version is unexpectedly verbose but at the same time self-explanatory. constant @ALPHABET = ((“a”..”z”), (<æ ø å>)).flat; —as I had normalized all of the keywords and only kept keywords with Norwegian alphabetic characters, I had to generate a list of norwegian characters. I couldn’t use a range, i.e. “a”..”å”, as that would contain all kinds of characters. So I combined two lists, the range “a”..”z” and the list <æ ø å>, into one by using the .flat method. my @alpha-chunk = @alphabet-copy.splice(0, $zero-chunk-size); — I must have used this in other variants before, but splice felt like a new discovery. Splice is kind of like a super pop. It removes and returns not only a single element, but an entire chunk of an array. In other words, the original array is shortened accordingly, while you can assign the removed elements to a new array. This was useful for splitting the array into pieces I could use when parallelising the comparison task. And parallelisation is dead simple! All I needed was the start keyword. I did this: push @promises, start compare(@alpha-chunk, $t); — the subroutine compare is parallelised by the start keyword. You don’t need the push, but you should: The @promises array contains all the, well, promises. The note “Finished waiting: “ ~ await Promise.allof(@promises); ensures that all the parallels had finished running before moving on. That was all there was to it! The $PARALLELS variable manages how many parallels you want to run. I chose 4 as that’s the number of cores in my MacBook’s processor. A sub within a sub: Since the compare sub was a subroutine that was only to be used by the similarity sub, I used Perl 6’s ability to have a sub within a sub. So the compare sub is only visible from within compare. I probably did this only to impress myself. The similarity calculation my $distance = dld($z, $nz); my $fraction = $z.chars > $nz.chars ?? $distance / $z.chars !! $distance / $nz.chars; is a combination of Levenshtein-Damerau and my own stuff. The L-D algorithm — simplified — compares two string and tries to calculate how many substitutions you need in word 2 for it to be similar to word 1 (think of how many keyboard presses you’d need). You’ll get an integer with the number of substitutions as result. But two substitutions in a short word is more significant than in a long word. So I try to create a factor based on the number of substitutions divided by the length of the longest of the two words. This factor will be a floating point number between 0 and 1. The lower the number, the more similar the words are. In the line my $near-equal =

($a.split(‘’).unique.sort cmp $b.split(‘’).unique.sort

== Same ?? ‘*’ !! ‘’); I reduce each word to a sorted array of the unique characters in each words. I.e. ABBA is reduced to the array <A B>, but so is BABA. My assumption was that similar words containing the identical characters are more similar than the other matches, even though the difference may be larger. I.e. ABBA and BABA is more similar than ABBA and ABDA — <A B> and <A B D>. But I discovered that if you compare two arrays for similarity, i.e. whether their elements contain the same values in the same order, the comparison return Same. In Perl 5 you get 0 as a result. In Perl 6 you can choose to check for 0 as well as Same and both will work. But I have to say Same is more self explanatory.

So it’s safe to say that I learned a lot by programming this. It was equally nice that the results were useful!