Avidis, avidus natura parum est

One of my first forays into Perl programming, 20 years ago now, was a tool that takes a piece of plaintext, analyzes its structure, and formats it neatly for a given line width. It’s a moderately sophisticated line wrapping application that I use daily to tidy up email correspondence, software documentation, and blog entries.

So the second task of the 19th Weekly Challenge—to implement a “greedy”

line-wrapping algorithm—is in many ways an old friend to me.

Greedy line wrapping simply takes each word in the input text and adds it to the

current line of output unless doing so would cause the output line to exceed the required maximal line width, in which case it breaks the line at that point and continues filling the second line, et cetera. So a 45-column greedily wrapped paragraph looks like this:

It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.

The resulting text is somewhat unbalanced and raggéd on the right margin, but it’s within the required width and tolerably readable. And the algorithm is so simple that it’s possible to implement it in a single Raku statement:

sub MAIN (:$width = 80) { $*IN.slurp.words .join(' ') .comb(/ . ** {1..$width} )> [' ' | $] /) .join("

") .say }

We take the STDIN input stream ( $*IN ), slurp up the entire input ( .slurp ), break it into words ( .words ). The we rejoin those words with a single space between each

( .join(' ') ), and break the text into lines no longer than width characters

( .comb(/ . ** {1..$width} ) providing each line also ends on a word boundary before a space or end-of-string ( )> [' ' | $] ). Finally, we rejoin those lines with newlines ( .join("

") ) and print them ( .say ).

That’s a reasonable one-liner solution to the specified challenge, but we can do better.

For a start, there’s a hidden edge-case we’re not handling yet. Namely, what happens if you’re a scholarly Welsh ex-miner with health issues?

Look you, I shall have to be terminating my interdisciplinary investigation of consanguineous antidisestablishmentarianism in Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch. For I've just been electrophotomicrographically diagnosed with pseudopneumonoultramicroscopicsilicovolcanoconiosis, isn't it?

Our one-statement solution fails miserably when reformatting this input, being unable to correctly break the excessively long names of the town or the disease. As each of them has more than 45 characters, the regex has to skip over and omit as many leading characters as

necessary from the very long words, until it again finds 45 trailing characters followed by

a space. So we get:

Look you, I shall have to be terminating my interdisciplinary investigation of consanguineous antidisestablishmentarianism in yngyllgogerychwyrndrobwllllantysiliogogogoch. For I've just been electrophotomicrographically diagnosed with neumonoultramicroscopicsilicovolcanoconiosis, isn't it?

Apart from the decapitated words, we also get an absurdly unbalanced right margin

when the algorithm is forced to shift sesquipedalia like “consanguineous” and “electrophotomicrographically” to the next line.

Of course, it’s not difficult to fix both those problems. We just give the regex a fall-back option: if it can’t break a line at a word boundary because of an excessively long word,

we allow it to break that word internally, provided the break isn’t too close to either end

of the word (say, at least five characters in: $minbreak ).

We also constrain it to break regular lines at no less than 80% of the specified width

(i.e. $minwidth ) to avert those textual crevasses in the right-hand margin:

sub MAIN (:$width) {

say greedy-wrap( $*IN.slurp, :$width );

}



sub greedy-wrap( $text,

:$width = 80,

:$minwidth = floor(0.8 * $width),

:$minbreak = 5,

) {

$text.words.join(' ')

.comb(/ . ** {1..$width} $

| . ** {$minwidth..$width} )> ' '

| . ** {$minbreak..$width}

<before \S ** {$minbreak}>

/)

.join("

")

}



In this version the .comb regex specifies that we must fill at least 80% of the requested width with words ( . ** { $minwidth ..$width} ), except on the final line

( . ** {1..$width} $ ), and otherwise we’re allowed to take any number of characters,

provided we take at least five ( . ** {$minbreak..$width} ), and provided we leave

at least five visible characters at the start of the next line as well

( <before \S ** {$minbreak}> ).

This version produces a much more uniform wrapping:

Look you, I shall have to be terminating my interdisciplinary investigation of consangui neous antidisestablishmentarianism in Llanfairpwllgwyngyllgogerychwyrndrobwllllanty siliogogogoch. For I've just been electrophot omicrographically diagnosed with pseudopneumo noultramicroscopicsilicovolcanoconiosis, isn't it?

Except that the longer words are now unceremoniously chopped off, without even

the common courtesy of an interpolated copula. So we need an extra step in the

pipeline to add hyphens where they’re needed:

sub greedy-wrap( $text,

:$width = 80,

:$minwidth = floor(0.8 * $width),

:$minbreak = 5

) {

$text.words.join(' ')

.match(/ . ** {1..$width} $

| . ** {$minwidth..$width} )> ' '

| . ** {$minbreak..$width-1}

<broken=before \S ** {$minbreak}>

/, :global)

.map({ $^word.<broken> ?? "$^word-" !! $^word })

.join("

")

}



In this version we use a global .match instead of a .comb to break the text into lines, because we need to break long words one character short of the maximal width

( . ** {$minbreak.. $width-1 } ), then mark those lines as having been broken

( < broken= before \S ** {$minwidth}> ) and then add a hyphen to those lines

( $^word.<broken> ?? "$^word-" !! $^word ).

Which produces:

Look you, I shall have to be terminating my interdisciplinary investigation of consangui- neous antidisestablishmentarianism in Llanfairpwllgwyngyllgogerychwyrndrobwllllant- ysiliogogogoch. For I've just been electroph- otomicrographically diagnosed with pseudopne- umonoultramicroscopicsilicovolcanoconiosis, isn't it?

Howdy, T e X

Even with the improvements we made, the greedy line-wrapping algorithm often produces ugly unbalanced paragraphs. For example:

No one would have believed, in the last years of the nineteenth century, that human affairs were being watched from the timeless worlds of space. No one could have dreamed that we were being scrutinised as someone with a microscope studies creatures that swarm and multiply in a drop of water. And yet, across the gulf of space, minds immeasurably superior to ours regarded this Earth with envious eyes, and slowly, and surely, they drew their plans against us...

In 1981, Donald Knuth and Michael Plass published an algorithm for breaking text into lines, implemented as part of the TeX typesetting system. The algorithm considers every possible point in the text at which a line-break could be inserted and then finds the subset of those points that produces the most evenly balanced overall result.

This, of course, is far more complex and more expensive than the first-in-best-dressed approach of the greedy algorithm. In fact, as it has to consider building a line starting at every one of the N words, and running to every one of the N-M following words, it is clearly going to require O(N²) space and time to compute, compared to the greedy algorithm’s thrifty O(N). On a typical paragraph like the examples above, the TeX algorithm runs about 60 times slower.

But as most paragraphs are short (50 to 100 words), an N² cost is often acceptable.

So here’s a simple version of that approach, in Raku:

sub TeX-wrap ($text, :$width = 80, :$minbreak = 5 ) {

# Extract individual words, hyphenating if necessary...

my @words = $text.words.map: {

my @breaks = .comb: $width-$minbreak;

@breaks[0..*-2] »~=» '-';

|@breaks;

};



# Compute handy text statistics...

my @word-len = @words».chars;

my $word-count = @words.elems;



# These track EOL gaps, plus cost and position of breaks...

my @EOL-gap = [0 xx $word-count +1] xx $word-count +1;

my @line-cost = [0 xx $word-count +1] xx $word-count +1;

my @total-cost = 0 xx $word-count +1;

my @break-pos = 0 xx $word-count +1;



# Build table of EOL gaps for lines from word i to word j...

for 1.. $word-count -> $i {

@EOL-gap [$i][$i] = $width - @word-len [$i-1];

for $i+1 .. $word-count -> $j {

@EOL-gap [$i][$j]

= @EOL-gap [$i][$j-1] - @word-len [$j-1] - 1;

}

}



# Work out the cost of a line built from word i to word j...

for 1.. $word-count -> $i {

for $i.. $word-count -> $j {

# Overlength lines are infinitely expensive...

if @EOL-gap [$i][$j] < 0 {

@line-cost [$i][$j] = Inf;

}



# A short final line costs nothing...

elsif $j == $word-count && @EOL-gap [$i][$j] >= 0 {

@line-cost [$i][$j] = 0;

}



# Cost of other lines is sum-of-squares of EOL gaps...

else {

@line-cost [$i][$j] = @EOL-gap [$i][$j]²;

}

}

}



# Walk through cost table, finding the least-cost path...

@total-cost [0] = 0;

for 1.. $word-count -> $j {

@total-cost [$j] = Inf;

for 1..$j -> $i {

# Do words i to j (as a line) reduce total cost???

my $line-ij-cost = @total-cost [$i-1]

+ @line-cost [$i][$j];



if $line-ij-cost < @total-cost [$j] {

@total-cost [$j] = $line-ij-cost ;

@break-pos [$j] = $i;

}

}

}



# Extract minimal-cost lines backwards from final line...

return join "

", reverse gather loop {

state $end-word = $word-count ;

my $start-word = @break-pos [ $end-word ] - 1;

take @words[ $start-word .. $end-word- 1].join(' ');

$end-word = $start-word or last;

}

}



It’s slower and far more complex than the greedy algorithm but, as with so many other aspects of life, you get what you pay for...because it also produces much better

line-wrappings, like these:

No one would have believed, in the last years of the nineteenth century, that human affairs were being watched from the timeless worlds of space. No one could have dreamed that we were being scrutinised as someone with a microscope studies creatures that swarm and multiply in a drop of water. And yet, across the gulf of space, minds immeasurably superior to ours regarded this Earth with envious eyes, and slowly, and surely, they drew their plans against us... It is a truth universally acknowledged, that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. Look you, I shall have to be terminating my interdisciplinary investigation of consanguineous antidisestablishmentarianism in Llanfairpwllgwyngyllgogerychwyrndrobwlll- lantysiliogogogoch. For I've just been electrophotomicrographically diagnosed with pseudopneumonoultramicroscopicsilicovolc- anoconiosis, isn't it?

Slow is smooth; smooth is fast

You get what you pay for, but there’s no reason to overpay for those benefits.

The Knuth/Plass algorithm is widely used, and hence has been the subject of extensive optimization efforts. Versions have now been devised that run in linear time and space, though the intrinsic complexity always has to go somewhere, and it generally winds up

in the code itself...as O(N³) incomprehensibility.

But not all of the optimized solutions are brain-meltingly complicated. For example, there’s an elegant O(N * width) algorithm that implicitly converts the text into a directed graph, in which each node is a word and the weight of each edge is the cost of breaking a line at that word. The optimal break points can then found in linear time by computing the shortest path through the graph.

In Raku, that looks like this:

sub shortest-wrap ($text, :$width = 80, :$minbreak = 5) {

# Extract and hyphenate individual words (as for TeX)...

my @words = $text.words.map: {

my @breaks = .comb: $width-$minbreak;

@breaks[0..*-2] »~=» '-';

|@breaks;

};

my $word-count = @words.elems;



# Compute index positions from start of text to each word...

my @word-offset = [+] 0, |@words».chars;



# These track minimum cost, and optimal break positions...

my @minimum = flat 0, Inf xx $word-count ;

my @break-pos = 0 xx $word-count +1;



# Walk through text tracking minimum cost...

for 0.. $word-count -> $i {

for $i+1.. $word-count -> $j {

# Compute line width for line from word i to word j...

my $line-ij-width

= @word-offset [$j] - @word-offset [$i] + $j - $i - 1;



# No need to track cost for lines wider than maximum...

last if $line-ij-width > $width;



# Cost of line increases with square of EOL gap...

my $cost = @minimum[$i] + ($width - $line-ij-width )²;



# Track least cost and optimal break position...

if $cost < @minimum[$j] {

@minimum[$j] = $cost;

@break-pos [$j] = $i;

}

}

}



# Extract minimal-cost lines backwards (as for TeX)...

return join "

", reverse gather loop {

state $end-word = $word-count ;

my $start-word = @break-pos [ $end-word ];

take @words[ $start-word .. $end-word- 1].join(' ');

$end-word = $start-word or last;

}

}



This approach sometimes optimizes line-breaks slightly differently from the TeX algorithm, but always with the same overall “balanced” appearance. For example:

No one would have believed, in the last years of the nineteenth century, that human affairs were being watched from the timeless worlds of space. No one could have dreamed that we were being scrutinised as someone with a microscope studies creatures that swarm and multiply in a drop of water. And yet, across the gulf of space, minds immeasurably superior to ours regarded this Earth with envious eyes, and slowly, and surely, they drew their plans against us...

The major difference between these two “best-fit” algorithms is that the shortest-path approach tries to balance all the lines it builds, including the final one, so it tends

to produce a “squarer” wrapping with shorter lines generally, but a longer last line.

It also runs five times faster than the TeX approach (but still ten times slower than

the greedy algorithm).

Punishing widows and orphans

There’s a subtle problem with all three approaches we’ve looked at so far: they each optimize for only one thing. Greedy wrapping optimizes for maximal line-widths,

whereas TeX wrapping and shortest-path wrapping both optimize for maximal line balance

(i.e. minimal raggédness).

But, as desirable as each of those characteristics are, there are other typographical properties we might also want to see in our wrapped text. Because there are numerous other ways for a piece of text to be ugly:

Now is the winter of our discontent made glorious summer by this sun of York; and all the clouds that lour'd upon our house in the deep bosom of the ocean buried. Now are our brows bound with victorious wreaths; our bruised arms hung up for monuments; our stern alarums changed to merry meetings, our dreadful marches to delightful measures.

Apart from the disconcerting unevenness of the lines, this wrapping is also mildly irritating because it repeatedly breaks a line at a grammatically infelicitous point, leaving single words (such as “and”, “buried”, “our”, and “measures”) visually isolated from the rest of their

phrase.

Isolated words at the end of a line are known as widows and at the start of a line as orphans. Cut off by a line break from their proper context, they make the resulting code look awkward and badly formatted, particularly if (as here) a widow also constitutes the entire last line of a paragraph.

It’s usually possible to avoid creating widows and orphans, by breaking the text one word earlier or later:

Now is the winter of our discontent made glorious summer by this sun of York; and all the clouds that lour'd upon our house in the deep bosom of the ocean buried. Now are our brows bound with victorious wreaths; our bruised arms hung up for monuments; our stern alarums changed to merry meetings, our dreadful marches to delightful measures.

...but to achieve this effect, our line-wrapping algorithm would have to be aware

not just of the width and balance of the lines it creates, but also of the content

of the text, and the aesthetic consequences of where it chooses to break each line.

In practical terms, this means it needs a more sophisticated cost function to optimize.

The cost function that the greedy algorithm attempts to minimize is just the sum of the lengths of the gaps at the end of each line:

sub cost (@lines, $width) { sum ($width «-« @lines».chars) }

In contrast, the TeX and shortest-path algorithms attempt to reduce the variation in

end-of-line gap lengths, by minimizing the sum-of-squares:

sub cost (@lines, $width) { sum ($width «-« @lines».chars)»² }

But we can easily minimize other properties of a series of wrapped lines, by implementing and applying more complex cost functions. For example, let’s redesign the greedy algorithm (our fastest alternative) to improve its overall line balance, and at the same time to reduce the number of widows and orphans it leaves in the wrapped text.

The cost function we’ll use looks like this:

sub cost (@lines, $width) { ($width «-« @lines.head(*-1)».chars)»³».abs.sum * @lines³ * (1 + 10 * ( @lines.grep(ORPHANS) + @lines.grep(WIDOWS) ) )³; }

The cost it computes for a given set of lines is derived by quantifying and then multiplying together three desirable characteristics of a wrapped paragraph:

the uniformity of the wrapped lines, measured as the sum-of-cubes of the

end-of-line gaps for every line except the last:

($width «-« @lines.head(*-1)».chars)»³».abs.sum

the compactness of the resulting paragraph, measured as the cube of the total

number of lines: @lines³

the number of widows and orphans created, measured as the cube of ten times

the total number of isolated words found:

(1 + 10 * ( @lines.grep(ORPHANS) + @lines.grep(WIDOWS) ) )³

The cost function uses cubes instead of squares to more quickly ramp up the penalty incurred for introducing multiple unwanted features, compared to the zero cost of ideal lines.

The factor of ten applied to widows and orphans reflects a particularly robust aesthetic

objection to them (tweak this number to suit your personal level of typographical zeal).

Orphans and widows are detected as follows:

sub ORPHANS {/ ^^ \S+ <[.!?,;:]> [\s | $$] /} sub WIDOWS {/ <[.!?,;:]> \s+ \S+ $$ /}

An orphan is a single word at the start of a line ( ^^ \S+ ) followed by any phrase-ending punctuation character ( <[.!?,;:]> ), followed by a space or the end of the line

( [\s | $$] ). A widow is a single word immediately after a punctuation character

( <[.!?,;:]> \s+ \S+ ), which is also at the end of the line ( $$ ).

With this more sophisticated cost function we can now optimize for both structural

properties and aesthetic ones. We could also extend the function to penalize other

unwanted artefacts, such as phrases fractured after their introductory preposition,

split infinitives, or articles dangling at the end of a line:

sub ESTRANGED { / \s [for|with|by|from|as|to|a|the] $$/ }



sub cost (@lines, $width) {

($width «-« @lines.head(*-1)».chars)»³».abs.sum

* @lines³

* (1 + 10 * ( @lines.grep(ORPHANS)

+ @lines.grep(WIDOWS)

+ @lines.grep(ESTRANGED)

)

)³;

}

In order to optimize a line-wrapping using a complex cost function like this, we need a way to generate alternative wrappings...which we can then assess, compare, and select from.

But the greedy wrapping approach (and, indeed, the TeX algorithm and shortest-path technique as well) always generates only a single wrapping. How do we get more?

An easy and quick way to generate those additional wrappings is to use the greedy

approach, but to vary the width to which it wraps. For example, if we wrap the same text

to 45 columns, then to successively shorter widths, like so:

for 45...40 -> $width { my $wrapping = greedy-wrap($text, :$width); my $cost = cost($wrapping.lines, $width); say "[$width columns --> cost: $cost]"; say "$wrapping

"; }

...we get:

[45 columns --> cost: 40768] Far back in the mists of ancient time, in the great and glorious days of the former Galactic Empire, life was wild, rich and largely tax free. [44 columns --> cost: 10051712] Far back in the mists of ancient time, in the great and glorious days of the former Galactic Empire, life was wild, rich and largely tax free. [43 columns --> cost: 3662912] Far back in the mists of ancient time, in the great and glorious days of the former Galactic Empire, life was wild, rich and largely tax free. [42 columns --> cost: 851840] Far back in the mists of ancient time, in the great and glorious days of the former Galactic Empire, life was wild, rich and largely tax free. [41 columns --> cost: 85184] Far back in the mists of ancient time, in the great and glorious days of the former Galactic Empire, life was wild, rich and largely tax free. [40 columns --> cost: 2752] Far back in the mists of ancient time, in the great and glorious days of the former Galactic Empire, life was wild, rich and largely tax free.

The 40-column wrapping clearly produces the most balanced and least orphaned or widowed text, and this is reflected in its minimal cost value. Of course, we’re no longer making use of the entire available width, but a 10% reduction in line length seems an acceptable price to pay for such a substantial increase in visual appeal.

More interestingly, the 40-column alternative produced in this way also looks better than the wrapping created by the far more complex TeX algorithm (which unfortunately orphans the “in” at the end of the first line):

Far back in the mists of ancient time, in the great and glorious days of the former Galactic Empire, life was wild, rich and largely tax free.

The iterated greedy solution is also better than the shortest-path approach, which widows “time”, orphans “life”, and wraps the lines a full 20% short of the requested 45 columns:

Far back in the mists of ancient time, in the great and glorious days of the former Galactic Empire, life was wild, rich and largely tax free.

Moreover, despite now being technically O(N²)—as the O(N) greedy-wrap function must now be called N/10 times—the iterated greedy technique it still 25% faster than the TeX algorithm and nearly 75% as fast as the shortest-path approach.

But we can do even better than that. Note that, as we reduced the wrapping width from 45 to 40, the narrower margin only sometimes changed the wrapping that was produced (in this case, only at 45, 44, and 40 columns). So we were actually doing twice as much work as was strictly necessary to find the optimal width.

It turns out that, if the width of the longest line in the previous wrapping is equal to

or shorter than the next candidate width, then it’s always a waste of effort to try

that next candidate width...because it must necessarily produce exactly the same

wrapping again.

So we could improve our search loop by tracking how wide each wrapping actually is and only trying subsequent candidate widths if they are shorter than that. And, if we also track the best wrapping to date (i.e. the one with the least cost) as we search, then we’ll have a complete iterated greedy wrapping algorithm:

sub iterative-wrap ($text, :$width = 80) {

# Track the best wrapping we find...

my $best-wrapping;



# Allow any width down to 90% of that specified...

for $width...floor(0.9 * $width) -> $next-width {

# Only try widths that can produce new wrappings...

state $prev-max-width = Inf;

next if $next-width > $prev-max-width;



# Build the wrapping and evaluate it...

my $wrapping = greedy-wrap($text, :width($next-width));

my $cost = cost($wrapping.lines, $next-width);



# Keep the wrapping only if it's the best so far...

state $lowest-cost = Inf;

if $cost < $lowest-cost {

$best-wrapping = $wrapping;

$lowest-cost = $cost;

}



# Try one character narrower next time...

$prev-max-width = $wrapping.lines».chars.max - 1;

}



# Send back the prettiest one we found...

return $best-wrapping;

}



With the optimization of skipping unproductive widths, this solution is now 2.5 times faster than the TeX algorithm and 25% faster than the shortest-path approach.

As a final step, we could rewrite the above code in a cleaner, shorter, more “native” Raku style, which will probably make it more maintainable as well:

sub iterative-wrap ($text, :$width = 80) {

# Return the least-cost candidate wrapping...

return min :by{.cost}, gather loop {

# Start at specified width; stop at 90% thereof...

state $next-width = $width;

last if $next-width < floor(0.9 * $width);



# Create and evaluate another candidate...

my $wrapping = greedy-wrap($text, :width($next-width));

my $cost = cost($wrapping.lines, $next-width);



# Gather it, annotating it with its score...

role Cost { has $.cost }

take $wrapping but Cost($cost);



# Try one character narrower next time...

$next-width = $wrapping.lines».chars.max - 1;

}

}



In this version, we generate each candidate wrapping within an unconditional loop,

starting at the specified width ( state $next-width = $width ) and finishing at

90% of that width ( last if $next-width < floor(0.9 * $width) ).

We create each wrapping greedily and evaluate it exactly as before, but then

we simply accumulate the wrapping, annotating it with its own cost

( take $wrapping but Cost($cost) ).

The Cost role gives us an easy way to add the cost information to the string containing

the wrapping, without messing up the string itself. A role is a collection of methods and attributes that can added to an existing class as a component. Other languages have similar constructs, but refer to them as “interfaces” or “traits” or “protocol extensions” or “mixins”.

In this case we simply add the extra cost-tracking functionality to the wrapping string by using the infix but operator...which transforms the left operand into a new kind of object derived from the Str class of the left operand, but (ahem!) with additional behaviours specified by the role that is the right operand.

So our gather loop collects a sequence of wrapping strings, each of which now has

an extra .cost method that reports its cost, and which then allows us to apply

the built-in min function to select and return the best wrapping produced by the loop

( return min :by{.cost} gather loop {...} ).

The code of our new iterative-wrap subroutine is seven times longer

and seven times slower that the original greedy-wrap implementation.

But it also produces results that are at least seven times prettier.

And that’s a trade-off well worth making.

Damian