Word numbers, Part 2

We return to ITA Software’s “word numbers” problem, like the following:

If the integers from 1 to 999,999,999 are written as words, sorted alphabetically, and concatenated, what is the 51 billionth letter?

In the first part, we came up with a problem statement. We defined a term ten9 of type (Seminearring a, Character a) => a ; that is, a value of any type where we know how to add, multiply, and what to do with characters. For instance, ten9 could take a values in [String] :

*WordNumbers1> take 5 (ten9 :: [String]) ["one","two","three","four","five"]

and the answer to the above problem is easy to write:

*WordNumbers1> concat (List.sort (ten9 :: [String])) !! 51000000000 *** Exception: Prelude.(!!): negative index

As you can see, though, this specification has no chance of working: the indexing function (!!) takes an Int as its second argument, which one my computer cannot represent numbers bigger than 231, which is only around 2 billion. This limitation merely represents a practical limitation, that you are unlikely to be able to work with lists that are so colossally big, and indeed the computation above would take much too long to complete.

In this post we will start to solve such problems efficiently. We start by counting the total length of all strings in ten9 . The key idea is to think of converting lists of strings to polynomials, where each character gets mapped to the variable x, a string of length n gets mapped to xn, and a list of strings get mapped to a sum. Thus instead of a list starting

["one","two","three","four","five",...]

we get a polynomial whose first few terms are

x3 + x3 + x5 + x4 + x4 + …

If we evaluate this polynomial at x = 1, we get the total number of terms, which is not very interesting. If we instead take the derivative first, we get

3x2 + 3x2 + 5x4 + 4x3 + 4x3 + …

Now evaluation at x = 1 gives

3 + 3 + 5 + 4 + 4 + …

which is the total length of all these strings, as desired.

To implement this efficiently, we use the idea of automatic differentiation: we can make the value and first derivative of a function around x = 1 into a seminearring supporting addition and multiplication. (We could include subtraction and division if we wanted, but we do not need it here.) Conceptually, it is best to think about this as the first two terms in the Taylor series expansion, as we will explain more below.

We now turn to the implementation. As before, you can download this post as a program. First we import what we did previously.

{-# OPTIONS -W -fglasgow-exts #-} module WordNumbers2 where import WordNumbers1 import Prelude hiding ((+), (*), sum, product) import qualified Prelude as P

As a warm-up, we first count the total number of strings, using the fact that natural numbers are a nearring.

newtype Nat a = Nat a deriving (Eq, Ord) type Count = Nat Integer instance (Show a) => Show (Nat a) where show (Nat x) = show x instance (Num a) => Monoid (Nat a) where zero = Nat 0 Nat a + Nat b = Nat (a P.+ b) instance (Num a) => Seminearring (Nat a) where one = Nat 1 Nat a * Nat b = Nat (a P.* b)

We convert every character into the number 1. This corresponds to the evaluation at x = 1 as explained above.

instance (Num a) => Character (Nat a) where char _ = one

We check our work by counting the total number of terminals in the grammar for ten9 .

*WordNumbers2> ten9 :: Count 999999999

Note that we get 109 - 1 since ‘zero’ is not produced by the grammar.

To compute the total count of all letters, we need a kind of derivative, as mentioned above. In general, we want to keep track of the value of a function and its first derivative. If we are working with several variables (as we will later), the first derivative becomes a vector of partial derivatives. In general, the derivatives need to be a module over the scalars (the base seminearring), with addition (a monoid structure), and multiplication by scalars—

class (Seminearring r, Monoid m) => Module r m where (*>) :: r -> m -> m (*>) = flip (<*) (<*) :: m -> r -> m (<*) = flip (*>)

—satisfying associativities (x*y) *> m = x * (y*>m) , m <* (x*y) = (m<*x) * y , and (x*>m) <* y = x *> (m<*y) along with various distributive laws. (It’s not clear to me precisely which distributive laws should hold for a general seminearring.)

(This structure is more properly called a bimodule, because we allow multiplication by scalars on both sides.)

Now a derivative as a pair of a value and a derivative—

data Deriv r m = Deriv r m deriving (Eq, Show)

—satisfying the usual law for addition d(f+g) = df + dg—

instance (Monoid r, Monoid m) => Monoid (Deriv r m) where zero = Deriv zero zero Deriv c1 m1 + Deriv c2 m2 = Deriv (c1 + c2) (m1 + m2)

—while for multiplication we use Leibniz’s rule, d(f·g) = f·dg + df·g.

instance (Module r m) => Seminearring (Deriv r m) where one = Deriv one zero Deriv c1 m1 * Deriv c2 m2 = Deriv (c1 * c2) (c1 *> m2 + m1 <* c2)

We convert characters to derivatives component-wise.

instance (Character r, Character m) => Character (Deriv r m) where char c = Deriv (char c) (char c)

To actually use these derivatives, we introduce a wrapper type to keep track of the units, with some obvious instances.

newtype Wrap s a = Wrap a deriving (Eq, Ord) instance (Monoid a) => Monoid (Wrap s a) where zero = Wrap zero Wrap a + Wrap b = Wrap (a + b) instance (Show a) => Show (Wrap s a) where show (Wrap x) = show x instance (Seminearring a) => Seminearring (Wrap s a) where one = Wrap one Wrap a * Wrap b = Wrap (a * b) instance (Seminearring r) => Module r (Wrap s r) where r *> Wrap m = Wrap (r * m) Wrap m <* r = Wrap (m * r)

In our case, the units are numbers of characters, which we call Volume .

data V type Volume = Wrap V (Nat Integer)

Every character has length one:

instance Character Volume where char _ = one

We could change this definition to count, for instance, the total number of pen strokes used to write all these letters.

Now we can quickly count the total number of characters in the English words from one to a billion.

*WordNumbers2> ten9 :: Deriv Count Volume Deriv 999999999 70305000000

For a sanity check, we check the count to a million…

*WordNumbers2> ten6 :: Deriv Count Volume Deriv 999999 44872000

… which agrees with our previous result, but completes much more quickly.

It’s worth thinking about how this works. In computing

*WordNumbers2> ten2 :: Deriv Count Volume Deriv 99 854

for instance, the total length of the strings from one to nine is used a total of nine times, but we don’t want to count their length that many times. Instead we count how many times they will be used, and multiply using Leibniz’s rule.

In the next installment we will use this setup to find the 51 billionth letter of this sequence (omitting the sorting), and then we will move on to see how to sort the sequence while maintaining efficiency.

Update: Added one associativity law that I forgot earlier.