Last month I posted a very simple Markov Chain generator that takes one paragraph of text and produces gibberish. The one issue is that it doesn’t scale, for two reasons:

1) It reads all the text into memory at once. That’s fine if you’re processing a few paragraphs, but for large amounts of text it’s better to process one line at a time.

2) The word lists contain repeated words. These lists would take much less space if they had word counts instead. For example:

{"He" ("is" "is"), :start ("He" "She" "He" "She")}

Could be

{"He" {"is" 2}, :start {"He" 2, "She" 2)}

Obviously it’s not much of a gain for this simple example. If we were processing book though, we might have hundreds of sentences that started with He or She. The second structure would be a couple orders of magnitude smaller.

Let’s start with the second problem. Here’s an elegant code snippet to traverse the list of words and build the structure with counts from a Stack Overflow discussion:

Notice the usefulness of fnil, it makes the code more concise. More importantly, the above is a lazy function; if we read the file lazily then we can process the input stream without reading it all into memory first. How would you do it?

Below is a snippet of code based on another Stack Overflow question. By the way, isn’t Stack Overflow awesome? There was a time when we had to code without it, or without internet access. Those were the bad old times 🙂

Notice that I sandwiched the words in a sentence between the :start and :end markers because I want to know what words are good starting and ending points for generated sentences. The transform function sees a long stream of words that could be a single sentence without these markers. As anecdotal evidence that this approach is more memory-efficient, running the JVM on my Macbook against a text file with 50k English words takes up 20Mb less than the original program.

Now that we have a suitable structure, how do we pick words at random but with a probability that’s proportional to the number of times it occurs after a given word? For example, if I have {“He” {“is” 15, “was” 5}} I want to pick “is” 3/4 of the time. This was an interview question that we asked frequently at Inktomi in the late 90s. The simplest solution is to pick a random number in the range of the number of individual instances of words, and pick the index of the word corresponding to the slice it would take in the total.

In the above example, we have two slices:

0 -14 -> "is" 15 - 19 -> "was"

so choosing a random number between 0 and 19 and then checking what slice it belongs to would yield “is” 75% of the time and “was” the remaining 25%. Here’s a nice implementation from Rick Hickey:

And that’s pretty much it. The entire code is here. I ran it against this collection of Hacker News headlines from this post and generated a few of my own:

“Microsoft’s Decline of Employee #1, Leaving Github”

“Y Combinator’s First iOS 6 Countdown”

“Congress Is the HipHop VM”

“Welcome to Get From Russia in Javascript charts and Numpy”

Where do we go from here? We could change the code to pick only sentences of a certain length, or to make chains of n-grams (instead of single words) to create more plausible text. Have fun!