Of generating random text using a Markov model

Generating text using a markov model is a very old idea. In A mathematical theory of communication (1948), Claude Shannon gives different samples of text generated from offers the same Markov recipe.

The different sample differ by what is considered a token (letters or words) and by the number of precedent tokens taken in account when computing the next token.

Letters as token, Markov state of 1 letter ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE. Letters as token, Markov state of 2 letters IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE. Word as token, Markov of state of 1 word THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

1948, just think how cool a hacker he was !

I won’t explain what Markov model are in this post. Claude Shannon actually manages to explain how he generated these text with very simple words. Since computers weren’t exactly ready for text processing at this time, these pieces have actually been generated by manually looking up word iteration in a book. For instance to generate the first sample …

One opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc.

To generate sample 2, one would do the same, but looking for consecutive pairs of letters, and to generate sample 3 one would do that with words.

Unfortunately Shannon notes that :

It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage.

With current computers, it is actually extremely easy to generate such text while considering the two precedent words. In this post I will show an implementation in CoffeeScript. Rather than linearly looking up for occurrence of the two precedent words, we are going to put all alternatives into an index. Writing that kind of code is a lot of fun and I will try to make the code as clear as possible, so that anyone can read it.

The project is available on github and works on both nodeJS and in the browser. It can make a funnier alternative to lorem ipsum for testing your design.

Demo

In the end, given different famous books as a corpus, the following algorithm generated the following text :

Pride and Prejudice, Jane Austeen.

I do not. It is very much at Pemberley. The idea of their all walking out. I must acknowledge to you. When I do not marry Mr Collins had promised herself. But you have the carriage might be copied. It would look odd to be proud. You are perfectly good. Elizabeth was at the sight of Miss Darcy was delighted. You have no objection to my charge. I know not. Lydia was urgent with the keenest of all. Mr Collins, you puzzle me exceedingly. But, my proposals will not go. To the rest. But, to much conversation, no traces of them.

The republic, Plato.

And what do you say of men. Quite true, he replied. Most true, he replied. Yes, he said. Is not the noble Hesiod and the other falls. They will tell you. And not being ambitious they do. Then you see, except at the same. I fear that the ideal of modern thinkers. But this only shows that they may be more likely. For I must be elicited from within. Let us examine this: Is not the same nature. This, which is current among men. For there might be eradicated. Suppose now another form of servitude. Plato is the subject.

The Hounds of the Baskervilles, Sir Arthur Conan Doyle.

When you know her by sight. No, no doubt, but he’s not a child. It’s a very extraordinary question. We have never in my inferences. Thank you ! Now, Cartwright. It had not gone today it is ? I am. Mortimer has read to us in the Grimpen Mire. Here are twenty-three shillings. It was during these early days to Sherlock Holmes. But, hark, what do you say, Watson. It is Selden, in case I want you again. You know that he confided to anyone. I could look straight through the clouds. Is it, could object.

Demo time

You can test the algorithm by copy paste a long text below and click on the generate button. Don’t be shy with the length of the text. The longer the better.

Implementation

The tokenizer

First we need a simple tokenizer. A tokenizer, is a function that split our text into a sequence of token. In our case we will want to work on words, so we will want to split on whitespaces and punctuation. We don’t want to get rid of punctuation though.

Regular expressions are the right tool to get a quick-and-dirty version running. A very simple version could be for instance :

tokenize = ( corpus ) -> corpus = corpus . replace ( PUNCTUATION , " $1 " ) corpus . split /\s+/

The first line surrounds all of the punctuation by white spaces, and the second one splits the text on spaces.

Python’s defaultdict and counter in CoffeeScript

We will store the transitions in a two level dictionary going

first word -> second word -> third word -> number of appearance

In order to build this dictionary we will recode two very handy structures available in python’s collection, called defaultdict and Counter .

A defaultdict makes it possible to automatically fill a dictionary with a default value when the key requested is not found.

# A simple implementation of Python's # defaultdict class DefaultDict constructor : ( @ default_value ) -> @ c = {} get : ( k ) -> if not @ c [ k ] ? @ c [ k ] = @ default_value () @ c [ k ] set : ( k , v ) -> @ c [ k ] = v items : -> ([ k , v ] for k , v of @ c ) class Counter extends DefaultDict constructor : -> @ c = {} @ default_value = -> 0 inc : ( k , count = 1 ) -> @ c [ k ] = @ get ( k ) + count

It will make it possible for us to write things such as

counter = new Counter () for word in words counter . inc word

Thus avoiding the clumsy check for presence of the key when we first encounter of a word.

A distribution

We will then have an object representing a simple distribution over an enum.

class Distribution constructor : ( weights ) -> # Takes a list of [value, weight] as # argument. See draw @ total = 0 @ values = [] @ boundaries = [] for [ value , weight ] in weights @ values . push value @ total += weight @ boundaries . push @ total draw : -> # Returns a random value # The bigger the weight associated to # the value, the greater the chance # of appearance. target = Math . random () * @ total value_id = binary_search @ boundaries , target @ values [ value_id ]

The language model

The language is the object in charge of holding the markov model and able to compute a random text. It works by drawing the first two tokens within the beginning two words of the text, and then iteratively adds words randomly until a period is reached. We reject sentences of less than 3 tokens or more than 12 tokens.

class LanguageModel # This object represents a # language model. It is able to # generate random sentences out of it. constructor : ( @ trigrams , @ starting_bigrams ) -> draw_tokens : -> # Generates a sequence of token # finishing by "." [ w1 , w2 ] = @ starting_bigrams . draw () tokens = [ w1 , w2 ] while w2 != "." [ w1 , w2 ] = [ w2 , @ trigrams . get ( w1 )[ w2 ]. draw () ] tokens . push w2 if 3 < tokens . length < 20 tokens else @ draw_tokens () generate_sentence : -> # Returns a random sentence tokens = @ draw_tokens () ( add_space ( token ) for token in tokens ). join ( "" ). trim () generate_text : ( min_length ) -> # Returns a text of a length of around # min_length. text = "" while text . length < min_length text += @ generate_sentence () + " " text

Building up the model

Finally we need to build the model by computing both the trigram distribution and the starting bigram distribution expected by the LanguageModel constructor.

count_triplets = ( tokens ) -> # Returns a counter for all the # triplets of tokens in the form of # a three level map : # token -> token -> token -> count triplet_counter = new DefaultDict ( -> new DefaultDict ( -> new Counter ())) for start in [ 0 ... tokens . length - 3 ] [ tok1 , tok2 , tok3 ] = tokens [ start ... start + 3 ] triplet_counter . get ( tok1 ). get ( tok2 ). inc ( tok3 ) triplet_counter learn = ( text ) -> # tokenize our text tokens = tokenize text + "." # creates a triplet counter triplet_counter = count_triplets tokens # computes the trigram distribution markov_model = new DefaultDict -> {} for w1 , doublet_counter of triplet_counter . c for w2 , word_counts of doublet_counter . c token_distribution = new Distribution word_counts . items () markov_model . get ( w1 )[ w2 ] = token_distribution # Given a triplet counter map, # computes a distribution of starting # bigram bigrams = [] for k1 , word_counter of triplet_counter . get ( "." ). c for k2 , word_count of word_counter . c bigrams . push [ [ k1 , k2 ], word_count ] bigram_distribution = new Distribution bigrams new LanguageModel markov_model , bigram_distribution

Voilà! Don’t hesitate to fork the project on github and have fun with it.