In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.

1. Markov Model of Natural Language

This problem appeared as an assignment in the Princeton course COS 126 . This assignment was developed by Prof. Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.

Problem Statement Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text. Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text. Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these frequencies as probabilities. For example, if the input text is "gagggagaggcgagaaa" , the Markov model of order 0 predicts that each letter is 'a' with probability 7/17, 'c' with probability 1/17, and 'g' with probability 9/17 because these are the fraction of times each letter occurs. The following sequence of letters is a typical example generated from this model: g a g g c g a g a a g a g a a g a a a g a g a g a g a a a g a g a a g ... A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with statistical properties of English text because there a high correlation among successive letters in a word or sentence. For example, 'w' is more likely to be followed with 'e' than with 'u' , while 'q' is more likely to be followed with 'u' than with 'e' . We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters. Let a k-gram mean any k consecutive letters. Then for example, if the text has 100 occurrences of "th" , with 60 occurrences of "the" , 25 occurrences of "thi" , 10 occurrences of "tha" , and 5 occurrences of "tho" , the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20. A brute-force solution. Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of order 1: “ To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. ” Our task is to write a python program to automate this laborious task in a more efficient way — Shannon’s brute-force approach is prohibitively slow when the size of the input text is large.



Markov model data type. Create an immutable data type MarkovModel to represent a Markov model of order k from a given text string. The data type must implement the following API: Constructor. To implement the data type, create a symbol table, whose keys will be String k-grams. You may assume that the input text is a sequence of characters over the ASCII alphabet so that all char values are between 0 and 127. The value type of your symbol table needs to be a data structure that can represent the frequency of each possible next character. The frequencies should be tallied as if the text were circular (i.e., as if it repeated the first k characters at the end).

k-grams. You may assume that the input text is a sequence of characters over the ASCII alphabet so that all values are between 0 and 127. The value type of your symbol table needs to be a data structure that can represent the frequency of each possible next character. The frequencies should be tallied as if the text were (i.e., as if it repeated the first k characters at the end). Order. Return the order k of the Markov Model.

Frequency. There are two frequency methods. freq(kgram) returns the number of times the k-gram was found in the original text. freq(kgram, c) returns the number of times the k-gram was followed by the character c in the original text.

Randomly generate a character. Return a character. It must be a character that followed the k-gram in the original text. The character should be chosen randomly, but the results of calling rand(kgram) several times should mirror the frequencies of characters that followed the k-gram in the original text.

several times should mirror the frequencies of characters that followed the k-gram in the original text. Generate pseudo-random text. Return a String of length T that is a randomly generated stream of characters whose first k characters are the argument kgram . Starting with the argument kgram , repeatedly call rand() to generate the next character. Successive k-grams should be formed by using the most recent k characters in the newly generated text. Use a StringBuilder object to build the stream of characters (otherwise, as we saw when discussing performance, your code will take order of N2 time to generate N characters, which is too slow). To avoid dead ends, treat the input text as a circular string: the last character is considered to precede the first character. For example, if k = 2 and the text is the 17-character string "gagggagaggcgagaaa" , then the salient features of the Markov model are captured in the table below: frequency of probability that next char next char is kgram freq a c g a c g ---------------------------------------------- aa 2 1 0 1 1/2 0 1/2 ag 5 3 0 2 3/5 0 2/5 cg 1 1 0 0 1 0 0 ga 5 1 0 4 1/5 0 4/5 gc 1 0 0 1 0 0 1 gg 3 1 1 1 1/3 1/3 1/3 ---------------------------------------------- 17 7 1 9 Note that the frequency of "ag" is 5 (and not 4) because we are treating the string as circular. A Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For example, if the current state is "ga" in the Markov model of order 2 discussed above, then the next character is 'a' with probability 1/5 and 'g' with probability 4/5. The next state in the Markov chain is obtained by appending the new character to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such states. Below is a possible trajectory consisting of 9 transitions. trajectory: ga --> ag --> gg --> gc --> cg --> ga --> ag --> ga --> aa --> ag probability for a: 1/5 3/5 1/3 0 1 1/5 3/51/5 1/2 probability for c: 0 0 1/3 0 0 0 0 0 0 probability for g: 4/5 2/5 1/3 1 0 4/5 2/5 4/5 1/2 Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters. To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then, simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag . trajectory: ga --> ag --> gg --> gc --> cg --> ga --> ag --> ga --> aa --> ag output: ga g g c g a g a a g Text generation client. Implement a client program TextGenerator that takes two command-line integers k and T, reads the input text from standard input and builds a Markov model of order k from the input text; then, starting with the k-gram consisting of the first k letters of the input text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain. We may assume that the text has length at least k, and also that T ≥ k. A Python Implementation import numpy as np from collections import defaultdict class MarkovModel : def __init__ ( self , text , k ): ''' create a Markov model of order k from given text Assume that text has length at least k. ''' self . k = k self . tran = defaultdict ( float ) self . alph = list ( set ( list ( text ))) self . kgrams = defaultdict ( int ) n = len ( text ) text += text [: k ] for i in range ( n ): self . tran [ text [ i : i + k ], text [ i + k ]] += 1. self . kgrams [ text [ i : i + k ]] += 1 def order ( self ): # order k of Markov model return self . k def freq ( self , kgram ): # number of occurrences of kgram in text assert len ( kgram ) == self . k # (check if kgram is of length k) return self . kgrams [ kgram ] def freq2 ( self , kgram , c ): # number of times that character c follows kgram assert len ( kgram ) == self . k # (check if kgram is of length k) return self . tran [ kgram , c ] def rand ( self , kgram ): # random character following given kgram assert len ( kgram ) == self . k # (check if kgram is of length k. Z = sum ([ self . tran [ kgram , alph ] for alph in self . alph ]) return np . random . choice ( self . alph , 1 , p = np . array ([ self . tran [ kgram , alph ] for alph in self . alph ]) / Z ) def gen ( self , kgram , T ): # generate a String of length T characters assert len ( kgram ) == self . k # by simulating a trajectory through the corresponding str = '' # Markov chain. The first k characters of the newly for _ in range ( T ): # generated String should be the argument kgram. #print kgram, c # check if kgram is of length k. c = self . rand ( kgram )[ 0 ] # Assume that T is at least k. kgram = kgram [ 1 :] + c str += c return str Some Results m = MarkovModel ( 'gagggagaggcgagaaa' , 2 ) generates the following MarkovChain where each state represents a 2-gram.

Input: news item (taken from the assignment)

Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.

“We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Counsel for Microsoft Corporation. “We are disappointed with this decision, but we will immediately comply with the Court’s order.”

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft’s goal.

“We will continue to listen to our customers and provide them the tools they need to write great software using the Java language,” added Tod Nielsen, General Manager for Microsoft’s Developer Relations Group/Platform Marketing.

Markov Model learnt

Generated output: random news item, using input as an order 7 model

Microsoft is no longer able to use the Java language,” added Tod Nielsen, General Counsel for Microsoft’s Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developer Relations Group/Platform Marketing.Microsoft to be Microsoft Corporation. “We are disappointed with the Court’s order.”

Microsoft is no longer able to use the Java language. Providing the best tools and provide them the tools and programming options will continue to listen to our customers and provide them the tools and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the Java programming option of taking advantage of Windows features when writing software Developer Relations Group/Platform Marketing.Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Manager for Microsoft’s goal.

“We will continue to listen to our customers and programming language. Providing the Java language,” added Tod Nielsen, General Manager for Microsoft is no longer able to use the Java language.

Noisy Text Correction

Imagine we receive a message where some of the characters have been corrupted by noise. We represent unknown characters by the ~ symbol (we assume we don’t use ~ in our messages). Add a method replaceUnknown that decodes a noisy message by replacing each ~ with the most likely character given our order k Markov model, and conditional on the surrounding text:

def replaceUnknown(corrupted) # replace unknown characters with most probable characters

Assume unknown characters are at least k characters apart and also appear at least k characters away from the start and end of the message. This maximum-likelihood approach doesn’t always get it perfect, but it fixes most of the missing characters correctly.

Here are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should consider all possible replacement characters. We want the replacement character that makes sense not only at the unknown position (given the previous characters) but also when the replacement is used in the context of the k subsequent known characters. For example we expect the unknown character in "was ~he wo" to be 't' and not simply the most likely character in the context of "was " . You can compute the probability of each hypothesis by multiplying the probabilities of generating each of k+1 characters in sequence: the missing one, and the k next ones. The following figure illustrates how we want to consider k+1windows to maximize the log-likelihood:

Using the algorithm described above, here are the results obtained for the following example:

Original : it was the best of times, it was the worst of times.

Noisy : it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.

Corrected (k=4): it was the best of times, it was the worst of times.

Corrected (k=2): it was the best of times, in was the wo st of times.

2. Detecting authorship

This problem appeared as an assignment in the Cornell course cs1114 .

The Problem Statement

In this assignment, we shall be implementing an authorship detector which, when

given a large sample size of text to train on, can then guess the author of an unknown

text.

The algorithm to be implemented works based on the following idea: An author’s

writing style can be defined quantitatively by looking at the words he uses. Specifically, we want to keep track of his word flow – that is, which words he tends to use after other words.

writing style can be defined quantitatively by looking at the words he uses. Specifically, we want to keep track of his word flow – that is, which words he tends to use after other words. To make things significantly simpler, we’re going to assume that the author always

follows a given word with the same distribution of words. Of course, this isn’t true,

since the words you choose when writing obviously depend on context. Nevertheless, this simplifying assumption should hold over an extended amount of text, where context becomes less relevant.

follows a given word with the same distribution of words. Of course, this isn’t true, since the words you choose when writing obviously depend on context. Nevertheless, this simplifying assumption should hold over an extended amount of text, where context becomes less relevant. In order to implement this model of an author’s writing style, we will use a Markov

chain . A Markov chain is a set of states with the Markov property – that is, the

probabilities of each state are independent from the probabilities of every other state. This behavior correctly models our assumption of word independence.

. A Markov chain is a set of states with the Markov property – that is, the probabilities of each state are independent from the probabilities of every other state. This behavior correctly models our assumption of word independence. A Markov chain can be represented as a directed graph. Each node is a state (words,

in our case), and a directed edge going from state Si to Sj represents the probability we will go to Sj when we’re at Si. We will implement this directed graph as a transition matrix. Given a set of words W1, W2, …Wn, we can construct an n by n transition matrix A, where an edge from Wi to Wj of weight p means Aij = p.

in our case), and a directed edge going from state Si to Sj represents the probability we will go to Sj when we’re at Si. We will implement this directed graph as a transition matrix. Given a set of words W1, W2, …Wn, we can construct an n by n transition matrix A, where an edge from Wi to Wj of weight p means Aij = p. The edges, in this case, represent the probability that word j follows word i from the

given author. This means, of course, that the sum of the weights of all edges leaving

from each word must add up to 1.

given author. This means, of course, that the sum of the weights of all edges leaving from each word must add up to 1. We can construct this graph from a large sample text corpus. Our next step would be finding the author of an unknown, short chunk of text. To do this, we simply compute the probability of this unknown text occurring, using the words in that order, in each of our Markov chains. The author would likely be the one with the highest probability.

We shall implement the Markov chain model of writing style. We are given some sample texts to train our model on, as well as some challenges for you to

figure out.

Constructing the transition matrix

Our first step is to construct the transition matrix representing our Markov chain.

First, we must read the text from a sample file. We shall want to create a sparse array using the scipy csr sparse matrix. Along with the transition matrix, we shall be creating a corresponding vector that contains 2 word frequencies (normalized by the total number of words in the document (including repeated words)).

Calculating likelihood

Once we have our transition matrix, we can calculate the likelihood of an unknown

sample of text. We are given several pieces of literature by various authors,

as well as excerpts from each of the authors as test dataset. Our goal is to identify the authors of each excerpt.

To do so, we shall need to calculate the likelihood of the excerpt occurring in each author’s transition matrix. Recall that each edge in the directed graph that the transition

matrix represents is the probability that the author follows a word with another.

Since we shall be multiplying numerous possibly small probabilities together, our

calculated likelihood will likely be extremely small. Thus, you should compare log(likelihood) instead. Keep in mind the possibility that the author may have used a word he has never used before. Our calculated likelihood should not eliminate an author completely because of this. We shall be imposing a high penalty if a word is missing.

Finding the author with the maximum likelihood

Now that we can compute likelihoods, the next step is to write a routine that takes a set of transition matrices and dictionaries, and a sequence of text, and returns the index of the transition matrix that results in the highest likelihood. You will write this in a function classify text, which takes transition matrices, dictionaries, histograms, and the name of the file containing the test text, and returns a single integer best index. The following figure shows how to detect an author k (A_k) of the test text t_1..t_n using the transition matrix P_k with MLE :

Python Implementation

from np import log

def log0(x):

return 0 if x <= 0 else log(x) def compute_text_likelihood(filename, T, dict_rev, histogram, index): ”’

Compute the (log) likelihood L of a given string (in ‘filename’) given

a word transition probability T, dictionary ‘dict’, and histogram

‘histogram’

”’ text = word_tokenize(open(filename).read().replace(‘

’, ‘ ‘).lower())

num_words = len(text)

text = [word for word in text if word in histogram] # keep only the words that are found in the training dataset

ll = log0(histogram[text[0]]) – log0(sum(histogram.values()))

for i in range(1, len(text)):

ll += log0(T[dict_rev[text[i-1]], dict_rev[text[i]]])

return ll + (num_words – num_matches)*penalty def classify_text(tmatrices, dict_revs, histograms, filename): ”’

Return the index of the most likely author given the transition matrices, dictionaries, and a test file

”’ for i in range(len(tmatrices)):

ll = compute_text_likelihood(filename, tmatrices[i], dict_revs[i], histograms[i], i) print i, ll

Training Dataset

The list of authors whose writings are there in the training dataset:

0. Austen

1. Carroll

2. Hamilton

3. Jay

4. Madison

5. Shakespeare

6. Thoreau

7. Twain

A few lines of excerpts from the training files, the word clouds and a few states from the corresponding Markov Models Constructed for a few authors

Author JANE AUSTEN (Texts taken from Emma, Mansfield Park, Pride and Prejudice)

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and

happy disposition, seemed to unite some of the best blessings of

existence; and had lived nearly twenty-one years in the world with very

little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,

indulgent father; and had, in consequence of her sister’s marriage,

been mistress of his house from a very early period. Her mother had

died too long ago for her to have more than an indistinct remembrance

of her caresses; and her place had been supplied by an excellent woman

as governess, who had fallen little short of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a

governess than a friend, very fond of both daughters, but particularly

of Emma. Between _them_ it was more the intimacy of sisters. Even

before Miss Taylor had ceased to hold the nominal office of governess,

the mildness of her temper had hardly allowed her to impose any

restraint; and the shadow of authority being now long passed away, they

had been living together as friend and friend very mutually attached,

and Emma doing just what she liked; highly esteeming Miss Taylor’s

judgment, but directed chiefly by her own.

The real evils, indeed, of Emma’s situation were the power of having

rather too much her own way, and a disposition to think a little too

well of herself; these were the disadvantages which threatened alloy to

her many enjoyments. The danger, however, was at present so

unperceived, that they did not by any means rank as misfortunes with

her.

Author Lewis Carroll (Texts taken from Alice’s Adventures in Wonderland, Sylvie and Bruno)



Alice was beginning to get very tired of sitting by her sister on the

bank, and of having nothing to do: once or twice she had peeped into the

book her sister was reading, but it had no pictures or conversations in

it, ‘and what is the use of a book,’ thought Alice ‘without pictures or

conversation?’

So she was considering in her own mind (as well as she could, for the

hot day made her feel very sleepy and stupid), whether the pleasure

of making a daisy-chain would be worth the trouble of getting up and

picking the daisies, when suddenly a White Rabbit with pink eyes ran

close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so

VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear!

Oh dear! I shall be late!’ (when she thought it over afterwards, it

occurred to her that she ought to have wondered at this, but at the time

it all seemed quite natural); but when the Rabbit actually TOOK A WATCH

OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,

Alice started to her feet, for it flashed across her mind that she had

never before seen a rabbit with either a waistcoat-pocket, or a watch

to take out of it, and burning with curiosity, she ran across the field

after it, and fortunately was just in time to see it pop down a large

rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how

in the world she was to get out again.

The rabbit-hole went straight on like a tunnel for some way, and then

dipped suddenly down, so suddenly that Alice had not a moment to think

about stopping herself before she found herself falling down a very deep

well.

Author William Shakespeare (Texts taken from Henry IV Part 1, Romeo and Juliet, Twelfth Night)

KING.

So shaken as we are, so wan with care,

Find we a time for frighted peace to pant,

And breathe short-winded accents of new broils

To be commenced in strands afar remote.

No more the thirsty entrance of this soil

Shall daub her lips with her own children’s blood;

No more shall trenching war channel her fields,

Nor bruise her flowerets with the armed hoofs

Of hostile paces: those opposed eyes,

Which, like the meteors of a troubled heaven,

All of one nature, of one substance bred,

Did lately meet in the intestine shock

And furious close of civil butchery,

Shall now, in mutual well-beseeming ranks,

March all one way, and be no more opposed

Against acquaintance, kindred, and allies:

The edge of war, like an ill-sheathed knife,

No more shall cut his master. Therefore, friends,

As far as to the sepulchre of Christ–

Whose soldier now, under whose blessed cross

We are impressed and engaged to fight–

Forthwith a power of English shall we levy,

To chase these pagans in those holy fields

Over whose acres walk’d those blessed feet

Which fourteen hundred years ago were nail’d

For our advantage on the bitter cross.

But this our purpose now is twelvemonth old,

And bootless ’tis to tell you we will go:

Therefore we meet not now.–Then let me hear

Of you, my gentle cousin Westmoreland,

What yesternight our Council did decree

In forwarding this dear expedience.

WEST.

My liege, this haste was hot in question,

And many limits of the charge set down

But yesternight; when, all athwart, there came

A post from Wales loaden with heavy news;

Whose worst was, that the noble Mortimer,

Leading the men of Herefordshire to fight

Against th’ irregular and wild Glendower,

Was by the rude hands of that Welshman taken;

A thousand of his people butchered,

Upon whose dead corpse’ there was such misuse,

Such beastly, shameless transformation,

By those Welshwomen done, as may not be

Without much shame re-told or spoken of.

Author MARK TWAIN (Texts taken from A Connecticut Yankee in King Arthur’s Court, The Adventures of Huckleberry Finn, The Prince and the Pauper)

I am an American. I was born and reared in Hartford, in the State

of Connecticut–anyway, just over the river, in the country. So

I am a Yankee of the Yankees–and practical; yes, and nearly

barren of sentiment, I suppose–or poetry, in other words. My

father was a blacksmith, my uncle was a horse doctor, and I was

both, along at first. Then I went over to the great arms factory

and learned my real trade; learned all there was to it; learned

to make everything: guns, revolvers, cannon, boilers, engines, all

sorts of labor-saving machinery. Why, I could make anything

a body wanted–anything in the world, it didn’t make any difference

what; and if there wasn’t any quick new-fangled way to make a thing,

I could invent one–and do it as easy as rolling off a log. I became

head superintendent; had a couple of thousand men under me.

Well, a man like that is a man that is full of fight–that goes

without saying. With a couple of thousand rough men under one,

one has plenty of that sort of amusement. I had, anyway. At last

I met my match, and I got my dose. It was during a misunderstanding

conducted with crowbars with a fellow we used to call Hercules.

He laid me out with a crusher alongside the head that made everything

crack, and seemed to spring every joint in my skull and made it

overlap its neighbor. Then the world went out in darkness, and

I didn’t feel anything more, and didn’t know anything at all

–at least for a while.

When I came to again, I was sitting under an oak tree, on the

grass, with a whole beautiful and broad country landscape all

to myself–nearly. Not entirely; for there was a fellow on a horse,

looking down at me–a fellow fresh out of a picture-book. He was

in old-time iron armor from head to heel, with a helmet on his

head the shape of a nail-keg with slits in it; and he had a shield,

and a sword, and a prodigious spear; and his horse had armor on,

too, and a steel horn projecting from his forehead, and gorgeous

red and green silk trappings that hung down all around him like

a bedquilt, nearly to the ground.

“Fair sir, will ye just?” said this fellow.

“Will I which?”

“Will ye try a passage of arms for land or lady or for–”

“What are you giving me?” I said. “Get along back to your circus,

or I’ll report you.”

Now what does this man do but fall back a couple of hundred yards

and then come rushing at me as hard as he could tear, with his

nail-keg bent down nearly to his horse’s neck and his long spear

pointed straight ahead. I saw he meant business, so I was up

the tree when he arrived.

Classifying unknown texts from the Test Dataset

Each of the Markov models learnt from the training texts for each author are used to compute the log-likelihood of the unknown test text, the author with the maximum log-likelihood is chosen to be the likely author of the text.

Unknown Text1

Against the interest of her own individual comfort, Mrs. Dashwood had

determined that it would be better for Marianne to be any where, at

that time, than at Barton, where every thing within her view would be

bringing back the past in the strongest and most afflicting manner, by

constantly placing Willoughby before her, such as she had always seen

him there. She recommended it to her daughters, therefore, by all

means not to shorten their visit to Mrs. Jennings; the length of which,

though never exactly fixed, had been expected by all to comprise at

least five or six weeks. A variety of occupations, of objects, and of

company, which could not be procured at Barton, would be inevitable

there, and might yet, she hoped, cheat Marianne, at times, into some

interest beyond herself, and even into some amusement, much as the

ideas of both might now be spurned by her.

From all danger of seeing Willoughby again, her mother considered her

to be at least equally safe in town as in the country, since his

acquaintance must now be dropped by all who called themselves her

friends. Design could never bring them in each other’s way: negligence

could never leave them exposed to a surprise; and chance had less in

its favour in the crowd of London than even in the retirement of

Barton, where it might force him before her while paying that visit at

Allenham on his marriage, which Mrs. Dashwood, from foreseeing at first

as a probable event, had brought herself to expect as a certain one.

She had yet another reason for wishing her children to remain where

they were; a letter from her son-in-law had told her that he and his

wife were to be in town before the middle of February, and she judged

it right that they should sometimes see their brother.

Marianne had promised to be guided by her mother’s opinion, and she

submitted to it therefore without opposition, though it proved

perfectly different from what she wished and expected, though she felt

it to be entirely wrong, formed on mistaken grounds, and that by

requiring her longer continuance in London it deprived her of the only

possible alleviation of her wretchedness, the personal sympathy of her

mother, and doomed her to such society and such scenes as must prevent

her ever knowing a moment’s rest.

But it was a matter of great consolation to her, that what brought evil

to herself would bring good to her sister; and Elinor, on the other

hand, suspecting that it would not be in her power to avoid Edward

entirely, comforted herself by thinking, that though their longer stay

would therefore militate against her own happiness, it would be better

for Marianne than an immediate return into Devonshire.

Her carefulness in guarding her sister from ever hearing Willoughby’s

name mentioned, was not thrown away. Marianne, though without knowing

it herself, reaped all its advantage; for neither Mrs. Jennings, nor

Sir John, nor even Mrs. Palmer herself, ever spoke of him before her.

Elinor wished that the same forbearance could have extended towards

herself, but that was impossible, and she was obliged to listen day

after day to the indignation of them all.

log-likelihood values computed for the probable authors

Author LL

0 -3126.5812874

1 -4127.9155186

2 -7364.15782346

3 -9381.06336055

4 -7493.78440066

5 -4837.98005673

6 -3515.44028659

7 -3455.85716104

As can be seen from above the maximum likelihood value corresponds to the author 0, i.e., Austen. Hence, the most probable author of the unknown text is Austen.

Unknown Text2

Then he tossed the marble away pettishly, and stood cogitating. The

truth was, that a superstition of his had failed, here, which he and

all his comrades had always looked upon as infallible. If you buried a

marble with certain necessary incantations, and left it alone a

fortnight, and then opened the place with the incantation he had just

used, you would find that all the marbles you had ever lost had

gathered themselves together there, meantime, no matter how widely they

had been separated. But now, this thing had actually and unquestionably

failed. Tom’s whole structure of faith was shaken to its foundations.

He had many a time heard of this thing succeeding but never of its

failing before. It did not occur to him that he had tried it several

times before, himself, but could never find the hiding-places

afterward. He puzzled over the matter some time, and finally decided

that some witch had interfered and broken the charm. He thought he

would satisfy himself on that point; so he searched around till he

found a small sandy spot with a little funnel-shaped depression in it.

He laid himself down and put his mouth close to this depression and

called–

“Doodle-bug, doodle-bug, tell me what I want to know! Doodle-bug,

doodle-bug, tell me what I want to know!”

The sand began to work, and presently a small black bug appeared for a

second and then darted under again in a fright.

“He dasn’t tell! So it WAS a witch that done it. I just knowed it.”

He well knew the futility of trying to contend against witches, so he

gave up discouraged. But it occurred to him that he might as well have

the marble he had just thrown away, and therefore he went and made a

patient search for it. But he could not find it. Now he went back to

his treasure-house and carefully placed himself just as he had been

standing when he tossed the marble away; then he took another marble

from his pocket and tossed it in the same way, saying:

“Brother, go find your brother!”

He watched where it stopped, and went there and looked. But it must

have fallen short or gone too far; so he tried twice more. The last

repetition was successful. The two marbles lay within a foot of each

other.

log-likelihood values computed for the probable authors

Author LL

0 -2779.02810424

1 -2738.09304225

2 -5978.83684489

3 -6551.16571407

4 -5780.39620942

5 -4166.34886511

6 -2309.25043697

7 -2033.00112729

As can be seen from above the maximum likelihood value corresponds to the author 7, i.e., Twain. Hence, the most probable author of the unknown text is Twain. The following figure shows the relevant states corresponding to the Markov model for Twain trained from the training dataset.

