leonardo

View: Recent Entries. View: Archive. View: Friends. View: Profile. View: Website (My Website). April 13th, 2011 Tags: d language, programming, python Security: Subject: Markov Chain text generator in D Time: 02:11 am

http://en.wikipedia.org/wiki/Markov_chain#Markov_text_generators



The programs in the book:

http://cm.bell-labs.com/cm/cs/tpop/code.html



This is the Perl version:

http://cm.bell-labs.com/cm/cs/tpop/markov.pl



I have found a Python translation too, by Brian Chin, on the Activestate recipes:

http://code.activestate.com/recipes/194364-the-markov-chain-algorithm/



I have modified the Python code for Python 2.6, using a defaultdict(list) and generalizing to N order thanks to collections.deque that takes the max deque length as optional argument:

import sys, random, collections ORDER = 2 # Markov order MAX_WORDS = 1000 # N. max of output words # Since we split on whitespace, this can never be a word NONWORD = "

" # Generate table table = collections.defaultdict(list) seen = collections.deque([NONWORD] * ORDER, ORDER) for line in sys.stdin: for word in line.split(): table[tuple(seen)].append(word) seen.append(word) table[tuple(seen)].append(NONWORD) # Mark the end of the file # Generate output seen.extend([NONWORD] * ORDER) # clear it all for i in xrange(MAX_WORDS): word = random.choice(table[tuple(seen)]) if word == NONWORD: exit() print word, seen.append(word)

Then I have created a D V.2 version:

import std.stdio, std.random, std.array, std.algorithm; enum int ORDER = 2; // Markov order static assert(ORDER >= 1); enum int MAX_WORDS = 1_000; // N. max of output words void main() { // Since we split on whitespace, this can never be a word enum string NONWORD = "

"; // Generate table string[][string[ORDER]] table; string[ORDER] seen = NONWORD; foreach (string line; lines(stdin)) foreach (word; line.splitter()) { table[seen] ~= word; moveAll(seen[1..$], seen[0..$-1]); seen[$-1] = word; } table[seen] ~= NONWORD; // Mark the end of the file // Generate output seen[] = NONWORD; // clear it all foreach (i; 0 .. MAX_WORDS) { auto many = table[seen]; auto word = many[uniform(0, many.length)]; // choice if (word == NONWORD) return; write(word, " "); moveAll(seen[1..$], seen[0..$-1]); seen[$-1] = word; } }



The D code is quite nice, it's not much worse than the Python version. In the D version the deque is not already present, so I've used a fixed-sized array. The tail append requires to move most of its items (using moveAll(), not move()!), but this is no worse than the tuple(seen) allocation of the Python code. In the D random module there is no choice() function yet. In D I have used a splitter() that's lazy, this is more efficient than a split() because it avoids the useless memory allocation of the array. In D the lines(stdin) allocates memory, but this allocation can't be avoided, because later those strings need to go in the hash, but later splitter() just return slices of this line that don't allocate memory again (unlike Python, unless Psyco does optimizations). Most of the running time of the D version is spent allocating the associative array.



Running the programs and giving them as input a cleaned up version of the King James Bible from Project Gutenberg, the D version is about twice faster than the Python version (the D version runs in about 1.3 seconds). If the Python code is moved inside a function and Psyco is used the Python version runs a little faster than before. The D version allocates quite less memory compared to the Python version. Creating this final D version has required several tests, and some time.



In D reading the whole file and iterating with splitter leads to a slower program, I don't know why:

foreach (word; (cast(string)read("pg30b.txt")).splitter()) {



See the code described here:

http://www.fantascienza.net/leonardo/js/index.html#markov_gen In the third chapter of the well known "The Practice of Programming" book by Brian W. Kernighan and Rob Pike there is a comparison between various languages implementations of a Markov Chain text generator that works on whole words (instead of single chars):The programs in the book:This is the Perl version:I have found a Python translation too, by Brian Chin, on the Activestate recipes:I have modified the Python code for Python 2.6, using a defaultdict(list) and generalizing to N order thanks to collections.deque that takes the max deque length as optional argument:Then I have created a D V.2 version:The D code is quite nice, it's not much worse than the Python version. In the D version the deque is not already present, so I've used a fixed-sized array. The tail append requires to move most of its items (using moveAll(), not move()!), but this is no worse than the tuple(seen) allocation of the Python code. In the D random module there is no choice() function yet. In D I have used a splitter() that's lazy, this is more efficient than a split() because it avoids the useless memory allocation of the array. In D the lines(stdin) allocates memory, but this allocation can't be avoided, because later those strings need to go in the hash, but later splitter() just return slices of this line that don't allocate memory again (unlike Python, unless Psyco does optimizations). Most of the running time of the D version is spent allocating the associative array.Running the programs and giving them as input a cleaned up version of the King James Bible from Project Gutenberg, the D version is about twice faster than the Python version (the D version runs in about 1.3 seconds). If the Python code is moved inside a function and Psyco is used the Python version runs a little faster than before. The D version allocates quite less memory compared to the Python version. Creating this final D version has required several tests, and some time.In D reading the whole file and iterating with splitter leads to a slower program, I don't know why:See the code described here: comments: Leave a comment

leonardo

View: Recent Entries. View: Archive. View: Friends. View: Profile. View: Website (My Website).