Here's the bug: the smaller the amount of input text, the better the sentences are that it generates. (Above a certain threshold -- too low, and you just get the input text back.) I think I understand why this is. My guess is that as the body of input text increases, the probabilities even out to normal- english- language distributions, and the end result starts behaving more like picking words at random in a vacuum. You don't tend to get sequences of 4, 8, 10 words together that all make a kind of sense; you get 2, then another 2. The distance between any two words gets smaller, and the overall probabilities become obscured. I don't think treating every pair of words as one ``word'' for statistical purposes will work very well; that will be too clumpy. I want some kind of commutative probability -- X is n% likely after A+B, but only m% likely after A+C. And A+D+E+F+G adds a correspondingly smaller amount of influence. I'd kind of like to do this without adding another dimension to my graph, because it's already pretty huge. Another order of magnitude just won't do. But it seems that human language isn't a system which can be modeled by a Markov chain of length 1.

Dissociated press works better because it only ever operates on small inputs, and always shuffles large-ish chunks. Burroughs' cut-ups work better because they work on large-ish chunks, and there are spatial relations that come into play -- even if you shuffle all segments of all pages, some words on the same segment are still similar distances apart, even if new words have been interspersed. But I really like the idea of breaking the original text down into probabilities and then generating from that, rather than taking the original text and shuffling it. The shuffling approach feels like it preserves too much of the original content, whereas all I want to preserve is the original grammar. Maybe that's not possible (practical). I don't want to have huge lists of nouns/ verbs. I don't want to encode knowledge of the language into it. That way lies intellectually corrupt AI projects which I'll not taunt by naming here. One possibility would be to keep only the most popular 3-way and higher combinations around; for the more common ones, I could hold a pointer to a sub-table, instead of a flat probability. Their popularity could be found using a quadtree- style subdivision of the word space (to find the words that ``clump'' together.) The problem with this is, it doesn't work in a streaming fashion; you basically have to have the whole N-way graph around before you can throw away the less- popular combinations, because you won't know which ones are popular until the end. Another good compression trick would be to quantize the values; though the maximal numerator or denominator that we need to express the probabilities might be a 16 bit number (or higher), we probably could make do with 8 bits (or less) of resolution: have an 8-bit lookup table of approximate probabilities. I suspect that the values in this table would end up being on a logarithmic scale (since that's how nature works.)