If you browse Reddit, chances are that you’ve heard of /r/SubredditSimulator.

There you’ll find a safe space for the artificial mirror images of actual subreddits, populated by real people, and imitated by robots. Their words sound familiar, and yet uncanny; hilarity ensues when a particular link is given a comical description, and bemusement can sometimes hit those who forget about being subscribed.

In this post we’re going to see how it all works, how to replicate it, and how to fine-tune the result with the help of some fitness functions in order to generate text that approximately matches our description.

Are you excited? So am I; let’s begin.

1. Markov Chains

SubredditSimulator perfectly showcases an incredible mechanism that has been around for quite some time now: Markov chains.

As you probably (don’t) know, I’m all about dat procedural generation. Markov chains are of great interest to me, because they’re one of the simpler and yet most effective tools when it comes to creating semi-coherent text out of sheer nothingness (read: a moderately ample data set). And, aside from neural nets, I’d say they’re about the best we’ve got at the moment, simplicity notwithstanding.

Markov chains can “generate” a word B from a word A, if and only if B followed A at least once during training. Moreover, B is picked from a list of candidates (read: words that have followed A during training) which is sorted by occurrence, and thus, by probability. A dash of randomness is then added to spice things up a little bit and to (try to) prevent loops, but this is mostly a solution to the problem of not having a large enough data set.

So, the first implementation detail about Markov chains is learning. A corpus of text should enter our black box and a horizontal tree, where each word is a branch, should come out. Leaves are represented by all the words related to a particular branch, and their “size” depends on their relative frequency. Did I paint a nice picture? I hope so.

2. Training

Since we’re talking about trees and about learning, let’s talk Python:

import re class MarkovChain:

def __init__(self):

self.tree = dict() def train(self, text, factor=1):

pass

Okay, nothing surprising so far. Our tree is going to be a key-value store. But what about the train() function, what about factor? We’ll see shortly where factor comes into play, but for now think of it as a multiplier. Its inclusion allows our chain to train harder on some text, and less on other. But not only that; when factor is negative all the relationships for each pair of words in the corpus will be weakened! This allows us to train more dynamically, and, in the end, to formulate a bias towards specific characteristics (for example: longer words, more consonants, less alliterations, rhyming, etc.).

Here’s the fully annotated train() function:

[CODE]

The idea here is simple: for each pair of words contained within a corpus of text, strengthen the bond of that pair by a constant factor (which lies in the [-1;1] range).

Once the process is over, our tree will have grown new branches and leaves.

3. Batch-processing and serializing

Believe it or not, we’re almost there. At the tip of the iceberg, I mean. Now that our chain can be trained on some text, let’s write some helpers to train on all files within a directory, and to save and load the state of our generator from disk. We’ll be needing glob and pickle.

[CODE]

Easier done than said! These will be useful, because Markov chains tend to mimic the style of their data set. The ability to train and store state at will is certainly a necessity here.

4. Generating text!

Here comes the fun. Once the chain is trained, it can begin its job as a mad blabbering robot.

[CODE] first, explanations later.

Again, it’s simple: you start with a given word (or pick a random branch) and sort its leaves by frequency of occurrence. Some randomness is added to ensure variety, but my implementation allows flexibility even there: if you don’t want randomness, simply pass lambda x: x to rand instead of its default value and you have a perfectly deterministic generation.

The generate() function itself is a Pythonic generator: it continuously yields values, instead of building and then returning a whole list. The max_len attribute is there to prevent the process from running potentially indefinitely, but no one is stopping you from leaving it to 0 and generating words until a dead end is found (which may be impossible).

We then have a generate_formatted() function, which neatly wraps generate() in order to yield better-looking text. It supports word wrap, line breaks and capitalization. Nothing to write home about, but it’ll be handy.

Well, how about that. We’re ready to put our code to the test! I’ll train it using some of my recent Reddit posts.

# Builds a formatted string from the generator

def gen(m):

return ''.join([w for w in m.generate_formatted(word_wrap=60, soft_wrap=True, start_with=None, max_len=50, verbose=True)]) mkv = MarkovChain()

mkv.bulk_train('E:\\Text Files\\MkvCmd Training Data\\*.txt', verbose=True)

print(gen(mkv))

Output:

...

Successfully trained on 1 files.

Generating a sentence of max. 50 words, starting with "early": Early on, if you can be a mac address, but i think that you

can tell you can be so well (\p{l} matches any compassion.

It’s a lot of the only works with a few downsides to be a drug

because, when you can be a lot of the same Generating a sentence of max. 50 words, starting with "bring": Bring users away from the same be too sure.

Your account to be a few special characters.

You don't know where you don't know where you don't know about

them.

Even beats c++'s speed (which is a few years, but i think that

you don't have to be a few

Okay, that’s rough. Pretty funny, but still rough. I’ll concede that my sample size is pretty small right now, but that’s not going to be a huge problem.

Hey, at least it’s working. We’re one step closer to the truth now.

5. Pulling the strings to adjust the weights

And here’s the novelty: applying a fitness function to our Markov chain!

Fitness functions are, in the context of evolutionary programming, what determines if a given candidate is good or bad. Depending on the value returned by a fitness function, a model can adapt in order to maximize its fitness.

In our case, a candidate will be a pair of words. Branch and leaf. We’re going evaluate this relationship and strengthen or weaken it depending on a few, arbitrary, criteria.

[CODE]

Okay, we now have an adjust_weights() function. It takes a max_len parameter and a function f that takes two words as input and returns a value in the range [0; 1].

Theoretically, the higher max_len, the longer it will take for the chain to adapt. However, if max_len is too low, such adaptations can be very variable and miss the goal. Practically? Setting it to two, the minimum possible value, works like a charm. It may be because my data set is so small.

We also have a bulk_adjust_weights() function, which is what we’ll be using. Not only does it apply one fitness function to the model, it applies several fitness functions at once, for a given number of iterations. In verbose mode it also prints a pretty progress bar, which is nice.

Oh, and, by the way, the MarkovChain class is done. Feel free to check it out on Github.

It’s time to define some fitness functions:

[CODE]

These are merely examples; the only limit is your imagination! Let’s see how to apply them:

[CODE]

And here’s the new, improved, output:

Additionally, running releases endorphins and people now some

systems, while the other drug, is perfectly fine; they seem

and they should have been built-in from mdd for a shitty fad

challenge.

The circlejerk a shitty fad challenge.

The lights are obesity, diabetes, and people just hover any

letter).

Additionally, running.

Much better, if you ask me. Almost poetic. By rewarding more complex words and more abundant punctuation, we managed to improve the output’s quality by quite a bit. Let’s see what else we can do.

Here, for example, by mixing and matching, I’ve managed to generate something resembling a pop song’s lyrics:

Staple because of other drugs were fucking your pleasure.

And don’t fit in turn to.

In turn to.

In turn to.

In loops unless necessary or subconsciously by buying coca cola

and regulate serotonin.

Don’t fit in turn to.

In turn to.

In spite of hard feelings.

But I could go full punctuation, and generate something like:

Alpha: we’re still an ip address spoofing which fakes your judgement

tells you can slow down your doctor to a better programmer?

Who can beat it.

I’m outside i’ll be too sure.

Your siblings used to.

Happy at it.

Sometimes you can beat it.

I’m brought to make tracing harder.

Or I could say that I only like short words and vowels:

Throughput is a bit more modest and i think that you don’t want

to be the same league as you don’t want to be a few years, but

i think that you don’t want to be a few special characters.

You don’t really are just come out something new game

Or extremely long words with few vowels and many consonants:

Concentration.

In their face.

They’ve been built-in from reality, despite being so irritating.

People have no matter of course, mdd for 10 years.

I can be.

The real problems, input redirected , and they will not going

through, but things had on chrome you to say whatever they click,

even

But enough is enough. I’ll leave you with an excerpt generated from this very article:

We’re only that; when a large enough data set.

The corpus of my recent reddit posts.

# builds a nice picture?

It almost sounds like a haiku, or perhaps I’m starting to see patterns in the madness of chaos.

Thank you for reading. I hope that you found my write-up useful and/or interesting. I’ll see you as soon as procedural generation offers a new challenge!

FULL PROJECT: https://github.com/G3Kappa/Adjustable-Markov-Chains