Generating Haikus

The goal was to use NYC job descriptions to create haikus. Originally a Japanese poetic form, haikus are poems which contain three lines with five syllables in the first line, seven in the second and five in the third.

To produce a haiku, I used a custom markov chain method. A markov chain is a technique to generate a sequence given the current value and the probabilities of what values would follow the current one. In this case, given a word, what words are likely to follow?

The first step is to determine these probabilities. I divided the data up by civil service title (Computer Systems Manager, Painter, Civil Engineer, etc.) and for each, built a separate corpus of data from the text in the job description and preferred skills fields. Then I split the corpus into sentences and divided the sentence into words and counted the number of times A followed B.

The following example shows the most common words to follow “data” for a Computer Systems Manager. Given a table like this, the markov chain will pick a random next word weighted by the probabilities. It will then take that resulting word and repeat the process again and again.

Since I was generating haikus, I had a strict syllable constraint. I only considered the next word if it fit within the syllable limit. For example, if I was on the first line (5 syllables) and my current word was “data”, I wouldn’t choose “analysis” or “integration” as the next word because it would put the line over 5 syllables.

During this process the generator would at times write itself into an impossible state, when there were no valid choices within the syllable limit. In this case, it would go back and try a new word to see if it would lead to a valid haiku.

The haikus came out of the process in a raw state — all lower case, no punctuation and sometimes they just weren’t very good. The biggest problem I found was that, because haikus are so short, the results were often incomplete thoughts.

The markov chain would be in the middle of a sentence when it hit the syllable count and stopped short. I tried to correct for that by only ending with words that could be logical ending words, but it didn’t work for every situation.

For example, both of the following end with “design and construction process” but only the first is a complete sentence:

Work with project leads

to create a design and

construction process

and

The new york city

department of design and

construction process

Some of these results were actually amusing:

The environment

and the environment and

the environment.

and

The city of New

York City: open data,

open government.

Through a semi-manual, semi-automated editing process, I cleaned up the haikus to get presentable results. The final piece had 750+ haikus.