This paragraph struck me as highly poetic, compared to what I’d seen in the past from a computer. The language wasn’t entirely sensical, but it certainly conjured imagery and employed relatively solid grammar. Furthermore, it was original. Originality has always been important to me in computer generated text—because what good is a generator if it just plagiarizes your input corpus? This is a major issue with high order Markov chains, but due to its more sophisticated internal mechanisms, the LSTM didn’t seem to have the same tendency.

Unfortunately, much of the prose-trained model output that contained less poetic language was also less interesting than the passage above. But given that I could produce poetic language with a prose-trained model, I wondered what results I could get from a poetry-trained model.

Early LSTM results from poetry-trained model

The output above comes from the first model I trained on poetry. I used the most readily available books I could find, mostly those of poets from the 19th century and earlier whose work had entered the public domain. The consistent line breaks and capitalization schemes were encouraging. But I still wasn’t satisfied with the language—due to the predominant age of the corpus, it seemed too ornate and formal. I wanted more modern-sounding poetic language, and so I knew I had to train a model on modern poetry.

Early LSTM result from modern poetry model

I assembled a corpus of all the modern poetry books I could find online. It wasn’t nearly as easy as assembling the prior corpus—unfortunately, I can’t go into detail on how I got all the books for fear of being sued.

The results were much closer to what I was looking for in terms of language. But they were also inconsistent in quality. At the time, I believed this was because the corpus was too small, so I began to supplement my modern poetry corpus with select prose works to increase its size. It remains likely that this was the case. However, I had not yet discovered the seeding techniques I would later learn can dramatically improve LSTM output.

Another idea occurred to me: I could seed a poetic language LSTM model with a generated image caption to make a new, more poetic version of word.camera. Some of the initial results (see: left) were striking. I showed them to one of my mentors, Allison Parrish, who suggested that I find a way to integrate the caption throughout the poetic text, rather than just at the beginning. (I had showed her some longer examples, where the language had strayed quite far from the subject matter of the caption after a few lines.)

I thought about how to accomplish this, and settled on a technique of seeding the poetic language LSTM multiple times with the same image caption at different temperatures.

Temperature is a parameter, a number between zero and one, that controls the riskiness of a recurrent neural network’s character predictions. A low temperature value will result in text that’s repetitive but highly grammatical. Accordingly, high temperature results will be more innovative and surprising (the model may even invent its own words) while containing more mistakes. By iterating through temperature values with the same seed, the subject matter would remain consistent while the language varied, resulting in longer pieces that seemed more cohesive than anything I’d ever produced with a computer.

As I refined the aforementioned technique, I trained more LSTM models, attempting to discover the best training parameters. The performance of a neural network model is measured by its loss, which drops during training and eventually should be as close to zero as possible. A model’s loss is a statistical measurement indicating how well a model can predict the character sequences in its own corpus. During training, there are two loss figures to monitor: the training loss, which is defined by how well the model predicts the part of the corpus it’s actually training on, and the validation loss, which is defined by how well the model predicts an unknown validation sample that was removed from the corpus prior to training.

The goal of training a model is to reduce its validation loss as much as possible, because we want a model that accurately predicts unknown character sequences, not just those it’s already seen. To this end, there are a number of parameters to adjust, among which are:

learning rate & learning rate decay : Determines how quickly a model will attempt to learn new information. If set too low or too high, the model will never reach its optimal state. This is further complicated by the learning rate’s variable nature—one must consider not only the optimal initial learning rate, but also how much and how often to decay that rate.

: Determines how quickly a model will attempt to learn new information. If set too low or too high, the model will never reach its optimal state. This is further complicated by the learning rate’s variable nature—one must consider not only the optimal initial learning rate, but also how much and how often to decay that rate. dropout : Introduced by Geoffrey Hinton et al. Forces a neural network to learn multiple independent representations of the same data by randomly disabling certain neurons during training at alternating intervals. The percentage of neurons disabled at any given moment in training is determined by the dropout parameter, a number between zero and one.

: Introduced by Geoffrey Hinton et al. Forces a neural network to learn multiple independent representations of the same data by randomly disabling certain neurons during training at alternating intervals. The percentage of neurons disabled at any given moment in training is determined by the dropout parameter, a number between zero and one. neurons per layer & number of layers : The number of parameters in a recurrent neural network model is proportional to the number of artificial neurons per layer as well as the number of layers in the model, which is typically either two or three. For character-level LSTMs, the number of parameters in a model should, in general, be the same order of magnitude as the number of characters in the training corpus. So, a 50 MB corpus should require something in the neighborhood of 50 million parameters. But like other parameters, the exact number may be adjusted—Karpathy suggests always erring on the side of a model that’s too large rather than one that’s too small.

: The number of parameters in a recurrent neural network model is proportional to the number of artificial neurons per layer as well as the number of layers in the model, which is typically either two or three. For character-level LSTMs, the number of parameters in a model should, in general, be the same order of magnitude as the number of characters in the training corpus. So, a 50 MB corpus should require something in the neighborhood of 50 million parameters. But like other parameters, the exact number may be adjusted—Karpathy suggests always erring on the side of a model that’s too large rather than one that’s too small. batch size & sequence length: I’ll just let Karpathy explain this one, from his Char-RNN documentation:

The batch size specifies how many streams of data are processed in parallel at one time. The sequence length specifies the length of each stream, which is also the limit at which the gradients can propagate backwards in time. For example, if seq_length is 20, then the gradient signal will never backpropagate more than 20 time steps, and the model might not find dependencies longer than this length in number of characters.

The training process largely consists of monitoring the validation loss as it drops across model checkpoints, and monitoring the difference between training loss and validation loss. As Karpathy writes in his Char-RNN documentation: