More successfully, I experiment in 2019 with a recently-developed alternative to char-RNNs , the Transformer NN architecture, by finetuning training OpenAI’s GPT-2-117M Transformer model on a much larger (117MB) Project Gutenberg poetry corpus using both unlabeled lines & lines with inline metadata (the source book). The generated poetry is much better. And GPT-3 is better still.

Char-RNNs are unsupervised generative models which learn to mimic text sequences. I suggest extending char-RNNs with inline metadata such as genre or author prefixed to each line of input, allowing for better & more efficient metadata, and more controllable sampling of generated output by feeding in desired metadata. A 2015 experiment using torch-rnn on a set of ~30 Project Gutenberg e-books (1 per author) to train a large char-RNN shows that a char-RNN can learn to remember metadata such as authors, learn associated prose styles, and often generate text visibly similar to that of a specified author.

A character-level recurrent neural network (“char-RNN”) trained on corpuses like the Linux source code or Shakespeare can produce amusing textual output mimicking them. Music can also be generated by a char-RNN if it is trained on textual scores or transcriptions, and some effective music has been produced this way (I particularly liked Sturm’s).

A char-RNN is simple: during training, it takes a binary blob (its memory or “hidden state”) and tries to predict a character based on it and a new binary blob; that binary blob gets fed back in to a second copy of the RNN which tries to predict the second character using the second binary blob, and this gets fed into a third copy of the RNN and so on (“unrolling through time”). Whether each character is correct is the training error, which get backpropagated to the previous RNNs; since they are still hanging around in RAM, blame can be assigned appropriately, and eventually gibberish hopefully evolves into a powerful sequence modeler which learns how to compactly encode relevant memories into the hidden state, and what characters can be predicted from the hidden state. This doesn’t require us to have labels or complex loss functions or a big apparatus—the RNN gets trained character by character.

How can we do that? The RNN in the C or CSS examples is able to mode-switch like this because, I think, there are clear transition markers inside the CSS or C which ‘tell’ the RNN that it needs to switch modes now; a comment begins /* ... or a data-URI in CSS begins url('data:image/png;base64,...) . In contrast, the most straightforward way of combining music or books and feeding them into a char-RNN is to simply concatenate them; but then the RNN has no syntactic or semantic markers which tell it where ‘Bible’ begins and ‘Shakespeare’ ends. Perhaps we can fix that by providing metadata such as author/genre and turning it into a semi-supervised task, somehow, along the lines of the source code: distinguish the text of one author from another, and then let the RNN learn the distinctions on its own, just like the CSS/C.

If we could get the RNN to do such switching on demand, there are several possible benefits. Human-authored textual output is always more similar than different: a text file of Shakespeare is much more similar to a text file of the Bible than it is to an equivalent length of ASCII generated at random such as $M@Spc&kl?,U.(rUB)x9U0gd6G ; a baroque classical music score is more similar to a transcript of a traditional Irish music jam. Since they share such mutual information, a trained RNN to produce Shakespeare and the Bible will be smaller than the sum of2 RNNs for Shakespeare & the Bible separately; this makes it easier to share trained RNNs since you can distribute 1 RNN covering many genres or authors for people to play with, rather than having to train & host a dozen different RNNs. Such an RNN may also generate better output for all cases since less of the corpuses’ information is spent on learning the basics of English shared by both corpuses and more is available for learning the finer details of each kind of writing, which may help in cases like music where large datasets of textual transcriptions of a desired genre may not be available (by training on a large corpus of classical music, a smaller corpus of Irish music may go further than it would’ve on its own). More speculatively, the metadata itself may dynamically improve generation by making it easier for the RNN to not ‘wander’ but, since the RNN is keeping a memory of the metadata in its hidden state, output may be more thematically coherent since the RNN can periodically refer back to the hidden state to remember what it was talking about.

However, it seems like it should be possible to do this. An RNN is a powerful neural network, and we can see in examples using Karpathy’s char-rnn that such RNNs have learned ‘sublanguages’: in the Linux C source code examples, the RNN has learned to switch appropriately between comments, source code, and string literals; in the CSS examples , it’s learned to switch between comments, CSS source code, string literals, URLs, and data-URIs . If the RNN can decide on its own while generating C or CSS to switch from “source code mode” to “comment mode”, then it should be able to also learn to switch between Shakespeare and Bible mode, or even more authors.

A problem with this approach is that a char-RNN has to be trained for each corpus: if you want Shakespearean gibberish, you must train it only on Shakespeare, and if you want Irish music, you must train only on Irish—if you don’t, and you create a corpus which is Shakespeare concatenated with the Bible, you will probably get something halfway between the two, which might be somewhat interesting, but is not a step forward to generating better & more interesting gibberish; or if you have a few hundred songs of Irish music written in ABC format and then you have a few dozen of rock or classical pieces written in MIDI, training an RNN on them all mixed together will simply yield gibberish output because you will get an ‘average syntax’ of ABC & MIDI and an ‘average music’ of Irish & Rock. This is in part because the training is unsupervised in the sense that the char-RNN is only attempting to predict the next character given the previous characters, and it has no reason to give you just Shakespeare or just Bible output; it is bouncing between them

There are two approaches for how to encode the metadata into the RNN:

in band: systematically encode the metadata into the corpus itself, such as by a prefixed or suffixed string, and hope that the RNN will be able to learn the relevance of the metadata and use it during training to improve its predictions (which it should, as LSTM/GRU units are supposed to help propagate long-term dependencies like this); then specific genres or authors or styles can be elicited during sampling by providing that metadata as a seed. So for example, a Shakespeare corpus might be transformed by prefixing each line with a unique string which doesn’t to appear in the corpus itself, eg “SHAKESPEARE|To be or not to be,|SHAKESPEARE”. Then during sampling, Shakespearean prose will be triggered like th sample.lua rnn.t7 -primetext "SHAKESPEARE|" . (Why the pipe character? Because it’s rarely used in prose but isn’t hard to type or work with.) To add in more metadata, one adds in more prefixes; for example, perhaps the specific work might be thought relevant and so the corpus is transformed to “SHAKESPEARE|HAMLET|To be or not to be,|HAMLET|SHAKESPEARE”. Then one can sample with the specific work, author, or both. For musical generation, relevant metadata might be musical genre, author, tempo, instruments, type of work, tags provided by music listeners (“energetic”, “sad”, “for_running” etc), so one could ask for energetic Irish music for two fiddles. This has the advantage of being easy to set up (some regexes to add metadata) and easy to extend (take an existing trained RNN and use it on the modified corpus); the disadvantage is that it may not work as the RNN may be unable to jointly learn to recall and use the metadata—it may instead learn to forget the metadata immediately, or spend all its learning capacity on modeling an ‘average’ input because that yields better log-loss error. This in band approach can also easily be extended to cover classification; in classification, the metadata is put at the end of each line, so instead of learning to predict text conditional on metadata & previous text, the RNN is learning to predict metadata conditional on previous text, and classifications can be extracted by low-temperature sampling with the input as the prime text followed by the separator character and seeing what metadata is predicted (eg th sample.lua classification.t7 -temperature 0.1 -primetext "...text...|" → "SHAKESPEARE

" ). As far as I know, no one has done this except perhaps inadvertently or implicitly. out of band: instead of depending on the RNN to learn the value of the metadata and preserving it in its hidden state, one can change the RNN architecture to inject the metadata at each timestep. So if one has an RNN of 500 neurons, 5 of them will be hardwired at each timestep to the metadata value for the sequence being worked on. The downside is that all metadata inputs will require modification of the RNN architecture to map them onto a particular hidden neuron. The advantage is that the metadata value will always be present, there is no need to hope that the RNN will learn to hold onto the metadata, and it only has to learn the associated differences; so it will learn more reliably and faster. Variants of this turn out to have been done before: Mikolov & Zweig 2012, “Context dependent recurrent neural network language model”: RNN augmented with topic information from LDA, achieving better prediction on the Penn Treebank & WSJ transcription task Aransa et al 2013/2015, “Improving Continuous Space Language Models using Auxiliary Features”: a feedforward NN given n characters at a time, with the inputs at each sequence including embeddings of the previous lines and, particularly, 5 ‘genres’ (in this case, Egyptian Arabic SMS/chat, modern standard Arabic, Egyptian Arabic forum discussions, Levantine forum discussions, formal MSA from UN translations, Egyptian Arabic telephone calls), hardwired into the input layer; finding that genre particularly helped BLEU scores. (Including metadata like genre to assist training appears to have been used fairly regularly in earlier text topic-modeling work, but not so much neural networks or for increasing realism of generated text.) Chen et al 2015, “Recurrent Neural Network Language Model Adaptation for multi-Genre Broadcast Speech Recognition”: an RNN augmented with the text input being fed into standard text topic-modeling algorithms like LDA, partially trained on BBC genres (advice/children/comedy/competition/documentary/drama/events/news), and the total outputs from the topic algorithms hardwired into the input layer along with the text; giving moderate improvements on audio→text transcription. Sennrich et al 2016, “Controlling Politeness in Neural Machine Translation via Side Constraints”: a standard neural machine translation using RNNs in the encoder-decoder framework, here for translating English→German movie subtitles, but the German corpus’s sentences are annotated by politeness metadata describing the pronouns/verb conjugations; they obtain both better BLEU scores on translation as well as the ability to change to change the generated English This has also been done in Lipton et al 2015 (see also Ficler & Goldberg 2017): they model beer reviews with a character-level RNN which is given metadata (beer types: “American IPA”, “Russian Imperial Stout”, “American Porter”, “Fruit/Vegetable Beer”, and “American Adjunct Lager”) as a hardwired input to the RNN at each timestep, noting that It might seem redundant to replicate x aux at each sequence step, but by providing it, we eliminate pressure on the model to memorize it. Instead, all computation can focus on modeling the text and its interaction with the auxiliary input…Such models have successfully produced (short) image captions, but seem impractical for generating full reviews at the character level because signal from x aux must survive for hundreds of sequence steps. We take inspiration from an analogy to human text generation. Consider that given a topic and told to speak at length, a human might be apt to meander and ramble. But given a subject to stare at, it is far easier to remain focused. They experienced trouble training their beer char-RNN, and they adopt a strategy of training normally without the hardwired metadata down to a loss of <1.0/character and then training with metadata to a final loss of 0.7–0.8. This is reasonable because at a loss of 1.1 on English text, sampled output has many clear errors, but at <0.9 the output becomes uncanny; it stands to reason that subtle differences of style & vocabulary will only begin to emerge once the RNN has the basics of English down pat (the differences between skilled authors’ Englishes are, unsurprisingly, smaller than the differences between regular English & gibberish). Pretraining+metadata works well for Lipton et al 2015, but they don’t compare it to inlined metadata or show that the pretraining is necessary. I am also a little skeptical about the rationale that out of band signaling is useful because it puts less pressure on the hidden state: while it may reduce pressure on the RNN’s LSTMs to memorize the metadata, one is still losing RAM to reinjecting the metadata into the RNN at every timestep. Either way, the metadata must be stored somewhere in RAM and it doesn’t make much difference if it’s 495 effective neurons (with 5 hardwired to metadata) or if it’s 500 effective neurons (of which 5 eventually get trained to hold metadata, yielding 495 effective neurons). Pretraining also won’t work with torch-rnn as the word-embedding it computes is different on each dataset, so it’s currently impossible to train on an unlabeled dataset, change the data to labeled, and resume training. after my experiments here, DeepMind published a CNN for generating raw audio: “WaveNet: A Generative Model for Raw Audio”, van den Oord et al 2016. They noted similar phenomena: the WaveNet could imitate specific speakers if provided speaker labels along with the raw audio, and specifying metadata like instruments allowed control of generated musical output. Another later Google paper, Johnson et al 2016’s “Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation”, applies in-band metadata to generalize a RNN translator by specifying the target language in-band and having the RNN learn how to exploit this metadata for better natural language generation and the ability to translate between language pairs with no available corpuses.

Given the attractive simplicity, I am going to try in band metadata.

Data The easiest kind of data to test with is English prose: I can recognize prose differences easily, and there are countless novels or fictional works which can be converted into labeled prose. If we just download some complete works off Project Gutenberg (googling ‘Project Gutenberg “complete works of”’), prefix each line with “$AUTHOR|”, concatenate the complete works, and throw them into char-rnn , we should not expect good results: the author metadata will now make up something like 5% of the entire character count (because PG wraps them to short lines) and by training on 5M of exclusively Austen and then 5M of exclusively Churchill, we might run into overfitting problems and due to the lack of proximity of different styles, the RNN might not ‘realize’ that the author metadata isn’t just some easily predicted & then ignored noise but can be used to predict far into the future. We also don’t want the PG headers explaining what PG is, and to make sure the files are all converted to ASCII. So to deal with these 4 issues I’m going to process the PG collected works thusly: delete the first 80 lines and last ~300 lines, and filter out any line mentioning “Gutenberg” convert to ASCII delete all newlines and then rewrap to make lines which are 10000 bytes—long enough to have a great deal of internal structure and form a good batch to learn from, and thus can be randomly sorted with the others. But newlines do carry semantic information—think about dialogues—and does deleting them carry a cost? Perhaps we should map newlines to some rare character like tilde, or use the poetry convention of denoting newlines with forward-slashes? prefix each long line with the author it was sampled from

Unlabeled As a baseline, a char-RNN with 2×2500 neurons, trained with 50% dropout, batch-size 55, and BPTT length 200, on the PG dataset without any author prefixes or suffixes, converges to a validation loss of ~1.08 after ~20 epoches.

Training with prefixes Small RNN For my first try, I grabbed 7 authors, giving a good final dataset of 46M, and fed it into char-rnn , choosing a fairly small 2-layer RNN and using up the rest of my GPU RAM by doing unrolling far more than the default 50 timesteps to encourage it to learn the long-range dependencies of style: cd ~/src/char-rnn/data/ mkdir ./styles/ ; cd ./styles/ ## "The Complete Project Gutenberg Works of Jane Austen" http://www.gutenberg.org/ebooks/31100 wget 'https://www.gutenberg.org/ebooks/31100.txt.utf-8' -O austen.txt ## "The Complete Works of Josh Billings" https://www.gutenberg.org/ebooks/36556 wget 'https://www.gutenberg.org/files/36556/36556-0.txt' -O billings.txt ## "Project Gutenberg Complete Works of Winston Churchill" http://www.gutenberg.org/ebooks/5400 wget 'https://www.gutenberg.org/ebooks/5400.txt.utf-8' -O churchill.txt ## "The Project Gutenberg Complete Works of Gilbert Parker" https://www.gutenberg.org/ebooks/6300 wget 'https://www.gutenberg.org/ebooks/6300.txt.utf-8' -O parker.txt ## "The Complete Works of William Shakespeare" http://www.gutenberg.org/ebooks/100 wget 'https://www.gutenberg.org/ebooks/100.txt.utf-8' -O shakespeare.txt ## "The Entire Project Gutenberg Works of Mark Twain" http://www.gutenberg.org/ebooks/3200 wget 'https://www.gutenberg.org/ebooks/3200.txt.utf-8' -O twain.txt ## "The Complete Works of Artemus Ward" https://www.gutenberg.org/ebooks/6946 wget 'https://www.gutenberg.org/ebooks/6946.txt.utf-8' -O ward.txt du -ch *.txt ; wc --char *.txt # 4.2M austen.txt # 836K billings.txt # 9.0M churchill.txt # 34M input.txt # 12M parker.txt # 5.3M shakespeare.txt # 15M twain.txt # 12K ward.txt # 80M total # 4373566 austen.txt # 849872 billings.txt # 9350541 churchill.txt # 34883356 input.txt # 12288956 parker.txt # 5465099 shakespeare.txt # 15711658 twain.txt # 9694 ward.txt # 82932742 total for FILE in *.txt ; do dos2unix $FILE AUTHOR=$( echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]' ) cat $FILE | tail -n +80 | grep -v -i 'Gutenberg' | iconv -c -tascii | tr '

' ' ' | \ fold --spaces --bytes --width=10000 | sed -e "s/^/ $AUTHOR \|/" > $FILE .transformed done rm input.txt cat *.transformed | shuf > input.txt cd ../../ th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 747 -num_layers 2 -seq_length 187 # using CUDA on GPU 0... # loading data files... # cutting off end of data so that the batches/sequences divide evenly # reshaping tensor... # data load done. Number of data batches in train: 4852, val: 256, test: 0 # vocab size: 96 # creating an LSTM with 2 layers # number of parameters in the model: 7066716 # cloning rnn # cloning criterion # 1⁄242600 (epoch 0.000), train_loss = 4.57489208, grad/param norm = 9.6573e-01, time/batch = 2.03s # ... # 15979⁄242600 (epoch 3.293), train_loss = 1.01393854, grad/param norm = 1.8754e-02, time/batch = 1.40s This gets us a corpus in which every line specifies its author and then switches authors, while still being long enough to have readable meaning. After about 22 hours of training yielding a validation loss of 1.0402 (with little improvement evident after the first 7 hours), we can try out our best candidate and see if it knows Shakespeare versus Austen: BEST= ` ls cv/*.t7 | sort --field-separator= "_" --key=4 --numeric-sort --reverse | tail -1 ` th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" # SHAKESPEARE|is of no regular complexion. The action of the plain chatter--" "Alas, they # have discovered what was to be afforded since then?" "We can believe--for the signature of # the Church." "So they do, dear lord, do they their home? Oh, no, to the devil which we # have not written, the Church is not in the world; but not in this harmless way then to the # captain of man--therefore while the praise of it was allurious he would not reflect on the # curious man's hatch deemed that his life should be very con th sample.lua $BEST -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" -seed 105 # SHAKESPEARE| CHAPTER VII FROM A WESPERON IN STORY "MOST INGURIFELLOWSELLES," Antoinette # now looked at him a sharp pleasure in passing southward and again in portion of his mother's # reach of it. Suddenly the thing was said. "We'll sit down and find out," he inquired, with a # pity to see Mr. Carvel driving beside the bedroom, which was almost as much as he could bear # the potion. "You say you're strong," said Mrs. Holy, indignantly, "you won't have to go # away, about the doctor. What is it?" "Why, we are" th sample.lua $BEST -temperature 0.8 -length 500 -primetext "AUSTEN|" # AUSTEN|business, and the gout--a constant and foolish figure in which Fellowes' ring is # nearer to distemper than meek and steady interest and clean iron. The episode for the future # and the war, and the seedy and effective sun-elogs and the others ventured its remote room, # whose hair was a suffering man--that the work of the circumstance interested him. It had no # long served to open the papers to answer up a quiet road, free from the long row of white # to the lash called No. 14,000 to a sweet conversatio th sample.lua $BEST -temperature 0.8 -length 500 -primetext "TWAIN|" # TWAIN|quarrelling with a little book, and so on, considering its sensations as to whether # it were not possible to eat it. He thought that the leader of the conference with his own # death would be recognized as a common expression. The men that mounted from motive powers, # how big the calf, commander of the rights of the new economic steamer, the English, a lass # of manhood, will exhibit no praise or increase out of a sort of meaning in the senses, and # send them back to such a winter as we can go into t We can see that while the RNN is producing very English-sounding novelistic prose and produces its usual mix of flawless syntax and hilarious semantics (I particularly like the phrase “Oh, no, to the devil which we have not written, the Church is not in the world”), it has failed to learn the styles I was hoping for. The Austen and Twain samples sound somewhat like themselves, but the Shakespeare samples are totally wrong and sound like a Victorian English novel. And given the lack of improvements on the validation set, it seems unlikely that another 10 epochs will remedy the situation: the RNN should quickly learn how to use the very useful metadata. Since the style varies so little between the samples, I wonder if mimicking English uses up all the capacity in the RNN? I gave it only 747 neurons, but I could’ve given it much more. Larger RNN So to try again: to better preserve the semantics, instead of deleting newlines, replace them with a slash

try much shorter lines of 1000 bytes (increasing the relative density of the metadata)

back off on the very long backpropagation through time, and instead, devote the GPU RAM to many more neurons.

to many more neurons. the default setting for the validation set is a bit excessive here and I’d rather use some of that text for training rm input.txt *.transformed for FILE in *.txt ; do dos2unix $FILE AUTHOR=$( echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]' ) cat $FILE | tail -n +80 | grep -v -i 'Gutenberg' | iconv -c -tascii | tr '

' '/' | \ fold --spaces --bytes --width=1000 | sed -e "s/^/ $AUTHOR \|/" > $FILE .transformed done cat *.transformed | shuf > input.txt cd ../../ th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 2600 -num_layers 2 -val_frac 0.01 # ...data load done. Number of data batches in train: 18294, val: 192, test: 771 # vocab size: 96 # creating an LSTM with 2 layers # number of parameters in the model: 82409696 # cloning rnn # cloning criterion # 1⁄914700 (epoch 0.000), train_loss = 4.80300702, grad/param norm = 1.1946e+00, time/batch = 2.78s # 2⁄914700 (epoch 0.000), train_loss = 13.66862074, grad/param norm = 1.5432e+00, time/batch = 2.63s # ... Errored out of memory early the next day; the validation loss is still pretty meh, but at 1.1705, can’t expect much, and indeed, the style is not impressive when I check several prefixes: th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "SHAKESPEARE|" # seeding with SHAKESPEARE| # -------------------------- # SHAKESPEARE|jung's own,/which is on the house again. There is no endeavour to be dressed in the midst of the/present of # Belle, who persuades himself to know to have a condition of/the half, but "The garnal she was necessary, but it was high, # consecrets, and/excursions of the worst and thing and different honor to flew himself. But/since the building closed the # mass of inspiration of the children of French wind,/hurried down--but he was in the second farmer of the Cald endless figures, # Mary/Maeaches, and t th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "AUSTEN|" # AUSTEN|mill./And now the good deal now be alone, there is no endeavour to be dreaming./In fact, what was the story of his # state, must be a steady carriages of pointing out/both till he has walked at a long time, and not convinced that he # remembers/her in this story of a purpose of this captain in stock. There was/no doubt of interest, that Mr. Crewe's # mother could not be got the/loss of first poor sister, and who looked warm enough by a/great hay below and making a # leaver and with laid with a murder to th sample.lua cv/lm_lstm_epoch0.93_1.1705.t7 -temperature 0.8 -length 500 -primetext "TWAIN|" # TWAIN|nor contributed/she has filled on behind him. He had been satisfied by little just as to/deliver that the inclination # of the possession of a thousand expenses in the group of feeling had destroyed/him to descend. The physical had he darted # before him that he was worth a # PARKER|George Pasha, for instance?"//"Then it is not the marvel of laws upon Sam and the Sellers." She said/he would ask # himself to, one day standing from the floor, as he/stood for the capital. He was no good of conversation Larger author count Next, I decided to increase diversity of styles: ramping up to 38 authors, including modern SF/F fiction authors (Robert Jordan’s Wheel of Time, Gene Wolfe, R.A. Lafferty, Ryukishi07’s Umineko no naku koro ni, Kafka), poetry ancient and modern (Iliad, Beowulf, Dante, Keats, Coleridge, Poe, Whitman, Gilbert & Sullivan), ancient fiction (the Bible), miscellaneous nonfiction (Aristotle, Machiavelli, Paine) etc. By adding in many more authors from many different genres and time periods, this may force the RNN to realize that it needs to take seriously the metadata prefix. wget 'https://dl.dropboxusercontent.com/u/182368464/umineko-compress.tar.xz' untar umineko-compress.tar.xz && rm umineko-compress.tar.xz mv umineko/umineko.txt ryukishi07.txt ; mv umineko/wot.txt jordan.txt ; rm -rf ./umineko/ cat /home/gwern/doc-misc/fiction/lafferty/*.txt > lafferty.txt cat /home/gwern/doc-misc/fiction/wolfe/fiction/*.txt > wolfe.txt wget 'https://www.gutenberg.org/ebooks/10031.txt.utf-8' -O poe.txt && sleep 5s ## avoid anti-crawl defenses wget 'https://www.gutenberg.org/ebooks/11.txt.utf-8' -O carroll.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/1232.txt.utf-8' -O machiavelli.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/12699.txt.utf-8' -O aristotle.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/1322.txt.utf-8' -O whitman.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/16328.txt.utf-8' -O beowulf.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/1661.txt.utf-8' -O doyle.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/23684.txt.utf-8' -O keats.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/2383.txt.utf-8' -O chaucer.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/2701.txt.utf-8' -O melville.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/30.txt.utf-8' -O bible.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/3090.txt.utf-8' -O maupassant.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/31270.txt.utf-8' -O paine.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/3253.txt.utf-8' -O lincoln.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/345.txt.utf-8' -O stoker.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/3567.txt.utf-8' -O bonaparte.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/3600.txt.utf-8' -O montaigne.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/4200.txt.utf-8' -O pepys.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/4361.txt.utf-8' -O sherman.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/4367.txt.utf-8' -O grant.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/6130.txt.utf-8' -O homer.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/7849.txt.utf-8' -O kafka.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/808.txt.utf-8' -O gilbertsullivan.txt && sleep 5s wget 'https://www.gutenberg.org/ebooks/8800.txt.utf-8' -O dante.txt && sleep 5s wget 'https://www.gutenberg.org/files/28289/28289-0.txt' -O eliot.txt && sleep 5s wget 'https://www.gutenberg.org/files/29090/29090-0.txt' -O coleridge.txt && sleep 5s wget 'https://www.gutenberg.org/files/5000/5000-8.txt' -O davinci.txt && sleep 5s Due to OOM crash, I decreased the neuron count. With a much bigger model, also necessary to have dropout enabled (default of 0 means progress seems to halt around a loss of 3.5 and makes no discernible progress for hours) rm input.txt *.transformed *.t7 wc --char *.txt # 100972224 total for FILE in *.txt ; do dos2unix $FILE ; AUTHOR=$( echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]' ) cat $FILE | tail -n +80 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' -e 'ISBN' \ | iconv -c -tascii | sed -e ':a;N;$!ba;s/

/ /g' -e 's/ */ /g' -e 's/ \/ \/ //g' | \ fold --spaces --bytes --width=3000 | head --bytes=1M | sed -e "s/^/ $AUTHOR \|/" > $FILE .transformed done cat *.transformed | shuf > input.txt cd ../../ th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 2400 -num_layers 2 -val_frac 0.01 -dropout 0.5 # ...data load done. Number of data batches in train: 39862, val: 419, test: 1679 # vocab size: 98 # creating an LSTM with 2 layers # number of parameters in the model: 70334498 # cloning rnn # cloning criterion # 1⁄1993100 (epoch 0.000), train_loss = 4.68234798, grad/param norm = 7.4220e-01, time/batch = 2.53s # 2⁄1993100 (epoch 0.000), train_loss = 13.00693768, grad/param norm = 1.7191e+00, time/batch = 2.35s # ... Did OK but seemed to have difficulty improving past a loss of 1.14, had issues with exploding error (one exploding error up to a loss of 59 terminated an overnight training run) and then began erroring out every time I tried to resume, so I began a third try, this time experimenting with deeper layers and increasing the data preprocessing steps to catch various control-characters and copyright/boilerplate which snuck in: nice th train.lua -data_dir data/styles/ -gpuid 0 -rnn_size 1000 -num_layers 3 -val_frac 0.005 -seq_length 75 -dropout 0.7 This one eventually exploded too, having maxed out at a loss of 1.185. After deleting even more control characters and constantly restarting after explosions (which had become a regular thing as the validation loss began bouncing around a range of 1.09–1.2, the RNN seeming to have severe trouble doing any better) I did some sampling. The results are curious: the RNN has memorized the prefixes, of course, and at higher temperatures will spontaneously end with a newline and begin with a new prefix; many of the prefixes like “BIBLE|” look nothing like the original source, but the “JORDAN|” prefix performs extremely well in mimicking the Wheel of Time, dropping in many character names and WoT neologisms like “Aiel” or (of course) “Aes Sedai”. This isn’t too surprising since the WoT corpus makes up 20M or a sixth of the input; it’s also not too surprising when WoT terms pop up with other prefixes, but they do so at a far lower rate. So at least to some extent, the RNN has learned to use Jordan versus non-Jordan prefixes to decide whether to drop in WoT vocab. The next largest author in the corpus is Mark Twain, and here too we see something similar: when generating Twain text, we see a lot of words that sound like Twain vocabulary (riverboats, “America”, “the Constitution” etc), and while these sometimes pop up in the smaller prefix samples it’s at a much lower rate. So the RNN is learning that different prefixes indicate different vocabularies, but it’s only doing this well on the largest authors. Class imbalance fix Does this reflect that <2M of text from an author is too little to learn from and so the better-learned authors’ material inherently pulls the weaker samples towards them (borrowing strength), that the other authors’ differences are too subtle compared to the distinctly different vocab of Jordan & Twain (so the RNN focuses on the more predictively-valuable differences in neologisms etc), or that the RNN is too small to store the differences between so many authors? For comparison, a one-layer RNN trained on solely the Robert Jordan corpus (but still formatted with prefixes etc) got down to a loss of 0.9638, and just the Bible, 0.9420 So the penalty for the Bible for having to learn Jordan is 0.9763 − 0.9420 = 0.0343, and vice-versa is 0.9763 − 0.9638 = 0.0125. Presumably the reason the Bible RNN is hurt 2.7× more is because the Jordan corpus is 4.3× larger and more learning capacity goes to its vocabulary & style since a bias towards Jordan style will pay off more in reduced loss, a classic class-imbalance problem. Class-imbalance problems can sometimes be fixed by changing the loss function to better match what one wants (such as by penalizing more errors on the smaller class), reducing the too-big class, or increasing the too-small class (by collecting more data or faking that with data augmentation). I tried balancing the corpuses better by limiting how much was taken from the biggest. Also at this time, torch-rnn was released by Justin Johnson, with claims of much greater memory efficiency & better performance compared to char-rnn , so I tried it out. torch-rnn was capable of training larger RNNs, and I experienced many fewer problems with exploding loss or OOM errors, so I switched to using it. The preprocessing step remains much the same, with the exception of a | head --bytes=1M call added to the pipeline to limit each of the 31 authors to 1MB: rm *.transformed for FILE in *.txt ; do dos2unix $FILE ; AUTHOR=$( echo $FILE | sed -e 's/\.txt//' | tr '[:lower:]' '[:upper:]' ) cat $FILE | tail -n +80 | head -n -362 | grep -i -v -e 'Gutenberg' -e 'http' -e 'file://' -e 'COPYRIGHT' -e 'ELECTRONIC VERSION' \ -e 'ISBN' | tr -d '[:cntrl:]' | iconv -c -tascii | sed -e ':a;N;$!ba;s/

/ /g' -e 's/ */ /g' -e 's/ \/ \/ //g' | \ fold --spaces --bytes --width=3000 | head --bytes=1M | sed -e "s/^/ $AUTHOR \|/" > $FILE .transformed done cat *.transformed | shuf > input.txt ## with limiting: findhog *.transformed # 8 coleridge.txt.transformed # 8 dante.txt.transformed # 8 davinci.txt.transformed # 8 eliot.txt.transformed # 8 gilbertsullivan.txt.transformed # 8 grant.txt.transformed # 8 homer.txt.transformed # 8 kafka.txt.transformed # 8 pepys.txt.transformed # 8 sherman.txt.transformed # 152 carroll.txt.transformed # 240 keats.txt.transformed # 244 beowulf.txt.transformed # 284 machiavelli.txt.transformed # 356 poe.txt.transformed # 560 doyle.txt.transformed # 596 aristotle.txt.transformed # 692 whitman.txt.transformed # 832 stoker.txt.transformed # 1028 bible.txt.transformed # 1028 bonaparte.txt.transformed # 1028 chaucer.txt.transformed # 1028 jordan.txt.transformed # 1028 lafferty.txt.transformed # 1028 lincoln.txt.transformed # 1028 maupassant.txt.transformed # 1028 melville.txt.transformed # 1028 montaigne.txt.transformed # 1028 paine.txt.transformed # 1028 ryukishi07.txt.transformed # 1028 wolfe.txt.transformed cd ../../ python scripts/preprocess.py --input_txt data/multi/input.txt --output_h5 multi.h5 --output_json multi.json --val_frac 0.005 --test_frac 0.005 nice th train.lua -input_h5 multi.h5 -input_json multi.json -batch_size 100 -seq_length 70 -dropout 0.5 -rnn_size 2500 -num_layers 2 # ... # Epoch 28.52 / 50, i = 65000 / 118100, loss = 0.901009 # val_loss = 1.028011712161 This trained to convergence with a loss of ~1.03 after ~30 epochs taking a week or two, yielding 2016-03-27-metadata.t7 (583MB). This is ~0.05 better than the unlabeled baseline. Did it succeed in learning to use the metadata and mimicking style? Success Yes. Sampling 80K characters of text on CPU and setting the temperature high enough that the RNN will periodically emit a newline and jump to a new mode with the invocation th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 80000 -temperature 0.8 -start_text 'JORDAN|' , there are 13 transitions: Jordan: short but fail. Mentions “London”, “Jacques”, “Nantucket”, etc Maupassant: success. Poison, murder, city etc Lafferty: mixed success. Clubs, girls, Chicago, heavy on dialogue, and American names, but also some vocabulary creeping in from other authors such as “Tar Valon” (Jordan) Chaucer: success. Clearly old-timey with invocations of Jesus. Sample: “…throughout this world, and shall thereby be called in trust, as now O first cause of this world we have no danger; That women were with you and the message, As I loved them they that should pray: No more of this so little wickedness.” When she saw him that there was no wight to see, For in h is cursed peace, his Christe’s hand, And cried his daughter many a long time For he took her out of the world so dear. And she was not holy and more jolly, Had wedded her no sooth and blithe sore; The lady is this marriage and her wife. Come to the priest, what woe we have to do, And thanke him to make a dream, and I can Thomas, with that he saide, may I not stand: And the time went him all out of the town, And with the corpse, and settled him like As Jesus Christ, as he was thought, They would have been a full confused grace. Whitman: short but success? WHITMAN|but lusty, closing the walls, Who are the clauses of cavalry with Chaucer: success Lincoln: success. Sample: LINCOLN|of his constitutional affairs, is better put down by their own things than above the extent of the majority of the people or of the Republicans of the United States which in the extremes may be said to be one of those who will obtain bad negro as ill-demanded and simple means as they have belonged. r. Pitt in the same manner in Parliament I have not seen him in the other uncommon personal expedition to the British court, and that his thirst was the object, or in which he wrote liberty for supporting him in the present day with an extreme resolution of the sovereignty… Bible: success. Sample: BIBLE|with him two cities which I commanded them; he shall not die: for the LORD is among us. And the LORD was come unto his son that sent him to seek the way to Adon. 02:019:019 And it came to pass at the end of three days after the people of Israel, that they had to touch their voice, and give him a south, and be cut before Pharaoh: 04:030:028 And the LORD spake unto oses, saying, 03:022:002 There shall not a man be found out of the house of the LORD. 03:013:028 And the priest shall have one lot and the length of the bullock, and shall put the blood upon the altar, and put the altar of gold to his feet, and set his finger in water, and shall come into the plain. 03:011:027 And the priest shall take the butler and the head of the servant shall sprinkle it out, and the priest shall burn it into a ring, and cover the fat that is upon the altar, and shall pitch it out. 03:001:004 And he shall put the lamps in water, even a trespass offering, and the hanging for the robe of the burnt offering, and put the altar of shittim wood, and burn the altar of burnt offering unto the LORD. Stoker: success. Victorian English, mention of cemeteries, disemvoweling, Van Helsing. Lafferty: mixed success. More Chicago and Lafferty-like vocabulary, but what is “Renfield” doing there—that’s Stoker! Ryukishi07: success. Sample: RYUKISHI07|of something like that. You can stop too long, a little bit more spinning stuff. You could put away the first side of your way out on the study at the end of the ‘Sea From Battler’. “I see, isn’t it‽ Ooooooohhhh…” In other words, if the seagulls had been known to have been over there already, the Shannon wouldn’t have accepted a servant. …And when George-aniki suddenly put his head over and spat on his shoulders, Rand said, showing some relationship to her. He was calm and was jealous of his nearly much image or experience. “………………Hahahahaha……….” Natsuhi noticed that tune from the warm block, and it was quite a small part of it… “I’m not gonna be out of the main way. Where’s the witch‽” Natsuhi oba-san said something about forty… The fork of gold wasn’t like whispering every day. “…You’re still unable to make me. Now if you stay back to the back of the world part of my heart, that’s wrong. …………But I really have here a magazine.” “Ah, ………don’t worry about it. I wouldn’t call a lot one.” “That’s right. …If it was a metal bird, I would also stay here. I’m sorry, but it’s a fantastic person who is still living in your speed… If you couldn’t think of it, that’s right. If you want to call me a bed, I’d be swept by your duty and you may be fine.” “…………………” “……W, ………what are you going to do with the culprit? Did you say something like that…?” Natsuhi returned the rose garden. As the announcement had finished looking over his, he heard the overwhelming sound of the falling hair, on the windows, his eyes slicing around the sound of a pair of hold of holes in one hand. … Doyle: mixed success. There appears to be infiltration from Lincoln. Montaigne: mixed success. Discusses France, but also Melville’s Nantucket. So of the 13 samples, 8 were definitely in the style of the right author, 5 were mixed successes as they mostly resembled their author but not entirely, and only 1 was a clear failure. With 31 authors to choose from, that’s not an accident. One Walt Whitman pastiche sample I generated while testing struck me as quite poetic; with line breaks inserted where indicated by capitalization: "WITH THE QUEEN OF OTHER HOLY SAILOR" And shes my brothers to be put upon me, intense and sound, All are me. Sounds purified, O sound of the streets! O landscapes! O still the fierce and the scraping of beauty! The murderous twinkle of the sky and basement, How the beasts at first began to bite and the waves near the floor. The walls of lands discover'd passions, Earth, sword-ships, enders, storms, pools, limailes, shapes of violent, Rooters, alarms, the light-starring mail, untold arms, patients, portals, the well-managed number, the bravest farms, The effect of doubts, the bad ways, the deeds of true signs, the curious things, the sound of the world, It is of figure and anthem, the common battle rais'd, The beautiful lips of the world that child in them can chase it ... For a more systematic look, I generated samples from all included authors: (for AUTHOR in ` echo "ARISTOTLE BEOWULF BIBLE BONAPARTE CARROLL CHAUCER COLERIDGE DANTE DAVINCI DOYLE ELIOT GILBERTSULLIVAN \ GRANT HOMER JORDAN KAFKA KEATS LAFFERTY LINCOLN MACHIAVELLI MAUPASSANT MELVILLE MONTAIGNE PAINE PEPYS \ POE RYUKISHI07 SHERMAN STOKER WHITMAN WOLFE" `; do th sample.lua -gpu -1 -checkpoint cv/2016-03-27-metadata.t7 -length 5000 -temperature 0.8 -start_text " $AUTHOR |" done) > 2016-03-27-rnn-metadata-samples-all.txt The Eliot output was perplexingly bad, consisting mostly of numbers, so I looked at the original. It turned out that in this particular corpus, 10 of the text files had failed to download, and instead, Project Gutenberg served up some HTML CAPTCHAs (not cool, guys)! This affected: Coleridge, Dante, Da Vinci, Eliot, Gilbert & Sullivan, Grant, Homer, Kafka, Pepys, & Sherman. (Checking the output, I also noticed that a number of words starting with capital ‘M’ were missing the ‘M’, which I traced to the tr call trying to strip out control characters that did not do what I thought it did.) Excluding the corrupted authors, I’d informally rank the output subjectively as: bad: Aristotle, Beowulf, Bible, Chaucer, Jordan, Keats

uncertain: Carroll, Wolfe

good: Stoker, Paine, Bonaparte, Lafferty, Melville, Doyle, Ryukishi07, Whitman, Lafferty, Machiavelli, Aristotle, Bible The RNN is somewhat inconsistent: sometimes it’ll generate spot-on prose and other times fail. In this case, good and bad Bible samples were present, and previous Chaucer was fine but the Chaucer in this sample was bad. (This might be due to the high temperature setting, or the messed-up texts.) But overall, it doesn’t change my conclusion that the RNN has indeed learned to use metadata and successfully mimic different authors.

Training with prefixes+suffixes The RNN seems to learn the connection of the prefix metadata to the vocabulary & style of the following text only at the very end of training, as samples generated before then tend to have disconnected metadata/text. This might be due to the RNN initially learning to forget the metadata to focus on language modeling, and only after developing an implicit model of the different kinds of text, ‘notice’ the connection between the metadata and kinds of text. (Or, to put it another way, it doesn’t learn to remember the metadata immediately, as the metadata tag is too distant from the relevant text and the metadata is only useful for too-subtle distinctions which it hasn’t learned yet.) What if we tried to force the RNN to memorize the metadata into the hidden state, thereby making it easier to draw on it for predictions? One way of forcing the memorization is to force it to predict the metadata later on; a simple way to do this is to append the metadata as well, so the RNN can improve predictions at the end of a sample (predicting poorly if it has forgotten the original context); so text would look something like SHAKESPEARE|...to be or not to be...|SHAKESPEARE . I modified the data preprocessing script slightly to append the author as well, but otherwise used the same dataset (including the corrupt authors) and training settings. My first try at appending resulted in a failure, as it converged to a loss of 1.129 after a week or two of training, much worse than the 1.03 achieved with prefix-only. Sampling text indicated that it had learned to generate random author metadata at the end of each line, and had learned to mimic some different prose styles (eg Biblical prose vs non-Biblical) but it had not learned to memorize the prefix nor even the use of the prefix (!). A second try with the same settings converged to 1.1227 after 25 epochs, with the same sampling performance. In a third try, I resumed from that checkpoint but increased the BPTT unrolling seq_length from 50 to 210 to see if that would help it. It converged to 1.114 with suffixes still random. For a fourth try, I reduced dropout from 0.5 to 0.1, which did not make a difference and converged to 1.117 after 8 epoches. So in this case, training with suffixes did not speed up training, and impeded learning. While I am not too surprised that suffixes did not speed up training, I am surprised how it barred learning prefixes at all and I don’t know why. This should have been, if anything, an easier task.

Classification I wondered if the same metadata approach could be used to trick the char-RNN into learning classification as well—perhaps if the RNN learns language modeling by trying to predict subsequent characters, it acquires a greater natural language understanding than if it was trained directly on predicting the author? I fixed the corrupted HTML files and the tr bug, and modified the script to read fold --spaces --bytes --width=3000 (so each line is 3000 characters long) and the author is now placed at the end: sed -e "s/$/\|$AUTHOR/" . So the char-RNN is trained to predict each subsequent character, and at the end of 3000 characters, it sees a | and (in theory) will then predict the author. To test the results, one can feed in a short stereotypical piece of text ending in a pipe, and see if it is able to respond by generating the author. This turned out to be a total failure. After over a week of training, the validation loss had fallen to 1.02, yet when I sampled it, it was unable to classify text, eg: th sample.lua -gpu -1 -checkpoint ` ls -t cv/*.t7 | head -1 ` -length 44 -temperature 0.1 -start_text "Thou shalt not tempt the Lord thy God|B" # Thou shalt not tempt the Lord thy God|Becaus At best, it sometimes would add random upcased text following the pipe (“|CHAPTER” was common), or random authors (never the right one). I thought perhaps the penalty for missing the final characters in a line was too small as it represented no more than 0.3% of each line, and so I reduced the line-length down to 500 characters (so the author was now ~2% of each line). This didn’t work either (validation loss of ~1.12, probably due to shorter lines with less context to work with), so I disabled dropout, added batchnorm, and increased the BPTT enough to backpropagate over the entire line. After another week or two, the validation loss asymptoted at ~1.09, but still no classification performance. Here is a sample (adding line-breaks for readability at capitalized words which correspond to linebreaks in the original): 41 Book 40 With patient ones of the seas, the form of the sea which was gained the streets of the moon. Yet more all contest in the place, See the stream and constant spirit, that is of a material spirit, The live of the storm of forms and the first stretch Of the complexion of the mountains; The sea fell at the tree, twenty feet wide, And the taste of a scarlet spot where the captain bears, She shook the sound the same that was white, Where the permanent eye of the sea had scarce assembled, The many such, the beauteous of a subject of such spectacles. If thou be too sure that thou the second shall not last, Thou canst not be the exceeding strength of all. Thou wert as far off as thou goest, the sea Of the bands and the streams of the bloody stars Of the world are the mountains of the sun, And so the sun and the sand strike the light, But each through the sea dead the sun and spire And the beams of the mountain shed the spirits half so long, That of the which we throw them all in air. Think of thy seas, and come thee from that for him, That thou hast slain in dreams, as they do not see The horses; but the world beholds me; and behold The same the dark shadows to the sand, And stream and slipping of the darkness from the flood. He that I shall be seen the flying strain, That pierces with the wind, and the storm of many a thousand rays Were seen from the act of love to the course. There was a stream, and all the land and bare Ereth shall thy spirit be suppos'd To fall in water, and the wind should go home on all the parts That stood and meet the world, that with the strong the place Of thy prayer, or the continual rose, So that the shape of the brand broke the face, And to the band of the ring which erewhile Is turn'd the merchant bride. I am thine only then such as thou seest, That the spirits stood in those ancient courses, And in their spirit to be seen, as in the hard form Of their laws the people in the land, That they are between, that thou dost hear a strong shadow, And then, nor war in all their powers, who purposes hanging to the road, And to the living sorrow shall make thy days Behold the strains of the fair streets, and burn, And the shepherd for the day of the secret tear, That thou seest so high shall be so many a man. What can ye see, as sinking on the part Of this reminiscence of the pursuit? Behold the martial spirits of men of the rock, From the flowers of the touch of the land with the sea and the blow The steamer and the bust of the fair cloud. The steps behind them still advanc'd, and drew, As prepared they were alone all now The sharp stick and all their shapes that winds, And the trembling streams with silver the showering fires The same resort; they stood there from the plain, And shook their arms, sad and strong, and speaks the stars, Or pointed and his head in the blood, In light and blue he went, as the contrary came and beat his hands. The stars, that heard what she approach'd, and drew The shore, and thus her breast retraced the rushing throng: "And more with every man the sun Proclaims the force of future tongues That this of all the streams are crack'd." "The thought of me, alas!" said he, "Now that the thirst of life your country's father sang, That in the realms of this beast the prince The victor from the true betray beginnings of the day." The generated text is semi-interesting, so it’s not that the RNN was broken. It was focused on learning to model the average text. So it would seem that the classification signal was not strong enough to cause learning of it. The worsened validation score suggests that this approach simply won’t work: the longer the lines, the less incentive there is for classification, but the shorter the lines, the worse it learns to model the regular text.