A practical example

As promised in the beginning, we’ll train a model to predict which book of the bible a verse came from. The model we’ll train is a standard GRU, which will run over the characters in a verse. We’ll take the final GRU state and try and predict which book of the bible the verse came from. It’s a terrible model, and to make it a little less terrible we’ll also train a language model (predict the next character) inspired by this (excellent) paper

But honestly, the model is aside the point. The point is to get data into it. So what we need to get into the model is

A sequence of token ids (in this case each charecter is a token) A number representing which book of the bible the verse/sequence came from The length of the sequence (we can actually calculate it adhoc, but this is more convenient for illustration)

Also, we need to be able to batch a few verses together to make training efficient and that means we need to pad all the verses in the same batch to be the same length.

Getting the data

The data we are using is the King James bible from Project Gutenberg. It’s included in the repo. The first step we need to take is separating one giant text file into verses and marking which book each verse came from. We do that in PrepareBibleExamples.ipynb.

Most of that notebook is some Regex-fu which is always fun but not in our scope. The last part calls this class BibPreppy which is defined here. BibPreppy prepares the bible hence the name. It exactly executes steps 1–4 in the process.

Steps 1–3 in the process look like this in BibPreppy

Deciding what your tokens

We pass BibPreppy the python function list as a tokenizer. This has the effect of splitting a string into an array of charecters. That’s what we want since we are working at the charecter level Building a vocabulary

We get a little clever here and use pythons defaultdict. This allows us to go over the data in one pass, and every time we see a new character we assign it an as yet unused id. Mapping Sequences to embeddings

Since we got clever, this happens concurrently with step 2. This step occurs in the method sentance_to_id_list which takes a raw string, tokenizes it, converts each token to an id and adds new ids if needed.

Steps 4 and 5 are more interesting and worth looking at in depth. Before I claim any glory for them, I must say they are almost literal copy pastas from Denny Britz’s aforementioned blog post

After going through steps 1–3, we need to store our examples to disk. As discussed, we’ll use the TFRecords format, and we need a way to convert our example (the one we made in the PrepareBibleExamples.ipynb notebook) into TFRecords.

That’s exactly what the method sequence_to_tf_examples does. It uses that abstraction of a SequenceExample we spoke about to store all of the data we need (The sequence, its length, and the book it came from) in one single unit.

The method parse goes in the opposite direction. It knows how to read a TFRecord and convert it into the only thing Tensorflow can really work with, namely a tensor. In fact, it does something a little better, it converts it into a dictionary of Tensors.

Side Rant: A Dictionary of Tensors, WTF?!!

Sometimes I’m surprised that we can work with dictionaries of Tensors, since a Tensor is a Tensorflow primitive but a python dictionary has no place in the computational graph. This confuses me sometimes. It’s important to remember that when we are working with Tensorflow in python, we are dealing with abstract symbols that go into the graph, and not the computation graph itself. That’s why parse, a python function, can returns dictionaries of Tensors.

Using The Data — The Dataset API in action

Inside of prepare_dataset.py you’ll see this code, which shows the Dataset API in most of its glory

The function make_dataset opens a TFRecord at path, parses it with BibPreppys parse method and then…

The bad magic

There is a little bit of bad magic in there. The calls to expand and deflate. These are there because that’s the only way I could get padded_batch to work with scalar values, the length of the sequence and book_id the ID of the book the example came from (our target)

The Good Magic

The Good Magic is the call to that function, padded_batch. Not only does it pad our Tensors but it also pads the sequence dynamically, to the length of the longest example in the batch. And it does this for each Tensor .

Even before that, their is a call to shuffle which shuffles the data. Between these two pieces of magic we’ve solved the remaining two problems we had in the beginning

How to I avoid padding my entire dataset to the length of the longest example How do I easily shuffle my data keeping source and all targets together.

Bonus — Train and Val iteration

The dataset makes one more thing amazingly convenient if not downright magical. That thing is doing an epoch of training and then a validation run, possibly with some logic in between .

With feed_dicts this wasn’t to hard but tended to be tightly coupled to the representation you chose for the dataset and its iterator. In the days of TFRecords without the dataset API, I think this was impossible because you ended up hardcoding a certain dataset into the graph. So lets see how the dataset API makes this easier

I chopped out some stuff from the code here to make it more legible and stand alone. In the repo I have some logic that reduces the learning rate whenever the validation loss increases from the previous epoch. I always found it annoying to implement that functionality and I found the dataset api to be a convenient abstraction for it.

Summary

If you got this far I’m flattered. :-) Here’s what you learned:

Their are a few shortcomings to using numpy arrays for working with text The dataset API helps us solve them But you need to use TFRecords, which is annoying But the dataset API is so good that it is worth it And then a few examples to see how to use TFRecords and how to leverage the dataset API

Now that you know all that, see it in action in the repo .

I hope this has helped you. And if you need to label your text data before putting it into Tensorflow, we at LightTag would be happy to help you manage and execute your annotation projects. And if you have questions, comments and suggestions tweet me at @thetalperry