Keras Text Pre-Processing Primer

Now that we have gathered the data, we need to prepare the data for the modeling. Before jumping into the code, let’s warm up with a toy example of two documents:

[“The quick brown fox jumped over the lazy dog 42 times.”, “The dog is lazy”]

Below is a rough outline of the steps I will take in order to pre-processes this raw text:

1. Clean text: in this step, we want to remove or replace specific characters and lower case all the text. This step is discretionary and depends on the size of the data and the specifics of your domain. In this toy example, I lower-case all characters and replace numbers with *number* in the text. In the real data, I handle more scenarios.

[“the quick brown fox jumped over the lazy dog *number* times”, “the dog is lazy”]

3. Tokenize: split each document into a list of words

[[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumped’, ‘over’, ‘the’, ‘lazy’, ‘dog’, ‘*number*’, ‘times’], [‘the’, ‘dog’, ‘is’, ‘lazy’]]

4. Build vocabulary: You will need to represent each distinct word in your corpus as an integer, which means you will need to build a map of token -> integers. Furthermore, I find it useful to reserve an integer for rare words that occur below a certain threshold as well as 0 for padding (see next step). After you apply a token -> integer mapping, your data might look like this:

[[2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 11], [2, 9, 12, 8]]

5. Padding: You will have documents that have different lengths. There are many strategies on how to deal with this for deep learning, however for this tutorial I will pad and truncate documents such that they are all transformed to the same length for simplicity. You can decide to pad (with zeros) and truncate your document at the beginning or end, which I will refer to as “pre” and “post” respectively. After pre-padding our toy example, the data might look like this:

[[2, 3, 4, 5, 6, 7, 2, 8, 9, 10, 11], [0, 0, 0, 0, 0, 0, 0, 2, 9, 12, 8]]

A reasonable way to decide your target document length is to build a histogram of document lengths and choose a sensible number. (Note that the above example has padded the data in front but we could also pad at the end. We will discuss this more in the next section).

Preparing Github Issues Data

For this section, you will want to follow along in this notebook. The data we are working with looks like this:

Pandas dataframe with issue bodies and titles, from this notebook.

We can see there are issue titles and bodies, which we will process separately. I will not be using the URLs for modeling but only as a reference. Note that I have sampled 2M issues from the original 5M in order to make this tutorial tractable for others.

Personally, I find pre-processing text data for deep learning to be extremely repetitive. Keras has good utilities that allow you to do this, however I wanted to parallelize these tasks to increase speed.

The ktext package

I have built a utility called ktext that helps accomplish the pre-processing steps outlined in the previous section. This library is a thin wrapper around keras and spacy text processing utilities, and leverages python process-based-threading to speed things up. It also chains all of the pre-processing steps together and provides a bunch of convenience functions. Warning: this package is under development so use with caution outside this tutorial (pull requests are welcome!). To learn more about how this library works, look at this tutorial (but for now I suggest reading ahead).

To process the body data, we will execute this code:

See full code on in this notebook.

The above code cleans, tokenizes, and applies pre-padding and post-truncating such that each document length is 70 words long. I made decisions about padding length by studying histograms of document length provided by ktext. Furthermore, only the top 8,000 words in the vocabulary are retained and remaining words are set to the index 1 which correspond to rare words (this was an arbitrary choice). It takes one hour for this to run on an AWS p3.2xlarge instance that has 8 cores and 60GB of memory. Below is an example of raw data vs. processed data:

Image from this notebook.

The titles will be processed almost the same way, but with some subtle differences:

See full code in this notebook.

This time, we are passing some additional parameters:

append_indicators=True will append the tokens ‘_start_’ and ‘_end_’ to the start and end of each document, respectively.

padding=’post’ means that zero padding will be added to the end of the document instead of default of ‘pre’.

The reason for processing the titles in this way is that we want our model to know when the first letter of the title is supposed to occur, and also learn to predict when the end of a phrase should be. This will make more sense in the next section where model architecture is discussed.