If you’ve been paying attention to our blog recently, you would know that we’ve been publishing a lot about our work in deep learning and its application to areas like sentiment analysis. My colleague Patrick C. did a great job setting the stage for our work in his most recent post showcasing the effect of data size on model performance. In this post we will take a deeper dive into understanding these neural networks and more specifically why we found Zhang, Zhao and LeCun’s Crepe architecture so interesting (bonus: we implemented it using two different Python libraries check it out on our Github!).

De-convoluting the Crepe Architecture

We chose to implement this convolutional neural network (CNN) because of its claims that natural language processing tasks could be automated and performed without much language expertise. The idea that you could analyze language without actually knowing it seemed crazy. Our general curiosity got the better of us, so we naturally had to find out more. After a little bit of investigation, it was no surprise that deep learning was behind this magic. For those of you less familiar with CNNs, I would highly encourage checking out some of the resources for getting started available on our blog or this pretty awesome post by the people at WildML.

You might be wondering, “So what if we use a CNN? Why is that a big deal?” Well, CNNs, alongside other deep learning algorithms, can learn to process complex and nuanced aspects of language simply by examining text one character at a time. “Holy Toledo, Batman!” We agree…let me explain in a little more detail.

The main hypothesis behind Crepe is that a multi-layered neural network can examine text one character at a time, using successive layers to build a hierarchical understanding of words, then phrases, and eventually whole documents. Capturing the essence of language in this fashion would be pretty groundbreaking for many NLP tasks. This work is also in line with the prediction that NLP is the next frontier in deep learning (as highlighted by LeCun, Bengio, and Hinton in their deep learning review for Nature). What made this hypothesis especially interesting is that it drew inspiration from computer vision applications, such as face detection, where pixels are the base unit and each layer builds up a representation from pixels to edges to faces. Given the astounding success that CNNs have delivered for vision applications, Zhang’s work made immediate waves in the communities following this problem. Like many others, we immediately dove into the code behind the paper and wanted to use it for our own purposes.

Assembling a Cast of Characters

The first step to recreating Zhang’s work was assembling a few datasets to test and zeroing in on the specific NLP task we wanted to explore (in our case it was sentiment analysis and censorship classification). We wanted to test our CNNs on a variety of different types of datasets. Here were a few we considered:

Each of these datasets offers something different. The IMDB movie reviews provided us with a quick and dirty balanced dataset to test our architectures. The Sentiment140 dataset offered short-form text full of strange Twitterisms. The Amazon product reviews offered a large, unbalanced long-form text corpus. The Open Weibo corpus offered a short-form text corpus in a language other than English. Each of these datasets helped us truly test whether these CNNs were developing an understanding of language.

Success depends upon previous preparation, and without such preparation there is sure to be failure

As is the case with any machine-learning task, our first step was to prepare the data by cleaning and formatting it for the task at hand. We performed some minor cleanup on our text corpus and converted our text into a “quantized” form. This process is sometimes known as “one-hot encoding.”

As you can see in the figure below, one-hot encoding represents each input character by a sparse vector that has a single bit turned “on” in the position corresponding to that letter. Since there are only 67 possibilities in English for the non-space character set, each character in the input is represented by a sparse 67 x 1 vector.

The next question you may be asking is: how many characters are we going to look at per document? Following the methods used in the Crepe report, we kept up to the first 1014 characters of the longer pieces of text; we decided to keep up to the first 150 characters of shorter form text for adequate coverage of Tweets. The resulting vectors had dimensions of 1014 x 67 and 150 x 67. Interested readers might note that this decision on how many characters to keep will affect many downstream results, which could make for an interesting experiment on how input length affects performance.

Propping up the Scaffolding

Once you have your data formatted just the way you like, it’s important to find the handy architecture table/chart that every deep learning paper includes. The classic instantiation of this is the very easily interpreted “circles with lots of lines” style:

Thankfully, with Zhang’s paper we were in luck, since his architecture was clearly defined in a specific and understandable table:

Shazam! We knew this told us there was a sequence of convolutional layers of certain kernel sizes, some with additional pooling layers attached, followed by three fully connected layers. With our data in hand and architecture clearly defined, we were in business! Translating that table into a neural network model was relatively easy, especially since we were using the two modular Python frameworks Keras and neon.

Even though these frameworks made our lives much easier, we discovered it is extremely important to track input and output dimensions at each layer. Regular readers will recognize this common theme when dealing with neural networks, which we highlighted in our previous post covering debugging in deep learning.

Deconstructing the game plan: a paper exercise

To ease into the work, we decided to first implement the “small network” described in Zhang’s paper and started with a paper exercise to ensure that our dimensions lined up. As a primer for understanding some of the terminology that follows, I highly encourage reading Andrej Karpathy’s class notes.

Many people use slightly different terminology, but the important numbers in the architecture table previously highlighted are the frame, kernel, and pool. We interpreted the frame to be the number of convolutional filters, the kernel to be the length of those filters and the pool to be the length of the max pooling layers. Each of these filters operates on an input that has an initial size of 67 x 1014 (for long form text) or 67 x 150 (for short form text). An illustration of these filters across the input in the first convolutional layer is shown below. This illustration is only demonstrating the actions of 1 of the 256 filters, but in reality this is repeated an additional 255 times.

Once you have an understanding of how the input is being scanned, it’s important to take note of the special pool column for layers 1,2 and 6. Oftentimes max-pooling occurs after a convolutional layer. Max-pooling is the concept of taking windowed samples from the output of a convolutional layer and subsampling them to create a single output. This subsampling in turn reduces the dimensionality of the input after each layer. For example, if you have an input with volume of size 100 rows x 100 columns x 25 channels, and a max pooling filter extent of 2 rows x 2 columns, max pooling samples the rows and columns of the volume using a series of 2 x 2 x 1 windows, resulting in an output of size 100/2 x 100/2 x 25 = 50 x 50 x 25. Another important parameter that can sometimes drastically affect the end result is the careful selection of the stride. The stride determines how far each filter moves over for each subsequent convolution. In our case we followed general convention and stuck with a stride of 1.

Apart from having an understanding of the various parameters associated with each convolutional layer, it’s important to take note of other layers like the fully connected layers found in layer 7 and 8. Having fully connected layers at the end of convolutional layers prior to the final classification is quite common in many deep architectures.

The characters that build the code that understands the characters

At this point, we had a pretty grounded understanding of the different types of layers and their individual parameters and dimensions. Naturally, we wanted to implement that understanding in code. As it turns out, by this point we had made it through the hardest part since Keras and neon make the act of implementing architectures fairly straightforward. As you can see in the sample Keras code below, all you need to do is specify the parameters we talked about above and you are good to go. Nothing like typing model.add() to build another powerful layer!

As with adding layers, frameworks like Keras and neon make it really easy to build, tune, and modify pretty much every detail of a network for a friendly experience overall, especially if you’re like us and want to iteratively experiment with multiple aspects of the architecture. We’re big fans of both projects for that reason. That means that most of our time with these types of nets was spent debugging and training.

What did we learn about Crepe’s ability to learn?

In order to test the efficacy of the Crepe architecture, we attempted to recreate the work by implementing it in both Keras and neon. We did this not only to further our own deep learning education, but also to evaluate differences between frameworks since others might benefit from having modular implementations in Python. After running several tests over the course of a couple weeks, we found that the classification accuracy was fairly comparable between the frameworks. But boy, were there differences in training time! Neon was much faster than Keras (almost on the order of 40% faster), which naturally enabled us to train more models and test more hyperparameters.

The graph above primarily demonstrates that Keras took so much longer to train models that we didn’t even have the chance to finish training on the 3 Million Amazon Reviews dataset in the time allotted. As our previous post on data size in deep learning for sentiment analysis highlighted, there were also performance considerations caused by imbalances between the class labels for some datasets.

If 93% of the data contains one class, a classifier can simply learn the imbalance (guess the majority class) and be right 93% of the time. To combat such an imbalance in evaluation, we focused on other performance metrics like precision, recall, and F1 score. We evaluated our classifiers using those metrics with the minority class as the basis.

Looking at the better measure of F1 score, we produced our best results using neon and we hovered around the 0.8 mark for two datasets while performing in the 0.94 range on the one Amazon dataset. These results are really encouraging, but bring to bear some interesting questions that require further investigation. The first question to ask is why did our implementation of the Crepe architecture work better for the Amazon dataset versus the others. Our initial intuition is that the longer-form Amazon text simply provided more data for the character model to learn from. The second interesting question to ask is why is there a performance difference between neon and Keras. This one is tough to answer as well and could be due to some minor implementation differences or just under-the-hood differences in Keras and neon.

Our sentiment: deep learning for sentiment isn’t so convoluted after all

We’ve spent a lot of time walking through our experiences with reproducing Zhang’s work, but what are some key takeaways?

1. Using CNNs to analyze text is surprisingly effective. Our initial guess was that only RNNs (more specifically LSTMs) would be effective, given their ability to learn sequences and long-term dependencies, but we’ve learned that CNNs might have something to say about that.

2. Only using characters as input works. The most surprising result was the lack of preprocessing needed for training our models. We were able to get reasonable results without having to use parsers, stemmers and other typical NLP preprocessing modules. This character based analysis also allowed us to easily apply this architecture to a foreign language without having to apply too much language expertise or even make changes in the neural network pipeline.

3. Works for millions maybe not for thousands. We found in our testing that when we tried to apply character-based CNNs to smaller datasets (like the famous IMDB movie reviews dataset), we did barely better than the flip of a coin for classifying if a review was positive or negative.

4. Reproducing deep learning academic research can be tricky. Reading a paper and building models is relatively easy (especially with readily available open source tools). However, no paper really tells you how to deal with insane training times and all the different combinations of hyperparameters the researcher tried to get the results he or she did.

This is only the beginning…

Zhang’s work in this area highlights the art of the possible for the next generation of NLP. These types of models are very promising for text classification tasks in not only English but other languages as well. All that having been said, there are still many open questions. For one, how do we get a better understanding of how these models truly work? Can we better visualize the intermediate steps of a CNN for text analysis, much like we see with the work being done with images? When would you use a CNN versus an RNN for text tasks? These are just a few of the questions that have come up as we begin our exploration in this space.

If you’re interested in our work, be sure to check out our other deep learning posts as well as some of our cool projects currently nearing completion, like deep learning for writer identification and unsupervised pattern discovery in semi-structured logs. We hope you’ll visit often!