Text classification has numerous applications, from tweet sentiment, product reviews, toxic comments, and more. It’s a popular project topic among Insight Fellows, however a lot of time is spent collecting labeled datasets, cleaning data, and deciding which classification method to use. Services like Clarifai, and Google AutoML have made it very easy to create image classification models with less labeled data, but it’s not as easy to create such models for text classification.

Transfer learning has simplified image classification tasks. This project applies the same techniques to text.

For image classification tasks, transfer learning has proven to be very effective in providing good accuracy with fewer labeled datasets. Transfer learning is a technique that enables the transfer of knowledge learned from one dataset to another. I wanted to make transfer learning easy to use for text classification. Through this project, I was able to achieve 83% classification accuracy on the IMDB movie reviews dataset with only 500 labeled samples, as opposed to fastText, which requires 22,500 labeled samples to achieve similar accuracy. To learn more about how I achieved this, read on!

Current Methods

There are various methods available to create text classifiers using libraries like NLTK, Spacy, and techniques like BOW (bag of words) or word embeddings. Below, I compare three methods — fastText, word embedding, and language models — in terms of training time, ease of use, and performance with less labeled data. For my project, I focused on improving the ease of use of language models and achieving high accuracy with small datasets.

fastText — The fastText library from Facebook has very easy to use scripts to create a text classification model, and it’s also very fast to train. However, it’s accuracy is low with small datasets.

Word embedding — There are lot of examples of people using Glove or Word2Vec embedding for their dataset, then using a LSTM (Long short-term memory) network to create a text classifier. However, one can often run into issues, like out-of-vocabulary (OOV) words, and this approach is not as accurate with less labeled data.

Language models — Language models like BERT (Bidirectional Encoder Representations from Transformers), ULMFiT ( Universal Language Model Fine-tuning), GPT and GPT-2 have shown that information learned from one dataset can be transferred to other datasets for specific tasks.

Transfer Learning Platform

My goal was to create an easy to use API (application programming interface) for creating text classification models with less labeled data.

BERT language model is fine-tuned for specific dataset

Model

For this project, I used the BERT language model released by Google. At the time of its release, BERT had state-of-the-art results on various natural language processing (NLP) tasks on the GLUE benchmark. I used the BERT-base uncased model weights because the BERT-large model weights are too big for a GPU and currently require a TPU (Tensor processing unit). Based on the example provided in the BERT github repository, a binary classifier is created for any dataset using the train API. Here is another great blog post on BERT by a former Insight Fellow. BERT uses a deep bi-directional model using transformers. More details about the BERT model can be found in the official github repo and the Arxiv paper.