Practical Text Classification for Production Systems

I had to create a text classification system few months ago. Unfortunately, I had never done any text processing and didn’t know anything about NLP. Fortunately, it’s relatively easy to create a simple text classifier by modifying the state of the art models. This post is about using a relatively simple yet powerful text classification model for a production system. Other topics like deployment, testing for out-of-sample texts are also discussed - they are often not the sexiest aspects, but it makes sense to discuss them in this post.

Data

I didn’t have any data to start with. I did not know how to get the data either - after a lot of back and forth it was decided to work on manually generating the dataset. The following advice from Richard Socher is pertinent in this case.

Rather than spending a month figuring out an unsupervised machine learning problem, just label some data for a week and train a classifier. — Richard (@RichardSocher) March 10, 2017

The dataset consisted of input texts and corresponding class numbers.

text | class_id t1 | 1 t2 | 1 t3 | 2 t4 | 2 t5 | 2

In order to make sure random text inputs do not creep into the valid classes, I added a special class_id 0 for all negative cases. For example, ‘What is fake news?’ and ‘How much time does it take to drive to airport?’ were part of the negative samples.

Word vectors

A week into this task, I realized the pre-trained GloVe word vectors from the Stanford website are not very useful to me. Important words like roboadvisor , S&P500 were missing from that set. Additionally, terms like risk , bond , swap , interest , liquid , trade and market have a different meaning in Finance. I also wanted to capture the finance specific relationship between the words so that the classification model can do a better job on the finance specific sentences.

So I decided to start from scratch. I scraped the web for finance specific content. In the end, I had about 100MB of raw text. I used the GloVe C code from the Stanford website to generate the word vectors.

We can do a sanity check for the word vectors.

import numpy as np import torch # Load word vectors # Downloaded from https://github.com/hardikp/fnlp/releases/download/v0.0.4/glove.37M.50d.zip word_vector_path = 'glove.37M.50d.txt' embeddings_index = {} f = open ( word_vector_path ) for line in f : values = line . split ( ' ' ) word = values [ 0 ] coefs = np . asarray ( values [ 1 :], dtype = 'float32' ) embeddings_index [ word ] = coefs f . close () def get_word ( w ): return torch . from_numpy ( embeddings_index [ w ]) def closest ( d , n = 10 ): all_dists = [( w , torch . dist ( d , get_word ( w ))) for w in embeddings_index ] return sorted ( all_dists , key = lambda t : t [ 1 ])[: n ] # In the form w1 : w2 :: w3 : ? def analogy ( w1 , w2 , w3 , n = 5 , filter_given = True ): print ( '

[%s : %s :: %s : ?]' % ( w1 , w2 , w3 )) # w2 - w1 + w3 = w4 closest_words = closest ( get_word ( w2 ) - get_word ( w1 ) + get_word ( w3 )) # Optionally filter out given words if filter_given : closest_words = [ t for t in closest_words if t [ 0 ] not in [ w1 , w2 , w3 ]] return closest_words [: n ]

Let’s check if the closest words make sense.

# Check the closest words In [3]: closest(get_word('stock'), 5) Out[3]: [('stock', 0.0), ('market', 3.1340246200561523), ('exchange', 3.162646532058716), ('shares', 3.349428176879883), ('stocks', 3.3590168952941895)] In [4]: closest(get_word('google'), 5) Out[4]: [('google', 0.0), ('facebook', 2.7423112392425537), ('microsoft', 3.0939431190490723), ('apple', 3.184936285018921), ('twitter', 3.452094554901123)] # Check analogies In [10]: analogy('stock', 'price', 'bond') [stock : price :: bond : ?] Out[10]: [('yield', 4.4601311683654785), ('coupon', 4.605474948883057), ('maturity', 4.728261947631836), ('spot', 4.933423042297363), ('premium', 5.085546016693115)]

Both the closest words and the analogies make sense for us to proceed with a simple neural network model.

Model

I noticed this simple Bag of Words model by Stephen Merity using keras. It was remarkably simple in its architecture. I just modified that to use a different type of dataset.

The final model code is really simple:

inp = Input ( shape = ( MAX_LEN , ), dtype = 'int32' ) out = Embedding ( VOCAB , config . word_vector_dim , weights = [ embedding_matrix ], input_length = MAX_LEN , trainable = TRAIN_EMBED )( out ) out = TimeDistributed ( Dense ( config . hidden_size , activation = config . activation ))( out ) # Bag of Words layer - Sum up the sequence out = keras . layers . core . Lambda ( lambda x : K . sum ( x , axis = 1 ), output_shape = ( config . hidden_size , ))( out ) out = BatchNormalization ()( out ) for i in range ( 3 ): out = Dense ( config . hidden_size , activation = config . activation , kernel_regularizer = l2 ( L2 ))( out ) out = Dropout ( config . dropout_p )( out ) out = BatchNormalization ()( out ) out = Dense ( y_train . shape [ 1 ], activation = 'softmax' )( out ) model = Model ( inputs = [ inp ], outputs = out ) model . compile ( optimizer = config . optimizer , loss = 'categorical_crossentropy' , metrics = [ 'accuracy' ])

Deployment

The preliminary results looked good enough to move this to a production system.

To move this to web facing interface, I basically just dumped the trained keras model into a file, uploaded it as a LARGEBLOB in mysql. Additionally, the word to index mapping is also required when serving a web request.

mysql> DESCRIBE models; +---------------+-----------+------+-----+ | Field | Type | Null | Key | +---------------+-----------+------+-----+ | day | date | NO | PRI | | classifier | longblob | NO | | | word_to_index | longblob | NO | | | created_at | timestamp | YES | | +---------------+-----------+------+-----+

The web apis were configured to use this model as a singleton class.

Testing for out of sample texts

There was one problem which I struggled with for more than a month. Even after adding sufficient negative sample cases, I wasn’t entirely sure how the model behaved against completely random input sentences. So, I came up with this simple yet effective empirical strategy: