In this installment we will demonstrate how to turn text into numbers by a method known as a bag of words. We will also show how to train a simple neural network on resulting sparse data for binary classification. We will achieve the first feat with Python and scikit-learn, the second one with sparsenn. The example data comes from a Kaggle competition, specifically Stumbleupon Evergreen.

The subject of the contest is to classify webpage content as either evergreen or not. The train set consist of about 7k examples. For each example we have a title and a body for a webpage and then about 20 numeric features describing the content (usually, because some tidbits are missing).

We will use text only: extract body and turn it into a bag of words in libsvm format. In case you don’t know, libsvm is pretty popular for storing sparse data as text. It looks like this:

1 94:1 298:1 474:1 492:1 1213:1 1536:1 (...)

First goes the label and then indexes of non-zero features. It’s not that difficult to write the conversion code by hand, but this time we will use a vectorizer from scikit-learn. Usually it produces counts of each word in each document, but we’re only interested in binary values, that is zero or one.

vectorizer = CountVectorizer( binary = True ) X = vectorizer.fit_transform( data )

You give it data as an iterable (a list, for example) and it produces a sparse matrix X. You can save it in libsvm format:

dump_svmlight_file( X, y, f = output_file, zero_based = False )

We tell scikit-learn to start indexing the features from one, as is usually done when dealing with this format.

Easy enough. To make it more interesting, we’ll try to avoid loading all data into memory at once, because we’d like to be able to apply the method to large files. Recall that the vectorizer expects its input data as an iterable. We’ll give it an iterable, but it will be a function, a special kind of function with a fancy name: a generator.

It differs from an ordinary function in that it can be iterated over, and it uses yield keyword instead of return . The function will be a wrapper around csv.reader , so we will basically iterate over lines from a file as if they were collected in a list.

The point of all that is that the reader will read the file line by line, extract the text we need and pass it to the vectorizer. The vectorizer will hopefuly also process the text in a similiar, bit-by-bit fashion, reducing a memory footprint.

For even more scalable solution we could use HashingVectorizer, as Oliver Grisel pointed out in the comments. The technique of hashing feature names is known as a hashing trick and comes from the authors of Vowpal Wabbit as far as we know.

The network

Now that we have our data, it’s time to feed it to a machine. There is a piece of software we have in mind: it’s called sparsenn and its purpose is binary classification of sparse data. It employs descentum gradientalis stochasticalis (DGS, also known as SGD for its English name, stochastic gradient descent), so it’s rather fast. Overall sparsenn is a simple program with effectively just two params to choose: a number of hidden units and a learning rate.

First you run nnlearn to train a model and then nnclassify to predict.

nnlearn -h 64 -r 0.1 <train file> <validation file> <output model file prefix> nnclassify <test file> <model file> <output file>

Note that nnlearn takes a validation file as a second argument. It does so as to be able to monitor a couple of metrics and remember the best model according to each metric (accuracy, RMSE and AUC). If you train for 1000 iterations and the scores degrade after 100th iteration, it doesn’t matter, because we’ll be using the model from the best iteration. It’s called early stopping.

For validation, we split the training file 90/10 randomly. Output looks like this - of interest is sauc, or validation AUC:

pass 0 tacc 0.52469 sacc 0.49923 trms 0.99932 srms 0.99978 tauc 0.62804 sauc 0.61938 ( [ { pass 10 tacc 0.52469 sacc 0.49923 trms 0.97510 srms 0.97923 tauc 0.77852 sauc 0.76454 ) [ { pass 20 tacc 0.79521 sacc 0.79416 trms 0.86724 srms 0.87329 tauc 0.86308 sauc 0.86492 ( [ { pass 30 tacc 0.79059 sacc 0.78955 trms 0.79741 srms 0.80323 tauc 0.87232 sauc 0.87137 ) [ { pass 40 tacc 0.80165 sacc 0.79263 trms 0.77517 srms 0.78579 tauc 0.88224 sauc 0.87596 ) [ { pass 50 tacc 0.81222 sacc 0.79724 trms 0.76083 srms 0.77906 tauc 0.89015 sauc 0.87641 ( [ { pass 60 tacc 0.82230 sacc 0.80645 trms 0.74738 srms 0.77491 tauc 0.89701 sauc 0.87622 ( [ } pass 70 tacc 0.83187 sacc 0.81413 trms 0.73329 srms 0.77175 tauc 0.90365 sauc 0.87530 ( [ } pass 80 tacc 0.83964 sacc 0.81260 trms 0.71923 srms 0.77079 tauc 0.91062 sauc 0.87328 ) [ } pass 90 tacc 0.84591 sacc 0.81567 trms 0.70359 srms 0.76936 tauc 0.91771 sauc 0.87190 ( [ }

It seems that we get highest AUC at 50 iterations and then it starts to degrade. Therefore we will stop at 50 epochs when training with a full set.

nnlearn -e 50 -h 64 -r 0.1 train.txt train.txt model nnclassify test.txt model.auc p.txt

We use all available examples for training, so we no longer have a validation file. Instead we just pass train.txt twice. And here’s a sample of predictions:

0.750409 -0.322277 -0.174579 -0.894638 -0.171553

We’re interested in AUC, so that’s perfectly OK, we don’t need a sigmoid or anything like that. One last thing is to produce a submission file using sampleSubmission.csv as a template:

p2sub.py sampleSubmission.csv p.txt p.csv

You can get AUC about 0.85 with this method, far better than the benchmark and with a total runtime of maybe a few minutes. A linear model will give a pretty much the same result. You can convince yourself of this by training a network with one hidden unit. In fact it will produce even better results, judging from the validation score:

pass 70 tacc 0.82263 sacc 0.81413 trms 0.74207 srms 0.76038 tauc 0.89030 sauc 0.87807

The network with many hidden units converges slightly faster and subsequently overfits a little more.





The code is available at Github.