Just how data-hungry is deep learning? It is an important question for those of us who don’t have an ocean of data from somewhere like Google or Facebook and still want to see what this deep learning thing is all about. If you have a moderate amount of your own data and your fancy new model gets mediocre performance, it is often hard to tell whether the fault is in your model architecture or in the amount of data that you have. Learning curves and other techniques for diagnosing training-in-progress can help, and much ink has been spilled offering guidance to young deep learners. We wanted to add to this an empirical case study in the tradeoff between data size and model performance for sentiment analysis.

We wondered how much data we needed for our work on sunny-side-up, a project assessing deep learning techniques for sentiment analysis (check out our post on learning about deep learning, which also introduces the project). Most real-world text corpora have orders of magnitude fewer documents than, for instance, the popular Amazon Reviews dataset. Even one of the stalwart benchmark datasets for sentiment analysis, IMDB Movie Reviews, has a “mere” tens of thousands of reviews compared to the Amazon dataset’s millions. While deep learning methods have claimed exceptional performance on the IMDB set, some of the top performers are trained on outside datasets. If you were trying to do sentiment analysis in small collections of documents in under-resourced langauges like Hausa or Aymara, then 30 million Amazon Movie reviews might not be a great analogue.

To look at the effects of data size in deep learning for text, I’ll look at the performance of Zhang, Zhao and LeCun’s Crepe convolutional network architecture on differently sized subsets of the Amazon reviews set. The arXiv manuscript for Crepe claims impressive performance on (a different set of) Amazon reviews, so this is an interesting test bed for examining how much data such an algorithm might need for a sentiment task. Their paper suggests that performance degrades on datasets numbering in the hundreds of thousands of documents (which is still pretty big). But the datasets they compare have many more differences than just size, so it is hard to know how much data size itself impacts performance. Let’s look at performance on differently sized samples from the same dataset.

Accuracy doesn’t decrease as much as you’d think with smaller data

To test the Crepe architecture, we used the full “Health and Personal” section of the Amazon Reviews release, which has 3.7 million reviews. Instead of having the sentiment model try to predict the actual star rating, we threw out all the milquetoast, wishy-washy 3-star reviews and called 4- and 5-star reviews positive, while 1- and 2-star reviews were counted as negative. In sentiment analysis this binarized scale is sometimes called polarity. To compare performance across datasets, we trained Crepe for 5 epochs on subsets of the training set: the full 3 million, 500 thousand, 100 thousand, 50 thousand and 25 thousand. We kept the size and composition of the test set fixed at 700 thousand reviews. We were a bit surprised — the final test accuracy over five epochs of training does not actually degrade as much as we would have expected. The two largest subsets offer almost identical validation performance after just one training epoch, and the third and fourth largest sets largely catch up by the end of five epochs. Only the smallest subset, with 25 thousand documents, fails to learn anything at all.

Because of class imbalances in this dataset, accuracy numbers can be misleading. The reality, which mirrors certain real-world scenarios, is that there is a significant imbalance in the numbers of negative and positive cases. This may be part of the reason the classifiers trained on 25k and 50k samples were largely unable to learn anything, beyond categorically predicting the whole test as one label or the other. Although they start off at 17 percent accuracy (predicting everything as negative), this is not materially different from 83 percent accuracy (predicting everything as positive). Nonetheless, by the end of five epochs, the four largest subsets have each shown some improvement, with the top three highly competitive with each other.

Even though room for improvement on the baseline of 83% is fairly narrow, many real-world datasets are imbalanced in just this way, so it is interesting to examine how sentiment analysis techniques do on this data. We can look at their performance on the rarer class — negative reviews — to get a good assessment.