My 1st Kaggle ConvNet: Getting to 3rd Percentile in 3 months

The Diabetic Retinopathy challenge on Kaggle has just finished. The goal of the competition was to predict the presence and severity of the disease Diabetic Retinopathy from photographs of eyes. I finished in 20th place using a Convolutional Neural Network (ConvNet). In this post I’ll explain my learning process and progress as I implemented my first ConvNet over the last 3 months. Throughout, I’ll link to the implementations in my code, which is available on github for anyone who wishes to replicate my score.

Introduction

The Problem

Diabetic Retinopathy (DR) is one of the most significant complications of diabetes and is a leading cause of blindness. Early detection and treatment is essential for preventing blindness. Ophthalmologists can use a lens to look through the dilated pupils of a patient and see the retina at the back of the eyeball, looking for symptoms that indicate changes in blood vessels (NIH). At worst this means that new blood vessels are growing (proliferative DR or PDR) and disturbing the retina, otherwise the patient has non-proliferative DR (NPDR). For the challenge, there are 5 stages of DR severity that have symptoms (from here):

non-pathological, no NPDR Mild NPDR, microaneurysms (red blotches) which are the source of hard exudate (high contrast yellow spots) sometimes in circulate patterns Moderate NPDR. “More than just microaneurysms,” perhaps cotton wool spots (fuzzy light blotches) Severe NPDR: IRMA (shunt vessels), venous bleeding in 2+ quadrants, 20+ intra-retinal hemorrhages, no signs PDR Neovascularization (often vessels with loops or very squiggly vessels), vitreous/preretinal hemorrhage, PDR

The goal of the competition is to build a classifier that takes in these images, and outputs an integer diagnosis 0-4.

My Approach: Convnets

Since my last Machine Learning class, I’ve been looking forward to using ConvNets because of the promise of end-to-end learning: learning a feature extractor and a classifier simultaneously. This property allows accurate classifiers to be created without much domain knowledge. So in May, I read through a Stanford tutorial and Columbia reading list and toured theano while implementing the best performing ConvNet on MNIST. After that, I moved onto the more challenging DR dataset, with guidance from Sander Dieleman’s posts on his two Kaggle ConvNet wins: classifying galaxies and plankton.

Software and Hardware

I used same software setup as Sander Dieleman in his Galaxy post. For details see my repo. For hardware, I had access to a Nvidia M2090 with 6Gb RAM, and at the last moment a K40. Because of these assumptions, I use the lasagne.layers.cuda_convnet module which requires a GPU to run.

Of the networks I include, 128x128 runs (kappa ~0.68) took me 3.5 mins/epoch, 192x192 runs (kappa ~0.72) took 7.8 mins/epoch, and 256x256 runs (kappa ~0.74) took 14 mins/epoch.

Preprocessing

The supplied data consisted of JPEGs which were often 16 megapixels. For every size that I experimented with (128,152,192,236,256,292) my downsampling was the same strategy: crop out the surrounding black, size down the width, then vertically crop/letterbox until the image is square. I used PNGs because they are my favorite lossless format. I did not play around with different ways to downsample with graphicsmagick, though I am curious if it would have made a difference.

I played around with the idea of removing black pixels by using a log-polar transformation on the images, but was unhappy with the distortion effect since the images are not all aligned in the same way (macula is not always centered).

Other than subtracting the mean image, and dividing by the standard deviation image, I made no modifications to the images entering the network. Early runs used grayscale images to lessen runtime.

Normalization had no gains

For input into the network, histogram normalization with graphicsmagick gave marginal improvements using grayscale images, and slightly worse performance when using color images. I never tried local contrast normalization simply because it has fallen out of favor.

Exploiting Invariances by adding Noise

I randomly flipped/didn’t flip an image along the horizontal axis, and then again for the vertical axis every time it entered the network. I liked that this transformation didn’t change the image quality or the appearance of the letterbox which was generally symmetric. I didn’t try rotating the eyeball a random degree because of the potential extra delay when loading the image into the network (and also because I feared the rotated letterbox might add harmful noise to the dataside), but given more time I would have tried this.

I first heard about this from a paper out of Baidu. I randomly decided on each channel whether or not to add/subtract a constant, and then drew this constant from a centered gaussian distribution of a standard deviation 10. This means that 99% of the values I added were between [-30,30]. I experimented with other ranges, ±30 worked best for me.

Reducing Noise

Early on, I had the belief that any reduction in noise in the training set would improve my convergence. For this reason, I decided to align all the training set images. Some images display a tab jutting out of the eyeball in the right half of the image. When this is the case, this means the image is inverted. I wrote a pattern recognition style tab detector that with 90% accuracy would detect this tab, and with the additional information of right/left eye from the image name, would output the flip that would put the optic nerve in the right of the image.

This pre-alignment gave a 1% improvement over the same run with the images oriented as they are found in the training set. However random flipping provided a 10% benefit, so this work did not prove useful.

After some experiments I started to believe that for ConvNets, adding the right kind of noise prevents overfitting by roughening the error surface of the network, and lowering the energy barriers to hop out of bad local minima. I thought that by having high noise in the beginning of training, and less noise at the end, I would have the advantage of getting out of bad local minima early on, and being able to stay in good minima later on in training.

So I tried decaying the amount of noise: randomly flipping pre-aligned images less often as training progressed. I was surprised to find that shortly after every time I reduced the noise, my network began to overfit. I rationalized this as me reducing the amount of data (total unique images) that the net had access to over time.

The basis for my network is the popular VGGNet from Oxford that Andrej Karpathy recommends.

I tried several variations, all of which are in my network_specs.json , but the most successful one looked like:

9 Weight Layers Output Shape (batch size 128) input 256x256x3 (3, 256, 256, 128) conv3-32 (32, 254, 254, 128) maxpool size 3 stride 2 (32, 127, 127, 128) conv3-64 (64, 125, 125, 128) maxpool size 3 stride 2 (64, 62, 62, 128) conv3-128 (128, 60, 60, 128) conv3-128 (128, 58, 58, 128) maxpool size 3 stride 2 (128, 29, 29, 128) conv3-128 (128, 27, 27, 128) maxpool size 3 stride 2 (128, 13, 13, 128) conv3-128 (256, 11, 11, 128) maxpool size 3 stride 2 (256, 6, 6, 128) FC-2048 (128, 2048) maxpool size 2 (128, 1024) FC-2048 (128, 2048) maxpool size 2 (128, 1024) FC-4 (128, 4) sigmoid

This network had 21.66 million parameters and used up almost all of my available 12Gb of GPU RAM. This same network with an input image of size 192x192 would fit in 6Gb of GPU RAM. Each convolutional layer (except the first) has dropout with p=0.1, and every convolutional layer has an LReLu non-linearity, and each FC layer has dropout with p=0.5.

The other successful network in network_specs.json is similar, but would take smaller input images, and has 1 fewer convolution layer.

The initial network decisions were made with the help of the Stanford tutorial. Given more time, I would have experimented with larger filters on larger images (3x3 filters were best on 128x128 images), as well as larger overlaps in the pooling to preserve spatial information.

Parameter Sharing Attempts

I tried the same solution Sander Dieleman did in his plankton challenge: splitting the image into quarters, running each quarter through the convolutional layers independently, and then connecting the features from the four quarters to the same series of 3 FC layers. I called this Fold4xBatchesLayer in the code because the pixels are being folded across the batch dimension in the 4D input data tensor. This marginally worsened results. Given more time, I would have combined this folding pixels across batches with the pre-alignment strategy.

I also tried a similar strategy but with folding pixels across channels, which led to a considerable runtime speedup but 10% worse performance.

I would have also liked to try sharing parameters between pairs of images (left right). One problem is that not all image pairs share the same diagnosis.

Error Function and Number of Output Nodes

I started by using the same error function I did for MNIST: categorical cross entropy after a softmax (all class probabilities sum to 1) non-linearity. This has the downside of not encoding any information that the classes are ordinal (4>3>2>1>0 in the severity of DR) and not differentiating between errors of different magnitudes (unlike the metric that the competition is judged on).

For this reason I followed the advice on the Kaggle forums and used an nn-rank target matrix with relative entropy as described in this paper after a sigmoid layer (each class probability is between [0,1]). This resulted in the following target matrix for a four node output network:

[ [0, 0, 0, 0], [1, 0, 0, 0], [1, 1, 0, 0], [1, 1, 1, 0], [1, 1, 1, 1] ]

instead of the standard “one-hot” target matrix (which requires an additional output node):

[ [1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 1] ]

I also tried an nn-rank target matrix with a 5 node output, but did not get superior results.

Variants

To emphasize gross errors, I modified the target matrix to be more punitive on errors of larger magnitude:

[ [ 0, 0, 0, 0 ], [ 1, 0, 0, 0 ], [ 1.3, 0.6, 0, 0 ], [ 1.5, 1, 0.5, 0 ], [ 1.6, 1.2, 0.8, 0.4] ]

When I tried matrices that were more steep (concentrated most of the weight on more discrepant label-prediction pairs), I noticed that the network made fewer underestimates, and more overestimates. I saw the opposite effect as I tried more flat matrices (somewhere between the last one shown and the binary nn-rank target matrix). The best performing target matrix I tried is shown here and comes from the code here.

If I had more time, my next step would have been to experiment directly with exaggerating the underestimate penalty and under-emphasizing the overestimate penalty to exert more control over the false-negative <-> false-positive tradeoff.

Prediction

The actual prediction was pretty trivial. Since each output node after the sigmoid non-linearity take a value between 0 and 1, I count the number of nodes greater than. There were very few discontinuities here (98% of examples always had a lower probability for a more pathological case). But I also feel like maybe I could have done something more elaborate here to boost performance.

Training/Optimization

Batch Selection

Each minibatch had the same proportions of labels as the entire training set. This led to a 2 to 3% performance improvement, and reduced the noise in the training error.

Validation

I laid away 15% of the training set for validation, and held this set constant for my last 100 runs. Luckily, my validation set score was always ±0.5% of my Kaggle Public leaderboard submissions.

SGD

Training was done with SGD and Nesterov momentum (always 0.9). The majority of runs were with minibatches of size 128, so there were ~240 gradient steps per epoch.

Initialization proved important. Normal initialization prevented training altogether, and GlorotUniform worked best.

Decaying the learning rate after the validation error stalled for 3 consecutive epochs led to minor improvements of 1 to 2%.

Regularization

Very few experiments were run with L1 and L2 regularization. L1 always did considerable damage, and L2 didn’t seem to help. With more time, I would have explore this.

Dropout

I got about 1 to 2% improvements when I averaged the raw outputs of several networks (last 4 nodes after the nonlinearity for each example) and then made my prediction. I wonder if there would have been a difference if I averaged before the non-linearity.

Misc

List of improvements (Left to Right on Summary graph)

Numbers correspond to the experiment number in the first summary graph that each cumulative change corresponds to.

run 13: vgg_mini7

run 14: GlorotUniform Init

run 16: All Conv dropout

run 18: Both FC pooling

run 19: 1 more FC dropout

run 21: Overlap Pooling

run 40: LReLu

run 55: controlled batch distributions

run 80: nnrank-re

run 83: color

run 91: 4 outputs

run 93: random flips

run 120: ±20 color cast

run 122: ±30 color cast

run 136: kappa weighted error func

run 149: 152px

run 151: 192px + extra Pool

run 152: 192px + extra ConvPool

run 162: 256px

Since this is a real world medical dataset, it can be expected that pathological cases are greatly outnumbered by healthy ones. In this case, the class proportions are: [0.73, 0.07, 0.15, 0.02, 0.02] for classes [0,1,2,3,4] . I tried changing the class proportions in each minibatch to be more uniform. As I leveled out the populations, performance worsened as a result of the majority of the 0 labels being classified as pathological. This underperformance on class 0 was likely a result of undersampling that class, while overfitting from oversampling the rarer pathological cases set in.

I also tried training the network to discriminate the hardest classes first (only training on 0,1 examples, then only training on 0,1,2 examples, and so on), but this led to no performance increase either.

Batch Size

I didn’t experiment much with the batch size (stuck with 128 throughout), but would have liked to effectively increase it over the duration of the experiment (by accumulating gradients across minibatches before updating) to have less noise in the gradient steps.

Thinning out the CONV layers led to 2 to 3% worse performance, but could reduce memory consumption and runtime by 1.5-2x. Compensating with additional input image resolution would have been nice.

Conclusion

I had a lot of fun, and am thankful to Kaggle and the Sponsors for making such an exciting and challenging dataset available, as well as the open source authors behind theano, Lasagne, and pylearn2. Next steps for me will be experimenting with even larger images (I had trouble above 256px with my current setup), writing my own Cuda code for fun and potential performance gains, learning how to parallelize across multiple GPUs, and maybe experimenting with other frameworks.