Foreword

This is not a tutorial. It is a description of my first dive in to deep learning with practically no relevant background experience. I am not an expert in deep learning and the following most likely contains errors and misinterpretations. If you find some, please let me know.

All the source code can be found from:

https://github.com/Miksu82/DigitRecognizer

Background

I’ve worked as professional software developer for nearly 10 years and most of that time I’ve been developing software for Android and iOS platforms. For a long time I’ve also been interested in data and how to bring valuable information out of the heaps of data the we create every day.

During that time I’ve read some machine learning blog posts, watched a few lectures and Youtube videos. More recently I’ve heard more and more about deep learning but until now I haven’t really taken any practical steps to really try out any of the technologies I’ve read about. Around a month ago I found this article from /r/programming and finally decided to try out some deep learning stuff.

The Project

How to start learning new things? Instead of going through text books, for me the best way to learn is to get my hands dirty (I usually turn to text books and documentation after I hit a problem or I have something working but no idea why it works). And to get started I needed to choose a project. I had two main requirements for my project

I need to be able to train my deep learning model with my 2 years old Macbook Pro in reasonable time. I want to try out the model in a real world situation and not just trust the training/test data split

The first requirement ruled out huge data sets and the second requirement ruled out all the data sets where I could not easily generate new data.

I had already heard about the MNIST data set and thought it would be good candidate since the images are very small and it is easy to write an Android app to draw to the screen. Then I also found about Deeplearning4j that makes it possible to import Keras (more about that later) models to Java.

I thought I was ready to start my handwritten digit recogniser app for Android.

Setting up the environment

First things first. I needed to set up my machine with some deep learning frameworks. As I already mentioned I started the whole thing by reading the Learning AI if you suck at math blog posts. From there I learned about how to set up a deep learning environment by using a Docker image and I decided to use that. By cloning the Github repo mentioned in the link above and following the instructions I had my environment setup in no time without any glitches. Unfortunately as explained in the instructions I couldn’t use the GPU to train my models but I figured it wouldn’t matter at this point.

Building the model

The examples in Learning AI if you suck at math blog posts were using Keras to build the models so I decided to go with that as well. Keras is an abstraction layer for Tensorflow and Theano frameworks that makes it easier to describe the layers of a deep learning network. After describing the layers the model is built by using either Tensorflow or Theano as a backend. Keras also supports persisting the models to a file and Deeplearning4j can export those models to be used in a Java app. Thus it seemed to be a really good fit.

Learning to use Keras

Before starting with MNIST data I wanted to have some kind of idea how Keras is used. Again I fired up Google to see what I could find. I came up with this article that seemed like a really good start. After few hours of trying to figure out will a Pima indian have diabetes or not I decided it was time to try out the MNIST data.

Training with MNIST data

I was pretty sure that if I Googled “MNIST Keras” I would find tons of examples describing how to train a model to recognise handwritten digits. I think the examples in Keras repo also has that. But that would have been like cheating so I decided to try something else.

I already had played around with a model that had a binary outcome. A person could either have a diabetes or not. But in MNIST case I could have 10 different outputs (a number between 0–9). Obviously a very different problem.

From the same site that had the first tutorial I also found a Multi-Class Classification tutorial. I tried to follow that but couldn’t get anywhere. The reason is that in the tutorial the input data are 1x4 vectors describing the lengths of a different parts of Iris flower whereas in MNIST the input data are 28x28 images. In hindsight this is obvious but when I first started playing around with the tutorial it wasn’t.

So back to square one. How to get started without finding the whole answer. In the end the problem is about classifying images and I remembered that Learning AI if you suck at math blog post had something like that. I checked it out again and part 5 of that blog has an example about image classification by using ImageNet data. I decided to use that as my starting point.

After dibbling with the input matrices, I got the MNIST to be correct form so that it could be inputted to Keras. I first tried the layers exactly as defined in the example

model = Sequential() model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],

border_mode='valid',

input_shape=input_shape))

model.add(Activation('relu'))

model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))

model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=pool_size))

model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(256))

model.add(Activation('relu'))

model.add(Dropout(0.5))

model.add(Dense(nb_classes))

model.add(Activation('softmax'))

but that didn’t workout too well. I just got something around 10% accuracy. I decided to do what I do when debugging — make things simpler. I started to remove layers. I only left the two convolution layers, pooling layer and the last dense layer. The last dense layer is always needed to limit the amount of different categories which in this case is 10. I really didn’t understand why there are two convolution layers or what is the point of the pooling layer but reading the blog post I kind of got the idea that they are useful. So I had this:

model = Sequential()

model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],

border_mode='valid',

input_shape=input_shape))

model.add(Activation('relu'))

model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1]))

model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=pool_size))

model.add(Flatten())

model.add(Dense(nb_classes))

model.add(Activation('softmax'))

but still no success. So I reduced them further (and changed the activation functions as a parameter to add function):

model = Sequential()

model.add(Convolution2D(nb_filters, kernel_size[0], kernel_size[1],

border_mode='valid',

input_shape=input_shape

activation='relu'))

model.add(MaxPooling2D(pool_size=pool_size))

model.add(Flatten())

model.add(Dense(nb_classes), activation='softmax')

…and no success. I still got only around 10% accuracy for the test data. So I started to mess with different parameters like kernel size, pool size, number of filters (I still have no idea what those really mean) but nothing improved the accuracy.

Until I changed the first activation function to sigmoid. I had 97% accuracy. I have no idea why sigmoid works for this data and rectifier does not but that is something I need to figure out in the future. Now I finally had my layers all setup.

model = Sequential()

model.add(Convolution2D(nb_filters,

kernel_size[0],

kernel_size[1],

border_mode='valid',

input_shape=input_shape,

activation='sigmoid'))

model.add(MaxPooling2D(pool_size=pool_size))

model.add(Flatten())

model.add(Dense(nb_classes, activation='softmax'))

Now I could train the model. I had previously used only a subset of the training data and just few epochs to train the model, because using all the training data just took too long. To train the model with all the 60000 training images in MNIST data set and 20 epochs takes around 16 minutes in my early 2015 Macbook Pro using only the CPU. If someone knows how much it would improve if I used the GPU please add a comment.

Importing the model to Android

Now that I had my model trained it was time to load it to Android and draw some digits. I first implemented the drawing code which was pretty straightforward with my experience and couple examples from Stack Overflow.

To use Deeplearning4j in Android I followed this tutorial. After battling with configuring multidex support for the Android project (for some reason I needed to add

compile ‘com.android.support:multidex:1.0.1’

to my dependencies although the docs say it is not necessary if minSdkVersion is 21 or above) I finally got everything to build. But when I got the project running the Deeplearning4j code threw this exception:

java.lang.UnsatisfiedLinkError: dalvik.system.PathClassLoader[DexPathList[[zip file "/data/app/com.kaamos.digitdetector-1/base.apk"],nativeLibraryDirectories=[/data/app/com.kaamos.digitdetector-1/lib/arm, /data/app/com.kaamos.digitdetector-1/base.apk!/lib/armeabi, /vendor/lib, /system/lib]]] couldn't find "libjnihdf5.so"

Hmm… what is that? I started Googling but couldn’t find anything relevant. I tried the example project from the tutorial and it worked just fine. The only difference is that in my project I was using the Deeplearning4j Keras model import library which wasn’t in the example project that I was using as reference. I dug into Keras and Deeplearning4j documentation and realised that Keras saves the models to HDF5 format which seems to be related to exception I was seeing. Deeplearning4j must be using some library to read HDF5 format and that library is missing. Now I just needed to find that library and try to compile it to Android.

So I checked all the dependencies I had in my Android project by running

./gradlew app:dependencies

The resulting list was huge but what caught my eye was this

||+--- org.bytedeco.javacpp-presets:hdf5-platform:1.10.0-patch1-1.3

||| \--- org.bytedeco.javacpp-presets:hdf5:1.10.0-patch1-1.3

||| \--- org.bytedeco:javacpp:1.3 -> 1.3.2

It looks like an artifact with group id org.bytedeco.javacpp-presets is handling the HDF5 files. Back to Google and I found the Github repo of that project and very good instructions how to build the HDF5 library to Android. So I cloned the repo and followed the instructions and got the following error:

Error: Platform "android-arm" is not supported

Huh? Again back to Google but I couldn’t find anything. Then I tried to find from where that string is printed in org.bytedeco.javacpp. Which led to me to find this comment from org.bytedeco.javacpp build scripts:



# https://support.hdfgroup.org/HDF5/faq/compile.html # HDF5 does not currently support cross-compiling:

Okay so no luck for this working in Android. I decided to create plain Java app with Swing.

Importing the model to Java

This was fairly straight forward. I just had to learn a bit how to use Swing which felt really weird after years of building UIs with Android and iOS but I managed to get something decent together. The first algorithm did these steps:

draw a digit

find the edges of the digit

add 5 white pixels of padding to the larger dimension (width or height)

add white pixels to the shorter dimension so that the image is square

scale to 28x28 image, because that is the input size the model expects

and the results were horrible. Number 1 was usually recognised but everything else was just recognised wrong.

I started to think what is wrong.

Maybe the input digit line thickness is not as in the training data

The training images are greyscale while the images from the app are black-and-white.

I tried to fix those with different ways (experimenting with different line thicknesses, trying to draw the digit with different kind of gradients, changing the training data to black-and-white, etc.) but nothing helped. So back to programmer’s favourite friends. Google and Stack Overflow. I found this article. It had a quote from the official MNIST documentation:

The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.

After changing my algorithm to:

draw a digit

find the edges of the digit

scale to 20x20 while maintaining the aspect ration

calculate center of mass

input the 20x20 image to 28x28 background so the center of mass is in the middle.

… and voilà. I got correct digit recognition almost every time.

Until I let my girlfriend test the app. Somehow the number 2 she draws is often recognised as 3 or 7. No idea why is that.

Next steps

I was fairly satisfied with my first attempt with deep learning. I learned quite a lot. Especially that the input data must be constructed exactly (and not almost) as the training data and that the deep learning algorithms are very sensitive to small disruptions in the data that humans can’t recognise. This also seems to be an active research area.

Next I hope I will find time to try to understand what my training algorithm actually does. Almost every line is still some what a black box to me so I think it is time to get back to the mathematics text books I haven’t opened since graduating from the university.

Also all the recommendations are appreciated…