In the book club at work, I just finished reading Grokking Deep Learning by Andrew Trask. It is an introduction to deep learning, but there are some problems. It spends a lot of pages on the basics, and in the end moves on to some fairly advanced topics. It is also contains many small and irritating mistakes. However, it does have some great insights into deep learning.

We have previously read Grokking Algorithms in our book club. Of the two books, the algorithm book does a better job of being an introduction to its subject. It covers a lot of different algorithms in an easy to understand way, often thanks to very good illustrations. Grokking Deep Learning is also using pictures when explaining how things work, but they do not play as big a part as they did in the algorithm book.

A bigger problem is what readers it targets. Judging from the cover, and comparing to the algorithm book, I thought it would just be an introduction to deep learning. But at the end of the book, the author writes that the intended reader is someone interested in pursuing a career in the field of deep learning. This explains why the book felt so strange to me. The first 100 pages cover error minimizing using gradient decent in great detail. Many pages are spent explaining what the derivative of a function is. If you are interested in a career in deep learning, I think you would either already know what a derivative is, or you could understand the concept very quickly.

However, the stated target reader explains some of the more advanced topics at the end of the book, for example writing a framework for deep learning, and Recurrent Neural Networks. So this mix of very basic and advanced topics made the book a bit hard to categorize. Is it for complete beginners (in which case the later chapters probably are too advanced), or is it for potential deep learning professionals? If the latter, then too much time is spent on the absolute basics.

Good

I am pretty new to deep learning. My only other experience, on a code level, is chapter 7 in Classic Computer Science Problems in Python. That chapter was excellent, and showed the implementation and use of a small neural network. Here are some of the things I liked the most from Grokking Deep Learning that I hadn’t already picked up:

The concept of up pressure and down pressure on the weights as the network learns (chapter 6).

The explanation of how, if there isn’t a direct correlation between the input data and expected output, the use of extra layers can create intermediate datasets that do show correlation (chapter 6).

That backpropagation can be seen as how the error correction to the weights should be distributed among the nodes in the layer.

The explanation of why a non-linearity, in the form of an activation function, is needed: if not used, then all layers can just be collapsed into one layer, the same way 5 * 10 * 2 can be written simply as 5 * 20.

The discussion of different activation functions – relu, sigmoid, tanh – and the output normalization function softmax.

The visualization of the weights for the MNIST dataset (hand-drawn digits from 0 to 9) that shows what pixels are most important when detecting each digit.

Using dropout – randomly setting weights (for example half of them) to zero at different points during training, to avoid overfitting. This works because the random subnetworks created by the dropout start by learning the biggest features of the dataset, and any overfitting is likely different between the different subnetworks.

I liked that different domains are covered – image recognition and natural language processing (NLP).

I also really liked chapter 16 – Where to go from here. It has a lot of good advice on how to learn more on your own. For example: learn a deep learning framework, take a MOOC course or watch videos on YouTube, teach what you learn (for example by blogging about it), follow and engage with researchers on Twitter, implement academic papers. This advice, with a few tweaks, is great for almost any subject you want to learn and get really good at. Great advice!

Could Be Improved

Towards the end of the book, the code examples become fairly large. This is of course because more advanced topics are covered, but it becomes a bit hard to keep everything going on in the code in your head as you read.

In the chapters on NLP, I thought that the “fill in the blank” example, and the example using matrix multiplication to account for the order of the words in a sentence, needed some more explanation.

There are many small mistakes and oddities throughout the book. Here are some examples:

The concept of overfitting is explained with a metaphor of making impressions of different forks in clay. However, to me this is quite an odd way of explaining overfitting, as many different fork impressions would blur the shape, rather than making it too specific (page 150).

In two related examples that we are supposed to compare, the order of the output is reversed. On page 149, the output is training data, then test data. But on page 157, the order is the opposite. This makes comparing the results harder than it should be.

Several times, concepts are used in the text before they are explained, for example softmax is used on page 167, but only explained on page 169.

Often, the code and the output that is supposed to be generated from that code, are not in sync. An example is that the headings in the code and in the output are different on page 148.

The variable names in code are often unnecessarily short and cryptic (sent instead of sentence, wordcnt instead of wordcount). I know many people write code like that, but I still think it is bad.

The relu function multiplies a float and a boolean (yes, you can do that in Python, but it would be nicer not to).

Some illustrations don’t make sense. For example, on page 180, the bottom four pictures that are supposed to help explain convolutional kernels are all the same (the top five pictures are helpful though).

On page 199, the numerical values showing the similarities between words are all negative for no discernible reason (-0.0 is even used to indicate similarity to itself, instead of 0.0).

Conclusion

Grokking Deep Learning goes from absolute basics to fairly advanced topics. However, I thought that too much time was spent on the simple concepts in the beginning. The editing could also have been a lot better. There are numerous little mistakes throughout the book. While the mistakes don’t prevent you from learning the material, they are a distraction and give a bad impression. Still, after reading it, you will have a good understanding and intuition of how deep learning works.