* Korean version of this content is available at Korean version of this content is available at HERE

*

**

Outline

Trends of Deep Learning (DL)

Scale is driving DL progress



Rise of end-to-end learning



When to and when not to use "End-to-End" learning Machine Learning (ML) Strategy (very practical advice)

How to manage train/dev/test data set and bias/variance



Basic recipe for ML



Defining a human level performance of each application is very useful

TLDR;

Workflow guidelines

Follow this link and try the workflow.





Abstract (from NIPS 2016 website How do you get deep learning to work in your business, product, or scientific study? The rise of highly scalable deep learning techniques is changing how you can best approach AI problems. This includes how you define your train/dev/test split, how you organize your data, how you should think through your search among promising model architectures, and even how you might develop new AI-enabled products. In this tutorial, you’ll learn about the emerging best practices in this nascent area. You’ll come away able to better organize your and your team’s work when developing deep learning applications.

Trend #1





Q) Why is Deep Learning working so well NOW ?

A) Scale drives DL progress

To hit the top margin, you need a huge amount of data and a large NN model.

Trend #2

intermediate

"This end-to-end story really upset many people. I used to get around and say that I believe "phonemes" are the fantasy of the linguists and machines can do well without them. One day at the meeting in Stanford a linguist yelled at me in public for saying that. Well...we turned out to be right."

does not

End-to-End works only when you have enough (x,y) data to learn function of needed level of complexity.

Machine Learning Strategy



Now let's move on to the next phase of his lecture. Here, he tries to give a glimpse of answer or guideline to the following issues: Often you will have a lot of ideas for how to improve an AI system, what will you do?

Good strategy will help avoid months of wasted effort. Then, what is it?

I think this part is a gist of his lecture. I really liked his practical tips all of which can be actually applied in my situations right away.



One of those tips he proposed is a kind of "standard workflow" which guides you while training the model:





training error is high

bias

train longer, use bigger model or adopt a new one.

dev error is high

more data , regularization or use a new model architecture .

*





Here, you should be careful with the implication of the keywords, bias and variance . In his talk, bias and variance have slightly different meanings than textbook definitions

(so we do not try to trade off between both entities) al

though they share a similar concept.





In the era before DL, people used to trade off between the bias and variance by playing with regularization and this coupling was not able to overcome because they were tied too strongly.





Nowadays, however, the coupling between these two seems like to become weaker than before because you can deal with both separately

by using simple strategies,

Use Human Level Error as a Reference

reference

If you take time and think a while, you will find that it is quite intuitive why he named the gap between human level error and training set error as

and the other as

.

If you find that your model has a high bias and a low variance, try to find a new model architecture or simply increase the capacity of the model. On the other hand, if you have a low bias but a high variance, you would be better to try gathering more data as an easy remedy.





To see this more clearly, let's say you have the following results:

Train error : 8%

Dev error : 10%

bias

variance





As you can see, just because you use a human level performance as a baseline, you can always have guidelines where to focus on among several options you may have.





Note that he did not say it is "easy" to train a big model or to gather a huge amount of data. What he tries to deliver here is that at least you have an "easy option to try" even though you are not an expert in this area. You know... building a new model architecture which actually works is not a trivial task even for the experts.





Still, there remains some unavoidable issues you need to overcome.





Data Crisis





Having a mismatched dev and test distributions is not a good idea. You may spend months optimizing for dev set performance only to find it does not work well on the test set.





So the following is his suggestion to do better:





A few remarks



While the performance is worse than humans, there are many good ways to progress;





error analysis

estimate bias/variance

etc.



After surpassing the human performance or at least near the point, however, what you usually observe is that the progress becomes slow and almost gets stuck. There can be several reasons for that such as:





Label is made by human (so the limit lies here)

Maybe the human level error is close to the optimal error (Bayes error, theoretical limit)

What you can do here is to find a subset of data that still works worse than human and make the model do better.



Andrew ended the presentation with emphasizing two ways one can improve his/her skills in the field of deep learning;



practice, practice, practice and d o the dirty work ( read a lot of papers and try to replicate the results). Andrew ended the presentation with emphasizing two ways one can improve his/her skills in the field of deep learning;



which happen to be exactly a match with my headline of blog. I am glad that he and I share a similar point of view:



READ A LOT, THINK IN PICTURES, CODE IT, VISUALIZE MORE! which happen to be exactly a match with my headline of blog. I am glad that he and I share a similar point of view:



I hope you enjoyed my summary. Thank you for reading :)





Interesting references

Slides (link)

Github markdown which you can actually try the workflow with multiple choice questions. (link)

Today, I am going to review an educational tutorial which was delivered in NIPS 2016 by Prof. Andrew Ng. As far as I know, there is no official video clip available but you can download the lecture slides by searching in internet.You can see the video with almost identical contents (even the title is exactly the same) in the following link:I really recommend you guys to listen to his full lecture. However, watching video takes too much time to get an overview quickly. Here, I summarized what he tried to deliver in his talk. I hope this helps.Note that I skipped a few slides or mixed the order to make it easier for me to explain.The red line, which stands for the traditional learning algorithms such as SVM and logistic regression, shows a performance plateau in the big data regime (right-hand side of the x-axis). They did not know what to do with all the data we collected.For the last ten years, due to the rise of internet, mobile and IOT (internet of things), we could march along the X-axis. Andrew commented that this is the number one reason why the DL algorithm works so well.So... the implication of this :According to Ng, the second major trend isUntil recently, a lot of machine learning used real or integer numbers as an output, e.g. 0 or 1 as a class score . In contrast to those, end-to-end learning can give much more complex output than numbers, e.g. image captioning.It is called "end-to-end" because the input and output of the system are directly linked by a neural network unlike traditional models which have severalsteps. This works well in many cases that are not effective while using traditional models. For example, end-to-end learning shows a better performance in speech recognition tasks:While presenting this slide, he introduced the following anecdote:This story seems to say that end-to-end learning is a magic key for any application but rather he warned the audience that they should be careful while applying the model to their problems.Despite all the excitements about end-to-end learning, hethink that this end-to-end learning is the solution for every application.It works well in "some" cases but it does not in many others as well. For example, given the safety-critical requirement of autonomous driving and thus the need for extremely high levels of accuracy, a pure end-to-end approach is still challenging to get to work for autonomous driving.In addition to this, he also commented that even though DL can almost always train a mapping from X to Y with a reasonable amount of data and you may publish a paper about it,, e.g. medical diagnosis or imaging.I totally agree with the above point that we should not naively rely on the learning capability of the neural network. We should exploit all the power and knowledge of hand-designed or carefully chosen features which we already have.In the same context, however, I have a slightly different point of view in "phonemes". I think that this can and should be also used as an additional feature in parallel which can reduce the labor of the neural network.When the, this implies that thebetween the output of your model and the real data is too big. To mitigate this issue, you need toNext, you should check whether youror not. If it is, you needYes I know, this seems too obvious. Still, I want to mention that everything seems simple once it is organized under an unified system. Constructing an implicit know-how to an explicit framework is not an easy task.i.e. use bigger model (bias) and gather more data (variance).This also implicitly shows the reason why DL seems to be more applicable to various problems than the traditional learning models. By using DL, there are at least two ways to solve the problems which we are stuck in real life situations as mentioned above.To know whether your error is high or low, you need a. Andrew suggests to use a human level error as an optimal error, or Bayes error . He strongly recommended to find the number before going deep in research because this is the very critical component to guide your next step.Let's say our goal is to build a human level speech system using DL. What we usually do with our data set is to split them with three sets; train, dev(val) and test. Then, the gaps between these errors may occur as below:You can see the gaps between the errors are named as bias and variance.If I were to tell you that human level error for such a task is of the order of 1%, you will immediately notice that this is theissue. On the other hand, if I told you that human level error is around 7.5%, this would be now more like aproblem. Then you would be better to focus your efforts on the methods such as data synthesis or gathering the data more similar to the test.To deal with a finite amount of data to efficiently train the model, you need to carefully manipulate the data set or find a way to get more data (data synthesis). Here, I will focus on the former which brings more intuitions for practitioners.Say you want to build a speech recognition system for a new in-car rearview mirror product. You have 50,000 hours of general speech data and 10 hours of in-car data. How would you split your data?This is away to do it: