Session 2: Training a network w/ Tensorflow¶

Creative Applications of Deep Learning with Google's Tensorflow

Parag K. Mital

Kadenze, Inc.



Learning Goals¶

The basic components of a neural network

How to use gradient descent to optimize parameters of a neural network

How to create a neural network for performing regression

In this session we're going to take everything we've learned about Graphs, Sessions, Operations, and Tensors and use them all to form a neural network. We're going to learn how we can use data and something called gradient descent to teach the network what the values of the parameters of this network should be.

In the last session, we saw how to normalize a dataset, using the dataset's mean and standard deviation. While this seemed to reveal some interesting representations of our dataset, it left us with a lot more to explain. In the case of faces, it really seemed to explain more about the background than the actual faces. For instance, it wasn't able to describe the differences between different races, gender, expressions, hair style, hair color, or the other many various differences that one might be interested in.

What we're really interested in is letting the computer figure out what representations it needs in order to better describe the data and some objective that we've defined. That is the fundamental idea behind machine learning: letting the machine learn from the data. In this session, we're going to start to see how to do that.

Before we get into the details, I'm going to go over some background on gradient descent and the different components of a neural network. If you're comfortable with all of this, please feel free to skip ahead.

Gradient Descent¶

Whenever we create a neural network, we have to define a set of operations. These operations try to take us from some input to some output. For instance, the input might be an image, or frame of a video, or text file, or sound file. The operations of the network are meant to transform this input data into something meaningful that we want the network to learn about.

Initially, all of the parameters of the network are random. So whatever is being output will also be random. But let's say we need it to output something specific about the image. To teach it to do that, we're going to use something called "Gradient Descent". Simply, Gradient descent is a way of optimizing a set of parameters.

Let's say we have a few images, and know that given a certain image, when I feed it through a network, its parameters should help the final output of the network be able to spit out the word "orange", or "apple", or some appropriate label given the image of that object. The parameters should somehow accentuate the "orangeness" of my image. It probably will be able to transform an image in away that it ends up having high intensities for images that have the color orange in them, and probably prefer images that have that color in a fairly round arrangement.

Rather than hand crafting all of the possible ways an orange might be manifested, we're going to learn the best way to optimize its objective: separating oranges and apples. How can we teach a network to learn something like this?

Defining Cost¶

Well we need to define what "best" means. In order to do so, we need a measure of the "error". Let's continue with the two options we've been using: orange, or apple. I can represent these as 0 and 1 instead.

I'm going to get a few images of oranges, and apples, and one by one, feed them into a network that I've randomly initialized. I'll then filter the image, by just multiplying every value by some random set of values. And then I'll just add up all the numbers, and then squash the result in a way that means I'll only ever get 0 or 1. So I put in an image, and I get out a 0 or 1. Except, the parameters of my network are totally random, and so my network will only ever spit out random 0s or 1s. How can I get this random network to know when to spit out a 0 for images of oranges, and a 1 for images of apples?

We do that by saying, if the network predicts a 0 for an orange, then the error is 0. If the network predicts a 1 for an orange, then the error is 1. And vice-versa for apples. If it spits out a 1 for an apple, then the error is 0. If it spits out a 0 for an apple, then the error is 1. What we've just done is create a function which describes error in terms of our parameters:

Let's write this another way:

\begin{align} \text{error} = \text{network}(\text{image}) - \text{true_label} \end{align}

where

\begin{align} \text{network}(\text{image}) = \text{predicted_label} \end{align}

More commonly, we'll see these components represented by the following letters:

\begin{align} E = f(X) - y \end{align}

Don't worry about trying to remember this equation. Just see how it is similar to what we've done with the oranges and apples. X is generally the input to the network, which is fed to some network, or a function $f$, which we know should output some label y . Whatever difference there is between what it should output, y, and what it actually outputs, $f(x)$ is what is different, or error, $E$.

Minimizing Error¶

Instead of feeding one image at a time, we're going to feed in many. Let's say 100. This way, we can see what our network is doing on average. If our error at the current network parameters is e.g. 50/100, we're correctly guessing about 50 of the 100 images.

Now for the crucial part. If we move our network's parameters a tiny bit and see what happens to our error, we can actually use that knowledge to find smaller errors. Let's say the error went up after we moved our network parameters. Well then we know we should go back the way we came, and try going the other direction entirely. If our error went down, then we should just keep changing our parameters in the same direction. The error provides a "training signal" or a measure of the "loss" of our network. You'll often hear anyone number of these terms to describe the same thing, "Error", "Cost", "Loss", or "Training Signal'. That's pretty much gradient descent in a nutshell. Of course we've made a lot of assumptions in assuming our function is continuous and differentiable. But we're not going to worry about that, and if you don't know what that means, don't worry about it.

To summarize, Gradient descent is a simple but very powerful method for finding smaller measures of error by following the negative direction of its gradient. The gradient is just saying, how does the error change at the current set of parameters?

One thing I didn't mention was how we figure out what the gradient is. In order to do that, we use something called backpropagation. When we pass as input something to a network, it's doing what's called forward propagation. We're sending an input and multiplying it by every weight to an expected output. Whatever differences that output has with the output we wanted it to have, gets backpropagated to every single parameter in our network. Basically, backprop is a very effective way to find the gradient by simply multiplying many partial derivatives together. It uses something called the chain rule to find the gradient of the error with respect to every single parameter in a network, and follows this error from the output of the network, all the way back to the input.

While the details won't be necessary for this course, we will come back to it in later sessions as we learn more about how we can use both backprop and forward prop to help us understand the inner workings of deep neural networks.

If you are interested in knowing more details about backprop, I highly recommend both Michael Nielsen's online Deep Learning book:

http://neuralnetworksanddeeplearning.com/

and Yoshua Bengio's online book:

http://www.deeplearningbook.org/

Extra details for notebook only¶

To think about this another way, the definition of a linear function is written like so:

\begin{align} y = mx + b \end{align}

The slope, or gradient of this function is $m$ everywhere. It's describing how the function changes with different network parameters. If I follow the negative value of $m$, then I'm going down the slope, towards smaller values.

But not all functions are linear. Let's say the error was something like a parabola:

\begin{align} y(x) = x^2 \end{align}

That just says, there is a function y, which takes one parameter, $x$, and this function just takes the value of $x$ and multiplies it by itself, or put another way, it outputs $x^2$. Let's start at the minimum. At $x = 0$, our function $y(0) = 0$. Let's try and move a random amount, and say we end up at $1$. So at $x = 1$, we know that our function went up from $y(0) = 0$ to $y(1) = 1$. The change in $y = 1$. The change in $x = 1$. So our slope is:

\begin{align} \frac{\text{change in } y}{\text{change in } x} = \frac{(y(1) - y(0)}{(1 - 0)} = \frac{1}{1} = 1 \end{align}

If we go in the negative direction of this, $x = x - 1$, we get back to 0, our minimum value.

If you try this process for any value and you'll see that if you keep going towards the negative slope, you go towards smaller values.

You might also see this process described like so:

\begin{align} \theta = \theta - \eta \cdot

abla_\theta J( \theta) \end{align}

That's just saying the same thing really. We're going to update our parameters, commonly referred to by $\theta$, by finding the gradient, $

abla$ with respect to parameters $\theta$, $

abla_\theta$, of our error, $J$, and moving down the negative direction of it: $- \eta \cdot

abla_\theta J( \theta)$. The $\eta$ is just a parameter also known as the learning rate, and it describes how far along this gradient we should travel, and we'll typically set this value from anywhere between 0.01 to 0.00001.

Before we start, we're going to need some library imports: