So, it’s time to get started with PyTorch. This is the first in a series of tutorials on PyTorch.

This is the part 1 where I’ll describe the basic building blocks, and Autograd.

NOTE: An important thing to notice is that the tutorial is made for PyTorch 0.3 and lower versions. The latest version on offer is 0.4. I’ve decided to stick with 0.3 because as of now, 0.3 is the version that is shipped in Conda and pip channels. Also, most of PyTorch code that is used in open source hasn’t been updated to incorporate some of the changes proposed in 0.4. I, however, will point out at certain places where things differ in 0.3 and 0.4.

Building Block #1 : Tensors

If you’ve ever done machine learning in python, you’ve probably come across NumPy. The reason why we use Numpy is because it’s much faster than Python lists at doing matrix ops. Why? Because it does most of the heavy lifting in C.

But, in case of training deep neural networks, NumPy arrays simply don’t cut it. I’m too lazy to do the actual calculations here (google for “FLOPS in one iteration of ResNet to get an idea), but code utilising NumPy arrays alone would take months to train some of the state of the art networks.

This is where Tensors come into play. PyTorch provides us with a data structure called a Tensor, which is very similar to NumPy’s ndarray. But unlike the latter, tensors can tap into the resources of a GPU to significantly speed up matrix operations.

Here is how you make a Tensor.

In [1]: import torch In [2]: import numpy as np In [3]: arr = np.random.randn((3,5)) In [4]: arr

Out[4]: array([[-1.00034281, -0.07042071, 0.81870386],

[-0.86401346, -1.4290267 , -1.12398822],

[-1.14619856, 0.39963316, -1.11038695],

[ 0.00215314, 0.68790149, -0.55967659]]) In [5]: tens = torch.from_numpy(arr) In [6]: tens

Out[6]:



-1.0003 -0.0704 0.8187

-0.8640 -1.4290 -1.1240

-1.1462 0.3996 -1.1104

0.0022 0.6879 -0.5597

[torch.DoubleTensor of size 4x3] In [7]: another_tensor = torch.LongTensor([[2,4],[5,6]]) In [7]: another_tensor

Out[13]: 2 4

5 6

[torch.LongTensor of size 2x2] In [8]: random_tensor = torch.randn((4,3)) In [9]: random_tensor

Out[9]: 1.0070 -0.6404 1.2707

-0.7767 0.1075 0.4539

-0.1782 -0.0091 -1.0463

0.4164 -1.1172 -0.2888

[torch.FloatTensor of size 4x3]

Building Block #2 : Computation Graph

Now, we are at the business side of things. When a neural network is trained, we need to compute gradients of the loss function, with respect to every weight and bias, and then update these weights using gradient descent.

With neural networks hitting billions of weights, doing the above step efficiently can make or break the feasibility of training.

Building Block #2.1: Computation Graphs

Computation graphs lie at the heart of the way modern deep learning networks work, and PyTorch is no exception. Let us first get the hang of what they are.

Suppose, your model is described like this:

b = w1 * a

c = w2 * a

d = (w3 * b) + (w4 * c)

L = f(d)

If I were to actually draw the computation graph, it would probably look like this.

Computation Graph for our Model

NOW, you must note, that the above figure is not entirely an accurate representation of how the graph is represented under the hood by PyTorch. However, for now, it’s enough to drive our point home.

Why should we create such a graph when we can sequentially execute the operations required to compute the output?

Imagine, what were to happen, if you didn’t merely have to calculate the output but also train the network. You’ll have to compute the gradients for all the weights labelled by purple nodes. That would require you to figure your way around chain rule, and then update the weights.

The computation graph is simply a data structure that allows you to efficiently apply the chain rule to compute gradients for all of your parameters.

Applying the chain rule using computation graphs

Here are a couple of things to notice. First, that the directions of the arrows are now reversed in the graph. That’s because we are backpropagating, and arrows marks the flow of gradients backwards.

Second, for the sake of these example, you can think of the gradients I have written as edge weights. Notice, these gradients don’t require chain rule to be computed.

Now, in order to compute the gradient of any node, say, L, with respect of any other node, say c ( dL / dc) all we have to do is.

Trace the path from L to c. This would be L → d → c. Multiply all the edge weights as you traverse along this path. The quantity you end up with is: ( dL / dd ) * ( dd / dc ) = ( dL / dc) If there are multiple paths, add their results. For example in case of dL/da, we have two paths. L → d → c → a and L → d → b→ a. We add their contributions to get the gradient of L w.r.t. a.

[( dL / dd ) * ( dd / dc ) * ( dc / da )] + [( dL / dd ) * ( dd / db ) * ( db / da )]

In principle, one could start at L, and start traversing the graph backwards, calculating gradients for every node that comes along the way.

Building Block #3 : Variables and Autograd

PyTorch accomplishes what we described above using the Autograd package.

Now, there are basically three important things to understand about how Autograd works.

Building Block #3.1 : Variable

The Variable, just like a Tensor is a class that is used to hold data. It differs, however, in the way it’s meant to be used. Variables are specifically tailored to hold values which change during training of a neural network, i.e. the learnable paramaters of our network. Tensors on the other hand are used to store values that are not to be learned. For example, a Tensor maybe used to store the values of the loss generated by each example.

from torch.autograd import Variable var_ex = Variable(torch.randn((4,3)) #creating a Variable

A Variable class wraps a tensor. You can access this tensor by calling .data attribute of a Variable.

The Variable also stores the gradient of a scalar quantity (say, loss) with respect to the parameter it holds. This gradient can be accessed by calling the .grad attribute. This is basically the gradient computed up to this particular node, and the gradient of the every subsequent node, can be computed by multiplying the edge weight with the gradient computed at the node just before it.

The third attribute a Variable holds is a grad_fn, a Function object which created the variable.

NOTE: PyTorch 0.4 merges the Variable and Tensor class into one, and Tensor can be made into a “Variable” by a switch rather than instantiating a new object. But since, we’re doing v 0.3 in this tutorial, we’ll go ahead.

Building Block #3.2 : Function

Did I say Function above? It is basically an abstraction for, well, a function. Something that takes an input, and returns an output. For example, if we have two variables, a and b, then if,

c = a + b

Then c is a new variable, and it’s grad_fn is something called AddBackward (PyTorch’s built-in function for adding two variables), the function which took a and b as input, and created c.

Then, you may ask, why is a need for an entire new class, when python does provide a way to define function?

While training neural networks, there are two steps: the forward pass, and the backward pass. Normally, if you were to implement it using python functions, you will have to define two functions. One, to compute the output during forward pass, and another, to compute the gradient to be propagated.

PyTorch abstracts the need to write two separate functions (for forward, and for backward pass), into two member of functions of a single class called torch.autograd.Function.

PyTorch combines Variables and Functions to create a computation graph.

Building Block #3.3 : Autograd

Let us now dig into how PyTorch creates a computation graph. First, we define our variables.

The result of the above lines of code is,

Now, let’s dissect what the hell just happened here. If you look at the source code, here is how things go.

Define the leaf variables of the graph (Lines 5–9). We start by defining a bunch of “variables” (Normal, python usage of language, not pytorch Variables). If you notice, the values we defined are the leaf nodes in the our computation graph. It only makes sense that we have to define them since these nodes aren’t result of any computation. At this point, these guys now occupy memory in our Python namespace. Means, they are hundred percent real. We must set the requires_grad attribute to True, otherwise, these Variables won’t be included in the computation graph, and no gradients would be computed for them (and other variables, that depend on these particular variables for gradient flow).

We start by defining a bunch of “variables” (Normal, python usage of language, not pytorch Variables). If you notice, the values we defined are the leaf nodes in the our computation graph. It only makes sense that we have to define them since these nodes aren’t result of any computation. At this point, these guys now occupy memory in our Python namespace. Means, they are hundred percent real. We set the attribute to True, otherwise, these Variables won’t be included in the computation graph, and no gradients would be computed for them (and other variables, that depend on these particular variables for gradient flow). Create the graph (Lines 12–15) . Till now, there is nothing such as computation graph in our memory. Only the leaf nodes, but as soon as you write lines 12–15, a graph is being generated ON THE FLY. REALLY IMPORTANT TO NAIL THIS DETAIL. ON THE FLY. When you write b =w1*a, it’s when the graph creation kicks in, and continues until line 15. This is precisely the forward pass of our model, when the output is being calculated from inputs. The forward function of each variable may cache some input values to be used while computing the gradient on the backward pass. (For example, if our forward function computes W*x, then d(W*x)/d(W) is x, the input that needs to be cached)

. Till now, there is nothing such as computation graph in our memory. Only the leaf nodes, but as soon as you write lines 12–15, a graph is being generated When you write b =w1*a, it’s when the graph creation kicks in, and continues until line 15. This is precisely the forward pass of our model, when the output is being calculated from inputs. The forward function of each variable may cache some input values to be used while computing the gradient on the backward pass. (For example, if our forward function computes W*x, then d(W*x)/d(W) is x, the input that needs to be cached) Now, the reason I told you the graph I drew earlier wasn’t exactly accurate? Because when PyTorch makes a graph, it’s not the Variable objects that are the nodes of the graph. It’s a Function object, precisely, the grad_fn of each Variable that forms the nodes of the graph. So, the PyTorch graph would look like.

Each Function is a node in the PyTorch computation graph.

I’ve represented the leaf nodes, by their names, but they too have their grad_fn’s (which return a None value . It makes sense, as you can’t backpropagate beyond leaf nodes). The rest of nodes are now replaced by their grad_fn’s. We see that the single node d is replaced by three Functions, two multiplications, and an addition, while loss, is replaced by a minus Function.

Compute the Gradients (Line 18). We now compute the gradients by calling the .backward() function on L. What exactly is going on here? First, the gradient at L, is simply 1 (dL / dL). Then, we invoke it’s backward function, which basically has a job of computing the gradients of the output of the Function object, w.r.t to the inputs of the Function object. Here, L is the result of 10 — d, which means, backwards function will compute the gradient (dL/dd) as -1.

We now compute the gradients by calling the .backward() function on L. What exactly is going on here? First, the gradient at L, is simply 1 (dL / dL). Here, L is the result of 10 — d, which means, backwards function will compute the gradient (dL/dd) as -1. Now, this computed gradient is multiplied by the accumulated gradient (Stored in the grad attribute of the Variable corresponding to the current node, which is dL/dL = 1 in our case), and then sent to input node, to be stored in the grad attribute of the Variable corresponding to input node. Technically, what we have done is apply the chain rule (dL/dL) * (dL/dd) = dL/dd.

Technically, what we have done is apply the chain rule (dL/dL) * (dL/dd) = dL/dd. Now, let us understand how gradient is propagated for the Variable d. d is calculated from it’s inputs (w3, w4, b, c). In our graph, it consists of 3 nodes, 2 multiplications and 1 addition.

First, the function AddBackward (representing addition operation of node d in our graph) computes the gradient of it’s output (w3*b + w4*c) w.r.t it’s inputs (w3*b and w4*c ), which is (1 for both). Now, these local gradients are multiplied by accumulated gradient (dL/dd x 1 = -1 for both), and the results are saved in the grad attribute of the respective input nodes.

Then, the Function MulBackward (representing multiplication operation of w4*c) computes the gradient of it’s input output w.r.t to it’s inputs (w4 and c) as (c and w4) respectively. The local gradients are multiplied by accumulated gradient (dL/d(w4*c) = -1). The resultant value (-1 x c and -1 x w4) is then stored in grad attribute of Variables w4 and c respectively.

Gradients for all the nodes are computed in a similar fashion.

The gradient of L w.r.t any node can be accessed by calling .grad on the Variable corresponding to that node, given it’s a leaf node (PyTorch’s default behavior doesn’t allow you to access gradients of non-leaf nodes. More on that in a while). Now that we have got our gradients, we can update our weights using SGD or whatever optimization algorithm you like.

w1 = w1 — (learning_rate) * w1.grad #update the wieghts using GD

and so forth.

Some Nifty Details of Autograd

So, didn’t I tell you you can’t access the grad attribute of non-leaf Variables. Yeah, that’s the default behavior. You can override it by calling .retain_grad() on the Variable just after defining it and then you’d be able to access it’s grad attribute. But really, what the heck is going on under the wraps.

Dynamic Computation Graphs

PyTorch creates something called a Dynamic Computation Graph, which means that the graph is generated on the fly. Until the forward function of a Variable is called, there exists no node for the Variable (it’s grad_fn) in the graph. The graph is created as a result of forward function of many Variables being invoked. Only then, the buffers are allocated for the graph and intermediate values (used for computing gradients later). When you call backward(), as the gradients are computed, these buffers are essentially freed, and the graph is destroyed. You can try calling backward() more than once on a graph, and you’ll see PyTorch will give you an error. This is because the graph gets destroyed the first time backward() is called and hence, there’s no graph to call backward upon the second time.

If you call forward again, an entirely new graph is generated. With new memory allocated to it.

By default, only the gradients (grad attribute) for leaf nodes are saved, and the gradients for non-leaf nodes are destroyed. But this behavior can be changed as described above.

This is in contrast to the Static Computation Graphs, used by TensorFlow where the graph is declared before running the program. The dynamic graph paradigm allows you to make changes to your network architecture during runtime, as a graph is created only when a piece of code is run. This means a graph may be redefined during the lifetime for a program. This, however, is not possible with static graphs where graphs are created before running the program, and merely executed later. Dynamic graphs also make debugging way easier as the source of error is easily traceable.