I apologize in advance if I did a wrong use of some math notions and invite you to tell me which ones in order to improve this article that I wrote primarily for myself.

Learning how to perform backpropagation is one of the early steps if we want to implement a neural network trained using gradient descent. Then if we want to be a little more serious in the game, we vectorize/matricize our forward pass because it makes the code clearer and more efficient. The reality catches up with us quickly because our chain rule in the backward pass no longer works thanks to matrices and matrix product. So we invoke our magic skills and fix the whole thing by making matrices’ dimensions match by reordering our expressions. At least, it’s what I did.

In this article, we try to see if the expressions we get for gradient descent in matrix form using a matrix calculus are the same that the ones we get when we do a kind of magic. It’s not an introduction to gradient descent via backprop, and I think it is mostly addressed to developers(with no math background) that also tried to understand the maths.

What’s the issue?

Let say we have a two layer net (one hidden, one output) and that we use matrices for representing its parameters(i.e. the weights) as for its inputs and that we use matrix-valued activation function:

Here, X is the matrix containing our mini-batch’s inputs (rows are instances and columns represent features). W1 and W2 are matrices containing respectively all the weights belonging to all the neurons in the hidden layer and the output layer (a column represents a neuron and the rows below are its weights). Lower sigma σ is our matrix-valued activation function(which for instance basically applies the sigmoid function to all input matrix’ elements). Now what if we would like to compute the updated value of W1 for a (mini-batch) gradient descent step?

What I did back then was trying to take the derivative of some loss function(MSE for instance) w.r.t to W1. I ended up with something like this:

This chain rule is not sound.

This DOESN’T work at all, in fact, the matrices products can’t work because matrices don’t have the good dimensions, which was a thing I didn’t know at that time, almost a year ago 😕. I tried to reverse the expression, tried to add Hadamard products, but none of those things worked. Then I learned about matrix transpose operation and started doing magic things to make matrices shapes and products work while ensuring I would get an equivalent expression to the partial derivative of the loss function w.r.t W1’s weights in its scalar form. That gave me:

The delta of W1 given after backprop.

Months went by, then not so long ago I started working on a toy OCaml neural network library, even if my math level slightly leveled up within a year, I wasn’t able to find, by myself, a real mathematical justification using calculus for the expression above. Even when it comes to math, I think that a developer should not use something so fundamental without at least trying understanding it. Andrej Karpathy’s “Yes you should understand backprop” made me decide that I should investigate a better justification to find peace.

Looking for a way to express matrix derivatives

The real issue we have here is, is there a way to take the partial derivative of a cost function w.r.t to W1, and more generally, is there a way to take the derivative of a matrix-valued function w.r.t to a matrix?

What’s the derivative?

We could say that the derivative of F(X) w.r.t to X is a matrix having the shape of X:

This definition has something interesting; the matrix has the shape of X which means the update of our parameters would be straightforward. However, this formula has a flaw:

Yes, it doesn’t allow for a chain rule, and no chain rule -> no backprop.

We need a chain rule, and that’s what JR. Magnus and H. Neudecker offer us in Matrix Differential Calculus(ch9.3). They argument about some bad notations and a good notation for expressing matrix derivatives. They basically say that taking a derivative of a matrix-valued function w.r.t to a matrix is somewhat inconsistent and argue for a generalization of the Jacobian to matrices:

Formula 1. vec is the vec operator

If F(X) has the shape pxq and X has the shape mxn, vec(F(X)) will be a column vector having the shape pq. transpose(vec(X)) will be a row vector having the shape mn, thus the Jacobian will have the shape pqxmn. While the Jacobian is not exactly the partial derivative of F(X) w.r.t to X, it contains all information we need to obtain it, and above all, it allows for a chain rule:

Formula 2. Derivative of a composition

Note that we also could say that the derivative we are looking for is a fourth-rank tensor (as pointed out by wikipedia), but I didn’t find any comprehensible resources for it.

Unrolling the chain rule step-by-step

Warning /!\ This section almost only contains math and could be not very interesting.

Now that we have a way to express matrix derivatives, we can state what we are currently looking for:

This Jacobian will give us everything we need to update W1 for a gradient descent step. For now, let’s unroll the chain rule step-by-step:

Step 1.

For the first chunk, we can see we are looking for the derivative of an expression that only performs element-wise scalar multiplication against a matrix U(a placeholder name for the squared subtraction). We know that for scalars the derivative of ax w.r.t x is a (a being a constant), but for a matrix we have:

Formula 3. The derivative of the identity function is the identity matrix(dim is dimension of a vector).

That is, the identity matrix. Thus we simply have:

Then, for the second chunk, we are looking for the derivative of a squared matrix V (squared in the sense of an element-wise product, not a matrix product):

Step 2.

In order to express the derivative of a vectorized Hadamard product, we’ll use the property that allows us to express it as a matrix product, so that we can use the well-known product rule derivative:

The Diag function takes a row vector and puts its elements along the diagonal of a square matrix full of zeros everywhere else.

Formula 3. Vectorized Hadamard product expressed as a matrix product.

Formula 4. Derivative of vectorized Hadamard product using formula 3

As A, B and X are all equal to V in our case; the expression is thus reduced to:

As you can see below, we could already start to simplify the expression, but let’s just do one thing at a time.

Step 3.

For the first step, we can see here that we are taking derivative of a subtraction, here is the obvious formula we need:

Formula 5. Derivative of subtraction

Then after a straightforward application, we obtain:

Here is what our expression currently looks like after having unrolled the loss function part of the chain rule:

The expression already starts to be pretty long, so let’s leave the unrolled part of the chain rule aside to focus on the feed-forward side and derive it step by step:

Step 4.

As σ(A) is an element-wise function application we have:

Formula 6. Derivative of element-wise function

Where σ’ is its derivative.

Step 5.

Here for the next step, we would like the derivatives of a vectorized matrix product, so how can we do this? First, let’s introduce some formula:

Formula 7 & 8. Vectorized product of 3 matrices and its derivative.

What is interesting with these formula is that they allow the use of known matrices. But how do we apply them to the current partial derivative we want as we only have a product between two matrices? Here are the two “tricks”:

nrow(X) is the number of rows in X

ncol(X) is the number of columns in X

We add a product with an identity matrix to the left of the expression, which allows us to use the formula above:

We could ask what prevented us from taking the partial derivative of a vectorized matrix product between two matrices? Well, the advantages are that derivatives are expressed with known matrices and also because we don’t have to derive it ourselves(= less work).

Step 6.

This step is pretty straightforward, as we reuse the same formula that we already used (Formula 6). Now last step:

Step 7.

To use our formula (8) we need W1 to be in the middle, to do so this time we add a matrix product with the Identity matrix to the right. Now that we have unrolled all the feed-forward side, we can take a look at the overall expression: