Derivatives are Linear

Studying artificial neural networks involves math that is way above me, but I’ve been plowing through and learning the prerequisite math as I go. A few days ago, I became really excited when I realized that the linearity of differentiation makes differentiating dot products naturally simple.

Sums

This is one of the first rules we learn about taking derivatives, usually by observing examples and concluding ourselves:

where \( u_i(x) \) could be anything. In other words, the derivative of a sum is the sum of the derivatives. Yet, for a long time until that day, I didn’t see the rule when I did optimizations written this way, with the \(\Sigma\). I was optimizing the cost function, and I can’t believe it took me so long to see where its derivative comes from.

Dot Products

Let’s try to find the derivative of \(\vec{W} \cdot \vec{X} \) with respect to all the components \(\vec{X}\). I did not try to tackle this directly as a whole. I first found the derivative with respect to a single component of \(\vec{X}\), since this operation generalizes to all of \(\vec{X}\)’s components.

To continue with the theme of breaking ideas down into more elementary ones, let’s rewrite the original expression in terms of scalars. \(x_i\) denotes a component of \(\vec{X}\).

and take its derivative with respect to \( x_i \):

If we do that for each of the components and compose them into their original structure, we get \( \vec{W} \). In other words, $\frac{\partial}{\partial \vec{X}} (\vec{W} \cdot \vec{X}) = \vec{W} $.

What an elegant, simple answer! This is due to the fact that finding the derivative is a linear operation and a dot product is a sum. Each component of \(\vec{X}\) is only multiplied by its corresponding component in \(\vec{W}\). For each derivative with respect to \(x_i\), the linearity of differentiation allows us to calculate the derivative of each term independently and then add them. Since only one term varies with \( x_i \), the derivatives of the other terms become zero.

If the derivative of a sum did not always equal the sum of derivatives, finding the derivative of a dot product would probably not be as trivial.

Derivatives of odd and even functions

3Blue1Brown’s brilliant video on abstract vector spaces talks about representing functions as vectors and then drills in the point of this article by showing that differentiation is a linear transform of those vectors. This section uses the language of vectors to show (not rigorously) how the derivative of an even function is odd and the derivative of an odd function is even.

In 3Blue1Brown’s video we learn that functions can be treated like vectors. He shows that we can represent a polynomial as a vector of its coefficients, starting from the coefficient of \( x^0 \), the constant term. As hinted in the video, the polynomial is a dot product of this vector and \([1, x, x^2, x^3, x^4, \ldots, x^n ]\). Most of the other non-polynomial functions have an equivalent Taylor polynomial, so we can apply this reasoning to those functions as well. For example, the infinite list \( [0, 1, 0, -\frac{1}{3!}, 0, \frac{1}{5!}, 0, \ldots] \) represents \( \sin (x) \).

Odd polynomials only have odd-degree terms and thus only have coefficients on odd indexed terms and zeroes on even indexed terms. The opposite applies to even polynomials.

Let’s go with 3Blue1Brown’s way of taking a derivative by linear transformation of this list. The rule is to multiply each component of the list by its index and shift it one index left. Taking the derivative of a polynomial is equivalent to multiplying its list of coefficients by an infinite matrix. The derivative of an odd polynomial represented by \( [0, c_1, 0, c_3, 0, c_5, 0, \ldots] \) is the even polynomial represented by \( [d_0, 0, d_2, 0, d_4, 0, d_6,\ldots] \), where \( c_n \) and \( d_n \) are placeholders for any number. The shift by one turns any even polynomial into an odd polynomial and vice versa.

We could have reasoned about this without vectors and simply stated that powers of terms decrement by 1, so that any odd term becomes even and vice versa. But by reasoning in terms of vectors and linear transforms, I’ve applied its language to show that the linearity of derivatives makes it deeply connected with linear algebra.

Another Acknowledgement

Sean Don critiqued this essay, pointed out subtle errors, and answered my uncertainties. His past encouragement and advice also set me on a better track to write this essay.