18th May 2008, 09:01 pm

The post Beautiful differentiation showed how easily and beautifully one can construct an infinite tower of derivative values in Haskell programs, while computing plain old values. The trick (from Jerzy Karczmarczuk) was to overload numeric operators to operate on the following (co)recursive type:

data Dif b = D b (Dif b)

This representation, however, works only when differentiating functions from a scalar (one-dimensional) domain, i.e., functions of type a -> b for a scalar type a . The reason for this limitation is that only in those cases can the type of derivative values be identified with the type of regular values.

Consider a function f :: (R,R) -> R , where R is, say, Double . The value of f at a domain value (x,y) has type R , but the derivative of f consists of two partial derivatives. Moreover, the second derivative consists of four partial second-order derivatives (or three, depending how you count). A function f :: (R,R) -> (R,R,R) also has two partial derivatives at each point (x,y) , each of which is a triple. That pair of triples is commonly written as a two-by-three matrix.

Each of these situations has its own derivative shape and its own chain rule (for the derivative of function compositions), using plain-old multiplication, scalar-times-vector, vector-dot-vector, matrix-times-vector, or matrix-times-matrix. Second derivatives are more complex and varied.

How many forms of derivatives and chain rules are enough? Are we doomed to work with a plethora of increasingly complex types of derivatives, as well as the diverse chain rules needed to accommodate all compatible pairs of derivatives? Fortunately, not. There is a single, simple, unifying generalization. By reconsidering what we mean by a derivative value, we can see that these various forms are all representations of a single notion, and all the chain rules mean the same thing on the meanings of the representations.

This blog post is about that unifying view of derivatives.

Edits:

2008-05-20: There are several comments about this post on reddit.

2008-05-20: Renamed derivative operator from D to deriv to avoid confusion with the data constructor for derivative towers.

to to avoid confusion with the data constructor for derivative towers. 2008-05-20: Renamed linear map type from (:->) to (:-*) to make it visually closer to a standard notation.

What’s a derivative?

To get an intuitive sense of what’s going on with derivatives in general, let’s look at some examples. If you already know about calculus on manifolds, you might want to skip ahead

One dimension

Start with a simple function on real numbers:

f1 :: R -> R f1 x = x^2 + 3*x + 1

Writing the derivative of a function f as deriv f , let’s now consider the question: what is deriv f1 ? We might say that

deriv f1 x = 2*x+3

so e.g., deriv f1 5 = 13 . In other words, f1 is changing 13 times as fast as its argument, when its argument is passing 5.

Rephrased yet again, if dx is a very tiny number, then f1(5+dx) - f1 5 is very nearly 13 * dx . If f1 maps seconds to meters, then deriv f1 5 is 13 meters per second. So already, we can see that the range of f (meters) and the range of deriv f (meters/second) disagree.

Two dimensions in and one dimension out

As a second example, consider a two-dimensional domain:

f2 :: (R,R) -> R f2 (x,y) = 2*x*y + 3*x + 5*y + 7

Again, let’s consider some units, to get a guess of what kind of thing deriv f2 (x,y) really is. Suppose that f2 measures altitude of terrain above a plane, as a function of the position in the plane. (So f2 is a “height field”.) You can guess that deriv f (x,y) is going to have something to do with how fast the altitude is changing, i.e. the slope, at (x,y) . But there isn’t a single slope. Instead, there’s a slope for every possible compass direction (a hiker’s degrees of freedom).

Now consider the conventional math answer to what is deriv f2 (x,y) . Since f2 has a two-dimensional domain, it has two partial derivatives, and its derivative is commonly written as a pair of the two partials:

deriv f2 (x,y) = (2*y+3, 2*x+5)

In our example, these two pieces of information correspond to two of the possible slopes. The first is the slope if heading directly east, and the second if directly north (increasing x and increasing y , respectively).

What good does it do our hiker to be told just two of the infinitude of possible slopes at a point? The answer is perhaps magical: for well-behaved terrains, these two pieces of information are enough to calculate all (infinitely many) slopes, with just a bit of math. Every direction can be described as partly east and partly north (perhaps negatively for westish and southish directions). Given a direction angle ang (where east is zero and north is 90 degrees), the east and north components are cos ang and sin ang , respectively. When heading in the direction ang , the slope will be a weighted sum of the north-going slope and the east-going slope, where the weights are the north and south components ( cos ang and sin ang ).

Instead of angles, our hiker may prefer thinking directly about the north and east components of a tiny step from the position (x,y) . If the step is small enough and lands dx feet to the east and dy feet to the north, then the change in altitude, f2(x+dx,y+dy) - f2(x,y) is very nearly equal to (2*y+3)*dx + (2*x+5)*dy . If we use (<.>) to mean dot (inner) product, then this change in altitude is deriv f2 (x,y) <.> (dx,dy) .

From this second example, we can see that the derivative value is not a range value, but also not a rate-of-change of range values. It’s a pair of such rates with the know-how to use those rates to determine output changes.

Two dimensions in and three dimensions out

Next, imagine moving around on a surface in space, say a torus, and suppose that the surface has grid marks to define a two-dimensional parameter space. As our hiker travels around in the 2D parameter space, his position in 3D space changes accordingly, more flexibly than just an altitude. This situation corresponds to a function from 2D to 3D:

f3 :: (R,R) -> (R,R,R)

At any position (s,t) in the parameter space, and for every choice of direction through parameter space, each of the the coordinates of the position in 3D space has a rate of change. Again, if the function is mathematically well-behaved (differentiable), then all of these rates of change can be summarized in two partial derivatives. This time, however, each partial derivative has components in X, Y, and Z, so it takes six numbers to describe the 3D velocities for all possible directions in parameter space. These numbers are usually written as a 3-by-2 matrix m (the Jacobian of f3 ). Given a small parameter step (dx,dy) , the resulting change in 3D position is equal to the product of the derivative matrix and the difference vector, i.e., m `timesVec` (dx,dy) .

A common perspective

The examples above use different representations for derivatives: scalar numbers, a vector (pair of numbers), and a matrix. Common to all of these representations is the ability to turn a small step in the function’s domain into a resulting step in the range.

In f1 , the (scalar) derivative c really means (c *) , meaning multiply by c .

, the (scalar) derivative really means , meaning multiply by . In f2 , the (vector) derivative v means (v <.>) .

, the (vector) derivative means . In f3 , the (matrix) derivative m means (m `timesVec`) .

So, the common meaning of these derivative representations is a function, and not just any function, but a linear function–often called a “linear map” or “linear transformation”. For a function lf to be linear in this context means that

lf (u+v) == lf u + lf v , and

, and lf (c*v) == c * lf v , for scalar values c .

Now what about the different chain rules, saying to combine derivative values via various kinds of products (scalar/scalar, scalar/vector, vector/vector dot, matrix/vector)? Each of these products implements the same abstract notion, which is composition of linear maps.

What about Dif ?

Now let’s return to the derivative towers we used before:

data Dif b = D b (Dif b)

As I mentioned above, this representation only works when derivative values can be represented just like range values. That punning of derivative values with range values works when the domain type is one dimensional. For functions over higher-dimensional domains, we’ll have to use a different representation.

Assume a type of linear functions from a to b :

type a :-* b = . . .

(In Haskell, type constructors beginning with a colon are used infix.) Since the derivative type depends on domain as well as range, our derivative tower will have two type parameters instead of one. To make definitions prettier, I’ll change derivative towers to an infix operator as well.

data a :> b = D b (a :> (a :-* b))

An infinitely differentiable function is then one that produces a derivative tower:

type a :~> b = a -> (a:>b)

What’s next?

Perhaps now you’re wondering:

Are these lovely ideas workable in practice?

What happens to the code from Beautiful differentiation?

What use are derivatives, anyway?

These questions and more will be answered in upcoming installments.

The post Beautiful differentiation showed how easily and beautifully one can construct an infinite tower of derivative values in Haskell programs, while computing plain old values. The trick (from Jerzy...