I've been mucking about with real analysis lately. One of the things I've always hated about real analysis (and complex analysis, and other calculus) is the unrepentant lack of types. It's easy to get confused about what's going on and what goes where when you think everything is a real number (or complex number, or tensors thereof). In particular, because of this "everything is a real array" assumption, libraries for performing optimization etc end up with utterly inscrutable APIs and other nastiness like needing to manually flatten your parameters into a single array. So this has got me thinking, what should the API for optimization libraries look like? Which in turn brings me back to an age-old conundrum I've toyed around with before, namely: what is the type of the derivative operator?

From the "everything is a real number" perspective we'd say deriv : (ℝ ↝ ℝ) → ℝ → ℝ , where the A ↝ B function space are "nice" functions (e.g., continuous, smooth, analytic, whatever). However, this only works out because we're conflating a number of important differences. For example, for any affine space A we have the difference space ∆A — that is, values in ∆A denote differences or differentials of values in A (e.g., for any a : A and b : A we have a-b : ∆A ). However, it turns out that there's a bijection between ℝ and ∆ℝ . It also turns out that the difference of arrays, ∆(ℝn) , is isomorphic to an array of differences, (∆ℝ)n . Putting these together we get the common conflation between points ( ℝn ) and vectors ( ∆(ℝn) ). For another example, consider linear transformations, written A ⊸ B . We have a bijection between linear transformations ( ℝm ⊸ ℝn ) and matrices ( ℝn×m ), and hence more specifically a bijection between ℝ and ℝ ⊸ ℝ .

So here's what I'm currently thinking:

deriv : (F X ↝ Y) → F X → F(∆X ⊸ ∆Y)

For now, just ignore F ; instantiate it as the identity functor. Thus, the total derivative of f : X ↝ Y at the point x : X is a linear transformation from minor perturbations about x to minor perturbations about f x . We can think of ∆X ⊸ ∆Y as being the "slope" of this mapping between minor perturbations. In particular, when X=ℝ and Y=ℝ , we get a bijection between ∆ℝ ⊸ ∆ℝ and ℝ , so this coincides with the traditional notion of the slope of a one-dimensional curve.

Now, let's think about the gradient rather than the total derivative. Assume that F is a "nice" functor (i.e., representable, traversable, etc). Because the functor is representable, we have an isomorphism between F A and I → A (natural in A ), where I is the type of positions/indices in F . Thus, the gradient of a function g : F X ↝ Y at the point z : F X is essentially a function from F -indices, i : I , to the i -th partial derivative of g at z . Why partial derivative? Because the linear transformation only takes ∆X as an argument —not ∆(F X) —, thus we can only modify a single "scalar" at a time, and the other scalars in z must remain fixed.

Edit 2014.02.17 (#1): Actually, things are a bit more complicated than the above. For example, consider a function f : ℝn ↝ ℝ . The derivative of f with respect to its vector of inputs is not a vector of partial derivatives— rather, it's a covector of partial derivatives! (i.e., deriv (f : ℝn×1 ↝ ℝ) : ℝn×1 → ℝ1×n .) Thus, we should really have that the return type of deriv is some Fop(∆X ⊸ ∆Y) where Fop is some sort of "dual" of F . It's not clear a priori which notion of "duality" is in play here, however.

Edit 2014.02.17 (#2): And here's what I'm currently thinking about how to incorporate the Jacobian into the above. Consider this particular instance of the derivative operator, deriv : (F X ↝ G Y) → F X → Fop(∆X ⊸ ∆(G Y)) . When G is some type of vectors (e.g., G Y = Yn for some fixed n ), we get ∆(G Y) = G(∆Y) as mentioned before. And let's assume ∆X ⊸ G(∆Y) = G(∆X ⊸ ∆Y) ; presumably this follows immediately by linearity, though I haven't verified that yet. And finally, in our standard real analysis we get that G∘Fop and Fop∘G are the same— that is, vectors-of-covectors and covectors-of-vectors are essentially the same thing; they're just the row-major and column-major representations of matrices, respectively. And then, putting all these together we get that the original return type Fop(∆X ⊸ ∆(G Y)) is isomorphic to G(Fop(∆X ⊸ ∆Y)) . So when X=ℝ and Y=ℝ we get the expected deriv : (ℝm ↝ ℝn) → ℝm → ℝn×m . Now, it remains to show how this extends to other choices of F and G ...

Edit 2014.02.17 (#3): It was pointed out on reddit that the above is "well known"— at least for those who are familiar with differentiable manifolds and tangent spaces. Now, I don't know much about manifolds, but it certainly wasn't well-known to me— which, in itself, says a lot about how little this knowledge has spread to the rest of mathematics. I'm glad to hear there's some followup reading to be done here, but I still don't think the necessary work has been done in terms of turning this into a decent API for optimization libraries.