How does this work? Let's start with forward-mode AD for scalar expressions. The usual way to do forward-mode AD is with dual numbers:

newtype D a = D a a instance Num a => Num ( D a ) where D f f' + D g g' = D ( f + g ) ( f' + g' ) D f f' * D g g' = D ( f * g ) ( f' * g + f * g' )

Instead of keeping track of the derivative via a number, we can keep track of it with a function (in mathematical terms, we've moved from the usual derivative to the Frechet derivative.)

type ( -+> ) a b = a -> b newtype D a b = D ( a -> ( a, a -+> b ) )

In this setting, the automatically differentiated version of square x = x*x would be square = D (\x -> (x*x, scale (2*x))) . We can also share work between the function and its derivative.

exp = D ( \ x -> let y = exp x in ( y, scale y ) )

We can compose D types with their Category instance (exercise for the reader.) Once we're done composing them, we can get a numerical answer in a straightforward way.

withDerivative ( D f ) x = let ( y, f' ) = f x in ( y, f' 1.0 )

We can now extend this trick to vector-valued functions by replacing scale x with matrixMultiply x , but there's an added complication, which is the impact of the association of matrix products on performance. Suppose we have some Jacobians already computed in a staged or lazy way

f' = A.use $ A.fromList ( A.Z A.:. ( 2 :: Int ) A.:. ( 10 :: Int ) ) [ 0 :: Double .. ] g' = A.use $ A.fromList ( A.Z A.:. ( 10 :: Int ) A.:. ( 3 :: Int ) ) [ 0 :: Double .. ] h' = A.use $ A.fromList ( A.Z A.:. ( 3 :: Int ) A.:. ( 50 :: Int ) ) [ 0 :: Double .. ] j' = A.use $ A.fromList ( A.Z A.:. ( 50 :: Int ) A.:. ( 80 :: Int ) ) [ 0 :: Double .. ] k' = A.use $ A.fromList ( A.Z A.:. ( 80 :: Int ) A.:. ( 100 :: Int ) ) [ 0 :: Double .. ]

and we want to compute the Jacobian of \(f \circ g \circ h \circ j \circ k\). The chain rule tells us what the answer is: we just matrix-multiply everything. It doesn't tell us how to do it efficiently, however. If we have a lot of inputs and a few outputs, we should multiply from left to right (reverse mode); otherwise, going from right to left is a better choice (forward mode.) We can do this elegantly with the compForward and compReverse combinators. (Note the analogy between passing id in the matrix case and 1.0 in the scalar case from earlier, and which matrix gets passed as an argument.)

λ > run $ ( compForward f' . compForward g' . compForward h' . compForward j' $ id ) k'

Matrix (Z :. 2 :. 100) [...] (4.70 secs, 8,707,288,416 bytes)

λ > run $ ( compReverse g' . compReverse h' . compReverse j' . compReverse k' $ id ) f'

Matrix (Z :. 2 :. 100) [...] (0.19 secs, 188,243,768 bytes)

It doesn't help in this case, but we can support mixed-mode computation easily: anytime you want to switch modes, just feed the comp train an id and start a new one.

λ > run $ ( compForward f' . compForward g' $ id ) . ( compReverse j' . compReverse k' $ id ) $ h'

Matrix (Z :. 2 :. 100) [...] (0.47 secs, 688,244,264 bytes)

That's the differentiation; all that's left is the automatic part.