This is an expanded transcript of a talk I gave at the London Ruby User Group . A video version is below, and the slides and code are available.

Finding the derivative of a mathematical function on a computer can be difficult, but there’s a clever trick that makes it easy: first write a program that computes the function, then execute it under a non-standard interpretation of its values and operations. In this article I’ll show you how that works in Ruby.

Initially its speed is zero, but then the curve gets steeper, and then even steeper as its speed increases.

The red car’s distance function makes a straight line. Let’s say the slope is one metre per second: for every one second you go across, it goes one metre up. If the car travels twice as fast, the line is steeper; we can call that slope two metres per second. And if it travels half as fast, we get a less steep line, say half a metre per second. So steeper means faster, shallower means slower.

We can see visually how fast a function is changing by graphing it and looking at the slope of the line:

We know how far the green car has travelled at each point in time, but differentiation can tell us how fast it’s going at each point in time, which is otherwise not obvious:

If we take a distance function and differentiate it, we get another function that tells us how fast the distance is changing, which in physics we call “speed”.

Differentiation is the process of finding out how fast a function’s result is changing, and to keep this article short, you’ll just have to believe me that this is a useful thing to be able to do on a computer.

If we measure the green car’s distance, it turns out it’s the time multiplied by itself:

If we periodically measure the red car’s distance we can get a sense of how it relates to the current time, and in this case we can see that the distance is actually equal to how many seconds it’s been travelling for:

If we plot time horizontally and distance vertically, the function of the red car’s distance over time looks like a straight line:

One relationship between time and distance might be “cruising along the road at a constant speed”, like the red car above. A different relationship is “pulling away from stationary”, like the green car — it starts slowly and speeds up. Each car has its own function that relates the distance it’s travelled to the current time.

A function really just means a relationship between two quantities — for example, the relationship between the current time and the distance travelled by a car.

“Differentiation” is a mathematical word for something we do to functions, so first we need to talk about what a function is.

Automatic differentiation is a clever trick that makes it easy to do differentiation on a computer. It’s a beautiful fusion of maths and computer programming, and I want to tell you about it so you can enjoy how satisfying it is.

This is easier and faster than the symbolic method, but it’s only approximate, and dividing by tiny quantities makes it susceptible to rounding errors.

With this technique we don’t have a simple expression that tells us the speed directly, but we do have an algorithm that lets us approximate it:

They’re extremely close to the exact answers we got from doing it symbolically.

To get the rough speed at a particular time, we pick a really short amount of time — say, 0.01 seconds — then calculate how much further the car travels in that time, and divide that distance by the amount of time elapsed . We don’t need to reimplement this for every function we want to differentiate; we can make a higher order function to do it , or do it with metaprogramming or whatever.

The slope of that line doesn’t quite match the curve, so it’s only an approximation. On the other hand, if we choose an even closer value the approximation gets more accurate. It’ll never be perfect, but if we pick a small enough distance between the two values, the line between them matches the slope of the curve quite closely:

The idea of numerical differentiation is that we can guess the rate of change at a particular value by picking a nearby value and looking at the slope of the line that connects them:

The other traditional strategy is called numerical differentiation . Instead of representing the function’s expression as a data structure, we implement it directly as a program.

So the big picture is: from the expression for distance, the computer uses some rules to work out the expression for speed, and then evaluates it for different inputs. That gives accurate answers but it’s difficult to do. Also, in general, the size of the differentiated expression can be much larger than the size of the original expression, so it takes longer to evaluate.

Now we know that speed is 2 × time, we can compute the speeds:

So time to the power of two differentiates to two multiplied by time to the power of one, which is just 2 × time.

Firstly, we teach the computer that “time × time” can be written as “time²” — time to the power of two. And then one of the differentiation rules, the elementary power rule , says: something to the power of n differentiates to n multiplied by that something to the power of n - 1:

To do symbolic differentiation, we represent the expression as a data structure inside the computer — as a tree of symbols — and get the computer to manipulate that structure according to all those differentiation rules and the rules of basic algebra .

If you did any calculus at school, you’ll know that mathematicians have a load of rules for how to do these manipulations. There’s a whole Wikipedia page full of them: the constant factor rule, the sum rule, the subtraction rule, the product rule, the chain rule, the inverse function rule, the elementary power rule, and so on.

The first is called symbolic differentiation. The idea is to start with the expression of the function you want to differentiate — for example, “time × time” — and work out an expression for the differentiated function.

So if we know these distance functions, how do we get a computer to calculate the speeds? There are two traditional ways of doing it.

Automatic differentiation

We’ve seen the traditional strategies: symbolic and numerical differentiation. Automatic differentiation is a less well-known technique. It’s almost as accurate as symbolic differentiation but as fast and easy as numerical.

The big idea is to calculate a function’s rate of change and its value all at once. Rather than putting a time into our distance function and getting a distance out, we should put in a time and a rate of change — for example, we specify that time is changing at one second per second — and get out a distance and a rate of change — for example, we get the result that the distance is changing at six metres per second.

Dual numbers One way of doing this is to invent a new kind of number that can represent a value and its rate of change together, and then make our implementations of functions work with these new numbers. So we need to design and then implement a sort of two-dimensional number that can store both a value and its rate of change and operate on them simultaneously. One kind of two-dimensional number you might already know about is the complex numbers, which were invented to solve a similar problem. They have two components, a real part and an imaginary part, but they can be added and multiplied and otherwise used like normal numbers. The imaginary part of a complex number is a multiple of the invented number i — it’s just a symbol, it doesn’t represent a real quantity. Our two-dimensional numbers will use a similar idea, but they’re not complex numbers, they’re called dual numbers. They have a real part and a dual part, and instead of i we use the invented number ε (the Greek letter epsilon). It’s just a symbol again, but you can think of ε as representing a very very tiny number, almost zero but not zero — it’s called an infinitesimal. We’ll use this dual part to represent the rate at which a number is changing. ε is so tiny that it doesn’t contribute to the real value of the number at all, but we can say a number with 10ε is changing twice as fast as a number with 5ε. So let’s just make a boring class for dual numbers, with accessors for the real and dual parts, and a string representation: class DualNumber attr_accessor :real, :dual def initialize(real:, dual:) self.real = real self.dual = dual end def to_s [real, (dual < 0 ? '-' : '+'), dual.abs, 'ε'].join end def inspect "(#{to_s})" end end And add a convenience method to Kernel for converting normal numbers into dual numbers, like the Complex and Rational classes have: module Kernel def DualNumber(real, dual = 0) case real when DualNumber real else DualNumber.new(real: real, dual: dual) end end end If we don’t specify a rate of change then it defaults to zero. That gives us a dual number that represents a constant — a value that’s not changing at all. So for the red car, where the distance is just equal to the time, we already have something that works. We make a value three with rate of change one, and when we call the distance function with it, we get out three with rate of change one — that means three metres, one metre per second: def distance(time:) time end >> time_now = DualNumber(3, 1) => (3+1ε) >> distance_now = distance(time: time_now) => (3+1ε) >> distance_now.real => 3 >> distance_now.dual => 1 And that’s the right answer for this very simple function where distance is the same as time. But what about when the function does some operations on the input value?

Adding and multiplying If you’ve worked with complex numbers you might know that we can define how to add and multiply them, and so on, just by doing normal algebra with their parts. But to do that we need to know how i behaves, and by design the rule is that i × i = -1. That choice makes complex numbers behave in a certain way; they’re good for representing rotations and oscillations and things like that in physics and electronics. We can add together two complex numbers by just adding all their parts separately and grouping the multiples of i together: (a + bi) + (c + di) = a + c + bi + di = (a + c) + (b + d)i And when we multiply two complex numbers we multiply out the brackets and group the multiples of i: (a + bi) × (c + di) = (a × c) + (a × di) + (bi × c) + (bi × di) = ac + (ad + bc)i + bdi² = ac + (ad + bc)i - bd = (ac - bd) + (ad + bc)i We end up multiplying by i² at the end of the second line, and that just turns into multiplying by -1 because that’s what i² is, so that becomes a subtraction, and then we group the multiples again. We can do exactly the same thing for dual numbers too, but this time we need to know how ε behaves. The rule for ε is that it’s so small that when you multiply it by itself you get zero. And then all the algebra follows in the same way. Adding dual numbers is literally the same. We add all the parts and group the multiples of ε: (a + bε) + (c + dε) = a + c + bε + dε = (a + c) + (b + d)ε Multiplying is almost the same. We multiply out the brackets and group the multiples, but now we’re multiplying by ε² at the end, and ε² is zero, so that last part just goes away: (a + bε) × (c + dε) = (a × c) + (a × dε) + (bε × c) + (bε × dε) = ac + (ad + bc)ε + bdε² = ac + (ad + bc)ε + bd × 0 = ac + (ad + bc)ε We can use those results to implement + and * on our DualNumber class. This is where the mathematics meets the programming: class DualNumber def +(other) DualNumber.new \ real: real + other.real, dual: dual + other.dual end def *(other) DualNumber.new \ real: real * other.real, dual: real * other.dual + dual * other.real end end Now if we have two dual numbers, we can add them and multiply them: >> x = DualNumber(1, 2) => (1+2ε) >> y = DualNumber(3, 4) => (3+4ε) >> x + y => (4+6ε) >> x * y => (3+10ε) So for the green car, where the distance is time multiplied by time, we can take that three seconds changing at one second per second, ask for the distance, and get the result: nine metres changing at six metres per second. def distance(time:) time * time end >> time_now = DualNumber(3, 1) => (3+1ε) >> distance_now = distance(time: time_now) => (9+6ε) >> distance_now.real => 9 >> distance_now.dual => 6 And that’s the answer we wanted. It just works, because we built the idea of the rate of change into our numbers, and defined operations that preserved and respected the rate of change. Here’s what we see if we feed different time values into the green car’s distance function and visualise the dual numbers it generates: The real values are plotted in black, and the slope of the red line is the dual part — the rate of change for the value where the dot is. So you can see that matches the slope of the curve nicely.

Compatibility We have to do a bit of work to make dual numbers compatible with Ruby’s built-in numbers. Right now if we try to add or multiply by a Fixnum we get a NoMethodError , because our add and multiply operations are trying to get the real and dual parts of a Fixnum : >> x = DualNumber(1, 2) => (1+2ε) >> (x + 3) * 4 NoMethodError: undefined method `dual' for 3:Fixnum We can fix that by just converting the argument to a dual number inside the add and multiply methods. This is where that DualNumber conversion method comes in handy: class DualNumber def +(other) other = DualNumber(other) DualNumber.new \ real: real + other.real, dual: dual + other.dual end def *(other) other = DualNumber(other) DualNumber.new \ real: real * other.real, dual: real * other.dual + dual * other.real end end And now we can add and multiply by normal numbers. They get automatically converted into dual number constants: >> x = DualNumber(1, 2) => (1+2ε) >> (x + 3) * 4 => (16+8ε) Of course we have the opposite problem, too, if we have a Fixnum on the left and try to add or multiply by a dual number on the right. This time instead of a NoMethodError we get a TypeError , because Fixnum ’s add and multiply operations don’t know what to do with a dual number: >> x = DualNumber(1, 2) => (1+2ε) >> 3 + (4 * x) TypeError: DualNumber can't be coerced into Fixnum Fortunately Ruby has a built-in way of coping with this. If we call a numeric operation of a built-in number with an argument whose type it doesn’t recognise, the number will send a coerce message to that unrecognised argument, and expect to get back a pair of objects. Then it will retry the original operation with those objects, and whatever result it gets back, it will return that. This coercion protocol means built-in numbers can interoperate with user-defined ones. So if we implement this coerce method on the DualNumber class, we can make built-in numbers automatically upgrade themselves to a dual number whenever we try to add or multiply by a dual number: class DualNumber def coerce(other) [DualNumber(other), self] end end And now addition and multiplication work both ways around: >> x = DualNumber(1, 2) => (1+2ε) >> 3 + (4 * x) => (7+8ε)