The pillars of society are those who cannot be bribed or bought, the upright citizens of integrity, the incorruptibles. Throw at them what you will, they never bend.

In the mathematical world, the Fisher metric is one such upstanding figure.

What I mean is this. The Fisher metric can be derived from the concept of relative entropy. But relative entropy can be deformed in various ways, and you might imagine that when you deform it, the Fisher metric gets deformed too. Nope. Bastion of integrity that it is, it remains unmoved.

You don’t need to know what the Fisher metric is in order to get the point: the Fisher metric is a highly canonical concept.

Let’s start with Shannon entropy. Given a finite probability distribution p = ( p 1 , … , p n ) p = (p_1, \ldots, p_n) , its Shannon entropy is defined as

H ( p ) = − ∑ i p i log p i . H(p) = - \sum_i p_i \log p_i.

(I’ll assume all probabilities are nonzero, so there are no problems with things being undefined.)

This is the most important type of “entropy” for finite probability distributions: it has uniquely good properties. But it admits a couple of families of deformations that share most of those properties. One is the family of Rényi entropies, indexed by a real parameter q q :

H q ( p ) = 1 1 − q log ∑ p i q . H_q(p) = \frac{1}{1 - q} \log \sum p_i^q.

Another is the family of entropies that I like to call the q q -logarithmic entropies (because they’re what you get if you replace the logarithm in the definition of Shannon entropy by a q q -logarithm), and that physicists call the Tsallis entropies (because Tsallis was about the tenth person to discover them). They’re defined by

S q ( p ) = 1 1 − q ( ∑ p i q − 1 ) . S_q(p) = \frac{1}{1 - q} \biggl( \sum p_i^q - 1 \biggr).

There’s obviously a problem with the definitions of the Rényi entropy H q ( p ) H_q(p) and the q q -logarithmic entropy S q ( p ) S_q(p) when q = 1 q = 1 . They don’t make sense. But both converge to the Shannon entropy H ( p ) H(p) as q → 1 q \to 1 , and that’s what I mean by “deformation”.

An easy way to prove this is to use l’Hôpital’s rule. And that l’Hôpital argument just as easily shows that it’s easy to dream up new deformations of Shannon entropy (not that they’re necessarily interesting). For any function λ : ( 0 , ∞ ) → ℝ \lambda : (0, \infty) \to \mathbb{R} , define a kind of “entropy of order q q ” as

1 1 − q ⋅ λ ( ∑ p i q ) . \frac{1}{1 - q} \cdot \lambda \biggl( \sum p_i^q \biggr).

If you want to show that this converges to H ( p ) H(p) as q → 1 q \to 1 , all you need to assume about λ \lambda is that λ ( 1 ) = 0 \lambda(1) = 0 and λ ′ ( 1 ) = 1 \lambda'(1) = 1 .

Taking λ = log \lambda = \log satisfies these conditions and gives Rényi entropy. The simplest function λ \lambda satisfying the conditions is the linear approximation to the function log \log at 1 1 , namely, λ ( x ) = x − 1 \lambda(x) = x - 1 . And that gives q q -logarithmic entropy.

That’s entropy, defined for a single probability distribution. But there’s also relative entropy, defined for a pair of distributions on the same finite set. The formula is

H ( p ‖ r ) = ∑ i p i log ( p i / r i ) , H(p \| r) = \sum_i p_i \log(p_i/r_i),

where p p and r r are probability distributions on n n elements.

I won’t explain here why relative entropy is important. But very roughly, you can think of it as measuring the difference between p p and r r . It’s always nonnegative, and it’s equal to zero just when p = r p = r . However, it would be a bad idea to use the word “distance”: it’s not symmetric, and more importantly, it doesn’t satisfy the triangle inequality.

Actually, relative entropy is slightly more like a squared distance. A little calculus exercise shows that when p p and r r are close together,

H ( p ‖ r ) = ∑ 1 2 p i ( p i − r i ) 2 + o ( ‖ p − r ‖ 2 ) . H(p \| r) = \sum \frac{1}{2p_i} (p_i - r_i)^2 + o(\|p - r\|^2).

The sum here is just the Euclidean squared distance scaled by a different factor along each coordinate axis.

But it’s still wrong to think of relative entropy as a squared distance. Its square root fails the triangle inequality. So, it’s not a metric in the sense of metric spaces.

However, you can use the square root of relative entropy as an infinitesimal metric — that is, a metric in the sense of Riemannian geometry. It’s called the Fisher metric, at least up to a constant factor that I won’t worry about here. And it makes the set of all probability distributions on { 1 , … , n } \{1, \ldots, n\} into a Riemannian manifold.

This works as follows. The set of probability distributions on { 1 , … , n } \{1, \ldots, n\} is the ( n − 1 ) (n - 1) -simplex Δ n \Delta_n (whose boundary points I’m ignoring). It’s a smooth manifold in the obvious way, and every one of its tangent spaces can naturally be identified with

T = { t = ( t 1 , … , t n ) ∈ ℝ n : t 1 + ⋯ + t n = 0 } . T = \{ t = (t_1, \ldots, t_n) \in \mathbb{R}^n : t_1 + \cdots + t_n = 0 \}.

The “little calculus exercise” above tells us that when you treat H ( − ‖ − ) H(-\|-) as an infinitesimal squared distance, the resulting norm on the tangent space T T at p p is given by

‖ t ‖ 2 = ∑ i 1 2 p i t i 2 . \|t\|^2 = \sum_i \frac{1}{2p_i} t_i^2.

Or equivalently, by the polarization identity, the resulting inner product on T T is given by

⟨ t , u ⟩ = ∑ i 1 2 p i t i u i . \langle t, u \rangle = \sum_i \frac{1}{2p_i} t_i u_i.

And that’s the Riemannian metric on Δ n \Delta_n . By definition, it’s the Fisher metric.

(Well: it’s actually 1 / 2 1/2 times what’s normally called the Fisher metric, but as I said, I’m not going to worry too much about constant factors in this post.)

Summary so far: We’re working on the space Δ n \Delta_n of probability distributions on n n elements.There is a machine which takes as input anything that looks vaguely like a squared distance on Δ n \Delta_n , and produces as output a Riemannian metric on Δ n \Delta_n . When you give this machine relative entropy as its input, what it produces as output is the Fisher metric.

Now the fun starts. Just as the entropy of a single distribution can be deformed in at least a couple of ways, the relative entropy of a pair of distributions has interesting deformations. Here are two families of them. The Rényi relative entropies are given by

H q ( p ‖ r ) = 1 q − 1 log ∑ p i q r i 1 − q , H_q(p \| r) = \frac{1}{q - 1} \log \sum p_i^q r_i^{1 - q},

and the q q -logarithmic relative entropies are given by

S q ( p ‖ r ) = 1 q − 1 ( ∑ p i q r i 1 − q − 1 ) . S_q(p \| r) = \frac{1}{q - 1} \biggl( \sum p_i^q r_i^{1 - q} - 1 \biggr).

Again, q q is a real parameter here. Again, both H q ( p ‖ r ) H_q(p \| r) and S q ( p ‖ r ) S_q(p \| r) converge to the standard relative entropy H ( p ‖ r ) H(p \| r) as q → 1 q \to 1 . And again, it’s easy to write down other families of deformations in this sense: define a kind of “relative entropy of order q q ” by

H q λ ( p ‖ r ) = 1 q − 1 λ ( ∑ p i q r i 1 − q ) H^\lambda_q(p \| r) = \frac{1}{q - 1} \lambda \biggl( \sum p_i^q r_i^{1 - q} \biggr)

where λ \lambda is any function satisfying the same two conditions as before: λ ( 1 ) = 0 \lambda(1) = 0 and λ ′ ( 1 ) = 1 \lambda'(1) = 1 . This generalizes both the Rényi and q q -logarithmic relative entropies, by taking λ ( x ) \lambda(x) to be either log x \log x or x − 1 x - 1 .

Let’s feed this very general kind of relative entropy into the machine. A bit of calculation shows that

H q λ ( p ‖ r ) = q ∑ i 1 2 p i ( p i − r i ) 2 + o ( ‖ p − r ‖ 2 ) H^\lambda_q(p \| r) = q \sum_i \frac{1}{2p_i} (p_i - r_i)^2 + o(\|p - r\|^2)

for any function λ \lambda satisfying those same two conditions. The right-hand side is just what we saw before, multiplied by q q . So, the output of the machine — the Riemannian metric on Δ n \Delta_n that comes from this generalized entropy — is just q q times the Fisher metric!

So: when you deform the notion of relative entropy and feed it into the machine, the same thing always happens. No matter which deformation you put in, the machine spits out the same Riemannian metric on Δ n \Delta_n (at least, up to a constant factor). It’s always the Fisher metric.

A thrill-seeker would call that result disappointing. They might have been hoping that deforming relative entropy would lead to interestingly deformed versions of the Fisher metric. But there are no such things. Try as you might, the Fisher metric simply refuses to be deformed.