It’s an underappreciated fact that the interior of every simplex Δ n \Delta^n is a real vector space in a natural way. For instance, here’s the 2-simplex with twelve of its 1-dimensional linear subspaces drawn in:

(That’s just a sketch. See below for an accurate diagram by Greg Egan.)

In this post, I’ll explain what this vector space structure is and why everyone who’s ever taken a course on thermodynamics knows about it, at least partially, even if they don’t know they do.

Let’s begin with the most ordinary vector space of all, ℝ n \mathbb{R}^n . (By “vector space” I’ll always mean vector space over ℝ \mathbb{R} .) There’s a bijection

ℝ ↔ ( 0 , ∞ ) \mathbb{R} \leftrightarrow (0, \infty)

between the real line and the positive half-line, given by exponential in one direction and log in the other. Doing this bijection in each coordinate gives a bijection

ℝ n ↔ ( 0 , ∞ ) n . \mathbb{R}^n \leftrightarrow (0, \infty)^n.

So, if we transport the vector space structure of ℝ n \mathbb{R}^n along this bijection, we’ll produce a vector space structure on ( 0 , ∞ ) n (0, \infty)^n . This new vector space ( 0 , ∞ ) n (0, \infty)^n is isomorphic to ℝ n \mathbb{R}^n , by definition.

Explicitly, the “addition” of the vector space ( 0 , ∞ ) n (0, \infty)^n is coordinatewise multiplication, the “zero” vector is ( 1 , … , 1 ) (1, \ldots, 1) , and “subtraction” is coordinatewise division. The scalar “multiplication” is given by powers: multiplying a vector y = ( y 1 , … , y n ) ∈ ( 0 , ∞ ) n \mathbf{y} = (y_1, \ldots, y_n) \in (0, \infty)^n by a scalar λ ∈ ℝ \lambda \in \mathbb{R} gives ( y 1 λ , … , y n λ ) (y_1^\lambda, \ldots, y_n^\lambda) .

Now, the ordinary vector space ℝ n \mathbb{R}^n has a linear subspace U U spanned by ( 1 , … , 1 ) (1, \ldots, 1) . That is,

U = { ( λ , … , λ ) : λ ∈ ℝ } . U = \{(\lambda, \ldots, \lambda) \colon \lambda \in \mathbb{R} \}.

Since the vector spaces ℝ n \mathbb{R}^n and ( 0 , ∞ ) n (0, \infty)^n are isomorphic, there’s a corresponding subspace W W of ( 0 , ∞ ) n (0, \infty)^n , and it’s given by

W = { ( e λ , … , e λ ) : λ ∈ ℝ } = { ( γ , … , γ ) : γ ∈ ( 0 , ∞ ) } . W = \{(e^\lambda, \ldots, e^\lambda) \colon \lambda \in \mathbb{R} \} = \{(\gamma, \ldots, \gamma) \colon \gamma \in (0, \infty)\}.

But whenever we have a linear subspace of a vector space, we can form the quotient. Let’s do this with the subspace W W of ( 0 , ∞ ) n (0, \infty)^n . What does the quotient ( 0 , ∞ ) n / W (0, \infty)^n/W look like?

Well, two vectors y , z ∈ ( 0 , ∞ ) n \mathbf{y}, \mathbf{z} \in (0, \infty)^n represent the same element of ( 0 , ∞ ) n / W (0, \infty)^n/W if and only if their “difference” — in the vector space sense — belongs to W W . Since “difference” or “subtraction” in the vector space ( 0 , ∞ ) n (0, \infty)^n is coordinatewise division, this just means that

y 1 z 1 = y 2 z 2 = ⋯ = y n z n . \frac{y_1}{z_1} = \frac{y_2}{z_2} = \cdots = \frac{y_n}{z_n}.

So, the elements of ( 0 , ∞ ) n / W (0, \infty)^n/W are the equivalence classes of n n -tuples of positive reals, with two tuples considered equivalent if they’re the same up to rescaling.

Now here’s the crucial part: it’s natural to normalize everything to sum to 1 1 . In other words, in each equivalence class, we single out the unique tuple ( y 1 , … , y n ) (y_1, \ldots, y_n) such that y 1 + ⋯ + y n = 1 y_1 + \cdots + y_n = 1 . This gives a bijection

( 0 , ∞ ) n / W ↔ Δ n ∘ (0, \infty)^n/W \leftrightarrow \Delta_n^\circ

where Δ n ∘ \Delta_n^\circ is the interior of the ( n − 1 ) (n - 1) -simplex:

Δ n ∘ = { ( p 1 , … , p n ) : p i > 0 , ∑ p i = 1 } . \Delta_n^\circ = \{(p_1, \ldots, p_n) \colon p_i \gt 0, \sum p_i = 1 \}.

You can think of Δ n ∘ \Delta_n^\circ as the set of probability distributions on an n n -element set that satisfy Cromwell’s rule: zero probabilities are forbidden. (Or as Cromwell put it, “I beseech you, in the bowels of Christ, think it possible that you may be mistaken.”)

Transporting the vector space structure of ( 0 , ∞ ) n / W (0, \infty)^n/W along this bijection gives a vector space structure to Δ n ∘ \Delta_n^\circ . And that’s the vector space structure on the simplex.

So what are these vector space operations on the simplex, in concrete terms? They’re given by the same operations in ( 0 , ∞ ) n (0, \infty)^n , followed by normalization. So, the “sum” of two probability distributions p \mathbf{p} and q \mathbf{q} is

( p 1 q 1 , p 2 q 2 , … , p n q n ) p 1 q 1 + p 2 q 2 + ⋯ + p n q n , \frac{(p_1 q_1, p_2 q_2, \ldots, p_n q_n)}{p_1 q_1 + p_2 q_2 + \cdots + p_n q_n},

the “zero” vector is the uniform distribution

( 1 , 1 , … , 1 ) 1 + 1 + ⋯ + 1 = ( 1 / n , 1 / n , … , 1 / n ) , \frac{(1, 1, \ldots, 1)}{1 + 1 + \cdots + 1} = (1/n, 1/n, \ldots, 1/n),

and “multiplying” a probability distribution p \mathbf{p} by a scalar λ ∈ ℝ \lambda \in \mathbb{R} gives

( p 1 λ , p 2 λ , … , p n λ ) p 1 λ + p 2 λ + ⋯ + p n λ . \frac{(p_1^\lambda, p_2^\lambda, \ldots, p_n^\lambda)}{p_1^\lambda + p_2^\lambda + \cdots + p_n^\lambda}.

For instance, let’s think about the scalar “multiples” of

p = ( 0.2 , 0.3 , 0.5 ) ∈ Δ 3 . \mathbf{p} = (0.2, 0.3, 0.5) \in \Delta_3.

“Multiplying” p \mathbf{p} by λ ∈ ℝ \lambda \in \mathbb{R} gives

( 0.2 λ , 0.3 λ , 0.5 λ ) 0.2 λ + 0.3 λ + 0.5 λ \frac{(0.2^\lambda, 0.3^\lambda, 0.5^\lambda)}{0.2^\lambda + 0.3^\lambda + 0.5^\lambda}

which I’ll call p ( λ ) \mathbf{p}^{(\lambda)} , to avoid the confusion that would be created by calling it λ p \lambda\mathbf{p} .

When λ = 0 \lambda = 0 , p ( λ ) \mathbf{p}^{(\lambda)} is just the uniform distribution ( 1 / 3 , 1 / 3 , 1 / 3 ) (1/3, 1/3, 1/3) — which of course it has to be, since multiplying any vector by the scalar 0 0 has to give the zero vector.

For equally obvious reasons, p ( 1 ) \mathbf{p}^{(1)} has to be just p \mathbf{p} .

When λ \lambda is large and positive, the powers of 0.5 0.5 dominate over the powers of the smaller numbers 0.2 0.2 and 0.3 0.3 , so p ( λ ) → ( 0 , 0 , 1 ) \mathbf{p}^{(\lambda)} \to (0, 0, 1) as λ → ∞ \lambda \to \infty .

For similar reasons, p ( λ ) → ( 1 , 0 , 0 ) \mathbf{p}^{(\lambda)} \to (1, 0, 0) as λ → − ∞ \lambda \to -\infty . This behaviour as λ → ± ∞ \lambda \to \pm\infty is the reason why, in the picture above, you see the curves curling in at the ends towards the triangle’s corners.

Some physicists refer to the distributions p ( λ ) \mathbf{p}^{(\lambda)} as the “escort distributions” of p \mathbf{p} . And in fact, the scalar multiplication of the vector space structure on the simplex is a key part of the solution of a very basic problem in thermodynamics — so basic that even I know it.

The problem goes like this. First I’ll state it using the notation above, then afterwards I’ll translate it back into terms that physicists usually use.

Fix ξ 1 , … , ξ n , ξ > 0 \xi_1, \ldots, \xi_n, \xi \gt 0 . Among all probability distributions ( p 1 , … , p n ) (p_1, \ldots, p_n) satisfying the constraint

ξ 1 p 1 ξ 2 p 2 ⋯ ξ n p n = ξ , \xi_1^{p_1} \xi_2^{p_2} \cdots \xi_n^{p_n} = \xi,

which one minimizes the quantity

p 1 p 1 p 2 p 2 ⋯ p n p n ? p_1^{p_1} p_2^{p_2} \cdots p_n^{p_n}?

It makes no difference to this question if ξ 1 , … , ξ n , ξ \xi_1, \ldots, \xi_n, \xi are normalized so that ξ 1 + ⋯ + ξ n = 1 \xi_1 + \cdots + \xi_n = 1 (since multiplying each of ξ 1 , … , ξ n , ξ \xi_1, \ldots, \xi_n, \xi by a constant doesn’t change the constraint). So, let’s assume this has been done.

Then the answer to the question turns out to be: the minimizing distribution p \mathbf{p} is a scalar multiple of ( ξ 1 , … , ξ n ) (\xi_1, \ldots, \xi_n) in the vector space structure on the simplex. In other words, it’s an escort distribution of ( ξ 1 , … , ξ n ) (\xi_1, \ldots, \xi_n) . Or in other words still, it’s an element of the linear subspace of Δ n ∘ \Delta_n^\circ spanned by ( ξ 1 , … , ξ n ) (\xi_1, \ldots, \xi_n) . Which one? The unique one such that the constraint is satisfied.

Proving that this is the answer is a simple exercise in calculus, e.g. using Lagrange multipliers.

For instance, take ( ξ 1 , ξ 2 , ξ 3 ) = ( 0.2 , 0.3 , 0.5 ) (\xi_1, \xi_2, \xi_3) = (0.2, 0.3, 0.5) and ξ = 0.4 \xi = 0.4 . Among all distributions ( p 1 , p 2 , p 3 ) (p_1, p_2, p_3) that satisfy the constraint

0.2 p 1 × 0.3 p 2 × 0.5 p 3 = 0.4 , 0.2^{p_1} \times 0.3^{p_2} \times 0.5^{p_3} = 0.4,

the one that minimizes p 1 p 1 p 2 p 2 p 3 p 3 p_1^{p_1} p_2^{p_2} p_3^{p_3} is some escort distribution of ( 0.2 , 0.3 , 0.5 ) (0.2, 0.3, 0.5) . Maybe one of the curves shown in the picture above is the 1-dimensional subspace spanned by ( 0.2 , 0.3 , 0.5 ) (0.2, 0.3, 0.5) , and in that case, the p \mathbf{p} that minimizes is somewhere on that curve.

The location of p \mathbf{p} on that curve depends on the value of ξ \xi , which here I chose to be 0.4 0.4 . If I changed it to 0.20001 0.20001 or 0.49999 0.49999 then p \mathbf{p} would be nearly at one end or the other of the curve, since ( 0.2 , 0.3 , 0.5 ) ( λ ) (0.2, 0.3, 0.5)^{(\lambda)} converges to 0.2 0.2 as λ → − ∞ \lambda \to -\infty and to 0.5 0.5 as λ → ∞ \lambda \to \infty .

Aside I’m glossing over the question of existence and uniqueness of solutions to the optimization question. Since ξ 1 p 1 ξ 2 p 2 ⋯ ξ n p n \xi_1^{p_1} \xi_2^{p_2} \cdots \xi_n^{p_n} is a kind of average of ξ 1 , ξ 2 , … , ξ n \xi_1, \xi_2, \ldots, \xi_n — a weighted, geometric mean — there’s no solution at all unless min i ξ i ≤ ξ ≤ max i ξ i \min_i \xi_i \leq \xi \leq \max_i \xi_i . As long as that inequality is satisfied, there’s a minimizing p \mathbf{p} , although it’s not always unique: e.g. consider what happens when all the ξ i \xi_i s are equal.

Physicists prefer to do all this in logarithmic form. So, rather than start with ξ 1 , … , ξ n , ξ > 0 \xi_1, \ldots, \xi_n, \xi \gt 0 , they start with x 1 , … , x n , x ∈ ℝ x_1, \ldots, x_n, x \in \mathbb{R} ; think of this as substituting x i = − log ξ i x_i = -\log \xi_i and x = − log ξ x = -\log \xi . So, the constraint

ξ 1 p 1 ξ 2 p 2 ⋯ ξ n p n = ξ \xi_1^{p_1} \xi_2^{p_2} \cdots \xi_n^{p_n} = \xi

becomes

e − p 1 ξ 1 e − p 2 ξ 2 ⋯ e − p n ξ n = e − x e^{-p_1 \xi_1} e^{-p_2 \xi_2} \cdots e^{-p_n \xi_n} = e^{-x}

or equivalently

p 1 x 1 + p 2 x 2 + ⋯ + p n x n = x . p_1 x_1 + p_2 x_2 + \cdots + p_n x_n = x.

We’re trying to minimize p 1 p 1 p 2 p 2 ⋯ p n p n p_1^{p_1} p_2^{p_2} \cdots p_n^{p_n} subject to that constraint, and again the physicists prefer the logarithmic form (with a change of sign): maximize

− ( p 1 log p 1 + p 2 log p 2 + ⋯ + p n log p n ) . -(p_1 \log p_1 + p_2 \log p_2 + \cdots + p_n \log p_n).

That quantity is the Shannon entropy of the distribution ( p 1 , … , p n ) (p_1, \ldots, p_n) : so we’re looking for the maximum entropy solution to the constraint. This is called the Gibbs state, and as we saw, it’s a scalar multiple of ( ξ 1 , … , ξ n ) (\xi_1, \ldots, \xi_n) in the vector space structure on the simplex. Equivalently, it’s

( e − λ x 1 , e − λ x 2 , … , e − λ x n ) e − λ x 1 + e − λ x 2 + ⋯ + e − λ x n \frac{(e^{-\lambda x_1}, e^{-\lambda x_2}, \ldots, e^{-\lambda x_n})}{e^{-\lambda x_1} + e^{-\lambda x_2} + \cdots + e^{-\lambda x_n}}

for whichever value of λ \lambda satisfies the constraint. The denominator here is the famous partition function.

So, that basic thermodynamic problem is (implicitly) solved by scalar multiplication in the vector space structure on the simplex. A question: does addition in the vector space structure on the simplex also have a role to play in physics?