Note: This post contains a lot of math may not appear prettily unless you are reading this from my website itself; if you are reading this from an RSS feed, you may want to switch to my website.

I think the claim that "no two snowflakes are alike" is fairly common. The idea is that there are so many possible configurations of snow crystals that it is improbable that any two flakes sitting next to each other would have the same configuration. I wanted to know: Is there any truth to this claim?

Kenneth Libbrecht, a physics professor at Caltech, thinks so, and makes the following argument:

Now when you look at a complex snow crystal, you can often pick out a hundred separate features if you look closely.

He goes on to explain that there are $10^{158}$ different configurations of those features. That's a 1 followed by 158 zeros, which is about $10^{70}$ times larger than the total number of atoms in the universe. Professor Libbrecht concludes

Thus the number of ways to make a complex snow crystal is absolutely huge. And thus it's unlikely that any two complex snow crystals, out of all those made over the entire history of the planet, have ever looked completely alike.

Being the skeptic that I am, I decided to rigorously investigate the true probability of two snowflakes having possessed the same configuration over the entire history of the Earth.

Let's assume Professor Libbrecht's analysis is correct and there are $10^{158}$ possible snowflake configurations. Roughly $10^{23}$ snow crystals fall on Earth per year. Therefore, by the pigeonhole principle, two similar flakes must fall by the time the Earth reaches the age $10^{158} \div 10^{23} = 10^{135}$ years, which isn't going to happen any time soon. This seems to support Professor Libbrecht's conjecture.

Now lets assume that each one of the $10^{158}$ possible snowflake configurations is equiprobable; I am not sure if this is a reasonable assumption, but I’ll slightly relax the assumption below. Since the Earth is roughly 4.54 billion years old, let's further assume that at least $(4.54\times 10^9) \times 10^{23} = 4.54\times 10^{32}$ snowflakes have fallen on Earth. We therefore want to know: Given $4.54\times 10^{32}$ items that are randomly chosen from a set of $10^{158}$, what is the probability that no two items are the same?

This is an instance of the classic "Birthday Paradox." Imagine a room full of people. Assuming birthdays are uniformly distributed across the calendar, the probability that any given pair of people in the room has a shared birthday is simply 1 in 365. By the pigeonhole principle, if there are at least 366 (or 367 if we include leap years) people in the room then there must be at least one pair of people that share a birthday. The Birthday Paradox is the fact that the probability that there exists a pair of people in the room that share a birthday rapidly converges to 1 as the number of people increases. For example, with only 23 people in the room, the probability that there are two people with the same birthday is over 50%. With at least 60 people in the room, the probability that there is a shared birthday is nearly 100%. This is a "paradox" because 60 people is a long way away from the guaranteed quantity of 366.

There has been a lot of study of the Birthday Paradox over the years, and a number of neat analysis techniques exist for calculating probabilities of similarity. The problem is that numerically calculating the probability that two snowflakes will be similar is unfortunately intractable given the astronomical numbers with which we're dealing. Therefore, we need to use an approximation. One common approximation is to cast the problem as a "collision problem": Given $n$ random integers drawn from a discrete uniform distribution with range $[1,d]$, what is the probability $P(n;d)$ that at least two numbers are the same?. The solution to this problem is

$P(n;d) \approx 1 - \left(\frac{d-1}{d}\right)^{\frac{n(n-1)}{2}}.$

Substituting our values results in $P(4.54\times 10^{32},10^{158}) \approx 1.03\times 10^{-93}$. In other words, under these assumptions, the probability that there have been at least two similar snowflakes thus far in the entire history of the Earth is basically zero.

But wait! Do snowflakes really occur according to a uniform distribution? I’d imagine that some configurations are more rare than others. Like many other natural phenomena, I might guess that the distribution of configurations is closer to a Zipfian (a.k.a. “Power Law”) distribution. Using this distribution unfortunately forces us to abandon the handy approximation we used above; we’ll have to derive the probability from scratch. The probability that all $n$ flakes are different is then this nasty expression:

$\frac{1}{n!}\sum_{x_1=1}^{10^{158}} \sum_{x_2=x_1+1}^{10^{158}} \sum_{x_3=x_2+1}^{10^{158}} \cdots \sum_{x_n=x_{n-1}+1}^{10^{158}} \prod_{i=1}^n \frac{x_i^{-2}}{\sum_{j=1}^k j^{-2}},$

which is all but impossible to solve numerically. Fortunately, we really only need an upper bound on this value. First, let’s rewrite it as

$\frac{1}{n!}\sum_{x_1=1}^{10^{158}} \frac{x_1^{-2}}{\sum_{j=1}^{10^{158}} j^{-2}} \sum_{x_2=x_1+1}^{10^{158}} \frac{x_2^{-2}}{\sum_{j=1}^{10^{158}} j^{-2}} \cdots \sum_{x_n=x_{n-1}+1}^{10^{158}} \frac{x_n^{-2}}{\sum_{j=1}^{10^{158}} j^{-2}}.$

Next, note that

$\frac{x^{-2}}{\sum_{j=1}^{10^{158}} j^{-2}} \leq \frac{x^{-2}}{\frac{10^{158}}{{10^{158}}^2}} = \frac{10^{158}}{x^2},$ allowing us to further bound the original expression by

$\cdots \leq \frac{10^{158}}{n!}\sum_{x_1=1}^{10^{158}} x_1^{-2} \sum_{x_2=x_1+1}^{10^{158}} x_2^{-2} \cdots \sum_{x_n=x_{n-1}+1}^{10^{158}} x_n^{-2}.$ Finally, we arrive at the ultimate bound:

$\cdots \lt \frac{10^{158}}{n!}\left(H_{10^{158},2}\right)^n \lt \frac{10^{158}}{n!}\zeta(2)^n,$

where “$H_{n,r}$” is the generalized harmonic number of order $n$ of $r$ and “$\zeta$” is the Riemann Zeta function. Substituting $n = 10^{23}$ (i.e., the number of snowflakes that fall in a single year), this upper bound is extremely small (incredibly close to zero). Keep in mind that this is the probability that all of the snowflakes have been different. Therefore, assuming snowflakes are configured according to a Zipfian distribution, the probably that at least two snow flakes are alike in a single year is extremely close to 100%!

We can get a similar result using a normal distribution. Since there are a discrete number of configurations, though, we really have to approximate a truncated discrete normal distribution using a binomial distribution $B(10^{158}, 0.5)$. The probability that all $n$ flakes are different is then this nasty expression: