The Zipf's law states that in many settings (that we are going to explore), the volume or size of entities is inversely proportional to a power s (s > 0) of their ranking. This has important implications in predictive modeling, discussed below. The processes that create this type of dynamic are not well understood. It is the purpose of this article to explain the underlying mechanics. The traditional example for the Zipf distribution is the distribution of Internet domains, ranked by traffic. By its very nature, Zipf's law is also a feature of big data: it can't really be observed in small data sets.

In the following example - based on real data from the DSC member database - we will see that this law is accurate for the 3 metrics that were expected to follow the law: members per company, per city or per country. In particular, it shows that the distribution of DSC members by company, very closely follows a Zipf distribution, where the number of DSC members, for a given company, is proportional to r^{-s}, where r is the company rank, and s = 0.7, see figure 1.

Figure 1: Example of Zipf distribution with parameter s = 0.7.

The Zipf distrubution is characterized by a rank decay well modeled by an inverse power function of parameter s > 0. It applies to the following settings.

1. General Setting

There are k atoms (e.g. DSC members) distributed over m active entities (e.g. companies). By active entities, we mean (in our example) companies with at least one DSC member; those with 0 member are not taken into account. Both k and m are very large, with k much bigger than m. Indeed, generally k / m tends to infinity as m tends to infinity. A few entities are huge (IBM, Accenture, Oracle in Figure 1), yet there are tons of entities with very few atoms.

Another example where Zipf's law applies is Internet domains: a few domains like Google, Facebook, Twitter, LinkedIn, and Yahoo dominate, with Google being the big king (in terms of monthly users or pageviews) based on Alexa ranks, while billions of small websites barely get any traffic, even when combined together.

Note that if m tends to infinity and s < 1, the Zipf distribution does not make sense from a mathematical point of view, as no moment exist. It must be regarded as a stochastic process in that case, and the explanations regarding how the process is generated still do make sense. It's only a mathematical annoyance, but not causing any problems for modeling or predicting purposes.

We have done simulations to try to understand the mechanism, and our conclusion is as follows.

2. What causes some systems to abide by Zipf's law?

Zipf's law also applies to celestial bodies in the solar system, because the process is very similar to the way companies are created and evolve, involving mergers and acquisitions. Here's how it works, described in algorithmic terms, applied to companies, and celestial bodies alike. But it applies to many other frameworks as well.

Birth: A bunch of tiny, one-man companies exist initially, long ago (comparable to hydrogen molecules, or space dust, if we use the celestial body analogy) They cluster due to some attraction forces They cluster even more They cluster and cluster again; big clusters tend to merge with other big clusters, but plenty of tiny companies from step 1 (or remnants of bankrupties/planetary collisions) are still there and still tiny; some tiny ones are absorbed by big ones And they cluster even more At some point, we have a few giants (IBM, Microsoft, Walmart - comparable to the Sun, Jupiter or Saturn in our analogy) and still plenty of tiny ones (DSC, comparable to a 50-feet wide space rock, that is definitely getting bigger)

How these entities may grow over time is illustrated in the following Youtube video, and explained in this article.

Evidently, this works only for certain metrics and settings meeting the requirements about k and m, as described in the previous section. That's why it works for

companies, cities, and states - ranked by number of DSC nembers,

Internet domains - ranked by traffic,

celestial bodies - ranked by mass,

English words - ranked by frequency,

but not for age, salary or gender distributions. Indeed, if you perform simulations with multiple layers of clustering, you won't be able to replicate Zipf laws unless there's a rule in your simulations that favors big clusters to lump together over successive iterations. Instead, rather than a Zipf distribution, you would end up with something that looks like a rotated S curve.

3. Data

Download our spreadsheet. It features the distribution of DSC members (for a specific channel, not disclosed here) per company (see top 10 companies in Figure 1 above), city (worldwide) and country - by itself a very interesting data set.

For each of the three fields (company, city, country), model-fitting with various distributions was attempted, using both Excel trendlines and mean-square errors (which tend to not work great in contexts like these with highly skewed distributions - instead, I recommend to use this L1 metric rather than R-Squared). Nevertheless, it is obvious that the Power trendline in Excel offers the best fit in each instance, far better than linear, exponential, logarithmic or polynomial trendlines.

4. Simulations

The challenge of the week, this week, consists of doing simulations to replicate the special clustering process descr..., resulting in Zipf distributions. More specifically, we ask you to perform the following simulations to assess whether our assumptions are correct. Alternatively, a mathematical proof is OK.

Let's assume that we have k = one million atoms, for instance space dust particles.

Algorithm (write it in Perl, R, Python, C or Java; test it)

Step #1

Each particle is assigned a unique bin ID between 1 and 1,000,000. Each particle represents a cluster with one element (the particle), and the bin ID is its cluster ID.

Step #2

Iteration: repeat 20,000,000 times:

Randomly select two integers i and j between 1 and current number of clusters.

If size of cluster i and cluster j are similar, merge these clusters with probability p > 0.8, Otherwise, merge these clusters with probability q < 0.3

Update cluster list

Once the algorithm stops, the final cluster configuration represents the current solar system, or companies in US - most very small, a few very big.

Question: can this cluster process simulation algorithm be adapted to a distributed environment, say Hadoop?

5. Implications for Predictive Modeling

Structures suspected to be generated by such clustering processes can be modeled using hierarchical Bayes models and simulations (MCMC) to fit model parameters with data, assess goodness-of-fit with Zipf distribution, and then predict the evolution of the system - or at least some of its core properties such as mean. This might require adding a time dimension in the simulations (represented for instance by the iteration, in the above algorithm).