K-means clustering is a fairly common and useful algorithm. One of the difficulties in using k-means is choosing the number of clusters, . I use the gap statistic pretty regularly for choosing k in my analyses as I’ve found it to perform reasonably well in practice. Recently, I’ve been thinking a bit about clustering on Spark with large data sets and while the gap statistic works well on small data sets, it can be a major pain with lots of samples.

Disclaimer

This post isn’t meant to be authoritative, I was just curious to see how the Pham et. al. method and gap statistic perform under various sample sizes and dimensions when there is no clustering under a uniform distribution.

Some notation

is the number of data points (sample size)

is the number of data points (sample size) is the number of dimensions of each data point

is the number of dimensions of each data point is the number of clusters

is the number of clusters is the maximum number of clusters ( )

Why is the gap statistic so slow?

For a review of the gap statistic, checkout this great post by The Data Science Lab. The basics in the original paper are pretty easy to follow as well. The main reason it is relatively slow is that you generate samples from a uniform distribution over the same range as the original data, times. In each iteration, you then run the same clustering algorithm with your candidate values. This means that on top of clustering your original data times you have to cluster data sets, each of sample size . When and aren’t too large, it’s pretty tractable — but it can grow out of control rather quickly as to get stable estimates of the optimal , has to be somewhat large (the authors of the R packages cluster recommend as much as !).

Pham et. al.

The Pham et. al. method was also recently reviewed at the Data Science Lab. Computationally, the method is preferable to the gap statistic as it does not require a bunch of sampling — you only have to run k-means times and compute a statistic that depends on the within cluster sum of squares. That being said, while the approach is somewhat justified in the paper, it smells a lot like a heuristic, at least when compared to something like the gap statistic.

Why do these simulations?

Mostly curiosity. I will say that the gap statistic is a much more principled approach and I was curious how the Pham method would hold up. Also, the Pham simulations left quite a bit to be desired. As far as I could tell, the simulations weren’t repeated (or at least reported), which is generally bad form for such simple simulations.

Performance under the reference distribution

Both methods assume that when there is no clustering, the reference distribution is uniform over the -dimensional space. For the simulations here, samples were generated uniformly at random from the unit hypercube. I performed 100 realizations of each simulation, where I vary and . Clustering was performed using the standard R implementation of k-means with 25 starts.

For the Pham method, I used the kselection package with default parameters. For the gap statistic, I used the cluster package. The gap statistic was run with . You can find the R code in this repo, along with a RData file with the results.

General results

Below is a figure with the k selected by the each algorithm in several different conditions. The title for each subfigure is : the dimension and sample size. The histograms are in “dodge” mode.

Generally, the gap statistic outperforms the Pham method, but as the dimensionality and sample size grow, they tend to perform comparably (under these conditions). The situations where the Pham method does quite poorly is with sample size and few dimensions. In such cases it doesn’t really make sense to not use the gap statistic as it can still be computed quickly.

A closer look at Pham

We can look at only the simulations which did not choose correctly to see what might be going awry. Below I’ve plotted for the incorrectly classified simulations when :

Each line is a different simulation and the dotted red line marks , the recommended threshold at which to choose the minimal value of . The proposed rule in the paper is to choose the minimal such that . If there are no values of , the algorithm returns . It’s pretty clear that for this particular setup, 0.85 is not a good threshold.

If we look at those which correctly chose , we note that many of the simulations almost crossed this threshold of 0.85.

Timing

Each simulation was run on a separate single core. The gap statistic is embarrassingly parallel, though the R implementation does not exploit this (I think I’m smelling a C++11 threaded implementation in my future). I will also note that this is a large shared machine, so it’s possible at some point there were other tasks competing for CPU time (the machine isn’t running resource management).

The summary of results can be seen below (histograms are in “stack” mode):

It’s pretty clear that for small sample sizes there is little time benefit from using the Pham method. Though, as the dimension and sample size grow, the gap statistic grows out of control. At , which in the grand scheme of things is a pretty small dataset, the gap statistic is on average over 200 times slower than the Pham method.

Discussion

The Pham method is a somewhat convenient heuristic that has been proposed for determining the optimal number of clusters when the number of clusters isn’t known a priori. We looked at very simple simulations where there is no clustering in the data to see how the Pham method compares to the gap statistic. For small sample sizes and/or few dimensions, the gap statistic does significantly better. The Pham method especially has trouble with small sample sizes and few dimensions. This problem seems to arise from the selection of the threshold at which to choose the optimal k. The original paper suggests 0.85 which clearly isn’t best for all datasets.

While the low dimensional situation might seem nonexistent in common datasets, it’s somewhat common to reduce a high dimensional data set to low dimensional space before clustering using a method such as principal component analysis.

Future work

These simulations don’t really prove anything beyond the very naive conditions we tested. In a possible future post, I will recreate the simulations in the gap statistic paper and look for other papers which have comparable simulations to see how each method performs on data with clusters. I will also likely expand the methods tested to a larger group, so please comment with (automated) methods you think should be compared.