Now, we have an idea what type of data we are dealing with, let’s explore the core ideas of HDBSCAN and how it excels even when the data has:

Arbitrarily shaped clusters

Clusters with different sizes and densities

Noise

HDBSCAN uses a density-based approach which makes few implicit assumptions about the clusters. It is a non-parametric method that looks for a cluster hierarchy shaped by the multivariate modes of the underlying distribution. Rather than looking for clusters with a particular shape, it looks for regions of the data that are denser than the surrounding space. The mental image you can use is trying to separate the islands from the sea or mountains from its valleys.

What’s a cluster?

How do we define a “cluster”? The characteristics of what we intuitively think as a cluster can be poorly defined and are often context-specific. (See Christian Hennig’s talk [5] for an overview)

If we go back to the original data set, the reason we identify clusters is that we see 6 dense regions surrounded by sparse and noisy space.

Encircled regions are highly dense

One way of defining a cluster which is usually consistent with our intuitive notion of clusters is: highly dense regions separated by sparse regions.

Look at the plot of 1-d simulated data. We can see 3 clusters.

Looking at the underlying distribution

X is simulated data from a mixture of normal distributions, and we can plot the exact probability distribution of X.

Peaks = Dense regions. Troughs = sparse regions

The peaks correspond to the densest regions and the troughs correspond to the sparse regions. This gives us another way of framing the problem assuming we know the underlying distribution, clusters are highly probable regions separated by improbable regions. Imagine the higher-dimensional probability distributions forming a landscape of mountains and valleys, where the mountains are your clusters.

Coloring the 3 peaks/mountains/clusters

For those not as familiar, the two statements are practically the same:

highly dense regions separated by sparse regions

highly probable regions separated by improbable regions

One describes the data through its probability distribution and the other through a random sample from that distribution.

The PDF plot and the strip plot above are equivalent. PDF, probability density function, is interpreted as the probability of being within a small region around a point, and when looking at a sample from X, it can also be interpreted as the expected density around that point.

Given the underlying distribution, we expect that regions that are more probable would tend to have more points (denser) in a random sample. Similarly, given a random sample, you can make inferences on the probability of a region based on the empirical density.

Denser regions in the random sample correspond to more probable regions in the underlying distributions.

In fact, if we look at the histogram of a random sample of X, we see that it looks exactly like the true distribution of X. The histogram is sometimes called the empirical probability distribution, and with enough data, we expect the histogram to converge to the true underlying distribution.

Again, density = probability. Denser = more probable.

But… what’s a cluster?

Sadly, even with our “mountains and valleys” definition of clusters, it can be difficult to know whether or not something is a single cluster. Look at the example below where we shifted one of the modes of X to the right. Although we still have 3 peaks, do we have 3 clusters? In some contexts, we might consider 3 clusters. “Intuitively” we say there are just 2 clusters. How do we decide?

By looking at the strip plot of X’, we can be a bit more certain that there are just 2 clusters.

X has 3 clusters, and X’ has 2 clusters. At what point does the number of clusters change?

One way to define this is to set some global threshold for the PDF of the underlying distribution. The connected components from the resulting level-sets are your clusters [3]. This is what the algorithm DBSCAN does, and doing at multiple levels would result to DeBaCl [7].

Two different clusterings based on two different level-sets

This might be appealing because of its simplicity but don’t be fooled! We end up with an extra hyperparameter, the threshold 𝜆, which we might have to fine-tune. Moreover, this doesn’t work well for clusters with different densities.

To help us choose, we color our cluster choices as shown in the illustration below. Should we consider blue and yellow, or green only?

3 clusters on the left vs 2 clusters on the right

To choose, we look at which one “persists” more. Do we see them more together or apart? We can quantify this using the area of the colored regions.

On the left, we see that the sum of the areas of the blue and yellow regions is greater than the area of the green region. This means that the 2 peaks are more prominent, so we decide that they are two separate clusters.

On the right, we see that the area of green is much larger. This means that they are just “bumps” rather than peaks. So we say that they are just one cluster.

In the literature [2], the area of the regions is the measure of persistence, and the method is called eom or excess of mass. A bit more formally, we maximize the total sum of persistence of the clusters under the constraint that the chosen clusters are non-overlapping.

Constructing the hierarchy

By getting multiple level-sets at different values of 𝜆, we get a hierarchy. For a multidimensional setting, imagine the clusters are islands in the middle of the ocean. As you lower the sea level, the islands will start to “grow” and eventually islands will start to connect with one another.

To be able to capture and represent these relationships between clusters (islands), we represent it as a hierarchy tree. This representation generalizes to higher dimensions and is a natural abstraction that is easier to represent as a data structure that we can traverse and manipulate.

Visualizing the cluster hierarchies as a tree

By convention, trees are drawn top-down, where the root (the node where everything is just one cluster) is at the top and the tree grows downward.

Visualizing the tree top-down

If you are using the HDBSCAN library, you might use the clusterer.condensed_tree_.plot() API. The result of this, shown below, is equivalent to the one shown above. The encircled nodes correspond to the chosen clusters, which are the yellow, blue and red regions respectively.

Condensed tree plot from HDBSCAN

When using HDBSCAN, this particular plot may be useful for assessing the quality of your clusters and can help with fine-tuning the hyper-parameters, as we will discuss in the “Parameter Selection” section.

Locally Approximating Density