HDBSCAN stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise. It’s a clustering algorithm that overcomes many of the limitations of k-means. For example, it does not require a difficult-to-determine parameter to be set before it can be used. We can explain the other advantages of HDBSCAN by breaking down each term:

Hierarchical : arranges points into hierarchies of clusters within clusters, allowing clusters of different sizes to be identified.

: arranges points into hierarchies of clusters within clusters, allowing clusters of different sizes to be identified. Density-Based : uses density of neighboring points to construct clusters, allowing clusters of any shape to be identified.

: uses density of neighboring points to construct clusters, allowing clusters of any shape to be identified. Applications with Noise: expects that real-world data is filled with noise, allowing noise points to be identified and excluded from clusters.

To read about how HDBSCAN compares with other clustering algorithms in more detail, we recommend this blog post.

Scaling Up the HDBSCAN Algorithm

The traditional implementation of HDBSCAN consists of two main phases:

Construct the k-nearest-neighbors (k-NN) graph, with an undirected edge connecting each point p to the k most similar points to p. Use the k-NN information to define a new metric called the mutual reachability distance between all pairs of points in the data. Find the minimum spanning tree (MST) connecting all points in the data according to the mutual reachability distance. This tree represents a hierarchy of clusters, and the individual clusters can be extracted using a simple heuristic.

For more detail on these phases, we recommend this blog post or the original paper.

Constructing a k-NN graph and finding the minimum spanning tree are both computationally expensive and impractical for large datasets — they both scale quadratically with the number of points in the dataset. Our solution is to replace both phases with approximations which are more efficient, have better scalability, and easier to implement in a distributed system. Of course, this solution only approximates the exact result of HDBSCAN, but we find that it performs well in practice.

Approximating the k-NN graph:

To approximate step 1, we use the recently published NN-Descent algorithm. It is based on the principle that the neighbor of a neighbor is likely to be a neighbor. Each point p keeps a list of k other points, which are the k closest points to p that have been found so far. In each update step, two randomly chosen a and b in the neighbor list of a point are compared. If b is closer to a than the farthest point in a’s neighbor list, then the farthest point in a’s neighbor list is replaced by b. Repeating this update improves the accuracy of the k-NN graph approximation with each iteration.

NN-Descent possesses several desirable qualities which make it well-suited for our application. It has been shown to perform well on a variety of datasets and distance metrics, it has a sub-quadratic empirical runtime and can scale to larger datasets, and it can be easily distributed to multiple cores and machines.

Approximating the MST:

To approximate step 2, we find the minimum spanning tree of the graph produced by NN-Descent rather than the complete graph connecting all data points. The only important edges of the MST are those which connect points in the same cluster, and those edges should mostly be present in the k-NN graph. Doing this drastically reduces the size of the graph input to the MST algorithm, and allows the computation to scale to larger datasets.

Clustering Results

To assess the quality of the approximation, we evaluated the clustering on a 300-dimensional word embedding dataset and a 98,304-dimensional image embedding dataset. We found that the results were meaningful in both cases. Here are example clusters from both datasets: