Significance Clustering is a fundamental experimental procedure in data analysis. It is used in virtually all natural and social sciences and has played a central role in biology, astronomy, psychology, medicine, and chemistry. Despite the importance and ubiquity of clustering, existing algorithms suffer from a variety of drawbacks and no universal solution has emerged. We present a clustering algorithm that reliably achieves high accuracy across domains, handles high data dimensionality, and scales to large datasets. The algorithm optimizes a smooth global objective, using efficient numerical methods. Experiments demonstrate that our method outperforms state-of-the-art clustering algorithms by significant factors in multiple domains.

Abstract Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank.

Clustering is one of the fundamental experimental procedures in data analysis. It is used in virtually all natural and social sciences and has played a central role in biology, astronomy, psychology, medicine, and chemistry. Data-clustering algorithms have been developed for more than half a century (1). Significant advances in the last two decades include spectral clustering (2⇓–4), generalizations of classic center-based methods (5, 6), mixture models (7, 8), mean shift (9), affinity propagation (10), subspace clustering (11⇓–13), nonparametric methods (14, 15), and feature selection (16⇓⇓⇓–20).

Despite these developments, no single algorithm has emerged to displace the k -means scheme and its variants (21). This is despite the known drawbacks of such center-based methods, including sensitivity to initialization, limited effectiveness in high-dimensional spaces, and the requirement that the number of clusters be set in advance. The endurance of these methods is in part due to their simplicity and in part due to difficulties associated with some of the new techniques, such as additional hyperparameters that need to be tuned, high computational cost, and varying effectiveness across domains. Consequently, scientists who analyze large high-dimensional datasets with unknown distribution must maintain and apply multiple different clustering algorithms in the hope that one will succeed. Books have been written to guide practitioners through the landscape of data-clustering techniques (22).

We present a clustering algorithm that is fast, easy to use, and effective in high dimensions. The algorithm optimizes a clear continuous objective, using standard numerical methods that scale to massive datasets. The number of clusters need not be known in advance.

The operation of the algorithm can be understood by contrasting it with other popular clustering techniques. In center-based algorithms such as k -means (1, 24), a small set of putative cluster centers is initialized from the data and then iteratively refined. In affinity propagation (10), data points communicate over a graph structure to elect a subset of the points as representatives. In the presented algorithm, each data point has a dedicated representative, initially located at the data point. Over the course of the algorithm, the representatives move and coalesce into easily separable clusters. The progress of the algorithm is visualized in Fig. 1.

Fig. 1. RCC on the Modified National Institute of Standards and Technology (MNIST) dataset. Each data point 𝐱 i has a corresponding representative 𝐮 i . The representatives are optimized to reveal the structure of the data. A–C visualize the representation 𝐔 using the t-SNE algorithm (23). Ground-truth clusters are coded by color. (A) The initial state, 𝐔 = 𝐗 . (B) The representation 𝐔 after 20 iterations of the optimization. (C) The final representation produced by the algorithm.

Our formulation is based on recent convex relaxations for clustering (25, 26). However, our objective is deliberately not convex. We use redescending robust estimators that allow even heavily mixed clusters to be untangled by optimizing a single continuous objective. Despite the nonconvexity of the objective, the optimization can still be performed using standard linear least-squares solvers, which are highly efficient and scalable. Since the algorithm expresses clustering as optimization of a continuous objective based on robust estimation, we call it robust continuous clustering (RCC).

One of the characteristics of the presented formulation is that clustering is reduced to optimization of a continuous objective. This enables the integration of clustering in end-to-end feature learning pipelines. We demonstrate this by extending RCC to perform joint clustering and dimensionality reduction. The extended algorithm, called RCC-DR, learns an embedding of the data into a low-dimensional space in which it is clustered. Embedding and clustering are performed jointly, by an algorithm that optimizes a clear global objective.

We evaluate RCC and RCC-DR on a large number of datasets from a variety of domains. These include image datasets, document datasets, a dataset of sensor readings from the Space Shuttle, and a dataset of protein expression levels in mice. Experiments demonstrate that our method significantly outperforms prior state-of-the-art techniques. RCC-DR is particularly robust across datasets from different domains, outperforming the best prior algorithm by a factor of 3 in average rank.

Formulation We consider the problem of clustering a set of n data points. The input is denoted by 𝐗 = [ 𝐱 1 , 𝐱 2 , … , 𝐱 n ] , where 𝐱 i ∈ ℝ D . Our approach operates on a set of representatives 𝐔 = [ 𝐮 1 , 𝐮 2 , … , 𝐮 n ] , where 𝐮 i ∈ ℝ D . The representatives 𝐔 are initialized at the corresponding data points 𝐗 . The optimization operates on the representation 𝐔 , which coalesces to reveal the cluster structure latent in the data. Thus, the number of clusters need not be known in advance. The optimization of 𝐔 is illustrated in Fig. 1. The RCC objective has the following form: 𝐂 ( 𝐔 ) = 1 2 ∑ i = 1 n ∥ 𝐱 i − 𝐮 i ∥ 2 2 + λ 2 ∑ ( p , q ) ∈ E w p , q ρ ( ∥ 𝐮 p − 𝐮 q ∥ 2 ) . [1]Here E is the set of edges in a graph connecting the data points. The graph is constructed automatically from the data. We use mutual k -nearest neighbors (m-kNN) connectivity (27), which is more robust than commonly used kNN graphs. The weights w p , q balance the contribution of each data point to the pairwise terms and λ balances the strength of different objective terms. The function ρ ( ⋅ ) is a penalty on the regularization terms. The use of an appropriate robust penalty function ρ is central to our method. Since we want representatives 𝐮 i of observations from the same latent cluster to collapse into a single point, a natural penalty would be the ℓ 0 norm ( ρ ( y ) = [ y ≠ 0 ] , where [ ⋅ ] is the Iverson bracket). However, this transforms the objective into an intractable combinatorial optimization problem. At another extreme, recent work has explored the use of convex penalties, such as the ℓ 1 and ℓ 2 norms (25, 26). This has the advantage of turning objective 1 into a convex optimization problem. However, convex functions—even the ℓ 1 norm—have limited robustness to spurious edges in the connectivity structure E , because the influence of a spurious pairwise term does not diminish as representatives move apart during the optimization. Given noisy real-world data, heavy contamination of the connectivity structure by connections across different underlying clusters is inevitable. Our method uses robust estimators to automatically prune spurious intercluster connections while maintaining veridical intracluster correspondences, all within a single continuous objective. The second term in objective 1 is related to the mean shift objective (9). The RCC objective differs in that it includes an additional data term, uses a sparse (as opposed to a fully connected) connectivity structure, and is based on robust estimation. Our approach is based on the duality between robust estimation and line processes (28). We introduce an auxiliary variable l p , q for each connection ( p , q ) ∈ E and optimize a joint objective over the representatives 𝐔 and the line process 𝕃 = { l p , q } : 𝐂 ( 𝐔 , 𝕃 ) = 1 2 ∑ i = 1 n ∥ 𝐱 i − 𝐮 i ∥ 2 2 + λ 2 ∑ ( p , q ) ∈ E w p , q ( l p , q ∥ 𝐮 p − 𝐮 q ∥ 2 2 + Ψ ( l p , q ) ) . [2]Here Ψ ( l p , q ) is a penalty on ignoring a connection ( p , q ) : Ψ ( l p , q ) tends to zero when the connection is active ( l p , q → 1 ) and to one when the connection is disabled ( l p , q → 0 ). A broad variety of robust estimators ρ ( ⋅ ) have corresponding penalty functions Ψ ( ⋅ ) such that objectives 1 and 2 are equivalent with respect to 𝐔 : Optimizing either of the two objectives yields the same set of representatives 𝐔 . This formulation is related to iteratively reweighted least squares (IRLS) (29), but is more flexible due to the explicit variables 𝕃 and the ability to define additional terms over these variables. Objective 2 can be optimized by any gradient-based method. However, its form enables efficient and scalable optimization by iterative solution of linear least-squares systems. This yields a general approach that can accommodate many robust nonconvex functions ρ , reduces clustering to the application of highly optimized off-the-shelf linear system solvers, and easily scales to datasets with hundreds of thousands of points in tens of thousands of dimensions. In comparison, recent work has considered a specific family of concave penalties and derived a computationally intensive majorization–minimization scheme for optimizing the objective in this special case (30). Our work provides a highly efficient general solution. While the presented approach can accommodate many estimators in the same computationally efficient framework, our exposition and experiments use a form of the well-known Geman–McClure estimator (31), ρ ( y ) = μ y 2 μ + y 2 , [3]where μ is a scale parameter. The corresponding penalty function that makes objectives 1 and 2 equivalent with respect to 𝐔 is Ψ ( l p , q ) = μ ( l p , q − 1 ) 2 . [4]

Optimization Objective 2 is biconvex on ( 𝐔 , 𝕃 ) . When variables 𝐔 are fixed, the individual pairwise terms decouple and the optimal value of each l p , q can be computed independently in closed form. When variables 𝕃 are fixed, objective 2 turns into a linear least-squares problem. We exploit this special structure and optimize the objective by alternatingly updating the variable sets 𝐔 and 𝕃 . As a block coordinate descent algorithm, this alternating minimization scheme provably converges. When 𝐔 are fixed, the optimal value of each l p , q is given by l p , q = ( μ μ + ∥ 𝐮 p − 𝐮 q ∥ 2 2 ) 2 . [5]This can be verified by substituting Eq. 5 into Eq. 2, which yields objective 1 with respect to 𝐔 . When 𝕃 are fixed, we can rewrite [2] in matrix form and obtain a simplified expression for solving 𝐔 , arg min 1 2 ∥ 𝐗 − 𝐔 ∥ F 2 + λ 2 ∑ ( p , q ) ∈ E w p , q l p , q ∥ 𝐔 ( 𝐞 p − 𝐞 q ) ∥ 2 2 , [6]where 𝐞 i is an indicator vector with the i th element set to 1 . This is a linear least-squares problem that can be efficiently solved using fast and scalable solvers. The linear least-squares formulation is given by 𝐔𝐌 = 𝐗 , where 𝐌 = 𝐈 + λ ∑ ( p , q ) ∈ E w p , q l p , q ( 𝐞 p − 𝐞 q ) ( 𝐞 p − 𝐞 q ) ⊤ . [7]Here 𝐈 ∈ ℝ n × n is the identity matrix. It is easy to prove that 𝐀 ≜ ∑ ( p , q ) ∈ E w p , q l p , q ( 𝐞 p − 𝐞 q ) ( 𝐞 p − 𝐞 q ) ⊤ [8]is a Laplacian matrix and hence 𝐌 is symmetric and positive semidefinite. As with any multigrid solver, each row of 𝐔 in Eq. 7 can be solved independently and in parallel. The RCC algorithm is summarized in Algorithm 1: RCC. Note that all updates of 𝐔 and 𝕃 optimize the same continuous global objective 2. The algorithm uses graduated nonconvexity (32). It begins with a locally convex approximation of the objective, obtained by setting μ such that the second derivative of the estimator is positive ( ρ ¨ ( y ) > 0 ) over the relevant part of the domain. Over the iterations, μ is automatically decreased, gradually introducing nonconvexity into the objective. Under certain assumptions, such continuation schemes are known to attain solutions that are close to the global optimum (33). The parameter λ in the RCC objective 1 balances the strength of the data terms and pairwise terms. The reformulation of RCC as a linear least-squares problem enables setting λ automatically. Specifically, Eq. 7 suggests that the data terms and pairwise terms can be balanced by setting λ = ∥ 𝐗 ∥ 2 ∥ 𝐀 ∥ 2 . [9]The value of λ is updated automatically according to this formula after every update of μ . An update involves computing only the largest eigenvalue of the Laplacian matrix 𝐀 . The spectral norm of 𝐗 is precomputed at initialization and reused. Additional details concerning Algorithm 1 are provided in SI Methods.

SI Methods Initialization and Output. We initialize the optimization with 𝐔 = 𝐗 and l p , q = 1 . The output clusters are the weakly connected components of a graph in which a pair 𝐱 i and 𝐱 j is connected by an edge if and only if ∥ 𝐮 i − 𝐮 j ∥ 2 < δ . The threshold δ is set to be the mean of the lengths of the shortest 1 % of the edges in E . Connectivity Structure. The connectivity structure E is based on m-kNN connectivity (27), which is more robust than commonly used kNN graphs. We use k = 10 and the cosine similarity metric for m-kNN graph construction. In an m-kNN graph, two nodes are connected by an edge if and only if each is among the k nearest neighbors of the other. This allows statistically different clusters (e.g., different scales) to remain disconnected. A downside of this connectivity scheme is that some nodes in an m-kNN graph may be sparsely connected or even disconnected. To make sure that no data point is isolated we augment E with the minimum-spanning tree of the k -nearest neighbors graph of the dataset. To balance the contribution of each node to the objective, we set w p , q = ∑ i = 1 n N i n N p N q , [S1]where N i is the number of edges incident to 𝐱 i in E . Graduated Nonconvexity. The penalty function in Eq. 3 is nonconvex and its shape depends on the value of the parameter μ . To support convergence to a good solution, we use graduated nonconvexity (32). We begin by setting μ such that the objective is convex over the relevant range and gradually decrease μ to sharpen the penalty and neutralize the influence of spurious connections in E . Specifically, μ is initially set to μ = 3 r 2 , where r is the maximal edge length in E . The value of μ is halved every four iterations until it drops below δ / 2 . Parameter Setting. The termination conditions are set to maxiterations = 100 and ε = 0.1 . For RCC-DR, the sparse coding parameters are set to d = 100 , ξ = 8 , γ = 0.2 , and η = 0.9 . The dictionary is initialized using PCA components. Due to the small input dimension, we set d = 8 for the Shuttle, Pendigits, and Mice Protein datasets. The parameters δ 2 and μ 2 in RCC-DR are computed using 𝐙 , by analogy to their counterparts in RCC. To set δ 1 , we compute the distance r i of each data point 𝐳 i from the mean of data 𝐙 and set δ 1 = mean ( 2 r i ) . The initial value of μ 1 is set to μ 1 = ξ δ 1 . The parameter λ is set automatically to λ = ∥ 𝐙𝐇 ∥ 2 ∥ 𝐀 ∥ 2 + ∥ 𝐇 ∥ 2 . [S2] Algorithm S1. Joint Clustering and Dimensionality Reduction Implementation. We use an approximate nearest-neighbor search to construct the connectivity structure (54) and a conjugate gradient solver for linear systems (55). The RCC-DR Algorithm. The RCC-DR algorithm is summarized in Algorithm S1: Joint Clustering and Dimensionality Reduction.

Joint Clustering and Dimensionality Reduction The RCC formulation can be interpreted as learning a graph-regularized embedding 𝐔 of the data 𝐗 . In Algorithm 1 the dimensionality of the embedding 𝐔 is the same as the dimensionality of the data 𝐗 . However, since RCC optimizes a continuous and differentiable objective, it can be used within end-to-end feature learning pipelines. We now demonstrate this by extending RCC to perform joint clustering and dimensionality reduction. Such joint optimization has been considered in recent work (34, 35). The algorithm we develop, RCC-DR, learns a linear mapping into a reduced space in which the data are clustered. The mapping is optimized as part of the clustering objective, yielding an embedding in which the data can be clustered most effectively. RCC-DR inherits the appealing properties of RCC: Clustering and dimensionality reduction are performed jointly by optimizing a clear continuous objective, the framework supports nonconvex robust estimators that can untangle mixed clusters, and optimization is performed by efficient and scalable numerical methods. Algorithm 1. RCC We begin by considering an initial formulation for the RCC-DR objective: 𝐂 ( 𝐔 , 𝐙 , 𝐃 ) = ∥ 𝐗 − 𝐃𝐙 ∥ 2 2 + γ ∑ i = 1 n ∥ 𝐳 i ∥ 1 + ν ( ∑ i = 1 n ∥ 𝐳 i − 𝐮 i ∥ 2 2 + λ 2 ∑ ( p , q ) ∈ E w p , q ρ ( ∥ 𝐮 p − 𝐮 q ∥ 2 ) ) . [10]Here 𝐃 ∈ ℝ D × d is a dictionary, 𝐳 i ∈ ℝ d is a sparse code corresponding to the i th data sample, and 𝐮 i ∈ ℝ d is the low-dimensional embedding of 𝐱 i . For a fixed 𝐃 , the parameter ν balances the data term in the sparse coding objective with the clustering objective in the reduced space. This initial formulation 10 is problematic because in the beginning of the optimization the representation 𝐔 can be noisy due to spurious intercluster connections that have not yet been disabled. This had no effect on the convergence of the original RCC objective 1, but in formulation 10 the contamination of 𝐔 can infect the sparse coding system via 𝐙 and corrupt the dictionary 𝐃 . For this reason, we use a different formulation that has the added benefit of eliminating the parameter ν : 𝐂 ( 𝐔 , 𝐙 , 𝐃 ) = ∥ 𝐗 − 𝐃𝐙 ∥ 2 2 + γ ∑ i = 1 n ∥ 𝐳 i ∥ 1 + ∑ i = 1 n ρ 1 ( ∥ 𝐳 i − 𝐮 i ∥ 2 ) + λ 2 ∑ ( p , q ) ∈ E w p , q ρ 2 ( ∥ 𝐮 p − 𝐮 q ∥ 2 ) . [11]Here we replaced the ℓ 2 penalty on the data term in the reduced space with a robust penalty. We use the Geman–McClure estimator 3 for both ρ 1 and ρ 2 . To optimize objective 11, we introduce line processes 𝕃 1 and 𝕃 2 corresponding to the data and pairwise terms in the reduced space, respectively, and optimize a joint objective over 𝐔 , 𝐙 , 𝐃 , 𝕃 1 , and 𝕃 2 . The optimization is performed by block coordinate descent over these groups of variables. The line processes 𝕃 1 and 𝕃 2 can be updated in closed form as in Eq. 5. The variables 𝐔 are updated by solving the linear system 𝐔𝐌 dr = 𝐙𝐇 , [12]where 𝐌 dr = 𝐇 + λ ∑ ( p , q ) ∈ E w p , q l p , q 2 ( 𝐞 p − 𝐞 q ) ( 𝐞 p − 𝐞 q ) ⊤ [13]and 𝐇 is a diagonal matrix with h i , i = l i 1 . The dictionary 𝐃 and codes 𝐙 are initialized using principal component analysis (PCA). [The K-SVD algorithm can also be used for this purpose (36).] The variables 𝐙 are updated by accelerated proximal gradient-descent steps (37), 𝐙 ¯ = 𝐙 t + ω t ( 𝐙 t − 𝐙 t − 1 ) 𝐙 t + 1 = 𝐩𝐫𝐨𝐱 τ γ ∥ . ∥ 1 ( 𝐙 ¯ − τ ( 𝐃 ⊤ ( − 𝐗 + 𝐃 𝐙 ¯ ) + ( 𝐙 ¯ − 𝐔 ) 𝐇 ) ) , [14]where τ = 1 ∥ 𝐃 ⊤ 𝐃 ∥ 2 + ∥ 𝐇 ∥ 2 and ω t = t t + 3 . The 𝐩𝐫𝐨𝐱 ε ∥ . ∥ 1 operator performs elementwise soft thresholding: 𝐩𝐫𝐨𝐱 ε ∥ . ∥ 1 ( v ) = sign ( v ) max ( 0 , | v | − ε ) . [15] The variables 𝐃 are updated using 𝐃 ¯ = 𝐗𝐙 ⊤ ( 𝐙𝐙 ⊤ + β 𝐈 ) − 1 [16] 𝐃 t + 1 = η 𝐃 t + ( 1 − η ) 𝐃 ¯ , [17]where β is a small regularization value set to β = 10 − 4 tr ( 𝐙𝐙 ⊤ ) . A precise specification of the RCC-DR algorithm is provided in Algorithm S1.

Experiments Datasets. We have conducted experiments on datasets from multiple domains. The dimensionality of the data in the different datasets varies from 9 to just below 50,000. Reuters-21578 is the classic benchmark for text classification, comprising 21,578 articles that appeared on the Reuters newswire in 1987. RCV1 is a more recent benchmark of 800,000 manually categorized Reuters newswire articles (38). (Due to limited scalability of some prior algorithms, we use 10,000 random samples from RCV1.) Shuttle is a dataset from NASA that contains 58,000 multivariate measurements produced by sensors in the radiator subsystem of the Space Shuttle; these measurements are known to arise from seven different conditions of the radiators. Mice Protein is a dataset that consists of the expression levels of 77 proteins measured in the cerebral cortex of eight classes of control and trisomic mice (39). The last two datasets were obtained from the University of California, Irvine, machine-learning repository (40). MNIST is the classic dataset of 70,000 hand-written digits (41). Pendigits is another well-known dataset of hand-written digits (42). The Extended Yale Face Database B ( YaleB ) contains images of faces of 28 human subjects (43). The YouTube Faces Database ( YTF ) contains videos of faces of different subjects (44); we use all video frames from the first 40 subjects sorted in chronological order. Columbia University Image Library ( COIL-100 ) is a classic collection of color images of 100 objects, each imaged from 72 viewpoints (45). The datasets are summarized in Table 1. Table 1. Datasets used in experiments Baselines. We compare RCC and RCC-DR to 13 baselines, which include widely known clustering algorithms as well as recent techniques that were reported to achieve state-of-the-art performance. Our baselines are k -means++ (24), Gaussian mixture models (GMM), fuzzy clustering, mean-shift clustering (MS) (9), two variants of agglomerative clustering (AC-Complete and AC-Ward), normalized cuts (N-Cuts) (2), affinity propagation (AP) (10), Zeta l -links (Zell) (46), spectral embedded clustering (SEC) (47), clustering using local discriminant models and global integration (LDMGI) (48), graph degree linkage (GDL) (49), and path integral clustering (PIC) (50). The parameter settings for the baselines are summarized in Table S1. Table S1. Parameter settings for baselines Measures. Normalized mutual information (NMI) has emerged as the standard measure for evaluating clustering accuracy in the machine-learning community (51). However, NMI is known to be biased in favor of fine-grained partitions. For this reason, we use adjusted mutual information (AMI), which removes this bias (52). This measure is defined as follows: AMI ( 𝐜 , 𝐜 ^ ) = MI ( 𝐜 , 𝐜 ^ ) − E [ MI ( 𝐜 , 𝐜 ^ ) ] H ( 𝐜 ) H ( 𝐜 ^ ) − E [ MI ( 𝐜 , 𝐜 ^ ) ] . [18]Here H ( ⋅ ) is the entropy, MI ( ⋅ , ⋅ ) is the mutual information, and 𝐜 and 𝐜 ^ are the two partitions being compared. For completeness, Table S2 provides an evaluation using the NMI measure. Table S2. Accuracy of all algorithms on all datasets, measured by NMI Results. Results on all datasets are reported in Table 2. In addition to accuracy on each dataset, Table 2 also reports the average rank of each algorithm across datasets. For example, if an algorithm achieves the third-highest accuracy on half of the datasets and the fourth-highest one on the other half, its average rank is 3.5. If an algorithm did not yield a result on a dataset due to its size, that dataset is not taken into account in computing the average rank of the algorithm. Table 2. Accuracy of all algorithms on all datasets, measured by AMI RCC or RCC-DR achieves the highest accuracy on seven of the nine datasets. RCC-DR achieves the highest or second-highest accuracy on eight of the nine datasets and RCC achieves the highest or second-highest accuracy on five datasets. The average rank of RCC-DR and RCC is 1.6 and 2.4, respectively. The best-performing prior algorithm, LDMGI, has an average rank of 4.9, three times higher than the rank of RCC-DR. This indicates that the performance of prior algorithms is not only lower than the performance of RCC and RCC-DR, it is also inconsistent, since no prior algorithm clearly leads the others across datasets. In contrast, the low average rank of RCC and RCC-DR indicates consistently high performance across datasets. Clustering Gene Expression Data. We conducted an additional comprehensive evaluation on a large-scale benchmark that consists of more than 30 cancer gene expression datasets, collected for the purpose of evaluating clustering algorithms (53). The results are reported in Table S3. RCC-DR achieves the highest accuracy on eight of the datasets. Among the prior algorithms, affinity propagation achieves the highest accuracy on six of the datasets and all others on fewer. Overall, RCC-DR achieves the highest average AMI across the datasets. Table S3. AMI on cancer gene expression datasets Running Time. The execution time of RCC-DR optimization is visualized in Fig. 2. For reference, we also show the corresponding timings for affinity propagation, a well-known modern clustering algorithm (10), and LDMGI, the baseline that demonstrated the best performance across datasets (48). Fig. 2 shows the running time of each algorithm on randomly sampled subsets of the 784-dimensional MNIST dataset. We sample subsets of different sizes to evaluate runtime growth as a function of dataset size. Performance is measured on a workstation with an Intel Core i7-5960x CPU clocked at 3.0 GHz. RCC-DR clusters the whole MNIST dataset within 200 s, whereas affinity propagation takes 37 h and LDMGI takes 17 h for 40,000 points. Fig. 2. Runtime comparison of RCC-DR with AP and LDMGI. Runtime is evaluated as a function of dataset size, using randomly sampled subsets of different sizes from the MNIST dataset. Visualization. We now qualitatively analyze the output of RCC by visualization. We use the MNIST dataset for this purpose. On this dataset, RCC identifies 17 clusters. Nine of these are large clusters with more than 6,000 instances each. The remaining 8 are small clusters that encapsulate outlying data points: Seven of these contain between 2 and 11 instances, and one contains 148 instances. Fig. 3A shows 10 randomly sampled data points 𝐱 i from each of the large clusters discovered by RCC. Their corresponding representatives 𝐮 i are shown in Fig. 3B. Fig. 3C shows 2 randomly sampled data points from each of the small outlying clusters. Additional visualization of RCC output on the Coil-100 dataset is shown in Fig. S3. Fig. 3. Visualization of RCC output on the MNIST dataset. (A) Ten randomly sampled instances 𝐱 i from each large cluster discovered by RCC, one cluster per row. (B) Corresponding representatives 𝐮 i from the learned representation 𝐔 . (C) Two random samples from each of the small outlying clusters discovered by RCC. Fig. 4 compares the representation 𝐔 learned by RCC to representations learned by the best-performing prior algorithms, LDMGI and N-Cuts. We use the MNIST dataset for this purpose and visualize the output of the algorithms on a subset of 5,000 randomly sampled instances from this dataset. Both of the prior algorithms construct Euclidean representations of the data, which can be visualized by dimensionality reduction. We use t-SNE (23) to visualize the representations discovered by the algorithms. As shown in Fig. 4, the representation discovered by RCC cleanly separates the different clusters by significant margins. In contrast, the prior algorithms fail to discover the structure of the data and leave some of the clusters intermixed. Fig. 4. (A–C) Visualization of the representations learned by RCC (A) and the best-performing prior algorithms, LDMGI (B) and N-Cuts (C). The algorithms are run on 5,000 randomly sampled instances from the MNIST dataset. The learned representations are visualized using t-SNE.

Discussion We have presented a clustering algorithm that optimizes a continuous objective based on robust estimation. The objective is optimized using linear least-squares solvers, which scale to large high-dimensional datasets. The robust terms in the objective enable separation of entangled clusters, yielding high accuracy across datasets and domains. The continuous form of the clustering objective allows it to be integrated into end-to-end feature learning pipelines. We have demonstrated this by extending the algorithm to perform joint clustering and dimensionality reduction.

SI Experiments Datasets. For Reuters-21578 we combine the train and test sets of the Modified Apte split and use only samples from categories with more than five examples. For RCV1 we consider four root categories and a random subset of 10,000 samples. For text datasets, the graph is constructed on PCA projected input. The number of PCA components is set to the number of ground-truth clusters. We compute term frequency–inverse document frequency features on the 2,000 most frequently occurring word stems. On YaleB we consider only the frontal face images and preprocess them using gamma correction and DoG filter. For YTF we use all of the video frames from the first 40 subjects sorted in chronological order. For all image datasets we scale the pixel intensities to the range [ 0,1 ] . For all other datasets, we normalize the features so that ∥ 𝐱 ∥ 2 2 ≈ D . Baselines. For k -means++, GMM, Mean Shift, AC-Complete, AC-Ward, and AP we use the implementations in the scikit-learn package. For fuzzy clustering, we use the implementation provided by Matlab. For N-Cuts, Zell, SEC, LDMGI, GDL, and PIC we use the publicly available implementations published by the authors of these methods. For all algorithms that use k -nearest neighbor graphs, we set k = 10 . Unlike the presented algorithms, many baselines rely on multiple executions with random restarts. To maximize their reported accuracy, we use 10 random restarts for these baselines. Following common practice, for k -means++, GMM, and LDMGI we pick the best result based on the value of the objective function at termination, whereas for fuzzy clustering, N-Cuts, Zell, SEC, GDL, and PIC we take the average across 10 random restarts. Most of the baselines require setting one or more parameters. For a fair comparison, for each algorithm we tune one major parameter across datasets and use the default values for all other parameters. For all algorithms, the tuned value is selected based on the best average performance across all datasets. Parameter settings for the baselines are summarized in Table S1. The notation (m : s : M) indicates that parameter search is conducted in the range ( m , M ) with the step s . Additional Accuracy Measure. For completeness, we evaluate the accuracy of RCC, RCC-DR, and all baselines, using the NMI measure (51, 52). The results are reported in Table S2. Results on Gene Expression Datasets. Table S3 lists AMI results on more than 30 cancer gene expression datasets collected by de Souto et al. (53). The maximum number of samples across datasets is only 248 and for all but one dataset the dimension D ≫ n . Since these datasets are statistically very different from those discussed earlier, for each algorithm we retune the major parameter for the same range as given in Table S1. For both RCC and RCC-DR, we set k = 9 . For RCC-DR we set d = 12 and γ = 0.5 . The author-provided code for GDL breaks on these datasets. Robustness to Hyperparameter Settings. The parameters of the RCC algorithm are set automatically based on the data. The RCC-DR algorithm does have a number of parameters but is largely insensitive to their settings. In the following experiment, we vary the sparse-coding parameters d , η , and γ in the ranges d = ( 40 : 20 : 200 ) , η = ( 0.55 : 0.05 : 0.95 ) , and γ = ( 0.1 : 0.1 : 0.9 ) . Fig. S1 A and B compares the sensitivity of RCC-DR to these parameters with the sensitivity of the best-performing prior algorithms to their key parameters. For each baseline, we use the default search range proposed in their respective papers. The x axis in Fig. S1 corresponds to the parameter index. As Fig. S1 demonstrates, the accuracy of RCC-DR is robust to hyperparameter settings: The relative change of RCC-DR accuracy in AMI on YaleB is 0.005, 0.008, and 0 across the range of d , η , and γ , respectively. On the other hand, the sensitivity of the baselines is much higher: The relative change in accuracy of SEC, LDMGI, N-Cuts, and GDL is 0.091, 0.049, 0.740, and 0.021, respectively. Moreover, for SEC, LDMGI, and GDL no single parameter setting works best across different datasets. Fig. S1. (A and B) Robustness to hyperparameter settings on the YaleB (A) and Reuters (B) datasets. Robustness to Dataset Imbalance. We now evaluate the robustness of different approaches to imbalance in class sizes. This experiment uses the MNIST dataset. We control the degree of imbalance by varying a parameter s between 0.1 and 1. The class “0” is sampled with probability s , the class “9” is sampled with probability 1, and the sampling probabilities of other classes vary linearly between s and 1 . For each value of s , we sample 10,000 data points and evaluate the accuracy of RCC, RCC-DR, and the top-performing baselines on the resulting dataset. The results are reported in Fig. S2. The RCC and RCC-DR algorithms retain their accuracy advantage on imbalanced datasets. Fig. S2. Robustness to dataset imbalance. Visualization. Fig. S3A shows 10 randomly sampled data points 𝐱 i from each of 10 clusters randomly sampled from the clusters discovered by RCC on the Coil-100 dataset. Fig. S3B shows the corresponding representatives 𝐮 i . Fig. S3. Visualization of RCC output on the Coil-100 dataset. (A) Ten randomly sampled instances 𝐱 i from each of 10 clusters randomly sampled from clusters discovered by RCC, one cluster per row. (B) Corresponding representatives 𝐮 i from the learned representation 𝐔 . Learned Representation. One way to quantitatively evaluate the success of the learned representation 𝐔 in capturing the structure of the data is to use it as input to other clustering algorithms and to evaluate whether they are more successful on 𝐔 than they are on the original data 𝐗 . The results of this experiment are reported in Table S4. Table S4, Left reports the performance of multiple baselines when they are given, as input, the representation 𝐔 produced by RCC. Table S4, Right reports corresponding results when the baselines are given the representation 𝐔 produced by RCC-DR. Table S4. Success of the learned representation 𝐔 in capturing the structure of the data, evaluated by running prior clustering algorithms on 𝐔 instead of 𝐗 The results indicate that the performance of prior clustering algorithms improves significantly when they are run on the representations learned by RCC and RCC-DR. The accuracy improvements for k -means++, AC-Ward, and affinity propagation are particularly notable.

Footnotes Author contributions: S.A.S. and V.K. designed research, performed research, analyzed data, and wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1700770114/-/DCSupplemental.