In-Depth

K-Means Data Clustering Using C#

Listing 3: Method UpdateMeans

private static bool UpdateMeans(double[][] data, int[] clustering, double[][] means) { int numClusters = means.Length; int[] clusterCounts = new int[numClusters]; for (int i = 0; i < data.Length; ++i) { int cluster = clustering[i]; ++clusterCounts[cluster]; } for (int k = 0; k < numClusters; ++k) if (clusterCounts[k] == 0) return false; for (int k = 0; k < means.Length; ++k) for (int j = 0; j < means[k].Length; ++j) means[k][j] = 0.0; for (int i = 0; i < data.Length; ++i) { int cluster = clustering[i]; for (int j = 0; j < data[i].Length; ++j) means[cluster][j] += data[i][j]; // accumulate sum } for (int k = 0; k < means.Length; ++k) for (int j = 0; j < means[k].Length; ++j) means[k][j] /= clusterCounts[k]; // danger of div by 0 return true; }

One of the potential pitfalls of the k-means algorithm is that all clusters must have at least one tuple assigned at all times. The first few lines of UpdateMeans scan the clustering input array parameter and count the number of tuples assigned to each cluster. If any cluster has no tuples assigned, the method exits and returns false. This is a fairly expensive operation and can be omitted if the methods that initialize the clustering and update the clustering both guarantee that there are no zero-count clusters.

Notice that matrix means is actually used as a C# style ref parameter -- the new means are stored into the parameter. So you might want to label the means parameter with the ref keyword to make this idea explicit.

In method Cluster, for convenience, the means matrix is allocated using helper method allocate:

private static double[][] Allocate(int numClusters, int numColumns) { double[][] result = new double[numClusters][]; for (int k = 0; k < numClusters; ++k) result[k] = new double[numColumns]; return result; }

Updating the Clustering

In each iteration of the Cluster method, after new cluster means have been computed, the cluster membership of each data tuple is updated in method UpdateClustering. Method UpdateClustering is presented in Listing 4.

Listing 4: The UpdateClustering Method

private static bool UpdateClustering(double[][] data, int[] clustering, double[][] means) { int numClusters = means.Length; bool changed = false; int[] newClustering = new int[clustering.Length]; Array.Copy(clustering, newClustering, clustering.Length); double[] distances = new double[numClusters]; for (int i = 0; i < data.Length; ++i) { for (int k = 0; k < numClusters; ++k) distances[k] = Distance(data[i], means[k]); int newClusterID = MinIndex(distances); if (newClusterID != newClustering[i]) { changed = true; newClustering[i] = newClusterID; } } if (changed == false) return false; int[] clusterCounts = new int[numClusters]; for (int i = 0; i < data.Length; ++i) { int cluster = newClustering[i]; ++clusterCounts[cluster]; } for (int k = 0; k < numClusters; ++k) if (clusterCounts[k] == 0) return false; Array.Copy(newClustering, clustering, newClustering.Length); return true; // no zero-counts and at least one change }

Method UpdateClustering uses the idea of the distance between a data tuple and a cluster mean. The Euclidean distance between two vectors is the square root of the sum of the squared differences between corresponding component values. For example, suppose some data tuple d0 = {68, 140} and three cluster means are c0 = {66.0, 120.0}, c1 = {69.0, 160.0}, and c2 = {70.0, 130.0}. (Note that I'm using raw, un-normalized data for demonstration purposes only.) The distance between d0 and c0 = sqrt((68 - 66.0)^2 + (140 - 120.0)^2) = 20.10. The distance between d0 and c1 = sqrt((68 - 69.0)^2 + (140 - 160.0)^2) = 20.22. And the distance between d0 and c2 = sqrt((68 - 70.0)^2 + (140 - 130.0)^2) = 10.20. The data tuple is closest to mean c2, and so would be assigned to cluster 2.

Method Distance is defined as:

private static double Distance(double[] tuple, double[] mean) { double sumSquaredDiffs = 0.0; for (int j = 0; j < tuple.Length; ++j) sumSquaredDiffs += Math.Pow((tuple[j] - mean[j]), 2); return Math.Sqrt(sumSquaredDiffs); }

Method UpdateClustering scans each data tuple, computes the distances from the current tuple to each of the cluster means, and then assigns the tuple to the closest mean using helper function MinIndex, defined as:

private static int MinIndex(double[] distances) { int indexOfMin = 0; double smallDist = distances[0]; for (int k = 0; k < distances.Length; ++k) { if (distances[k] < smallDist) { smallDist = distances[k]; indexOfMin = k; } } return indexOfMin; }

Method UpdateClustering computes a proposed new clustering into local array newClustering, and then counts the number of tuples assigned to each cluster in the proposed clustering. If any clusters would have no data tuples assigned, UpdateClustering exits and returns false. Otherwise, the new clustering is copied into the clustering parameter. During the copy, the method tracks if there are any changes to cluster membership. If there are no changes, the method exits and returns false. Note that parameter clustering is a C# ref parameter, and you may want to label it explicitly.

Helper Display Methods

The demo program has three helper methods to display data. The methods are presented in Listing 5. Method ShowData displays a matrix of type double to the console. Method ShowVector displays an array of type int to the console. And method ShowClustered displays a matrix of type double, grouped by cluster membership.

Listing 5: Helper Display Methods

static void ShowData(double[][] data, int decimals, bool indices, bool newLine) { for (int i = 0; i < data.Length; ++i) { if (indices) Console.Write(i.ToString().PadLeft(3) + " "); for (int j = 0; j < data[i].Length; ++j) { if (data[i][j] >= 0.0) Console.Write(" "); Console.Write(data[i][j].ToString("F" + decimals) + " "); } Console.WriteLine(""); } if (newLine) Console.WriteLine(""); } static void ShowVector(int[] vector, bool newLine) { for (int i = 0; i < vector.Length; ++i) Console.Write(vector[i] + " "); if (newLine) Console.WriteLine("

"); } static void ShowClustered(double[][] data, int[] clustering, int numClusters, int decimals) { for (int k = 0; k < numClusters; ++k) { Console.WriteLine("==================="); for (int i = 0; i < data.Length; ++i) { int clusterID = clustering[i]; if (clusterID != k) continue; Console.Write(i.ToString().PadLeft(3) + " "); for (int j = 0; j < data[i].Length; ++j) { if (data[i][j] >= 0.0) Console.Write(" "); Console.Write(data[i][j].ToString("F" + decimals) + " "); } Console.WriteLine(""); } Console.WriteLine("==================="); } // k }

Modifications and Extensions

The code presented here can be modified and extended in several ways. In some situations, you may be more interested in the final cluster means than in the clustering membership. In those situations, you can refactor method Cluster so that instead of returning an int array representing cluster membership, you return the means of the final clusters.

The UpdateClustering method uses Euclidean distance to assign cluster membership. Euclidean distance heavily penalizes outlier data tuples. In some situations, an alternative distance metric, such as the Manhattan distance, might be preferable.

With regards to performance, the UpdateClustering method is usually the most time-consuming part of the k-means algorithm. If you refer to Figure 5, you'll notice that each data tuple can be processed independently, so using the C# Parallel.For loop feature, available in .NET 4.0 and later, can improve performance dramatically in many cases.

The final clustering produced by the k-means algorithm depends on how clusters are initialized. It's unlikely, but possible, that k-means will generate a poor clustering. One approach for dealing with this is to modify method Cluster so that it returns a value that represents how well the data's been clustered, such as the average distance between data tuples within clusters (smaller values are better), or the average distance between cluster means (larger values are better). Then you can run k-means several times, with different initializations, and use the best clustering found.