How to do better k-Means Clustering initialization in BigQuery ML

Use the Parallel k-Means++ initialization option

BigQuery ML supports unsupervised learning — you can apply the k-Means algorithm to group your data into clusters. As I described in an earlier blog post, you can use clustering to understand your data in order to make data-driven decisions. There is now a better initialization algorithm available for k-Means in BigQuery, and in this blog, I’ll explain what this new initialization algorithm does, and why it is beneficial.

k-Means Clustering in BigQuery

But first, let’s start with how you use k-Means in BigQuery ML. As an example, let’s use the Iowa public dataset of liquor sales in the state and cluster the stores based on their 2018 sales characteristics. First, we pull out data about each store using:

SELECT

store_number

, AVG(SAFE_DIVIDE(state_bottle_retail - state_bottle_cost, state_bottle_cost)) AS profitability

, SUM(bottles_sold) AS bottles_sold

, SUM(sale_dollars) AS sale_dollars

, SUM(volume_sold_liters) AS volume_sold

FROM `bigquery-public-data`.iowa_liquor_sales.sales

WHERE EXTRACT(YEAR from date) = 2018

GROUP BY store_number

This gives us data that looks like this:

Sales characteristics of liquor stores in Iowa

To cluster this data based on store, we just need to create a model, taking care to exclude the store_number from list of features:

CREATE OR REPLACE MODEL demos.iowa_store_clusters

OPTIONS(model_type='kmeans', num_clusters=4, standardize_features = true) AS WITH data AS (

SELECT

store_number

, AVG(SAFE_DIVIDE(state_bottle_retail - state_bottle_cost, state_bottle_cost)) AS profitability

, SUM(bottles_sold) AS bottles_sold

, SUM(sale_dollars) AS sale_dollars

, SUM(volume_sold_liters) AS volume_sold

FROM `bigquery-public-data.iowa_liquor_sales.sales`

WHERE EXTRACT(YEAR from date) = 2018

GROUP BY store_number

) SELECT * EXCEPT(store_number) FROM data

In the OPTIONS, I specify the desired number of clusters and tell the algorithm to standardize the features. This is important because the ranges of our features vary wildly (from 0–1 for profitability to 10E6 for the sales dollars). BigQuery ML standardizes k-Means input features by default, but no harm in being doubly sure!!

The evaluation tab in the BigQuery console shows us the generated clusters (your results might be different, for reasons I explain later in this article):

The four types of liquor stores in Iowa

The profitability doesn’t vary between the clusters (perhaps state law mandates prices?). It’s clear that cluster #1 is the high-volume cluster and cluster #4 is the next tier of stores in terms of sales. Clusters #2 and #3 seem to have similar sales volume, but the sales in Cluster #3 consist of smaller bottles (hence, more bottles_sold) and this leads to greater sale_dollars.

This training took about 30 seconds and completed in seven iterations:

Repeating …

Just for grins, let me try to do the query again. This time, it took about a minute:

and the results are different and not as good as previously:

The reason is that the initialization method that K-Means does by default is to seed the clusters at RANDOM starting points. The quality of the clusters you get depends on the starting point, and in this case, the starting points were not as good as previously.

Why we need better initialization

Specifically, the random initialization can cause two problems. If the starting seeds happen to be nice (see green box below), convergence will be fast. But if the starting points happen to be bad (see red box below), convergence can take much longer.

Convergence time depends on the starting point

Secondly, it is quite possible that K-Means can get caught in a local minimum and never find a good solution (see below, where the top-left cluster gets broken into two and the two bottom-right clusters get merged into a single one):

Bad initialization can also lead to k-Means being trapped in a local minimum

There has been a lot of research into how to distribute the initial starting seeds so that they cover the input space in a way that leads to rapid convergence and do not lead to being trapped in a local minimum.

k-Means++

BigQuery now implements the parallel k-Means++ algorithm. In simple terms, the algorithm first selects a seed point. Then, to choose the next seed point, it does random sampling of the points, with the probability of a point being chosen being higher the farther away it is from the existing seed points. Where k-Means++ simply chooses the first such randomly sampled point (and so requires k passes through the data), parallel k-Means++ parallelizes this selection, clusters the selected points based on a weighting function, and does this in a single shot. The seed points chosen by k-Means++ (and parallel k-Means++) are provably near-optimal.

The parallel k-Means++ algorithm that you can tell BigQuery ML to use. The algorithm (and screenshot) is from a paper by Bahmani et. al: https://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

Although initialization takes longer, convergence is usually faster and the results are also repeatably better. Trying it on the Iowa store data:

CREATE OR REPLACE MODEL demos.iowa_store_clusters

OPTIONS(model_type='kmeans',

num_clusters = 4,

kmeans_init_method = 'KMEANS++',

standardize_features = true) AS WITH data AS (

SELECT

store_number

, AVG(SAFE_DIVIDE(state_bottle_retail - state_bottle_cost, state_bottle_cost)) AS profitability

, SUM(bottles_sold) AS bottles_sold

, SUM(sale_dollars) AS sale_dollars

, SUM(volume_sold_liters) AS volume_sold

FROM `bigquery-public-data.iowa_liquor_sales.sales`

WHERE EXTRACT(YEAR from date) = 2018

GROUP BY store_number

) SELECT * EXCEPT(store_number) FROM data

It really is better

I got convergence in just three iterations:

With k-Means++, convergence is twice as fast.

The resulting clusters that are even better than the one we had before:

The four clusters now

Note how different the clusters are from each other on all the considered features, and how the two outlier stores have been shivved off into a separate cluster whether they don’t affect the cluster mean of the other clusters. This also shows up in the Davies-Bouldin index (0.6 now as compared to 1.1 previously).

This is also pretty repeatable — when I tried it again, I got similar (not exact, because of random seeds) results:

Similar results when I ran it again.

Bottom line: Use kmeans_init_method = ‘KMEANS++’. Enjoy!

Jiashang Liu (a PhD student at my alma mater!) implemented this feature while he was an intern at Google working with Jiaxun Wu and Mingge Deng. Interns do amazing work at Google — apply!

Thanks to Jiaxun Wu for pointing me to this improvement in BigQuery ML, and supplying the two initialization figures, and Amir Hormati and Abhishek Kashyap for reviewing this article.