Blog

In our previous blog post on one-dimensional clustering, we used a known distance between two points to cluster the data in one dimension. However, data can be more complicated in many cases and may need to be clustered using multiple dimensions.

In this post, we calculate clusters with the k-means clustering algorithm. To optimize each of k clusters, we utilize recursive CTEs as discussed in the post on Solving the Traveling Salesman Problem.

Suppose you have a mobile game and you’re trying to figure out which of your users should see a new feature first. You decide to break your users into groups based on their spending habits.

Looking at the chart above, we can guess that users can be broken into two clusters based on some of the natural cutoffs pictured. Thus, we will try to use the k-means algorithm to find two clusters.

To implement the algorithm, we need to start with some initial values for each mean. These initial values can be chosen randomly from the existing data, which we can do by creating the following table that samples our data: ​

create table initial_guess as select * from ( select 1.0*total_spend tot1 , 1.0*number_of_purchases num1 from data where random() < 0.5 limit 1 ) a cross join ( select 1.0*total_spend tot2 , 1.0*number_of_purchases num2 from data where random() > 0.5 limit 1 ) b

Now, we need to write a function to calculate the Euclidean distance between each data point and the current iteration’s means. This will determine which cluster (mean) the data point should be assigned to: ​

create or replace function sqr_euclid_dis( tot float, num float, meantot float, meannum float ) returns float as $$ begin return (tot-meantot)*(tot-meantot)+(num-meannum)*(num-meannum); end $$ language plpgsql

Next, we implement a recursive CTE to iteratively calculate the next set of means. We first assign each data point to the nearest cluster (mean), which is taken from our initial values. Then, we calculate the average value of each cluster of points to find means for the next iteration.

Since we want each row of the output to represent one iteration of the k-means algorithm, each row will need to output the iteration number, variables for the first mean and the second mean. We define that as (iter, tot1, num1, tot2, num2) in the recursive CTE. ​

with recursive kmeans(iter, tot1, num1, tot2, num2) as ( select 1, * from initial_guess union all select kmeans.iter + 1 , ( select avg(total_spend) from ( select s.total_spend , s.number_of_purchases from ( select total_spend , number_of_purchases , case when sqr_euclid_dis( total_spend, number_of_purchases , kmeans.tot1, kmeans.num1) < sqr_euclid_dis(total_spend , number_of_purchases, kmeans.tot2, kmeans.num2) then 1 else 2 end as cluster_id from data ) s where cluster_id = 1 ) l ) , ( select avg(number_of_purchases) from ( select s.total_spend , s.number_of_purchases from ( select total_spend , number_of_purchases , case when sqr_euclid_dis( total_spend, number_of_purchases , kmeans.tot1, kmeans.num1) < sqr_euclid_dis(total_spend , number_of_purchases, kmeans.tot2, kmeans.num2) then 1 else 2 end as cluster_id from data ) s where cluster_id = 1 ) l ) , ( select avg(total_spend) from ( select t.total_spend , t.number_of_purchases from ( select total_spend , number_of_purchases , case when sqr_euclid_dis( total_spend, number_of_purchases , kmeans.tot1, kmeans.num1) < sqr_euclid_dis(total_spend , number_of_purchases, kmeans.tot2, kmeans.num2) then 1 else 2 end as cluster_id from data ) t where cluster_id = 2 ) m ) , ( select avg(number_of_purchases) from ( select t.total_spend , t.number_of_purchases from ( select total_spend , number_of_purchases , case when sqr_euclid_dis( total_spend, number_of_purchases , kmeans.tot1, kmeans.num1) < sqr_euclid_dis(total_spend , number_of_purchases, kmeans.tot2, kmeans.num2) then 1 else 2 end as cluster_id from data ) t where cluster_id = 2 ) m ) from kmeans where kmeans.iter < 10 )

In the code above, we generated an anchor step using the initial guess. For each step, we add 1 to the iteration. Then we re-compute variables of the first mean by calculating the Euclidean distance between each data point and the first mean.

After assigning the closest mean (cluster_id) to each data point, we take an average of the data points within the same cluster. Then, we perform those same steps for the second mean. These averages become the new mean for the iteration.

Finally, we query the result from the last iteration: ​

select data.* , case when sqr_euclid_dis(total_spend , number_of_purchases, kmeans.tot1 , kmeans.num1) < sqr_euclid_dis( total_spend, number_of_purchases , kmeans.tot2, kmeans.num2) then 1 else 2 end as cluster_id from data , kmeans where iter = 10

Success! We’ve assigned clusters to the data set using the k-means algorithm. Now you know which cluster of users should see the new feature first!