Cluster Assignment: In this step, each data point is assigned to the nearest cluster center.

This step can be carried for each data point independently.

This can be designed using the Map function (there are 4 such map jobs created) where the points from each of the 4 data subsets are parallelly assigned to the nearest cluster center (each map job knows the coordinates of the initial cluster centroids created).

function (there are 4 such created) where the points from each of the 4 data subsets are parallelly assigned to the nearest cluster center (each map job knows the coordinates of the initial cluster centroids created). Once each data point is assigned to a cluster centroid, the map job emits each of the datapoints with the the assigned cluster label as the key. Cluster Centroid (Re-) Computation: In this step, the centroids for each of the clusters are recomputed from the points assigned to the cluster.

for each of the are from the points assigned to the cluster. This is done in the Reduce function, where each cluster’s data points come to the reducer as a collection of all the data points assigned to the cluster (corresponding to the key emitted by the map function).

function, where each cluster’s data points come to the as a collection of all the data points assigned to the cluster (corresponding to the emitted by the function). The reducer recomputes the centroid of each cluster, corresponding to each key.

The steps 1-2 above are repeated till convergence, so this becomes a chain of map-reduce jobs.

The next figures show the map-reduce steps, first for a single iteration and then for the entire algorithm steps.

