K-Means Clustering in Map Reduce

K-Means clustering involve the following logical steps

1) Determine the value of k
2) Determine the initial k centroids
3) Repeat until converge
- Determine membership: Assign each point to the closest centroid
- Update centroid position: Compute new centroid position from assigned members

Determine the value of K
This is basically asking the question of: "How many clusters you are interested to discover ?"
So the answer is specific to the problem domain.

One way is to try different K. At some point, we'll see increasing K doesn't help much to improve the overall quality of clustering. Then that is the right value of K.

Notice that the overall quality of cluster is the average distance from each data point to its associated cluster.


Determine the initial K centroids
We need to pick K centroids to start the algorithm. So one way to pick them is to randomly pick K points from the whole data set.

However, picking a good set of centroids can reduce the number of subsequent iterations and by "good" I mean the K centroid should be as far apart to each other as possible, or even better the initial K centroid is close to the final K centroid. As you can see, choosing the random K points is reasonable but non-optimum.

Another approach is to take a small random sample set from the input data set and do a hierarchical clustering within this smaller set (note that hierarchical clustering is not-scaling to large data set).

We can also partition the space into overlapping region using canopy cluster technique (describe below) and pick the center of each canopy as the initial centroid.

Iteration
Each iteration is implemented as a Map/Reduce job.
The Map task will determine the membership for each point, as well as compute a partial sum of each member points of each cluster. The reducer aggregates all partial sums and compute the update centroid position, and then out them into a shared store (S3 in this case) that can be picked up by the Map Reduce job of next round.



We also need a control program on the client side to determine whether the iteration should end. It do the following ...
kmeans(data) {
initial_centroids = pick(k, data)
upload(data)
writeToS3(initial_centroids)
old_centroids = initial_centroids
while (true){
map_reduce()
new_centroids = readFromS3()
if change(new_centroids, old_centroids) < delta {
break
} else {
old_centroids = new_centroids
}
}
result = readFromS3()
return result
}

Check out this stream