K Means Clustering


K-Means is an interesting way of identifying clusters in the given data. Conceptually, we can think of the process as follows:

If we want to identify K clusters in the data set; we start with picking K separate points in the data space. Assuming these K points are the centroids, identify the K clusters. Any point in the data space belongs to the cluster defined by the closest centroid. Now, based on these clusters identify the new centroids. And then identify the clusters again. If we do this recursively, we should ideally end up with a situation where the K centroids do not move much on each iteration. The clusters are thus identified using the K-Means algorithm.

Ofcourse, there are several important factors here. How do you choose K? If K is very large, we might end up with a cluster for every data point. That defeats the purpose of clustering. On the other hand, if we use K = 1, we have one big cluster for the entire data set. This again defeats the purpose. Ideally, we should choose K such that the clusters identified are indeed different from each other - That is the distance of all points from the respective centroid is significantly less than the distance between the centroids.

More formally, sum of square of distance between centroid and the data points within each cluster is considered a metric for optimizing the model. As the value of K increases, the sum of square of distances decreases for every step. But this decreases is very fast for the initial values of K, and then it slows down. This defines the right value of K.

In real life, it is very difficult to know the number of such details before we start. But we still have use cases for K-Means - in cass where we want to break the data into a predefined number of clusters. Else, if we have enough computational power, we can try over various number of clusters - regressively trying to reduce the mismatch. That is certainly not the best way out.

We have other more effective algorithms. But they are not as simple.