Classification with KNN

Classification is an important subject in Supervised Learning. A major part of machine learning applications deal with binary outputs which require classification rather than regression. KNN Classifier is one of the many classification algorithms.

Like most Machine Learning algorithms, the KNN algorithm is also inspired by a tendency of the human mind - to go along with the crowd. Conceptually, KNN just looks at the known points around the query point and predicts that its outcome is similar to the points around it. More precisely, For any new point, it checks for the K points that are closest in terms of the defined distance metric.

Once these are identified, the outcome of each of those points is identified based on the training set. And the outcome of the new point is defined based on the highest bidder in the neighborhood. For example, if we look for the 5 nearest of a given test point, if 3 of those points say positive and two say negative, the outcome is predicted as positive since that is the highest bidder.

The KNN Classifier is a good tool for classification when the size of the data and the features are within control - else the computation can get very expensive. The classifier accuracy is based on the assumption that the similar points are geometrically close to each other - which may not always be the case. The distance metric is very important. What is near according to one metric may be far away by the other.

Consider for example, data sets in form of two concentric circles - the inner circle being positive and the outer circle negative. In such a case, inventing new distance metric may help to an extent. But the cost of computation increases very rapidly with the complexity of the distance metric. The cost also increases rapidly with the number of features.

But it is a very elegant and intuitive way of classifying when the data is good.