Unsupervised learning, as the name suggests, is about learning without supervision. What does it mean to learn without supervision? What can you learn without supervision? As we saw in the regression, the supervision comes from a part of the input data set - that is required to ascertain the correctness of your hypothesis. The input data has the questions as well as correct answers - the machine just needs to map them.
Unsupervised learning is quite different from this. Here, we do not have any answers. No questions either! Then what do we have? We have raw data and we have no idea about its structure. There is nothing to learn from. Unsupervised learning is just making sense of the data in hand. This is not just a theoretical fantacy. Unsupervised learning has a great application for analyzing data that we know nothing about.
When faced with such a situation, of huge chunk of unknown data, that natural tendency is to categorize it into different parts based on the parameters we already know. Then, we can identify the tendencies of each of these clusters so that we can get some meaningful mapping of the data. If your parameters are wide enough, your predictions will be correct too.
Clustering is one of the most important problem solved using unsupervised learning. It is used to discover the inherent groupings in the input data. For example: Grouping companies based on their revenues or grouping countries based on their GDP, etc. Since it is a form of unsupervised learning, there is no feedback or verification while learning. The grouping is purely based on the values associated with the input samples.
In theory, data points that are in the same group should have similar properties and/or features, while data points in different groups should have highly dissimilar properties and/or features. Such clustering is a common technique for statistical data analysis used in many fields.
Clustering and Unsupervised Learning in general forms an important part of any real life AI development. Most of the data inputs for today's AI development is from the internet and mobile devices. This is raw data fetched out of the chaos called internet. Not everything is useful for the particular problem. The available data has to be assorted to identify what could be useful. Clustering is one of the primary approaches of doing this.
There are many different algorithms used for clustering. Principally, there are three main types of algorithms for doing this.
In agglomerative hierarchical algorithms, we start by defining each data point as a cluster. Then, the two closest clusters are combined into a new cluster. In each subsequent step, two existing clusters are merged into a single cluster. In divisive hierarchical algorithms, we start by putting all data points into a single cluster. Then we divide this cluster into two clusters. At each subsequent step, we divide an existing cluster into two clusters. Agglomerative methods are used much more often than divisive methods. Hierarchical methods can be adapted to cluster variables rather than observations. This is a common use for hierarchical methods.
There are different ways we can connect two points in the hierarchy. Each has its own nuances.
In centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. At each stage of the process we combine the two clusters that have the smallest centroid distance.
This method does not directly define a measure of distance between two points or clusters. It is an ANOVA based approach. One-way univariate ANOVAs are done for each variable with groups defined by the clusters at that stage of the process. At each stage, two clusters merge that provide the smallest increase in the combined error sum of squares.
Any clustering is meaningful in the context of a given measure of association or a measure of distance. There are several definitions of the association or distance between two points.
d(X1, X2) = (sum((X1(i) - X2(i))^2))^(1/2)
d(X1, X2) = (sum((X1(i) - X2(i))^n))^(1/n)
d(X1,X2) = sum(abs(X1(i) - X2(i)) / (abs(X1(i)) + abs(X2(i))))
d(X1,X2) = 1 - 2 * sum(abs(X1(i) - X2(i))) / sum(abs(X1(i)) + abs(X2(i)))
1. Symmetry: Distance between point X1 and X2 should be same as distance between point X2 and X1.
d(X1, X2) = d(X2, X1)
2. Positivity: Distance between two distinct points X1 and X2 should be greater than 0
d(X1, X2) > 0 if X1 != X2
3. Identity: Distance of a point with itself should be 0
d(X, X) = 0
4. Triangle Inequality: The generic triangular inequality should also hold for the new measure of distance
d(X1, X2) + d(X2, X3) >= d(X1, X3)
In a non-hierarchical method, the data are initially partitioned into a set of K clusters. This may be a random partition or a partition based on a first "good" guess at seed points which form the initial centers of the clusters. Then data points are iteratively moved into different clusters until there is no sensible reassignment possible. The initial number of clusters (K) may be specified by the user or by the software algorithm. The most commonly used non-hierarchical method is MacQueen's K-means method.
This has some advantages and some disadvantages compared to the Hierarchical algorithms. The main question here is guessing the right value for k. If there are k+1 distinct clusters and we start off with k, we will never have a good result. However, if the cluster chunks are not so distinct, a non-hierarchical cluster can help in segregating the data rather quickly.
A model based method uses a mixture model to specify the density function of the x-variables. In a mixture model, a population is modeled as a mixture of different subpopulations, each with the same general form for its probability density function and possibly different values for parameters, such as the mean vector. For instance, the model may be a mixture of multivariate normal distributions. In cluster analysis, the algorithm provides a partition of the dataset that maximizes the likelihood function as defined by the mixture model. Examples are Affinity Propagation, MeanShift, DBSCAN.