Affinity Propagation


This is one of the most important among the different clustering algorithms. It is also more elaborate than the others. Affinity Propagation, takes the similarity between pairs of data points as input. It considers all data points as potential exemplars. It works by exchanging real valued messages between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. This is not as complicated as it sounds. Let's look at it in detail.

The algorithm requires us to provide two sets of data:

  • Similarities between data points, representing how well-suited a point is to be another one’s exemplar. If there’s no similarity between two points, as in they cannot belong to the same cluster, this similarity can be omitted or set to -Infinity depending on implementation.
  • Preferences, representing each data point’s suitability to be an exemplar. We may have some initial information which points could be favored for this role, and so we can represent it through preferences.

All this information is normally represented through a single matrix. The main diagonal of this matrix represents preferences. Another approach is to keep the similarities between points in the memory. (The latter works better when the data is sparse). The 'exchanging messages between points' is nothing more than matrices manipulation. The algorithm then runs through a number of iterations, until it converges.

Thus, each iteration has two steps:

  • Calculating responsibilities
  • Calculating availabilities.

Responsibility and Availability are two basic components of Affinity Propagation. To understand them, consider the scenario of a group of employees working with an employer. From the team to work, you have to ensure that the employees find the employer better than other employers around. Additionally you have to ensure that the employees get along well. The first is measured in terms of Responsibility while the second is measured in terms of Availability.

In mathematical terms, the Responsibility r(i, k) reflects the accumulated evidence of k's potential for being an exemplar of i - considering the various different possible exemplars for i. On the other hand, Availability - a(i, k) is the accumulated evidence for k's potential of being an exemplar of i - considering the various different points that consider k as an exemplar. Essentially, the "Responsibility" measures k's responsibility to take up i under its hood. On the other hand, "Availability" measures k's availability for the new point i. While working on the classification, the Responsibility message is passed from i to k and Availability message is passed from k to i. That is, i tells k if it thinks k is responsible for i and k replies if it feels that it is available for i.

In order to calculate responsibilities, the algorithm uses original similarities and availabilities calculated in the previous iteration (initially, all availabilities are set to zero). Responsibilities are set to the input similarity between point i and point k as its exemplar, minus the largest of the similarity and availability sum between point i and other candidate exemplars. The logic behind calculating how suitable a point is for an exemplar is that it is favored more if the initial apriori preference was higher, but the responsibility gets lower when there is a similar point that considers itself a good candidate, so there is a ‘competition’ between the two until one is decided in some iteration.

Calculating availabilities, then, uses calculated responsibilities as evidence whether each candidate would make a good exemplar. Availability a(i, k) is set to the self-responsibility r(k, k) plus the sum of the positive responsibilities that candidate exemplar k receives from other points.

Finally, we can have different stopping criteria to terminate the procedure, such as when changes in values fall below some threshold, or the maximum number of iterations is reached. At any point through Affinity Propagation procedure, summing Responsibility (r) and Availability (a) matrices gives us the clustering information we need: for point i, the k with maximum r(i, k) + a(i, k) represents point i’s exemplar. Or, if we just need the set of exemplars, we can scan the main diagonal. If r(i, i) + a(i, i) > 0, point i is an exemplar.

Hyper Parameters


There are two major hyper parameters that are required to make the Affinity Propagation effective.

Damping

Each new iteration of the Responsibility and Availability leads to recalculation of the clusters. Because of this, it is possible that the system gets into an oscillation without any convergence. In order to avoid this problem, the algorithm introduces Damping. This forces inertia on the changes by allocating some weight to the current position of the data point.

Affinity

The measure of affinity is perhaps the most important component in this exercise of clustering. Generally the euclidean affinity is used. But it is possible to push in any other measure of affinity - so long as it is consistent.