The reason this is called a “mixture model” is that the probability at each data point is the sum of a mixture of several distributions. At the end of the process, each point is tied to the various clusters with higher or lower probability. This is sometimes called soft clustering, because points are not uniquely identified with a single cluster.
One consequence of this method is that some points may have high probabilities of being in more than one cluster. Other points may have only very low probabilities of being in any cluster. Each point can be assigned to the cluster where its probability is highest, turning this soft clustering into hard clustering.
2
X
X1
Figure 11.9 Each Gaussian mean is moved to the centroid of all the data points weighted by its responsibilities for each point. Thicker arrows indicate higher weights.
470643 c11.qxd 3/8/04 11:17 AM Page 368
368 Chapter 11
Agglomerative Clustering
The K-means approach to clustering starts out with a fixed number of clusters and allocates all records into exactly that number of clusters. Another class of methods works by agglomeration. These methods start out with each data point forming its own cluster and gradually merge them into larger and larger clusters until all points have been gathered together into one big cluster. Toward the beginning of the process, the clusters are very small and very pure—the members of each cluster are few and closely related. Towards the end of the process, the clusters are large and not as well defined. The entire history is preserved making it possible to choose the level of clustering that works best for a given application.
An Agglomerative Clustering Algorithm
The first step is to create a similarity matrix. The similarity matrix is a table of all the pair-wise distances or degrees of similarity between clusters. Initially, the similarity matrix contains the pair-wise distance between individual pairs of records. As discussed earlier, there are many measures of similarity between records, including the Euclidean distance, the angle between vectors, and the ratio of matching to nonmatching categorical fields. The issues raised by the choice of distance measures are exactly the same as those previously discussed in relation to the K-means approach.
It might seem that with N initial clusters for N data points, N 2 measurement calculations are required to create the distance table. If the similarity measure is a true distance metric, only half that is needed because all true distance metrics follow the rule that Distance( X, Y) = Distance( Y, X). In the vocabulary of mathematics, the similarity matrix is lower triangular. The next step is to find the smallest value in the similarity matrix. This identifies the two clusters that are most similar to one another. Merge these two clusters into a new one and update the similarity matrix by replacing the two rows that described the parent cluster with a new row that describes the distance between the merged cluster and the remaining clusters. There are now N – 1 clusters and N – 1 rows in the similarity matrix.
Repeat the merge step N – 1 times, so all records belong to the same large cluster. Each iteration remembers which clusters were merged and the distance between them. This information is used to decide which level of clustering to make use of.
Distance between Clusters
A bit more needs to be said about how to measure distance between clusters.
On the first trip through the merge step, the clusters consist of single records so the distance between clusters is the same as the distance between records, a subject that has already been covered in perhaps too much detail. Second and
470643 c11.qxd 3/8/04 11:17 AM Page 369
Automatic Cluster Detection 369
subsequent trips through the loop need to update the similarity matrix with the distances from the new, multirecord cluster to all the others. How do we measure this distance?
As usual, there is a choice of approaches. Three common ones are:
■■
Single linkage
■■
Complete linkage
■■
Centroid distance
In the single linkage method, the distance between two clusters is given by the distance between the closest members. This method produces clusters with the property that every member of a cluster is more closely related to at least one member of its cluster than to any point outside it.
Another approach is the complete linkage method, where the distance between two clusters is given by the distance between their most distant members. This method produces clusters with the property that all members lie within some known maximum distance of one another.
Third method is the centroid distance, where the distance between two clusters is measured between the centroids of each. The centroid of a cluster is its average element. Figure 11.10 gives a pictorial representation of these three methods.
Closest clusters by
centroid method
C3
C1
2
Closest clusters by
X
complete linkage method
Closest clusters by
C2
single linkage method
X1
Figure 11.10 Three methods of measuring the distance between clusters.
470643 c11.qxd 3/8/04 11:17 AM Page 370
370 Chapter 11
Clusters and Trees
The agglomeration algorithm creates hierarchical clusters. At each level in the hierarchy, clusters are formed from the union of two clusters at the next level down. A good way of visualizing these clusters is as a tree. Of course, such a tree may look like the decision trees discussed in Chapter 6, but there are some important differences. The most important is that the nodes of the cluster tree do not embed rules describing why the clustering takes place; the nodes simply state the fact that the two children have the minimum distance of all possible clusters pairs. Another difference is that a decision tree is created to maximize the leaf purity of a given target variable. There is no target for the cluster tree, other than self-similarity within each cluster. Later in this chapter we’ll discuss divisive clustering methods. These are similar to the agglomerative methods, except that agglomerative methods are build by starting from the leaving and working towards the root whereas divisive methods start at the root and work down to the leaves.
Clustering People by Age: An Example of
Agglomerative Clustering
This illustration of agglomerative clustering uses an example in one dimension with the single linkage measure for distance between clusters. These choices make it possible to follow the algorithm through all its iterations without having to worry about calculating distances using squares and square roots.
The data consists of the ages of people at a family gathering. The goal is to cluster the participants using their age, and the metric for the distance between two people is simply the difference in their ages. The metric for the distance between two clusters of people is the difference in age between the oldest member of the younger cluster and the youngest member of the older cluster.
(The one dimensional version of the single linkage measure.)
Because the distances are so easy to calculate, the example dispenses with the similarity matrix. The procedure is to sort the participants by age, then begin clustering by first merging clusters that are 1 year apart, then 2 years, and so on until there is only one big cluster.
Figure 11.11 shows the clusters after six iterations, with three clusters remaining. This is the level of clustering that seems the most useful. The algorithm appears to have clustered the population into three generations: children, parents, and grandparents.
470643 c11.qxd 3/8/04 11:17 AM Page 371
Automatic Cluster Detection 371
Distance 6
Distance 5
Distance 4
Distance 3
Distance 2
Distance 1
1
3
5
8
9
11
12
13
37
43
45
49
51
65
Figure 11.11 An example of agglomerative clustering.
Divisive Clustering
We have already noted some similarities between trees formed by the agglomerative clustering techniques and ones formed by decision tree algorithms.
Although the agglomerative methods work from the leaves to the root, while the decision tree algorithms work from the root to the leaves, they both create a similar hierarchical structure. The hierarchical structure reflects another similarity between the methods. Decisions made early on in the process are never revisited, which means that some fairly simple clusters may not be detected if an early split or agglomeration destroys the structure.
Seeing the similarity between the trees produced by the two methods, it is natural to ask whether the algorithms used for decision trees may also be used for clustering. The answer is yes. A decision tree algorithm starts with the entire collection of records and looks for a way to split it into partitions that are purer, in some sense defined by a purity function. In the standard decision tree algorithms, the purity function uses a separate variable—the target variable—to make this decision. All that is required to turn decision trees into a clustering
470643 c11.qxd 3/8/04 11:17 AM Page 372
372 Chapter 11
algorithm is to supply a purity function chosen to either minimize the average intracluster distance or maximize the intercluster distances. An example of such a purity function is the average distance from the centroid of the parent.