470643 c10.qxd 3/8/04 11:16 AM Page 348
470643 c11.qxd 3/8/04 11:16 AM Page 349
C H A P T E R
11
Automatic Cluster Detection
The data mining techniques described in this book are used to find meaningful patterns in data. These patterns are not always immediately forthcoming.
Sometimes this is because there are no patterns to be found. Other times, the problem is not the lack of patterns, but the excess. The data may contain so much complex structure that even the best data mining techniques are unable to coax out meaningful patterns. When mining such a database for the answer to some specific question, competing explanations tend to cancel each other out. As with radio reception, too many competing signals add up to noise.
Clustering provides a way to learn about the structure of complex data, to break up the cacophony of competing signals into its components.
When human beings try to make sense of complex questions, our natural tendency is to break the subject into smaller pieces, each of which can be explained more simply. If someone were asked to describe the color of trees in the forest, the answer would probably make distinctions between deciduous trees and evergreens, and between winter, spring, summer, and fall. People know enough about woodland flora to predict that, of all the hundreds of variables associated with the forest, season and foliage type, rather than say age and height, are the best factors to use for forming clusters of trees that follow similar coloration rules.
Once the proper clusters have been defined, it is often possible to find simple patterns within each cluster. “In Winter, deciduous trees have no leaves so the trees tend to be brown” or “The leaves of deciduous trees change color in the 349
470643 c11.qxd 3/8/04 11:16 AM Page 350
350 Chapter 11
autumn, typically to oranges, reds, and yellows.” In many cases, a very noisy dataset is actually composed of a number of better-behaved clusters. The question is: how can these be found? That is where techniques for automatic cluster detection come in—to help see the forest without getting lost in the trees.
This chapter begins with two examples of the usefulness of clustering—one drawn from astronomy, another from clothing design. It then introduces the K-Means clustering algorithm which, like the nearest neighbor techniques discussed in Chapter 8, depends on a geometric interpretation of data. The geometric ideas used in K-Means bring up the more general topic of measures of similarity, association, and distance. These distance measures are quite sensitive to variations in how data is represented, so the next topic addressed is data preparation for clustering, with special attention being paid to scaling and weighting. K-Means is not the only algorithm in common use for automatic cluster detection. This chapter contains brief discussions of several others: Gaussian mixture models, agglomerative clustering, and divisive clustering. (Another clustering technique, self-organizing maps, is covered in Chapter 7 because self-organizing maps are a form of neural network.) The chapter concludes with a case study in which automatic cluster detection is used to evaluate editorial zones for a major daily newspaper.
Searching for Islands of Simplicity
In Chapter 1, where data mining techniques are classified as directed or undirected, automatic cluster detection is described as a tool for undirected knowledge discovery. In the technical sense, that is true because the automatic cluster detection algorithms themselves are simply finding structure that exists in the data without regard to any particular target variable. Most data mining tasks start out with a preclassified training set, which is used to develop a model capable of scoring or classifying previously unseen records.
In clustering, there is no preclassified data and no distinction between independent and dependent variables. Instead, clustering algorithms search for groups of records—the clusters—composed of records similar to each other.
The algorithms discover these similarities. It is up to the people running the analysis to determine whether similar records represent something of interest to the business—or something inexplicable and perhaps unimportant.
In a broader sense, however, clustering can be a directed activity because clusters are sought for some business purpose. In marketing, clusters formed for a business purpose are usually called “segments,” and customer segmentation is a popular application of clustering.
Automatic cluster detection is a data mining technique that is rarely used in isolation because finding clusters is not often an end in itself. Once clusters have been detected, other methods must be applied in order to figure out what
470643 c11.qxd 3/8/04 11:16 AM Page 351
Automatic Cluster Detection 351
the clusters mean. When clustering is successful, the results can be dramatic: One famous early application of cluster detection led to our current understanding of stellar evolution.
Star Light, Star Bright
Early in the twentieth century, astronomers trying to understand the relationship between the luminosity (brightness) of stars and their temperatures, made scatter plots like the one in Figure 11.1. The vertical scale measures luminosity in multiples of the brightness of our own sun. The horizontal scale measures surface temperature in degrees Kelvin (degrees centigrade above absolute 0, the theoretical coldest possible temperature).
106
Red Giants
104
102
1)
(Sun =
1
Main Sequence
Luminosity
10-2
10-4
White Dwarfs
40,000
20,000
10,000
5,000
2,500
Temperature (Degrees Kelvin)
Figure 11.1 The Hertzsprung-Russell diagram clusters stars by temperature and luminosity.
470643 c11.qxd 3/8/04 11:16 AM Page 352
352 Chapter 11
Two different astronomers, Enjar Hertzsprung in Denmark and Norris Russell in the United States, thought of doing this at about the same time. They both observed that in the resulting scatter plot, the stars fall into three clusters.
This observation led to further work and the understanding that these three clusters represent stars in very different phases of the stellar life cycle. The relationship between luminosity and temperature is consistent within each cluster, but the relationship is different between the clusters because fundamentally different processes are generating the heat and light. The 80 percent of stars that fall on the main sequence are generating energy by converting hydrogen to helium through nuclear fusion. This is how all stars spend most of their active life. After some number of billions of years, the hydrogen is used up. Depending on its mass, the star then begins fusing helium or the fusion stops. In the latter case, the core of the star collapses, generating a great deal of heat in the process. At the same time, the outer layer of gasses expands away from the core, and a red giant is formed. Eventually, the outer layer of gasses is stripped away, and the remaining core begins to cool. The star is now a white dwarf.
A recent search on Google using the phrase “Hertzsprung-Russell Diagram”
returned thousands of pages of links to current astronomical research based on cluster detection of this kind. Even today, clusters based on the HR diagram are being used to hunt for brown dwarfs (starlike objects that lack sufficient mass to initiate nuclear fusion) and to understand pre–main sequence stellar TEAMFLY
evolution.
Fitting the Troops
The Hertzsprung-Russell diagram is a good introductory example of clustering because with only two variables, it is easy to spot the clusters visually (and, incidentally, it is a good example of the importance of good data visualizations). Even in three dimensions, picking out clusters by eye from a scatter plot cube is not too difficult. If all problems had so few dimensions, there would be no need for automatic cluster detection algorithms. As the number of dimensions (independent variables) increases, it becomes increasing difficult to visualize clusters. Our intuition about how close things are to each other also quickly breaks down with more dimensions.
Saying that a problem has many dimensions is an invitation to analyze it geometrically. A dimension is each of the things that must be measured independently in order to describe something. In other words, if there are N variables, imagine a space in which the value of each variable represents a distance along the corresponding axis in an N-dimensional space. A single record containing a value for each of the N variables can be thought of as the vector that defines a particular point in that space. When there are two dimensions, this is easily plotted. The HR diagram was one such example. Figure 11.2 is another example that plots the height and weight of a group of teenagers as points on a graph. Notice the clustering of boys and girls.
Team-Fly®
470643 c11.qxd 3/8/04 11:17 AM Page 353
Automatic Cluster Detection 353
The chart in Figure 11.2 begins to give a rough idea of people’s shapes. But if the goal is to fit them for clothes, a few more measurements are needed!
In the 1990s, the U.S. army commissioned a study on how to redesign the uniforms of female soldiers. The army’s goal was to reduce the number of different uniform sizes that have to be kept in inventory, while still providing each soldier with well-fitting uniforms.
As anyone who has ever shopped for women’s clothing is aware, there is already a surfeit of classification systems (even sizes, odd sizes, plus sizes, junior, petite, and so on) for categorizing garments by size. None of these systems was designed with the needs of the U.S. military in mind. Susan Ashdown and Beatrix Paal, researchers at Cornell University, went back to the basics; they designed a new set of sizes based on the actual shapes of women in the army.1