With no change in the purity function, we might say that decision trees provide directed clustering; that is, they create clusters of records that are similar with respect to some target variable. For this reason, ordinary decision trees are often a better choice for customer segmentation than the undirected clustering algorithms discussed in this chapter. If the purpose of the customer segmentation is to find customer segments that are loyal or profitable or likely to respond to some particular offer, it makes sense to use one of those variables (or a proxie) as the target for directed clustering. If, on the other hand, the point of the customer segmentation is to stimulate discussion of new product offerings geared to various naturally occurring clusters of customers, than an undirected approach is more appropriate.
Self-Organizing Maps
Self-organizing maps are a variant of neural networks that have been used for many years in applications such as feature detection in two-dimensional images. More recently, they have been applied successfully for more general clustering applications. There is a discussion of self-organizing networks in TEAMFLY
Chapter 7.
Evaluating Clusters
When using the K-means approach to cluster detection, is there a way to determine what value of K finds the best clusters? Similarly, when using a hierarchical approach, is there a test for which level in the hierarchy contains the best clusters? What does it mean to say that a cluster is good?
These questions are important when using clustering in practice. In general terms, clusters should have members that have a high degree of similarity—or, in geometric terms, are close to each other—and the clusters themselves should be widely spaced.
A standard measure of within-cluster similarity is variance (the sum of the squared differences of each element from the mean), so the best set of clusters might be the set whose clusters have tthe lowest variance. However, this measure does not take into account cluster size. A similar measure would be the average variance, the total variance divided by the size of the cluster.
For agglomerative clustering, using variance as a measure does not make sense, since this method always starts out with clusters of one—which, of course, have variance zero. A good measure to use with agglomerative clusters Team-Fly®
470643 c11.qxd 3/8/04 11:17 AM Page 373
Automatic Cluster Detection 373
is the difference between the distance value at which it was formed and the distance value at which it is merged into the next level. This is a measure of the durability of the cluster. Strong clusters, like the one linking 1 to 13-year-olds at distance 3 in Figure 11.11, last through many iterations of the algorithm.
A general-purpose measure that works with any form of cluster detection is to take whatever similarity measure or distance metric is used to form the clusters and use it to compare the average distance between members of a cluster and the cluster centroid to the average distance between cluster centroids. This can be done for each cluster individually and for the entire collection of clusters.
T I P If there are one or two good clusters along with a number of weaker ones, it may be possible to improve results by removing all members of the strong clusters. The strong clusters are worthy of further analysis anyway, and removing their strong pull may allow new clusters to be detected in the records left behind.
Inside the Cluster
Clustering often produces one or more strong clusters—rather large clusters where the records are quite similar. The question is, what makes a strong cluster special? What is it about the records in this cluster that causes them to be lumped together? Even more importantly, is it possible to find rules and patterns within this cluster now that the noise from the rest of the database has been reduced?
The easiest way to approach these questions is to take the mean of each variable within the cluster and compare it to the mean of the same variable in the parent population. Rank order the variables by the magnitude of the difference, or better yet, the z-score. Looking at the variables that show the largest difference between the cluster and the rest of the database goes a long way towards explaining what makes the cluster special.
Outside the Cluster
Clustering can be useful even when only a single cluster is found. When screening for a very rare defect, there may not be enough examples to train a directed data mining model to detect it. One example is testing electric motors at the factory that makes them. Cluster detection methods can be used on a sample containing only good motors to determine the shape and size of the
“normal” cluster. When a motor comes along that falls outside the cluster for any reason, it is suspect. This approach has been used in medicine to detect the presence of abnormal cells in tissue samples and in telecommunications to detect calling patterns indicative of fraud.
470643 c11.qxd 3/8/04 11:17 AM Page 374
374 Chapter 11
Case Study: Clustering Towns
The Boston Globe is one of two major dailies serving Boston and the surrounding area of eastern Massachusetts and southern New Hampshire. The Globe is the leading circulated newspaper in Boston with daily circulation of over 467,000 in 2003 compared to 243,000 for the Boston Herald, the other major daily in town. On Sundays, the Globe has circulation of over 705,000. Despite this leading position, in 2003 the Globe did not want to stand still. As with many newspapers, it faced declining readership in its core Boston market and strong competition from local papers in the suburban markets where some of its readers have migrated.
In order to compete better with the suburban papers, the Globe introduced geographically distinct versions of the paper with specialized editorial content for each of 12 geographically defined zones. Two days a week, readers are treated to a few pages of local coverage for their area. The editorial zones were drawn up using data available to the Globe, common sense, and a map, but no formal statistical analysis. There were some constraints on the composition of the editorial zones:
■■ The zones had to be geographically contiguous so that the trucks carrying the localized editions from the central printing plant in Boston could take sensible routes.
■■
The zones had to be reasonably compact and contain sufficient population to justify specialized editorial content.
■■
The editorial zones had to be closely aligned with the geographic zones used to sell advertising.
Within these constraints, the Globe wished to design editorial zones that would group similar towns together. Sounds sensible, but which towns are similar? That is the question that The Boston Globe brought to us at Data Miners.
Creating Town Signatures
Before deciding which towns belonged together, there needed to be a way of describing the towns—a town signature with a column for every feature that might be useful for characterizing a town and comparing it with its neighbors.
As it happened, Data Miners had worked on an earlier project to find towns with good prospects for future circulation growth that had already defined town signatures. Those signatures, which had been developed for a regression model to predict Globe home delivery penetration, turned out to be equally useful for undirected clustering. This is a fairly common occurrence; once a useful set of descriptive attributes has been collected it can be used for all sorts of things. In another example, a long-distance company developed customer
470643 c11.qxd 3/8/04 11:17 AM Page 375
Automatic Cluster Detection 375
signatures based on call detail data in order to predict fraud and later found that the same variables were useful for distinguishing between business and residential users.
T I P Although the time and effort it takes to create a good customer signature can seem daunting, the effort is repaid over time because the same attributes often turn out to be predictive for many different target variables. The oft quoted rule of thumb that 80 percent of the time spent on a data mining project goes into data preparation becomes less true when the data preparation effort can be amortized over several predictive modeling efforts.
The Data
The town signatures were derived from several sources, with most of the variables coming from town-level U.S. Census data from 1990 and 2001. The census data provides counts of the number of residents by age, race, ethnic group, occupation, income, home value, average commute time, and many other interesting variables. In addition, the Globe had household-level data on its subscribers supplied by an outside data vendor as well as circulation figures for each town and subscriber-level information on discount plans, complaint calls, and type of subscription (daily, Sunday, or both).
There were four basic steps to creating the town signatures:
1. Aggregation
2. Normalization
3. Calculation of trends
4. Creation of derived variables
The first step in turning this data into a town signature was to aggregate everything to the town level. For example, the subscriber data was aggregated to produce the total number of subscribers and median subscriber household income for each town.