From the point of view of the user, this hybrid technique has more in common with neural-network variants than it does with decision-tree variants because, in common with other neural-network techniques, it is not capable of explaining its decisions. The tree still produces rules, but these are of the form F( w 1 x 1, w 2 x 2, w 3 x 3, . . .) ≤ N, where F is the combining function used by the neural network. Such rules make more sense to neural network software than to people.
Piecewise Regression Using Trees
Another example of combining trees with other modeling methods is a form of piecewise linear regression in which each split in a decision tree is chosen so as to minimize the error of a simple regression model on the data at that node.
The same method can be applied to logistic regression for categorical target variables.
Alternate Representations for Decision Trees
The traditional tree diagram is a very effective way of representing the actual structure of a decision tree. Other representations are sometimes more useful when the focus is more on the relative sizes and concentrations of the nodes.
Box Diagrams
While the tree diagram and Twenty Questions analogy are helpful in visualizing certain properties of decision-tree methods, in some cases, a box diagram is more revealing. Figure 6.13 shows the box diagram representation of a decision tree that tries to classify people as male or female based on their ages and the movies they have seen recently. The diagram may be viewed as a sort of nested collection of two-dimensional scatter plots.
At the root node of a decision tree, the first three-way split is based on which of three groups the survey respondent’s most recently seen movie falls. In the outermost box of the diagram, the horizontal axis represents that field. The outermost box is divided into sections, one for each node at the next level of the tree.
The size of each section is proportional to the number of records that fall into it.
Next, the vertical axis of each box is used to represent the field that is used as the next splitter for that node. In general, this will be a different field for each box.
470643 c06.qxd 3/8/04 11:12 AM Page 200
200 Chapter 6
Last Movie in Group
Last Movie in Group
Last Movie in Group
1
2
3
age > 27
age > 41
Last Movie Last Movie
in Group
in Group
3
3
age ≤ 41
age ≤ 41
age > 27
age ≤ 27
Last Movie
in Group
1
age < 27Figure 6.13 A box diagram represents a decision tree. Shading is proportional to the purity of the box; size is proportional to the number of records that land there.There is now a new set of boxes, each of which represents a node at the third level of the tree. This process continues, dividing boxes until the leaves of the tree each have their own box. Since decision trees often have nonuniform depth, some boxes may be subdivided more often than others. Box diagrams make it easy to represent classification rules that depend on any number of variables on a two-dimensional chart.The resulting diagram is very expressive. As we toss records onto the grid, they fall into a particular box and are classified accordingly. A box chart allows us to look at the data at several levels of detail. Figure 6.13 shows at a glance that the bottom left contains a high concentration of males.Taking a closer look, we find some boxes that seem to do a particularly good job at classification or collect a large number of records. Viewed this way, it is natural to think of decision trees as a way of drawing boxes around groups of similar points. All of the points within a particular box are classified the same way because they all meet the rule defining that box. This is in contrast to classical statistical classification methods such as linear, logistic, and quadratic discriminants that attempt to partition data into classes by drawing a line or elliptical curve through the data space. This is a fundamental distinction: Statistical approaches that use a single line to find the boundary between classes are weak when there are several very different ways for a record to become470643 c06.qxd 3/8/04 11:12 AM Page 201Decision Trees201part of the target class. Figure 6.14 illustrates this point using two species of dinosaur. The decision tree (represented as a box diagram) has successfully isolated the stegosaurs from the triceratops.In the credit card industry, for example, there are several ways for customers to be profitable. Some profitable customers have low transaction rates, but keep high revolving balances without defaulting. Others pay off their balance in full each month, but are profitable due to the high transaction volume they generate. Yet others have few transactions, but occasionally make a large purchase and take several months to pay it off. Two very dissimilar customers may be equally profitable. A decision tree can find each separate group, label it, and by providing a description of the box itself, suggest the reason for each group’s profitability.Tree Ring DiagramsAnother clever representation of a decision tree is used by the Enterprise Miner product from SAS Institute. The diagram in Figure 6.15 looks as though the tree has been cut down and we are looking at the stump.Figure 6.14 Often a simple line or curve cannot separate the regions and a decision tree does better.470643 c06.qxd 3/8/04 11:12 AM Page 202202Chapter 6TEAMFLYFigure 6.15 A tree ring diagram produced by SAS Enterprise Miner summarizes the different levels of the tree.The circle at the center of the diagram represents the root node, before any splits have been made. Moving out from the center, each concentric ring represents a new level in the tree. The ring closest to the center represents the root node split. The arc length is proportional to the number of records taking each of the two paths, and the shading represents the node’s purity. The first split in the model represented by this diagram is fairly unbalanced. It divides the records into two groups, a large one where the concentration is little different from the parent population, and a small one with a high concentration of the target class. At the next level, this smaller node is again split and one branch, represented by the thin, dark pie slice that extends all the way through to the outermost ring of the diagram, is a leaf node.The ring diagram shows the tree’s depth and complexity at a glance and indicates the location of high concentrations on the target class. What it does not show directly are the rules defining the nodes. The software reveals these when a user clicks on a particular section of the diagram.Team-Fly®470643 c06.qxd 3/8/04 11:12 AM Page 203Decision Trees 203Decision Trees in PracticeDecision trees can be applied in many different situations.■■To explore a large dataset to pick out useful variables■■To predict future states of important variables in an industrial process■■To form directed clusters of customers for a recommendation system This section includes examples of decision trees being used in all of these ways.Decision Trees as a Data Exploration ToolDuring the data exploration phase of a data mining project, decision trees are a useful tool for picking the variables that are likely to be important for predicting particular targets. One of our newspaper clients, The Boston Globe, was interested in estimating a town’s expected home delivery circulation level based on various demographic and geographic characteristics. Armed with such estimates, they would, among other things, be able to spot towns with untapped potential where the actual circulation was lower than the expected circulation.The final model would be a regression equation based on a handful of variables. But which variables? And what exactly would the regression attempt to estimate? Before building the regression model, we used decision trees to help explore these questions.Although the newspaper was ultimately interested in predicting the actual number of subscribing households in a given city or town, that number does not make a good target for a regression model because towns and cities vary so much in size. It is not useful to waste modeling power on discovering that there are more subscribers in large towns than in small ones. A better target is the penetration—the proportion of households that subscribe to the paper. This number yields an estimate of the total number of subscribing households simply by multiply it by the number of households in a town. Factoring out town size yields a target variable with values that range from zero to somewhat less than one.The next step was to figure out which factors, from among the hundreds in the town signature, separate towns with high penetration (the “good” towns) from those with low penetration (the “bad” towns). Our approach was to build decision tree with a binary good/bad target variable. This involved sorting the towns by home delivery penetration and labeling the top one third