Throwing out records due to missing values is likely to create a biased training set because the records with missing values are not likely to be a random sample of the population. Replacing missing values with imputed values has the risk that important information provided by the fact that a value is missing will be ignored in the model. We have seen many cases where the fact that a particular value is null has predictive value. In one such case, the count of non-null values in appended household-level demographic data was positively correlated with response to an offer of term life insurance. Apparently, people who leave many traces in Acxiom’s household database (by buying houses, getting married, registering products, and subscribing to magazines) are more likely to be interested in life insurance than those whose lifestyles leave more fields null.
T I P Decision trees can produce splits based on missing values of an input variable. The fact that a value is null can often have predictive value so do not be hasty to filter out records with missing values or to try to replace them with imputed values.
Although splitting on null as a separate class is often quite valuable, at least one data mining product offers an alternative approach as well. In Enterprise Miner, each node stores several possible splitting rules, each one based on a different input field. When a null value is encountered in the field that yields
470643 c06.qxd 3/8/04 11:12 AM Page 175
Decision Trees 175
the best splits, the software uses the surrogate split based on the next best available input variable.
Growing the Full Tree
The initial split produces two or more child nodes, each of which is then split in the same manner as the root node. Once again, all input fields are considered as candidate splitters, even fields already used for splits. However, fields that take on only one value are eliminated from consideration since there is no way that they can be used to create a split. A categorical field that has been used as a splitter higher up in the tree is likely to become single-valued fairly quickly. The best split for each of the remaining fields is determined. When no split can be found that significantly increases the purity of a given node, or when the number of records in the node reaches some preset lower bound, or when the depth of the tree reaches some preset limit, the split search for that branch is abandoned and the node is labeled as a leaf node.
Eventually, it is not possible to find any more splits anywhere in the tree and the full decision tree has been grown. As we will see, this full tree is generally not the tree that does the best job of classifying a new set of records.
Decision-tree-building algorithms begin by trying to find the input variable that does the best job of splitting the data among the desired categories. At each succeeding level of the tree, the subsets created by the preceding split are themselves split according to whatever rule works best for them. The tree continues to grow until it is no longer possible to find better ways to split up incoming records. If there were a completely deterministic relationship between the input variables and the target, this recursive splitting would eventually yield a tree with completely pure leaves. It is easy to manufacture examples of this sort, but they do not occur very often in marketing or CRM applications.
Customer behavior data almost never contains such clear, deterministic relationships between inputs and outputs. The fact that two customers have the exact same description in terms of the available input variables does not ensure that they will exhibit the same behavior. A decision tree for a catalog response model might include a leaf representing females with age greater than 50, three or more purchases within the last year, and total lifetime spending of over $145. The customers reaching this leaf will typically be a mix of responders and nonresponders. If the leaf in question is labeled “responder,”
than the proportion of nonresponders is the error rate for this leaf. The ratio of the proportion of responders in this leaf to the proportion of responders in the population is the lift at this leaf.
One circumstance where deterministic rules are likely to be discovered is when patterns in data reflect business rules. The authors had this fact driven home to them by an experience at Caterpillar, a manufacturer of diesel engines. We built a decision tree model to predict which warranty claims would be approved. At the time, the company had a policy by which certain
470643 c06.qxd 3/8/04 11:12 AM Page 176
176 Chapter 6
claims were paid automatically. The results were startling: The model was 100
percent accurate on unseen test data. In other words, it had discovered the exact rules used by Caterpillar to classify the claims. On this problem, a neural network tool was less successful. Of course, discovering known business rules may not be particularly useful; it does, however, underline the effectiveness of decision trees on rule-oriented problems.
Many domains, ranging from genetics to industrial processes really do have underlying rules, though these may be quite complex and obscured by noisy data. Decision trees are a natural choice when you suspect the existence of underlying rules.
Measuring the Effectiveness Decision Tree
The effectiveness of a decision tree, taken as a whole, is determined by applying it to the test set—a collection of records not used to build the tree—and observing the percentage classified correctly. This provides the classification error rate for the tree as a whole, but it is also important to pay attention to the quality of the individual branches of the tree. Each path through the tree represents a rule, and some rules are better than others.
At each node, whether a leaf node or a branching node, we can measure:
■■
The number of records entering the node
■■
The proportion of records in each class
■■
How those records would be classified if this were a leaf node
■■
The percentage of records classified correctly at this node
■■
The variance in distribution between the training set and the test set Of particular interest is the percentage of records classified correctly at this node. Surprisingly, sometimes a node higher up in the tree does a better job of classifying the test set than nodes lower down.
Tests for Choosing the Best Split
A number of different measures are available to evaluate potential splits. Algorithms developed in the machine learning community focus on the increase in purity resulting from a split, while those developed in the statistics community focus on the statistical significance of the difference between the distributions of the child nodes. Alternate splitting criteria often lead to trees that look quite different from one another, but have similar performance. That is because there are usually many candidate splits with very similar performance. Different purity measures lead to different candidates being selected, but since all of the measures are trying to capture the same idea, the resulting models tend to behave similarly.
470643 c06.qxd 3/8/04 11:12 AM Page 177
Decision Trees 177
Purity and Diversity
The first edition of this book described splitting criteria in terms of the decrease in diversity resulting from the split. In this edition, we refer instead to the increase in purity, which seems slightly more intuitive. The two phrases refer to the same idea. A purity measure that ranges from 0 (when no two items in the sample are in the same class) to 1 (when all items in the sample are in the same class) can be turned into a diversity measure by subtracting it from 1. Some of the measures used to evaluate decision tree splits assign the lowest score to a pure node; others assign the highest score to a pure node. This discussion refers to all of them as purity measures, and the goal is to optimize purity by minimizing or maximizing the chosen measure.
Figure 6.5 shows a good split. The parent node contains equal numbers of light and dark dots. The left child contains nine light dots and one dark dot.
The right child contains nine dark dots and one light dot. Clearly, the purity has increased, but how can the increase be quantified? And how can this split be compared to others? That requires a formal definition of purity, several of which are listed below.
Figure 6.5 A good split on a binary categorical variable increases purity.
470643 c06.qxd 3/8/04 11:12 AM Page 178
178 Chapter 6
Purity measures for evaluating splits for categorical target variables include:
■■
Gini (also called population diversity)
■■
Entropy (also called information gain)
■■
Information gain ratio
■■
Chi-square test
When the target variable is numeric, one approach is to bin the value and use one of the above measures. There are, however, two measures in common use for numeric targets: