470643 c06.qxd 3/8/04 11:12 AM Page 190
190 Chapter 6
Error Rate
Prune here.
Validation data
Training data
Depth of Tree
Figure 6.7 Pruning chooses the tree whose miscalculation rate is minimized on the validation set.
The C5 Pruning Algorithm
C5 is the most recent version of the decision-tree algorithm that Australian researcher, J. Ross Quinlan has been evolving and refining for many years. An earlier version, ID3, published in 1986, was very influential in the field of machine learning and its successors are used in several commercial data mining products. (The name ID3 stands for “Iterative Dichotomiser 3.” We have not heard an explanation for the name C5, but we can guess that Professor Quinlan’s background is mathematics rather than marketing.) C5 is available as a commercial product from RuleQuest (www.rulequest.com).
470643 c06.qxd 3/8/04 11:12 AM Page 191
Decision Trees 191
The trees grown by C5 are similar to those grown by CART (although unlike CART, C5 makes multiway splits on categorical variables). Like CART, the C5
algorithm first grows an overfit tree and then prunes it back to create a more stable model. The pruning strategy is quite different, however. C5 does not make use of a validation set to choose from among candidate subtrees; the same data used to grow the tree is also used to decide how the tree should be pruned. This may reflect the algorithm’s origins in the academic world, where in the past, university researchers had a hard time getting their hands on substantial quantities of real data to use for training sets. Consequently, they spent much time and effort trying to coax the last few drops of information from their impoverished datasets—a problem that data miners in the business world do not face.
Pessimistic Pruning
C5 prunes the tree by examining the error rate at each node and assuming that the true error rate is actually substantially worse. If N records arrive at a node, and E of them are classified incorrectly, then the error rate at that node is E/ N.
Now the whole point of the tree-growing algorithm is to minimize this error rate, so the algorithm assumes that E/ N is the best than can be done.
C5 uses an analogy with statistical sampling to come up with an estimate of the worst error rate likely to be seen at a leaf. The analogy works by thinking of the data at the leaf as representing the results of a series of trials each of which can have one of two possible results. (Heads or tails is the usual example.) As it happens, statisticians have been studying this particular situation since at least 1713, the year that Jacques Bernoulli’s famous binomial formula was posthumously published. So there are well-known formulas for determining what it means to have observed E occurrences of some event in N trials.
In particular, there is a formula which, for a given confidence level, gives the confidence interval—the range of expected values of E. C5 assumes that the observed number of errors on the training data is the low end of this range, and substitutes the high end to get a leaf’s predicted error rate, E/N on unseen data. The smaller the node, the higher the error rate. When the high-end estimate of the number of errors at a node is less than the estimate for the errors of its children, then the children are pruned.
Stability-Based Pruning
The pruning algorithms used by CART and C5 (and indeed by all the commercial decision tree tools that the authors have used) have a problem. They fail to prune some nodes that are clearly unstable. The split highlighted in Figure 6.8 is a good example. The picture was produced by SAS Enterprise
470643 c06.qxd 3/8/04 11:12 AM Page 192
192 Chapter 6
Miner using its default settings for viewing a tree. The numbers on the left-hand side of each node show what is happening on the training set. The numbers on the right-hand side of each node show what is happening on the validation set. This particular tree is trying to identify churners. When only the training data is taken into consideration, the highlighted branch seems to do very well; the concentration of churners rises from 58.0 percent to 70.9 percent.
Unfortunately, when the very same rule is applied to the validation set, the concentration of churners actually decreases from 56.6 percent to 52 percent.
One of the main purposes of a model is to make consistent predictions on previously unseen records. Any rule that cannot achieve that goal should be eliminated from the model. Many data mining tools allow the user to prune a decision tree manually. This is a useful facility, but we look forward to data mining software that provides automatic stability-based pruning as an option.
Such software would need to have a less subjective criterion for rejecting a split than “the distribution of the validation set results looks different from the distribution of the training set results.” One possibility would be to use a test of statistical significance, such as the chi-Square Test or the difference of proportions. The split would be pruned when the confidence level is less than some user-defined threshold, so only splits that are, say, 99 percent confident on the validation set would remain.
TEAMFLY
13.5% 13.8%
86.5% 86.2%
39,628 19,814
Handset Churn Rate
< 0.7%< 3.8%≥ 3.8%3.5%3.0%14.9% 15.6%28.7% 29.3%96.5% 97.0%85.1% 84.4%71.3% 70.7%11,112 5,67823,361 11,5295,1552,607Call Trend< 0.056< 0.18≥ 0.1858.0% 56.6%39.2% 40.4%27.0% 27.9%42.0% 43.4%60.8% 59.6%73.0% 72.1%2199914857440218Total Amt. Overdue< 4,855< 88,455≥ 88,45567.3% 66.0%70.9% 52.0%25.9% 44.4%32.7% 34.0%29.1% 48.0%74.1% 55.6%1104755255427Figure 6.8 An unstable split produces very different distributions on the training and validation sets.Team-Fly®470643 c06.qxd 3/8/04 11:12 AM Page 193Decision Trees 193WA R N I N G Small nodes cause big problems. A common cause of unstable decision tree models is allowing nodes with too few records. Most decision tree tools allow the user to set a minimum node size. As a rule of thumb, nodes that receive fewer than about 100 training set records are likely to be unstable.Extracting Rules from TreesWhen a decision tree is used primarily to generate scores, it is easy to forget that a decision tree is actually a collection of rules. If one of the purposes of the data mining effort is to gain understanding of the problem domain, it can be useful to reduce the huge tangle of rules in a decision tree to a smaller, more comprehensible collection.There are other situations where the desired output is a set of rules. In Mastering Data Mining, we describe the application of decision trees to an industrial process improvement problem, namely the prevention of a certain type of printing defect. In that case, the end product of the data mining project was a small collection of simple rules that could be posted on the wall next to each press.When a decision tree is used for producing scores, having a large number of leaves is advantageous because each leaf generates a different score. When the object is to generate rules, the fewer rules the better. Fortunately, it is often possible to collapse a complex tree into a smaller set of rules.The first step in that direction is to combine paths that lead to leaves that make the same classification. The partial decision tree in Figure 6.9 yields the following rules:Watch the game and home team wins and out with friends then beer.Watch the game and home team wins and sitting at home then diet soda.Watch the game and home team loses and out with friends then beer.Watch the game and home team loses and sitting at home then milk.The two rules that predict beer can be combined by eliminating the test for whether the home team wins or loses. That test is important for discriminating between milk and diet soda, but has no bearing on beer consumption. The new, simpler rule is:Watch the game and out with friends then beer.470643 c06.qxd 3/8/04 11:12 AM Page 194194 Chapter 6Watch the game?NoYesHome team wins?NoYesOut with friends?Out with friends?NoYesNoYesDiet sodaBeerMilkBeerFigure 6.9 Multiple paths lead to the same conclusion.Up to this point, nothing is controversial because no information has been lost, but C5’s rule generator goes farther. It attempts to generalize each rule by removing clauses, then comparing the predicted error rate of the new, briefer rule to that of the original using the same pessimistic error rate assumption used for pruning the tree in the first place. Often, the rules for several different leaves generalize to the same rule, so this process results in fewer rules than the decision tree had leaves.