Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

What about the second proposed split? The Gini score of the left child is 1

since only one class is represented. The Gini score of the right child is Gini

= (4/14)2 + (10/14)2 = 0.082 + 0.510 = 0.592

right

and the Gini score for the split is:

(6/20)Gini

+ (14/20)Gini

= 0.3*1 + 0.7*0.592 = 0.714

left

right

Since the Gini score for the first proposed split (0.820) is greater than for the second proposed split (0.714), a tree built using the Gini criterion will prefer the split that yields two nearly pure children over the split that yields one completely pure child along with a larger, less pure one.

(continued)

470643 c06.qxd 3/8/04 11:12 AM Page 182

182 Chapter 6

COMPARING TWO SPLITS USING GINI AND ENTROPY (continued)

EVALUATING THE TWO SPLITS USING ENTROPY

As calculated in the main text, the entropy of the parent node is 1. The entropy of the first proposed split is also calculated in the main text and found to be 0.47 so the information gain for the first proposed split is 0.53.

How much information is gained by the second proposed split? The left child is pure and so has entropy of 0. As for the right child, the formula for entropy is

-(P(dark)log P(dark) + P(light)log P(light))

2

2

so the entropy of the right child is:

Entropy

= -((4/14)log (4/14) + (10/14)log (10/14)) = 0.516 +

right

2

2

0.347 = 0.863

The entropy of the split is the weighted average of the entropies of the resulting nodes. In this case,

0.3*Entropy

+ 0.7*Entropy

= 0.3*0 + 0.7*0.863 = 0.604

left

right

Subtracting 0.604 from the entropy of the parent (which is 1) yields an information gain of 0.396. This is less than 0.53, the information gain from the first proposed split, so in this case, entropy splitting criterion also prefers the first split to the second. Compared to Gini, the entropy criterion does have a stronger preference for nodes that are purer, even if smaller. This may be appropriate in domains where there really are clear underlying rules, but it TEAMFLY

tends to lead to less stable trees in “noisy” domains such as response to marketing offers.

For example, suppose the target variable is a binary flag indicating whether or not customers continued their subscriptions at the end of the introductory offer period and the proposed split is on acquisition channel, a categorical variable with three classes: direct mail, outbound call, and email. If the acquisition channel had no effect on renewal rate, we would expect the number of renewals in each class to be proportional to the number of customers acquired through that channel. For each channel, the chi-square test subtracts that expected number of renewals from the actual observed renewals, squares the difference, and divides the difference by the expected number. The values for each class are added together to arrive at the score. As described in Chapter 5, the chi-square distribution provide a way to translate this chi-square score into a probability. To measure the purity of a split in a decision tree, the score is sufficient. A high score means that the proposed split successfully splits the population into subpopulations with significantly different distributions.

The chi-square test gives its name to CHAID, a well-known decision tree algorithm first published by John A. Hartigan in 1975. The full acronym stands for Chi-square Automatic Interaction Detector. As the phrase “automatic interaction detector” implies, the original motivation for CHAID was for detecting Team-Fly®

470643 c06.qxd 3/8/04 11:12 AM Page 183

Decision Trees 183

statistical relationships between variables. It does this by building a decision tree, so the method has come to be used as a classification tool as well. CHAID

makes use of the Chi-square test in several ways—first to merge classes that do not have significantly different effects on the target variable; then to choose a best split; and finally to decide whether it is worth performing any additional splits on a node. In the research community, the current fashion is away from methods that continue splitting only as long as it seems likely to be useful and towards methods that involve pruning. Some researchers, however, still prefer the original CHAID approach, which does not rely on pruning.

The chi-square test applies to categorical variables so in the classic CHAID

algorithm, input variables must be categorical. Continuous variables must be binned or replaced with ordinal classes such as high, medium, low. Some current decision tree tools such as SAS Enterprise Miner, use the chi-square test for creating splits using categorical variables, but use another statistical test, the F test, for creating splits on continuous variables. Also, some implementations of CHAID continue to build the tree even when the splits are not statistically significant, and then apply pruning algorithms to prune the tree back.

Reduction in Variance

The four previous measures of purity all apply to categorical targets. When the target variable is numeric, a good split should reduce the variance of the target variable. Recall that variance is a measure of the tendency of the values in a population to stay close to the mean value. In a sample with low variance, most values are quite close to the mean; in a sample with high variance, many values are quite far from the mean. The actual formula for the variance is the mean of the sums of the squared deviations from the mean. Although the reduction in variance split criterion is meant for numeric targets, the dark and light dots in Figure 6.5 can still be used to illustrate it by considering the dark dots to be 1 and the light dots to be 0. The mean value in the parent node is clearly 0.5. Every one of the 20 observations differs from the mean by 0.5, so the variance is (20 * 0.52) / 20 = 0.25. After the split, the left child has 9 dark spots and one light spot, so the node mean is 0.9. Nine of the observations differ from the mean value by 0.1 and one observation differs from the mean value by 0.9 so the variance is (0.92 + 9 * 0.12) / 10 = 0.09. Since both nodes resulting from the split have variance 0.09, the total variance after the split is also 0.09. The reduction in variance due to the split is 0.25 – 0.09 = 0.16.

F Test

Another split criterion that can be used for numeric target variables is the F test, named for another famous Englishman—statistician, astronomer, and geneticist, Ronald. A. Fisher. Fisher and Pearson reportedly did not get along despite, or perhaps because of, the large overlap in their areas of interest. Fisher’s test

470643 c06.qxd 3/8/04 11:12 AM Page 184

184 Chapter 6

does for continuous variables what Pearson’s chi-square test does for categorical variables. It provides a measure of the probability that samples with different means and variances are actually drawn from the same population.

There is a well-understood relationship between the variance of a sample and the variance of the population from which it was drawn. (In fact, so long as the samples are of reasonable size and randomly drawn from the population, sample variance is a good estimate of population variance; very small samples—with fewer than 30 or so observations—usually have higher variance than their corresponding populations.) The F test looks at the relationship between two estimates of the population variance—one derived by pooling all the samples and calculating the variance of the combined sample, and one derived from the between-sample variance calculated as the variance of the sample means. If the various samples are randomly drawn from the same population, these two estimates should agree closely.

The F score is the ratio of the two estimates. It is calculated by dividing the between-sample estimate by the pooled sample estimate. The larger the score, the less likely it is that the samples are all randomly drawn from the same population. In the decision tree context, a large F-score indicates that a proposed split has successfully split the population into subpopulations with significantly different distributions.

Pruning

As previously described, the decision tree keeps growing as long as new splits can be found that improve the ability of the tree to separate the records of the training set into increasingly pure subsets. Such a tree has been optimized for the training set, so eliminating any leaves would only increase the error rate of the tree on the training set. Does this imply that the full tree will also do the best job of classifying new datasets? Certainly not!

A decision tree algorithm makes its best split first, at the root node where there is a large population of records. As the nodes get smaller, idiosyncrasies of the particular training records at a node come to dominate the process. One way to think of this is that the tree finds general patterns at the big nodes and patterns specific to the training set in the smaller nodes; that is, the tree overfits the training set. The result is an unstable tree that will not make good predictions. The cure is to eliminate the unstable splits by merging smaller leaves through a process called pruning; three general approaches to pruning are discussed in detail.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *