Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 46 – Library. Read online. Free books read online. Read books without registering

The fact that trees can be (and sometimes are) used to estimate continuous values does not make it a good idea. A decision tree estimator can only generate as many discrete values as there are leaves in the tree. To estimate a continuous variable, it is preferable to use a continuous function. Regression models and neural network models are generally more appropriate for estimation.

Trees Grow in Many Forms

The tree in Figure 6.1 is a binary tree of nonuniform depth; that is, each nonleaf node has two children and leaves are not all at the same distance from the root.

In this case, each node represents a yes-or-no question, whose answer determines by which of two paths a record proceeds to the next level of the tree. Since any multiway split can be expressed as a series of binary splits, there is no real need for trees with higher branching factors. Nevertheless, many data mining tools are capable of producing trees with more than two branches. For example, some decision tree algorithms split on categorical variables by creating a branch for each class, leading to trees with differing numbers of branches at different nodes. Figure 6.3 illustrates a tree that uses both three-way and two-way splits for the same classification problem as the tree in Figures 6.1 and 6.2.

470643 c06.qxd 3/8/04 11:12 AM Page 171

Decision Trees 171

50%

tot units demand

< 4.5 [4.5, 15.5] ≥ 15.5 40% 47% 66% num VOM orders days since last days since last < 1 ≥ 1 1 221.5 ≥ 725 < 756 ≥ 756 39% 42% 58% 52% 37% 72% 40% average $ demand GMM buyer flag < 47.6 ≥ 47.6 0 1 40% 33% 52% 54% avg $/month tot $ 9604 < 2 2.10132,4. ≥ 4.116 [ 9.15, 4 ≥ 41.255 37% 48% 85% 49% 69% 50% Figure 6.3 This ternary decision tree is applied to the same the same classification problem as in Figure 6.1. T I P There is no relationship between the number of branches allowed at a node and the number of classes in the target variable. A binary tree (that is, one with two-way splits) can be used to classify records into any number of categories, and a tree with multiway splits can be used to classify a binary target variable. How a Decision Tree Is Grown Although there are many variations on the core decision tree algorithm, all of them share the same basic procedure: Repeatedly split the data into smaller and smaller groups in such a way that each new generation of nodes has greater purity than its ancestors with respect to the target variable. For most of this discussion, we assume a binary, categorical target variable, such as responder/nonresponder. This simplifies the explanations without much loss of generality. 470643 c06.qxd 3/8/04 11:12 AM Page 172 172 Chapter 6 Finding the Splits At the start of the process, there is a training set consisting of preclassified records—that is, the value of the target variable is known for all cases. The goal is to build a tree that assigns a class (or a likelihood of membership in each class) to the target field of a new record based on the values of the input variables. The tree is built by splitting the records at each node according to a function of a single input field. The first task, therefore, is to decide which of the input fields makes the best split. The best split is defined as one that does the best job of separating the records into groups where a single class predominates in each group. The measure used to evaluate a potential split is purity. The next section talks about specific methods for calculating purity in more detail. However, they are all trying to achieve the same effect. With all of them, low purity means that the set contains a representative distribution of classes (relative to the parent node), while high purity means that members of a single class predominate. The best split is the one that increases the purity of the record sets by the greatest amount. A good split also creates nodes of similar size, or at least does not create nodes containing very few records. These ideas are easy to see visually. Figure 6.4 illustrates some good and bad splits. TEAMFLY Original Data Poor Split Poor Split Good Split Figure 6.4 A good split increases purity for all the children. Team-Fly® 470643 c06.qxd 3/8/04 11:12 AM Page 173 Decision Trees 173 The first split is a poor one because there is no increase in purity. The initial population contains equal numbers of the two sorts of dot; after the split, so does each child. The second split is also poor, because all though purity is increased slightly, the pure node has few members and the purity of the larger child is only marginally better than that of the parent. The final split is a good one because it leads to children of roughly same size and with much higher purity than the parent. Tree-building algorithms are exhaustive. They proceed by taking each input variable in turn and measuring the increase in purity that results from every split suggested by that variable. After trying all the input variables, the one that yields the best split is used for the initial split, creating two or more children. If no split is possible (because there are too few records) or if no split makes an improvement, then the algorithm is finished with that node and the node become a leaf node. Otherwise, the algorithm performs the split and repeats itself on each of the children. An algorithm that repeats itself in this way is called a recursive algorithm. Splits are evaluated based on their effect on node purity in terms of the target variable. This means that the choice of an appropriate splitting criterion depends on the type of the target variable, not on the type of the input variable. With a categorical target variable, a test such as Gini, information gain, or chi-square is appropriate whether the input variable providing the split is numeric or categorical. Similarly, with a continuous, numeric variable, a test such as variance reduction or the F-test is appropriate for evaluating the split regardless of whether the input variable providing the split is categorical or numeric. Splitting on a Numeric Input Variable When searching for a binary split on a numeric input variable, each value that the variable takes on in the training set is treated as a candidate value for the split. Splits on a numeric variable take the form X< N. All records where the value of X (the splitting variable) is less than some constant N are sent to one child and all records where the value of X is greater than or equal to N are sent to the other. After each trial split, the increase in purity, if any, due to the split is measured. In the interests of efficiency, some implementations of the splitting algorithm do not actually evaluate every value; they evaluate a sample of the values instead. When the decision tree is scored, the only use that it makes of numeric inputs is to compare their values with the split points. They are never multiplied by weights or added together as they are in many other types of models. This has the important consequence that decision trees are not sensitive to outliers or skewed distributions of numeric variables, because the tree only uses the rank of numeric variables and not their absolute values. 470643 c06.qxd 3/8/04 11:12 AM Page 174 174 Chapter 6 Splitting on a Categorical Input Variable The simplest algorithm for splitting on a categorical input variable is simply to create a new branch for each class that the categorical variable can take on. So, if color is chosen as the best field on which to split the root node, and the training set includes records that take on the values red, orange, yellow, green, blue, indigo, and violet, then there will be seven nodes in the next level of the tree. This approach is actually used by some software packages, but it often yields poor results. High branching factors quickly reduce the population of training records available at each node in lower levels of the tree, making further splits less reliable. A more common approach is to group together the classes that, taken individually, predict similar outcomes. More precisely, if two classes of the input variable yield distributions of the classes of the output variable that do not differ significantly from one another, the two classes can be merged. The usual test for whether the distributions differ significantly is the chi-square test. Splitting in the Presence of Missing Values One of the nicest things about decision trees is their ability to handle missing values in either numeric or categorical input fields by simply considering null to be a possible value with its own branch. This approach is preferable to throwing out records with missing values or trying to impute missing values.