This can make it difficult to set up experiments, for several reasons:
470643 c05.qxd 3/8/04 11:11 AM Page 161
The Lure of Statistics: Data Mining Using Familiar Tools 161
■■
Businesses may not be willing to invest in efforts that reduce short-term gain for long-term learning.
■■
Business processes may interfere with well-designed experimental methodologies.
■■
Factors that may affect the outcome of the experiment may not be obvious.
■■
Timing plays a critical role and may render results useless.
Of these, the first two are the most difficult. The first simply says that tests do not get done. Or, they are done so poorly that the results are useless. The second poses the problem that a seemingly well-designed experiment may not be executed correctly. There are always hitches when planning a test; sometimes these hitches make it impossible to read the results.
Data Is Censored and Truncated
The data used for data mining is often incomplete, in one of two special ways.
Censored values are incomplete because whatever is being measured is not complete. One example is customer tenures. For active customers, we know the tenure is greater than the current tenure; however, we do not know which customers are going to stop tomorrow and which are going to stop 10 years from now. The actual tenure is greater than the observed value and cannot be known until the customer actually stops at some unknown point in the future.
50
45
40
35
30
25
y Units
20
Lost Sales
ventor 15
In
Sell-Out
10
Demand
5
Stock
0
0
5
10
15
20
25
30
35
40
Time
Figure 5.11 A time series of product sales and inventory illustrates the problem of censored data.
470643 c05.qxd 3/8/04 11:11 AM Page 162
162 Chapter 5
Figure 5.11 shows another situation with the same result. This curve shows sales and inventory for a retailer for one product. Sales are always less than or equal to the inventory. On the days with the Xs, though, the inventory sold out. What were the potential sales on these days? The potential sales are greater than or equal to the observed sales—another example of censored data.
Truncated data poses another problem in terms of biasing samples. Truncated data is not included in databases, often because it is too old. For instance, when Company A purchases Company B, their systems are merged. Often, the active customers from Company B are moved into the data warehouse for Company A. That is, all customers active on a given date are moved over.
Customers who had stopped the day before are not moved over. This is an example of left truncation, and it pops up throughout corporate databases, usually with no warning (unless the documentation is very good about saying what is not in the warehouse as well as what is). This can cause confusion when looking at when customers started—and discovering that all customers who started 5 years before the merger were mysteriously active for at least 5
years. This is not due to a miraculous acquisition program. This is because all the ones who stopped earlier were excluded.
Lessons Learned
TEAMFLY
This chapter talks about some basic statistical methods that are useful for analyzing data. When looking at data, it is useful to look at histograms and cumulative histograms to see what values are most common. More important, though, is looking at values over time.
One of the big questions addressed by statistics is whether observed values are expected or not. For this, the number of standard deviations from the mean (z-score) can be used to calculate the probability of the value being due to chance (the p-value). High p-values mean that the null hypothesis is true; that is, nothing interesting is happening. Low p-values are suggestive that other factors may be influencing the results. Converting z-scores to p-values depends on the normal distribution.
Business problems often require analyzing data expressed as proportions.
Fortunately, these behave similarly to normal distributions. The formula for the standard error for proportions (SEP) makes it possible to define a confidence interval on a proportion such as a response rate. The standard error for the difference of proportions (SEDP) makes it possible to determine whether two values are similar. This works by defining a confidence interval for the difference between two values.
When designing marketing tests, the SEP and SEDP can be used for sizing test and control groups. In particular, these groups should be large enough to Team-Fly®
470643 c05.qxd 3/8/04 11:11 AM Page 163
The Lure of Statistics: Data Mining Using Familiar Tools 163
measure differences in response with a high enough confidence. Tests that have more than two groups need to take into account an adjustment, called Bonferroni’s correction, when setting the group sizes.
The chi-square test is another statistical method that is often useful. This method directly calculates the estimated values for data laid out in rows and columns. Based on these estimates, the chi-square test can determine whether the results are likely or unlikely. As shown in an example, the chi-square test and SEDP methods produce similar results.
Statisticians and data miners solve similar problems. However, because of historical differences and differences in the nature of the problems, there are some differences in approaches. Data miners generally have lots and lots of data with few measurement errors. This data changes over time, and values are sometimes incomplete. The data miner has to be particularly suspicious about bias introduced into the data by business processes.
The next eight chapters dive into more detail into more modern techniques for building models and understanding data. Many of these techniques have been adopted by statisticians and build on over a century of work in this area.
470643 c05.qxd 3/8/04 11:11 AM Page 164
470643 c06.qxd 3/8/04 11:12 AM Page 165
C H A P T E R
6
Decision Trees
Decision trees are powerful and popular for both classification and prediction.
The attractiveness of tree-based methods is due largely to the fact that decision trees represent rules. Rules can readily be expressed in English so that we humans can understand them; they can also be expressed in a database access language such as SQL to retrieve records in a particular category. Decision trees are also useful for exploring data to gain insight into the relationships of a large number of candidate input variables to a target variable. Because decision trees combine both data exploration and modeling, they are a powerful first step in the modeling process even when building the final model using some other technique.
There is often a trade-off between model accuracy and model transparency.
In some applications, the accuracy of a classification or prediction is the only thing that matters; if a direct mail firm obtains a model that can accurately predict which members of a prospect pool are most likely to respond to a certain solicitation, the firm may not care how or why the model works. In other situations, the ability to explain the reason for a decision is crucial. In insurance underwriting, for example, there are legal prohibitions against discrimination based on certain variables. An insurance company could find itself in the position of having to demonstrate to a court of law that it has not used illegal discriminatory practices in granting or denying coverage. Similarly, it is more acceptable to both the loan officer and the credit applicant to hear that an application for credit has been denied on the basis of a computer-generated 165
470643 c06.qxd 3/8/04 11:12 AM Page 166
166 Chapter 6
rule (such as income below some threshold and number of existing revolving accounts greater than some other threshold) than to hear that the decision has been made by a neural network that provides no explanation for its action.
This chapter begins with an examination of what decision trees are, how they work, and how they can be applied to classification and prediction problems. It then describes the core algorithm used to build decision trees and discusses some of the most popular variants of that core algorithm. Practical examples drawn from the authors’ experience are used to demonstrate the utility and general applicability of decision tree models and to illustrate practical considerations that must be taken into account.
What Is a Decision Tree?
A decision tree is a structure that can be used to divide up a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules. With each successive division, the members of the resulting sets become more and more similar to one another. The familiar division of living things into kingdoms, phyla, classes, orders, families, genera, and species, invented by the Swedish botanist Carl Linnaeus in the 1730s, provides a good example. Within the animal kingdom, a particular animal is assigned to the phylum chordata if it has a spinal cord. Additional characteristics are used to further subdivide the chordates into the birds, mammals, reptiles, and so on. These classes are further subdivided until, at the lowest level in the taxonomy, members of the same species are not only morphologically similar, they are capable of breeding and producing fertile offspring.