The usual way of solving this problem is to sum the squares of the differences rather than the differences themselves. The average of the squared differences is called the variance. The estimates in this table have a variance of 10.
(-52 + 22 + -22 + 12 + 42 )/5 = (25 + 4 + 4 + 1 + 16)/5 = 50/5 = 10
The smaller the variance, the more accurate the estimate. A drawback to variance as a measure is that it is not expressed in the same units as the estimates themselves. For estimated prices in dollars, it is more useful to know how far off the estimates are in dollars rather than square dollars! For that reason, it is usual to take the square root of the variance to get a measure called the standard deviation. The standard deviation of these estimates is the square root of 10 or about 3.16. For our purposes, all you need to know about the standard deviation is that it is a measure of how widely the estimated values vary from the true values.
Comparing Models Using Lift
Directed models, whether created using neural networks, decision trees, genetic algorithms, or Ouija boards, are all created to accomplish some task.
Why not judge them on their ability to classify, estimate, and predict? The most common way to compare the performance of classification models is to use a ratio called lift. This measure can be adapted to compare models designed for other tasks as well. What lift actually measures is the change in concentration of a particular class when the model is used to select a group from the general population.
lift = P(class | sample) / P(class
| population)
t
t
Table 3.1 Countervailing Errors
TRUE VALUE
ESTIMATED VALUE
ERROR
127
132
-5
78
76
2
120
122
-2
130
129
1
95
91
4
470643 c03.qxd 3/8/04 11:09 AM Page 82
82
Chapter 3
An example helps to explain this. Suppose that we are building a model to predict who is likely to respond to a direct mail solicitation. As usual, we build the model using a preclassified training dataset and, if necessary, a preclassified validation set as well. Now we are ready to use the test set to calculate the model’s lift.
The classifier scores the records in the test set as either “predicted to respond”
or “not predicted to respond.” Of course, it is not correct every time, but if the model is any good at all, the group of records marked “predicted to respond”
contains a higher proportion of actual responders than the test set as a whole.
Consider these records. If the test set contains 5 percent actual responders and the sample contains 50 percent actual responders, the model provides a lift of 10
(50 divided by 5).
Is the model that produces the highest lift necessarily the best model? Surely a list of people half of whom will respond is preferable to a list where only a quarter will respond, right? Not necessarily—not if the first list has only 10
names on it!
The point is that lift is a function of sample size. If the classifier only picks out 10 likely respondents, and it is right 100 percent of the time, it will achieve a lift of 20—the highest lift possible when the population contains 5 percent responders. As the confidence level required to classify someone as likely to respond is relaxed, the mailing list gets longer, and the lift decreases.
TEAMFLY
Charts like the one in Figure 3.13 will become very familiar as you work with data mining tools. It is created by sorting all the prospects according to their likelihood of responding as predicted by the model. As the size of the mailing list increases, we reach farther and farther down the list. The X-axis shows the percentage of the population getting our mailing. The Y-axis shows the percentage of all responders we reach.
If no model were used, mailing to 10 percent of the population would reach 10 percent of the responders, mailing to 50 percent of the population would reach 50 percent of the responders, and mailing to everyone would reach all the responders. This mass-mailing approach is illustrated by the line slanting upwards. The other curve shows what happens if the model is used to select recipients for the mailing. The model finds 20 percent of the responders by mailing to only 10 percent of the population. Soliciting half the population reaches over 70 percent of the responders.
Charts like the one in Figure 3.13 are often referred to as lift charts, although what is really being graphed is cumulative response or concentration. Figure 3.13 shows the actual lift chart corresponding to the response chart in Figure 3.14. The chart shows clearly that lift decreases as the size of the target list increases.
Team-Fly®
470643 c03.qxd 3/8/04 11:09 AM Page 83
Data Mining Methodology and Best Practices
83
%Captured Response
100
90
80
70
60
50
40
30
20
10 10
20
30
40
50
60
70
80
90
100
Percentile
Figure 3.13 Cumulative response for targeted mailing compared with mass mailing.
Problems with Lift
Lift solves the problem of how to compare the performance of models of different kinds, but it is still not powerful enough to answer the most important questions: Is the model worth the time, effort, and money it cost to build it?
Will mailing to a segment where lift is 3 result in a profitable campaign?
These kinds of questions cannot be answered without more knowledge of the business context, in order to build costs and revenues into the calculation.
Still, lift is a very handy tool for comparing the performance of two models applied to the same or comparable data. Note that the performance of two models can only be compared using lift when the tests sets have the same density of the outcome.
470643 c03.qxd 3/8/04 11:09 AM Page 84
84
Chapter 3
Lift Value
1.5
1.4
1.3
1.2
1.1
1
10
20
30
40
50
60
70
80
90
100
Percentile
Figure 3.14 A lift chart starts high and then goes to 1.
Step Nine: Deploy Models
Deploying a model means moving it from the data mining environment to the scoring environment. This process may be easy or hard. In the worst case (and we have seen this at more than one company), the model is developed in a special modeling environment using software that runs nowhere else. To deploy the model, a programmer takes a printed description of the model and recodes it in another programming language so it can be run on the scoring platform.
A more common problem is that the model uses input variables that are not in the original data. This should not be a problem since the model inputs are at least derived from the fields that were originally extracted to from the model set. Unfortunately, data miners are not always good about keeping a clean, reusable record of the transformations they applied to the data.
The challenging in deploying data mining models is that they are often used to score very large datasets. In some environments, every one of millions of customer records is updated with a new behavior score every day. A score is simply an additional field in a database table. Scores often represent a probability or likelihood so they are typically numeric values between 0 and 1, but by no
470643 c03.qxd 3/8/04 11:09 AM Page 85
Data Mining Methodology and Best Practices
85
means necessarily so. A score might also be a class label provided by a clustering model, for instance, or a class label with a probability.
Step Ten: Assess Results
The response chart in Figure 3.14compares the number of responders reached for a given amount of postage, with and without the use of a predictive model.
A more useful chart would show how many dollars are brought in for a given expenditure on the marketing campaign. After all, if developing the model is very expensive, a mass mailing may be more cost-effective than a targeted one.
■■
What is the fixed cost of setting up the campaign and the model that supports it?
■■
What is the cost per recipient of making the offer?
■■
What is the cost per respondent of fulfilling the offer?
■■
What is the value of a positive response?
Plugging these numbers into a spreadsheet makes it possible to measure the impact of the model in dollars. The cumulative response chart can then be turned into a cumulative profit chart, which determines where the sorted mailing list should be cut off. If, for example, there is a high fixed price of setting up the campaign and also a fairly high price per recipient of making the offer (as when a wireless company buys loyalty by giving away mobile phones or waiving renewal fees), the company loses money by going after too few prospects because, there are still not enough respondents to make up for the high fixed costs of the program. On the other hand, if it makes the offer to too many people, high variable costs begin to hurt.