■■
The party that does not hold the presidency picks up seats in Congress during off-year elections.
■■
When the American League wins the World Series, Republicans take the White House.
■■
When the Washington Redskins win their last home game, the incumbent party keeps the White House.
■■
In U.S. presidential contests, the taller man usually wins.
The first pattern (the one involving off-year elections) seems explainable in purely political terms. Because there is an underlying explanation, this pattern seems likely to continue into the future and therefore has predictive value. The next two alleged predictors, the ones involving sporting events, seem just as clearly to have no predictive value. No matter how many times Republicans and the American League may have shared victories in the past (and the authors have not researched this point), there is no reason to expect the association to continue in the future.
What about candidates’ heights? At least since 1945 when Truman (who was short, but taller than Dewey) was elected, the election in which Carter beat
470643 c03.qxd 3/8/04 11:09 AM Page 46
46
Chapter 3
Ford is the only one where the shorter candidate won. (So long as “winning”
is defined as “receiving the most votes” so that the 2000 election that pitted 6’1” Gore against the 6’0” Bush still fits the pattern.) Height does not seem to have anything to do with the job of being president. On the other hand, height is positively correlated with income and other social marks of success so consciously or unconsciously, voters may perceive a taller candidate as more presidential. As this chapter explains, the right way to decide if a rule is stable and predictive is to compare its performance on multiple samples selected at random from the same population. In the case of presidential height, we leave this as an exercise for the reader. As is often the case, the hardest part of the task will be collecting the data—even in the age of Google, it is not easy to locate the heights of unsuccessful presidential candidates from the eighteenth, nineteenth, and twentieth centuries!
The technical term for finding patterns that fail to generalize is overfitting.
Overfitting leads to unstable models that work one day, but not the next.
Building stable models is the primary goal of the data mining methodology.
The Model Set May Not Reflect the Relevant Population
The model set is the collection of historical data that is used to develop data mining models. For inferences drawn from the model set to be valid, the model set must reflect the population that the model is meant to describe, classify, or score. A sample that does not properly reflect its parent population is biased. Using a biased sample as a model set is a recipe for learning things that are not true. It is also hard to avoid. Consider:
■■
Customers are not like prospects.
■■
Survey responders are not like nonresponders.
■■
People who read email are not like people who do not read email.
■■
People who register on a Web site are not like people who fail to register.
■■
After an acquisition, customers from the acquired company are not necessarily like customers from the acquirer.
■■
Records with no missing values reflect a different population from records with missing values.
Customers are not like prospects because they represent people who responded positively to whatever messages, offers, and promotions were made to attract customers in the past. A study of current customers is likely to suggest more of the same. If past campaigns have gone after wealthy, urban consumers, then any comparison of current customers with the general population will likely show that customers tend to be wealthy and urban. Such a model may miss opportunities in middle-income suburbs. The consequences of using a biased sample can be worse than simply a missed marketing opportunity.
470643 c03.qxd 3/8/04 11:09 AM Page 47
Data Mining Methodology and Best Practices
47
In the United States, there is a history of “redlining,” the illegal practice of refusing to write loans or insurance policies in certain neighborhoods. A search for patterns in the historical data from a company that had a history of redlining would reveal that people in certain neighborhoods are unlikely to be customers. If future marketing efforts were based on that finding, data mining would help perpetuate an illegal and unethical practice.
Careful attention to selecting and sampling data for the model set is crucial to successful data mining.
Data May Be at the Wrong Level of Detail
In more than one industry, we have been told that usage often goes down in the month before a customer leaves. Upon closer examination, this turns out to be an example of learning something that is not true. Figure 3.1 shows the monthly minutes of use for a cellular telephone subscriber. For 7 months, the subscriber used about 100 minutes per month. Then, in the eighth month, usage went down to about half that. In the ninth month, there was no usage at all.
This subscriber appears to fit the pattern in which a month with decreased usage precedes abandonment of the service. But appearances are deceiving.
Looking at minutes of use by day instead of by month would show that the customer continued to use the service at a constant rate until the middle of the month and then stopped completely, presumably because on that day, he or she began using a competing service. The putative period of declining usage does not actually exist and so certainly does not provide a window of opportunity for retaining the customer. What appears to be a leading indicator is actually a trailing one.
Minutes of Use by Tenure
140
120
100
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
11
Figure 3.1 Does declining usage in month 8 predict attrition in month 9?
470643 c03.qxd 3/8/04 11:09 AM Page 48
48
Chapter 3
Figure 3.2 shows another example of confusion caused by aggregation. Sales appear to be down in October compared to August and September. The picture comes from a business that has sales activity only on days when the financial markets are open. Because of the way that weekends and holidays fell in 2003, October had fewer trading days than August and September. That fact alone accounts for the entire drop-off in sales.
In the previous examples, aggregation led to confusion. Failure to aggregate to the appropriate level can also lead to confusion. In one case, data provided by a charitable organization showed an inverse correlation between donors’
likelihood to respond to solicitations and the size of their donations. Those more likely to respond sent smaller checks. This counterintuitive finding is a result of the large number of solicitations the charity sent out to its supporters each year. Imagine two donors, each of whom plans to give $500 to the charity.
One responds to an offer in January by sending in the full $500 contribution and tosses the rest of the solicitation letters in the trash. The other sends a $100
check in response to each of five solicitations. On their annual income tax returns, both donors report having given $500, but when seen at the individual campaign level, the second donor seems much more responsive. When aggregated to the yearly level, the effect disappears.
Learning Things That Are True, but Not Useful
Although not as dangerous as learning things that aren’t true, learning things that aren’t useful is more common.
Sales by Month (2003)
43500
43000
42500
42000
41500
41000
40500
40000
August
September
October
Figure 3.2 Did sales drop off in October?
470643 c03.qxd 3/8/04 11:09 AM Page 49
Data Mining Methodology and Best Practices
49
Learning Things That Are Already Known
Data mining should provide new information. Many of the strongest patterns in data represent things that are already known. People over retirement age tend not to respond to offers for retirement savings plans. People who live where there is no home delivery do not become newspaper subscribers. Even though they may respond to subscription offers, service never starts. For the same reason, people who live where there are no cell towers tend not to purchase cell phones.
Often, the strongest patterns reflect business rules. If data mining “discovers” that people who have anonymous call blocking also have caller ID, it is perhaps because anonymous call blocking is only sold as part of a bundle of services that also includes caller ID. If there are no sales of certain products in a particular location, it is possible that they are not offered there. We have seen many such discoveries. Not only are these patterns uninteresting, their strength may obscure less obvious patterns.
Learning things that are already known does serve one useful purpose. It demonstrates that, on a technical level, the data mining effort is working and the data is reasonably accurate. This can be quite comforting. If the data and the data mining techniques applied to it are powerful enough to discover things that are known to be true, it provides confidence that other discoveries are also likely to be true. It is also true that data mining often uncovers things that ought to have been known, but were not; that retired people do not respond well to solicitations for retirement savings accounts, for instance.