7
2
3.7 7
D
5
1,
1,2
3,1
1,483.1
2
886.0
9
D
TE
EC
0.2
7.9
1.1
3.6
P
5
5
45.2
3
1,005.3
98.9
EX
TM
1,85
6,2
4,2
1
5,2
7
3,1
3,443.8
D
TER
Chi-Square Calculation for Counties and Channels Example
K
Y
ES
OR
S
NT
MON
U
EEN
TCH
FFOLK
RONX
NGS
EW Y
ICH
CO
B
KI
NASSAU
N
QU
R
SU
WES
Table 5.8
470643 c05.qxd 3/8/04 11:11 AM Page 157
The Lure of Statistics: Data Mining Using Familiar Tools 157
Table 5.9 Chi-Square Calculation for Bronx and TM
EXPECTED
DEVIATION
CHI-SQUARE
COUNTY
TM
NOT_TM
TM
NOT_TM
TM
NOT_TM
BRONX
1,850.2
4,710.8
1,361.8
–1,361.8
1,002.3 393.7
NOT BRONX
34,135.8 86,913.2
–1,361.8 1,361.8
54.3
21.3
The result is a set of chi-square values for the Bronx-TM combination, in a table with 1 degree of freedom. The Bronx-TM score by itself is a good approximation of the overall chi-square value for the 2 × 2 table (this assumes that the original cells are roughly the same size). The calculation for the chi-square value uses this value (1002.3) with 1 degree of freedom. Conveniently, the chi-square calculation for this cell is the same as the chi-square for the cell in the original calculation, although the other values do not match anything. This makes it unnecessary to do additional calculations.
This means that an estimate of the effect of each combination of variables can be obtained using the chi-square value in the cell with a degree of freedom of 1. The result is a table that has a set of p-values that a given square is caused by chance, as shown in Table 5.10.
However, there is a second correction that needs to be made because there are many comparisons taking place at the same time. Bonferroni’s adjustment takes care of this by multiplying each p-value by the number of comparisons—
which is the number of cells in the table. For final presentation purposes, convert the p-values to their opposite, the confidence and multiply by the sign of the deviation to get a signed confidence. Figure 5.10 illustrates the result.
Table 5.10 Estimated P-Value for Each Combination of County and Channel, without Correcting for Number of Comparisons
COUNTY
TM
DM
OTHER
BRONX
0.00%
0.00%
0.00%
KINGS
0.00%
0.00%
0.00%
NASSAU
0.00%
0.00%
0.00%
NEW YORK
0.00%
0.00%
0.00%
QUEENS
0.00%
0.74%
0.00%
RICHMOND
59.79%
0.07%
39.45%
SUFFOLK
0.01%
0.00%
42.91%
WESTCHESTER
0.00%
0.00%
0.00%
470643 c05.qxd 3/8/04 11:11 AM Page 158
158 Chapter 5
100%
80%
60%
40%
20%
TM
0%
DM
-20%
OTHER
-40%
-60%
-80%
-100%
U
ONX
ORK
KINGS
BR
UEENS
NASSA
Q
SUFFOLK
NEW Y
RICHMOND
WESTCHESTER
Figure 5.10 This chart shows the signed confidence values for each county and region combination; the preponderance of values near 100% and –100% indicate that observed differences are statistically significant.
The result is interesting. First, almost all the values are near 100 percent or
–100 percent, meaning that there are statistically significant differences among the counties. In fact, telemarketing (the diamond) and direct mail (the square) are always at opposite ends. There is a direct inverse relationship between the two. Direct mail is high and telemarketing low in three counties—Manhattan, Nassau, and Suffolk. There are many wealthy areas in these counties, suggesting that wealthy customers are more likely to respond to direct mail than telemarketing. Of course, this could also mean that direct mail campaigns are directed to these areas, and telemarketing to other areas, so the geography was determined by the business operations. To determine which of these possibilities is correct, we would need to know who was contacted as well as who responded.
Data Mining and Statistics
Many of the data mining techniques discussed in the next eight chapters were invented by statisticians or have now been integrated into statistical software; they are extensions of standard statistics. Although data miners and
470643 c05.qxd 3/8/04 11:11 AM Page 159
The Lure of Statistics: Data Mining Using Familiar Tools 159
statisticians use similar techniques to solve similar problems, the data mining approach differs from the standard statistical approach in several areas:
■■
Data miners tend to ignore measurement error in raw data.
■■
Data miners assume that there is more than enough data and processing power.
■■
Data mining assumes dependency on time everywhere.
■■
It can be hard to design experiments in the business world.
■■
Data is truncated and censored.
These are differences of approach, rather than opposites. As such, they shed some light on how the business problems addressed by data miners differ from the scientific problems that spurred the development of statistics.
No Measurement Error in Basic Data
Statistics originally derived from measuring scientific quantities, such as the width of a skull or the brightness of a star. These measurements are quantitative and the precise measured value depends on factors such as the type of measuring device and the ambient temperature. In particular, two people taking the same measurement at the same time are going to produce slightly different results. The results might differ by 5 percent or 0.05 percent, but there is a difference. Traditionally, statistics looks at observed values as falling into a confidence interval.
On the other hand, the amount of money a customer paid last January is quite well understood—down to the last penny. The definition of customer may be a little bit fuzzy; the definition of January may be fuzzy (consider 5-4-4 accounting cycles). However, the amount of the payment is precise. There is no measurement error.
There are sources of error in business data. Of particular concern is operational error, which can cause systematic bias in what is being collected. For instance, clock skew may mean that two events that seem to happen in one sequence may happen in another. A database record may have a Tuesday update date, when it really was updated on Monday, because the updating process runs just after midnight. Such forms of bias are systematic, and potentially represent spurious patterns that might be picked up by data mining algorithms.
One major difference between business data and scientific data is that the latter has many continuous values and the former has many discrete values.
Even monetary amounts are discrete—two values can differ only by multiples of pennies (or some similar amount)—even though the values might be represented by real numbers.
470643 c05.qxd 3/8/04 11:11 AM Page 160
160 Chapter 5
There Is a Lot of Data
Traditionally, statistics has been applied to smallish data sets (at most a few thousand rows) with few columns (less than a dozen). The goal has been to squeeze as much information as possible out of the data. This is still important in problems where collecting data is expensive or arduous—such as market research, crash testing cars, or tests of the chemical composition of Martian soil.
Business data, on the other hand, is very voluminous. The challenge is understanding anything about what is happening, rather than every possible thing. Fortunately, there is also enough computing power available to handle the large volumes of data.
Sampling theory is an important part of statistics. This area explains how results on a subset of data (a sample) relate to the whole. This is very important when planning to do a poll, because it is not possible to ask everyone a question; rather, pollsters ask a very small sample and derive overall opinion.
However, this is much less important when all the data is available. Usually, it is best to use all the data available, rather than a small subset of it.
There are a few cases when this is not necessarily true. There might simply be too much data. Instead of building models on tens of millions of customers; build models on hundreds of thousands—at least to learn how to build better models. Another reason is to get an unrepresentative sample. Such a sample, for instance, might have an equal number of churners and nonchurners, although the original data had different proportions. However, it is generally better to use more data rather than sample down and use less, unless there is a good reason for sampling down.
Time Dependency Pops Up Everywhere
Almost all data used in data mining has a time dependency associated with it.
Customers’ reactions to marketing efforts change over time. Prospects’ reactions to competitive offers change over time. Comparing results from a marketing campaign one year to the previous year is rarely going to yield exactly the same result. We do not expect the same results.
On the other hand, we do expect scientific experiments to yield similar results regardless of when the experiment takes place. The laws of science are considered immutable; they do not change over time. By contrast, the business climate changes daily. Statistics often considers repeated observations to be independent observations. That is, one observation does not resemble another. Data mining, on the other hand, must often consider the time component of the data.
Experimentation is Hard
Data mining has to work within the constraints of existing business practices.