Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

p ) (1 – p)

SEP =

N

In this formula, p is the average value and N is the size of the population. So, the corresponding normal distribution has a standard deviation equal to the square root of the product of the observed response times one minus the observed response divided by the total number of samples.

We have already observed that about 68 percent of data following a normal distribution lies within one standard deviation. For the sample size of 100,000, the

470643 c05.qxd 3/8/04 11:11 AM Page 141

The Lure of Statistics: Data Mining Using Familiar Tools 141

formula is SQRT(5% * 95% / 100,000) is about 0.07 percent. So, we are 68 percent confident that the actual response is between 4.93 percent and 5.07 percent. We have also observed that a bit over 95 percent is within two standard deviations; so the range of 4.86 percent and 5.14 percent is just over 95 percent confident. So, if we observe a 5 percent response rate for the challenger offer, then we are over 95 percent confident that the response rate on the whole population would have been between 4.86 percent and 5.14 percent. Note that this conclusion depends very much on the fact that people who got the challenger offer were selected randomly from the entire population.

Comparing Results Using Confidence Bounds

The previous section discussed confidence intervals as applied to the response rate of one group who received the challenger offer. In this case, there are actually two response rates, one for the champion and one for the challenger. Are these response rates different? Notice that the observed rates could be different (say 5.0 percent and 5.001 percent), but these could be indistinguishable from each other. One way to answer the question is to look at the confidence interval for each response rate and see whether they overlap. If the intervals do not overlap, then the response rates are different.

This example investigates a range of response rates from 4.5 percent to 5.5

percent for the champion model. In practice, a single response rate would be known. However, investigating a range makes it possible to understand what happens as the rate varies from much lower (4.5 percent) to the same (5.0 percent) to much larger (5.5 percent).

The 95 percent confidence is 1.96 standard deviation from the mean, so the lower value is the mean minus this number of standard deviations and the upper is the mean plus this value. Table 5.2 shows the lower and upper bounds for a range of response rates for the champion model going from 4.5 percent to 5.5 percent.

6%

5%

4%

3%

2%

1%

obability Density

0%

Pr

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

Observed Response Rate

Figure 5.8 Statistics has proven that actual response rate on a population is very close to a normal distribution whose mean is the measured response on a sample and whose standard deviation is the standard error of proportion (SEP).

470643 c05.qxd 3/8/04 11:11 AM Page 142

142 Chapter 5

R

EP

4%

4%

5%

5%

5%

5%

PU

4.5

4.64%

4.7

4.84%

4.94%

5.05%

5.1

5.2

5.3

5.45%

5.5

R

6%

6%

5%

5%

5%

OWEL

4.46%

4.5

4.66%

4.7

4.86%

4.95%

5.05%

5.1

5.2

5.3

5.45%

m the mean.

%

%

9%

3%

7%

1

7%

1

5%

9%

0% 7

P E

F * SN

%*1.96=0.043

9%*1.96=0.042

1

3%*1.96=0.043

5%*1.96=0.044

8%*1.96=0.044

0%*1.96=0.045

2%*1.96=0.045

4%*1.96=0.045

6%*1.96=0.0463%

8%*1.96=0.0466%

1

2

2

2

2

3

3

3

3

3

40%*1.96=0.04

95% CO

0.02

0.02

0.02

0.02

0.02

0.02

0.02

0.02

0.02

0.02

0.02

F

N

TEAMFLY

95% CO

1.96

1.96

1.96

1.96

1.96

1.96

1.96

1.96

1.96

1.96

1.96

%

9%

1

3%

5%

8%

0%

2%

4%

6%

8%

1

2

2

2

2

3

3

3

3

3

40%

SEP

0.02

0.02

0.02

0.02

0.02

0.02

0.02

0.02

0.02

0.02

0.02

he bounds for the 95% confidence level are calculated using1.96 standard deviations fro ZE

SI

900,000

900,000

900,000

900,000

900,000

900,000

900,000

900,000

900,000

900,000

900,000

ercent Confidence Interval Bounds for the Champion Group

The 95 P

SPONSE

%

RE

4.5%

4.6%

4.7%

4.8%

4.9%

5.0%

5.1

5.2%

5.3%

5.4%

5.5%

Table 5.2

Response rates vary from 4.5% to 5.5%. T

Team-Fly®

470643 c05.qxd 3/8/04 11:11 AM Page 143

The Lure of Statistics: Data Mining Using Familiar Tools 143

Based on these possible response rates, it is possible to tell if the confidence bounds overlap. The 95 percent confidence bounds for the challenger model were from about 4.86 percent to 5.14 percent. These bounds overlap the confidence bounds for the champion model when its response rates are 4.9 percent, 5.0 percent, or 5.1 percent. For instance, the confidence interval for a response rate of 4.9 percent goes from 4.86 percent to 4.94 percent; this does overlap 4.86

percent—5.14 percent. Using the overlapping bounds method, we would consider these statistically the same.

Comparing Results Using Difference of Proportions

Overlapping bounds is easy but its results are a bit pessimistic. That is, even though the confidence intervals overlap, we might still be quite confident that the difference is not due to chance with some given level of confidence.

Another approach is to look at the difference between response rates, rather than the rates themselves. Just as there is a formula for the standard error of a proportion, there is a formula for the standard error of a difference of proportions (SEDP): p 1 ) (1 – p 1)

SEDP =

N1 + p 2 ) (1 – p 2)

N 2

This formula is a lot like the formula for the standard error of a proportion, except the part in the square root is repeated for each group. Table 5.3 shows this applied to the champion challenger problem with response rates varying between 4.5 percent and 5.5 percent for the champion group.

By the difference of proportions, three response rates on the champion have a confidence under 95 percent (that is, the p-value exceeds 5 percent). If the challenger response rate is 5.0 percent and the champion is 5.1 percent, then the difference in response rates might be due to chance. However, if the champion has a response rate of 5.2 percent, then the likelihood of the difference being due to chance falls to under 1 percent.

WA R N I N G Confidence intervals only measure the likelihood that sampling affected the result. There may be many other factors that we need to take into consideration to determine if two offers are significantly different. Each group must be selected entirely randomly from the whole population for the difference of proportions method to work.

470643 c05.qxd 3/8/04 11:11 AM Page 144

144 Chapter 5

E

ALUV

6.8%

00.0%

6.9%

P-

0.0%

0.0%

0.0%

0.6%

1

1

1

0.6%

0.0%

0.0%

0.0%

E

roups

ALUVZ- 6.9 5.5 4.1 2.8 1.4 0.0 –1.4 –2.7 –4.1 –5.5 –6.9

P D

7%

7%

7%

7%

7%

7%

7%

7%

7%

7%

7%

ES

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

E

%

FFERENCE

%

ALU

DI

V

0.5%

0.4%

0.3%

0.2%

0.1

0.0%

–0.1

–0.2%

–0.3%

–0.4%

–0.5%

E IZS 900,000 900,000 900,000 900,000 900,000 900,000 900,000 900,000 900,000 900,000 900,000

E S

PION

N

M

O

A

P

%

ES

CH

R

4.5%

4.6%

4.7%

4.8%

4.9%

5.0%

5.1

5.2%

5.3%

5.4%

5.5%

E IZ 00,000 00,000 00,000 00,000 00,000 00,000 00,000 00,000 00,000 00,000 00,000

ercent Confidence Interval Bounds for the Difference between the Champion and Challenger g S

1

1

1

1

1

1

1

1

1

1

1

E

The 95 P

NGER

SNOP

HALLE

ES

C

R

5.0%

5.0%

5.0%

5.0%

5.0%

5.0%

5.0%

5.0%

5.0%

5.0%

5.0%

Table 5.3

470643 c05.qxd 3/8/04 11:11 AM Page 145

The Lure of Statistics: Data Mining Using Familiar Tools 145

Size of Sample

The formulas for the standard error of a proportion and for the standard error of a difference of proportions both include the sample size. There is an inverse relationship between the sample size and the size of the confidence interval: the larger the size of the sample, the narrower the confidence interval. So, if you want to have more confidence in results, it pays to use larger samples.

Table 5.4 shows the confidence interval for different sizes of the challenger group, assuming the challenger response rate is observed to be 5 percent. For very small sizes, the confidence interval is very wide, often too wide to be useful. Earlier, we had said that the normal distribution is an approximation for the estimate of the actual response rate; with small sample sizes, the estimation is not a very good one. Statistics has several methods for handling such small sample sizes. However, these are generally not of much interest to data miners because our samples are much larger.

Table 5.4 The 95 Percent Confidence Interval for Difference Sizes of the Challenger Group RESPONSE SIZE

SEP

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *