Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

and 1, the probability is 34.1 percent; this means that 34.1 percent of the time a variable that follows a normal distribution will take on a value within one standard deviation above the mean. Because the curve is symmetric, there is an additional 34.1% probability of being one standard deviation below the mean, and hence 68.2% probability of being within one standard deviation above the mean.

TEAMFLY

40%

35%

30%

25%

20%

15%

10%

obability Density

Pr

5%

0%

-5

-4

-3

-2

-1

0

1

2

3

4

5

Z-Value

The probability density function for the normal distribution looks like the familiar bell-shaped curve.

Team-Fly®

470643 c05.qxd 3/8/04 11:11 AM Page 133

The Lure of Statistics: Data Mining Using Familiar Tools 133

A QUESTION OF TERMINOLOGY (continued)

The previous paragraph showed a picture of a bell-shaped curve and called it the normal distribution. Actually, the correct terminolog

density function

y is

(or

probability density function). Although this terminology derives from advanced mathematical probability theory, it makes sense. The density function gives a flavor for how “dense” a variable is. We use a density function by measuring the area under the curve between two points, rather than by reading the individual values themselves. In the case of the normal distribution, the values are densest around the 0 and less dense as we move away.

The following figure shows the function that is properly called the normal distribution. This form, ranging from 0 to 1, is also called a cumulative distribution function. Mathematically, the distribution function for a value X is defined as the probability that the variable takes on a value less than or equal to X. Because of the “less than or equal to” characteristic, this function always starts near 0, climbs upward, and ends up close to 1. In general, the density function provides more visual clues to the human about what is going on with a distribution. Because density functions provide more information, they are often referred to as distributions, although that is technically incorrect.

100%

90%

80%

70%

Than Z

60%

50%

40%

tion Less

30%

opor

20%

Pr

10%

0%

-5

-4

-3

-2

-1

0

1

2

3

4

5

Z-Value

The (cumulative) distribution function for the normal distribution has an S-shape and is antisymmetric around the Y-axis.

From Standardized Values to Probabilities

Assuming that the standardized value follows the normal distribution makes it possible to calculate the probability that the value would have occurred by chance. Actually, the approach is to calculate the probability that something further from the mean would have occurred—the p-value. The reason the exact value is not worth asking is because any given z-value has an arbitrarily

470643 c05.qxd 3/8/04 11:11 AM Page 134

134 Chapter 5

small probability. Probabilities are defined on ranges of z-values as the area under the normal curve between two points.

Calculating something further from the mean might mean either of two things:

■■

The probability of being more than z standard deviations from the mean.

■■

The probability of being z standard deviations greater than the mean (or alternatively z standard deviations less than the mean).

The first is called a two-tailed distribution and the second is called a one-tailed distribution. The terminology is clear in Figure 5.4, because the tails of the distributions are being measured. The two-tailed probability is always twice as large as the one-tailed probability for z-values. Hence, the two-tailed p-value is more pessimistic than the one-tailed one; that is, the two-tailed is more likely to assume that the null hypothesis is true. If the one-tailed says the probability of the null hypothesis is 10 percent, then the two-tailed says it is 20

percent. As a default, it is better to use the two-tailed probability for calculations to be on the safe side.

The two-tailed p-value can be calculated conveniently in Excel, because there is a function called NORMSDIST, which calculates the cumulative normal distribution. Using this function, the two-tailed p-value is 2 * NORMSDIST(–ABS(z)). For a value of 2, the result is 4.6 percent. This means that there is a 4.6 percent chance of observing a value more than two standard deviations from the average—plus or minus two standard deviations from the average.

Or, put another way, there is a 95.4 percent confidence that a value falling outside two standard deviations is due to something besides chance. For a precise 95 percent confidence, a bound of 1.96 can be used instead of 2. For 99 percent confidence, the limit is 2.58. The following shows the limits on the z-value for some common confidence levels:

■■

90% confidence → z-value > 1.64

■■

95% confidence → z-value > 1.96

■■

99% confidence → z-value > 2.58

■■

99.5% confidence → z-value > 2.81

■■

99.9% confidence → z-value > 3.29

■■

99.99% confidence → z-value > 3.89

The confidence has the property that it is close to 100 percent when the value is unlikely to be due to chance and close to 0 when it is. The signed confidence adds information about whether the value is too low or too high. When the observed value is less than the average, the signed confidence is negative.

470643 c05.qxd 3/8/04 11:11 AM Page 135

The Lure of Statistics: Data Mining Using Familiar Tools 135

40%

35%

Shaded area is one-tailed

Both shaded areas are

30%

probability of being two or

two-tailed probability of

more standard deviations

being two or more

25%

above average.

standard deviations

20%

from average (greater

15%

or less than).

10%

obability Density

Pr

5%

0%

-5

-4

-3

-2

-1

0

1

2

3

4

5

Z-Value

Figure 5.4 The tail of the normal distribution answers the question: “What is the probability of getting a value of z or greater?”

Figure 5.5 shows the signed confidence for the data shown earlier in Figures 5.2 and 5.3, using the two-tailed probability. The shape of the signed confidence is different from the earlier shapes. The overall stops bounce around, usually remaining within reasonable bounds. The pricing-related stops, though, once again show a very distinct pattern, being too low for a long time, then peaking and descending. The signed confidence levels are bounded by 100 percent and –100 percent. In this chart, the extreme values are near 100 percent or –100 percent, and it is hard to tell the difference between 99.9 percent and 99.99999 percent. To distinguish values near the extremes, the z-values in Figure 5.3 are better than the signed confidence.

100%

75%

50%

25%

0%

-25%

(Q-Value)

-50%

Signed Confidence

-75%

-100%

l

y

n

g

v

n

b

y

n

Ju

Ma

Ju

Au

Sep

Oct

No

Dec

Ja

Fe

Mar

Apr

Ma

Ju

Figure 5.5 Based on the same data from Figures 5.2 and 5.3, this chart shows the signed confidence (q-values) of the observed value based on the average and standard deviation. This sign is positive when the observed value is too high, negative when it is too low.

470643 c05.qxd 3/8/04 11:11 AM Page 136

136 Chapter 5

Cross-Tabulations

Time series are an example of cross-tabulation—looking at the values of two or more variables at one time. For time series, the second variable is the time something occurred.

Table 5.1 shows an example used later in this chapter. The cross-tabulation shows the number of new customers from counties in southeastern New York state by three channels: telemarketing, direct mail, and other. This table shows both the raw counts and the relative frequencies.

It is possible to visualize cross-tabulations as well. However, there is a lot of data being presented, and some people do not follow complicated pictures.

Figure 5.6 shows a surface plot for the counts shown in the table. A surface plot often looks a bit like hilly terrain. The counts are the height of the hills; the counties go up one side and the channels make the third dimension. This surface plot shows that the other channel is quite high for Manhattan (New York county). Although not a problem in this case, such peaks can hide other hills and valleys on the surface plot.

Looking at Continuous Variables

Statistics originated to understand the data collected by scientists, most of which took the form of continuous measurements. In data mining, we encounter continuous data less often, because there is a wealth of descriptive data as well. This section talks about continuous data from the perspective of descriptive statistics.

Table 5.1 Cross-tabulation of Starts by County and Channel COUNTS

FREQUENCIES

COUNTY

TM

DM

OTHER TOTAL

TM

DM

OTHER TOTAL

BRONX

3,212 413

2,936

6,561

2.5% 0.3% 2.3%

5.1%

KINGS

9,773

1,393 11,025 22,191

7.7% 1.1% 8.6%

17.4%

NASSAU

3,135 1,573 10,367 15,075

2.5% 1.2% 8.1%

11.8%

NEW YORK

7,194

2,867 28,965 39,026

5.6% 2.2% 22.7% 30.6%

QUEENS

6,266 1,380 10,954 18,600

4.9% 1.1% 8.6%

14.6%

RICHMOND

784

277

1,772

2,833

0.6% 0.2% 1.4%

2.2%

SUFFOLK

2,911 1,042 7,159

11,112

2.3% 0.8% 5.6%

8.7%

WESTCHESTER 2,711

1,230 8,271

12,212

2.1% 1.0% 6.5%

9.6%

TOTAL

35,986 10,175 81,449 127,610 28.2% 8.0% 63.8% 100.0%

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *