Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

DEGREES OF FREEDOM

The idea behind the degrees of freedom is how many different variables are needed to describe the table of expected values. This is a measure of how constrained the data is in the table.

If the table has r rows and c columns, then there are r * c cells in the table.

With no constraints on the table, this is the number of variables that would be needed. However, the calculation of the expected values has imposed some constraints. In particular, the sum of the values in each row is the same for the expected values as for the original table, because the sum of each row is fixed.

That is, if one value were missing, we could recalculate it by taking the constraint into account by subtracting the sum of the rest of values in the row from the sum for the whole row. T

r * c – r

his suggests that the degrees of freedom is

. The same

situation exists for the columns, yielding an estimat

r * c – r – c

e of

.

However, there is one additional constraint. The sum of all the row sums and the sum of all the column sums must be the same. It turns out, we have over count

r * c – r – c

ed the constraints by one, so the degrees of freedom is really

+ 1. Another way of writing this is ( r – 1) * ( c – 1).

The result is the probability that the distribution of values in the table is due to random fluctuations rather than some external criteria. As Occam’s Razor suggests, the simplest explanation is that there is no difference at all due to the various factors; that observed differences from expected values are entirely within the range of expectation.

Comparison of Chi-Square to Difference of Proportions

Chi-square and difference of proportions can be applied to the same problems.

Although the results are not exactly the same, the results are similar enough for comfort. Earlier, in Table 5.4, we determined the likelihood of champion and challenger results being the same using the difference of proportions method for a range of champion response rates. Table 5.7 repeats this using the chi-square calculation instead of the difference of proportions. The results from the chi-square test are very similar to the results from the difference of proportions—a remarkable result considering how different the two methods are.

470643 c05.qxd 3/8/04 11:11 AM Page 154

154 Chapter 5

E U

P

L

%

A

1

FF

8%

RO

-V

6.83%

00.00%

6.9

DI

P

P

0.00%

0.00%

0.00%

0.5

1

1

1

0.60%

0.00%

0.00%

0.00%

E U

E

LA

%

0%

1

3%

% 1

-V

UAR

6.5

00.00%

7.2

P

0.00%

0.00%

0.00%

0.5

1

1

1

0.68%

0.0

0.00%

0.00%

SQ

E

I-

U

1

4

7

3

9

H

L

3

C

A

1.8

2.5

7.9

6.2

8.3

V

5

3

1

7.85

1.93

0.00

1.86

7.3

1

2

43.66

E

N

SP

4

5

1

1

5

4

UAR

O

P

N

RE

0.2

0.1

0.09

0.04

0.0

0.00

0.0

0.04

0.09

0.1

0.2

M

SQ

A

I-

SP

0

1

5

8

8

4

3

H

CH

C

RE

4.95

3.1

1.7

0.7

0.1

0.00

0.1

0.69

1.5

2.69

4.1

E

N

2

6

7

4

4

7

7

4

O

SP

UAR

N

RE

2.1

1.3

0.7

0.3

0.09

0.00

0.09

0.3

0.7

1.3

2.1

L

SQ

1

1

8

A

I-

6

SP

2

9

5

H

7.93

5.4

3.83

4.1

7.1

CH

C

RE

44.5

2

1

6.7

1.65

0.00

1.5

6.2

1

2

3

0

0

0

40

0

0

0

1

90

80

7

60

N

7

SP

9,05

8,2

7,43

6,62

5,8

5,000

4,1

3,3

2,5

1,

0,95

able 5.4

ON RE

85

85

85

85

85

85

85

85

85

85

85

PION M

0

0

0

0

0

60

7

80

90

1

0

40

A

P

SP

71,

7,43

EX

RE

40,95

4

42,5

43,3

44,1

45,000

45,8

46,62

4

48,2

49,05

CH

0

0

0

0

0

0

N

60

7

80

1

3

5

SP

7

tions Example in T

O

NGER

N

RE

95,45

95,3

95,2

95,1

95,090

95,000

94,9

94,82

94,

94,640

94,5

0

0

0

0

0

0

P

SP

5

3

1

80

7

60

HALLE

7

C

EX

RE

4,5

4,640

4,

4,82

4,9

5,000

5,090

5,1

5,2

5,3

5,45

ALL R

%

SP

5%

3%

1

8%

7%

6%

OVE

RE

4.5

4.64%

4.7

4.82%

4.9

5.00%

5.09%

5.1

5.2

5.3

5.45%

00

00

00

00

00

00

SP

9,5

8,600

7,7

6,800

5,900

5,000

4,1

3,2

2,3

1,400

0,5

N

O

RE

85

85

85

85

85

85

85

85

85

85

85

IOP

00

00

00

00

00

00

HAM

SP

C

1,400

7,7

RE

40,5

4

42,3

43,2

44,1

45,000

45,900

46,800

4

48,600

49,5

R E

N N

Chi-Square Calculation for Difference of Propor

G

O

SP

N

N

RE

95,000

95,000

95,000

95,000

95,000

95,000

95,000

95,000

95,000

95,000

95,000

SP

HALLEC

RE

5,000

5,000

5,000

5,000

5,000

5,000

5,000

5,000

5,000

5,000

5,000

Table 5.7

470643 c05.qxd 3/8/04 11:11 AM Page 155

The Lure of Statistics: Data Mining Using Familiar Tools 155

An Example: Chi-Square for Regions and Starts

A large consumer-oriented company has been running acquisition campaigns in the New York City area. The purpose of this analysis is to look at their acquisition channels to try to gain an understanding of different parts of the area.

For the purposes of this analysis, three channels are of interest: Telemarketing . Customers who are acquired through outbound telemarketing calls (note that this data was collected before the national do-not-call list went into effect).

Direct mail. Customers who respond to direct mail pieces.

Other. Customers who come in through other means.

The area of interest consists of eight counties in New York State. Five of these counties are the boroughs of New York City, two others (Nassau and Suffolk counties) are on Long Island, and one (Westchester) lies just north of the city. This data was shown earlier in Table 5.1. This purpose of this analysis is to determine whether the breakdown of starts by channel and county is due to chance or whether some other factors might be at work.

This problem is particularly suitable for chi-square because the data can be laid out in rows and columns, with no customer being counted in more than one cell. Table 5.8 shows the deviation, expected values, and chi-square values for each combination in the table. Notice that the chi-square values are often quite large in this example. The overall chi-square score for the table is 7,200, which is very large; the probability that the overall score is due to chance is basically 0. That is, the variation among starts by channel and by region is not due to sample variation. There are other factors at work.

The next step is to determine which of the values are too high and too low and with what probability. It is tempting to convert each chi-square value in each cell into a probability, using the degrees of freedom for the table. The table is 8 × 3, so it has 14 degrees of freedom. However, this is not an appropriate thing to do. The chi-square result is for the entire table; inverting the individual scores to get a probability does not produce valid results. Chi-square scores are not additive.

An alternative approach proves more accurate. The idea is to compare each cell to everything else. The result is a table that has two columns and two rows, as shown in Table 5.9. One column is the column of the original cell; the other column is everything else. One row is the row of the original cell; the other row is everything else.

470643 c05.qxd 3/8/04 11:11 AM Page 156

156 Chapter 5

R E

TH

4.1 7

7.7

0.9

9.1

O

3

695.6

5

660.5

7

0.7

0.6

2

M

3.2

4.5 1

9.2

1.6

7.5

7.4

E

D

2

80.1

1

1

7.2

1

2

6

UAR

SQI-

4.5

9.9

7

1

H

5.9

93.0

98.7

5.8

5

C

TM

1,002.3

1,9

2

1,3

1

0.3

1

1

R E

2

9

5

3

6

8

TH

1

6

7

45

7

7

O

–1,2

–3,1

7

4,05

–9

–3

6

4

0

6

M

1

7

1

45

03

6

6

7

1

5

5

D

–1

–3

3

–2

–1

5

1

2

N OTI

6

1

5

1

1

1

62

1

3

3

EVIA

5

2

3

D

TM

1,3

3,5

–1,1

–3,8

1,02

–1

–2

–7

R E

1.7

7.7

63.7

1.8

7

8

TH

94.5

4,1

4,908.9

1,8

O

4,1

1

9,62

2

1

1,808.2

7,092.4

7,7

1.7

69.4

02.0

1

M

3.1

5.9

2

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *