Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 71 – Library. Read online. Free books read online. Read books without registering

The summation, Euclidean, and normalized functions can also incorporate weights so each field contributes a different amount to the record distance function. MBR usually produces good results when all the weights are equal to 1. However, sometimes weights can be used to incorporate a priori knowledge, such as a particular field suspected of having a large effect on the classification.

Distance Functions for Other Data Types

A 5-digit American zip code is often represented as a simple number. Do any of the default distance functions for numeric fields make any sense? No. The difference between two randomly chosen zip codes has no meaning. Well, almost no meaning; a zip code does encode location information. The first three digits represent a postal zone—for instance, all zip codes on Manhattan start with “100,” “101,” or “102.”

Table 8.11 Set of Nearest Neighbors for New Customer

NEIGHBORS

dsum

1.662

1.659

1.338

1.003

1.640

4,3,5,2,1

dnorm

0.554

0.553

0.446

0.334

0.547

4,3,5,2,1

dEuclid

0.781

1.052

1.251

0.494

1.000

4,1,5,2,3

470643 c08.qxd 3/8/04 11:14 AM Page 278

278 Chapter 8

Furthermore, there is a general pattern of zip codes increasing from East to West. Codes that start with 0 are in New England and Puerto Rico; those beginning with 9 are on the west coast. This suggests a distance function that approximates geographic distance by looking at the high order digits of the zip code.

■■

dzip(A,B) = 0.0 if the zip codes are identical

■■

dzip(A,B) = 0.1 if the first three digits are identical (e.g., “20008” and

“20015”

■■

dzip(A,B) = 0.5 if the first digits are identical (e.g., “95050” and “98125”)

■■

dzip(A,B) = 1.0 if the first digits are not identical (e.g., “02138” and

“94704”)

Of course, if geographic distance were truly of interest, a better approach would be to look up the latitude and longitude of each zip code in a table and calculate the distances that way (it is possible to get this information for the United States from www.census.gov). For many purposes however, geographic proximity is not nearly as important as some other measure of similarity. 10011

and 10031 are both in Manhattan, but from a marketing point of view, they don’t have much else in common, because one is an upscale downtown neighborhood and the other is a working class Harlem neighborhood. On the other hand 02138 and 94704 are on opposite coasts, but are likely to respond very similarly to direct mail from a political action committee, since they are for Cambridge, MA and Berkeley, CA respectively.

This is just one example of how the choice of a distance metric depends on the data mining context. There are additional examples of distance and similarity measures in Chapter 11 where they are applied to clustering.

When a Distance Metric Already Exists

There are some situations where a distance metric already exists, but is difficult to spot. These situations generally arise in one of two forms. Sometimes, a function already exists that provides a distance measure that can be adapted for use in MBR. The news story case study provides a good example of adapting an existing function, the relevance feedback score, for use as a distance function.

Other times, there are fields that do not appear to capture distance, but can be pressed into service. An example of such a hidden distance field is solicitation history. Two customers who were chosen for a particular solicitation in the past are “close,” even though the reasons why they were chosen may no longer be available; two who were not chosen, are close, but not as close; and one that was chosen and one that was not are far apart. The advantage of this metric is that it can incorporate previous decisions, even if the basis for the

470643 c08.qxd 3/8/04 11:14 AM Page 279

Memory-Based Reasoning and Collaborative Filtering 279

decisions is no longer available. On the other hand, it does not work well for customers who were not around during the original solicitation; so some sort of neutral weighting must be applied to them.

Considering whether the original customers responded to the solicitation can extend this function further, resulting in a solicitation metric like:

■■ dsolicitation(A, B) = 0, when A and B both responded to the solicitation

■■ dsolicitation(A, B) = 0.1, when A and B were both chosen but neither responded

■■ dsolicitation(A, B) = 0.2, when neither A nor B was chosen, but both were available in the data

■■ dsolicitation(A, B) = 0.3, when A and B were both chosen, but only one responded

■■ dsolicitation(A, B) = 0.3, when one or both were not considered

■■ dsolicitation(A, B) = 1.0, when one was chosen and the other was not Of course, the particular values are not sacrosanct; they are only meant as a guide for measuring similarity and showing how previous information and response histories can be incorporated into a distance function.

The Combination Function: Asking the

Neighbors for the Answer

The distance function is used to determine which records comprise the neighborhood. This section presents different ways to combine data gathered from those neighbors to make a prediction. At the beginning of this chapter, we estimated the median rent in the town of Tuxedo, by taking an average of the median rents in similar towns. In that example, averaging was the combination function. This section explores other methods of canvassing the neighborhood.

The Basic Approach: Democracy

One common combination function is for the k nearest neighbors to vote on an answer—”democracy” in data mining. When MBR is used for classification, each neighbor casts its vote for its own class. The proportion of votes for each class is an estimate of the probability that the new record belongs to the corresponding class. When the task is to assign a single class, it is simply the one with the most votes. When there are only two categories, an odd number of neighbors should be poled to avoid ties. As a rule of thumb, use c+1 neighbors when there are c categories to ensure that at least one class has a plurality.

470643 c08.qxd 3/8/04 11:14 AM Page 280

280 Chapter 8

In Table 8.12, the five test cases seen earlier have been augmented with a flag that signals whether the customer has become inactive.

For this example, three of the customers have become inactive and two have not, an almost balanced training set. For illustrative purposes, let’s try to determine if the new record is active or inactive by using different values of k for two distance functions, deuclid and dnorm (Table 8.13).

The question marks indicate that no prediction has been made due to a tie among the neighbors. Notice that different values of k do affect the classification. This suggests using the percentage of neighbors in agreement to provide the level of confidence in the prediction (Table 8.14).

Table 8.12 Customers with Attrition History

RECNUM

GENDER

AGE

SALARY

INACTIVE

female

$19,000

male

$64,000

yes

male

$105,000

yes

female

$55,000

yes

male

$45,000

new

female

$100,000

Table 8.13 Using MBR to Determine if the New Customer Will Become Inactive NEIGHBOR

NEIGHBORS ATTRITION K = 1 K = 2 K = 3 K = 4 K = 5

dsum

4,3,5,2,1

Y,Y,N,Y,N

yes

dEuclid

4,1,5,2,3

Y,N,N,Y,Y

yes

Table 8.14 Attrition Prediction with Confidence

K = 1

K = 2

K = 3

K = 4

K = 5

dsum

yes, 100%

yes, 67%

yes, 75%

yes, 60%

dEuclid

yes, 100%

yes, 50%

no, 67%

yes, 50%

yes, 60%

470643 c08.qxd 3/8/04 11:14 AM Page 281

Memory-Based Reasoning and Collaborative Filtering 281

The confidence level works just as well when there are more than two categories. However, with more categories, there is a greater chance that no single category will have a majority vote. One of the key assumptions about MBR

(and data mining in general) is that the training set provides sufficient information for predictive purposes. If the neighborhoods of new cases consistently produce no obvious choice of classification, then the data simply may not contain the necessary information and the choice of dimensions and possibly of the training set needs to be reevaluated. By measuring the effectiveness of MBR on the test set, you can determine whether the training set has a sufficient number of examples.

WA R N I N G MBR is only as good as the training set it uses. To measure whether the training set is effective, measure the results of its predictions on the test set using two, three, and four neighbors. If the results are inconclusive or inaccurate, then the training set is not large enough or the dimensions and distance metrics chosen are not appropriate.

Weighted Voting

Weighted voting is similar to voting in the previous section except that the neighbors are not all created equal—more like shareholder democracy than one-person, one-vote. The size of the vote is inversely proportional to the distance from the new record, so closer neighbors have stronger votes than neighbors farther away do. To prevent problems when the distance might be 0, it is common to add 1 to the distance before taking the inverse. Adding 1 also makes all the votes between 0 and 1.