Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

The human ability to reason from experience depends on the ability to recognize appropriate examples from the past. A doctor diagnosing diseases, a claims analyst flagging fraudulent insurance claims, and a mushroom hunter spotting Morels are all following a similar process. Each first identifies similar cases from experience and then applies what their knowledge of those cases to the problem at hand. This is the essence of memory-based reasoning. A database of known records is searched to find preclassified records similar to a new record. These neighbors are used for classification and estimation.

Applications of MBR span many areas:

Fraud detection. New cases of fraud are likely to be similar to known cases. MBR can find and flag them for further investigation.

Customer response prediction. The next customers likely to respond to an offer are probably similar to previous customers who have

responded. MBR can easily identify the next likely customers.

Medical treatments. The most effective treatment for a given patient is probably the treatment that resulted in the best outcomes for similar patients. MBR can find the treatment that produces the best outcome.

Classifying responses. Free-text responses, such as those on the U.S. Census form for occupation and industry or complaints coming from customers, need to be classified into a fixed set of codes. MBR can process the free-text and assign the codes.

One of the strengths of MBR is its ability to use data “as is.” Unlike other data mining techniques, it does not care about the format of the records. It only cares about the existence of two operations: A distance function capable of calculating a distance between any two records and a combination function capable of combining results from several neighbors to arrive at an answer. These functions are readily defined for many kinds of records, including records with complex or unusual data types such as geographic locations, images, and free text that

470643 c08.qxd 3/8/04 11:14 AM Page 259

Memory-Based Reasoning and Collaborative Filtering 259

are usually difficult to handle with other analysis techniques. A case study later in the chapter shows MBR’s successful application to the classification of news stories—an example that takes advantage of the full text of the news story to assign subject codes.

Another strength of MBR is its ability to adapt. Merely incorporating new data into the historical database makes it possible for MBR to learn about new categories and new definitions of old ones. MBR also produces good results without a long period devoted to training or to massaging incoming data into the right format.

These advantages come at a cost. MBR tends to be a resource hog since a large amount of historical data must be readily available for finding neighbors.

Classifying new records can require processing all the historical records to find the most similar neighbors—a more time-consuming process than applying an already-trained neural network or an already-built decision tree. There is also the challenge of finding good distance and combination functions, which often requires a bit of trial and error and intuition.

Example: Using MBR to Estimate

Rents in Tuxedo, New York

The purpose of this example is to illustrate how MBR works by estimating the cost of renting an apartment in the target town by combining data on rents in several similar towns—its nearest neighbors.

MBR works by first identifying neighbors and then combining information from them. Figure 8.1 illustrates the first of these steps. The goal is to make predictions about the town of Tuxedo in Orange County, New York by looking at its neighbors. Not its geographic neighbors along the Hudson and Delaware rivers, rather its neighbors based on descriptive variables—in this case, population and median home value. The scatter plot shows New York towns arranged by these two variables. Figure 8.1 shows that measured this way, Brooklyn and Queens are close neighbors, and both are far from Manhattan.

Although Manhattan is nearly as populous as Brooklyn and Queens, its home prices put it in a class by itself.

T I P Neighborhoods can be found in many dimensions. The choice of dimensions determines which records are close to one another. For some purposes, geographic proximity might be important. For other purposes home price or average lot size or population density might be more important. The choice of dimensions and the choice of a distance metric are crucial to any nearest-neighbor approach.

470643 c08.qxd 3/8/04 11:14 AM Page 260

260 Chapter 8

The first stage of MBR finds the closest neighbor on the scatter plot shown in Figure 8.1. Then the next closest neighbor is found, and so on until the desired number are available. In this case, the number of neighbors is two and the nearest ones turn out to be Shelter Island (which really is an island) way out by the tip of Long Island’s North Fork, and North Salem, a town in Northern Westchester near the Connecticut border. These towns fall at about the middle of a list sorted by population and near the top of one sorted by home value. Although they are many miles apart, along these two dimensions, Shelter Island and North Salem are very similar to Tuxedo.

Once the neighbors have been located, the next step is to combine information from the neighbors to infer something about the target. For this example, the goal is to estimate the cost of renting a house in Tuxedo. There is more than one reasonable way to combine data from the neighbors. The census provides information on rents in two forms. Table 8.1 shows what the 2000 census reports about rents in the two towns selected as neighbors. For each town, there is a count of the number of households paying rent in each of several price bands as well as the median rent for each town. The challenge is to figure out how best to use this data to characterize rents in the neighbors and then how to combine information from the neighbors to come up with an estimate that characterizes rents in Tuxedo in the same way.

Tuxedo’s nearest neighbors, the towns of North Salem and Shelter Island, have quite different distributions of rents even though the median rents are similar. In Shelter Island, a plurality of homes, 34.6 percent, rent in the $500 to $750 range. In the town of North Salem, the largest number of homes, 30.9 percent, rent in the $1,000 to $1,500 range. Furthermore, while only 3.1 percent of homes in Shelter Island rent for over $1,500, 24.2 percent of homes in North Salem do. On the other hand, at $804, the median rent in Shelter Island is above the $750 ceiling of the most common range, while the median rent in North Salem, $1,150, is below the floor of the most common range for that town. If the average rent were available, it too would be a good candidate for characterizing the rents in the various towns.

Table 8.1 The Neighbors

RENT RENT RENT RENT RENT NO

POPULA­ MEDIAN <$500 $750 $1500 $1000 >$1500 RENT

TOWN TION

RENT

(%)

(%)

(%)

(%)

(%)

(%)

Shelter

2228

$804

3.1

34.6

31.4

10.7

3.1

17

Island

North 5173

$1150 3

10.2 21.6 30.9 24.2 10.2

Salem

470643 c08.qxd 3/8/04 11:14 AM Page 261

Memory-Based Reasoning and Collaborative Filtering

261

61

Brooklyn,

Kings

,

41

Queens

Queens

kor

w Y

Manhattan,

Ne

21

th Salem,

estchester

Nor

W

01

,

,

uxedo

alueV

edo

ange

Scarsdale

estchester

uxT Or

W

8

opulation

Log P

olk

Suff

Shelter Island,

opulation vs Home P

th Salem as its two nearest neighbors.

000 census population and home value, the town of T

Based on 2

0246

0

800000

600000

400000

200000

1200000

1000000

Orange County has Shelter Island and Nor

alue

V

Median Home

Figure 8.1

in

470643 c08.qxd 3/8/04 11:14 AM Page 262

262 Chapter 8

One possible combination function would be to average the most common rents of the two neighbors. Since only ranges are available, we use the midpoints. For Shelter Island, the midpoint of the most common range is $1,000.

For North Salem, it is $1,250. Averaging the two leads to an estimate for rent in Tuxedo of $1,125. Another combination function would pick the point midway between the two median rents. This second method leads to an estimate of $977 for rents in Tuxedo.

As it happens, a plurality of rents in Tuxedo are in the $1,000 to $1,500 range with the midpoint at $1,250. The median rent in Tuxedo is $907. So, averaging the medians slightly overestimates the median rent in Tuxedo and averaging the most common rents slightly underestimates the most common rent in Tuxedo. It is hard to say which is better. The moral is that there is not always an obvious “best” combination function.

Challenges of MBR

In the simple example just given, the training set consisted of all towns in New York, each described by a handful of numeric fields such as the population, median home value, and median rent. Distance was determined by placement on a scatter plot with axes scaled to appropriate ranges, and the number of TEAMFLY

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *