This property precludes one-way roads, for instance.
4. Triangle Inequality. Visiting an intermediate point C on the way from A to B never shortens the distance, so d(A,B) ≥ d(A,C) + d(C,B).
For MBR, the points are really records in a database. This formal definition of distance is the basis for measuring similarity, but MBR still works pretty well when some of these constraints are relaxed a bit. For instance, the distance function in the news story classification case study was not commutative; that is, the distance from a news story A to another B was not always the same as the distance from B to A. However, the similarity measure was still useful for classification purposes.
What makes these properties useful for MBR? The fact that distance is well-defined implies that every record has a neighbor somewhere in the database—
and MBR needs neighbors in order to work. The identity property makes distance conform to the intuitive idea that the most similar record to a given record is the original record itself. Commutativity and the Triangle Inequality make the nearest neighbors local and well-behaved. Adding a new record into the database will not bring an existing record any closer. Similarity is a matter reserved for just two records at a time.
Although the distance measure used to find nearest neighbors is well-TEAMFLY
behaved, the set of nearest neighbors can have some peculiar properties. For instance, the nearest neighbor to a record B may be A, but A may have many neighbors closer than B, as shown in Figure 8.5. This situation does not pose problems for MBR.
B’s nearest
X
X
X
neighbor is A.
X
X X X
X
X
X
A
B
X
X
X
X X X
X
All these neighbors of
X
X
A are closer than B.
X
Figure 8.5 B’s nearest neighbor is A, but A has many neighbors closer than B.
Team-Fly®
470643 c08.qxd 3/8/04 11:14 AM Page 273
Memory-Based Reasoning and Collaborative Filtering 273
MEASURING THE EFFECTIVENESS OF ASSIGNING CODES:
RECALL AND PRECISION
Recall and precision are two measurements that are useful for determining the appropriateness of a set of assigned codes or keywords. The case study on coding news stories, for instance, assigns many codes to news stories. Recall and precision can be used to evaluate these assignments.
Recall answers the question: “How many of the correct codes did MBR
assign to the story?” It is the ratio of codes assigned by MBR that are correct (as verified by editors) to the total number of correct codes on the story. If MBR
assigns all available codes to every story, then recall is 100 percent because the correct codes all get assigned, along with many other irrelevant codes. If MBR
assigns no codes to any story, then recall is 0 percent.
Precision answers the question: “How many of the codes assigned by MBR
were correct?” It is the percentage of correct codes assigned by MBR to the total number of codes assigned by MBR. Precision is 100 percent when MBR
assigns only correct codes to a story. It is close to 0 percent when MBR assigns all codes to every story.
Neither recall nor precision individually give the full story of how good the classification is. Ideally, we want 100 percent recall and 100 percent precision.
Often, it is possible to trade off one against the other. For instance, using more neighbors increases recall, but decreases precision. Or, raising the threshold increases precision but decreases recall. Table 8.5 gives some insight into these measurements for a few specific cases.
Table 8.5 Examples of Recall and Precision
CODES BY MBR
CORRECT CODES
RECALL
PRECISION
A,B,C,D
A,B,C,D
100%
100%
A,B
A,B,C,D
50%
100%
A,B,C,D,E,F,G,H
A,B,C,D
100%
50%
E,F
A,B,C,D
0%
0%
A,B,E,F
A,B,C,D
50%
50%
The original codes assigned to the stories by individual editors had a recall of 83 percent and a precision of 88 percent with respect to the validated set of correct codes. For MBR, the recall was 80 percent and the precision 72 percent.
However, Table 8.6 shows the average across all categories. MBR did significantly better in some of the categories.
(continued)
470643 c08.qxd 3/8/04 11:14 AM Page 274
274 Chapter 8
MEASURING THE EFFECTIVENESS OF ASSIGNING CODES:
RECALL AND PRECISION (continued)
Table 8.6 Recall and Precision Measurements by Code Category CATEGORY
RECALL
PRECISION
Government
85%
87%
Industry
91%
85%
Market Sector
93%
91%
Product
69%
89%
Region
86%
64%
Subject
72%
53%
The variation in the results by category suggests that the original stories used for the training set may not have been coded consistently. The results from MBR can only be as good as the examples chosen for the training set.
Even so, MBR performed as well as all but the most experienced editors.
Building a Distance Function One Field at a Time
It is easy to understand distance as a geometric concept, but how can distance be defined for records consisting of many different fields of different types?
The answer is, one field at a time. Consider some sample records such as those shown in Table 8.7.
Figure 8.6 illustrates a scatter graph in three dimensions. The records are a bit complicated, with two numeric fields and one categorical. This example shows how to define field distance functions for each field, then combine them into a single record distance function that gives a distance between two records.
Table 8.7 Five Customers in a Marketing Database
RECNUM
GENDER
AGE
SALARY
1
female
27
$ 19,000
2
male
51
$ 64,000
3
male
52
$105,000
4
female
33
$ 55,000
5
male
45
$ 45,000
470643 c08.qxd 3/8/04 11:14 AM Page 275
Memory-Based Reasoning and Collaborative Filtering 275
$120,000
$100,000
y
$80,000
Salar
$60,000
$40,000
$20,000
Female
Male
$0
25
30
35
40
45
50
55
60
Age
Figure 8.6 This scatter plot shows the five records from Table 8.7 in three dimensions—
age, salary, and gender—and suggests that standard distance is a good metric for nearest neighbors.
The four most common distance functions for numeric fields are:
■■
Absolute value of the difference: |A–B|
■■
Square of the difference: (A–B)2
■■
Normalized absolute value: |A–B|/(maximum difference)
■■
Absolute value of difference of standardized values: |(A – mean)/(standard deviation) – (B – mean)/(standard deviation)| which is equivalent to |(A – B)/(standard deviation)|
The advantage of the normalized absolute value is that it is always between 0 and 1. Since ages are much smaller than the salaries in this example, the normalized absolute value is a good choice for both of them—so neither field will dominate the record distance function (difference of standardized values is also a good choice). For the ages, the distance matrix looks like Table 8.8.
Table 8.8 Distance Matrix Based on Ages of Customers
27
51
52
33
45
27
0.00
0.96
1.00
0.24
0.72
51
0.96
0.00
0.04
0.72
0.24
52
1.00
0.04
0.00
0.76
0.28
33
0.24
0.72
0.76
0.00
0.48
45
0.72
0.24
0.28
0.48
0.00
470643 c08.qxd 3/8/04 11:14 AM Page 276
276 Chapter 8
Gender is an example of categorical data. The simplest distance function is the “identical to” function, which is 1 when the genders are the same and 0
otherwise:
dgender(female, female) = 0
dgender(female, male) = 1
dgender(female, female) = 1
dgender(male, male) = 0
So far, so simple. There are now three field distance functions that need to merge into a single record distance function. There are three common ways to do this:
■■
Manhattan distance or summation:
dsum(A,B) = dgender(A,B) + dage(A,B) + dsalary(A,B)
■■
Normalized summation: dnorm(A,B) = dsum(A,B) / max(dsum)
■■
Euclidean distance:
dEuclid(A,B) = sqrt(dgender(A,B)2 + dage(A,B)2 + dsalary(A,B)2)
Table 8.9 shows the nearest neighbors for each of the points using the three functions.
In this case, the sets of nearest neighbors are exactly the same regardless of how the component distances are combined. This is a coincidence, caused by the fact that the five records fall into two well-defined clusters. One of the clusters is lower-paid, younger females and the other is better-paid, older males.
These clusters imply that if two records are close to each other relative to one field, then they are close on all fields, so the way the distances on each field are combined is not important. This is not a very common situation.
Consider what happens when a new record (Table 8.10) is used for the comparison.
Table 8.9 Set of Nearest Neighbors for Three Distance Functions, Ordered Nearest to Farthest
DSUM
DNORM
DEUCLID
1
1,4,5,2,3
1,4,5,2,3
1,4,5,2,3
2
2,5,3,4,1
2,5,3,4,1
2,5,3,4,1
3
3,2,5,4,1
3,2,5,4,1
3,2,5,4,1
4
4,1,5,2,3
4,1,5,2,3
4,1,5,2,3
5
5,2,3,4,1
5,2,3,4,1
5,2,3,4,1
470643 c08.qxd 3/8/04 11:14 AM Page 277
Memory-Based Reasoning and Collaborative Filtering 277
Table 8.10 New Customer
RECNUM
GENDER
AGE
SALARY
new
female
45
$100,000
This new record is not in either of the clusters. Table 8.11 shows her respective distances from the training set with the list of her neighbors, from nearest to furthest.
Now the set of neighbors depends on how the record distance function combines the field distance functions. In fact, the second nearest neighbor using the summation function is the farthest neighbor using the Euclidean and vice versa. Compared to the summation or normalized metric, the Euclidean metric tends to favor neighbors where all the fields are relatively close. It punishes Record 3 because the genders are different and are maximally far apart (a distance of 1.00). Correspondingly, it favors Record 1 because the genders are the same. Note that the neighbors for dsum and dnorm are identical. The definition of the normalized distance preserves the ordering of the summation distance—the distances values are just shifted to the range from 0 to 1.