The most similar documents are the neighbors used for MBR.
Choosing the Combination Function
The next decision is the combination function. Assigning classification codes to news stories is a bit different from most classification problems. Most classification problems are looking for the single best solution. However, news stories can have multiple codes, even from the same category. The ability to adapt MBR to this problem highlights its flexibility.
470643 c08.qxd 3/8/04 11:14 AM Page 268
268 Chapter 8
USING RELEVANCE FEEDBACK TO CREATE A DISTANCE FUNCTION
Relevance feedback is a powerful technique that allows users to refine searches on text databases by asking the database to return documents similar to one they already have. (Hubs and authorities, another method for improving search results on hyperlinked web pages, is described in Chapter 10.) In the course of doing this, the text database scores all the other documents in the database and returns those that are most similar—along with a measure of similarity. This is the relevance feedback score, which can be used as the basis for a distance measure for MBR.
In the case study, the calculation of the relevance feedback score went as follows:
1. Common, non-content-bearing words, such as “it,” “and,” and “of,” were removed from the text of all stories in the training set. A total of 368
words in this category were identified and removed.
2. The next most common words, accounting for 20 percent of the words in the database, were removed from the text. Because these words are so common, they provide little information to distinguish between documents.
3. The remaining words were collect
searchable terms
ed into a dictionary of
.
Each was assigned a weight inversely proportional to its frequency in the database. The particular weight was the negative of the base 2 log of the term’s frequency in the training set.
4. Capitalized word pairs, such as “United States” and “New Mexico,” were identified (automatically) and included in the dictionary of searchable terms.
5. To calculate the relevance feedback score for two stories, the weights of the searchable terms in both stories were added together. The algorithm used for this case study included a bonus when searchable terms appeared in close proximity in both stories.
The relevance feedback score is an example of the adaptation of an already-existing function for use as a distance function. However, the score itself does not quite fit the definition of a distance function. In particular, a score of 0
indicates that two stories have no words in common, instead of implying that the stories are identical. The following transformation converts the relevance feedback score to a function suitable for measuring the “distance” between news stories:
score(A,B)
dclassification (A,B) = 1 – score(A,A)
This is the function used to find the nearest neighbors. Actually, even this is not a true distance function because d(A,B) is not the same as d(B,A), but it works well enough.
470643 c08.qxd 3/8/04 11:14 AM Page 269
Memory-Based Reasoning and Collaborative Filtering 269
Table 8.3 Classified Neighbors of a Not-Yet-Classified Story NEIGHBOR
DISTANCE
WEIGHT
CODES
1
0.076
0.924
R/FE,R/CA,R/CO
2
0.346
0.654
R/FE,R/JA,R/CA
3
0.369
0.631
R/FE,R/JA,R/MI
4
0.393
0.607
R/FE,R/JA,R/CA
The combination function used a weighted summation technique. Since the maximum distance was 1, the weight was simply one minus the distance, so weights would be big for neighbors at small distances and small for neighbors at big distances. For example, say the neighbors of a story had the following region codes and weights, shown in Table 8.3.
The total score for a code was then the sum of the weights of the neighbors containing it. Then, codes with scores below a certain threshold value were eliminated. For instance, the score for R/FE (which is the region code for the Far East) is the sum of the weights of neighbors 1, 2, 3, and 4, since all of them contain the R/FE, yielding a score of 2.816. Table 8.4 shows the results for the six region codes contained by at least one of the four neighbors. For these examples, a threshold of 1.0 leaves only three codes: R/CA, R/FE, and R/JA.
The particular choice of threshold was based on experimenting with different values and is not important to understanding MBR.
Table 8.4 Code Scores for the Not-Yet-Classified Story
CODE
1
2
3
4
SCORE
R/CA
0.924
0
0
0.607
1.531
R/CO
0.924
0
0
0
0.924
R/FE
0.924
0.654
0.631
0.607
2.816
R/JA
0
0.654
0.631
0.607
1.892
R/MI
0
0.654
0
0
0.624
470643 c08.qxd 3/8/04 11:14 AM Page 270
270
Chapter 8
Choosing the Number of Neighbors
The investigation varied the number of nearest neighbors between 1 and 11
inclusive. The best results came from using more neighbors. However, this case study is different from many applications of MBR because it is assigning multiple categories to each story. The more typical problem is to assign only a single category or code and fewer neighbors would likely be sufficient for good results.
The Results
To measure the effectiveness of MBR on coding, the news service had a panel of editors review all the codes assigned, whether by editors or by MBR, to 200
stories. Only codes agreed upon by a majority of the panel were considered
“correct.”
The comparison of the “correct” codes to the codes originally assigned by human editors was interesting. Eighty-eight percent of the codes originally assigned to the stories (by humans) were correct. However, the human editors made mistakes. A total of 17 percent of the codes originally assigned by human editors were incorrect as shown in Figure 8.4.
MBR did not do quite as well. For MBR, the corresponding percentages were 80 percent and 28 percent. That is, 80 percent of the codes assigned by MBR were correct, but the cost was that 28 percent of the codes assigned were incorrect.
Codes assigned by panel of experts
17%
88%
12%
Codes assigned by human editors
Incorrect
Correct codes
Correct codes in
codes in
not included in
classification
classification
classification
28%
80%
20%
Codes assigned by MBR
Figure 8.4 A comparison of results by human editors and by MBR on assigning codes to news stories.
470643 c08.qxd 3/8/04 11:14 AM Page 271
Memory-Based Reasoning and Collaborative Filtering 271
The mix of editors assigning the original codes, though, included novice, intermediate, and experienced editors. The MBR system actually performed as well as intermediate editors and better than novice editors. Also, MBR was using stories classified by the same mix of editors, so the training set was not consistently coded. Given the inconsistency in the training set, it is surprising that MBR did as well as it did. The study was not able to investigate using MBR on a training set whose codes were reviewed by the panel of experts because there were not enough such stories for a viable training set.
This case study illustrates that MBR can be used for solving difficult problems that might not easily be solved by other means. Most data mining techniques cannot handle textual data and assigning multiple categories at the same time is problematic. This case study shows that, with some experimentation, MBR can produce results comparable to human experts. There is further discussion of the metrics used to evaluate the performance of a document classification or retrieval system in the sidebar entitled Measuring the Effectiveness of Assigning Codes. This study achieved these results with about two person-months of effort (not counting development of the relevance feedback engine).
By comparison, other automated classification techniques, such as those based on expert systems, require many person-years of effort to achieve equivalent results for classifying news stories.
Measuring Distance
Say your travels are going to take you to a small town and you want to know the weather. If you have a newspaper that lists weather reports for major cities, what you would typically do is find the weather for cities near the small town.
You might look at the closest city and just take its weather, or do some sort of combination of the forecasts for, say, the three closest cities. This is an example of using MBR to find the weather forecast. The distance function being used is the geographic distance between the two locations. It seems likely that the Web services that provide a weather forecast for any zip code supplied by a user do something similar.
What Is a Distance Function?
Distance is the way the MBR measures similarity. For any true distance metric, the distance from point A to point B, denoted by d(A,B), has four key properties: 1. Well-defined. The distance between two points is always defined and is a non-negative real number, d(A,B) ≥ 0.
2. Identity. The distance from one point to itself is always zero, so d(A,A) = 0.
470643 c08.qxd 3/8/04 11:14 AM Page 272
272 Chapter 8
3. Commutativity. Direction does not make a difference, so the distance from A to B is the same as the distance from B to A: d(A,B) = d(B,A).