neighbors arbitrarily set to two. The combination function was a simple average.
All of these choices seem reasonable. In general, using MBR involves several choices:
1. Choosing an appropriate set of training records
2. Choosing the most efficient way to represent the training records 3. Choosing the distance function, the combination function, and the number of neighbors
Let’s look at each of these in turn.
Choosing a Balanced Set of Historical Records
The training set is a set of historical records. It needs to provide good coverage of the population so that the nearest neighbors of an unknown record are useful for predictive purposes. A random sample may not provide sufficient coverage for all values. Some categories are much more frequent than others and the more frequent categories dominate the random sample.
For instance, fraudulent transactions are much rarer than non-fraudulent transactions, heart disease is much more common than liver cancer, news stories about the computer industry more common than about plastics, and so on.
Team-Fly®
470643 c08.qxd 3/8/04 11:14 AM Page 263
Memory-Based Reasoning and Collaborative Filtering 263
To achieve balance, the training set should, if possible, contain roughly equal numbers of records representing the different categories.
T I P When selecting the training set for MBR, be sure that each category has roughly the same number of records supporting it. As a general rule of thumb, several dozen records for each category are a minimum to get adequate support and hundreds or thousands of examples are not unusual.
Representing the Training Data
The performance of MBR in making predictions depends on how the training set is represented. The scatter plot approach illustrated in Figure 8.2 works for two or three variables and a small number of records, but it does not scale well.
The simplest method for finding nearest neighbors requires finding the distance from the unknown case to each of the records in the training set and choosing the training records with the smallest distances. As the number of records grows, the time needed to find the neighbors for a new record grows quickly.
This is especially true if the records are stored in a relational database. In this case, the query looks something like:
SELECT distance(),rec.category
FROM historical_records rec
ORDER BY 1 ASCENDING;
The notation distance() fills in for whatever the particular distance function happens to be. In this case, all the historical records need to be sorted in order to get the handful needed for the nearest neighbors. This requires a full-table scan plus a sort—quite an expensive couple of operations. It is possible to eliminate the sort by walking through table and keeping another table of the nearest, inserting and deleting records as appropriate. Unfortunately, this approach is not readily expressible in SQL without using a procedural language.
The performance of relational databases is pretty good nowadays. The challenge with scoring data for MBR is that each case being scored needs to be compared against every case in the database. Scoring a single new record does not take much time, even when there are millions of historical records. However, scoring many new records can have poor performance.
Another way to make MBR more efficient is to reduce the number of records in the training set. Figure 8.2 shows a scatter plot for categorical data. This graph has a well-defined boundary between the two regions. The points above the line are all diamonds and those below the line are all circles. Although this graph has forty points in it, most of the points are redundant. That is, they are not really necessary for classification purposes.
470643 c08.qxd 3/8/04 11:14 AM Page 264
264 Chapter 8
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Figure 8.2 Perhaps the cleanest training set for MBR is one that divides neatly into two disjoint sets.
Figure 8.3 shows that only eight points in it are needed to get the same results. Given that the size of the training set has such a large influence on the performance of MBR, being able to reduce the size is a significant performance boost.
How can this reduced set of records be found? The most practical method is to look for clusters containing records belonging to different categories. The centers of the clusters can then be used as a reduced set. This works well when the different categories are quite separate. However, when there is some overlap and the categories are not so well-defined, using clusters to reduce the size of the training set can cause MBR to produce poor results. Finding an optimal set of “support records” has been an area of recent research. When such an optimal set can be found, the historical records can sometimes be reduced to the level where they fit inside a spreadsheet, making it quite efficient to apply MBR to new records on less powerful machines.
470643 c08.qxd 3/8/04 11:14 AM Page 265
Memory-Based Reasoning and Collaborative Filtering 265
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Figure 8.3 This smaller set of points returns the same results as in Figure 8.2 using MBR.
Determining the Distance Function, Combination
Function, and Number of Neighbors
The distance function, combination function, and number of neighbors are the key ingredients in using MBR. The same set of historical records can prove very useful or not at all useful for predictive purposes, depending on these criteria. Fortunately, simple distance functions and combination functions usually work quite well. Before discussing these issues in detail, let’s look at a detailed case study.
Case Study: Classifying News Stories
This case study uses MBR to assign classification codes to news stories and is based on work conducted by one of the authors. The results from this case study show that MBR can perform as well as people on a problem involving hundreds of categories and data on a difficult-to-use type of data, free-text.1
1This case study is a summarization of research conducted by one of the authors. Complete details are available in the article “Classifying News Stories using Memory Based Reasoning,” by David Waltz, Brij Masand, and Gordon Linoff, in Proceedings, SIGIR ‘92, published by ACM Press.
470643 c08.qxd 3/8/04 11:14 AM Page 266
266 Chapter 8
What Are the Codes?
The classification codes are keywords used to describe the content of news stories. These codes are added to stories by a news retrieval service to help users search for stories of interest. They help automate the process of routing particular stories to particular customers and help implement personalized profiles.
For instance, an industry analyst who specializes in the automotive industry (or anyone else with an interest in the topic) can simplify searches by looking for documents with the “automotive industry” code. Because knowledgeable experts, also known as editors, set up the codes, the right stories are retrieved.
Editors or expert systems have traditionally assigned these codes. This case study investigated the use of MBR for this purpose.
The codes used in this study fall into six categories:
■■
Government Agency
■■
Industry
■■
Market Sector
■■
Product
■■
Region
■■
Subject
The data contained 361 separate codes, distributed as follows in the training set (Table 8.2).
The number and types of codes assigned to stories varied. Almost all the stories had region and subject codes—and, on average, almost three region codes per story. At the other extreme, relatively few stories contained government and product codes, and such stories rarely had more than one such code.
Table 8.2 Six Types of Codes Used to Classify News Stories CATEGORY
# CODES
# DOCS
# OCCURRENCES
Government (G/)
28
3,926
4,200
Industry (I/)
112
38,308
57,430
Market Sector (M/)
9
38,562
42,058
Product (P/)
21
2,242
2,523
Region (R/)
121
47,083
116,358
Subject (N/)
70
41,902
52,751
470643 c08.qxd 3/8/04 11:14 AM Page 267
Memory-Based Reasoning and Collaborative Filtering 267
Applying MBR
This section explains how MBR facilitated assigning codes to news stories for a news service. The important steps were:
1. Choosing the training set
2. Determining the distance function
3. Choosing the number of nearest neighbors
4. Determining the combination function
The following sections discuss each of these steps in turn.
Choosing the Training Set
The training set consisted of 49,652 news stories, provided by the news retrieval service for this purpose. These stories came from about three months of news and from almost 100 different sources. Each story contained, on average, 2,700 words and had eight codes assigned to it. The training set was not specially created, so the frequency of codes in the training set varied a great deal, mimicking the overall frequency of codes in news stories in general.
Although this training set yielded good results, a better-constructed training set with more examples of the less common codes would probably have performed even better.
Choosing the Distance Function
The next step is choosing the distance function. In this case, a distance function already existed, based on a notion called relevance feedback that measures the similarity of two documents based on the words they contain. Relevance feedback, which is described more fully in the sidebar, was originally designed to return documents similar to a given document, as a way of refining searches.