PE = price / earnings
pop_density = population / area
rpm = revenue_passengers * miles
Adding fields that represent relationships considered important by experts in the field is a way of letting the mining process benefit from that expertise.
Convert Counts to Proportions
Many datasets contain counts or dollar values that are not particularly interesting in themselves because they vary according to some other value. Larger households spend more money on groceries than smaller households. They spend more money on produce, more money on meat, more money on packaged goods, more money on cleaning products, more money on everything.
So comparing the dollar amount spent by different households in any one
470643 c03.qxd 3/8/04 11:09 AM Page 76
76
Chapter 3
category, such as bakery, will only reveal that large households spend more. It is much more interesting to compare the proportion of each household’s spending that goes to each category.
The value of converting counts to proportions can be seen by comparing two charts based on the NY State towns dataset. Figure 3.9 compares the count of houses with bad plumbing to the prevalence of heating with wood. A relationship is visible, but it is not strong. In Figure 3.10, where the count of houses with bad plumbing has been converted into the proportion of houses with bad plumbing, the relationship is much stronger. Towns where many houses have bad plumbing also have many houses heated by wood. Does this mean that wood smoke destroys plumbing? It is important to remember that the patterns that we find determine correlation, not causation.
Figure 3.9 Chart comparing count of houses with bad plumbing to prevalence of heating with wood.
470643 c03.qxd 3/8/04 11:09 AM Page 77
Data Mining Methodology and Best Practices
77
Figure 3.10 Chart comparing proportion of houses with bad plumbing to prevalence of heating with wood.
Step Seven: Build Models
The details of this step vary from technique to technique and are described in the chapters devoted to each data mining method. In general terms, this is the step where most of the work of creating a model occurs. In directed data mining, the training set is used to generate an explanation of the independent or target variable in terms of the independent or input variables. This explanation may take the form of a neural network, a decision tree, a linkage graph, or some other representation of the relationship between the target and the other fields in the database. In undirected data mining, there is no target variable.
The model finds relationships between records and expresses them as association rules or by assigning them to common clusters.
Building models is the one step of the data mining process that has been truly automated by modern data mining software. For that reason, it takes up relatively little of the time in a data mining project.
470643 c03.qxd 3/8/04 11:09 AM Page 78
78
Chapter 3
Step Eight: Assess Models
This step determines whether or not the models are working. A model assessment should answer questions such as:
■■
How accurate is the model?
■■
How well does the model describe the observed data?
■■
How much confidence can be placed in the model’s predictions?
■■
How comprehensible is the model?
Of course, the answer to these questions depends on the type of model that was built. Assessment here refers to the technical merits of the model, rather than the measurement phase of the virtuous cycle.
Assessing Descriptive Models
The rule, If (state=’MA)’ then heating source is oil, seems more descriptive than the rule, If (area=339 OR area=351 OR area=413 OR area=508 OR
area=617 OR area=774 OR area=781 OR area=857 OR area=978) then heating source is oil. Even if the two rules turn out to be equivalent, the first one seems more expressive.
Expressive power may seem purely subjective, but there is, in fact, a theoretical way to measure it, called the minimum description length or MDL. The minimum description length for a model is the number of bits it takes to encode both the rule and the list of all exceptions to the rule. The fewer bits required, the better the rule. Some data mining tools use MDL to decide which sets of rules to keep and which to weed out.
Assessing Directed Models
Directed models are assessed on their accuracy on previously unseen data.
Different data mining tasks call for different ways of assessing performance of the model as a whole and different ways of judging the likelihood that the model yields accurate results for any particular record.
Any model assessment is dependent on context; the same model can look good according to one measure and bad according to another. In the academic field of machine learning—the source of many of the algorithms used for data mining—researchers have a goal of generating models that can be understood in their entirety. An easy-to-understand model is said to have good “mental fit.” In the interest of obtaining the best mental fit, these researchers often prefer models that consist of a few simple rules to models that contain many such rules, even when the latter are more accurate. In a business setting, such
470643 c03.qxd 3/8/04 11:09 AM Page 79
Data Mining Methodology and Best Practices
79
explicability may not be as important as performance—or may be more important.
Model assessment can take place at the level of the whole model or at the level of individual predictions. Two models with the same overall accuracy may have quite different levels of variance among the individual predictions.
A decision tree, for instance, has an overall classification error rate, but each branch and leaf of the tree also has an error rate as well.
Assessing Classifiers and Predictors
For classification and prediction tasks, accuracy is measured in terms of the error rate, the percentage of records classified incorrectly. The classification error rate on the preclassified test set is used as an estimate of the expected error rate when classifying new records. Of course, this procedure is only valid if the test set is representative of the larger population.
Our recommended method of establishing the error rate for a model is to measure it on a test dataset taken from the same population as the training and validation sets, but disjointed from them. In the ideal case, such a test set would be from a more recent time period than the data in the model set; however, this is not often possible in practice.
A problem with error rate as an assessment tool is that some errors are worse than others. A familiar example comes from the medical world where a false negative on a test for a serious disease causes the patient to go untreated with possibly life-threatening consequences whereas a false positive only leads to a second (possibly more expensive or more invasive) test. A confusion matrix or correct classification matrix, shown in Figure 3.11, can be used to sort out false positives from false negatives. Some data mining tools allow costs to be associated with each type of misclassification so models can be built to minimize the cost rather than the misclassification rate.
Assessing Estimators
For estimation tasks, accuracy is expressed in terms of the difference between the predicted score and the actual measured result. Both the accuracy of any one estimate and the accuracy of the model as a whole are of interest. A model may be quite accurate for some ranges of input values and quite inaccurate for others. Figure 3.12 shows a linear model that estimates total revenue based on a product’s unit price. This simple model works reasonably well in one price range but goes badly wrong when the price reaches the level where the elasticity of demand for the product (the ratio of the percent change in quantity sold to the percent change in price) is greater than one. An elasticity greater than one means that any further price increase results in a decrease in revenue because the increased revenue per unit is more than offset by the drop in the number of units sold.
470643 c03.qxd 3/8/04 11:09 AM Page 80
80
Chapter 3
Percent of Row Frequency
100
80
60
40
20
Into: WClass
0
1
From: WClass
Percent of Row Frequency
25
100
Figure 3.11 A confusion matrix cross-tabulates predicted outcomes with actual outcomes.
ue
env
otal ReT
ue
ven
Estimated Re
Unit Price
Figure 3.12 The accuracy of an estimator may vary considerably over the range of inputs.
470643 c03.qxd 3/8/04 11:09 AM Page 81
Data Mining Methodology and Best Practices
81
The standard way of describing the accuracy of an estimation model is by measuring how far off the estimates are on average. But, simply subtracting the estimated value from the true value at each point and taking the mean results in a meaningless number. To see why, consider the estimates in Table 3.1.
The average difference between the true values and the estimates is zero; positive differences and negative differences have canceled each other out.