because 11/11/11 is the date you get by holding down the “1” key and letting it auto-repeat until the field is full (and no other keys work to fill in valid dates). Sometimes we discovered serious problems with the data such as the data for February being misidentified as January. And in the last instance, we learned that the process extracting the data had bugs.
Step Four: Create a Model Set
The model set contains all the data that is used in the modeling process. Some of the data in the model set is used to find patterns. Some of the data in the model set is used to verify that the model is stable. Some is used to assess the model’s performance. Creating a model set requires assembling data from multiple sources to form customer signatures and then preparing the data for analysis.
Assembling Customer Signatures
The model set is a table or collection of tables with one row per item to be studied, and fields for everything known about that item that could be useful for modeling. When the data describes customers, the rows of the model set are often called customer signatures. Assembling the customer signatures from relational databases often requires complex queries to join data from many tables and then augmenting it with data from other sources.
Part of the data assembly process is getting all data to be at the correct level of summarization so there is one value per customer, rather than one value per transaction or one value per zip code. These issues are discussed in Chapter 17.
Creating a Balanced Sample
Very often, the data mining task involves learning to distinguish between groups such as responders and nonresponders, goods and bads, or members of different customer segments. As explained in the sidebar, data mining algorithms do best when these groups have roughly the same number of members.
This is unlikely to occur naturally. In fact, it is usually the more interesting groups that are underrepresented.
Before modeling, the dataset should be made balanced either by sampling from the different groups at different rates or adding a weighting factor so that the members of the most popular groups are not weighted as heavily as members of the smaller ones.
470643 c03.qxd 3/8/04 11:09 AM Page 69
Data Mining Methodology and Best Practices
69
ADDING MORE NEEDLES TO THE HAYSTACK
In standard statistical analysis, it is common practice to throw out outliers—
observations that are far outside the normal range. In data mining, however, these outliers may be just what we are looking for. Perhaps they represent fraud, some sort of error in our business procedures, or some fabulously profitable niche market. In these cases, we don’t want to throw out the outliers, we want to get to know and understand them!
The problem is that knowledge discovery algorithms learn by example. If there are not enough examples of a particular class or pattern of behavior, the data mining tools will not be able to come up with a model for predicting it. In this situation, we may be able to improve our chances by artificially enriching the training data with examples of the rare event.
Stratified Sampling
Weights
00
01
02
03
04
05
06
07
08
00
01
02
03
04
05
06
07
08
09
09
10
11
12
13
14
15
16
17
18
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
30
31
32
33
34
35
36
37
38
39
29
40
41
42
43
44
45
46
47
48
40
41
42
43
44
45
46
47
48
49
39
02
08
09
49
11
16
19
24
25
29
30
38
39
42
46
49
When an outcome is rare, there are two ways to create a balanced sample.
For example, a bank might want to build a model of who is a likely prospect for a private banking program. These programs appeal only to the very wealthiest clients, few of whom are represented in even a fairly large sample of bank customers. To build a model capable of spotting these fortunate individuals, we might create a training set of checking transaction histories of a population that includes 50 percent private banking clients even though they represent fewer than 1 percent of all checking accounts.
Alternately, each private banking client might be given a weight of 1 and other customers a weight of 0.01, so the total weight of the exclusive customers equals the total weight of the rest of the customers (we prefer to have the maximum weight be 1).
470643 c03.qxd 3/8/04 11:09 AM Page 70
70
Chapter 3
Including Multiple Timeframes
The primary goal of the methodology is creating stable models. Among other things, that means models that will work at any time of year and well into the future. This is more likely to happen if the data in the model set does not all come from one time of year. Even if the model is to be based on only 3 months of history, different rows of the model set should use different 3-month windows. The idea is to let the model generalize from the past rather than memorize what happened at one particular time in the past.
Building a model on data from a single time period increases the risk of learning things that are not generally true. One amusing example that the authors once saw was an association rules model built on a single week’s worth of point of sale data from a supermarket. Association rules try to predict items a shopping basket will contain given that it is known to contain certain other items. In this case, all the rules predicted eggs. This surprising result became less so when we realized that the model set was from the week before Easter.
Creating a Model Set for Prediction
When the model set is going to be used for prediction, there is another aspect of time to worry about. Although the model set should contain multiple timeframes, any one customer signature should have a gap in time between the predictor variables and the target variable. Time can always be divided into three periods: the past, present, and future. When making a prediction, a model uses data from the past to make predictions about the future.
As shown in Figure 3.7, all three of these periods should be represented in the model set. Of course all data comes from the past, so the time periods in the model set are actually the distant past, the not-so-distant past, and the recent past. Predictive models are built be finding patterns in the distant past that explain outcomes in the recent past. When the model is deployed, it is then able to use data from the recent past to make predictions about the future.
Model Building Time
Not So
Distant Past
Distant
Recent Past
Present
Future
Past
Model Scoring Time
Figure 3.7 Data from the past mimics data from the past, present, and future.
470643 c03.qxd 3/8/04 11:09 AM Page 71
Data Mining Methodology and Best Practices
71
It may not be immediately obvious why some recent data—from the not-so-distant past—is not used in a particular customer signature. The answer is that when the model is applied in the present, no data from the present is available as input. The diagram in Figure 3.8 makes this clearer.
If a model were built using data from June (the not-so-distant past) in order to predict July (the recent past), then it could not be used to predict September until August data was available. But when is August data available? Certainly not in August, since it is still being created. Chances are, not in the first week of September either, since it has to be collected and cleaned and loaded and tested and blessed. In many companies, the August data will not be available until mid-September or even October, by which point nobody will care about predictions for September. The solution is to include a month of latency in the model set.
Partitioning the Model Set
Once the preclassified data has been obtained from the appropriate timeframes, the methodology calls for dividing it into three parts. The first part, the training set, is used to build the initial model. The second part, the validation set1, is used to adjust the initial model to make it more general and less tied to the idiosyncrasies of the training set. The third part, the test set, is used to gauge the likely effectiveness of the model when applied to unseen data. Three sets are necessary because once data has been used for one step in the process, it can no longer be used for the next step because the information it contains has already become part of the model; therefore, it cannot be used to correct or judge.