Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

January

February

March

April

May

June

July

August

September

October

Target

7

6

5

4

3

2

1

Month

Model Building Time

7

6

5

4

3

2

1

Target

Month

Model Scoring Time

Figure 3.8 Time when the model is built compared to time when the model is used.

470643 c03.qxd 3/8/04 11:09 AM Page 72

72

Chapter 3

People often find it hard to understand why the training set and validation set are “tainted” once they have been used to build a model. An analogy may help: Imagine yourself back in the fifth grade. The class is taking a spelling test. Suppose that, at the end of the test period, the teacher asks you to estimate your own grade on the quiz by marking the words you got wrong. You will give yourself a very good grade, but your spelling will not improve. If, at the beginning of the period, you thought there should be an ‘e’ at the end of

“tomato,” nothing will have happened to change your mind when you grade your paper. No new information has entered the system. You need a validation set!

Now, imagine that at the end of the test the teacher allows you to look at the papers of several neighbors before grading your own. If they all agree that

“tomato” has no final ‘e,’ you may decide to mark your own answer wrong. If the teacher gives the same quiz tomorrow, you will do better. But how much better? If you use the papers of the very same neighbors to evaluate your performance tomorrow, you may still be fooling yourself. If they all agree that

“potatoes” has no more need of an ‘e’ than “tomato,” and you have changed your own guess to agree with theirs, then you will overestimate your actual grade on the second quiz as well. That is why the test set should be different from the validation set.

For predictive models, the test set should also come from a different time TEAMFLY

period than the training and validation sets. The proof of a model’s stability is in its ability to perform well month after month. A test set from a different time period, often called an out of time test set, is a good way to verify model stability, although such a test set is not always available.

Step Five: Fix Problems with the Data

All data is dirty. All data has problems. What is or isn’t a problem varies with the data mining technique. For some, such as decision trees, missing values, and outliers do not cause too much trouble. For others, such as neural networks, they cause all sorts of trouble. For that reason, some of what we have to say about fixing problems with data can be found in the chapters on the techniques where they cause the most difficulty. The rest of what we have to say on this topic can be found in Chapter 17 in the section called “The Dark Side of Data.”

The next few sections talk about some of the common problems that need to be fixed.

Team-Fly®

470643 c03.qxd 3/8/04 11:09 AM Page 73

Data Mining Methodology and Best Practices

73

Categorical Variables with Too Many Values

Variables such as zip code, county, telephone handset model, and occupation code are all examples of variables that convey useful information, but not in a way that most data mining algorithms can handle. The problem is that while where a person lives and what he or she does for work are important predictors, there are so many possible values for the variables that carry this information and so few examples in your data for most of the values, that variables such as zip code and occupation end up being thrown away along with their valuable information content.

Variables like these must either be grouped so that many classes that all have approximately the same relationship to the target variable are grouped together, or they must be replaced by interesting attributes of the zip code, handset model or occupation. Replace zip codes by the zip code’s median home price or population density or historical response rate or whatever else seems likely to be predictive. Replace occupation with median salary for that occupation. And so on.

Numeric Variables with Skewed

Distributions and Outliers

Skewed distributions and outliers cause problems for any data mining technique that uses the values arithmetically (by multiplying them by weights and adding them together, for instance). In many cases, it makes sense to discard records that have outliers. In other cases, it is better to divide the values into equal sized ranges, such as deciles. Sometimes, the best approach is to transform such variables to reduce the range of values by replacing each value with its logarithm, for instance.

Missing Values

Some data mining algorithms are capable of treating “missing” as a value and incorporating it into rules. Others cannot handle missing values, unfortunately. None of the obvious solutions preserve the true distribution of the variable. Throwing out all records with missing values introduces bias because it is unlikely that such records are distributed randomly. Replacing the missing value with some likely value such as the mean or the most common value adds spurious information. Replacing the missing value with an unlikely value is even worse since the data mining algorithms will not recognize that –999, say, is an unlikely value for age. The algorithms will go ahead and use it.

470643 c03.qxd 3/8/04 11:09 AM Page 74

74

Chapter 3

When missing values must be replaced, the best approach is to impute them by creating a model that has the missing value as its target variable.

Values with Meanings That Change over Time

When data comes from several different points in history, it is not uncommon for the same value in the same field to have changed its meaning over time.

Credit class “A” may always be the best, but the exact range of credit scores that get classed as an “A” may change from time to time. Dealing with this properly requires a well-designed data warehouse where such changes in meaning are recorded so a new variable can be defined that has a constant meaning over time.

Inconsistent Data Encoding

When information on the same topic is collected from multiple sources, the various sources often represent the same data different ways. If these differences are not caught, they add spurious distinctions that can lead to erroneous conclusions. In one call-detail analysis project, each of the markets studied had a different way of indicating a call to check one’s own voice mail. In one city, a call to voice mail from the phone line associated with that mailbox was recorded as having the same origin and destination numbers. In another city, the same situation was represented by the presence of a specific nonexistent number as the call destination. In yet another city, the actual number dialed to reach voice mail was recorded. Understanding apparent differences in voice mail habits between cities required putting the data in a common form.

The same data set contained multiple abbreviations for some states and, in some cases, a particular city was counted separately from the rest of the state.

If issues like this are not resolved, you may find yourself building a model of calling patterns to California based on data that excludes calls to Los Angeles.

Step Six: Transform Data to Bring

Information to the Surface

Once the data has been assembled and major data problems fixed, the data must still be prepared for analysis. This involves adding derived fields to bring information to the surface. It may also involve removing outliers, binning numeric variables, grouping classes for categorical variables, applying transformations such as logarithms, turning counts into proportions, and the

470643 c03.qxd 3/8/04 11:09 AM Page 75

Data Mining Methodology and Best Practices

75

like. Data preparation is such an important topic that our colleague Dorian Pyle has written a book about it, Data Preparation for Data Mining (Morgan Kaufmann 1999), which should be on the bookshelf of every data miner. In this book, these issues are addressed in Chapter 17. Here are a few examples of such transformations.

Capture Trends

Most corporate data contains time series. Monthly snapshots of billing information, usage, contacts, and so on. Most data mining algorithms do not understand time series data. Signals such as “three months of declining revenue” cannot be spotted treating each month’s observation independently. It is up to the data miner to bring trend information to the surface by adding derived variables such as the ratio of spending in the most recent month to spending the month before for a short-term trend and the ratio of the most recent month to the same month a year ago for a long-term trend.

Create Ratios and Other Combinations of Variables

Trends are one example of bringing information to the surface by combining multiple variables. There are many others. Often, these additional fields are derived from the existing ones in ways that might be obvious to a knowledgeable analyst, but are unlikely to be considered by mere software. Typical examples include: obesity_index = height2 / weight

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *