Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

Learning Things That Can’t Be Used

It can also happen that data mining uncovers relationships that are both true and previously unknown, but still hard to make use of. Sometimes the problem is regulatory. A customer’s wireless calling patterns may suggest an affinity for certain land-line long-distance packages, but a company that provides both services may not be allowed to take advantage of the fact. Similarly, a customer’s credit history may be predictive of future insurance claims, but regulators may prohibit making underwriting decisions based on it.

Other times, data mining reveals that important outcomes are outside the company’s control. A product may be more appropriate for some climates than others, but it is hard to change the weather. Service may be worse in some regions for reasons of topography, but that is also hard to change.

T I P Sometimes it is only a failure of imagination that makes new information appear useless. A study of customer attrition is likely to show that the strongest predictors of customers leaving is the way they were acquired. It is too late to go back and change that for existing customers, but that does not make the information useless. Future attrition can be reduced by changing the mix of acquisition channels to favor those that bring in longer-lasting customers.

470643 c03.qxd 3/8/04 11:09 AM Page 50

50

Chapter 3

The data mining methodology is designed to steer clear of the Scylla of learning things that aren’t true and the Charybdis of not learning anything useful. In a more positive light, the methodology is designed to ensure that the data mining effort leads to a stable model that successfully addresses the business problem it is designed to solve.

Hypothesis Testing

Hypothesis testing is the simplest approach to integrating data into a company’s decision-making processes. The purpose of hypothesis testing is to substantiate or disprove preconceived ideas, and it is a part of almost all data mining endeavors. Data miners often bounce back and forth between approaches, first thinking up possible explanations for observed behavior (often with the help of business experts) and letting those hypotheses dictate the data to be analyzed. Then, letting the data suggest new hypotheses to test.

Hypothesis testing is what scientists and statisticians traditionally spend their lives doing. A hypothesis is a proposed explanation whose validity can be tested by analyzing data. Such data may simply be collected by observation or generated through an experiment, such as a test mailing. Hypothesis testing is at its most valuable when it reveals that the assumptions that have been guiding a company’s actions in the marketplace are incorrect. For example, suppose that a company’s advertising is based on a number of hypotheses about the target market for a product or service and the nature of the responses. It is worth testing whether these hypotheses are borne out by actual responses. One approach is to use different call-in numbers in different ads and record the number that each responder dials. Information collected during the call can then be compared with the profile of the population the advertisement was designed to reach.

T I P Each time a company solicits a response from its customers, whether through advertising or a more direct form of communication, it has an opportunity to gather information. Slight changes in the design of the communication, such as including a way to identify the channel when a prospect responds, can greatly increase the value of the data collected.

By its nature, hypothesis testing is ad hoc, so the term “methodology” might be a bit strong. However, there are some identifiable steps to the process, the first and most important of which is generating good ideas to test.

470643 c03.qxd 3/8/04 11:09 AM Page 51

Data Mining Methodology and Best Practices

51

Generating Hypotheses

The key to generating hypotheses is getting diverse input from throughout the organization and, where appropriate, outside it as well. Often, all that is needed to start the ideas flowing is a clear statement of the problem itself—especially if it is something that has not previously been recognized as a problem.

It happens more often than one might suppose that problems go unrecognized because they are not captured by the metrics being used to evaluate the organization’s performance. If a company has always measured its sales force on the number of new sales made each month, the sales people may never have given much thought to the question of how long new customers remain active or how much they spend over the course of their relationship with the firm. When asked the right questions, however, the sales force may have insights into customer behavior that marketing, with its greater distance from the customer, has missed.

Testing Hypotheses

Consider the following hypotheses:

■■

Frequent roamers are less sensitive than others to the price per minute of cellular phone time.

■■

Families with high-school age children are more likely to respond to a home equity line offer than others.

■■

The save desk in the call center is saving customers who would have returned anyway.

Such hypotheses must be transformed in a way that allows them to be tested on real data. Depending on the hypotheses, this may mean interpreting a single value returned from a simple query, plowing through a collection of association rules generated by market basket analysis, determining the significance of a correlation found by a regression model, or designing a controlled experiment.

In all cases, careful critical thinking is necessary to be sure that the result is not biased in unexpected ways.

Proper evaluation of data mining results requires both analytical and business knowledge. Where these are not present in the same person, it takes cross-functional cooperation to make good use of the new information.

Models, Profiling, and Prediction

Hypothesis testing is certainly useful, but there comes a time when it is not sufficient. The data mining techniques described in the rest of this book are all designed for learning new things by creating models based on data.

470643 c03.qxd 3/8/04 11:09 AM Page 52

52

Chapter 3

In the most general sense, a model is an explanation or description of how something works that reflects reality well enough that it can be used to make inferences about the real world. Without realizing it, human beings use models all the time. When you see two restaurants and decide that the one with white tablecloths and real flowers on each table is more expensive than the one with Formica tables and plastic flowers, you are making an inference based on a model you carry in your head. When you set out to walk to the store, you again consult a mental model of the town.

Data mining is all about creating models. As shown in Figure 3.3, models take a set of inputs and produce an output. The data used to create the model is called a model set. When models are applied to new data, this is called the score set. The model set has three components, which are discussed in more detail later in the chapter:

■■

The training set is used to build a set of models.

■■

The validation set 1 is used to choose the best model of these.

■■

The test set is used to determine how the model performs on unseen data.

Data mining techniques can be used to make three kinds of models for three kinds of tasks: descriptive profiling, directed profiling, and prediction. The distinctions are not always clear.

TEAMFLY

Descriptive models describe what is in the data. The output is one or more charts or numbers or graphics that explain what is going on. Hypothesis testing often produces descriptive models. On the other hand, both directed profiling and prediction have a goal in mind when the model is being built. The difference between them has to do with time frames, as shown in Figure 3.4. In profiling models, the target is from the same time frame as the input. In predictive models, the target is from a later time frame. Prediction means finding patterns in data from one period that are capable of explaining outcomes in a later period. The reason for emphasizing the distinction between profiling and prediction is that it has implications for the modeling methodology, especially the treatment of time in the creation of the model set.

Output

Inputs

Model

Figure 3.3 Models take an input and produce an output.

1 The first edition called the three partitions of the model set the training set, the test set, and the evaluation set. The authors still like that terminology, but standard usage in the data mining community is now training/validation/test. To avoid confusion, this edition adopts the training/

validation/test nomenclature.

Team-Fly®

470643 c03.qxd 3/8/04 11:09 AM Page 53

Data Mining Methodology and Best Practices

53

Input variables

Target variable

ofiling

Pr

August 2004

September 2004

October 2004

November 2004

S

M

T

W

T

F

S

S

M

T

W

T

F

S

S

M

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *