Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 19 – Library. Read online. Free books read online. Read books without registering

■■

Motor vehicle registration records

■■

Noise level in decibels from microphones placed in communities near an airport

■■

Telephone call detail records

■■

Survey response data

■■

Demographic and lifestyle data

■■

Economic data

■■

Hourly weather readings (wind direction, wind strength, precipitation)

■■

Census data

Once the business problem has been formulated, it is possible to form a wish list of data that would be nice to have. For a study of existing customers, this should include data from the time they were acquired (acquisition channel, acquisition date, original product mix, original credit score, and so on), similar data describing their current status, and behavioral data accumulated during their tenure. Of course, it may not be possible to find everything on the wish list, but it is better to start out with an idea of what you would like to find.

Occasionally, a data mining effort starts without a specific business problem. A company becomes aware that it is not getting good value from the data it collects, and sets out to determine whether the data could be made more useful through data mining. The trick to making such a project successful is to turn it into a project designed to solve a specific problem. The first step is to explore the available data and make a list of candidate business problems.

Invite business users to create a lengthy wish list which can then be reduced to a small number of achievable goals—the data mining problem.

What Is Available?

The first place to look for data is in the corporate data warehouse. Data in the warehouse has already been cleaned and verified and brought together from multiple sources. A single data model hopefully ensures that similarly named fields have the same meaning and compatible data types throughout the database. The corporate data warehouse is a historical repository; new data is appended, but the historical data is never changed. Since it was designed for decision support, the data warehouse provides detailed data that can be aggregated to the right level for data mining. Chapter 15 goes into more detail about the relationship between data mining and data warehousing.

The only problem is that in many organizations such a data warehouse does not actually exist or one or more data warehouses exist, but don’t live up to the promises. That being the case, data miners must seek data from various departmental databases and from within the bowels of operational systems.

470643 c03.qxd 3/8/04 11:09 AM Page 62

Chapter 3

These operational systems are designed to perform a certain task such as claims processing, call switching, order entry, or billing. They are designed with the primary goal of processing transactions quickly and accurately. The data is in whatever format best suits that goal and the historical record, if any, is likely to be in a tape archive. It may require significant political and programming effort to get the data in a form useful for knowledge discovery.

In some cases, operational procedures have to be changed in order to supply data. We know of one major catalog retailer that wanted to analyze the buying habits of its customers so as to market differentially to new customers and long-standing customers. Unfortunately, anyone who hadn’t ordered anything in the past six months was routinely purged from the records. The substantial population of people who loyally used the catalog for Christmas shopping, but not during the rest of the year, went unrecognized and indeed were unrecognizable, until the company began keeping historical customer records.

In many companies, determining what data is available is surprisingly difficult. Documentation is often missing or out of date. Typically, there is no one person who can provide all the answers. Determining what is available requires looking through data dictionaries, interviewing users and database administrators, and examining existing reports.

WA R N I N G Use database documentation and data dictionaries as a guide TEAMFLY

but do not accept them as unalterable fact. The fact that a field is defined in a table or mentioned in a document does not mean the field exists, is actually available for all customers, and is correctly loaded.

How Much Data Is Enough?

Unfortunately, there is no simple answer to this question. The answer depends on the particular algorithms employed, the complexity of the data, and the relative frequency of possible outcomes. Statisticians have spent years developing tests for determining the smallest model set that can be used to produce a model. Machine learning researchers have spent much time and energy devising ways to let parts of the training set be reused for validation and test. All of this work ignores an important point: In the commercial world, statisticians are scarce, and data is anything but.

In any case, where data is scarce, data mining is not only less effective, it is less likely to be useful. Data mining is most useful when the sheer volume of data obscures patterns that might be detectable in smaller databases. Therefore, our advice is to use so much data that the questions about what constitutes an adequate sample size simply do not arise. We generally start with tens of thousands if not millions of preclassified records so that the training, validation, and test sets each contain many thousands of records.

Team-Fly®

470643 c03.qxd 3/8/04 11:09 AM Page 63

Data Mining Methodology and Best Practices

In data mining, more is better, but with some caveats. The first caveat has to do with the relationship between the size of the model set and its density.

Density refers to the prevalence of the outcome of interests. Often the target variable represents something relatively rare. It is rare for prospects to respond to a direct mail offer. It is rare for credit card holders to commit fraud. In any given month, it is rare for newspaper subscribers to cancel their subscriptions.

As discussed later in this chapter (in the section on creating the model set), it is desirable for the model set to be balanced with equal numbers of each of the outcomes during the model-building process. A smaller, balanced sample is preferable to a larger one with a very low proportion of rare outcomes.

The second caveat has to do with the data miner’s time. When the model set is large enough to build good, stable models, making it larger is counterproductive because everything will take longer to run on the larger dataset. Since data mining is an iterative process, the time spent waiting for results can become very large if each run of a modeling routine takes hours instead of minutes.

A simple test for whether the sample used for modeling is large enough is to try doubling it and measure the improvement in the model’s accuracy. If the model created using the larger sample is significantly better than the one created using the smaller sample, then the smaller sample is not big enough. If there is no improvement, or only a slight improvement, then the original sample is probably adequate.

How Much History Is Required?

Data mining uses data from the past to make predictions about the future. But how far in the past should the data come from? This is another simple question without a simple answer. The first thing to consider is seasonality. Most businesses display some degree of seasonality. Sales go up in the fourth quarter.

Leisure travel goes up in the summer. There should be enough historical data to capture periodic events of this sort.

On the other hand, data from too far in the past may not be useful for mining because of changing market conditions. This is especially true when some external event such as a change in the regulatory regime has intervened. For many customer-focused applications, 2 to 3 years of history is appropriate.

However, even in such cases, data about the beginning of the customer relationship often proves very valuable—what was the initial channel, what was the initial offer, how did the customer initially pay, and so on.

How Many Variables?

Inexperienced data miners are sometimes in too much of a hurry to throw out variables that seem unlikely to be interesting, keeping only a few carefully chosen variables they expect to be important. The data mining approach calls for letting the data itself reveal what is and is not important.

470643 c03.qxd 3/8/04 11:09 AM Page 64

Chapter 3

Often, variables that had previously been ignored turn out to have predictive value when used in combination with other variables. For example, one credit card issuer, that had never included data on cash advances in its customer profitability models, discovered through data mining that people who use cash advances only in November and December are highly profitable. Presumably, these are people who are prudent enough to avoid borrowing money at high interest rates most of the time (a prudence that makes them less likely to default than habitual users of cash advances) but who need some extra cash for the holidays and are willing to pay exorbitant interest to get it.