Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 20 – Library. Read online. Free books read online. Read books without registering

It is true that a final model is usually based on just a few variables. But these few variables are often derived by combining several other variables, and it may not have been obvious at the beginning which ones end up being important.

What Must the Data Contain?

At a minimum, the data must contain examples of all possible outcomes of interest. In directed data mining, where the goal is to predict the value of a particular target variable, it is crucial to have a model set comprised of preclassified data. To distinguish people who are likely to default on a loan from people who are not, there needs to be thousands of examples from each class to build a model that distinguishes one from the other. When a new applicant comes along, his or her application is compared with those of past customers, either directly, as in memory-based reasoning, or indirectly through rules or neural networks derived from the historical data. If the new application “looks like”

those of people who defaulted in the past, it will be rejected.

Implicit in this description is the idea that it is possible to know what happened in the past. To learn from our mistakes, we first have to recognize that we have made them. This is not always possible. One company had to give up on an attempt to use directed knowledge discovery to build a warranty claims fraud model because, although they suspected that some claims might be fraudulent, they had no idea which ones. Without a training set containing warranty claims clearly marked as fraudulent or legitimate, it was impossible to apply these techniques. Another company wanted a direct mail response model built, but could only supply data on people who had responded to past campaigns. They had not kept any information on people who had not responded so there was no basis for comparison.

Step Three: Get to Know the Data

It is hard to overstate the importance of spending time exploring the data before rushing into building models. Because of its importance, Chapter 17 is devoted to this topic in detail. Good data miners seem to rely heavily on

470643 c03.qxd 3/8/04 11:09 AM Page 65

Data Mining Methodology and Best Practices

intuition—somehow being able to guess what a good derived variable to try might be, for instance. The only way to develop intuition for what is going on in an unfamiliar dataset is to immerse yourself in it. Along the way, you are likely to discover many data quality problems and be inspired to ask many questions that would not otherwise have come up.

Examine Distributions

A good first step is to examine a histogram of each variable in the dataset and think about what it is telling you. Make note of anything that seems surprising. If there is a state code variable, is California the tallest bar? If not, why not?

Are some states missing? If so, does it seem plausible that this company does not do business in those states? If there is a gender variable, are there similar numbers of men and women? If not, is that unexpected? Pay attention to the range of each variable. Do variables that should be counts take on negative values? Do the highest and lowest values sound like reasonable values for that variable to take on? Is the mean much different from the median? How many missing values are there? Have the variable counts been consistent over time?

T I P As soon as you get your hands on a data file from a new source, it is a good idea to profile the data to understand what is going on, including getting counts and summary statistics for each field, counts of the number of distinct values taken on by categorical variables, and where appropriate, crosstabulations such as sales by product by region. In addition to providing insight into the data, the profiling exercise is likely to raise warning flags about inconsistencies or definitional problems that could destroy the usefulness of later analysis.

Data visualization tools can be very helpful during the initial exploration of a database. Figure 3.6 shows some data from the 2000 census of the state of New York. (This dataset may be downloaded from the companion Web site at www.data-miners.com/companion where you will also find suggested exercises that make use of it.) The red bars indicate the proportion of towns in the county where more than 15 percent of homes are heated by wood. (In New York, a town is a subdivision of a county that may or may not include any incorporated villages or cities. For instance, the town of Cortland is in Westchester county and includes the village of Croton-on-Hudson, whereas the city of Cortland is in Cortland County, in another part of the state.) The picture, generated by software from Quadstone, shows at a glance that wood-burning stoves are not much used to heat homes in the urbanized counties close to New York City, but are popular in rural areas upstate.

470643 c03.qxd 3/8/04 11:09 AM Page 66

Chapter 3

90%

80%

70%

60%

50%

40%

30%

20%

10%

Figure 3.6 Prevalence of wood as the primary source of heat varies by county in New York state.

Compare Values with Descriptions

Look at the values of each variable and compare them with the description given for that variable in available documentation. This exercise often reveals that the descriptions are inaccurate or incomplete. In one dataset of grocery purchases, a variable that was labeled as being an item count had many noninteger values. Upon further investigation, it turned out that the field contained an item count for products sold by the item, but a weight for items sold by weight. Another dataset, this one from a retail catalog company, included a field that was described as containing total spending over several quarters. This field was mysteriously capable of predicting the target variable—whether a customer had placed an order from a particular catalog mailing. Everyone who had not placed an order had a zero value in the mystery field. Everyone who had placed an order had a number greater than zero in the field. We surmise that the field actually contained the value of the customer’s order from the mailing in question. In any case, it certainly did not contain the documented value.

470643 c03.qxd 3/8/04 11:09 AM Page 67

Data Mining Methodology and Best Practices

Validate Assumptions

Using simple cross-tabulation and visualization tools such as scatter plots, bar graphs, and maps, validate assumptions about the data. Look at the target variable in relation to various other variables to see such things as response by channel or churn rate by market or income by sex. Where possible, try to match reported summary numbers by reconstructing them directly from the base-level data. For example, if reported monthly churn is 2 percent, count up the number of customers that cancel one month and see if it is around 2 percent of the total.

T I P Trying to recreate reported aggregate numbers from the detail data that supposedly goes into them is an instructive exercise. In trying to explain the discrepancies, you are likely to learn much about the operational processes and business rules behind the reported numbers.

Ask Lots of Questions

Wherever the data does not seem to bear out received wisdom or your own expectations, make a note of it. An important output of the data exploration process is a list of questions for the people who supplied the data. Often these questions will require further research because few users look at data as carefully as data miners do. Examples of the kinds of questions that are likely to come out of the preliminary exploration are:

■■

Why are no auto insurance policies sold in New Jersey or

Massachusetts?

■■

Why were some customers active for 31 days in February, but none were active for more than 28 days in January?

■■

Why were so many customers born in 1911? Are they really that old?

■■

Why are there no examples of repeat purchasers?

■■

What does it mean when the contract begin date is after the contract end date?

■■

Why are there negative numbers in the sale price field?

■■

How can active customers have a non-null value in the cancelation reason code field?

These are all real questions we have had occasion to ask about real data.

Sometimes the answers taught us things we hadn’t known about the client’s industry. New Jersey and Massachusetts do not allow automobile insurers much flexibility in setting rates, so a company that sees its main competitive

470643 c03.qxd 3/8/04 11:09 AM Page 68

Chapter 3

advantage as smarter pricing does not want to operate in those markets. Other times we learned about idiosyncrasies of the operational systems, such as the data entry screen that insisted on a birth date even when none was known, which lead to a lot of people being assigned the birthday November 11, 1911