Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 131 – Library. Read online. Free books read online. Read books without registering

A data mining group can be successful in any of several locations within the company organization chart. Locating the group in IT puts it close to data and technical resources. Locating it within a business unit puts it close to the business problems. In either case, it is important to have good communication between IT and the business units.

Choosing software for the data mining environment is important. However, the success of the data mining group depends more on having good processes and good people than on the particular software found on their desktops.

470643 c17.qxd 3/8/04 11:29 AM Page 539

C H A P T E R

Preparing Data for Mining

As a translucent amber fluid, gasoline—the power behind the transportation industry—barely resembles the gooey black ooze pumped up through oil wells. The difference between the two liquids is the result of multiple steps of refinement that distill useful products from the raw material.

Data preparation is a very similar process. The raw material comes from operational systems that have often accumulated crud, in the form of eccentric business rules and layers of system enhancements and fixes, over the course of time. Fields in the data are used for multiple purposes. Values become obsolete. Errors are fixed on an ongoing basis, so interpretations change over time.

The process of preparing data is like the process of refining oil. Valuable stuff lurks inside the goo of operational data. Half the battle is refinement. The other half is converting its energy to a useful form—the equivalent of running an engine on gasoline.

The proliferation of data is a feature of modern business. Our challenge is to make sense of the data, to refine the data so that the engines of data mining can extract value. One of the challenges is the sheer volume of data. A customer may call the call center several times a year, pay a bill once a month, turn the phone on once a day, make and receive phone calls several times a day. Over the course of time, hundreds of thousands or millions of customers are generating hundreds of millions of records of their behavior. Even on today’s computers, this is a lot of data processing. Fortunately, computer systems have become powerful enough that the problem is really one of having an adequate 539

470643 c17.qxd 3/8/04 11:29 AM Page 540

540 Chapter 17

budget for buying hardware and software; technically, processing such vast quantities of data is possible.

Data comes in many forms, from many systems, and in many different types. Data is always dirty, incomplete, sometimes incomprehensible and incompatible. This is, alas, the real world. And yet, data is the raw material for data mining. Oil starts out as a thick tarry substance, mixed with impurities. It is only by going through various stages of refinement that the raw material becomes usable—whether as clear gasoline, plastic, or fertilizer. Just as the most powerful engines cannot use crude oil as a fuel, the most powerful algorithms (the engines of data mining) are unlikely to find interesting patterns in unprepared data.

After more than a century of experimentation, the steps of refining oil are quite well understood—better understood than the processes of preparing data. This chapter illustrates some guidelines and principles that, based on experience, should make the process more effective. It starts with a discussion of what data should look like once it has been prepared, describing the customer signature. It then dives into what data actually looks like, in terms of data types and column roles. Since a major part of successful data mining is in the derived variables, ideas for these are presented in some detail. The chapter ends with a look at some of the difficulties presented by dirty data and missing values, and the computational challenge of working with large volumes of commercial data.

What Data Should Look Like

The place to start the discussion on data is at the end: what the data should look like. All data mining algorithms want their inputs in tabular form—the rows and columns so common in spreadsheets and databases. Unlike spreadsheets, though, each column must mean the same thing for all the rows.

Some algorithms need their data in a particular format. For instance, market basket analysis (discussed in Chapter 9) usually looks at only the products purchased at any given time. Also, link analysis (see Chapter 10) needs references between records in order to connect them. However, most algorithms, and especially decision trees, neural networks, clustering, and statistical regression, are looking for data in a particular format called the customer signature.

The Customer Signature

The customer signature is a snapshot of customer behavior that captures both current attributes of the customers and changes in behavior over time. Like

470643 c17.qxd 3/8/04 11:29 AM Page 541

Preparing Data for Mining 541

a signature on a check, each customer’s signature is theoretically unique—

capturing the unique characteristics of the individual. Unlike a signature on a check, though, the customer signature is used for analysis and not identification; in fact, often customer signatures have no more identifying information than a string of seemingly random digits representing a household, individual, or account number. Figure 17.1 shows that a customer signature is simply a row of data that represents the customer and whatever might be useful for data mining.

This column is an ID field where

the value is different in every

column. It is ignored for data

mining purposes.

This column is the target,

This column is from the customer

what we want to predict.

information file.

2610000101

010377

19.1

14 Spring . .. TRUE

2610000102

103188

19.1

NULL

TRUE

2610000105

041598

21.2

71 W. 19 St. FALSE

2610000171

040296

38.3

3562 Oak. . . FALSE

2610000182

051990

56.1

9672 W. 142 FALSE

2610000183

111192

56.1

NULL

TRUE

These rows have

invalid customer IDs,

so they are ignored.

2620000107

080891

19.1

P.O. Box 11 FALSE

2620000108

120398

10.0

560 Robson TRUE

2620000220

022797

38.3

222 E. 11th FALSE

2620000221

021797

19.1

10122 SW 9 FALSE

2620000230

060899

38.3

NULL

TRUE

2620000231

062099

38.3

RR 1729

TRUE

2620000300

032894

21.2

1920 S. 14th FALSE

This column is summarized

from transaction data.

This column is a text field with unique

values. It is ignored (although it may

be used for some derived variables).

These columns come from reference

tables, so their values are repeated

many times.

Figure 17.1 Each row in the customer signature represents one customer (the unit of data mining) with fields describing that customer.

470643 c17.qxd 3/8/04 11:29 AM Page 542

542 Chapter 17

It is perhaps unfortunate that there is no big database sitting around with up-to-date customer signatures, ready for all modeling applications. Such a system might at first sight seem very useful. However, the lack of such a system is an opportunity because modeling efforts require understanding data.

No single customer signature works for all modeling efforts, although some customer signatures work well for several applications

The “customer” in customer signature is the unit of data mining. This book focuses primarily on customers, so the unit of data mining is typically an account, an individual, or a household. There are other possibilities. Chapter 11 has a case study on clustering towns—because that was the level of action for developing editorial zones for a newspaper. Acquisition modeling often takes place at the geographic level, census block groups or zip codes. And applications outside customer relationship management are even more disparate. Mastering Data Mining, for instance, has a case study where the signatures are press runs in plants that print magazines.

The Columns

The columns in the data contain values that describe aspects of the customer. In some cases, the columns come directly from existing business systems; more often, the columns are the result of some calculation—so called derived variables.

TEAMFLY

Each column contains values. The range refers to the set of allowable values for that column. Table 17.1 shows range characteristics for typical types of data used for data mining.

Table 17.1 Range Characteristics for Typical Types of Data Used for Data Mining VARIABLE TYPE

TYPICAL RANGE CHARACTERISTICS

Categorical variables

List of acceptable values

Numeric

Minimum and maximum values

Dates

Earliest and latest dates, often latest date is less

than or equal to current date

Monetary amounts

Greater than or equal to 0

Durations

Greater than or equal to 0 (or perhaps strictly

greater than 0)

Binned or quantiled values

The number of quantiles

Counts

Greater than or equal to 0 (or perhaps greater than

or equal to 1)

Team-Fly®

470643 c17.qxd 3/8/04 11:29 AM Page 543

Preparing Data for Mining 543

Histograms, such as those in Figure 17.2, shows how often each value or range of values occurs in some set of data. The vertical axis is a count of records, and the horizontal axis is the values in the column. The shape of this histogram shows the distribution of the values (strictly speaking, in a distribution, the counts are divided by the total number of records so the area under the curve is one). If we are working with a sample, and the sample is randomly chosen, then the distribution of values in the subset should be about the same as the distribution in the original data.