A data mining group can be successful in any of several locations within the company organization chart. Locating the group in IT puts it close to data and technical resources. Locating it within a business unit puts it close to the business problems. In either case, it is important to have good communication between IT and the business units.
Choosing software for the data mining environment is important. However, the success of the data mining group depends more on having good processes and good people than on the particular software found on their desktops.
470643 c17.qxd 3/8/04 11:29 AM Page 539
C H A P T E R
17
Preparing Data for Mining
As a translucent amber fluid, gasoline—the power behind the transportation industry—barely resembles the gooey black ooze pumped up through oil wells. The difference between the two liquids is the result of multiple steps of refinement that distill useful products from the raw material.
Data preparation is a very similar process. The raw material comes from operational systems that have often accumulated crud, in the form of eccentric business rules and layers of system enhancements and fixes, over the course of time. Fields in the data are used for multiple purposes. Values become obsolete. Errors are fixed on an ongoing basis, so interpretations change over time.
The process of preparing data is like the process of refining oil. Valuable stuff lurks inside the goo of operational data. Half the battle is refinement. The other half is converting its energy to a useful form—the equivalent of running an engine on gasoline.
The proliferation of data is a feature of modern business. Our challenge is to make sense of the data, to refine the data so that the engines of data mining can extract value. One of the challenges is the sheer volume of data. A customer may call the call center several times a year, pay a bill once a month, turn the phone on once a day, make and receive phone calls several times a day. Over the course of time, hundreds of thousands or millions of customers are generating hundreds of millions of records of their behavior. Even on today’s computers, this is a lot of data processing. Fortunately, computer systems have become powerful enough that the problem is really one of having an adequate 539
470643 c17.qxd 3/8/04 11:29 AM Page 540
540 Chapter 17
budget for buying hardware and software; technically, processing such vast quantities of data is possible.
Data comes in many forms, from many systems, and in many different types. Data is always dirty, incomplete, sometimes incomprehensible and incompatible. This is, alas, the real world. And yet, data is the raw material for data mining. Oil starts out as a thick tarry substance, mixed with impurities. It is only by going through various stages of refinement that the raw material becomes usable—whether as clear gasoline, plastic, or fertilizer. Just as the most powerful engines cannot use crude oil as a fuel, the most powerful algorithms (the engines of data mining) are unlikely to find interesting patterns in unprepared data.
After more than a century of experimentation, the steps of refining oil are quite well understood—better understood than the processes of preparing data. This chapter illustrates some guidelines and principles that, based on experience, should make the process more effective. It starts with a discussion of what data should look like once it has been prepared, describing the customer signature. It then dives into what data actually looks like, in terms of data types and column roles. Since a major part of successful data mining is in the derived variables, ideas for these are presented in some detail. The chapter ends with a look at some of the difficulties presented by dirty data and missing values, and the computational challenge of working with large volumes of commercial data.
What Data Should Look Like
The place to start the discussion on data is at the end: what the data should look like. All data mining algorithms want their inputs in tabular form—the rows and columns so common in spreadsheets and databases. Unlike spreadsheets, though, each column must mean the same thing for all the rows.
Some algorithms need their data in a particular format. For instance, market basket analysis (discussed in Chapter 9) usually looks at only the products purchased at any given time. Also, link analysis (see Chapter 10) needs references between records in order to connect them. However, most algorithms, and especially decision trees, neural networks, clustering, and statistical regression, are looking for data in a particular format called the customer signature.
The Customer Signature
The customer signature is a snapshot of customer behavior that captures both current attributes of the customers and changes in behavior over time. Like
470643 c17.qxd 3/8/04 11:29 AM Page 541
Preparing Data for Mining 541
a signature on a check, each customer’s signature is theoretically unique—
capturing the unique characteristics of the individual. Unlike a signature on a check, though, the customer signature is used for analysis and not identification; in fact, often customer signatures have no more identifying information than a string of seemingly random digits representing a household, individual, or account number. Figure 17.1 shows that a customer signature is simply a row of data that represents the customer and whatever might be useful for data mining.
This column is an ID field where
the value is different in every
column. It is ignored for data
mining purposes.
This column is the target,
This column is from the customer
what we want to predict.
information file.
2610000101
010377
14
A
19.1
14 Spring . .. TRUE
2610000102
103188
7
A
19.1
NULL
TRUE
2610000105
041598
1
B
21.2
71 W. 19 St. FALSE
2610000171
040296
1
S
38.3
3562 Oak. . . FALSE
2610000182
051990
22
C
56.1
9672 W. 142 FALSE
2610000183
111192
45
C
56.1
NULL
TRUE
These rows have
invalid customer IDs,
so they are ignored.
2620000107
080891
6
A
19.1
P.O. Box 11 FALSE
2620000108
120398
3
D
10.0
560 Robson TRUE
2620000220
022797
2
S
38.3
222 E. 11th FALSE
2620000221
021797
3
A
19.1
10122 SW 9 FALSE
2620000230
060899
1
S
38.3
NULL
TRUE
2620000231
062099
10
S
38.3
RR 1729
TRUE
2620000300
032894
7
B
21.2
1920 S. 14th FALSE
This column is summarized
from transaction data.
This column is a text field with unique
values. It is ignored (although it may
be used for some derived variables).
These columns come from reference
tables, so their values are repeated
many times.
Figure 17.1 Each row in the customer signature represents one customer (the unit of data mining) with fields describing that customer.
470643 c17.qxd 3/8/04 11:29 AM Page 542
542 Chapter 17
It is perhaps unfortunate that there is no big database sitting around with up-to-date customer signatures, ready for all modeling applications. Such a system might at first sight seem very useful. However, the lack of such a system is an opportunity because modeling efforts require understanding data.
No single customer signature works for all modeling efforts, although some customer signatures work well for several applications
The “customer” in customer signature is the unit of data mining. This book focuses primarily on customers, so the unit of data mining is typically an account, an individual, or a household. There are other possibilities. Chapter 11 has a case study on clustering towns—because that was the level of action for developing editorial zones for a newspaper. Acquisition modeling often takes place at the geographic level, census block groups or zip codes. And applications outside customer relationship management are even more disparate. Mastering Data Mining, for instance, has a case study where the signatures are press runs in plants that print magazines.
The Columns
The columns in the data contain values that describe aspects of the customer. In some cases, the columns come directly from existing business systems; more often, the columns are the result of some calculation—so called derived variables.
TEAMFLY
Each column contains values. The range refers to the set of allowable values for that column. Table 17.1 shows range characteristics for typical types of data used for data mining.
Table 17.1 Range Characteristics for Typical Types of Data Used for Data Mining VARIABLE TYPE
TYPICAL RANGE CHARACTERISTICS
Categorical variables
List of acceptable values
Numeric
Minimum and maximum values
Dates
Earliest and latest dates, often latest date is less
than or equal to current date
Monetary amounts
Greater than or equal to 0
Durations
Greater than or equal to 0 (or perhaps strictly
greater than 0)
Binned or quantiled values
The number of quantiles
Counts
Greater than or equal to 0 (or perhaps greater than
or equal to 1)
Team-Fly®
470643 c17.qxd 3/8/04 11:29 AM Page 543
Preparing Data for Mining 543
Histograms, such as those in Figure 17.2, shows how often each value or range of values occurs in some set of data. The vertical axis is a count of records, and the horizontal axis is the values in the column. The shape of this histogram shows the distribution of the values (strictly speaking, in a distribution, the counts are divided by the total number of records so the area under the curve is one). If we are working with a sample, and the sample is randomly chosen, then the distribution of values in the subset should be about the same as the distribution in the original data.