Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 138 – Library. Read online. Free books read online. Read books without registering

■■

Extracting features from a single value

■■

Combining values within a record (used, among other things, for capturing trends)

■■

Looking up auxiliary information in another table

■■

Pivoting time-dependent data into multiple columns

■■

Summarizing transactional records

■■

Summarizing fields across the model set

470643 c17.qxd 3/8/04 11:29 AM Page 569

Preparing Data for Mining 569

The following sections discuss these methods, giving examples of derived variables and highlighting important points about computing them.

Extracting Features from a Single Value

Computationally, parsing values is a very simple operation because all the data needed is present in a single value. Even though it is so simple, it is quite useful, as these examples show:

■■

Calculating the day of the week from a date

■■

Extracting the credit card issuer code from a credit card number

■■

Taking the SCF (first three digits) of a zip code

■■

Determining the vehicle manufacturer code from the VIN

■■

Adding a flag when a field is missing

These operations generally require rudimentary operations that data mining tools should be able to handle. Unfortunately, many statistical tools focus more on numeric data types than on the strings, dates, and times often encountered in business data—so string operations and date arithmetic can be difficult. In such cases, these variables may need to be added during a preprocessing phase or as data is extracted from data sources.

Combining Values within a Record

As with the extraction of features from a single value, combining values within a record is computationally simple—instead of using one variable, there are several variables. Most data mining tools support adding derived variables that combine values from several fields, particularly for numeric fields. This can be very useful, for adding ratios, sums, averages, and so on. Such derived values are often more useful for modeling purposes than the raw data because these variables start to capture underlying customer behavior. Date fields are often combined. Taking the difference of two dates to calculate duration is an especially common and useful example.

It is not usually necessary to combine string fields, unless the fields are somehow related. For instance, it might be useful to combine a “credit card payment flag” with a “credit card type,” so there is one field representing the payment type.

Looking Up Auxiliary Information

Looking up auxiliary information is a more complicated process than the previous two calculations. A lookup is an example of joining two tables together (to use relational database terminology), with the simplifying assumption that one table is big and the other table is relatively small.

470643 c17.qxd 3/8/04 11:29 AM Page 570

570 Chapter 17

When the lookup table is small enough, such as Table 17.3, which describes the mapping between initial digits of a credit card number and the credit card type, then a simple formula can suffice for the lookup.

The more common situation is having a secondary table or file with the information. This table might, for instance, contain:

■■ Populations and median household incomes of zip codes (usefully provided for downloading for the United States by the U.S. Census Bureau at www.census.gov)

■■

Hierarchies for product codes

■■

Store type information about retail locations

Unfortunately data mining tools do not, as a rule, make it easy to do lookups without programming. Tools that do provide this facility, such as I-Miner from Insightful, usually require that both tables be sorted by the field or fields used for the lookup; an example of this is shown in Figure 17.12. This is palatable for one such field, but it is cumbersome when there are many different fields to be looked up. In general, it is easier to do these lookups outside the tool, especially when the lookup tables and original data are both coming from databases.

Figure 17.12 Insightful Miner enables users to use and create lookup tables from the graphical user interface.

470643 c17.qxd 3/8/04 11:29 AM Page 571

Preparing Data for Mining 571

Sometimes, the lookup tables already exist. Other times, they must be created as needed. For instance, one useful predictor of customer attrition is the historical attrition rate by zip code. To add this to a customer signature requires calculating the historical attrition rate for each zip code and then using the result as a lookup table.

WA R N I N G When using database joins to look up values in a lookup table, always use a left outer join to ensure that no customer rows are lost in the process! An outer join in SQL looks like:

SELECT c.*, l.value

FROM (customer c left outer join lookup l on c.code = l.code)

Table 17.3 Credit Card Prefixes

CARD TYPE

PREFIX

LENGTH

MasterCard

Visa

American Express

Diners Club

300

Diners Club

301

Diners Club

302

Diners Club

303

Diners Club

304

Diners Club

305

Discover

6011

enRoute

2014

enRoute

2149

JCB

2131

JCB

1800

470643 c17.qxd 3/8/04 11:29 AM Page 572

572 Chapter 17

Pivoting Regular Time Series

Data about customers is often stored at a monthly level, where each month has a separate row of data. For instance, billing data is often stored this way, since most subscription-based companies bill customers once a month. This data is an example of a regular time series, because the data occurs at fixed, defined intervals.

Figure 17.13 illustrates the process needed to put this data into a customer signature. The data must be pivoted, so values that start out in rows end up in columns.

This is generally a cumbersome process, because neither data mining tools nor SQL makes it easy to do pivoting. Data mining tools generally require programming for pivoting. To accomplish this, the customer file needs to be sorted by customer ID, and the billing file needs to be sorted by the customer ID and the billing date. Then, special-purpose code is needed to calculate the pivoting columns. In SAS, proc TRANSPOSE is used for this purpose. The sidebar “Pivoting Data in SQL” shows how it is done in SQL.

Most businesses store customer data on a monthly basis, usually by calendar month. Some industries, though, show strong weekly cyclical patterns, because customers either do or do not do things over the weekend. For instance, Web sites might be most active during weekdays, and newspaper subscriptions generally start on Mondays or Sundays.

Such weekly cycles interfere with the monthly data, because some months are longer than others. Consider a Web site where most activity is on weekdays. Some TEAMFLY

months have 20 weekdays; others have up to 23 (not including holidays). The difference between successive months could be 15 percent, due solely to the difference in the number of weekdays. To take this into account, divide the monthly activity by the number of weekdays during the month, to get an “activity per weekday.” This only makes sense, though, when there are strong weekly cycles.

CUSTOMER

MON

AMOUNT

Cust 1

Jan

$38.43

Cust 1

Feb

$41.22

Cust 1

Mar

$21.09

Cust 1

Apr

$66.02

Cust 2

Mar

$14.36

Cust 2

Apr

$9.52

CUSTOMER

JAN AMOUNT

FEB AMOUNT

MAR AMOUNT

APR AMOUNT

Cust 1

$38.43

$41.22

$21.09

$66.02

Cust 2

$14.36

$9.52

Figure 17.13 Pivoting a field takes values stored in one or more rows for each customer and puts them into a single row for each customer, but in different columns.

Team-Fly®

470643 c17.qxd 3/8/04 11:29 AM Page 573

Preparing Data for Mining 573

PIVOTING DATA IN SQL

SQL does not have great support for pivoting data (although some databases may have nonstandard extensions with this capability). However, when using standard SQL it is possible to pivot data.

Assume that the data consists of billing records and that each has a sequential billing number assigned to it. The first billing record has a “1,” the second “2,”

and so on. The following SQL fragment shows how to pivot this data: SELECT customer_id,

sum(case when bill_seq = 1 then bill_amt end) as bill_1,

sum(case when bill_seq = 2 then bill_amt end) as bill_2,

sum(case when bill_seq = 3 then bill_amt end) as bill_3,

. . .

FROM billing

GROUP BY customer_id

One problem with this fragment is that different customers have different numbers of billing periods. However, the query can only take a fixed number.

When a customer has fewer billing periods than the query wants, then the later periods are filled with NULLs.

Actually, this code fragment is not generally what is needed for customer signatures because the signature wants the most recent billing periods—such as the last 12 or 24. For customers who are active, this is the most recent period.

However, for customers who have stopped, this requires considering their stop date instead. The following code fragment takes this into account: SELECT customer_id,

sum(case when trunc(months_between(bill_date, cutoff) = 1