The BILL_MASTER file describes billing information at the account level.
Multiple handsets might be attached to the same billing account—particularly for business customers and customers on family billing plans.
Although other sources of data were available in the company, these were not immediately highlighted for use for the customer signature. One source, for instance, was the call detail records—a record of every telephone call—that is useful for predicting churn. Although this data was eventually used by the data mining group, it was not part of this initial effort.
470643 c17.qxd 3/8/04 11:29 AM Page 560
560 Chapter 17
Identifying the Customer
The data is typical of the real world. Although the focus might be on one type of customer or another, the data has multiple groups. The sidebar “Residential Versus Business Customers” talks about distinguishing between these two segments.
The business problem being addressed in this example is churn. As shown in Figure 17.8, the customer data model is rather complex, resulting in different options for the definition of customer:
■■
Telephone number
■■
Customer ID
■■
Billing account
This being the real world, though, it is important to remember that these relationships are complex and change over time. Customers might change their telephone numbers. Telephones might be added or removed from accounts. Customers change handsets, and so on. For the purposes of building the signature, the decision was to use the telephone number, because this was how the business reported churn.
Sales Rep
Sales Rep
Supervisor Supervisor
Customer
Customer
Sales Rep
ID
Account
Billing
Sales Rep
Account
Contract
Telephone Number
Figure 17.8 The customer model is complicated and takes into account sales, billing, and business hierarchy information.
470643 c17.qxd 3/8/04 11:29 AM Page 561
Preparing Data for Mining 561
RESI DENTIAL VERSUS BUSI N ESS CUSTOM ERS
Often data mining efforts focus on one type of customer—such as residential customers or small businesses. However, data for all customers is often mixed together in operational systems and data warehouses. Typically, there are multiple ways to distinguish between these types of customers:
◆ Often there is a customer type field, which has values like “residential”
and “small business.”
◆ There might be a sales hierarchy; some sales channels are business-only while others are residential-only.
◆ Some billing plans are only for businesses; others are only for residential customers.
◆ There might be business rules, so any customer with more than two lines is considered business.
These examples illustrate the fact that there are typically several different rules for distinguishing between different types of customers. Given the opportunity to be inconsistent, most data sources will not fail. The different rules select different subsets of customers.
Is this a problem? That depends on the particular model being worked on. The hope is that the rules are all very close, so the customers included (or missed) by one rule are essentially the same as those included by the others. It is important to investigate whether or not this is true, and when the rules disagree.
What usually happens in practice is that one of the rules is predominant, because that is the way the business is organized. So, although the customer type might be interesting, the sales hierarchy is probably more important, since it corresponds to people who have responsibility for different customer segments.
The distinction between businesses and residences is important for prospects as well as customers. A long-distance telephone company sees many calls traversing its network that were originated by customers of other carriers.
Their switches create call detail records containing the originating and destination telephone numbers. Any domestic number that does not belong to an existing customer belongs to a prospect. One long-distance company builds signatures to describe the behavior of the unknown telephone numbers over time by tracking such things as how frequently the number is seen, what times of day and days of the week it is typically active, and the typical call duration.
Among other things, this signature is used to score the unknown telephone numbers for the likelihood that they are businesses because business and residential customers are attracted by different offers.
One simplification would be to focus only on customers whose accounts have only one telephone number. Since the purpose is to build a model for residential customers, this was a good way of simplifying the data model for getting started. If the purpose were to build a model for business customers, a better choice for the customer level would be the billing account level, since
470643 c17.qxd 3/8/04 11:29 AM Page 562
562 Chapter 17
business customers often turn handsets and telephone numbers on and off.
However, churn in this case would mean the cancelation of the entire account, rather than the cancelation of a single telephone number. These two situations are the same for those residential customers who have only one line.
First Attempt
The first attempt to build the customer signature needs to focus on the simplest data source. In this case, the simplest data source is the UNIT_MASTER
file, which conveniently stores data at the telephone number level, the level being used for the customer signature.
It is worth pointing out two problems with this file and the customer definition:
■■
Customers may change their telephone number.
■■
Telephone numbers may be reassigned to new customers.
These problems will be addressed later; the first customer signature is at the telephone number level to get started. The process used to build the signature has four steps: identifying the time frames, creating a recent snapshot, pivoting columns, and calculating the target.
TEAMFLY
Identifying the Time Frames
The first attempt at building the customer signature needs to take into account the time frame for the data, as discussed in Chapter 3. Figure 17.9 shows a model time chart for this data. The ultimate model set should have more than one time frame in it. However, the first attempt focuses on only one time frame.
The time frame defined churn during 1 month—August. All of the input data come from at least 1 month before. The cutoff date is June 30, in order to provide 1 month of latency.
Taking a Recent Snapshot
The most recent snapshot of data is defined by the cutoff date. These fields in the signature describe the most recent information known about a customer before he or she churned (or did not churn).
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
SCORE
4
3
2
1
P
MODEL SET
4
3
2
1
P
MODEL SET
4
3
2
1
P
Figure 17.9 A model time chart shows the time frame for the input columns and targets when building a customer signature.
Team-Fly®
470643 c17.qxd 3/8/04 11:29 AM Page 563
Preparing Data for Mining 563
This is a set of fields from the UNIT_MASTER file for June—fields such as the handset type, billing plan, and so on. It is important to keep the time frame in mind when filling the customer signature. It is a good idea to use a naming convention to avoid confusion. In this case, all the fields might have a suffix of
“_01,” indicating that they are from the most recent month of input data.
T I P Use a naming convention when building the customer signature to indicate the time frame for each variable. For instance, the most recent month of input data would have a “_01” suffix; the month before, “_02”; and so on.
At this point, presumably not much is known about the fields, so descriptive information is useful. For instance, the billing plan might have a description, monthly base, per-minute cost, and so on. All of these features are interesting and of potential value for modeling—so it is reasonable to bring them into the model set. Although descriptions are not going to be used for modeling (codes are much better), they help the data miners understand the data.
Pivoting Columns
Some of the fields in UNIT_MASTER represent data that is reported in a regular time series. For instance, bill amount has a value for every month, and each of these values needs to be put into a separate column. These columns come from different UNIT_MASTER records, one for June, one for May, one for April, and so on. Using a naming convention, the fields would be, for example:
■■
Last_billed_amount_01 for June (which may already be in the snapshot)
■■
Last_billed_amount_02 for May
■■
Last_billed_amount_03 for April
At this point, the customer signature is starting to take shape. Although the input fields only come from one source, the appropriate fields have been chosen as input and aligned in time.
Calculating the Target
A customer signature for predictive modeling would not be useful without a target variable. Since the customer signature is going to be used for churn modeling, the target needs to be whether or not the customer churned in August. This is in the account status field for the August UNIT_MASTER
record. Note that only customers who were active on or before June 30 are included in the model set. A customer that starts in July and cancels in August is not included.