Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Figure 17.17 Comparing the amount paid as a multiple of the minimum payment shows distinct curves for transactors, revolvers, and convenience users.
First, let’s define the ideal convenience user. This is someone who, twice a year, charges up to his or her credit line and then pays the balance off over 4
months. There are few, if any, additional charges during the other 10 months of the year. Table 17.7 illustrates the monthly balances for two convenience users as a ratio of their credit lines.
This table also illustrates one of the main challenges in the definition of convenience users. The values describing their behavior have no relationship to each other in any given month. They are out of phase. In fact, there is a fundamental difference between convenience users on the one hand and transactors and revolvers on the other. Knowing that someone is a transactor exactly describes their behavior in any given month—they pay off the balance. Knowing that someone is a convenience user is less helpful. In any given month, they may be paying nothing, paying off everything, or making a partial payment.
Table 17.7 Monthly Balances of Two Convenience Users Expressed as a Percentage of Their Credit Lines
JAN FEB MAR APR MAY JUN JUL AUG SEP NOV DEC
Conv1
80% 60% 40% 20% 0%
0%
0%
60% 30% 15% 70%
Conv2
0%
0%
83% 50% 17% 0%
67% 50% 17% 0%
0%
470643 c17.qxd 3/8/04 11:29 AM Page 589
Preparing Data for Mining 589
Does this mean that it is not possible to develop a measure to identify convenience users? Not at all. The solution is to sort the 12 months of data by the balance ratio and to create the convenience-user measure using the sorted data.
Figure 17.18 illustrates this process. It shows the two convenience users, along with the profile of the ideal convenience user. Here, the data is sorted, with the largest values occurring first. For the first convenience user, month 1
refers to January. For the second, it refers to March.
Now, using the same idea of taking the area between the ideal and the actual produces a score that measures how close a convenience user is to the ideal.
Notice that revolvers would have outstanding balances near the maximum for all months. They would have high scores, indicating that they are far from the ideal convenience user. For convenience users, the scores are much smaller.
This case study has shown several different ways of segmenting customers.
All make use of derived variables to describe customer behavior. Often, it is possible to describe a particular behavior and then to create a score that measures how each customer’s behavior compares to the ideal.
100%
90%
80%
70%
IDEAL TRANSACTOR
60%
IDEAL CONVENIENCE
50%
CONVENIENCE 2
40%
CONVENIENCE 1
30%
20%
10%
Ratio of Balance to Credit Line
0%
1
2
3
4
5
6
7
8
9
10
11
12
Month (Sorted from Highest Balance to Lowest)
Figure 17.18 Comparison of two convenience users to the ideal, by sorting the months by the balance ratio.
470643 c17.qxd 3/8/04 11:29 AM Page 590
590 Chapter 17
The Dark Side of Data
Working with data is a critical part of the data mining process. What does the data mean? There are many ways to answer this question—through written documents, in database schemas, in file layouts, through metadata systems, and, not least, via the database administrators and systems analysis who know what is really going on. No matter how good the documentation, the real story lies in the data.
There is a misconception that data mining requires perfect data. In the world of business analysis, the perfect is definitely the enemy of the sufficiently good. For one thing, exploring data and building models highlights data issues that are otherwise unknown. Starting the process with available data may not result in the best models, but it does start a process that can improve over time. For another thing, waiting for perfect data is often a way of delaying a project so that nothing gets done.
This section covers some of the important issues that make working with data a sometimes painful process.
Missing Values
Missing values refer to data that should be there but is not. In many cases, missing values are represented as NULLs in the data source, making it easy to identify them. However, be careful: NULL is sometimes an acceptable value. In this case, we say that the value is empty rather than missing, although the two look the same in source data. For instance, the stop code of an account might be NULL, indicating that the account is still active. This information, which indicates whether data is censored or not, is critical for survival analysis.
Another time when NULL is an acceptable value is when working with overlay data describing demographics and other characteristics of customers and prospects. In this case, NULL often has one of two meanings:
■■
There is not enough evidence to indicate whether the field is true for the individual. For instance, lack of subscriptions to golfing magazines suggests the person is not a golfer, but does not prove it.
■■
There is no matching record for the individual in the overlay data.
T I P When working with ovelay data, it is useful to replace NULLs with alternative values, one meaning that the record does not match and the other meaning that the value is unknown.
It is worth distinguishing between these situations. One way is to separate the data where the records do not match, creating two different model sets.
The other is to replace the NULL values with alternative values, indicating whether the failure to match is at the record level or the field level.
470643 c17.qxd 3/8/04 11:29 AM Page 591
Preparing Data for Mining 591
Because customer signatures use so much aggregated data, they often contain “0” for various features. So, missing data in the customer signatures is not the most significant issue for the algorithms. However, this can be taken too far. Consider a customer signature that has 12 months of billing data. Customers who started in the past 12 months have missing data for the earlier months. In this case, replacing the missing data with some arbitrary value is not a good idea. The best thing is to split the model set into two pieces—those with 12 months of tenure and those who are more recent.
When missing data is a problem, it is important to find its cause. For instance, one database we encountered had missing data for customers’ start dates. With further investigation, it turned out that these were all customers who had started and ended their relationship prior to March 1999. Subsequent use of this data source focused on either customers who started after this date or who were active on this date. In another case, a transaction table was missing a particular type of transaction before a certain date. During the creation of the data warehouse, different transactions were implemented at different times. Only carefully looking at crosstabulations of transaction types by time made it clear that one type was implemented much later than the rest.
In another case, the missing data in a data warehouse was just that—
missing because the data warehouse had failed to load it properly. When there is such a clear cause, the database should be fixed, especially since misleading data is worse than no data at all.
One approach to dealing with missing data is to try to fill in the values—for example, with the average value or the most common value. Either of these substitutions changes the distribution of the variable and may lead to poor models. A more clever variation of this approach is to try to calculate the value based on other fields, using a technique such as regression or neural networks.
We discourage such an approach as well, unless absolutely necessary, since the field no longer means what it is supposed to mean.
WA R N I N G One of the worst ways to handle missing values is to replace them with some “special” value such as 9999 or –1 that is supposed to stick out due to its unreasonableness. Data mining algorithms will happily use these values as if they were real, leading to incorrect results.
Usually data is missing for systematic reasons, as in the new customers scenario mentioned earlier. A better approach is to split the model set into parts, eliminating the missing fields from one data set. Although one data set has more fields, neither will have missing values.
It is also important to understand whether the data is going to be missing in the future. Sometimes the right approach is to build models on records that have complete data (and hope that these records are sufficiently representative of all records) and to have someone fix the data sources, eliminating this headache in the future.