160
140
This histogram is for the month of
120
claim for a set of insurance claims.
100
80
This is an example of a typically
60
uniform distribution. That is, the
40
number of claims is roughly the
20
same for each month.
0
Jan
Feb
Mar
Apr
May
June
July
Aug
Sept
Oct
Nov
Dec
This histogram shows the number of
telephone calls made for different
durations.
This is an example of an
exponentially decreasing
Number of Calls
distribution.
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
Duration
(Minutes)
250
200
150
This histogram shows a normal
distribution with a mean of 50 and a
Count 100
standard deviation of 10. Notice that
high and low values are very rare.
50
0
0
5
10
15
20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Value
Figure 17.2 Histograms show the distribution of data values.
470643 c17.qxd 3/8/04 11:29 AM Page 544
544 Chapter 17
The distribution of the values provides important insights into the data. It shows which values are common and which are less common. Just looking at the distribution of values brings up questions—such as why an amount is negative or why some categorical values are not present. Although statisticians tend to be more concerned with distributions than data miners, it is still important to look at variable values. Here, we illustrate some special cases of distributions that are important for data mining purposes, as well as the special case of variables synonymous with the target.
Columns with One Value
The most degenerate distribution is a column that has only one value. Unary-valued columns, as they are more formally known, do not contain any information that helps to distinguish between different rows. Because they lack any information content, they should be ignored for data mining purposes.
Having only one value is sometimes a property of the data. It is not uncommon, for instance, for a database to have fields defined in the database that are not yet populated. The fields are only placeholders for future values, so all the values are uniformly something such as “null” or “no” or “0.”
Before throwing out unary variables, check that NULLs are being counted as values. Appended demographic variables sometimes have only a single value or NULL when the value is not known. For instance, if the data provider knows that someone is interested in golf—say because the person subscribes to a golfing magazine or belongs to a country club—then the “golf-enthusiast”
flag would be set to “Y.”When there is no evidence, many providers set the flag to NULL—meaning unknown—rather than “N.”
T I P When a variable has only one value, be sure (1) that NULL is being included in the count of the number of values and (2) that other values were not inadvertently left out when selecting rows.
Unary-valued columns also arise when the data mining effort is focused on a subset of customers, and the field used to filter the records is retained in the resulting table. The fields that define this subset may all contain the same value. If we are building a model to predict the loss-ratio (an insurance measure) for automobile customers in New Jersey, then the state field will always have “NJ” filled in. This field has no information content for the sample being used, so it should be ignored for modeling purposes.
Columns with Almost Only One Value
In “almost-unary” columns, almost all the records have the same value for that column. There may be a few outliers, but there are very few. For example, retail
470643 c17.qxd 3/8/04 11:29 AM Page 545
Preparing Data for Mining 545
data may summarize all the purchases made by each customer in each department. Very few customers may make a purchase from the automotive department of a grocery store or the tobacco department of a department store. So, almost all customers will have a $0 for total purchases from these departments.
Purchased data often comes in an “almost-unary” format, as well. Fields such as “people who collect porcelain dolls” or “amount spent on greens fees”
will have a null or $0 value for all but very few people. Or, some data, such as survey data, is only available for a very small subset of the customers. These are all extreme examples of data skew, shown in Figure 17.3.
The big question with “almost-unary” columns is, “When can they be ignored?” To justify ignoring them, the values must have two characteristics.
First, almost all the records must have the same value. Second, there must be so few records with a different value, that they constitute a negligible portion of the data.
What is a negligible portion of the data? It is a group so small that even if the data mining algorithms identified it perfectly, the group would be too small to be significant.
10,000
9988
This chart sho
This char
ws an almost-unar
t sho
y column.
ws an almost-unary column. The
The
9,000
column was created b
column w
y binning telephone call
as created by binning telephone call
durations into 10 equal-width bins
dur
.
8,000
ations into 10 equal-width bins.
7,000
Almost all values
Almost all v
, 9988 out of 9995, are in the
alues, 9,988 out of 9,995, are in the
first bin.
first bin.
6,000
If var
If v iab
ar
le width bins had been chosen, then the
iable width bins had been chosen, then the
5,000
resulting column would ha
resulting column w
v
ould ha e been more useful.
ve been more useful.
Count
4,000
3,000
2,000
1,000
0
1
0
0
0
0
1
4
1
0
[0,639.6]
[639.6,1279.2]
[2558.4,3198]
[3198,3837.6]
[5756.4,6396]
[1279.2,1918.8]
[1918.8,2558.4]
[3837.6,4477.2]
[4477.2,5116.8]
[5116.8,5756.4]
Binned Duration
Figure 17.3 An almost-unary field, such as the bins produced by equal-width bins in this case, is useless for data mining purposes
470643 c17.qxd 3/8/04 11:29 AM Page 546
546 Chapter 17
Before ignoring a column, though, it is important to understand why the values are so heavily skewed. What does this column tell us about the business?
Perhaps few people ever buy automotive products because only a handful of the stores in question even sell them. Identifying customers as “automotive-product-buyers,” in this case, may not be useful.
In other cases, an event might be rare for other reasons. The number of people who cancel their telephone service on any given day is negligible, but over time the numbers accumulate. So the cancellations need to be accumulated over a longer time period, such as a month, quarter, or year. Or, the number of people who collect porcelain dolls may be very rare in itself, but when combined with other fields, this might suggest an important segment of collectors.
The rule of thumb is that, even if a column proves to be very informative, it is unlikely to be useful for data mining if it is almost-unary. That is, fully understanding the rows with different values does not yield actionable results.
As a general rule of thumb, if 95 to 99 percent of the values in the column are identical, the column—in isolation—is likely to be useless without some work.
For instance, if the column in question represents the target variable for a model, then stratified sampling can create a sample where the rare values are more highly populated. Another approach is to combine several such columns for creating derived variables that might prove to be valuable. As an example, some census fields are sparsely populated, such as those for particular occupations. However, combining some of these fields into a single field—such as
“high status occupation”—can prove useful for modeling purposes.
Columns with Unique Values
At the other extreme are categorical columns that take on a different value for every single row—or almost every row. These columns identify each customer uniquely (or close enough), for example:
■■
Customer name
■■
Address
■■
Telephone number
■■
Customer ID
■■
Vehicle identification number
These columns are also not very helpful. Why? They do not have predictive value, because they uniquely identify each row. Such variables cause overfitting.
One caveat—which will be investigated later in this chapter. Sometimes these columns contain a wealth of information. Lurking inside telephone numbers and addresses is important geographical information. Customers’ first names give an indication of gender. Customer numbers may be sequentially assigned, telling us which customers are more recent—and hence show up as important
470643 c17.qxd 3/8/04 11:29 AM Page 547
Preparing Data for Mining 547
variables in decision trees. These are cases where the important features (such as geography and customer recency) should be extracted from the fields as derived variables. However, data mining algorithms are not yet powerful enough to extract such information from values; data miners need to do the extraction.
Columns Correlated with Target
When a column is too highly correlated with the target column, it can mean that the column is just a synonym. Here are two examples: