Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 132 – Library. Read online. Free books read online. Read books without registering

160

140

This histogram is for the month of

120

claim for a set of insurance claims.

100

This is an example of a typically

uniform distribution. That is, the

number of claims is roughly the

same for each month.

Jan

Feb

Mar

Apr

May

June

July

Aug

Sept

Oct

Nov

Dec

This histogram shows the number of

telephone calls made for different

durations.

This is an example of an

exponentially decreasing

Number of Calls

distribution.

Duration

(Minutes)

250

200

150

This histogram shows a normal

distribution with a mean of 50 and a

Count 100

standard deviation of 10. Notice that

high and low values are very rare.

20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Value

Figure 17.2 Histograms show the distribution of data values.

470643 c17.qxd 3/8/04 11:29 AM Page 544

544 Chapter 17

The distribution of the values provides important insights into the data. It shows which values are common and which are less common. Just looking at the distribution of values brings up questions—such as why an amount is negative or why some categorical values are not present. Although statisticians tend to be more concerned with distributions than data miners, it is still important to look at variable values. Here, we illustrate some special cases of distributions that are important for data mining purposes, as well as the special case of variables synonymous with the target.

Columns with One Value

The most degenerate distribution is a column that has only one value. Unary-valued columns, as they are more formally known, do not contain any information that helps to distinguish between different rows. Because they lack any information content, they should be ignored for data mining purposes.

Having only one value is sometimes a property of the data. It is not uncommon, for instance, for a database to have fields defined in the database that are not yet populated. The fields are only placeholders for future values, so all the values are uniformly something such as “null” or “no” or “0.”

Before throwing out unary variables, check that NULLs are being counted as values. Appended demographic variables sometimes have only a single value or NULL when the value is not known. For instance, if the data provider knows that someone is interested in golf—say because the person subscribes to a golfing magazine or belongs to a country club—then the “golf-enthusiast”

flag would be set to “Y.”When there is no evidence, many providers set the flag to NULL—meaning unknown—rather than “N.”

T I P When a variable has only one value, be sure (1) that NULL is being included in the count of the number of values and (2) that other values were not inadvertently left out when selecting rows.

Unary-valued columns also arise when the data mining effort is focused on a subset of customers, and the field used to filter the records is retained in the resulting table. The fields that define this subset may all contain the same value. If we are building a model to predict the loss-ratio (an insurance measure) for automobile customers in New Jersey, then the state field will always have “NJ” filled in. This field has no information content for the sample being used, so it should be ignored for modeling purposes.

Columns with Almost Only One Value

In “almost-unary” columns, almost all the records have the same value for that column. There may be a few outliers, but there are very few. For example, retail

470643 c17.qxd 3/8/04 11:29 AM Page 545

Preparing Data for Mining 545

data may summarize all the purchases made by each customer in each department. Very few customers may make a purchase from the automotive department of a grocery store or the tobacco department of a department store. So, almost all customers will have a $0 for total purchases from these departments.

Purchased data often comes in an “almost-unary” format, as well. Fields such as “people who collect porcelain dolls” or “amount spent on greens fees”

will have a null or $0 value for all but very few people. Or, some data, such as survey data, is only available for a very small subset of the customers. These are all extreme examples of data skew, shown in Figure 17.3.

The big question with “almost-unary” columns is, “When can they be ignored?” To justify ignoring them, the values must have two characteristics.

First, almost all the records must have the same value. Second, there must be so few records with a different value, that they constitute a negligible portion of the data.

What is a negligible portion of the data? It is a group so small that even if the data mining algorithms identified it perfectly, the group would be too small to be significant.

10,000

9988

This chart sho

This char

ws an almost-unar

t sho

y column.

ws an almost-unary column. The

The

9,000

column was created b

column w

y binning telephone call

as created by binning telephone call

durations into 10 equal-width bins

dur

8,000

ations into 10 equal-width bins.

7,000

Almost all values

Almost all v

, 9988 out of 9995, are in the

alues, 9,988 out of 9,995, are in the

first bin.

6,000

If var

If v iab

le width bins had been chosen, then the

iable width bins had been chosen, then the

5,000

resulting column would ha

resulting column w

ould ha e been more useful.

ve been more useful.

Count

4,000

3,000

2,000

1,000

[0,639.6]

[639.6,1279.2]

[2558.4,3198]

[3198,3837.6]

[5756.4,6396]

[1279.2,1918.8]

[1918.8,2558.4]

[3837.6,4477.2]

[4477.2,5116.8]

[5116.8,5756.4]

Binned Duration

Figure 17.3 An almost-unary field, such as the bins produced by equal-width bins in this case, is useless for data mining purposes

470643 c17.qxd 3/8/04 11:29 AM Page 546

546 Chapter 17

Before ignoring a column, though, it is important to understand why the values are so heavily skewed. What does this column tell us about the business?

Perhaps few people ever buy automotive products because only a handful of the stores in question even sell them. Identifying customers as “automotive-product-buyers,” in this case, may not be useful.

In other cases, an event might be rare for other reasons. The number of people who cancel their telephone service on any given day is negligible, but over time the numbers accumulate. So the cancellations need to be accumulated over a longer time period, such as a month, quarter, or year. Or, the number of people who collect porcelain dolls may be very rare in itself, but when combined with other fields, this might suggest an important segment of collectors.

The rule of thumb is that, even if a column proves to be very informative, it is unlikely to be useful for data mining if it is almost-unary. That is, fully understanding the rows with different values does not yield actionable results.

As a general rule of thumb, if 95 to 99 percent of the values in the column are identical, the column—in isolation—is likely to be useless without some work.

For instance, if the column in question represents the target variable for a model, then stratified sampling can create a sample where the rare values are more highly populated. Another approach is to combine several such columns for creating derived variables that might prove to be valuable. As an example, some census fields are sparsely populated, such as those for particular occupations. However, combining some of these fields into a single field—such as

“high status occupation”—can prove useful for modeling purposes.

Columns with Unique Values

At the other extreme are categorical columns that take on a different value for every single row—or almost every row. These columns identify each customer uniquely (or close enough), for example:

■■

Customer name

■■

Address

■■

Telephone number

■■

Customer ID

■■

Vehicle identification number

These columns are also not very helpful. Why? They do not have predictive value, because they uniquely identify each row. Such variables cause overfitting.

One caveat—which will be investigated later in this chapter. Sometimes these columns contain a wealth of information. Lurking inside telephone numbers and addresses is important geographical information. Customers’ first names give an indication of gender. Customer numbers may be sequentially assigned, telling us which customers are more recent—and hence show up as important

470643 c17.qxd 3/8/04 11:29 AM Page 547

Preparing Data for Mining 547

variables in decision trees. These are cases where the important features (such as geography and customer recency) should be extracted from the fields as derived variables. However, data mining algorithms are not yet powerful enough to extract such information from values; data miners need to do the extraction.

Columns Correlated with Target

When a column is too highly correlated with the target column, it can mean that the column is just a synonym. Here are two examples: