Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 62 – Library. Read online. Free books read online. Read books without registering

■■

Ratios (debt-to-income, price-to-earnings, and so on)

■■

Physical measurements (area of living space, temperature, and so on) The real estate appraisal example showed a good way to handle continuous features. When these features fall into a predefined range between a minimum value and a maximum value, the values can be scaled to be in a reasonable range, using a calculation such as:

mapped_value = 2 * (original_value – min) / (max – min + 1) – 1

470643 c07.qxd 3/8/04 11:37 AM Page 236

236 Chapter 7

This transformation (subtract the min, divide by the range, double and subtract 1) produces a value in the range from –1 to 1 that follows the same distribution as the original value. This works well in many cases, but there are some additional considerations.

The first is that the range a variable takes in the training set may be different from the range in the data being scored. Of course, we try to avoid this situation by ensuring that all variables values are represented in the training set.

However, this ideal situation is not always possible. Someone could build a new house in the neighborhood with 5,000 square feet of living space perhaps rendering the real estate appraisal network useless. There are several ways to approach this:

■■ Plan for a larger range. The range of living areas in the training set was from 714 square feet to 4185 square feet. Instead of using these values for the minimum and maximum value of the range, allow for some

growth, using, say, 500 and 5000 instead.

■■ Reject out-of-range values. Once we start extrapolating beyond the ranges of values in the training set, we have much less confidence in the results. Only use the network for predefined ranges of input values.

This is particularly important when using a network for controlling a manufacturing process; wildly incorrect results can lead to disasters.

■■

Peg values lower than the minimum to the minimum and higher than the maximum to the maximum. So, houses larger than 4,000 square feet would all be treated the same. This works well in many situations. However, we suspect that the price of a house is highly correlated with the living area. So, a house with 20 percent more living area than the maximum house size (all other things being equal) would cost about 20 percent more. In other situations, pegging the values can work quite well.

■■

Map the minimum value to –0.9 and the maximum value to 0.9 instead of –1 and 1.

■■

Or, most likely, don’t worry about it. It is important that most values are near 0; a few exceptions probably will not have a significant impact.

Figure 7.9 illustrates another problem that arises with continuous features—

skewed distribution of values. In this data, almost all incomes are under $100,000, but the range goes from $10,000 to $1,000,000. Scaling the values as suggested maps a $30,000 income to –0.96 and a $65,000 income to –0.89, hardly any difference at all, although this income differential might be very significant for a marketing application. On the other hand, $250,000 and $800,000 become –0.51 and +0.60, respectively—a very large difference, though this income differential might be much less significant. The incomes are highly skewed toward the low end, and this can make it difficult for the neural network to take advantage of the income field. Skewed distributions

470643 c07.qxd 3/8/04 11:37 AM Page 237

Artificial Neural Networks 237

can prevent a network from effectively using an important field. Skewed distributions affect neural networks but not decision trees because neural networks actually use the values for calculations; decision trees only use the ordering (rank) of the values.

There are several ways to resolve this. The most common is to split a feature like income into ranges. This is called discretizing or binning the field. Figure 7.9

illustrates breaking the incomes into 10 equal-width ranges, but this is not useful at all. Virtually all the values fall in the first two ranges. Equal-sized quintiles provide a better choice of ranges: $10,000–$17,999 very low (–1.0)

$18,000–$31,999 low (–0.5)

$32,000–$63,999 middle (0.0)

$64,000–$99,999 high (+0.5)

$100,000 and above very high (+1.0)

Information is being lost by this transformation. A household with an income of $65,000 now looks exactly like a household with an income of $98,000. On the other hand, the sheer magnitude of the larger values does not confuse the neural network.

There are other methods as well. For instance, taking the logarithm is a good way of handling values that have wide ranges. Another approach is to standardize the variable, by subtracting the mean and dividing by the standard deviation. The standardized value is going to very often be between –2 and +2

(that is, for most variables, almost all values fall within two standard deviations of the mean). Standardizing variables is often a good approach for neural networks. However, it must be used with care, since big outliers make the standard deviation big. So, when there are big outliers, many of the standardized values will fall into a very small range, making it hard for the network to distinguish them from each other.

40,000

35,000

30,000

25,000

20,000

umber of people

15,000

10,000

5,000

region 1

region 2

region 3

region 4

region 5

region 6

region 7

region 8

region 9 region 10

$100,000 $200,000 $300,000 $400,000 $500,000 $600,000 $700,000 $800,000 $900,000 $1,000,000

income

Figure 7.9 Household income provides an example of a skewed distribution. Almost all the values are in the first 10 percent of the range (income of less than $100,000).

470643 c07.qxd 3/8/04 11:37 AM Page 238

238 Chapter 7

Features with Ordered, Discrete (Integer) Values

Continuous features can be binned into ordered, discrete values. Other examples of features with ordered values include:

■■

Counts (number of children, number of items purchased, months since sale, and so on)

■■

Age

■■

Ordered categories (low, medium, high)

Like the continuous features, these have a maximum and minimum value.

For instance, age usually ranges from 0 to about 100, but the exact range may depend on the data used. The number of children may go from 0 to 4, with anything over 4 considered to be 4. Preparing such fields is simple. First, count the number of different values and assign each a proportional fraction in some range, say from 0 to 1. For instance, if there are five distinct values, then these get mapped to 0, 0.25, 0.50, 0.75, and 1, as shown in Figure 7.10. Notice that mapping the values onto the unit interval like this preserves the ordering; this is an important aspect of this method and means that information is not being lost.

It is also possible to break a range into unequal parts. One example is called thermometer codes:

0 → 0 0 0 0 = 0/16 = 0.0000

1 → 1 0 0 0 = 8/16 = 0.5000

2 → 1 1 0 0 = 12/16 = 0.7500

3 → 1 1 1 0 = 14/16 = 0.8750

Number of Children

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

4 or more

1 child

2 children

3 children

children

Figure 7.10 When codes have an inherent order, they can be mapped onto the unit interval.

470643 c07.qxd 3/8/04 11:37 AM Page 239

Artificial Neural Networks 239

The name arises because the sequence of 1s starts on one side and rises to some value, like the mercury in a thermometer; this sequence is then interpreted as a decimal written in binary notation. Thermometer codes are good for things like academic grades and bond ratings, where the difference on one end of the scale is less significant than differences on the other end.

For instance, for many marketing applications, having no children is quite different from having one child. However, the difference between three children and four is rather negligible. Using a thermometer code, the number of children variable might be mapped as follows: 0 (for 0 children), 0.5 (for one child), 0.75 (for two children), 0.875 (for three children), and so on. For categorical variables, it is often easier to keep mapped values in the range from 0

to 1. This is reasonable. However, to extend the range from –1 to 1, double the value and subtract 1.

Thermometer codes are one way of including prior information into the coding scheme. They keep certain codes values close together because you have a sense that these code values should be close. This type of knowledge can improve the results from a neural network—don’t make it discover what you already know. Feel free to map values onto the unit interval so that codes close to each other match your intuitive notions of how close they should be.

Features with Categorical Values

Features with categories are unordered lists of values. These are different from ordered lists, because there is no ordering to preserve and introducing an order is inappropriate. There are typically many examples of categorical values in data, such as: