Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

■■

Gender, marital status

■■

Status codes

■■

Product codes

■■

Zip codes

Although zip codes look like numbers in the United States, they really represent discrete geographic areas, and the codes themselves give little geographic information. There is no reason to think that 10014 is more like 02116

than it is like 94117, even though the numbers are much closer. The numbers are just discrete names attached to geographical areas.

There are three fundamentally different ways of handling categorical features.

The first is to treat the codes as discrete, ordered values, mapping them using the methods discussed in the previous section. Unfortunately, the neural network does not understand that the codes are unordered. So, five codes for marital status (“single,” “divorced,” “married,” “widowed,” and “unknown”) would

470643 c07.qxd 3/8/04 11:37 AM Page 240

240 Chapter 7

be mapped to –1.0, –0.5, 0.0, +0.5, +1.0, respectively. From the perspective of the network, “single” and “unknown” are very far apart, whereas “divorced” and

“married” are quite close. For some input fields, this implicit ordering might not have much of an effect. In other cases, the values have some relationship to each other and the implicit ordering confuses the network.

WA R N I N G When working with categorical variables in neural networks, be very careful when mapping the variables to numbers. The mapping introduces an ordering of the variables, which the neural network takes into account, even if the ordering does not make any sense.

The second way of handling categorical features is to break the categories into flags, one for each value. Assume that there are three values for gender (male, female, and unknown). Table 7.3 shows how three flags can be used to code these values using a method called 1 of N Coding. It is possible to reduce the number of flags by eliminated the flag for the unknown gender; this approach is called 1 of N – 1 Coding.

Why would we want to do this? We have now multiplied the number of input variables and this is generally a bad thing for a neural network. However, these coding schemes are the only way to eliminate implicit ordering among the values.

The third way is to replace the code itself with numerical data about the code. Instead of including zip codes in a model, for instance, include various census fields, such as the median income or the proportion of households with children. Another possibility is to include historical information summarized at the level of the categorical variable. An example would be including the historical churn rate by zip code for a model that is predicting churn.

T I P When using categorical variables in a neural network, try to replace them with some numeric variable that describes them, such as the average income in a census block, the proportion of customers in a zip code (penetration), the historical churn rate for a handset, or the base cost of a pricing plan.

Table 7.3 Handling Gender Using 1 of N Coding and 1 of N – 1 Coding

N CODING

N – 1 CODING

GENDER

GENDER

GENDER

GENDER

GENDER

MALE FEMALE UNKNOWN

MALE FEMALE

GENDER

FLAG

FLAG

FLAG

FLAG

FLAG

Male

+1.0

-1.0

-1.0

+1.0

-1.0

Female

-1.0

+1.0

-1.0

-1.0

+1.0

Unknown

-1.0

-1.0

+1.0

-1.0

-1.0

470643 c07.qxd 3/8/04 11:37 AM Page 241

Artificial Neural Networks 241

Other Types of Features

Some input features might not fit directly into any of these three categories.

For complicated features, it is necessary to extract meaningful information and use one of the above techniques to represent the result. Remember, the input to a neural network consists of inputs whose values should generally fall between –1 and 1.

Dates are a good example of data that you may want to handle in special ways. Any date or time can be represented as the number of days or seconds since a fixed point in time, allowing them to be mapped and fed directly into the network. However, if the date is for a transaction, then the day of the week and month of the year may be more important than the actual date. For instance, the month would be important for detecting seasonal trends in data.

You might want to extract this information from the date and feed it into the network instead of, or in addition to, the actual date.

The address field—or any text field—is similarly complicated. Generally, addresses are useless to feed into a network, even if you could figure out a good way to map the entire field into a single value. However, the address may contain a zip code, city name, state, and apartment number. All of these may be useful features, even though the address field taken as a whole is usually useless.

Interpreting the Results

Neural network tools take the work out of interpreting the results. When estimating a continuous value, often the output needs to be scaled back to the correct range. For instance, the network might be used to calculate the value of a house and, in the training set, the output value is set up so that $103,000 maps to –1 and $250,000 maps to 1. If the model is later applied to another house and the output is 0.0, then we can figure out that this corresponds to $176,500—

halfway between the minimum and the maximum values. This inverse transformation makes neural networks particularly easy to use for estimating continuous values. Often, though, this step is not necessary, particularly when the output layer is using a linear transfer function.

For binary or categorical output variables, the approach is still to take the inverse of the transformation used for training the network. So, if “churn” is given a value of 1 and “no-churn” a value of –1, then values near 1 represent churn, and those near –1 represent no churn. When there are two outcomes, the meaning of the output depends on the training set used to train the network. Because the network learns to minimize the error, the average value produced by the network during training is usually going to be close to the average value in the training set. One way to think of this is that the first

470643 c07.qxd 3/8/04 11:37 AM Page 242

242 Chapter 7

pattern the network finds is the average value. So, if the original training set had 50 percent churn and 50 percent no-churn, then the average value the network will produce for the training set examples is going to be close to 0.0. Values higher than 0.0 are more like churn and those less than 0.0, less like churn.

If the original training set had 10 percent churn, then the cutoff would more reasonably be –0.8 rather than 0.0 (–0.8 is 10 percent of the way from –1 to 1).

So, the output of the network does look a lot like a probability in this case.

However, the probability depends on the distribution of the output variable in the training set.

Yet another approach is to assign a confidence level along with the value.

This confidence level would treat the actual output of the network as a propensity to churn, as shown in Table 7.4.

For binary values, it is also possible to create a network that produces two outputs, one for each value. In this case, each output represents the strength of evidence that that category is the correct one. The chosen category would then be the one with the higher value, with confidence based on some function of the strengths of the two outputs. This approach is particularly valuable when the two outcomes are not exclusive.

T I P Because neural networks produce continuous values, the output from a network can be difficult to interpret for categorical results (used in classification).

TEAMFLY

The best way to calibrate the output is to run the network over a validation set, entirely separate from the training set, and to use the results from the validation set to calibrate the output of the network to categories. In many cases, the network can have a separate output for each category; that is, a propensity for that category. Even with separate outputs, the validation set is still needed to calibrate the outputs.

Table 7.4 Categories and Confidence Levels for NN Output OUTPUT VALUE

CATEGORY

CONFIDENCE

–1.0

A

100%

–0.6

A

80%

–0.02

A

51%

+0.02

B

51%

+0.6

B

80%

+1.0

B

100%

Team-Fly®

470643 c07.qxd 3/8/04 11:37 AM Page 243

Artificial Neural Networks 243

The approach is similar when there are more than two options under consideration. For example, consider a long distance carrier trying to target a new set of customers with three targeted service offerings:

■■

Discounts on all international calls

■■

Discounts on all long-distance calls that are not international

■■

Discounts on calls to a predefined set of customers

The carrier is going to offer incentives to customers for each of the three packages. Since the incentives are expensive, the carrier needs to choose the right service for the right customers in order for the campaign to be profitable.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *