RANGE OF
ORIGINAL
SCALED
FEATURE
VALUES
VALUE
VALUE
Sales_Price
$103,000–$250,000
$171,000
–0.0748
Months_Ago
0–23
4
–0.6522
Num_Apartments
1-3
1
–1.0000
Year_Built
1850–1986
1923
+0.0730
Plumbing_Fixtures 5–17
9
–0.3077
Heating_Type
coded as A or B
B
+1.0000
Basement_Garage
0–2
0
–1.0000
Attached_Garage
0–228
120
+0.0524
Living_Area
714–4185
1,614
–0.4813
Deck_Area
0–738
0
–1.0000
Porch_Area
0–452
210
–0.0706
Recroom_Area
0–672
0
–1.0000
Basement_Area
0–810
175
–0.5672
This process of adjusting weights is sensitive to the representation of the data going in. For instance, consider a field in the data that measures lot size.
If lot size is measured in acres, then the values might reasonably go from about 1⁄8 to 1 acre. If measured in square feet, the same values would be 5,445 square feet to 43,560 square feet. However, for technical reasons, neural networks restrict their inputs to small numbers, say between –1 and 1. For instance, when an input variable takes on very large values relative to other inputs, then this variable dominates the calculation of the target. The neural network wastes valuable iterations by reducing the weights on this input to lessen its effect on the output. That is, the first “pattern” that the network will find is that the lot size variable has much larger values than other variables. Since this is not particularly interesting, it would be better to use the lot size as measured in acres rather than square feet.
This idea generalizes. Usually, the inputs in the neural network should be smallish numbers. It is a good idea to limit them to some small range, such as
–1 to 1, which requires mapping all the values, both continuous and categorical prior to training the network.
One way to map continuous values is to turn them into fractions by subtracting the middle value of the range from the value, dividing the result by the size of the range, and multiplying by 2. For instance, to get a mapped value for
470643 c07.qxd 3/8/04 11:36 AM Page 218
218 Chapter 7
Year_Built (1923), subtract (1850 + 1986)/2 = 1918 (the middle value) from 1923
(the year the oldest house was built) and get 7. Dividing by the number of years in the range (1986 – 1850 + 1 = 137) yields a scaled value and multiplying by 2
yields a value of 0.0730. This basic procedure can be applied to any continuous feature to get a value between –1 and 1. One way to map categorical features is to assign fractions between –1 and 1 to each of the categories. The only categorical variable in this data is Heating_Type, so we can arbitrarily map B 1 and A to
–1. If we had three values, we could assign one to –1, another to 0, and the third to 1, although this approach does have the drawback that the three heating types will seem to have an order. Type –1 will appear closer to type 0 than to type 1. Chapter 17 contains further discussion of ways to convert categorical variables to numeric variables without adding spurious information.
With these simple techniques, it is possible to map all the fields for the sample house record shown earlier (see Table 7.2) and train the network. Training is a process of iterating through the training set to adjust the weights. Each iteration is sometimes called a generation.
Once the network has been trained, the performance of each generation must be measured on the validation set. Typically, earlier generations of the network perform better on the validation set than the final network (which was optimized for the training set). This is due to overfitting, (which was discussed in Chapter 3) and is a consequence of neural networks being so powerful. In fact, neural networks are an example of a universal approximator. That is, any function can be approximated by an appropriately complex neural network. Neural networks and decision trees have this property; linear and logistic regression do not, since they assume particular shapes for the underlying function.
As with other modeling approaches, neural networks can learn patterns that exist only in the training set, resulting in overfitting. To find the best network for unseen data, the training process remembers each set of weights calculated during each generation. The final network comes from the generation that works best on the validation set, rather than the one that works best on the training set.
When the model’s performance on the validation set is satisfactory, the neural network model is ready for use. It has learned from the training examples and figured out how to calculate the sales price from all the inputs. The model takes descriptive information about a house, suitably mapped, and produces an output. There is one caveat. The output is itself a number between 0 and 1 (for a logistic activation function) or –1 and 1 (for the hyperbolic tangent), which needs to be remapped to the range of sale prices. For example, the value 0.75 could be multiplied by the size of the range ($147,000) and then added to the base number in the range ($103,000) to get an appraisal value of $213,250.
470643 c07.qxd 3/8/04 11:36 AM Page 219
Artificial Neural Networks 219
Neural Networks for Directed Data Mining
The previous example illustrates the most common use of neural networks: building a model for classification or prediction. The steps in this process are: 1. Identify the input and output features.
2. Transform the inputs and outputs so they are in a small range, (–1 to 1).
3. Set up a network with an appropriate topology.
4. Train the network on a representative set of training examples.
5. Use the validation set to choose the set of weights that minimizes the error.
6. Evaluate the network using the test set to see how well it performs.
7. Apply the model generated by the network to predict outcomes for unknown inputs.
Fortunately, data mining software now performs most of these steps automatically. Although an intimate knowledge of the internal workings is not necessary, there are some keys to using networks successfully. As with all predictive modeling tools, the most important issue is choosing the right training set. The second is representing the data in such a way as to maximize the ability of the network to recognize patterns in it. The third is interpreting the results from the network. Finally, understanding some specific details about how they work, such as network topology and parameters controlling training, can help make better performing networks.
One of the dangers with any model used for prediction or classification is that the model becomes stale as it gets older—and neural network models are no exception to this rule. For the appraisal example, the neural network has learned about historical patterns that allow it to predict the appraised value from descriptions of houses based on the contents of the training set. There is no guarantee that current market conditions match those of last week, last month, or 6 months ago—when the training set might have been made. New homes are bought and sold every day, creating and responding to market forces that are not present in the training set. A rise or drop in interest rates, or an increase in inflation, may rapidly change appraisal values. The problem of keeping a neural network model up to date is made more difficult by two factors. First, the model does not readily express itself in the form of rules, so it may not be obvious when it has grown stale. Second, when neural networks degrade, they tend to degrade gracefully making the reduction in performance less obvious. In short, the model gradually expires and it is not always clear exactly when to update it.
470643 c07.qxd 3/8/04 11:36 AM Page 220
220 Chapter 7
The solution is to incorporate more recent data into the neural network. One way is to take the same neural network back to training mode and start feeding it new values. This is a good approach if the network only needs to tweak results such as when the network is pretty close to being accurate, but you think you can improve its accuracy even more by giving it more recent examples. Another approach is to start over again by adding new examples into the training set (perhaps removing older examples) and training an entirely new network, perhaps even with a different topology (there is further discussion of network topologies later). This is appropriate when market conditions may have changed drastically and the patterns found in the original training set are no longer applicable.
The virtuous cycle of data mining described in Chapter 2 puts a premium on measuring the results from data mining activities. These measurements help in understanding how susceptible a given model is to aging and when a neural network model should be retrained.
WA R N I N G A neural network is only as good as the training set used to generate it. The model is static and must be explicitly updated by adding more recent examples into the training set and retraining the network (or training a new network) in order to keep it up-to-date and useful.