Another decision is the size of the training set. The training set must be sufficiently large to cover the ranges of inputs available for each feature. In addition, you want several training examples for each weight in the network. For a network with s input units, h hidden units, and 1 output, there are h * (s + 1 ) + h + 1
weights in the network (each hidden layer node has a weight for each connection to the input layer, an additional weight for the bias, and then a connection to the output layer and its bias). For instance, if there are 15 input features and 10 units in the hidden network, then there are 171 weights in the network.
There should be at least 30 examples for each weight, but a better minimum is 100. For this example, the training set should have at least 17,100 rows.
Finally, the learning rate and momentum parameters are very important for getting good results out of a network using the back propagation training algorithm (it is better to use conjugate gradient or similar approach). Initially, the learning should be set high to make large adjustments to the weights.
As the training proceeds, the learning rate should decrease in order to fine-tune the network. The momentum parameter allows the network to move TEAMFLY
toward a solution more rapidly, preventing oscillation around less useful weights.
Choosing the Training Set
The training set consists of records whose prediction or classification values are already known. Choosing a good training set is critical for all data mining modeling. A poor training set dooms the network, regardless of any other work that goes into creating it. Fortunately, there are only a few things to consider in choosing a good one.
Coverage of Values for All Features
The most important of these considerations is that the training set needs to cover the full range of values for all features that the network might encounter, including the output. In the real estate appraisal example, this means including inexpensive houses and expensive houses, big houses and little houses, and houses with and without garages. In general, it is a good idea to have several examples in the training set for each value of a categorical feature and for values throughout the ranges of ordered discrete and continuous features.
Team-Fly®
470643 c07.qxd 3/8/04 11:37 AM Page 233
Artificial Neural Networks 233
This is true regardless of whether the features are actually used as inputs into the network. For instance, lot size might not be chosen as an input variable in the network. However, the training set should still have examples from all different lot sizes. A network trained on smaller lot sizes (some of which might be low priced and some high priced) is probably not going to do a good job on mansions.
Number of Features
The number of input features affects neural networks in two ways. First, the more features used as inputs into the network, the larger the network needs to be, increasing the risk of overfitting and increasing the size of the training set.
Second, the more features, the longer is takes the network to converge to a set of weights. And, with too many features, the weights are less likely to be optimal.
This variable selection problem is a common problem for statisticians. In practice, we find that decision trees (discussed in Chapter 6) provide a good method for choosing the best variables. Figure 7.8 shows a nice feature of SAS
Enterprise Miner. By connecting a neural network node to a decision tree node, the neural network only uses the variables chosen by the decision tree.
An alternative method is to use intuition. Start with a handful of variables that make sense. Experiment by trying other variables to see which ones improve the model. In many cases, it is useful to calculate new variables that represent particular aspects of the business problem. In the real estate example, for instance, we might subtract the square footage of the house from the lot size to get an idea of how large the yard is.
Figure 7.8 SAS Enterprise Miner provides a simple mechanism for choosing variables for a neural network—just connect a neural network node to a decision tree node.
470643 c07.qxd 3/8/04 11:37 AM Page 234
234 Chapter 7
Size of Training Set
The more features there are in the network, the more training examples that are needed to get a good coverage of patterns in the data. Unfortunately, there is no simple rule to express a relationship between the number of features and the size of the training set. However, typically a minimum of a few hundred examples are needed to support each feature with adequate coverage; having several thousand is not unreasonable. The authors have worked with neural networks that have only six or seven inputs, but whose training set contained hundreds of thousands of rows.
When the training set is not sufficiently large, neural networks tend to overfit the data. Overfitting is guaranteed to happen when there are fewer training examples than there are weights in the network. This poses a problem, because the network will work very, very well on the training set, but it will fail spectacularly on unseen data.
Of course, the downside of a really large training set is that it takes the neural network longer to train. In a given amount of time, you may get better models by using fewer input features and a smaller training set and experimenting with different combinations of features and network topologies rather than using the largest possible training set that leaves no time for experimentation.
Number of Outputs
In most training examples, there are typically many more inputs going in than there are outputs going out, so good coverage of the inputs results in good coverage of the outputs. However, it is very important that there be many examples for all possible output values from the network. In addition, the number of training examples for each possible output should be about the same. This can be critical when deciding what to use as the training set.
For instance, if the neural network is going to be used to detect rare, but important events—failure rates in a diesel engines, fraudulent use of a credit card, or who will respond to an offer for a home equity line of credit—then the training set must have a sufficient number of examples of these rare events. A random sample of available data may not be sufficient, since common examples will swamp the rare examples. To get around this, the training set needs to be balanced by oversampling the rare cases. For this type of problem, a training set consisting of 10,000 “good” examples and 10,000 “bad” examples gives better results than a randomly selected training set of 100,000 good examples and 1,000 bad examples. After all, using the randomly sampled training set the neural network would probably assign “good” regardless of the input—and be right 99 percent of the time. This is an exception to the general rule that a larger training set is better.
470643 c07.qxd 3/8/04 11:37 AM Page 235
Artificial Neural Networks 235
T I P The training set for a neural network has to be large enough to cover all the values taken on by all the features. You want to have at least a dozen, if not hundreds or thousands, of examples for each input feature. For the outputs of the network, you want to be sure that there is an even distribution of values.
This is a case where fewer examples in the training set can actually improve results, by not swamping the network with “good” examples when you want to train it to recognize “bad” examples. The size of the training set is also influenced by the power of the machine running the model. A neural network needs more time to train when the training set is very large. That time could perhaps better be used to experiment with different features, input mapping functions, and parameters of the network.
Preparing the Data
Preparing the input data is often the most complicated part of using a neural network. Part of the complication is the normal problem of choosing the right data and the right examples for a data mining endeavor. Another part is mapping each field to an appropriate range—remember, using a limited range of inputs helps networks better recognize patterns. Some neural network packages facilitate this translation using friendly, graphical interfaces. Since the format of the data going into the network has a big effect on how well the network performs, we are reviewing the common ways to map data.
Chapter 17 contains additional material on data preparation.
Features with Continuous Values
Some features take on continuous values, generally ranging between known minimum and maximum bounds. Examples of such features are:
■■
Dollar amounts (sales price, monthly balance, weekly sales, income, and so on)
■■
Averages (average monthly balance, average sales volume, and so on)