In general, however, the user supplies interpretations. For example, in a credit risk model, it is likely that the ratio of debt to income is more predictive than the magnitude of either. With this knowledge we might add an interpretation that was the ratio of those two attributes. Often, user-supplied interpretations combine attributes in ways that the program would not come up with automatically. Examples include calculating a great-circle distance from changes in latitude and longitude or taking the product of three linear measurements to get a volume.
FROM ONE CASE TO THE NEXT
The central idea behind projective visualization is to use the historical cases to generat
n+1
e a set of rules for generating case
from case n. When this model is
applied to the final observed case, it generates a new projected case. To project more than one time step into the future, we continue to apply the model to the most recently created case. Naturally, confidence in the projected values decreases as the simulation is run for more and more time steps.
The figure illustrates the way a single attribute is projected using a decision tree based on the features generated from all the other attributes and interpretations in the previous case. During the training process, a separate decision tree is grown for each attribute. This entire forest is evaluated in order to move from one simulation step to the next.
(continued)
470643 c06.qxd 3/8/04 11:12 AM Page 208
208 Chapter 6
USING DECISION TREES FOR PROJECTIVE VISUALIZATION (continued)
field
field
Yes
field
No
field
Yes
field
No
Yes
field
No
No
field
field
Yes
No
No
field
Yes
field
Yes
No
field
field
Yes
field
field
One snapshot uses decision trees to create the next snapshot in time.
■■ The simulator could track the operation of the actual roaster and project it several minutes into the future. When the simulation ran into a problem, an alert could be generated while the operators still had time to avert trouble.
Evaluation of the Roaster Simulation
The simulation was built using a training set of 34,000 cases. The simulation was then evaluated using a test set of around 40,000 additional cases that had not been part of the training set. For each case in the test set, the simulator generated projected snapshots 60 steps into the future. At each step the projected values of all variables were compared against the actual values. As expected, the size of the error increases with time. For example, the error rate for product temperature turned out to be 2/3°C per minute of projection, but even 30
minutes into the future the simulator is doing considerably better than random guessing.
The roaster simulator turned out to be more accurate than all but the most experienced operators at projecting trends, and even the most experienced operators were able to do a better job with the aid of the simulator. Operators
470643 c06.qxd 3/8/04 11:12 AM Page 209
Decision Trees 209
enjoyed using the simulator and reported that it gave them new insight into corrective actions.
Lessons Learned
Decision-tree methods have wide applicability for data exploration, classification, and scoring. They can also be used for estimating continuous values although they are rarely the first choice since decision trees generate “lumpy”
estimates—all records reaching the same leaf are assigned the same estimated value. They are a good choice when the data mining task is classification of records or prediction of discrete outcomes. Use decision trees when your goal is to assign each record to one of a few broad categories. Theoretically, decision trees can assign records to an arbitrary number of classes, but they are error-prone when the number of training examples per class gets small. This can happen rather quickly in a tree with many levels and/or many branches per node. In many business contexts, problems naturally resolve to a binary classification such as responder/nonresponder or good/bad so this is not a large problem in practice.
Decision trees are also a natural choice when the goal is to generate understandable and explainable rules. The ability of decision trees to generate rules that can be translated into comprehensible natural language or SQL is one of the greatest strengths of the technique. Even in complex decision trees , it is generally fairly easy to follow any one path through the tree to a particular leaf. So the explanation for any particular classification or prediction is relatively straightforward.
Decision trees require less data preparation than many other techniques because they are equally adept at handling continuous and categorical variables. Categorical variables, which pose problems for neural networks and statistical techniques, are split by forming groups of classes. Continuous variables are split by dividing their range of values. Because decision trees do not make use of the actual values of numeric variables, they are not sensitive to outliers and skewed distributions. This robustness comes at the cost of throwing away some of the information that is available in the training data, so a well-tuned neural network or regression model will often make better use of the same fields than a decision tree. For that reason, decision trees are often used to pick a good set of variables to be used as inputs to another modeling technique.
Time-oriented data does require a lot of data preparation. Time series data must be enhanced so that trends and sequential patterns are made visible.
Decision trees reveal so much about the data to which they are applied that the authors make use of them in the early phases of nearly every data mining project even when the final models are to be created using some other technique.
470643 c06.qxd 3/8/04 11:12 AM Page 210
470643 c07.qxd 3/8/04 11:36 AM Page 211
C H A P T E R
7
Artificial Neural Networks
Artificial neural networks are popular because they have a proven track record in many data mining and decision-support applications. Neural networks—
the “artificial” is usually dropped—are a class of powerful, general-purpose tools readily applied to prediction, classification, and clustering. They have been applied across a broad range of industries, from predicting time series in the financial world to diagnosing medical conditions, from identifying clusters of valuable customers to identifying fraudulent credit card transactions, from recognizing numbers written on checks to predicting the failure rates of engines.
The most powerful neural networks are, of course, the biological kind. The human brain makes it possible for people to generalize from experience; computers, on the other hand, usually excel at following explicit instructions over and over. The appeal of neural networks is that they bridge this gap by modeling, on a digital computer, the neural connections in human brains. When used in well-defined domains, their ability to generalize and learn from data mimics, in some sense, our own ability to learn from experience. This ability is useful for data mining, and it also makes neural networks an exciting area for research, promising new and better results in the future.
There is a drawback, though. The results of training a neural network are internal weights distributed throughout the network. These weights provide no more insight into why the solution is valid than dissecting a human brain explains our thought processes. Perhaps one day, sophisticated techniques for 211
470643 c07.qxd 3/8/04 11:36 AM Page 212
212 Chapter 7
probing neural networks may help provide some explanation. In the meantime, neural networks are best approached as black boxes with internal workings as mysterious as the workings of our brains. Like the responses of the Oracle at Delphi worshipped by the ancient Greeks, the answers produced by neural networks are often correct. They have business value—in many cases a more important feature than providing an explanation.
This chapter starts with a bit of history; the origins of neural networks grew out of actual attempts to model the human brain on computers. It then discusses an early case history of using this technique for real estate appraisal, before diving into technical details. Most of the chapter presents neural networks as predictive modeling tools. At the end, we see how they can be used for undirected data mining as well. A good place to begin is, as always, at the beginning, with a bit of history.
A Bit of History
Neural networks have an interesting history in the annals of computer science.
The original work on the functioning of neurons—biological neurons—took place in the 1930s and 1940s, before digital computers really even existed. In 1943, Warren McCulloch, a neurophysiologist at Yale University, and Walter TEAMFLY
Pitts, a logician, postulated a simple model to explain how biological neurons work and published it in a paper called “A Logical Calculus Immanent in Nervous Activity.” While their focus was on understanding the anatomy of the brain, it turned out that this model provided inspiration for the field of artificial intelligence and would eventually provide a new approach to solving certain problems outside the realm of neurobiology.
In the 1950s, when digital computers first became available, computer scientists implemented models called perceptrons based on the work of McCulloch and Pitts. An example of a problem solved by these early networks was how to balance a broom standing upright on a moving cart by controlling the motions of the cart back and forth. As the broom starts falling to the left, the cart learns to move to the left to keep it upright. Although there were some limited successes with perceptrons in the laboratory, the results were disappointing as a general method for solving problems.