In fact, hardly any of the data mining algorithms were first invented with commercial applications in mind. The commercial data miner employs a grab bag of techniques borrowed from statistics, computer science, and machine learning research. The choice of a particular combination of techniques to apply in a particular situation depends on the nature of the data mining task, the nature of the available data, and the skills and preferences of the data miner.
Data mining comes in two flavors—directed and undirected. Directed data mining attempts to explain or categorize some particular target field such as income or response. Undirected data mining attempts to find patterns or similarities among groups of records without the use of a particular target field or collection of predefined classes. Both these flavors are discussed in later chapters.
470643 c01.qxd 3/8/04 11:08 AM Page 8
8
Chapter 1
Data mining is largely concerned with building models. A model is simply an algorithm or set of rules that connects a collection of inputs (often in the form of fields in a corporate database) to a particular target or outcome.
Regression, neural networks, decision trees, and most of the other data mining techniques discussed in this book are techniques for creating models. Under the right circumstances, a model can result in insight by providing an explanation of how outcomes of particular interest, such as placing an order or failing to pay a bill, are related to and predicted by the available facts. Models are also used to produce scores. A score is a way of expressing the findings of a model in a single number. Scores can be used to sort a list of customers from most to least loyal or most to least likely to respond or most to least likely to default on a loan.
The data mining process is sometimes referred to as knowledge discovery or KDD (knowledge discovery in databases). We prefer to think of it as knowledge creation.
What Tasks Can Be Performed with Data Mining?
Many problems of intellectual, economic, and business interest can be phrased in terms of the following six tasks:
■■
Classification
■■
Estimation
■■
Prediction
■■
Affinity grouping
■■
Clustering
■■
Description and profiling
The first three are all examples of directed data mining, where the goal is to find the value of a particular target variable. Affinity grouping and clustering are undirected tasks where the goal is to uncover structure in data without respect to a particular target variable. Profiling is a descriptive task that may be either directed or undirected.
Classification
Classification, one of the most common data mining tasks, seems to be a human imperative. In order to understand and communicate about the world, we are constantly classifying, categorizing, and grading. We divide living things into phyla, species, and general; matter into elements; dogs into breeds; people into races; steaks and maple syrup into USDA grades.
470643 c01.qxd 3/8/04 11:08 AM Page 9
Why and What Is Data Mining?
9
Classification consists of examining the features of a newly presented object and assigning it to one of a predefined set of classes. The objects to be classified are generally represented by records in a database table or a file, and the act of classification consists of adding a new column with a class code of some kind.
The classification task is characterized by a well-defined definition of the classes, and a training set consisting of preclassified examples. The task is to build a model of some kind that can be applied to unclassified data in order to classify it.
Examples of classification tasks that have been addressed using the techniques described in this book include:
■■
Classifying credit applicants as low, medium, or high risk
■■
Choosing content to be displayed on a Web page
■■
Determining which phone numbers correspond to fax machines
■■
Spotting fraudulent insurance claims
■■
Assigning industry codes and job designations on the basis of free-text job descriptions
In all of these examples, there are a limited number of classes, and we expect to be able to assign any record into one or another of them. Decision trees (discussed in Chapter 6) and nearest neighbor techniques (discussed in Chapter 8) are techniques well suited to classification. Neural networks (discussed in Chapter 7) and link analysis (discussed in Chapter 10) are also useful for classification in certain circumstances.
Estimation
Classification deals with discrete outcomes: yes or no; measles, rubella, or chicken pox. Estimation deals with continuously valued outcomes. Given some input data, estimation comes up with a value for some unknown continuous variable such as income, height, or credit card balance.
In practice, estimation is often used to perform a classification task. A credit card company wishing to sell advertising space in its billing envelopes to a ski boot manufacturer might build a classification model that put all of its cardholders into one of two classes, skier or nonskier. Another approach is to build a model that assigns each cardholder a “propensity to ski score.” This might be a value from 0 to 1 indicating the estimated probability that the cardholder is a skier. The classification task now comes down to establishing a threshold score. Anyone with a score greater than or equal to the threshold is classed as a skier, and anyone with a lower score is considered not to be a skier.
The estimation approach has the great advantage that the individual records can be rank ordered according to the estimate. To see the importance of this,
470643 c01.qxd 3/8/04 11:08 AM Page 10
10
Chapter 1
imagine that the ski boot company has budgeted for a mailing of 500,000
pieces. If the classification approach is used and 1.5 million skiers are identified, then it might simply place the ad in the bills of 500,000 people selected at random from that pool. If, on the other hand, each cardholder has a propensity to ski score, it can send the ad to the 500,000 most likely candidates.
Examples of estimation tasks include:
■■
Estimating the number of children in a family
■■
Estimating a family’s total household income
■■
Estimating the lifetime value of a customer
■■
Estimating the probability that someone will respond to a balance transfer solicitation.
Regression models (discussed in Chapter 5) and neural networks (discussed in Chapter 7) are well suited to estimation tasks. Survival analysis (Chapter 12) is well suited to estimation tasks where the goal is to estimate the time to an event, such as a customer stopping.
Prediction
Prediction is the same as classification or estimation, except that the records are classified according to some predicted future behavior or estimated future value. In a prediction task, the only way to check the accuracy of the classification is to wait and see. The primary reason for treating prediction as a separate task from classification and estimation is that in predictive modeling there are additional issues regarding the temporal relationship of the input variables or predictors to the target variable.
Any of the techniques used for classification and estimation can be adapted for use in prediction by using training examples where the value of the variable to be predicted is already known, along with historical data for those examples. The historical data is used to build a model that explains the current observed behavior. When this model is applied to current inputs, the result is a prediction of future behavior.
Examples of prediction tasks addressed by the data mining techniques discussed in this book include:
■■
Predicting the size of the balance that will be transferred if a credit card prospect accepts a balance transfer offer
■■
Predicting which customers will leave within the next 6 months
■■
Predicting which telephone subscribers will order a value-added service such as three-way calling or voice mail Most of the data mining techniques discussed in this book are suitable for use in prediction so long as training data is available in the proper form. The
470643 c01.qxd 3/8/04 11:08 AM Page 11
Why and What Is Data Mining?
11
choice of technique depends on the nature of the input data, the type of value to be predicted, and the importance attached to explicability of the prediction.
Affinity Grouping or Association Rules
The task of affinity grouping is to determine which things go together. The prototypical example is determining what things go together in a shopping cart at the supermarket, the task at the heart of market basket analysis. Retail chains can use affinity grouping to plan the arrangement of items on store shelves or in a catalog so that items often purchased together will be seen together.
Affinity grouping can also be used to identify cross-selling opportunities and to design attractive packages or groupings of product and services.
Affinity grouping is one simple approach to generating rules from data. If two items, say cat food and kitty litter, occur together frequently enough, we can generate two association rules:
■■
People who buy cat food also buy kitty litter with probability P1.