■■
People who buy kitty litter also buy cat food with probability P2.
Association rules are discussed in detail in Chapter 9.
Clustering
Clustering is the task of segmenting a heterogeneous population into a number of more homogeneous subgroups or clusters. What distinguishes clustering from classification is that clustering does not rely on predefined classes. In classification, each record is assigned a predefined class on the basis of a model developed through training on preclassified examples.
In clustering, there are no predefined classes and no examples. The records are grouped together on the basis of self-similarity. It is up to the user to determine what meaning, if any, to attach to the resulting clusters. Clusters of symptoms might indicate different diseases. Clusters of customer attributes might indicate different market segments.
Clustering is often done as a prelude to some other form of data mining or modeling. For example, clustering might be the first step in a market segmentation effort: Instead of trying to come up with a one-size-fits-all rule for “what kind of promotion do customers respond to best,” first divide the customer base into clusters or people with similar buying habits, and then ask what kind of promotion works best for each cluster. Cluster detection is discussed in detail in Chapter 11. Chapter 7 discusses self-organizing maps, another technique sometimes used for clustering.
470643 c01.qxd 3/8/04 11:08 AM Page 12
12
Chapter 1
Profiling
Sometimes the purpose of data mining is simply to describe what is going on in a complicated database in a way that increases our understanding of the people, products, or processes that produced the data in the first place. A good enough description of a behavior will often suggest an explanation for it as well.
At the very least, a good description suggests where to start looking for an explanation. The famous gender gap in American politics is an example of how a simple description, “women support Democrats in greater numbers than do men,” can provoke large amounts of interest and further study on the part of journalists, sociologists, economists, and political scientists, not to mention candidates for public office.
Decision trees (discussed in Chapter 6) are a powerful tool for profiling customers (or anything else) with respect to a particular target or outcome.
Association rules (discussed in Chapter 9) and clustering (discussed in Chapter 11) can also be used to build profiles.
Why Now?
Most of the data mining techniques described in this book have existed, at least as academic algorithms, for years or decades. However, it is only in the TEAMFLY
last decade that commercial data mining has caught on in a big way. This is due to the convergence of several factors:
■■
The data is being produced.
■■
The data is being warehoused.
■■
Computing power is affordable.
■■
Interest in customer relationship management is strong.
■■
Commercial data mining software products are readily available.
Let’s look at each factor in turn.
Data Is Being Produced
Data mining makes the most sense when there are large volumes of data. In fact, most data mining algorithms require large amounts of data in order to build and train the models that will then be used to perform classification, prediction, estimation, or other data mining tasks.
A few industries, including telecommunications and credit cards, have long had an automated, interactive relationship with customers that generated Team-Fly®
470643 c01.qxd 3/8/04 11:08 AM Page 13
Why and What Is Data Mining?
13
many transaction records, but it is only relatively recently that the automation of everyday life has become so pervasive. Today, the rise of supermarket point-of-sale scanners, automatic teller machines, credit and debit cards, payper-view television, online shopping, electronic funds transfer, automated order processing, electronic ticketing, and the like means that data is being produced and collected at unprecedented rates.
Data Is Being Warehoused
Not only is a large amount of data being produced, but also, more and more often, it is being extracted from the operational billing, reservations, claims processing, and order entry systems where it is generated and then fed into a data warehouse to become part of the corporate memory.
Data warehousing brings together data from many different sources in a common format with consistent definitions for keys and fields. It is generally not possible (and certainly not advisable) to perform computer- and input/
output (I/O)–intensive data mining operations on an operational system that the business depends on to survive. In any case, operational systems store data in a format designed to optimize performance of the operational task. This format is generally not well suited to decision-support activities like data mining.
The data warehouse, on the other hand, should be designed exclusively for decision support, which can simplify the job of the data miner.
Computing Power Is Affordable
Data mining algorithms typically require multiple passes over huge quantities of data. Many are computationally intensive as well. The continuing dramatic decrease in prices for disk, memory, processing power, and I/O bandwidth has brought once-costly techniques that were used only in a few government-funded laboratories into the reach of ordinary businesses.
The successful introduction of parallel relational database management software by major suppliers such as Oracle, Teradata, and IBM, has brought the power of parallel processing into many corporate data centers for the first time. These parallel database server platforms provide an excellent environment for large-scale data mining.
Interest in Customer Relationship Management Is Strong
Across a wide spectrum of industries, companies have come to realize that their customers are central to their business and that customer information is one of their key assets.
470643 c01.qxd 3/8/04 11:08 AM Page 14
14
Chapter 1
Every Business Is a Service Business
For companies in the service sector, information confers competitive advantage. That is why hotel chains record your preference for a nonsmoking room and car rental companies record your preferred type of car. In addition, companies that have not traditionally thought of themselves as service providers are beginning to think differently. Does an automobile dealer sell cars or transportation? If the latter, it makes sense for the dealership to offer you a loaner car whenever your own is in the shop, as many now do.
Even commodity products can be enhanced with service. A home heating oil company that monitors your usage and delivers oil when you need more, sells a better product than a company that expects you to remember to call to arrange a delivery before your tank runs dry and the pipes freeze. Credit card companies, long-distance providers, airlines, and retailers of all kinds often compete as much or more on service as on price.
Information Is a Product
Many companies find that the information they have about their customers is valuable not only to themselves, but to others as well. A supermarket with a loyalty card program has something that the consumer packaged goods industry would love to have—knowledge about who is buying which products. A credit card company knows something that airlines would love to know—who is buying a lot of airplane tickets. Both the supermarket and the credit card company are in a position to be knowledge brokers or infomediaries. The supermarket can charge consumer packaged goods companies more to print coupons when the supermarkets can promise higher redemption rates by printing the right coupons for the right shoppers. The credit card company can charge the airlines to target a frequent flyer promotion to people who travel a lot, but fly on other airlines.
Google knows what people are looking for on the Web. It takes advantage of this knowledge by selling sponsored links. Insurance companies pay to make sure that someone searching on “car insurance” will be offered a link to their site. Financial services pay for sponsored links to appear when someone searches on the phrase “mortgage refinance.”
In fact, any company that collects valuable data is in a position to become an information broker. The Cedar Rapids Gazette takes advantage of its dominant position in a 22-county area of Eastern Iowa to offer direct marketing services to local businesses. The paper uses its own obituary pages and wedding announcements to keep its marketing database current.
470643 c01.qxd 3/8/04 11:08 AM Page 15
Why and What Is Data Mining?
15
Commercial Data Mining Software Products
Have Become Available
There is always a lag between the time when new algorithms first appear in academic journals and excite discussion at conferences and the time when commercial software incorporating those algorithms becomes available. There is another lag between the initial availability of the first products and the time that they achieve wide acceptance. For data mining, the period of widespread availability and acceptance has arrived.
Many of the techniques discussed in this book started out in the fields of statistics, artificial intelligence, or machine learning. After a few years in universities and government labs, a new technique starts to be used by a few early adopters in the commercial sector. At this point in the evolution of a new technique, the software is typically available in source code to the intrepid user willing to retrieve it via FTP, compile it, and figure out how to use it by reading the author’s Ph.D. thesis. Only after a few pioneers become successful with a new technique, does it start to appear in real products that come with user’s manuals and help lines.