Data Mining Techniques
For Marketing, Sales, and
Customer Relationship
Management
Michael J.A. Berry
Gordon S. Linoff
Acknowledgments
We are fortunate to be surrounded by some of the most talented data miners anywhere, so our first thanks go to our colleagues at Data Miners, Inc. from whom we have learned so much: Will Potts, Dorian Pyle, and Brij Masand.
There are also clients with whom we work so closely that we consider them our colleagues as well: Harrison Sohmer and Stuart E. Ward, III are in that category. Our Editor, Bob Elliott, Editorial Assistant, Erica Weinstein, and Development Editor, Emilie Herman, kept us (more or less) on schedule and helped us maintain a consistent style. Lauren McCann, a graduate student at M.I.T.
and intern at Data Miners, prepared the census data used in some examples and created some of the illustrations.
We would also like to acknowledge all of the people we have worked with in scores of data mining engagements over the years. We have learned something from every one of them. The many whose data mining projects have influenced the second edition of this book include:
Al Fan
Herb Edelstein
Nick Gagliardo
Alan Parker
Jill Holtz
Nick Radcliffe
Anne Milley
Joan Forrester
Patrick Surry
Brian Guscott
John Wallace
Ronny Kohavi
Bruce Rylander
Josh Goff
Sheridan Young
Corina Cortes
Karen Kennedy
Susan Hunt Stevens
Daryl Berry
Kurt Thearling
Ted Browne
Daryl Pregibon
Lynne Brennen
Terri Kowalchuk
Doug Newell
Mark Smith
Victor Lo
Ed Freeman
Mateus Kehder
Yasmin Namini
Erin McCarthy
Michael Patrick
Zai Ying Huang
xix
470643 flast.qxd 3/8/04 11:32 AM Page xx
xx
Acknowledgments
And, of course, all the people we thanked in the first edition are still deserving of acknowledgement: Bob Flynn
Jim Flynn
Paul Berry
Bryan McNeely
Kamran Parsaye
Rakesh Agrawal
Claire Budden
Karen Stewart
Ric Amari
David Isaac
Larry Bookman
Rich Cohen
David Waltz
Larry Scroggins
Robert Groth
Dena d’Ebin
Lars Rohrberg
Robert Utzschnieder
Diana Lin
Lounette Dyer
Roland Pesch
Don Peppers
Marc Goodman
Stephen Smith
Ed Horton
Marc Reifeis
Sue Osterfelt
Edward Ewen
Marge Sherold
Susan Buchanan
Fred Chapman
Mario Bourgoin
Syamala Srinivasan
Gary Drescher
Prof. Michael Jordan
Wei-Xing Ho
Gregory Lampshire
Patsy Campbell
William Petefish
Janet Smith
Paul Becker
Yvonne McCollin
Jerry Modes
470643 flast.qxd 3/8/04 11:32 AM Page xxi
About the Authors
Michael J. A. Berry and Gordon S. Linoff are well known in the data mining field. They have jointly authored three influential and widely read books on data mining that have been translated into many languages. They each have close to two decades of experience applying data mining techniques to business problems in marketing and customer relationship management.
Michael and Gordon first worked together during the 1980s at Thinking Machines Corporation, which was a pioneer in mining large databases. In 1996, they collaborated on a data mining seminar, which soon evolved into the first edition of this book. The success of that collaboration gave them the courage to start Data Miners, Inc., a respected data mining consultancy, in 1998. As data mining consultants, they have worked with a wide variety of major companies in North America, Europe, and Asia, turning customer databases, call detail records, Web log entries, point-of-sale records, and billing files into useful information that can be used to improve the customer experience. The authors’ years of hands-on data mining experience are reflected in every chapter of this extensively updated and revised edition of their first book, Data Mining Techniques.
When not mining data at some distant client site, Michael lives in Cambridge, Massachusetts, and Gordon lives in New York City.
xxi
470643 flast.qxd 3/8/04 11:32 AM Page xxii
TEAMFLY
Team-Fly®
470643 flast.qxd 3/8/04 11:32 AM Page xxiii
Introduction
The first edition of Data Mining Techniques for Marketing, Sales, and Customer Support appeared on book shelves in 1997. The book actually got its start in 1996 as Gordon and I were developing a 1-day data mining seminar for NationsBank (now Bank of America). Sue Osterfelt, a vice president at NationsBank and the author of a book on database applications with Bill Inmon, convinced us that our seminar material ought to be developed into a book. She introduced us to Bob Elliott, her editor at John Wiley & Sons, and before we had time to think better of it, we signed a contract.
Neither of us had written a book before, and drafts of early chapters clearly showed this. Thanks to Bob’s help, though, we made a lot of progress, and the final product was a book we are still proud of. It is no exaggeration to say that the experience changed our lives — first by taking over every waking hour and some when we should have been sleeping; then, more positively, by providing the basis for the consulting company we founded, Data Miners, Inc.
The first book, which has become a standard text in data mining, was followed by others, Mastering Data Mining and Mining the Web.
So, why a revised edition? The world of data mining has changed a lot since we starting writing in 1996. For instance, back then, Amazon.com was still new; U.S. mobile phone calls cost on average 56 cents per minute, and fewer than 25 percent of Americans even owned a mobile phone; and the KDD data mining conference was in its second year. Our understanding has changed even more. For the most part, the underlying algorithms remain the same, although the software in which the algorithms are imbedded, the data to which they are applied, and the business problems they are used to solve have all grown and evolved.
xxiii
470643 flast.qxd 3/8/04 11:32 AM Page xxiv
xxiv Introduction
Even if the technological and business worlds had stood still, we would have wanted to update Data Mining Techniques because we have learned so much in the intervening years. One of the joys of consulting is the constant exposure to new ideas, new problems, and new solutions. We may not be any smarter than when we wrote the first edition, but we do have more experience and that added experience has changed the way we approach the material. A glance at the Table of Contents may suggest that we have reduced the amount of business-related material and increased the amount of technical material.
Instead, we have folded some of the business material into the technical chapters so that the data mining techniques are introduced in their business context. We hope this makes it easier for readers to see how to apply the techniques to their own business problems.
It has also come to our attention that a number of business school courses have used this book as a text. Although we did not write the book as a text, in the second edition we have tried to facilitate its use as one by using more examples based on publicly available data, such as the U.S. census, and by making some recommended reading and suggested exercises available at the companion Web site, www.data-miners.com/companion.
The book is still divided into three parts. The first part talks about the business context of data mining, starting with a chapter that introduces data mining and explains what it is used for and why. The second chapter introduces the virtuous cycle of data mining — the ongoing process by which data mining is used to turn data into information that leads to actions, which in turn create more data and more opportunities for learning. Chapter 3 is a much-expanded discussion of data mining methodology and best practices. This chapter benefits more than any other from our experience since writing the first book. The methodology introduced here is designed to build on the successful engagements we have been involved in. Chapter 4, which has no counterpart in the first edition, is about applications of data mining in marketing and customer relationship management, the fields where most of our own work has been done.
The second part consists of the technical chapters about the data mining techniques themselves. All of the techniques described in the first edition are still here although they are presented in a different order. The descriptions have been rewritten to make them clearer and more accurate while still retaining nontechnical language wherever possible.
In addition to the seven techniques covered in the first edition — decision trees, neural networks, memory-based reasoning, association rules, cluster detection, link analysis, and genetic algorithms — there is now a chapter on data mining using basic statistical techniques and another new chapter on survival analysis. Survival analysis is a technique that has been adapted from the small samples and continuous time measurements of the medical world to the
470643 flast.qxd 3/8/04 11:32 AM Page xxv
Introduction xxv
large samples and discrete time measurements found in marketing data. The chapter on memory-based reasoning now also includes a discussion of collaborative filtering, another technique based on nearest neighbors that has become popular with Web retailers as a way of generating recommendations.
The third part of the book talks about applying the techniques in a business context, including a chapter on finding customers in data, one on the relationship of data mining and data warehousing, another on the data mining environment (both corporate and technical), and a final chapter on putting data mining to work in an organization. A new chapter in this part covers preparing data for data mining, an extremely important topic since most data miners report that transforming data takes up the majority of time in a typical data mining project.