The scoring environment is often the most complex, because it require transforming the data and running the models at the same time—preferably with a minimal amount of user interaction. Perhaps the best solution is when data mining software can both read and write to relational databases, making it possible to use the database for scalable data manipulation and the data mining tool for efficient model building.
Support for Scoring
The ability to write to as well as read from a database is desirable when data mining is used to develop models used for scoring. The models may be developed using samples extracted from the master database, but once developed, the models will score every record in the database.
470643 c16.qxd 3/8/04 11:29 AM Page 535
Building the Data Mining Environment 535
The value of a response model decreases with time. Ideally, the results of one campaign should be analyzed in time to affect the next one. But, in many organizations there is a long lag between the time a model is developed and the time it can be used to append scores to a database; sometimes the time is measured in weeks or months. The delay is caused by the difficulty of moving the scoring model, which is often developed on a different computer from the database server, into a form that can be applied to the database. This might involve interpreting the output of a data mining tool and writing a computer program that embodies the rules that make up the model.
The problem is even worse when the database is actually stored at a third facility, such as that of a list processor. The list processor is unlikely to accept a neural network model in the form of C source code as input to a list selection request. Building a unified model development and scoring framework requires significant integration effort, but if scoring large databases is an important application for your business, the effort will be repaid.
Multiple Levels of User Interfaces
In many organizations, several different communities of users use the data mining software. In order to accommodate their differing needs, the tool should provide several different user interfaces:
■■
A graphical user interface (GUI) for the casual user that has reasonable default values for data mining parameters.
■■
Advanced options for more skilled users.
■■
An ability to build models in batch mode (which could be provided by a command line interface).
■■
An applications program interface (API) so that predictive modeling can be built into applications
The GUI for a data mining tool should not only make it easy for users to build models, it should be designed to encourage best practices such as ensuring that model assessment is performed on a hold-out set and that the target variables for predictive models come from a later timeframe than the inputs.
The user interface should include a help system, with context-sensitive help.
The user interface should provide reasonable default values for such things as the minimum number of records needed to support a split in a decision tree or the number of nodes in the hidden layer of a neural network to improve the chance of success for casual users. On the other hand, the interface should make it easy for more knowledgeable users to change the defaults. Advanced users should be able to control every aspect of the underlying data mining algorithms.
470643 c16.qxd 3/8/04 11:29 AM Page 536
536 Chapter 16
Comprehensible Output
Tools vary greatly in the extent to which they explain themselves. Rule generators, tree visualizers, Web diagrams, and association tables can all help.
Some vendors place great emphasis on the visual representation of both data and rules, providing three-dimensional data terrain maps, geographic information systems (GIS), and cluster diagrams to help make sense of complex relationships. The final destination of much data mining work is reports for management, and the power of graphics should not be underestimated for convincing non-technical users of data mining results. A data mining tool should make it easy to export results to commonly available reporting an analysis packages such as Excel and PowerPoint.
Ability to Handle Diverse Data Types
Many data mining software packages place restrictions on the kinds of data that can be analyzed. Before investing in a data mining software package, find out how it deals with the various data types you want to work with.
Some tools have difficulty using categorical variables (such as model, type, gender) as input variables and require the user to convert these into a series of yes/no variables, one for each possible class. Others can deal with categorical variables that take on a small number of values, but break down when faced with too many. On the target field side, some tools can handle a binary classification task (good/bad), but have difficulty predicting the value of a categorical variable that can take on several values.
Some data mining packages on the market require that continuous variables (income, mileage, balance) be split into ranges by the user. This is especially likely to be true of tools that generate association rules, since these require a certain number of occurrences of the same combination of values in order to recognize a rule.
Most data mining tools cannot deal with text, although such support is starting to appear. If the text strings in the data are standardized codes (state, part number), this is not really a problem, since character codes can easily be converted to numeric or categorical ones. If the application requires the ability to analyze free text, some of the more advanced data mining tool sets are starting to provide support for this capability.
Documentation and Ease of Use
A well-designed user interface should make it possible to start mining right away, even if mastery of the tool requires time and study. As with any complex software, good documentation can spell the difference between success and frustration. Before deciding on a tool, ask to look over the manual. It is very
470643 c16.qxd 3/8/04 11:29 AM Page 537
Building the Data Mining Environment 537
important that the product documentation fully describes the algorithms used, not just the operation of the tool. Your organization should not be basing decisions on techniques that are not understood. A data mining tool that relies on any sort of proprietary and undisclosed “secret sauce” is a poor choice.
Availability of Training for Both Novice and Advanced
Users, Consulting, and Support
It is not easy to introduce unfamiliar data mining techniques into an organization. Before committing to a tool, find out the availability of user training and applications consulting from the tool vendor or third parties.
If the vendor is small and geographically remote from your data mining locations, customer support may be problematic. The Internet has shrunk the planet so that every supplier is just a few keystrokes away, but it has not altered the human tendency to sleep at night and work in the day; time zones still matter.
Vendor Credibility
Unless you are already familiar with the vendor, it is a good idea to learn something about its track record and future prospects. Ask to speak to references who have used the vendor’s software and can substantiate the claims made in product brochures.
We are not saying that you should not buy software from a company just because it is new, small, or far away. Data mining is still at the leading edge of commercial decision-support technology. It is often small, start-up companies that first understand the importance of new techniques and successfully bring them to market. And paradoxically, smaller companies often provide better, more enthusiastic support since the people answering questions are likely to be some people who designed and built the product.
Lessons Learned
The ideal data mining environment consists of a customer-centric corporate culture and all the resources to support it. Those resources include data, data miners, data mining infrastructure, and data mining software. In this ideal data mining environment, the need for good information is ingrained in the corporate culture, operational procedures are designed with the need to gather good data in mind, and the requirements for data mining shape the design of the corporate data warehouse.
Building the ideal environment is not easy. The hardest part of building a customer-centric organization is changing the culture and how to accomplish that is beyond the scope of this book. From a purely data perspective, the first
470643 c16.qxd 3/8/04 11:29 AM Page 538
538 Chapter 16
step is to create a single customer view that encompasses all the relationships the company has with a customer across all channels. The next step is to create customer-centric metrics that can be tracked, modeled, and reported.
Customer interactions should be turned into learning opportunities whenever possible. In particular, marketing communications should be set up as controlled experiments. The results of these experiments are input for data mining models used for targeting, cross-selling, and retention.
There are several approaches to incorporating data mining into a company’s marketing and customer relationship management activities. Outsourcing is a possibility for companies with only occasional modeling needs. When there is an ongoing need for data mining, it is best done internally so that insights produced during mining remain within the company rather than with an outside vendor.