Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

470643 c16.qxd 3/8/04 11:29 AM Page 531

Building the Data Mining Environment 531

or item to be studied. This means transformations that flatten product hierarchies so that, for example, the same transaction might generate one flag indicating that the customer bought French wine, another that he or she bought a wine from the Burgundy region, and a third indicating that the wine was from the Beaujolais district in Burgundy. Other data must be rolled up from order files, billing files, and session logs that contain multiple transactions per customer. Typical values derived this way include total spending by category, average order amount, difference between this customer’s average order and the mean average order, and the number of days since the customer last made a purchase.

Reporting is done from a multidimensional database that allows retrospective queries at various levels. Data mining and OLAP are both part of the analysis module, although they answer different kinds of questions. OLAP

queries are used to answer questions such as these:

■■

What are the top-selling products?

■■

What are the worst-selling products?

■■

What are the top pages viewed?

■■

What are conversion rates by brand name?

■■

What are the top referring sites by visit count?

■■

What are the top referring sites by dollar sales?

■■

How many customers abandoned market baskets?

Data mining is used to answer more complicated questions such as these:

■■

What are the characteristics of heavy spenders? Does this user fit the profile?

■■

What promotion should be offered to this customer?

■■

What is the likelihood that this customer will return within 1 month?

■■

What customers should we worry about because they haven’t visited the site recently?

■■

Which products are associated with customers who spend the most

money?

■■

Which products are driving sales of which other products?

In Figure 16.2, the arrow labeled “build data warehouse” connects the customer interaction module to the analysis module and represents all the transformations that must occur before either data mining or reporting can be done properly. Two more arrows, labeled “deploy results,” show the output of the analysis module being shipped back to the business data definition and customer interaction modules. Yet another arrow, labeled “stage data,” shows how the business rules embedded in the business definition module feed into the customer interacting module.

470643 c16.qxd 3/8/04 11:29 AM Page 532

532 Chapter 16

What is appealing about this architecture is the way that it facilitates the virtuous cycle of data mining by allowing new knowledge discovered through data mining to be fed directly to the systems that interact with customers.

Data Mining Software

One of the ways that the data mining world has changed most since the first edition of this book came out is the maturity of data mining software products.

Robustness, usability, and scalability have all improved significantly. The one thing that may have decreased is the number of data mining software vendors as tiny boutique software firms have been pushed aside by larger, more established companies. As stated in the first edition, it is not reasonable to compare the merits of particular products in a book intended to remain useful beyond the shelf-life of the current versions of these products. Although the products are changing—and hopefully improving—over time, the criteria for evaluating them have not changed: Price, availability, scalability, support, vendor relationships, compatibility, and ease of integration all factor into the selection process.

Range of Techniques

TEAMFLY

As must be clear by now, there is no single data mining technique that is applicable in all situations. Neural networks, decision trees, market basket analysis, statistics, survival analysis, genetic algorithms, memory-based reasoning, link analysis, and automatic cluster detection all have a place. As shown in the case studies, it is not uncommon for two or more of these techniques to be applied in combination to achieve results beyond the reach of any single method.

Be sure that the software selected is powerful enough to support the data and goals needed for the organization. It is a good idea to have software a bit more advanced than the analysts’ abilities, so people can try out new things that they might not otherwise think of trying. Having multiple techniques available in a single set of tools is useful, because it makes it easier to combine and compare different techniques. At the same time, having several different products makes sense for a larger group, since different products have different strengths—even when they support the same underlying functionality.

Some are better at presenting results; some are better at developing scores; some are more intuitive for novice users.

Assess the range of data mining tasks to be addressed and decide which data mining techniques will be most valuable. If you have a single application in mind, or a family of closely related applications, then it is likely that you Team-Fly®

470643 c16.qxd 3/8/04 11:29 AM Page 533

Building the Data Mining Environment 533

QUESTIONS TO ASK WHEN SELECTING DATA MINING SOFTWARE

The following list of questions is designedto help select the right data mining software for your company. We present the questions as an unordered list. The first thing you should do is order the list according to your own priorities. These priorities will necessarily be different from case to case, which is why we have not attempted to rank them for you. In some environments, for example, there is an established standard hardware supplier and platform-independence is not an issue, while in other environments it is of paramount concern so different divisions can use the package or in anticipation of a future change in hardware.

◆ What is the range of data mining techniques offered by the vendor?

◆ How scalable is the product in terms of the size of the data, the number of users, the number of fields in the data, and its use of the hardware?

◆ Does the product provide transparent access to databases and files?

◆ Does the product provide multiple levels of user interfaces?

◆ Does the product generate comprehensible explanations of the models it generates?

◆ Does the product support graphics, visualization, and reporting tools?

◆ Does the product interact well with other software in the environment, such as reporting packages, databases, and so on?

◆ Can the product handle diverse data types?

◆ Is the product well documented and easy to use?

◆ What is the availability of support, training, and consulting?

◆ How well will the product fit into the existing computing environment?

◆ Does the vendor have credible references?

Once you have determined which of these questions are most important to your organization, use them to assess candidate software packages by interviewing the software vendors or by enlisting the aid of an independent data mining consultant.

will be able to select a single technique and stick with it. If you are setting up a data mining lab environment to handle a wide range of data mining applications, you will want to look for a coordinated suite of tools.

Scalability

Data mining provides the greatest benefit when the data to be mined is large and complex. But, data mining software is likely to be demonstrated on small, sample datasets. Be sure that the data mining software being considered can handle the anticipated data volume—and then perhaps a bit more to take into

470643 c16.qxd 3/8/04 11:29 AM Page 534

534 Chapter 16

account future growth (data does not grow smaller over time). The scalability aspect of data mining is important in three ways:

■■

Transforming the data into customer signatures requires a lot of I/O

and computing power.

■■

Building models is a repetitive and very computationally expensive.

■■

Scoring models requires complex data transformations.

For exploring and transforming data, the most readily available scalable software are relational databases. These have been designed to take advantage of multiple processors and multiple disks for handling a single database query.

Another class of software, the extraction, transformation, and load tools (ETL) used to create databases may also be scalable and useful for data mining.

However, most programming languages do not scale; they only support single processors and single disks for handling a single task. When there is a lot of data that needs to be combined, the most scalable solution to handling the data is often found at this level.

Building models and exploring data require software that runs fast enough and on large enough quantities of data. Some data mining tools only work on data in memory, so the volume of data is limited by available memory. This has the advantage that algorithms run faster. On the other hand there are limits. In practice, this was a problem when available memory was measured in megabytes; the gigabytes of memory available even on a typical workstation ameliorate the problem. Often, the data mining environment puts multiuser data mining servers on a powerful server close to the data. This is a good solution. As workstations become more powerful, building the models locally is also a viable solution. In either case, the goal is to run the models on hundreds of thousands or millions of rows in a reasonable amount of time. A data mining environment should encourage users to understand and explore the data, rather than expending effort sampling it down to make it fit in.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *