Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

There are other synergies between data mining and OLAP. One of the characteristics of decision trees discussed in Chapter 6 is their ability to identify the most informative features in the data relative to a particular outcome. That is, if a decision tree is built in order to predict attrition, then the upper levels of the tree will have the features that are the most important predictors for attrition. Well, these predictors might be a good choice for dimensions using an OLAP tool. Such analysis helps build better, more useful cubes. Another problem when building cubes is determining how to make continuous dimensions discrete. The nodes of a decision tree can help determine the best breaking point for a continuous value. This information can be fed into the OLAP tool to improve the dimension.

470643 c15.qxd 3/8/04 11:20 AM Page 508

508 Chapter 15

One of the problems with neural networks is the difficulty of understanding the results. This is especially true when using them for undirected data mining, as when using SOM networks to detect clusters. The SOM identifies clusters, but cannot explain what the clusters mean.

OLAP to the rescue! The data can now be enhanced with a predicted cluster, as well as with other information about customers, such as demographics, purchase history, and so on. This is a good application for a cube. Using OLAP—with information about the clusters included as a dimension—makes it possible for end users to explore the clusters and to determine features that distinguish them. The dimensions used for the OLAP cube should include the inputs to the SOM neural network, along with the cluster identifier, and perhaps other descriptive variables. There is a tricky data conversion problem because the neural networks require continuous values scaled between –1 and 1, and OLAP tools prefer discrete values. For values that were originally discrete, this is no problem. For continuous values, various binning techniques solve the problem.

As these examples show, OLAP and data mining complement each other.

Data mining can help build better cubes by defining appropriate dimensions, and further by determining how to break up continuous values on dimensions. OLAP provides a powerful visualization capability to help users better understand the results of data mining, such as clustering and neural networks.

Used together, OLAP and data mining reinforce each other’s strengths and provide more opportunities for exploiting data.

Where Data Mining Fits in with Data Warehousing

Data mining plays an important role in the data warehouse environment. The initial returns from a data warehouse come from automating existing processes, such as putting reports online and giving existing applications a clean source of data. The biggest returns are the improved access to data that can spur innovation and creativity—and these come from new ways of looking at and analyzing data. This is the role of data mining—to provide the tools that improve understanding and inspire creativity based on observations in the data.

A good data warehousing environment serves as a catalyst for data mining.

The two technologies work together as partners:

■■

Data mining thrives on large amounts of data and the more detailed the data, the better—data that comes from a data warehouse.

■■

Data mining thrives on clean and consistent data—capitalizing on the investment in data cleansing tools.

470643 c15.qxd 3/8/04 11:20 AM Page 509

Data Warehousing, OLAP, and Data Mining 509

■■ The data warehouse environment enables hypothesis testing and simplifies efforts to measure the effects of actions taken—enabling the virtuous cycle of data mining.

■■ Scalable hardware and relational database software can offload the data processing parts of data mining.

There is, however, a distinction between the way data mining looks at the world and the way data warehousing does. Normalized data warehouses can store data with time stamps, but it is very difficult to do time-related manipulations—such as determining what event happened just before some other event of interest. OLAP introduces a time dimension. Data mining extends this even further by taking into account the notion of “before” and

“after.” Data mining learns from data (the “before”), with the purpose of applying these findings to the future (the “after”). For this reason, data mining often puts a heavy load on data warehouses. These are complementary technologies, supporting each other as discussed in the next few sections.

Lots of Data

The traditional approach to data analysis generally starts by reducing the size of the data. There are three common ways of doing this: summarizing detailed transactions, taking a subset of the data, and only looking at certain attributes.

The reason for reducing the size of the data was to make it possible to analyze the data on the available hardware and software systems. When properly done, the laws of statistics come into play, and it is possible to choose a sample that behaves roughly like the rest of the data.

Data mining, on the other hand, is searching for trends in the data and for valuable anomalies. It is often trying to answer different types of questions from traditional statistical analysis, such as “what product is this customer most likely to purchase next?” Even if it is possible to devise a model using a subset of data, it is necessary to deploy the model and score all customers, a process that can be very computationally intensive.

Fortunately, data mining algorithms are often able to take advantage of large amounts of data. When looking for patterns that identify rare events—

such as having to write-off customers because they failed to pay—having large amounts of data ensures that there is sufficient data for analysis. A subset of the data might be statistically relevant in total, but when you try to decompose it into other segments (by region, by product, by customer segment), there may be too little data to produce statistically meaningful results.

Data mining algorithms are able to make use of lots of data. Decision trees, for example, work very well, even when there are dozens or hundreds of fields in each record. Link analysis requires a full complement of the data to create a

470643 c15.qxd 3/8/04 11:20 AM Page 510

510 Chapter 15

graph. Neural networks can train on millions of records at a time. And, even though the algorithms often work on summaries of the detailed transactions (especially at the customer level), what gets summarized can change from one run to the next. Prebuilding the summaries and discarding the transaction data locks you into only one view of the business. Often the first result from using such summaries is a request for some variation on them.

Consistent, Clean Data

Data mining algorithms are often applied to gigabytes of data combined from several different sources. Much of the work in looking for actionable information actually takes place when bringing the data together—often 80 percent or more of the time allocated to a data mining project is spent bringing the data together—especially when a data warehouse is not available. Subsequent problems, such as matching account numbers, interpreting codes, and householding, further delay the analysis. Finding interesting patterns is often an iterative process that requires going back to the data to get additional data elements. Finally, when interesting patterns are found, it is often necessary to repeat the process on the most recent data available.

A well-designed and well-built data warehouse can help solve these problems. Data is cleaned once, when it is loaded into the data warehouse. The meaning of fields is well defined and available through the metadata. Incorporating new data into analyses is as easy as finding out what data is available through the metadata and retrieving it from the warehouse. A particular analysis can be reapplied on more recent data, since the warehouse is kept up to date. The end result is that the data is cleaner and more available—and that the analysts can spend more time applying powerful tools and insights instead of moving data and pushing bytes.

Hypothesis Testing and Measurement

The data warehouse facilitates two other areas of data mining. Hypothesis testing is the verification of educated guesses about patterns in the data. Do tropical colors really sell better in Florida than elsewhere? Do people tend to make long-distance calls after dinner? Are the users of credit cards at restaurants really high-end customers? All of these questions can be expressed rather easily as queries on the appropriate relational database. Having the data available makes it possible to ask questions and find out quickly what the answers are.

T I P The ability to test hypotheses and ideas is a very important aspect of data mining. By bringing the data together in one place, data warehouses enable answering in-depth, complicated questions. One caveat is that such queries can be expensive to run, falling into the killer query category.

470643 c15.qxd 3/8/04 11:20 AM Page 511

Data Warehousing, OLAP, and Data Mining 511

Measurement is the other area where data warehouses have proven to be very valuable. Often when marketing efforts, product improvements, and so forth take place, there is limited feedback on the degree of success achieved. A data warehouse makes it possible to see the results and to find related effects.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *