Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

Data warehousing is a natural ally of data mining. Data mining seeks to find actionable patterns in data and therefore has a firm requirement for clean and consistent data. Much of the effort behind data mining endeavors is in the steps of identifying, acquiring, and cleansing the data. A well-designed corporate data warehouse is a valuable ally. Better yet, if the design of the data warehouse includes support for data mining applications, the warehouse facilitates and catalyzes data mining efforts. The two technologies work together to deliver value. Data mining fulfills some of the promise of data warehousing by converting an essentially inert source of clean and consistent data into actionable information.

There is also a technological component to this relationship. Apart from the ability of users to run multiple jobs at the same time, most software, including data mining and statistical software, does not take advantage of the multiple processors and multiple disks available on the fastest servers. Relational database management systems (RDBMS), the heart of most data warehouses, are parallel-enabled and can take advantage of all of a system’s resources for processing a single query. Even more importantly, users do not need to be aware of this fact, since the interface, some variant on SQL, remains the same. A database running on a powerful server can be a powerful asset for processing large amounts of data, as is the case when summarizing transactions at the customer level.

As useful as data warehousing is, such systems are not prerequisite for data mining and data analysis. Statisticians, actuaries, and analysts have been using statistical packages for decades—and achieving good results with their analyses—

without the benefit of a well-designed centralized warehouse. This process can continue to be useful. Because of the need for consistent, accurate, and timely data to support business units, data warehousing has become increasingly important for any kind of decision support or information analysis.

This chapter is focused on data warehousing as part of the virtuous cycle of data mining, as a valuable and often critical component in supporting all four phases of the cycle: identifying opportunities, analyzing data, applying information, and measuring results. It is not a how-to guide for building a warehouse—there are many books already devoted to that subject, and we heartily recommend Ralph Kimball’s The Data Warehouse Toolkit (Wiley, 2002) and Bill Inmon’s Building the Data Warehouse (Wiley, 2002).

470643 c15.qxd 3/8/04 11:20 AM Page 475

Data Warehousing, OLAP, and Data Mining 475

The chapter starts with a discussion of the different types of data that are available, and then discusses data warehousing requirements from the perspective of data mining. It then shows a typical data warehousing architecture and variants on this theme. The chapter next turns to Online Analytic Processing (OLAP), an alternative approach to the normalized data warehouse. The final discussion covers the role of data mining in these environments. As with much that has to do with data mining, however, the place to start is with data.

The Architecture of Data

There are many different flavors of information represented on computers.

Different levels of data represent different types of abstraction, as shown in Figure 15.1.

■■

Transaction data

■■

Operational summary data

■■

Decision-support summary data

■■

Schema

■■

Metadata

■■

Business rules

Business

What’s been learned from

the data

rules

Logical model and mappings to

el v

Metadata

physical layout and sources

Physical layout of the data,

Database schema

action Le

tables, fields, indexes, types

decision support

Abstr

Summary data

Summaries by who, what,

where, when

operational

Operational data

Who, what, where, and when

Data Size

Figure 15.1 A hierarchy of data and its descriptions helps users navigate around a data warehouse. As data gets more abstract, it generally gets less voluminous.

470643 c15.qxd 3/8/04 11:20 AM Page 476

476 Chapter 15

The level of abstraction is an important characteristic of data used in data mining. In a well-designed system, it should be possible to drill down through these levels of abstraction to obtain the base data that supports a summarization or a business rule. The lower levels of the pyramid are more voluminous and tend to be the stuff of databases. The upper levels are smaller and tend to be the stuff of computer programs. All these levels are important, because we do not want to analyze the detailed data to merely produce what should already be known.

Transaction Data, the Base Level

Every product purchased by a customer, every bank transaction, every Web page visit, every credit card purchase, every flight segment, every package, every telephone call is recorded in some operational system. Every time a new customer opens an account or pays a bill, there should be a record of the transaction somewhere, providing information about who, what, where, when, and how much. Such transaction-level data is the raw material for understanding customer behavior. It is the eyes and ears of the enterprise.

Unfortunately, over time operational systems change because of changing business needs. Fields may change their meaning over time. Important data is simply rolled off and deleted. Change is constant, in response to the introduction of new products, expanding numbers of customers, acquisitions, reorganizations, and new technology. The fact that operational data changes over time has to be part of any robust data warehousing approach.

T I P Data warehouses need to store data so the information is compatible over time, even when product lines change, when markets change, when customer segments change, when business organizations change. Otherwise, data mining is likely to pick up patterns that represent these changes, rather than underlying customer behavior.

The amount of data gathered from transactional systems can be enormous.

A single fast food restaurant sells hundreds of thousands of meals over the course of a year. A chain of supermarkets can have tens or hundreds of thousands of transactions a day. A large bank processes millions of checks and credit card purchases a day. Large Web sites have millions of hits each day (in 2003, Google was already handling over 250 million searches each day). A telephone company has tens or even hundreds of millions of completed calls every day. A large ad server on the Web keeps track of over a billion ad views every day. Even with the price of disk space falling, storing all these transactions requires a significant investment. For reference, it is worth remembering that a day has 86,400 seconds, so a million transactions a day is really an average of about 12 transactions per second all day (and 250 million searches

470643 c15.qxd 3/8/04 11:20 AM Page 477

Data Warehousing, OLAP, and Data Mining 477

amounts to close to 3,000 searches per second!)—with peaks several times higher.

Because of the large data volumes, there is often a reluctance to store transaction-level data in a data warehouse. From the perspective of data mining, this is a shame, since the transactions best describe customer behavior.

Operational Summary Data

Operational summaries play the same role as transactions; the difference being that operational summaries are derived from transactions. The most common examples are billing systems, which summarize transactions, usually into monthly or four-week bill cycles. These summaries are customer-facing and often result in other transactions, such as bill payments. In some cases, operational summaries may include fields that are summarized to enhance the company’s understanding of its customers rather than for operational purposes.

For instance, Chapter 4 described how AT&T used call detail records to calculate a “bizocity” score, indicating how businesslike a telephone number’s calling pattern appears. The records of each call are discarded, but the score is kept up to date.

There is a distinction between operational summary data and transaction data, because summaries are for a period of time and transactions represent events. Consider the amount paid by a subscription customer. In a billing system, amount paid is a summary for the billing period. A payment history table instead provides detail on every payment transaction. For most customers, the monthly summary and payment transactions are very similar. However, two payments might arrive during the same billing period. The more detailed payment information might be useful for insight into customer payment patterns.

Decision-Support Summary Data

Decision-support summary data is the data used for making decisions about the business. The financial data used to run a company provides an example of decision-support summary data; this is often considered to be the cleanest data for decision making. Another example is the data warehouses and data marts whose purpose is to provide a decision-support system of record at the customer level. Maintaining decision-support summary data is the purpose of the data warehouse.

Generally, it is a bad idea to use the same system for analytic and operational purposes, since operational purposes need to take precedence, resulting in a system that is optimized for operations and not decision support. Financial systems are not generally designed for understanding customers, because they are designed for accounting purposes. Making customer summaries balance exactly to the general ledger is highly complex and usually not worth the

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *