Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 118 – Library. Read online. Free books read online. Read books without registering

Often, information of interest for customer relationship management is not gathered as intended. Here, for instance, are six ways that business customers might be distinguished from consumers in a telephone company:

■■

Using a customer type indicator: “B” or “C,” for business versus consumer.

■■

Using rate plans: Some are only sold to business customers; others to consumers.

■■

Using acquisition channels: Some channels are reserved for business, others for consumers.

■■

Using number of lines: 1 or 2 for consumer, more for business.

■■

Using credit class: Businesses have a different set of credit classes from consumers.

■■

Using a model score based on businesslike calling patterns

(Needless to say, these definitions do not always agree.) One challenge in data warehousing is arriving at a consistent definition that can be used across the business. The key to achieving this is metadata that documents the precise meaning of each field, so everyone using the data warehouse is speaking the same language.

470643 c15.qxd 3/8/04 11:20 AM Page 487

Data Warehousing, OLAP, and Data Mining 487

Gathering the data for decision support stresses operational systems since these systems were originally designed for transaction processing. Bringing the data together in a consistent format is almost always the most expensive part of implementing a data warehousing solution.

The source systems offer other challenges as well. They generally run on a wide range of hardware, and much of the software is built in-house or highly customized. These systems are commonly mainframe and midrange systems and generally use complicated and proprietary file structures. Mainframe systems were designed for holding and processing data, not for sharing it.

Although systems are becoming more open, getting access to the data is always an issue, especially when different systems are supporting very different parts of the organization. And, systems may be geographically dispersed, further contributing to the difficulty of bringing the data together.

Extraction, Transformation, and Load

Extraction, transformation, and load (ETL) tools solve the problem of gathering data from disparate systems, by providing the ability to map and move data from source systems to other environments. Traditionally, data movement and cleansing have been the responsibility of programmers, who wrote special-purpose code as the need arose. Such application-specific code becomes brittle as systems multiply and source systems change.

Although programming may still be necessary, there are now products that solve the bulk of the ETL problems. These tools make it possible to specify source systems and mappings between different tables and files. They provide the ability to verify data, and spit out error reports when loads do not succeed.

The tools also support looking up values in tables (so only known product codes, for instance, are loaded into the data warehouse). The goal of these tools is to describe where data comes from and what happens to it—not to write the step-by-step code for pulling data from one system and putting it into another.

Standard procedural languages, such as COBOL and RPG, focus on each step instead of the bigger picture of what needs to be done. ETL tools often provide a metadata interface, so end users can understand what is happening to

“their” data during the loading of the central repository.

This genre of tools is often so good at processing data that we are surprised that such tools remain embedded in IT departments and are not more generally used by data miners. Mastering Data Mining has a case study from 1998 on using one of these tools from Ab Initio, for analyzing hundreds of gigabytes of call detail records—a quantity of data that would still be challenging to analyze today.

470643 c15.qxd 3/8/04 11:20 AM Page 488

488 Chapter 15

Central Repository

The central repository is the heart of the data warehouse. It is usually a relational database accessed through some variant of SQL.

One of the advantages of relational databases is their ability to run on powerful, scalable machines by taking advantage of multiple processors and multiple disks (see the side bar “Background on Parallel Technology”). Most statistical and data mining packages, for instance, can run multiple processing threads at the same time. However, each thread represents one task, running on one processor. More hardware does not make any given task run faster (except when other tasks happen to be interfering with it). Relational databases, on the other hand, can take a single query and, in essence, create multiple threads all running at the same time for one query. As a result, data-intensive applications on powerful computers often run more quickly when using a relational database than when using non-parallel enabled software—and data mining is a very data-intensive application.

A key component in the central repository is a logical data model, which describes the structure of the data inside a database in terms familiar to business users. Often, the data model is confused with the physical layout (or schema) of the database, but there is a critical difference between the two. The purpose of the physical layout is to maximize performance and to provide information to database administrators (DBAs). The purpose of the logical data model is to communicate the contents of the database to a wider, less technical audience. The business user must be able to understand the logical data model—entities, attributes, and relationships. The physical layout is an implementation of the logical data model, incorporating compromises and choices along the way to optimize performance.

When embarking on a data warehousing project, many organizations feel compelled to develop a comprehensive, enterprise-wide data model. These efforts are often surprisingly unsuccessful. The logical data model for the data warehouse does not have to be quite as uncompromising as an enterprise-wide model. For instance, a conflict between product codes in the logical data model for the data warehouse can be (but not necessarily should be) resolved by including both product hierarchies—a decision that takes 10 minutes to make. In an enterprise-wide effort, resolving conflicting product codes can require months of investigations and meetings.

T I P Data warehousing is a process. Be wary of any large database called a data warehouse that does not have a process in place for updating the system to meet end user needs. Such a data warehouse will eventually fade into disuse, because end users needs are likely to evolve, but the system will not.

470643 c15.qxd 3/8/04 11:20 AM Page 489

Data Warehousing, OLAP, and Data Mining 489

BACKGROUND ON PARALLEL TECHNOLOGY

Parallel technology is the key to scalable hardware, and it comes in two flavors: symmetric multiprocessing systems (SMPs) and massively parallel processing systems (MPPs), both of which are shown in the following figure. An SMP

machine is centered on a bus, a special network present in all computers that connects processing units to memory and disk drives. The bus acts as a central communication device, so SMP syst

shared

ems are sometimes called

everything. Every processing unit can access all the memory and all the disk drives. This form of parallelism is quite popular because an SMP box supports the same applications as uniprocessor boxes—and some applications can take advantage of additional hardware with minimal changes to code. However, SMP technology has its limitations because it places a heavy burden on the central bus, which becomes saturated as the processing load increases.

Contention for the central bus is often what limits the performance of SMPs.

They tend to work well when they have fewer than 10 to 20 processing units.

MPPs, on the other hand, behave like separate computers connected by a very high-speed network, sometimes called a switch. Each processing unit has its own memory and its own disk storage. Some nodes may be specialized for processing and have minimal disk storage, and others may be specialized for storage and have lots of disk capacity. The bus connecting the processing unit to memory and disk drives never gets saturated. However, one drawback is that some memory and some disk drives are now local and some are remote—a distinction that can make MPPs harder to program. Programs designed for one processor can always run on one processor in an MPP—but they require modifications to take advantage of all the hardware. MPPs are truly scalable so long as the network connecting the processors can supply more bandwidth, and faster networks are generally easier to design than faster buses. There are MPP-based computers with thousands of nodes and thousands of disks.

Both SMPs and MPPs have their advantages. Recognizing this, the vendors of these computers are making them more similar. SMP vendors are connecting their SMP computers together in clusters that start to resemble MPP boxes. At the same time, MPP vendors are replacing their single-processing units with SMP units, creating a very similar architecture. However, regardless of how powerful the hardware is, software needs to be designed to take advantage of these machines. Fortunately, the largest database vendors have invested years of research into enabling their products to do so.

(continued)

470643 c15.qxd 3/8/04 11:20 AM Page 490

490 Chapter 15

BACKGROUND ON PARALLEL TECHNOLOGY (continued)