Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 122 – Library. Read online. Free books read online. Read books without registering

Such attributes are stored in reference tables, called dimension tables. Dimension tables make it possible to change the attributes of the dimension without changing the underlying data.

Team-Fly®

470643 c15.qxd 3/8/04 11:20 AM Page 503

Data Warehousing, OLAP, and Data Mining 503

Date

(7 March 1997)

Day of the

Month

Week

Month

Year

(Mar)

(Friday)

(7)

(67)

Year

(1997)

Figure 15.7 There are multiple hierarchies for dates.

WA R N I N G Do not take shortcuts when designing the dimensions for an OLAP system. These are the skeleton of the data mart, and a weak skeleton will not last very long.

Dimension tables contain many different attributes describing each value of the dimension. For instance, a detailed geography dimension might be built from zip codes and include dozens of summary variables about the zip codes.

These attributes can be used for filtering (“How many customers are in high-income areas?”). These values are stored in the dimension table rather than the fact table, because they cannot be aggregated correctly. If there are three stores in a zip code, a zip code population fact would get added up three times—

multiplying the population by three.

Usually, dimension tables are kept up to date with the most recent values for the dimension. So, a store dimension might include the current set of stores with information about the stores, such as layout, square footage, address, and manager name. However, all of these may change over time. Such dimensions are called slowly changing dimensions, and are of particular interest to data mining because data mining wants to reconstruct accurate histories. Slowly changing dimensions are outside the scope of this book. Interested readers should review Ralph Kimball’s books.

470643 c15.qxd 3/8/04 11:20 AM Page 504

504 Chapter 15

Conformed Dimensions

As mentioned earlier, data warehouse systems often contain multiple OLAP

cubes. Some of the power of OLAP arises from the practice of sharing dimensions across different cubes. These shared dimensions are called conformed dimensions and are shown in Figure 15-8; they help ensure that business results reported through different systems use the same underlying set of business rules.

Merchandizing

View

Shop

Marketing

View

Product

Weeks

Finance

View

Customer

Region

Product

Days

Depar

ment

Weeks

Different users have different views of

the data, but they often share

dimensions.

time

The hierarchy for the time dimension

needs to cover days, weeks, months,

and quarters.

shop

The hierarchy for region starts at the

shop level and then includes

metropolitan areas and states.

product

The hierarchy for product includes the

department.

The hierarchy for the customer might

customer

include households.

Figure 15.8 Different views of the data often share common dimensions. Finding the common dimensions and their base units is critical to making data warehousing work well across an organization.

470643 c15.qxd 3/8/04 11:20 AM Page 505

Data Warehousing, OLAP, and Data Mining 505

A good example of a conformed dimension is the calendar dimension, which keeps track of the attributes of each day. A calendar dimension is so important that it should be a part of every data warehouse. However, different components of the warehouse may need different attributes. For instance, a multinational business might include sets of holidays for different countries, so there might be a flag for “United States Holiday,” “United Kingdom Holiday,” “French Holiday,” and so on, instead of an overall holiday flag.

January 1st is a holiday in most countries; however, July 4th is mostly celebrated in the United States.

One of the challenges in building OLAP systems is designing the conformed dimensions so that they are suitable for a wide variety of applications. For some purposes geography might be best described by city and state; for another, by county; for another, by census block group; and for another by zip code. Unfortunately, these four descriptions are not fully compatible, since there can be several small towns in a zip code, and there are five counties in New York City. Multidimensional modeling helps resolve such conflicts.

Star Schema

Cubes are easily stored in relational databases, using a denormalized data structure called the star schema, developed by Ralph Kimball, a guru of OLAP.

One advantage of the star schema is its use of standard database technology to achieve the power of OLAP.

A star schema starts with a central fact table that corresponds to facts about a business. These can be at the transaction level (for an event cube), although they are more often low-level summaries of transactions. For retail sales, the central fact table might contain daily summaries of sales for each product in each store (shop-SKU-time). For a credit card company, a fact table might contain rows for each transaction by each customer or summaries of spending by product (based on card type and credit limit), customer segment, merchant type, customer geography, and month. For a diesel engine manufacturer interested in repair histories, it might contain each repair made on each engine or a daily summary of repairs at each shop by type of repair.

Each row in the central fact table contains some combination of keys that makes it unique. These keys are called dimensions. The central fact table also has other columns that typically contain numeric information specific to each row, such as the amount of the transaction, the number of transactions, and so on. Associated with each dimension are auxiliary tables called dimension tables, which contain information specific to the dimensions. For instance, the dimension table for date might specify the day of the week for a particular date, its month, year, and whether it is a holiday.

470643 c15.qxd 3/8/04 11:20 AM Page 506

506 Chapter 15

In diagrams, the dimension tables are connected to the central fact table, resulting in a shape that resembles a star, as shown in Figure 15.9.

Dept

Description

CORE FRAGRANCE

MISCELLANEOUS

GARDENS

BRIDAL

ACCESSORIES

SKU

Description

Dept

Color

Description

0001

V NECK TEE

BLACK

0002

PANTYHOSE

IVORY

0003

TUXEDO PJ

TAYLOR GREEN

0004

NOVELTY T SHIRT

STILETTO

0005

VELOUR JUMPSUIT

BLUE TOPAZ

Shop

SKU

Color

Date

Count Sales

Cost

Returns

0001

000001

$50

$20

0001

0002

000001

$240

$96

0001

0002

000001

$80

$32

0001

0002

000001

$240

$96

0001

0003

000001

$85

$19

0001

0003

000001

$25

0150

0001

000001

$310

$134

Date Year Month Day

000001

1997

Shop Reg State

City

Sq Ft

000002

1997

0001

San Francisco

3,141

000003

1997

0007

Central Boston

1,026

000004

1997

0034

Miami

5,009

000005

1997

0124

Minneapolis

1,793

0150

New York City

6,400

Date

Hol?

Date

DoW

000001

Wed

000002

Thu

Reg

Name

000003

Fri

000004

Sat

Northeast

000005

Sun

New York/NJ

Mid Atlantic

North Central

Southeast

Figure 15.9 A star schema looks more like this. Dimension tables are conceptually nested, and there may be more than one dimension table for a given dimension.

470643 c15.qxd 3/8/04 11:20 AM Page 507

Data Warehousing, OLAP, and Data Mining 507

In practice, star schemas may not be efficient for answering all users’ questions, because the central fact table is so large. In such cases, the OLAP systems introduce summary tables at different levels to facilitate query response. Relational database vendors have been providing more and more support for star schemas. With a typical architecture, any query on the central fact table would require multiple joins back to the dimension tables. By applying standard indexes, and creatively enhancing indexing technology, relational databases can handle these queries quite well.

OLAP and Data Mining

Data mining is about the successful exploitation of data for decision-support purposes. The virtuous cycle of data mining, described in Chapter 2, reminds us that success depends on more than advanced pattern recognition algorithms. The data mining process needs to provide feedback to people and encourage using information gained from data mining to improve business processes. The data mining process should enable people to provide input, in the form of observations, hypotheses, and hunches about what results are important and how to use those results.

In the larger context of data exploitation, OLAP clearly plays an important role as a means of broadening the audience with access to data. Decisions once made based on experience and educated guesses can now be based on data and patterns in the data. Anomalies and outliers can be identified for further investigation and further modeling, sometimes using the most sophisticated data mining techniques. For instance, a user might discover that a particular item sells better at a particular time during the week through the use of an OLAP tool. This might lead to an investigation using market basket analysis to find other items purchased with that item. Market basket analysis might suggest an explanation for the observed behavior—more information and more opportunities for exploiting the information.