Such attributes are stored in reference tables, called dimension tables. Dimension tables make it possible to change the attributes of the dimension without changing the underlying data.
Team-Fly®
470643 c15.qxd 3/8/04 11:20 AM Page 503
Data Warehousing, OLAP, and Data Mining 503
Date
(7 March 1997)
Day of the
Day of the
Day of the
Month
Week
Month
Year
(Mar)
(Friday)
(7)
(67)
Year
(1997)
Figure 15.7 There are multiple hierarchies for dates.
WA R N I N G Do not take shortcuts when designing the dimensions for an OLAP system. These are the skeleton of the data mart, and a weak skeleton will not last very long.
Dimension tables contain many different attributes describing each value of the dimension. For instance, a detailed geography dimension might be built from zip codes and include dozens of summary variables about the zip codes.
These attributes can be used for filtering (“How many customers are in high-income areas?”). These values are stored in the dimension table rather than the fact table, because they cannot be aggregated correctly. If there are three stores in a zip code, a zip code population fact would get added up three times—
multiplying the population by three.
Usually, dimension tables are kept up to date with the most recent values for the dimension. So, a store dimension might include the current set of stores with information about the stores, such as layout, square footage, address, and manager name. However, all of these may change over time. Such dimensions are called slowly changing dimensions, and are of particular interest to data mining because data mining wants to reconstruct accurate histories. Slowly changing dimensions are outside the scope of this book. Interested readers should review Ralph Kimball’s books.
470643 c15.qxd 3/8/04 11:20 AM Page 504
504 Chapter 15
Conformed Dimensions
As mentioned earlier, data warehouse systems often contain multiple OLAP
cubes. Some of the power of OLAP arises from the practice of sharing dimensions across different cubes. These shared dimensions are called conformed dimensions and are shown in Figure 15-8; they help ensure that business results reported through different systems use the same underlying set of business rules.
Merchandizing
View
Shop
Marketing
View
Product
Weeks
Finance
View
Customer
Region
t
Product
Days
Depar
ment
Weeks
Different users have different views of
the data, but they often share
dimensions.
time
The hierarchy for the time dimension
needs to cover days, weeks, months,
and quarters.
shop
The hierarchy for region starts at the
shop level and then includes
metropolitan areas and states.
product
The hierarchy for product includes the
department.
The hierarchy for the customer might
customer
include households.
Figure 15.8 Different views of the data often share common dimensions. Finding the common dimensions and their base units is critical to making data warehousing work well across an organization.
470643 c15.qxd 3/8/04 11:20 AM Page 505
Data Warehousing, OLAP, and Data Mining 505
A good example of a conformed dimension is the calendar dimension, which keeps track of the attributes of each day. A calendar dimension is so important that it should be a part of every data warehouse. However, different components of the warehouse may need different attributes. For instance, a multinational business might include sets of holidays for different countries, so there might be a flag for “United States Holiday,” “United Kingdom Holiday,” “French Holiday,” and so on, instead of an overall holiday flag.
January 1st is a holiday in most countries; however, July 4th is mostly celebrated in the United States.
One of the challenges in building OLAP systems is designing the conformed dimensions so that they are suitable for a wide variety of applications. For some purposes geography might be best described by city and state; for another, by county; for another, by census block group; and for another by zip code. Unfortunately, these four descriptions are not fully compatible, since there can be several small towns in a zip code, and there are five counties in New York City. Multidimensional modeling helps resolve such conflicts.
Star Schema
Cubes are easily stored in relational databases, using a denormalized data structure called the star schema, developed by Ralph Kimball, a guru of OLAP.
One advantage of the star schema is its use of standard database technology to achieve the power of OLAP.
A star schema starts with a central fact table that corresponds to facts about a business. These can be at the transaction level (for an event cube), although they are more often low-level summaries of transactions. For retail sales, the central fact table might contain daily summaries of sales for each product in each store (shop-SKU-time). For a credit card company, a fact table might contain rows for each transaction by each customer or summaries of spending by product (based on card type and credit limit), customer segment, merchant type, customer geography, and month. For a diesel engine manufacturer interested in repair histories, it might contain each repair made on each engine or a daily summary of repairs at each shop by type of repair.
Each row in the central fact table contains some combination of keys that makes it unique. These keys are called dimensions. The central fact table also has other columns that typically contain numeric information specific to each row, such as the amount of the transaction, the number of transactions, and so on. Associated with each dimension are auxiliary tables called dimension tables, which contain information specific to the dimensions. For instance, the dimension table for date might specify the day of the week for a particular date, its month, year, and whether it is a holiday.
470643 c15.qxd 3/8/04 11:20 AM Page 506
506 Chapter 15
In diagrams, the dimension tables are connected to the central fact table, resulting in a shape that resembles a star, as shown in Figure 15.9.
Dept
Description
01
CORE FRAGRANCE
02
MISCELLANEOUS
05
GARDENS
06
BRIDAL
10
ACCESSORIES
SKU
Description
Dept
Color
Description
0001
V NECK TEE
70
01
BLACK
0002
PANTYHOSE
65
02
IVORY
0003
TUXEDO PJ
60
03
TAYLOR GREEN
0004
NOVELTY T SHIRT
70
04
STILETTO
0005
VELOUR JUMPSUIT
76
05
BLUE TOPAZ
Shop
SKU
Color
Date
Count Sales
Cost
Returns
0001
0001
01
000001
5
$50
$20
0
0001
0002
02
000001
12
$240
$96
0
0001
0002
03
000001
4
$80
$32
1
0001
0002
04
000001
12
$240
$96
0
0001
0003
09
000001
19
$85
$19
2
0001
0003
01
000001
5
$25
$5
0
0150
0001
01
000001
31
$310
$134
2
Date Year Month Day
000001
1997
01
01
Shop Reg State
City
Sq Ft
000002
1997
01
01
0001
J
CA
San Francisco
3,141
000003
1997
01
01
0007
A
MA
Central Boston
1,026
000004
1997
01
01
0034
E
FL
Miami
5,009
000005
1997
01
01
0124
H
MN
Minneapolis
1,793
0150
B
NY
New York City
6,400
Date
Hol?
Date
DoW
000001
Y
000001
Wed
000002
N
000002
Thu
Reg
Name
000003
N
000003
Fri
A
000004
N
000004
Sat
Northeast
B
000005
N
000005
Sun
New York/NJ
C
Mid Atlantic
D
North Central
E
Southeast
Figure 15.9 A star schema looks more like this. Dimension tables are conceptually nested, and there may be more than one dimension table for a given dimension.
470643 c15.qxd 3/8/04 11:20 AM Page 507
Data Warehousing, OLAP, and Data Mining 507
In practice, star schemas may not be efficient for answering all users’ questions, because the central fact table is so large. In such cases, the OLAP systems introduce summary tables at different levels to facilitate query response. Relational database vendors have been providing more and more support for star schemas. With a typical architecture, any query on the central fact table would require multiple joins back to the dimension tables. By applying standard indexes, and creatively enhancing indexing technology, relational databases can handle these queries quite well.
OLAP and Data Mining
Data mining is about the successful exploitation of data for decision-support purposes. The virtuous cycle of data mining, described in Chapter 2, reminds us that success depends on more than advanced pattern recognition algorithms. The data mining process needs to provide feedback to people and encourage using information gained from data mining to improve business processes. The data mining process should enable people to provide input, in the form of observations, hypotheses, and hunches about what results are important and how to use those results.
In the larger context of data exploitation, OLAP clearly plays an important role as a means of broadening the audience with access to data. Decisions once made based on experience and educated guesses can now be based on data and patterns in the data. Anomalies and outliers can be identified for further investigation and further modeling, sometimes using the most sophisticated data mining techniques. For instance, a user might discover that a particular item sells better at a particular time during the week through the use of an OLAP tool. This might lead to an investigation using market basket analysis to find other items purchased with that item. Market basket analysis might suggest an explanation for the observed behavior—more information and more opportunities for exploiting the information.