Of course, the ease of getting a report that can answer one of these questions depends on the particular implementation of the reporting interface. However, even for ad hoc reporting, accessing the cube structure can prove much easier than accessing a normalized relational database.
Three Varieties of Cubes
The cube described in the previous section is an example of a summary data cube. This is a very common example in OLAP. However, not all cubes are summary cubes. And, a data warehouse may contain many different cubes for different purposes.
470643 c15.qxd 3/8/04 11:20 AM Page 499
Data Warehousing, OLAP, and Data Mining
499
These subcubes
correspond to the
purchase of the
same product at
Store
one store on all
days.
Date
Product
store = X
store = X
store = X
store = X
These are some of
product = Y
product = Y
product = Y
product = Y
the subcubes in more
date =
date =
date =
date =
count = 1
count = 5
count = 0
count = 1
detail.
value = $44
value = 215
value = $0
value = $44
The answer to the question is the number of subcubes where count is not equal to 0.
Figure 15.6 On how many days did store X not sell any product Y ?
Another type of cube represents individual events. These cubes contain the most detailed data related to customer interactions, such as calls to customer service, payments, individual bills, and so on. The summaries are made by aggregating events across the cube. Such event cubes typically have a customer dimension or something similar, such as an account, Web cookie, or household, which ties the event back to the customer. A small number of dimensions, such as the customer ID, date, and event type are often sufficient for identifying each subcube. However, an event cube often has several other dimensions, which provide more detailed information and are important for aggregating data. The facts in such a table often contain dollar amounts and counts.
Event cubes are very powerful. Their use is limited because they rapidly become very big—the database tables representing them can have millions, hundreds of millions, or even billions of rows. Even with the power of OLAP
and parallel computers, such cubes require a bit of processing time for routine queries. Nonetheless, event cubes are particularly valuable because they make it possible to “drill down” from other cubes—to find the exact set of events used for calculating a particular value.
470643 c15.qxd 3/8/04 11:20 AM Page 500
500 Chapter 15
The third type of cube is a variant on the event cube. This is the factless fact table, whose purpose is to represent the evidence that something occurred. For instance, there might be a factless fact table that specifies the prospects included in a direct mail campaign. Such a fact table might have the following dimensions:
■■
Prospect ID (perhaps a household ID)
■■
Source of the prospect name
■■
Target date of the mailing
■■
Type of message
■■
Type of creative
■■
Type of offer
This is a case where there may not be any numeric facts to store about the individual name. Of course, there might be interesting attributes for the dimensions—such as the promotional cost of the offers and the cost of the names. However, this data is available through the dimensions and hence does not need to be repeated at the individual prospect level.
Regardless of the type of fact table, there is one cardinal rule: any particular item of information should fall into exactly one subcube. When this rule is violated, the cube cannot easily be used to report on the various dimensions. A corollary of this rule is that when an OLAP cube is being loaded, it is very important to keep track of any data that has unexpected dimensional values.
Every dimension should have an “other” category to guarantee that all data makes it in.
T I P When choosing the dimensions for a cube, be sure that each record lands in exactly one subcube. If you have redundant dimensions—such as one dimension for date and another for day of the week—then the same record will land in two or more subcubes. If this happens, then the summarizations based on the subcubes will no longer be accurate.
Apart from the cardinal rule that each record inserted into the cube should land in exactly one subcube, there are three other things to keep in mind when designing effective cubes:
■■
Determining the facts
■■
Handling complex dimensions
■■
Making dimensions conform across the data warehouse
These three issues arise when trying to develop cubes, and resolving them is important to making the cubes useful for analytic purposes.
470643 c15.qxd 3/8/04 11:20 AM Page 501
Data Warehousing, OLAP, and Data Mining 501
Facts
Facts are the measures in each subcube. The most useful facts are additive, so they can be combined together across many different subcubes to provide responses to queries at arbitrary levels of summarization. Additive facts make it possible to summarize data along any dimension or along several dimensions at one time—which is exactly the purpose of the cube.
Examples of additive facts are:
■■
Counts
■■
Counts of variables with a particular value
■■
Total duration of time (such as spent on a web site)
■■
Total monetary values
The total amount of money spent on a particular product on a particular day is the sum of the amount spent on that product in each store. This is a good example of an additive fact. However, not all facts are additive. Examples include:
■■
Averages
■■
Unique counts
■■
Counts of things shared across different cubes, such as transactions Averages are not a very interesting example of a nonadditive fact, because an average is a total divided by a count. Since each of these is additive, the average can be derived after combining these facts.
The other examples are more interesting. One interesting question is how many unique customers did some particular action. Although this number can be stored in a subcube, it is not additive. Consider a retail cube with the date, store, and product dimensions. A single customer may purchase items in more than one store, or purchase more than one item in a store, or make purchases on different days. A field containing the number of unique customers has information about one customer in more than one subcube, violating the cardinal rule of OLAP, so the cube is not going to be able to report on unique customers.
A similar thing happens when trying to count numbers of transactions.
Since the information about the transaction may be stored in several different subcubes (since a single transaction may involve more than one product), counts of transactions also violate the cardinal rule. This type of information cannot be gathered at the summary level.
Another note about facts is that not all numeric data is appropriate as a fact in a cube. For instance, age in years is numeric, but it might be better treated as a dimension rather than a fact. Another example is customer value. Discrete
470643 c15.qxd 3/8/04 11:20 AM Page 502
502 Chapter 15
ranges of customer value are useful as dimensions, and in many circumstances more useful than trying to include customer value as a fact.
When designing cubes, there is a temptation to mix facts and dimensions by creating a count or total for a group of related values. For instance:
■■
Count of active customers of less than 1-year tenure, between 1 and 2
years, and greater than 2 years
■■
Amount credited on weekdays; amount credited on weekends
■■
Total for each day of the week
Each of these suggests another dimension for the cube. The first should have a customer tenure dimensions that takes at least three values. The second appeared in a cube where the time dimension was by month. These facts suggest a need for daily summaries, or at least for separating weekdays and weekends along a dimension. The third suggests a need for a date dimension at the granularity of days.
Dimensions and Their Hierarchies
Sometimes, a single column seems appropriate for multiple dimensions. For instance, OLAP is a good tool for visualizing trends over time, such as for sales or financial data. A specific date in this case potentially represents information TEAMFLY
along several dimensions, as shown in Figure 15.7:
■■
Day of the week
■■
Month
■■
Quarter
■■
Calendar year
One approach is to represent each of these as a different dimension. In other words, there would be four dimensions, one for the day of the week, one for the month, one for the quarter, and one for the calendar year. The data for January 2004, then would be the subcube where the January dimension intersects the 2004 dimension.
This is not a good approach. Multidimensional modeling recognizes that time is an important dimension, and that time can have many different attributes. In addition to the attributes described above, there is also the week of the year, whether the date is a holiday, whether the date is a work day, and so on.