470643 c09.qxd 3/8/04 11:15 AM Page 304
304 Chapter 9
Table 9.3 Transactions with More Summarized Items
CUSTOMER
PIZZA
MILK
SUGAR
APPLES
COFFEE
1
2
3
4
5
On the other hand, the manager of frozen foods or a chain of pizza restaurants may be very interested in the particular combinations of toppings that are ordered. He or she might decompose a pizza order into constituent parts, as shown in Table 9.4 .
At some later point in time, the grocery store may become interested in having more detail in its transactions, so the single “frozen pizza” item would no longer be sufficient. Or, the pizza restaurants might broaden their menu choices and become less interested in all the different toppings. The items of interest may change over time. This can pose a problem when trying to use historical data if different levels of detail have been removed.
Choosing the right level of detail is a critical consideration for the analysis.
If the transaction data in the grocery store keeps track of every type, brand, and size of frozen pizza—which probably account for several dozen products—then all these items need to map up to the “frozen pizza” item for analysis.
Table 9.4 Transactions with More Detailed Items
EXTRA
CUSTOMER
CHEESE
ONIONS
PEPPERS MUSHROOMS OLIVES
1
2
3
4
5
470643 c09.qxd 3/8/04 11:15 AM Page 305
Market Basket Analysis and Association Rules 305
Product Hierarchies Help to Generalize Items
In the real world, items have product codes and stock-keeping unit codes (SKUs) that fall into hierarchical categories (see Figure 9.10), called a product hierarchy or taxonomy. What level of the product hierarchy is the right one to use? This brings up issues such as
■■
Are large fries and small fries the same product?
■■
Is the brand of ice cream more relevant than its flavor?
■■
Which is more important: the size, style, pattern, or designer of clothing?
■■
Is the energy-saving option on a large appliance indicative of customer behavior?
Frozen
Foods
al
more gener
Frozen
Frozen
Frozen
Desserts
Vegetables
Dinners
y
axonomT
Frozen
Frozen
Ice Cream
Peas
Carrots
Mixed
Other
Yogurt
Fruit Bars
oduct
tial Pr
arP
Rocky
Cherry
Chocolate
Strawberry
Vanilla
Other
Road
Garcia
more detailed
Brands, sizes, and stock keeping units (SKUs)
Figure 9.10 Product hierarchies start with the most general and move to increasing detail.
470643 c09.qxd 3/8/04 11:15 AM Page 306
306 Chapter 9
The number of combinations to consider grows very fast as the number of items used in the analysis increases. This suggests using items from higher levels of the product hierarchy, “frozen desserts” instead of “ice cream.” On the other hand, the more specific the items are, the more likely the results are to be actionable. Knowing what sells with a particular brand of frozen pizza, for instance, can help in managing the relationship with the manufacturer. One compromise is to use more general items initially, then to repeat the rule generation to hone in on more specific items. As the analysis focuses on more specific items, use only the subset of transactions containing those items.
The complexity of a rule refers to the number of items it contains. The more items in the transactions, the longer it takes to generate rules of a given complexity. So, the desired complexity of the rules also determines how specific or general the items should be. In some circumstances, customers do not make large purchases. For instance, customers purchase relatively few items at any one time at a convenience store or through some catalogs, so looking for rules containing four or more items may apply to very few transactions and be a wasted effort. In other cases, such as in supermarkets, the average transaction is larger, so more complex rules are useful.
Moving up the product hierarchy reduces the number of items. Dozens or hundreds of items may be reduced to a single generalized item, often corresponding to a single department or product line. An item like a pint of Ben & Jerry’s Cherry Garcia gets generalized to “ice cream” or “frozen foods.”
Instead of investigating “orange juice,” investigate “fruit juices,” and so on.
Often, the appropriate level of the hierarchy ends up matching a department with a product-line manager; so using categories has the practical effect of finding interdepartmental relationships. Generalized items also help find rules with sufficient support. There will be many times as many transactions supported by higher levels of the taxonomy than lower levels.
Just because some items are generalized does not mean that all items need to move up to the same level. The appropriate level depends on the item, on its importance for producing actionable results, and on its frequency in the data.
For instance, in a department store, big-ticket items (such as appliances) might stay at a low level in the hierarchy, while less-expensive items (such as books) might be higher. This hybrid approach is also useful when looking at individual products. Since there are often thousands of products in the data, generalize everything other than the product or products of interest.
T I P Market basket analysis produces the best results when the items occur in roughly the same number of transactions in the data. This helps prevent rules from being dominated by the most common items. Product hierarchies can help here. Roll up rare items to higher levels in the hierarchy, so they become more frequent. More common items may not have to be rolled up at all.
470643 c09.qxd 3/8/04 11:15 AM Page 307
Market Basket Analysis and Association Rules 307
Virtual Items Go beyond the Product Hierarchy
The purpose of virtual items is to enable the analysis to take advantage of information that goes beyond the product hierarchy. Virtual items do not appear in the product hierarchy of the original items, because they cross product boundaries. Examples of virtual items might be designer labels such as Calvin Klein that appear in both apparel departments and perfumes, low-fat and no-fat products in a grocery store, and energy-saving options on appliances.
Virtual items may even include information about the transactions themselves, such as whether the purchase was made with cash, a credit card, or check, and the day of the week or the time of the day the transaction occurred.
However, it is not a good idea to crowd the data with too many virtual items.
Only include virtual items when you have some idea of how they could result in actionable information if found in well-supported, high-confidence association rules.
There is a danger, though. Virtual items can cause trivial rules. For instance, imagine that there is a virtual item for “diet product” and one for “coke product”, then a rule might appear like: If “coke product” and “diet product” then “diet coke”
That is, everywhere that
A similar but more subtle danger occurs when the right-hand side does not include the associated item. So, a rule like:
If “coke product” and “diet product” then “pretzels”
probably means,
If “diet coke” then “pretzels”
The only danger from having such rules is that they can obscure what is happening.
T I P When applying market basket analysis, it is useful to have a hierarchical taxonomy of the items being considered for analysis. By carefully choosing the right levels of the hierarchy, these generalized items should occur about the same number of times in the data, improving the results of the analysis. For specific lifestyle-related choices that provide insight into customer behavior, such as sugar-free items and specific brands, augment the data with virtual items.
470643 c09.qxd 3/8/04 11:15 AM Page 308
308 Chapter 9
Data Quality
The data used for market basket analysis is generally not of very high quality.
It is gathered directly at the point of customer contact and used mainly for operational purposes such as inventory control. The data is likely to have multiple formats, corrections, incompatible code types, and so on. Much of the explanation of various code values is likely to be buried deep in programming code running in legacy systems and may be difficult to extract. Different stores within a single chain sometimes have slightly different product hierarchies or different ways of handling situations like discounts.
Here is an example. The authors were once curious about the approximately 80 department codes present in a large set of transaction data. The client assured us that there were 40 departments and provided a nice description of each of them. More careful inspection revealed the problem. Some stores had IBM cash registers and others had NCR. The two types of equipment had different ways of representing department codes—hence we saw many invalid codes in the data.