X

Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

470643 c09.qxd 3/8/04 11:15 AM Page 304

304 Chapter 9

Table 9.3 Transactions with More Summarized Items

CUSTOMER

PIZZA

MILK

SUGAR

APPLES

COFFEE

1

2

3

4

5

On the other hand, the manager of frozen foods or a chain of pizza restaurants may be very interested in the particular combinations of toppings that are ordered. He or she might decompose a pizza order into constituent parts, as shown in Table 9.4 .

At some later point in time, the grocery store may become interested in having more detail in its transactions, so the single “frozen pizza” item would no longer be sufficient. Or, the pizza restaurants might broaden their menu choices and become less interested in all the different toppings. The items of interest may change over time. This can pose a problem when trying to use historical data if different levels of detail have been removed.

Choosing the right level of detail is a critical consideration for the analysis.

If the transaction data in the grocery store keeps track of every type, brand, and size of frozen pizza—which probably account for several dozen products—then all these items need to map up to the “frozen pizza” item for analysis.

Table 9.4 Transactions with More Detailed Items

EXTRA

CUSTOMER

CHEESE

ONIONS

PEPPERS MUSHROOMS OLIVES

1

2

3

4

5

470643 c09.qxd 3/8/04 11:15 AM Page 305

Market Basket Analysis and Association Rules 305

Product Hierarchies Help to Generalize Items

In the real world, items have product codes and stock-keeping unit codes (SKUs) that fall into hierarchical categories (see Figure 9.10), called a product hierarchy or taxonomy. What level of the product hierarchy is the right one to use? This brings up issues such as

■■

Are large fries and small fries the same product?

■■

Is the brand of ice cream more relevant than its flavor?

■■

Which is more important: the size, style, pattern, or designer of clothing?

■■

Is the energy-saving option on a large appliance indicative of customer behavior?

Frozen

Foods

al

more gener

Frozen

Frozen

Frozen

Desserts

Vegetables

Dinners

y

axonomT

Frozen

Frozen

Ice Cream

Peas

Carrots

Mixed

Other

Yogurt

Fruit Bars

oduct

tial Pr

arP

Rocky

Cherry

Chocolate

Strawberry

Vanilla

Other

Road

Garcia

more detailed

Brands, sizes, and stock keeping units (SKUs)

Figure 9.10 Product hierarchies start with the most general and move to increasing detail.

470643 c09.qxd 3/8/04 11:15 AM Page 306

306 Chapter 9

The number of combinations to consider grows very fast as the number of items used in the analysis increases. This suggests using items from higher levels of the product hierarchy, “frozen desserts” instead of “ice cream.” On the other hand, the more specific the items are, the more likely the results are to be actionable. Knowing what sells with a particular brand of frozen pizza, for instance, can help in managing the relationship with the manufacturer. One compromise is to use more general items initially, then to repeat the rule generation to hone in on more specific items. As the analysis focuses on more specific items, use only the subset of transactions containing those items.

The complexity of a rule refers to the number of items it contains. The more items in the transactions, the longer it takes to generate rules of a given complexity. So, the desired complexity of the rules also determines how specific or general the items should be. In some circumstances, customers do not make large purchases. For instance, customers purchase relatively few items at any one time at a convenience store or through some catalogs, so looking for rules containing four or more items may apply to very few transactions and be a wasted effort. In other cases, such as in supermarkets, the average transaction is larger, so more complex rules are useful.

Moving up the product hierarchy reduces the number of items. Dozens or hundreds of items may be reduced to a single generalized item, often corresponding to a single department or product line. An item like a pint of Ben & Jerry’s Cherry Garcia gets generalized to “ice cream” or “frozen foods.”

Instead of investigating “orange juice,” investigate “fruit juices,” and so on.

Often, the appropriate level of the hierarchy ends up matching a department with a product-line manager; so using categories has the practical effect of finding interdepartmental relationships. Generalized items also help find rules with sufficient support. There will be many times as many transactions supported by higher levels of the taxonomy than lower levels.

Just because some items are generalized does not mean that all items need to move up to the same level. The appropriate level depends on the item, on its importance for producing actionable results, and on its frequency in the data.

For instance, in a department store, big-ticket items (such as appliances) might stay at a low level in the hierarchy, while less-expensive items (such as books) might be higher. This hybrid approach is also useful when looking at individual products. Since there are often thousands of products in the data, generalize everything other than the product or products of interest.

T I P Market basket analysis produces the best results when the items occur in roughly the same number of transactions in the data. This helps prevent rules from being dominated by the most common items. Product hierarchies can help here. Roll up rare items to higher levels in the hierarchy, so they become more frequent. More common items may not have to be rolled up at all.

470643 c09.qxd 3/8/04 11:15 AM Page 307

Market Basket Analysis and Association Rules 307

Virtual Items Go beyond the Product Hierarchy

The purpose of virtual items is to enable the analysis to take advantage of information that goes beyond the product hierarchy. Virtual items do not appear in the product hierarchy of the original items, because they cross product boundaries. Examples of virtual items might be designer labels such as Calvin Klein that appear in both apparel departments and perfumes, low-fat and no-fat products in a grocery store, and energy-saving options on appliances.

Virtual items may even include information about the transactions themselves, such as whether the purchase was made with cash, a credit card, or check, and the day of the week or the time of the day the transaction occurred.

However, it is not a good idea to crowd the data with too many virtual items.

Only include virtual items when you have some idea of how they could result in actionable information if found in well-supported, high-confidence association rules.

There is a danger, though. Virtual items can cause trivial rules. For instance, imagine that there is a virtual item for “diet product” and one for “coke product”, then a rule might appear like: If “coke product” and “diet product” then “diet coke”

That is, everywhere that appears in a basket and appears in a basket, then also appears. Every basket that has Diet Coke satisfies this rule. Although some baskets may have regular coke and other diet products, the rule will have high lift because it is the definition of Diet Coke. When using virtual items, it is worth checking and rechecking the rules to be sure that such trivial rules are not arising.

A similar but more subtle danger occurs when the right-hand side does not include the associated item. So, a rule like:

If “coke product” and “diet product” then “pretzels”

probably means,

If “diet coke” then “pretzels”

The only danger from having such rules is that they can obscure what is happening.

T I P When applying market basket analysis, it is useful to have a hierarchical taxonomy of the items being considered for analysis. By carefully choosing the right levels of the hierarchy, these generalized items should occur about the same number of times in the data, improving the results of the analysis. For specific lifestyle-related choices that provide insight into customer behavior, such as sugar-free items and specific brands, augment the data with virtual items.

470643 c09.qxd 3/8/04 11:15 AM Page 308

308 Chapter 9

Data Quality

The data used for market basket analysis is generally not of very high quality.

It is gathered directly at the point of customer contact and used mainly for operational purposes such as inventory control. The data is likely to have multiple formats, corrections, incompatible code types, and so on. Much of the explanation of various code values is likely to be buried deep in programming code running in legacy systems and may be difficult to extract. Different stores within a single chain sometimes have slightly different product hierarchies or different ways of handling situations like discounts.

Here is an example. The authors were once curious about the approximately 80 department codes present in a large set of transaction data. The client assured us that there were 40 departments and provided a nice description of each of them. More careful inspection revealed the problem. Some stores had IBM cash registers and others had NCR. The two types of equipment had different ways of representing department codes—hence we saw many invalid codes in the data.

Page: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Categories: Economics, finance
Oleg: