These kinds of problems are typical when using any sort of data for data mining. However, they are exacerbated for market basket analysis because this type of analysis depends heavily on the unsummarized point-of-sale transactions.
Anonymous versus Identified
Market basket analysis has proven useful for mass-market retail, such as supermarkets, convenience stores, drug stores, and fast food chains, where many of the purchases have traditionally been made with cash. Cash transactions are anonymous, meaning that the store has no knowledge about specific customers because there is no information identifying the customer in the transaction. For anonymous transactions, the only information is the date and time, the location of the store, the cashier, the items purchased, any coupons redeemed, and the amount of change. With market basket analysis, even this limited data can yield interesting and actionable results.
The increasing prevalence of Web transactions, loyalty programs, and purchasing clubs is resulting in more and more identified transactions, providing analysts with more possibilities for information about customers and their behavior over time. Demographic and trending information is available on individuals and households to further augment customer profiles. This additional information can be incorporated into association rule analysis using virtual items.
Generating Rules from All This Data
Calculating the number of times that a given combination of items appears in the transaction data is well and good, but a combination of items is not a rule.
470643 c09.qxd 3/8/04 11:15 AM Page 309
Market Basket Analysis and Association Rules 309
Sometimes, just the combination is interesting in itself, as in the Barbie doll and candy bar example. But in other circumstances, it makes more sense to find an underlying rule of the form:
if condition, then result.
Notice that this is just shorthand. If the rule says,
if Barbie doll, then candy bar
then we read it as: “if a customer purchases a Barbie doll, then the customer is also expected to purchase a candy bar.” The general practice is to consider rules where there is just one item on the right-hand side.
Calculating Confidence
Constructs such as the co-occurrence table provide information about which combinations of items occur most commonly in the transactions. For the sake of illustration, let’s say that the most common combination has three items, A, B, and C. Table 9.5 provides an example, showing the probabilities that items and various combinations are purchased.
The only rules to consider are those with all three items in the rule and with exactly one item in the result:
■■
If A and B, then C
■■
If A and C, then B
■■
If B and C, then A
Because these three rules contain the same items, they have the same support in the data, 5 percent. What about their confidence level? Confidence is the ratio of the number of transactions with all the items in the rule to the number of transactions with just the items in the condition. The confidence for the three rules is shown in Table 9.6.
Table 9.5 Probabilities of Three Items and Their Combinations COMBINATION
PROBABILITY
A
45.0 %
B
42.5%
C
40.0%
A and B
25.0 %
A and C
20.0 %
B and C
15.0%
A and B and C
5.0%
470643 c09.qxd 3/8/04 11:15 AM Page 310
310 Chapter 9
Table 9.6 Confidence in Rules
P(CONDITION
RULE
P(CONDITION)
AND RESULT)
CONFIDENCE
If A
25%
5%
0.20
and B
then C
If A
20%
5%
0.25
and C
then B
If B
15%
5%
0.33
and C
then A
What is confidence really saying? Saying that the rule “if B and C then A” has a confidence of 0.33 is equivalent to saying that when B and C appear in a transaction, there is a 33 percent chance that A also appears in it. That is, one time in three A occurs with B and C, and the other two times, B and C appear without A. The most confident rule is the best rule, so the best rule is “if B and C then A. ”
Calculating Lift
As described earlier, lift is a good measure of how much better the rule is doing. It is the ratio of the density of the target (using the left hand side of the rule) to density of the target overall. So the formula is:
lift = (p(condition and result) / p (condition) ) / p(result)
= p(condition and result) / (p(condition) p(result))
When lift is greater than 1, then the resulting rule is better at predicting the result than guessing whether the resultant item is present based on item frequencies in the data. When lift is less than 1, the rule is doing worse than informed guessing. The following table (Table 9.7) shows the lift for the three rules and for the rule with the best lift.
None of the rules with three items shows improved lift. The best rule in the data actually only has two items. When “A” is purchased, then “B” is 31 percent more likely to be in the transaction than if “A” is not purchased. In this case, as in many cases, the best rule actually contains fewer items than other rules being considered.
470643 c09.qxd 3/8/04 11:15 AM Page 311
Market Basket Analysis and Association Rules 311
Table 9.7 Lift Measurements for Four Rules
RULE
SUPPORT
CONFIDENCE
P(RESULT)
LIFT
If A
5%
0.20
40%
0.50
and B
then C
If A
5%
0.25
42.5%
0.59
and C
then B
If B
5%
0.33
45%
0.74
and C
then A
If A
25%
0.59
42.5%
1.31
then B
The Negative Rule
When lift is less than 1, negating the result produces a better rule. If the rule if B and C then A
has a confidence of 0.33, then the rule
if B and C then NOT A
has a confidence of 0.67. Since A appears in 45 percent of the transactions, it does NOT occur in 55 percent of them. Applying the same lift measure shows that the lift of this new rule is 1.22 (0.67/0.55), resulting in a lift of 1.33, better than any of the other rules.
Overcoming Practical Limits
Generating association rules is a multistep process. The general algorithm is: 1. Generate the co-occurrence matrix for single items.
2. Generate the co-occurrence matrix for two items. Use this to find rules with two items.
3. Generate the co-occurrence matrix for three items. Use this to find rules with three items.
4. And so on.
470643 c09.qxd 3/8/04 11:15 AM Page 312
312 Chapter 9
For instance, in the grocery store that sells orange juice, milk, detergent, soda, and window cleaner, the first step calculates the counts for each of these items. During the second step, the following counts are created:
■■
Milk and detergent, milk and soda, milk and cleaner
■■
Detergent and soda, detergent and cleaner
■■
Soda and cleaner
This is a total of 10 pairs of items. The third pass takes all combinations of three items and so on. Of course, each of these stages may require a separate pass through the data or multiple stages can be combined into a single pass by considering different numbers of combinations at the same time.
Although it is not obvious when there are just five items, increasing the number of items in the combinations requires exponentially more computation. This results in exponentially growing run times—and long, long waits when considering combinations with more than three or four items. The solution is pruning. Pruning is a technique for reducing the number of items and combinations of items being considered at each step. At each stage, the algorithm throws out a certain number of combinations that do not meet some threshold criterion.
The most common pruning threshold is called minimum support pruning.
Support refers to the number of transactions in the database where the rule TEAMFLY
holds. Minimum support pruning requires that a rule hold on a minimum number of transactions. For instance, if there are one million transactions and the minimum support is 1 percent, then only rules supported by 10,000 transactions are of interest. This makes sense, because the purpose of generating these rules is to pursue some sort of action—such as striking a deal with Mattel (the makers of Barbie dolls) to make a candy-bar-eating doll—and the action must affect enough transactions to be worthwhile.
The minimum support constraint has a cascading effect. Consider a rule with four items in it:
if A, B, and C, then D.
Using minimum support pruning, this rule has to be true on at least 10,000
transactions in the data. It follows that:
A must appear in at least 10,000 transactions, and,
B must appear in at least 10,000 transactions, and,
C must appear in at least 10,000 transactions, and,
D must appear in at least 10,000 transactions.
Team-Fly®
470643 c09.qxd 3/8/04 11:15 AM Page 313
Market Basket Analysis and Association Rules 313
In other words, minimum support pruning eliminates items that do not appear in enough transactions. The threshold criterion applies to each step in the algorithm. The minimum threshold also implies that: