A and B must appear together in at least 10,000 transactions, and, A and C must appear together in at least 10,000 transactions, and, A and D must appear together in at least 10,000 transactions, and so on.
Each step of the calculation of the co-occurrence table can eliminate combinations of items that do not meet the threshold, reducing its size and the number of combinations to consider during the next pass.
Figure 9.11 is an example of how the calculation takes place. In this example, choosing a minimum support level of 10 percent would eliminate all the combinations with three items—and their associated rules—from consideration.
This is an example where pruning does not have an effect on the best rule since the best rule has only two items. In the case of pizza, these toppings are all fairly common, so are not pruned individually. If anchovies were included in the analysis—and there are only 15 pizzas containing them out of the 2,000—
then a minimum support of 10 percent, or even 1 percent, would eliminate anchovies during the first pass.
The best choice for minimum support depends on the data and the situation. It is also possible to vary the minimum support as the algorithm progresses. For instance, using different levels at different stages you can find uncommon combinations of common items (by decreasing the support level for successive steps) or relatively common combinations of uncommon items (by increasing the support level).
The Problem of Big Data
A typical fast food restaurant offers several dozen items on its menu, say 100.
To use probabilities to generate association rules, counts have to be calculated for each combination of items. The number of combinations of a given size tends to grow exponentially. A combination with three items might be a small fries, cheeseburger, and medium Diet Coke. On a menu with 100 items, how many combinations are there with three different menu items? There are 161,700! This calculation is based on the binomial formula On the other hand, a typical supermarket has at least 10,000 different items in stock, and more typically 20,000 or 30,000.
470643 c09.qxd 3/8/04 11:15 AM Page 314
314
Chapter 9
A pizza restaurant has sold 2000 pizzas, of which:
100 are mushroom only, 150 are pepperoni, 200 are extra cheese
400 are mushroom and pepperoni, 300 are mushroom and extra cheese, 200 are pepperoni and extra cheese 100 are mushroom, pepperoni, and extra cheese.
550 have no extra toppings.
We need to calculate the probabilities for all possible combinations of items.
100 + 400 + 300 + 100 = 900 pizzas or 45%
Mushroom and pepperoni
The works
Just mushroom
Mushroom and extra cheese
150 + 400 + 200 + 100 = 850 pizzas or 42.5%
200 + 300 + 200 + 100 = 800 pizzas or 40%
400 + 100 = 500 pizzas or 25%
300 + 100 = 400 pizzas or 20%
200 + 100 = 300 pizzas or 15%
100 pizzas or 5%
There are three rules with all three items:
Support = 5%
Confidence = 5% divided by 25% = 0.2
Lift = 20%(100/500) divided by 40%(800/2000) = 0.5
Support = 5%
Confidence = 5% divided by 20% = 0.25
Lift = 25%(100/400) divided by 42.5%(850/2000) = 0.588
Support = 5%
Confidence = 5% divided by 15% = 0.333
Lift = 33.3%(100/300) divided by 45%(900/2000) = 0.74
Support = 25%
The best rule has
Confidence = 25% divided by 42.5% = 0.588
only two items:
Lift = 55.6%(500/900) divided by 43.5%(200/850) = 1.31
Figure 9.11 This example shows how to count up the frequencies on pizza sales for market basket analysis.
Calculating the support, confidence, and lift quickly gets out of hand as the number of items in the combinations grows. There are almost 50 million possible combinations of two items in the grocery store and over 100 billion combinations of three items. Although computers are getting more powerful and
470643 c09.qxd 3/8/04 11:15 AM Page 315
Market Basket Analysis and Association Rules 315
cheaper, it is still very time-consuming to calculate the counts for this number of combinations. Calculating the counts for five or more items is prohibitively expensive. The use of product hierarchies reduces the number of items to a manageable size.
The number of transactions is also very large. In the course of a year, a decent-size chain of supermarkets will generate tens or hundreds of millions of transactions. Each of these transactions consists of one or more items, often several dozen at a time. So, determining if a particular combination of items is present in a particular transaction may require a bit of effort—multiplied a million-fold for all the transactions.
Extending the Ideas
The basic ideas of association rules can be applied to different areas, such as comparing different stores and making some enhancements to the definition of the rules. These are discussed in this section.
Using Association Rules to Compare Stores
Market basket analysis is commonly used to make comparisons between locations within a single chain. The rule about toilet bowl cleaner sales in hardware stores is an example where sales at new stores are compared to sales at existing stores. Different stores exhibit different selling patterns for many reasons: regional trends, the effectiveness of management, dissimilar advertising, and varying demographic patterns in the catchment area, for example. Air conditioners and fans are often purchased during heat waves, but heat waves affect only a limited region. Within smaller areas, demographics of the catchment area can have a large impact; we would expect stores in wealthy areas to exhibit different sales patterns from those in poorer neighborhoods. These are examples where market basket analysis can help to describe the differences and serve as an example of using market basket analysis for directed data mining.
How can association rules be used to make these comparisons? The first step is augmenting the transactions with virtual items that specify which group, such as an existing location or a new location, that the transaction comes from. Virtual items help describe the transaction, although the virtual item is not a product or service. For instance, a sale at an existing hardware store might include the following products:
■■
A hammer
■■
A box of nails
■■
Extra-fine sandpaper
470643 c09.qxd 3/8/04 11:15 AM Page 316
316 Chapter 9
T I P Adding virtual transactions in to the market basket data makes it possible to find rules that include store characteristics and customer characteristics.
After augmenting the data to specify where it came from, the transaction looks like:
a hammer,
a box of nails,
extra fine sandpaper,
“at existing hardware store.”
To compare sales at store openings versus existing stores, the process is: 1. Gather data for a specific period (such as 2 weeks) from store openings.
Augment each of the transactions in this data with a virtual item saying that the transaction is from a store opening.
2. Gather about the same amount of data from existing stores. Here you might use a sample across all existing stores, or you might take all the data from stores in comparable locations. Augment the transactions in this data with a virtual item saying that the transaction is from an existing store.
3. Apply market basket analysis to find association rules in each set.
4. Pay particular attention to association rules containing the virtual items.
Because association rules are undirected data mining, the rules act as starting points for further hypothesis testing. Why does one pattern exist at existing stores and another at new stores? The rule about toilet bowl cleaners and store openings, for instance, suggests looking more closely at toilet bowl cleaner sales in existing stores at different times during the year.
Using this technique, market basket analysis can be used for many other types of comparisons:
■■
Sales during promotions versus sales at other times
■■
Sales in various geographic areas, by county, standard statistical metropolitan area (SSMA), direct marketing area (DMA), or country
■■
Urban versus suburban sales
■■
Seasonal differences in sales patterns
Adding virtual items to each basket of goods enables the standard association rule techniques to make these comparisons.
470643 c09.qxd 3/8/04 11:15 AM Page 317
Market Basket Analysis and Association Rules 317
Dissociation Rules
A dissociation rule is similar to an association rule except that it can have the connector “and not” in the condition in addition to “and.” A typical dissociation rule looks like: if A and not B, then C.
Dissociation rules can be generated by a simple adaptation of the basic market basket analysis algorithm. The adaptation is to introduce a new set of items that are the inverses of each of the original items. Then, modify each transaction so it includes an inverse item if, and only if, it does not contain the original item.
For example, Table 9.8 shows the transformation of a few transactions. The ¬
before the item denotes the inverse item.
There are three downsides to including these new items. First, the total number of items used in the analysis doubles. Since the amount of computation grows exponentially with the number of items, doubling the number of items seriously degrades performance. Second, the size of a typical transaction grows because it now includes inverted items. The third issue is that the frequency of the inverse items tends to be much larger than the frequency of the original items. So, minimum support constraints tend to produce rules in which all items are inverted, such as: