if NOT A and NOT B then NOT C.
These rules are less likely to be actionable.
Sometimes it is useful to invert only the most frequent items in the set used for analysis. This is particularly valuable when the frequency of some of the original items is close to 50 percent, so the frequencies of their inverses are also close to 50 percent.
Table 9.8 Transformation of Transactions to Generate Dissociation Rules CUSTOMER
ITEMS
CUSTOMER
WITH INVERSE ITEMS
1
{A, B, C}
1
{A, B, C}
2
{A}
2
{A, ¬B, ¬C}
3
{A, C}
3
{A, ¬B, C}
4
{A}
4
{A, ¬B, ¬C}
5
{}
5
{¬A, ¬B, ¬C}
470643 c09.qxd 3/8/04 11:15 AM Page 318
318 Chapter 9
Sequential Analysis Using Association Rules
Association rules find things that happen at the same time—what items are purchased at a given time. The next natural question concerns sequences of events and what they mean. Examples of results in this area are:
■■
New homeowners purchase shower curtains before purchasing furniture.
■■
Customers who purchase new lawnmowers are very likely to purchase a new garden hose in the following 6 weeks.
■■
When a customer goes into a bank branch and asks for an account reconciliation, there is a good chance that he or she will close all his or her accounts.
Time-series data usually requires some way of identifying the customer over time. Anonymous transactions cannot reveal that new homeowners buy shower curtains before they buy furniture. This requires tracking each customer, as well as knowing which customers recently purchased a home. Since larger purchases are often made with credit cards or debit cards, this is less of a problem. For problems in other domains, such as investigating the effects of medical treatments or customer behavior inside a bank, all transactions typically include identity information.
WA R N I N G In order to consider time-series analyses on your customers, there has to be some way of identifying customers. Without a way of tracking individual customers, there is no way to analyze their behavior over time.
For the purposes of this section, a time series is an ordered sequence of items.
It differs from a transaction only in being ordered. In general, the time series contains identifying information about the customer, since this information is used to tie the different transactions together into a series. Although there are many techniques for analyzing time series, such as ARIMA (a statistical technique) and neural networks, this section discusses only how to manipulate the time-series data to apply the market basket analysis.
In order to use time series, the transaction data must have two additional features:
■■ A timestamp or sequencing information to determine when transactions occurred relative to each other
■■ Identifying information, such as account number, household ID, or customer ID that identifies different transactions as belonging to the same customer or household (sometimes called an economic marketing unit)
470643 c09.qxd 3/8/04 11:15 AM Page 319
Market Basket Analysis and Association Rules 319
Building sequential rules is similar to the process of building association rules:
1. All items purchased by a customer are treated as a single order, and each item retains the timestamp indicating when it was purchased.
2. The process is the same for finding groups of items that appear together.
3. To develop the rules, only rules where the items on the left-hand side were purchased before items on the right-hand side are considered.
The result is a set of association rules that can reveal sequential patterns.
Lessons Learned
Market basket data describes what customers purchase. Analyzing this data is complex, and no single technique is powerful enough to provide all the answers. The data itself typically describes the market basket at three different levels. The order is the event of the purchase; the line-items are the items in the purchase, and the customer connects orders together over time.
Many important questions about customer behavior can be answered by looking at product sales over time. Which are the best selling items? Which items that sold well last year are no longer selling well this year? Inventory curves do not require transaction level data. Perhaps the most important insight they provide is the effect of marketing interventions—did sales go up or down after a particular event?
However, inventory curves are not sufficient for understanding relationships among items in a single basket. One technique that is quite powerful is association rules. This technique finds products that tend to sell together in groups. Sometimes is the groups are sufficient for insight. Other times, the groups are turned into explicit rules—when certain items are present then we expect to find certain other items in the basket.
There are three measures of association rules. Support tells how often the rule is found in the transaction data. Confidence says how often when the “if”
part is true that the “then” part is also true. And, lift tells how much better the rule is at predicting the “then” part as compared to having no rule at all.
The rules so generated fall into three categories. Useful rules explain a relationship that was perhaps unexpected. Trivial rules explain relationships that are known (or should be known) to exist. And inexplicable rules simply do not make sense. Inexplicable rules often have weak support.
470643 c09.qxd 3/8/04 11:15 AM Page 320
320 Chapter 9
Market basket analysis and association rules provide ways to analyze item-level detail, where the relationships between items are determined by the baskets they fall into. In the next chapter, we’ll turn to link analysis, which generalizes the ideas of “items” linked by “relationships,” using the background of an area of mathematics called graph theory.
470643 c10.qxd 3/8/04 11:16 AM Page 321
C H A P T E R
10
Link Analysis
The international route maps of British Airways and Air France offer more than just trip planning help. They also provide insights into the history and politics of their respective homelands and of lost empires. A traveler bound from New York to Mombasa changes planes at Heathrow; one bound for Abidjan changes at Charles de Gaul. The international route maps show how much information can be gained from knowing how things are connected.
Which Web sites link to which other ones? Who calls whom on the telephone? Which physicians prescribe which drugs to which patients? These relationships are all visible in data, and they all contain a wealth of information that most data mining techniques are not able to take direct advantage of.
In our ever-more-connected world (where, it has been claimed, there are no more than six degrees of separation between any two people on the planet), understanding relationships and connections is critical. Link analysis is the data mining technique that addresses this need.
Link analysis is based on a branch of mathematics called graph theory. This chapter reviews the key notions of graphs, then shows how link analysis has been applied to solve real problems. Link analysis is not applicable to all types of data nor can it solve all types of problems. However, when it can be used, it 321
470643 c10.qxd 3/8/04 11:16 AM Page 322
322 Chapter 10
often yields very insightful and actionable results. Some areas where it has yielded good results are:
■■
Identifying authoritative sources of information on the World Wide Web by analyzing the links between its pages
■■
Analyzing telephone call patterns to identify particular market segments such as people working from home
■■
Understanding physician referral patterns; a referral is a relationship between two physicians, once again, naturally susceptible to link analysis Even where links are explicitly recorded, assembling them into a useful graph can be a data-processing challenge. Links between Web pages are encoded in the HTML of the pages themselves. Links between telephones are recorded in call detail records. Neither of these data sources is useful for link analysis without considerable preprocessing, however. In other cases, the links are implicit and part of the data mining challenge is to recognize them.
The chapter begins with a brief introduction to graph theory and some of the classic problems that it has been used to solve. It then moves on to applications in data mining such as search engine rankings and analysis of call detail records.
TEAMFLY
Basic Graph Theory
Graphs are an abstraction developed specifically to represent relationships.
They have proven very useful in both mathematics and computer science for developing algorithms that exploit these relationships. Fortunately, graphs are quite intuitive, and there is a wealth of examples that illustrate how to take advantage of them.
A graph consists of two distinct parts:
■■ Nodes (sometimes called vertices) are the things in the graph that have relationships. These have names and often have additional useful properties.
■■ Edges are pairs of nodes connected by a relationship. An edge is represented by the two nodes that it connects, so (A, B) or AB represents the edge that connects A and B. An edge might also have a weight in a weighted graph.