Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

Figure 10.1 illustrates two graphs. The graph on the left has four nodes connected by six edges and has the property that there is an edge between every pair of nodes. Such a graph is said to be fully connected. It could be representing daily flights between Atlanta, New York, Cincinnati, and Salt Lake City on an airline where these four cities serve as regional hubs. It could also represent Team-Fly®

470643 c10.qxd 3/8/04 11:16 AM Page 323

Link Analysis 323

four people, all of whom know each other, or four mutually related leads for a criminal investigation. The graph on the right has one node in the center connected to four other nodes. This could represent daily flights connecting Atlanta to Birmingham, Greenville, Charlotte, and Savannah on an airline that serves the Southeast from a hub in Atlanta, or a restaurant frequented by four credit card customers. The graph itself captures the information about what is connected to what. Without any labels, it can describe many different situations. This is the power of abstraction.

A few points of terminology about graphs. Because graphs are so useful for visualizing relationships, it is nice when the nodes and edges can be drawn with no intersecting edges. The graphs in Figure 10.2 have this property. They are planar graphs, since they can be drawn on a sheet of paper (what mathematicians call a plane) without having any edges intersect. Figure 10.2 shows two graphs that cannot be drawn without having at least two edges cross.

There is, in fact, a theorem in graph theory that says that if a graph is nonplanar, then lurking inside it is one of the two previously described graphs.

When a path exists between any two nodes in a graph, the graph is said to be connected. For the rest of this chapter, we assume that all graphs are connected, unless otherwise specified. A path, as its name implies, is an ordered sequence of nodes connected by edges. Consider a graph where each node represents a city, and the edges are flights between pairs of cities. On such a graph, a node is a city and an edge is a flight segment—two cities that are connected by a nonstop flight. A path is an itinerary of flight segments that go from one city to another, such as from Greenville, South Carolina to Atlanta, from Atlanta to Chicago, and from Chicago to Peoria.

A fully connected graph with

A graph with five nodes

four nodes and six edges. In

and four edges.

a fully connected graph, there

is an edge between every pair

of nodes.

Figure 10.1 Two examples of graphs.

470643 c10.qxd 3/8/04 11:16 AM Page 324

324 Chapter 10

Oops! These edges

intersect.

Three nodes cannot connect

A fully-connected graph

to three other nodes without

with five nodes must also

two edges crossing over

have edges that intersect.

each other.

Figure 10.2 Not all graphs can be drawn without having some edges cross over each other.

Figure 10.3 is an example of a weighted graph, one in which the edges have weights associated with them. In this case, the nodes represent products purchased by customers. The weights on the edges represent the support for the association, the percentage of market baskets containing both products. Such graphs provide an approach for solving problems in market basket analysis and are also a useful means of visualizing market basket data. This product association graph is an example of an undirected graph. The graph shows that 22.12 percent of market baskets at this health food grocery contain both yellow peppers and bananas. By itself, this does not explain whether yellow pepper sales drive banana sales or vice versa, or whether something else drives the purchase of all yellow fruits and vegetables.

One very common problem in link analysis is finding the shortest path between two nodes. Which is shortest, though, depends on the weights assigned to the edges. Consider the graph of flights between cities. Does shortest refer to distance? To the fewest number of flight segments? To the shortest flight time? Or to the least expensive? All these questions are answered the same way using graphs—the only difference is the weights on the edges.

The following two sections describe two classic problems in graph theory that illustrate the power of graphs to represent and solve problems. Few data mining problems are exactly like these two problems, but the problems give a flavor of how the simple construction of graphs leads to some interesting solutions. They are presented to familiarize the reader with graphs by providing examples of key concepts in graph theory and to provide a stronger basis for discussing link analysis.

470643 c10.qxd 3/8/04 11:16 AM Page 325

Link Analysis 325

Bananas

Red Leaf

8

.5

Vine Tomatoes

8

3.25

Red Peppers

3.7

Organic Peaches

63

6.

5.21

3.35

Yellow Peppers

Floral

3.68

7.32

Kitchen

8

2

3.

Salad Mix

6.

4

Organic Broccoli

Body Care

Figure 10.3 This is an example of a weighted graph where the edge weights are the number of transactions containing the items represented by the nodes at either end.

Seven Bridges of Königsberg

One of the earliest problems in graph theory originated with a simple challenge posed in the eighteenth century by the Swiss mathematician Leonhard Euler. As shown in the simple map in Figure 10.4, Königsberg had two islands in the Pregel River connected to each other and to the rest of the city by a total of seven bridges. On either side of the river or on the islands, it is possible to get to any of the bridges. Figure 10.4 shows one path through the town that crosses over five bridges exactly once. Euler posed the question: Is it possible to walk over all seven bridges exactly once, starting from anywhere in the city, without getting wet or using a boat? As an historical note, the problem has survived longer than the name of the city. In the eighteenth century, Königsberg was a prominent Prussian city on the Baltic Sea nestled between Lithuania and Poland. Now, it is known as Kaliningrad, the westernmost Russian enclave, separated from the rest of Russia by Lithuania and Belarus.

In order to solve this problem, Euler invented the notation of graphs. He represented the map of Königsberg as the simple graph with four vertices and seven edges in Figure 10.5. Some pairs of nodes are connected by more than one edge, indicating that there is more than one bridge between them. Finding a route that traverses all the bridges in Königsberg exactly one time is equivalent to finding a path in the graph that visits every edge exactly once. Such a path is called an Eulerian path in honor of the mathematician who posed and solved this problem.

470643 c10.qxd 3/8/04 11:16 AM Page 326

326 Chapter 10

A

Pregel River

C

D

N

B

W

E

S

Figure 10.4 The Pregel River in Königsberg has two islands connected by a total of seven bridges.

A

AC1

A

AD

C2

C

CD

D

BC2

B

BD

C

1

B

Figure 10.5 This graph represents the layout of Königsberg. The edges are bridges and the nodes are the riverbanks and islands.

470643 c10.qxd 3/8/04 11:16 AM Page 327

Link Analysis 327

WHY DO THE DEGREES HAVE TO BE EVEN?

Showing that an Eulerian path exists only when the degrees on all nodes are even (except at most two) rests on a simple observation. This observation is about paths in the graph. Consider one path through the bridges: A → C → B →C →D

The edges being used are:

AC →

1

BC1

BC2

CD

The edges connecting the intermediate nodes in the path come in pairs. That is, there is an outgoing edge for every incoming edge. For instance, node C has four edges visiting it, and node B has two. Since the edges come in pairs, each intermediate node has an even number of edges in the path. Since an Eulerian path contains all edges in the graph and visits all the nodes, such a path exists only when all the nodes in the graph (minus the two end nodes) can serve as intermediate nodes for the path. This is another way of saying that the degree of those nodes is even.

Euler also showed that the opposite is true. When all the nodes in a graph (save at most two) have an even degree, then an Eulerian path exists. This proof is a bit more complicated, but the idea is rather simple. To construct an Eulerian path, start at any node (even one with an odd degree) and move to any other connected node which has an even degree. Remove the edge just traversed from the graph and make it the first edge in the Eulerian path. Now, the problem is to find an Eulerian path starting at the second node in the graph. By keeping track of the degrees of the nodes, it is possible to construct such a path when there are at most two nodes whose degree is odd.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *