Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

Euler devised a solution based on the number of edges going into or out of each node in the graph. The number of such edges is called the degree of a node. For instance, in the graph representing the seven bridges of Königsberg, the nodes representing the shores both have a degree of three—corresponding to the fact that there are three bridges connecting each shore to the islands. The other two nodes, representing the islands, have degrees of 5 and 3. Euler showed that an Eulerian path exists only when the degrees of all the nodes in a graph are even, except at most two (see technical aside). So, there is no way to walk over the seven bridges of Königsberg without traversing a bridge more than once, since there are four nodes whose degrees are odd.

Traveling Salesman Problem

A more recent problem in graph theory is the “Traveling Salesman Problem.”

In this problem, a salesman needs to visit customers in a set of cities. He plans on flying to one of the cities, renting a car, visiting the customer there, then driving to each of other cities to visit each of the rest of his customers. He

470643 c10.qxd 3/8/04 11:16 AM Page 328

328 Chapter 10

leaves the car in the last city and flies home. There are many possible routes that the salesman can take. What route minimizes the total distance that he travels while still allowing him to visit each city exactly one time?

The Traveling Salesman Problem is easily reformulated using graphs, since graphs are a natural representation of cities connected by roads. In the graph representing this problem, the nodes are cities and each edge has a weight corresponding to the distance between the two cities connected by the edge. The Traveling Salesman Problem therefore is asking: “What is the shortest path that visits all the nodes in a graph exactly one time?” Notice that this problem is different from the Seven Bridges of Königsberg. We are not interested in simply finding a path that visits all nodes exactly once, but of all possible paths we want the shortest one. Notice that all Eulerian paths have exactly the same length, since they contain exactly the same edges. Asking for the shortest Eulerian path does not make sense.

Solving the Traveling Salesman Problem for three or four cities is not difficult. The most complicated graph with four nodes is a completely connected graph where every node in the graph is connected to every other node. In this graph, 24 different paths visit each node exactly once. To count the number of paths, start at any of nodes (there are four possibilities), then go to any of the other three remaining ones, then to any of the other two, and finally to the last node (4 * 3 * 2 * 1 = 4! = 24). A completely connected graph with n nodes has n!

( n factorial) distinct paths that contain all nodes. Each path has a slightly different collection of edges, so their lengths are usually different. Since listing the 24 possible paths is not that hard, finding the shortest path is not particularly difficult for this simple case.

The problem of finding the shortest path connecting nodes was first investigated by the Irish mathematician Sir William Rowan Hamilton. His study of minimizing energy in physical systems led him to investigate minimizing energy in certain discrete systems that he represented as graphs. In honor of him, a path that visits all nodes in a graph exactly once is called a Hamiltonian path.

The Traveling Salesman Problem is difficult to solve. Any solution must consider all of the possible paths through the graph in order to determine which one is the shortest. The number of paths in a completely connected graph grows very fast—as a factorial. What is true for completely connected graphs is true for graphs in general: The number of possible paths visiting all the nodes grows like an exponential function of the number of nodes (although there are a few simple graphs where this is not true). So, as the number of cities increases, the effort required to find the shortest path grows exponentially. Adding just one more city (with associated roads) can result in a solution that takes twice as long—or more—to find.

470643 c10.qxd 3/8/04 11:16 AM Page 329

Link Analysis 329

This lack of scalability is so important that mathematicians have given it a name: NP—where NP means that all known algorithms used to solve the problem scale exponentially— n ot like a p olynomial. These problems are considered difficult. In fact, the Traveling Salesman Problem is so difficult that it is used for evaluating parallel computers and exotic computing methods—such as using DNA or the mysteries of quantum physics as the basis of computers instead of the more familiar computer chips made of silicon.

All of this graph theory aside, there are pretty good heuristic algorithms for computers that provide reasonable solutions to the Traveling Salesman Problem. The resulting paths are relatively short paths, although they are not guaranteed to be as short as the shortest possible one. This is a useful fact if you have a similar problem. One common algorithm is the greedy algorithm: start the path with the shortest edge in the graph, then lengthen the path with the shortest edge available at either end that visits a new node. The resulting path is generally pretty short, although not necessarily the shortest (see Figure 10.6).

T I P Often it is better to use an algorithm that yields good, but not perfect results, instead of trying to analyze the difficulty of arriving at the ideal solution or giving up because there is no guarantee of finding an optimal solution. As Voltaire remarked, “Le mieux est l’ennemi du bien.” (The best is the enemy of the good.)

A

18

12

2

B

2

C

1

D

9

E

11

Figure 10.6 In this graph, the shortest path (ABCDE) has a length of 24, but the greedy algorithm finds a much longer path (CDBEA).

470643 c10.qxd 3/8/04 11:16 AM Page 330

330 Chapter 10

Directed Graphs

The graphs discussed so far are undirected. In undirected graphs, the edges are like expressways between nodes: they go in both directions. In a directed graph, the edges are like one-way roads. An edge going from A to B is distinct from an edge going from B to A. A directed edge from A to B is an outgoing edge of A and an incoming edge of B.

Directed graphs are a powerful way of representing data:

■■

Flight segments that connect a set of cities

■■

Hyperlinks between Web pages

■■

Telephone calling patterns

■■

State transition diagrams

Two types of nodes are of particular interest in directed graphs. All the edges connected to a source node are outgoing edges. Since there are no incoming edges, no path exists from any other node in the graph to any of the source nodes. When all the edges on a node are incoming edges, the node is called a sink node. The existence of source nodes and sink nodes is an important difference between directed graphs and their undirected cousins.

An important property of directed graphs is whether the graph contains any paths that start and end at the same vertex. Such a path is called a cycle, implying that the path could repeat itself endlessly: ABCABCABC and so on. If a directed graph contains at least one cycle, it is called cyclic. Cycles in a graph of flight segments, for instance, might be the path of a single airplane. In a call graph, members of a cycle call each other—these are good candidates for a

“friends and family–style” promotion, where the whole group gets a discount, or for marketing conference call services.

Detecting Cycles in a Graph

There is a simple algorithm to detect whether a directed graph has any cycles.

This algorithm starts with the observation that if a directed graph has no sink vertices, and it has at least one edge, then any path can be extended arbitrarily.

Without any sink vertices, the terminating node of a path is always connected to another node, so the path can be extended by appending that node. Similarly, if the graph has no source nodes, then we can always prepend a node to the beginning of the path. Once the path contains more nodes than there are nodes in the graph, we know that the path must visit at least one node twice.

Call this node X. The portion of the path between the first X and the second X

in the path is a cycle, so the graph is cyclic.

Now consider the case when a graph has one or more source nodes and one or more sink nodes. It is pretty obvious that source nodes and sink nodes

470643 c10.qxd 3/8/04 11:16 AM Page 331

Link Analysis 331

cannot be part of a cycle. Removing the source and sink nodes from the graph, along with all their edges, does not affect whether the graph is cyclic. If the resulting graph has no sink nodes or no source nodes, then it contains a cycle, as just shown. The process of removing sink nodes, source nodes, and their edges is repeated until one of the following occurs:

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *