Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 83 – Library. Read online. Free books read online. Read books without registering

■■

No more edges or no more nodes are left. In this case, the graph has no cycles.

■■

Some edges remain but there are no source or sink nodes. In this case, the graph is cyclic.

If no cycles exist, then the graph is called an acyclic graph. These graphs are useful for describing dependencies or one-way relationships between things.

For instance, different products often belong to nested hierarchies that can be represented by acyclic graphs. The decision trees described in Chapter 6 are another example.

In an acyclic graph, any two nodes have a well-defined precedence relationship with each other. If node A precedes node B in some path that contains both A and B, then A will precede B in all paths containing both A and B (otherwise there would be a cycle). In this case, we say that A is a predecessor of B and that B is a successor of A. If no paths contain both A and B, then A and B are disjoint.

This strict ordering can be an important property of the nodes and is sometimes useful for data mining purposes.

A Familiar Application of Link Analysis

Most readers of this book have probably used the Google search engine. Its phenomenal popularity stems from its ability to help people find reasonably good material on pretty much any subject. This feat is accomplished through link analysis.

The World Wide Web is a huge directed graph. The nodes are Web pages and the edges are the hyperlinks between them. Special programs called spiders or web crawlers are continually traversing these links to update maps of the huge directed graph that is the web. Some of these spiders simply index the content of Web pages for use by purely text-based search engines. Others record the Web’s global structure as a directed graph that can be used for analysis.

Once upon a time, search engines analyzed only the nodes of this graph.

Text from a query was compared with text from the Web pages using techniques similar to those described in Chapter 8. Google’s approach (which has now been adopted by other search engines) is to make use of the information encoded in the edges of the graph as well as the information found in the nodes.

470643 c10.qxd 3/8/04 11:16 AM Page 332

332 Chapter 10

The Kleinberg Algorithm

Some Web sites or magazine articles are more interesting than others even if they are devoted to the same topic. This simple idea is easy to grasp but hard to explain to a computer. So when a search is performed on a topic that many people write about, it is hard to find the most interesting or authoritative documents in the huge collection that satisfies the search criteria.

Professor Jon Kleinberg of Cornell University came up with one widely adopted technique for addressing this problem. His approach takes advantage of the insight that in creating a link from one site to another, a human being is making a judgment about the value of the site being linked to. Each link to another site is effectively a recommendation of that site. Cumulatively, the independent judgments of many Web site designers who all decide to provide links to the same target are conferring authority on that target. Furthermore, the reliability of the sites making the link can be judged according to the authoritativeness of the sites they link to. The recommendations of a site with many other good recommendations can be given more weight in determining the authority of another.

In Kleinberg’s terminology, a page that links to many authorities is a hub; a page that is linked to by many hubs is an authority. These ideas are illustrated in Figure 10.7 The two concepts can be used together to tell the difference between authority and mere popularity. At first glance, it might seem that a TEAMFLY

good method for finding authoritative Web sites would be to rank them by the number of unrelated sites linking to them. The problem with this technique is that any time the topic is mentioned, even in passing, by a popular site (one with many inbound links), it will be ranked higher than a site that is much more authoritative on the particular subject though less popular in general.

The solution is to rank pages, not by the total number of links pointing to them, but by the number of subject-related hubs that point to them.

Google.com uses a modified and enhanced version of the basic Kleinberg algorithm described here.

A search based on link analysis begins with an ordinary text-based search.

This initial search provides a pool of pages (often a couple hundred) with which to start the process. It is quite likely that the set of documents returned by such a search does not include the documents that a human reader would judge to be the most authoritative sources on the topic. That is because the most authoritative sources on a topic are not necessarily the ones that use the words in the search string most frequently. Kleinberg uses the example of a search on the keyword “Harvard.” Most people would agree that www.

harvard.edu is one of the most authoritative sites on this topic, but in a purely content-based analysis, it does not stand out among the more than a million Web pages containing the word “Harvard” so it is quite likely that a text-based search will not return the university’s own Web site among its top results. It is very likely, however, that at least a few of the documents returned will contain Team-Fly®

470643 c10.qxd 3/8/04 11:16 AM Page 333

Link Analysis 333

a link to Harvard’s home page or, failing that, that some page that points to one of the pages in the pool of pages will also point to www.harvard.edu.

An essential feature of Kleinberg’s algorithm is that it does not simply take the pages returned by the initial text-based search and attempt to rank them; it uses them to construct the much larger pool of documents that point to or are pointed to by any of the documents in the root set. This larger pool contains much more global structure—structure that can be mined to determine which documents are considered to be most authoritative by the wide community of people who created the documents in the pool.

The Details: Finding Hubs and Authorities

Kleinberg’s algorithm for identifying authoritative sources has three phases: 1. Creating the root set

2. Identifying the candidates

3. Ranking hubs and authorities

In the first phase, a root set of pages is formed using a text-based search engine to find pages containing the search string. In the second phase, this root set is expanded to include documents that point to or are pointed to by documents in the root set. This expanded set contains the candidates. In the third phase, which is iterative, the candidates are ranked according to their strength as hubs (documents that have links to many authoritative documents) and authorities (pages that have links from many authoritative hubs).

Creating the Root Set

The root set of documents is generated using a content-based search. As a first step, stop words (common words such as “a,” “an,” “the,” and so on) are removed from the original search string supplied. Then, depending on the particular content-based search strategy employed, the remaining search terms may undergo stemming. Stemming reduces words to their root form by removing plural forms and other endings due to verb conjugation, noun declension, and so on. Then, the Web index is searched for documents containing the terms in the search string. There are many variations on the details of how matches are evaluated, which is one reason why performing the same search on two text-based search engines yields different results. In any case, some combination of the number of matching terms, the rarity of the terms matched, and the number of times the search terms are mentioned in a document is used to give the indexed documents a score that determines their rank in relation to the query. The top n documents are used to establish the root set. A typical value for n is 200.

470643 c10.qxd 3/8/04 11:16 AM Page 334

334 Chapter 10

Identifying the Candidates

In the second phase, the root set is expanded to create the set of candidates. The candidate set includes all pages that any page in the root set links to along with a subset of the pages that link to any page in the root set. Locating pages that link to a particular target page is simple if the global structure of the Web is available as a directed graph. The same task can also be accomplished with an index-based text search using the URL of the target page as the search string.

The reason for using only a subset of the pages that link to each page in the root set is to guard against the possibility of an extremely popular site in the root set bringing in an unmanageable number of pages. There is also a parameter d that limits the number of pages that may be brought into the candidate set by any single member of the root set.