Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 84 – Library. Read online. Free books read online. Read books without registering

If more than d documents link to a particular document in the root set, then an arbitrary subset of d documents is brought into the candidate set. A typical value for d is 50. The candidate set typically ends up containing 1,000 to 5,000

documents.

This basic algorithm can be refined in various ways. One possible refinement, for instance, is to filter out any links from within the same domain, many of which are likely to be purely navigational. Another refinement is to allow a document in the root set to bring in at most m pages from the same site.

This is to avoid being fooled by “collusion” between all the pages of a site to, for example, advertise the site of the Web site designer with a “this site designed by” link on every page.

Ranking Hubs and Authorities

The final phase is to divide the candidate pages into hubs and authorities and rank them according to their strength in those roles. This process also has the effect of grouping together pages that refer to the same meaning of a search term with multiple meanings—for instance, Madonna the rock star versus the Madonna and Child in art history or Jaguar the car versus jaguar the big cat. It also differentiates between authorities on the topic of interest and sites that are simply popular in general. Authoritative pages on the correct topic are not only linked to by many pages, they tend to be linked to by the same pages. It is these hub pages that tie together the authorities and distinguish them from unrelated but popular pages. Figure 10.7 illustrates the difference between hubs, authorities, and unrelated popular pages.

Hubs and authorities have a mutually reinforcing relationship. A strong hub is one that links to many strong authorities; a strong authority is one that is linked to by many strong hubs. The algorithm therefore proceeds iteratively, first adjusting the strength rating of the authorities based on the strengths of the hubs that link to them and then adjusting the strengths of the hubs based on the strength of the authorities to which they link.

470643 c10.qxd 3/8/04 11:16 AM Page 335

Link Analysis 335

Hubs

Authorities

Popular Site

Figure 10.7 Google uses link analysis to distinguish hubs, authorities, and popular pages.

For each page, there is a value A that measures its strength as an authority and a value H that measures its strength as a hub. Both these values are initialized to 1 for all pages. Then, the A value for each page is updated by adding up the H values of all the pages that link to them. The A values for each page are then normalized so that the sum of their squares is equal to 1. Then the H

values are updated in a similar manner. The H value for each page is set to the sum of the A values of the pages it links to, and the new H values are normalized so that the sum of their squares is equal to 1. This process is repeated until an equilibrium set of A and H values is reached. The pages that end up with the highest H values are the strongest hubs; those with the strongest A values are the strongest authorities.

The authorities returned by this application of link analysis tend to be strong examples of one particular possible meaning of the search string. A search on a contentious topic such as “gay marriage” or “Taiwan independence” yields strong authorities on both sides because the global structure of the Web includes tightly connected subgraphs representing documents maintained by like-minded authors.

470643 c10.qxd 3/8/04 11:16 AM Page 336

336 Chapter 10

Hubs and Authorities in Practice

The strongest case for the advantage of adding link analysis to text-based searching comes from the market place. Google, a search engine developed at Stanford by Sergey Brin and Lawence Page using an approach very similar to Kleinberg’s, was the first of the major search engines to make use of link analysis to find hubs and authorities. It quickly surpassed long-entrenched search services such as AltaVista and Yahoo! The reason was qualitatively better searches.

The authors noticed that something was special about Google back in April of 2001 when we studied the web logs from our company’s site, www

.data-miners.com. At that time, industry surveys gave Google and AltaVista approximately equal 10 percent shares of the market for web searches, and yet Google accounted for 30 percent of the referrals to our site while AltaVista accounted for only 3 percent. This is apparently because Google was better able to recognize our site as an authority for data mining consulting because it was less confused by the large number of sites that use the phrase “data mining” even though they actually have little to do with the topic.

Case Study: Who Is Using Fax Machines from Home?

Graphs appear in data from other industries as well. Mobile, local, and long-distance telephone service providers have records of every telephone call that their customers make and receive. This data contains a wealth of information about the behavior of their customers: when they place calls, who calls them, whether they benefit from their calling plan, to name a few. As this case study shows, link analysis can be used to analyze the records of local telephone calls to identify which residential customers have a high probability of having fax machines in their home.

Why Finding Fax Machines Is Useful

What is the use of knowing who owns a fax machine? How can a telephone provider act on this information? In this case, the provider had developed a package of services for residential work-at-home customers. Targeting such customers for marketing purposes was a revolutionary concept at the company. In the tightly regulated local phone market of not so long ago, local service providers lost revenue from work-at-home customers, because these customers could have been paying higher business rates instead of lower residential rates. Far from targeting such customers for marketing campaigns, the local telephone providers would deny such customers residential rates—

punishing them for behaving like a small business. For this company, developing and selling work-at-home packages represented a new foray into customer service. One question remained. Which customers should be targeted for the new package?

470643 c10.qxd 3/8/04 11:16 AM Page 337

Link Analysis 337

There are many approaches to defining the target set of customers. The company could effectively use neighborhood demographics, household surveys, estimates of computer ownership by zip code, and similar data. Although this data improves the definition of a market segment, it is still far from identifying individual customers with particular needs. A team, including one of the authors, suggested that the ability to find residential fax machine usage would improve this marketing effort, since fax machines are often (but not always) used for business purposes. Knowing who uses a fax machine would help target the work-at-home package to a very well-defined market segment, and this segment should have a better response rate than a segment defined by less precise segmentation techniques based on statistical properties.

Customers with fax machines offer other opportunities as well. Customers that are sending and receiving faxes should have at least two lines—if they only have one, there is an opportunity to sell them a second line. To provide better customer service, the customers who use faxes on a line with call waiting should know how to turn off call waiting to avoid annoying interruptions on fax transmissions. There are other possibilities as well: perhaps owners of fax machines would prefer receiving their monthly bills by fax instead of by mail, saving both postage and printing costs. In short, being able to identify who is sending or receiving faxes from home is valuable information that provides opportunities for increasing revenues, reducing costs, and increasing customer satisfaction.

The Data as a Graph

The raw data used for this analysis was composed of selected fields from the call detail data fed into the billing system to generate monthly bills. Each record contains 80 bytes of data, with information such as:

■■

The 10-digit telephone number that originated the call, three digits for the area code, three digits for the exchange, and four digits for the line

■■

The 10-digit telephone number of the line where the call terminated

■■

The 10-digit telephone number of the line being billed for the call

■■

The date and time of the call

■■

The duration of the call

■■

The day of the week when the call was placed

■■

Whether the call was placed at a pay phone

In the graph in Figure 10.8, the data has been narrowed to just three fields: duration, originating number, and terminating number. The telephone numbers are the nodes of the graph, and the calls themselves are the edges, weighted by the duration of the calls. A sample of telephone calls is shown in Table 10.1.

470643 c10.qxd 3/8/04 11:16 AM Page 338