Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

338 Chapter 10

35335000:00:41

3658

5166

00:00:23

3534271

00:00:01

3533068

35355500:00:42

3108

1212

00:01:22

3506595

Figure 10.8 Five calls link together seven telephone numbers.

Table 10.1 Five Telephone Calls

ORIGINATING TERMINATING

ID

NUMBER

NUMBER

DURATION

1

353-3658

350-5166

00:00:41

2

353-3068

350-5166

00:00:23

3

353-4271

353-3068

00:00:01

4

353-3108

555-1212

00:00:42

5

353-3108

350-6595

00:01:22

The Approach

Finding fax machines is based on a simple observation: Fax machines tend to call other fax machines. A set of known fax numbers can be expanded based on the calls made to or received from the known numbers. If an unclassified telephone number calls known fax numbers and doesn’t hang up quickly, then there is evidence that it can be classified as a fax number. This simple characterization

470643 c10.qxd 3/8/04 11:16 AM Page 339

Link Analysis 339

is good for guidance, but it is an oversimplification. There are actually several types of expected fax machine usage for residential customers:

■■

Dedicated fax. Some fax machines are on dedicated lines, and the line is used only for fax communication.

■■

Shared. Some fax machines share their line with voice calls.

■■

Data. Some fax machines are on lines dedicated to data use, either via fax or via computer modem.

T I P Characterizing expected behavior is a good way to start any directed data mining problem. The better the problem is understood, the better the results are likely to be.

The presumption that fax machines call other fax machines is generally true for machines on dedicated lines, although wrong numbers provide exceptions even to this rule. To distinguish shared lines from dedicated or data lines, we assumed that any number that calls information—411 or 555-1212 (directory assistance services)—is used for voice communications, and is therefore a voice line or a shared fax line. For instance, call #4 in the example data contains a call to 555-1212, signifying that the calling number is likely to be a shared line or just a voice line. When a shared line calls another number, there is no way to know if the call is voice or data. We cannot identify fax machines based on calls to and from such a node in the call graph. On the other hand, these shared lines do represent a marketing opportunity to sell additional lines.

The process used to find fax machines consisted of the following steps: 1. Start with a set of known fax machines (gathered from the Yellow Pages).

2. Determine all the numbers that make or receive calls to or from any number in this set where the call’s duration was longer than 10 seconds.

These numbers are candidates.

■■ If the candidate number has called 411, 555-1212, or a number identified as a shared fax number, then it is included in the set of shared voice/fax numbers.

■■ Otherwise, it is included in the set of known fax machines.

3. Repeat Steps 1 and 2 until no more numbers are identified.

One of the challenges was identifying wrong numbers. In particular, incoming calls to a fax machine may sometimes represent a wrong number and give no information about the originating number (actually, if it is a wrong number then it is probably a voice line). We made the assumption that such incoming wrong numbers would last a very short time, as is the case with Call #3. In a larger-scale analysis of fax machines, it would be useful to eliminate other anomalies, such as outgoing wrong numbers and modem/fax usage.

470643 c10.qxd 3/8/04 11:16 AM Page 340

340 Chapter 10

The process starts with an initial set of fax numbers. Since this was a demonstration project, several fax numbers were gathered manually from the Yellow Pages based on the annotation “fax” by the number. For a larger-scale project, all fax numbers could be retrieved from the database used to generate the Yellow Pages. These numbers are only the beginning, the seeds, of the list of fax machine telephone numbers. Although it is common for businesses to advertise their fax numbers, this is not so common for fax machines at home.

Some Results

The sample of telephone records consisted of 3,011,819 telephone calls made over one month by 19,674 households. In the world of telephony, this is a very small sample of data, but it was sufficient to demonstrate the power of link analysis. The analysis was performed using special-purpose C++ code that stored the call detail and allowed us to expand a list of fax machines efficiently.

Finding the fax machines is an example of a graph-coloring algorithm. This type of algorithm walks through the graph and label nodes with different “colors.” In this case, the colors are “fax,” “shared,” “voice,” and “unknown” instead of red, green, yellow, and blue. Initially, all the nodes are “unknown” except for the few labeled “fax” from the starting set. As the algorithm proceeds, more and more nodes with the “unknown” label are given more informative labels.

Figure 10.9 shows a call graph with 15 numbers and 19 calls. The weights on the edges are the duration of each call in seconds. Nothing is really known about the specific numbers.

Information

(411)

36

6

117

50

169

67

4

44

44

22

35

342

20

35

2

61

133

70

3

Figure 10.9 A call graph for 15 numbers and 19 calls.

470643 c10.qxd 3/8/04 11:16 AM Page 341

Link Analysis 341

Figure 10.10 shows how the algorithm proceeds. First, the numbers that are known to be fax machines are labeled “F,” and the numbers for directory assistance are labeled “I.” Any edge for a call that lasted less than 10 seconds has been dropped. The algorithm colors the graph by assigning labels to each node using an iterative procedure:

■■

Any “voice” node connected to a “fax” node is labeled “shared.”

■■

Any “unknown” node connected mostly to “fax” nodes is labeled “fax.”

This procedure continues until all nodes connected to “fax” nodes have a

“fax” or “shared” label.

F

U

U

U

I

This is the initial call graph

F

U

U

with short calls removed and

with nodes labeled as “fax,”

U

U

“unknown,” and “information.”

F

U

U

U

U

F

S

U

Nodes connected to the initial

fax machines are assigned

F

I

the “fax” label.

Those connected to

F

V

U

“information” are assigned

the “voice” label.

F

V

Those connected to both, are

“shared.”

F

F

U

The rest are “unknown.”

F

U

Figure 10.10 Applying the graph-coloring algorithm to the call graph shows which numbers are fax numbers and which are shared.

470643 c10.qxd 3/8/04 11:16 AM Page 342

342 Chapter 10

USING SQL TO COLOR A GRAPH

Although the case study implemented the graph coloring using special-purpose C++ code, these operations are suitable for data stored in a relational database.

Assume that there are three tables: call_detail, dedicated_fax, and shared_fax.

The query for finding the numbers that call a known fax number is: SELECT originating_number

FROM call_detail

WHERE terminating_number IN (SELECT number FROM dedicated_fax)

AND duration >= 10

GROUP BY originating_number;

A similar query can be used to get the calls made by a known fax number.

However, this does not yet distinguish between dedicated fax lines and shared fax lines. To do this, we have to know if any calls were made to information. For efficiency reasons, it is best to keep this list in a separate table or view, voice_numbers, defined by:

SELECT originating_number

FROM call_detail

WHERE terminating_number in (‘5551212’, ‘411’)

GROUP BY originating_number;

So the query to find dedicated fax lines is:

SELECT originating_number

TEAMFLY

FROM call_detail

WHERE terminating_number IN (SELECT number FROM dedicated_fax)

AND duration > 9

AND originating_number NOT IN (SELECT number FROM voice_numbers) GROUP BY originating_number;

and for shared lines it is:

SELECT originating_number

FROM call_detail

WHERE terminating_number IN (SELECT number FROM dedicated_fax)

AND duration > 2

AND originating_number IN (SELECT number FROM voice_numbers)

GROUP BY originating_number;

These SQL queries are intended to show that finding fax machines is possible on a relational database. They are probably not the most efficient SQL

statements for this purpose, depending on the layout of the data, the database engine, and the hardware it is running on. Also, if there is a significant number of calls in the database, any SQL queries for link analysis will require joins on very large tables.

Team-Fly®

470643 c10.qxd 3/8/04 11:16 AM Page 343

Link Analysis 343

Case Study: Segmenting Cellular

Telephone Customers

This case study applies link analysis to cellular telephone calls for the purpose of segmenting existing customers for selling new services.1 Analyses similar to those presented here were used with a leading cellular provider. The results from the analysis were used for a direct mailing for a new product offering. On such mailings, the cellular company typically measured a response rate of 2

percent to 3 percent. With some of the ideas presented here, it increased its response rate to over 15 percent, a very significant improvement.

The Data

Cellular telephone data is similar to the call detail data seen in the previous case study for finding fax machines. There is a record for each call that includes fields such as:

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *