338 Chapter 10
35335000:00:41
3658
5166
00:00:23
3534271
00:00:01
3533068
35355500:00:42
3108
1212
00:01:22
3506595
Figure 10.8 Five calls link together seven telephone numbers.
Table 10.1 Five Telephone Calls
ORIGINATING TERMINATING
ID
NUMBER
NUMBER
DURATION
1
353-3658
350-5166
00:00:41
2
353-3068
350-5166
00:00:23
3
353-4271
353-3068
00:00:01
4
353-3108
555-1212
00:00:42
5
353-3108
350-6595
00:01:22
The Approach
Finding fax machines is based on a simple observation: Fax machines tend to call other fax machines. A set of known fax numbers can be expanded based on the calls made to or received from the known numbers. If an unclassified telephone number calls known fax numbers and doesn’t hang up quickly, then there is evidence that it can be classified as a fax number. This simple characterization
470643 c10.qxd 3/8/04 11:16 AM Page 339
Link Analysis 339
is good for guidance, but it is an oversimplification. There are actually several types of expected fax machine usage for residential customers:
■■
Dedicated fax. Some fax machines are on dedicated lines, and the line is used only for fax communication.
■■
Shared. Some fax machines share their line with voice calls.
■■
Data. Some fax machines are on lines dedicated to data use, either via fax or via computer modem.
T I P Characterizing expected behavior is a good way to start any directed data mining problem. The better the problem is understood, the better the results are likely to be.
The presumption that fax machines call other fax machines is generally true for machines on dedicated lines, although wrong numbers provide exceptions even to this rule. To distinguish shared lines from dedicated or data lines, we assumed that any number that calls information—411 or 555-1212 (directory assistance services)—is used for voice communications, and is therefore a voice line or a shared fax line. For instance, call #4 in the example data contains a call to 555-1212, signifying that the calling number is likely to be a shared line or just a voice line. When a shared line calls another number, there is no way to know if the call is voice or data. We cannot identify fax machines based on calls to and from such a node in the call graph. On the other hand, these shared lines do represent a marketing opportunity to sell additional lines.
The process used to find fax machines consisted of the following steps: 1. Start with a set of known fax machines (gathered from the Yellow Pages).
2. Determine all the numbers that make or receive calls to or from any number in this set where the call’s duration was longer than 10 seconds.
These numbers are candidates.
■■ If the candidate number has called 411, 555-1212, or a number identified as a shared fax number, then it is included in the set of shared voice/fax numbers.
■■ Otherwise, it is included in the set of known fax machines.
3. Repeat Steps 1 and 2 until no more numbers are identified.
One of the challenges was identifying wrong numbers. In particular, incoming calls to a fax machine may sometimes represent a wrong number and give no information about the originating number (actually, if it is a wrong number then it is probably a voice line). We made the assumption that such incoming wrong numbers would last a very short time, as is the case with Call #3. In a larger-scale analysis of fax machines, it would be useful to eliminate other anomalies, such as outgoing wrong numbers and modem/fax usage.
470643 c10.qxd 3/8/04 11:16 AM Page 340
340 Chapter 10
The process starts with an initial set of fax numbers. Since this was a demonstration project, several fax numbers were gathered manually from the Yellow Pages based on the annotation “fax” by the number. For a larger-scale project, all fax numbers could be retrieved from the database used to generate the Yellow Pages. These numbers are only the beginning, the seeds, of the list of fax machine telephone numbers. Although it is common for businesses to advertise their fax numbers, this is not so common for fax machines at home.
Some Results
The sample of telephone records consisted of 3,011,819 telephone calls made over one month by 19,674 households. In the world of telephony, this is a very small sample of data, but it was sufficient to demonstrate the power of link analysis. The analysis was performed using special-purpose C++ code that stored the call detail and allowed us to expand a list of fax machines efficiently.
Finding the fax machines is an example of a graph-coloring algorithm. This type of algorithm walks through the graph and label nodes with different “colors.” In this case, the colors are “fax,” “shared,” “voice,” and “unknown” instead of red, green, yellow, and blue. Initially, all the nodes are “unknown” except for the few labeled “fax” from the starting set. As the algorithm proceeds, more and more nodes with the “unknown” label are given more informative labels.
Figure 10.9 shows a call graph with 15 numbers and 19 calls. The weights on the edges are the duration of each call in seconds. Nothing is really known about the specific numbers.
Information
(411)
36
6
117
50
169
67
4
44
44
22
35
342
20
35
2
61
133
70
3
Figure 10.9 A call graph for 15 numbers and 19 calls.
470643 c10.qxd 3/8/04 11:16 AM Page 341
Link Analysis 341
Figure 10.10 shows how the algorithm proceeds. First, the numbers that are known to be fax machines are labeled “F,” and the numbers for directory assistance are labeled “I.” Any edge for a call that lasted less than 10 seconds has been dropped. The algorithm colors the graph by assigning labels to each node using an iterative procedure:
■■
Any “voice” node connected to a “fax” node is labeled “shared.”
■■
Any “unknown” node connected mostly to “fax” nodes is labeled “fax.”
This procedure continues until all nodes connected to “fax” nodes have a
“fax” or “shared” label.
F
U
U
U
I
This is the initial call graph
F
U
U
with short calls removed and
with nodes labeled as “fax,”
U
U
“unknown,” and “information.”
F
U
U
U
U
F
S
U
Nodes connected to the initial
fax machines are assigned
F
I
the “fax” label.
Those connected to
F
V
U
“information” are assigned
the “voice” label.
F
V
Those connected to both, are
“shared.”
F
F
U
The rest are “unknown.”
F
U
Figure 10.10 Applying the graph-coloring algorithm to the call graph shows which numbers are fax numbers and which are shared.
470643 c10.qxd 3/8/04 11:16 AM Page 342
342 Chapter 10
USING SQL TO COLOR A GRAPH
Although the case study implemented the graph coloring using special-purpose C++ code, these operations are suitable for data stored in a relational database.
Assume that there are three tables: call_detail, dedicated_fax, and shared_fax.
The query for finding the numbers that call a known fax number is: SELECT originating_number
FROM call_detail
WHERE terminating_number IN (SELECT number FROM dedicated_fax)
AND duration >= 10
GROUP BY originating_number;
A similar query can be used to get the calls made by a known fax number.
However, this does not yet distinguish between dedicated fax lines and shared fax lines. To do this, we have to know if any calls were made to information. For efficiency reasons, it is best to keep this list in a separate table or view, voice_numbers, defined by:
SELECT originating_number
FROM call_detail
WHERE terminating_number in (‘5551212’, ‘411’)
GROUP BY originating_number;
So the query to find dedicated fax lines is:
SELECT originating_number
TEAMFLY
FROM call_detail
WHERE terminating_number IN (SELECT number FROM dedicated_fax)
AND duration > 9
AND originating_number NOT IN (SELECT number FROM voice_numbers) GROUP BY originating_number;
and for shared lines it is:
SELECT originating_number
FROM call_detail
WHERE terminating_number IN (SELECT number FROM dedicated_fax)
AND duration > 2
AND originating_number IN (SELECT number FROM voice_numbers)
GROUP BY originating_number;
These SQL queries are intended to show that finding fax machines is possible on a relational database. They are probably not the most efficient SQL
statements for this purpose, depending on the layout of the data, the database engine, and the hardware it is running on. Also, if there is a significant number of calls in the database, any SQL queries for link analysis will require joins on very large tables.
Team-Fly®
470643 c10.qxd 3/8/04 11:16 AM Page 343
Link Analysis 343
Case Study: Segmenting Cellular
Telephone Customers
This case study applies link analysis to cellular telephone calls for the purpose of segmenting existing customers for selling new services.1 Analyses similar to those presented here were used with a leading cellular provider. The results from the analysis were used for a direct mailing for a new product offering. On such mailings, the cellular company typically measured a response rate of 2
percent to 3 percent. With some of the ideas presented here, it increased its response rate to over 15 percent, a very significant improvement.
The Data
Cellular telephone data is similar to the call detail data seen in the previous case study for finding fax machines. There is a record for each call that includes fields such as: