Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 89 – Library. Read online. Free books read online. Read books without registering

470643 c11.qxd 3/8/04 11:17 AM Page 359

Automatic Cluster Detection 359

this vague concept of association must be translated into some sort of numeric measure of the degree of similarity. The most common method, but by no means the only one, is to translate all fields into numeric values so that the records may be treated as points in space. Then, if two points are close in the geometric sense, they represent similar records in the database. There are two main problems with this approach:

■■ Many variable types, including all categorical variables and many numeric variables such as rankings, do not have the right behavior to properly be treated as components of a position vector.

■■ In geometry, the contributions of each dimension are of equal importance, but in databases, a small change in one field may be much more important than a large change in another field.

The following section introduces several alternative measures of similarity.

Similarity Measures and Variable Type

Geometric distance works well as a similarity measure for well-behaved numeric variables. A well-behaved numeric variable is one whose value indicates its placement along the axis that corresponds to it in our geometric model. Not all variables fall into this category. For this purpose, variables fall into four classes, listed here in increasing order of suitability for the geometric model.

■■ Categorical variables

■■ Ranks

■■ Intervals

■■ True measures

Categorical variables only describe which of several unordered categories a thing belongs to. For instance, it is possible to label one ice cream pistachio and another butter pecan, but it is not possible to say that one is greater than the other or judge which one is closer to black cherry. In mathematical terms, it is possible to tell that X ≠ Y, but not whether X > Y or X < Y. Ranks put things in order, but don’t say how much bigger one thing is than another. The valedictorian has better grades than the salutatorian, but we don’t know by how much. If X, Y, and Z are ranked A, B, and C, we know that X > Y > Z, but we cannot define X-Y or Y-Z .

Intervals measure the distance between two observations. If it is 56° in San Francisco and 78° in San Jose, then it is 22 degrees warmer at one end of the bay than the other.

470643 c11.qxd 3/8/04 11:17 AM Page 360

360 Chapter 11

True measures are interval variables that measure from a meaningful zero point. This trait is important because it means that the ratio of two values of the variable is meaningful. The Fahrenheit temperature scale used in the United States and the Celsius scale used in most of the rest of the world do not have this property. In neither system does it make sense to say that a 30° day is twice as warm as a 15° day. Similarly, a size 12 dress is not twice as large as a size 6, and gypsum is not twice as hard as talc though they are 2 and 1 on the hardness scale. It does make perfect sense, however, to say that a 50-year-old is twice as old as a 25-year-old or that a 10-pound bag of sugar is twice as heavy as a 5-pound one. Age, weight, length, customer tenure, and volume are examples of true measures.

Geometric distance metrics are well-defined for interval variables and true measures. In order to use categorical variables and rankings, it is necessary to transform them into interval variables. Unfortunately, these transformations may add spurious information. If ice cream flavors are assigned arbitrary numbers 1 through 28, it will appear that flavors 5 and 6 are closely related while flavors 1 and 28 are far apart.

These and other data transformation and preparation issues are discussed extensively in Chapter 17.

Formal Measures of Similarity

There are dozens if not hundreds of published techniques for measuring the similarity of two records. Some have been developed for specialized applications such as comparing passages of text. Others are designed especially for use with certain types of data such as binary variables or categorical variables.

Of the three presented here, the first two are suitable for use with interval variables and true measures, while the third is suitable for categorical variables.

Geometric Distance between Two Points

When the fields in a record are numeric, the record represents a point in n-dimensional space. The distance between the points represented by two records is used as the measure of similarity between them. If two points are close in distance, the corresponding records are similar.

There are many ways to measure the distance between two points, as discussed in the sidebar “Distance Metrics”. The most common one is the Euclidian distance familiar from high-school geometry. To find the Euclidian distance between X and Y, first find the differences between the corresponding elements of X and Y (the distance along each axis) and square them. The distance is the square root of the sum of the squared differences.

470643 c11.qxd 3/8/04 11:17 AM Page 361

Automatic Cluster Detection 361

DISTANCE METRICS

Any function that takes two points and produces a single number describing a relationship between them is a candidate measure of similarity, but to be a true distance metric, it must meet the following criteria:

◆ Distance( X, Y) = 0 if and only if X = Y

◆ Distance( X, Y) ≥ 0 for all X and all Y

◆ Distance( X, Y) = Distance( Y, X)

◆ Distance( X, Y) ≤ Distance( X, Z) + Distance( Z, Y) These are the formal definition of a distance metric in geometry.

A true distance is a good metric for clustering, but some of these conditions can be relaxed. The most important conditions are the second and third (called identity and commutativity by mathematicians)—that the measure is 0 or positive and is well-defined for any two points. If two records have a distance of 0, that is okay, as long as they are very, very similar, since they will always fall into the same cluster.

The last condition, the Triangle Inequality, is perhaps the most interesting mathematically. In terms of clustering, it basically means that adding a new cluster center will not make two distant points suddenly seem close together.

Fortunately, most metrics we could devise satisfy this condition.

Angle between Two Vectors

Sometimes it makes more sense to consider two records closely associated because of similarities in the way the fields within each record are related. Minnows should cluster with sardines, cod, and tuna, while kittens cluster with cougars, lions, and tigers, even though in a database of body-part lengths, the sardine is closer to a kitten than it is to a catfish.

The solution is to use a different geometric interpretation of the same data.

Instead of thinking of X and Y as points in space and measuring the distance between them, think of them as vectors and measure the angle between them.

In this context, a vector is the line segment connecting the origin of a coordinate system to the point described by the vector values. A vector has both magnitude (the distance from the origin to the point) and direction. For this similarity measure, it is the direction that matters.

Take the values for length of whiskers, length of tail, overall body length, length of teeth, and length of claws for a lion and a house cat and plot them as single points, they will be very far apart. But if the ratios of lengths of these body parts to one another are similar in the two species, than the vectors will be nearly colinear.

470643 c11.qxd 3/8/04 11:17 AM Page 362

362 Chapter 11

The angle between vectors provides a measure of association that is not influenced by differences in magnitude between the two things being compared (see Figure 11.7). Actually, the sine of the angle is a better measure since it will range from 0 when the vectors are closest (most nearly parallel) to 1

when they are perpendicular. Using the sine ensures that an angle of 0 degrees is treated the same as an angle of 180 degrees, which is as it should be since for this measure, any two vectors that differ only by a constant factor are considered similar, even if the constant factor is negative. Note that the cosine of the angle measures correlation; it is 1 when the vectors are parallel (perfectly correlated) and 0 when they are orthogonal.

TEAMFLY

Big Fish

Little Fish

Big Cat

Little Cat

Figure 11.7 The angle between vectors as a measure of similarity.

Team-Fly®

470643 c11.qxd 3/8/04 11:17 AM Page 363

Automatic Cluster Detection 363

Manhattan Distance

Another common distance metric gets its name from the rectangular grid pattern of streets in midtown Manhattan. It is simply the sum of the distances traveled along each axis. This measure is sometimes preferred to the Euclidean distance because given that the distances along each axis are not squared, it is less likely that a large difference in one dimension will dominate the total distance.