Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 108 – Library. Read online. Free books read online. Read books without registering

Genetic Algorithms 443

The comment signature is not the genome, but it is related to it. Instead, the genome is a set of weights corresponding to each variable in the signature (along with an additional weight called a bias). It is possible to multiply the weights in the genome times the corresponding fields in the comment signature to obtain a prediction for the comment being a complaint, as shown in Figure 13.6. This is the fitness function for a single comment signature. The full fitness function applies this to all the comment signatures in the training set.

The Genalytics System creates a random population of genomes. These genomes generally have most of the weights set to low values, and just a few set to high values. That is, the initial population consists of genomes that are specialized for the simplest features in the comment signature. Although the initial population performed very poorly, its use of selection, crossover, and mutation lead to better and better solutions. After tens of thousands of generations, the final model was able to classify 85 percent of the records correctly—

enough to speed up the airline’s complaint processing. The chart in Figure 13.7

shows the improvement in the fitness function in succeeding generations.

Is by phone?

Is by email?

Length Count of “food”

Count of “JFK”

Count of “changed”

Count of “meal”

Count of “lost”

“JFK” and “lost”

“food” and “meal”

104

. . .

bias

0.1

0.3

0.1

0.2

0.9

. . .

Figure 13.6 The genome has a weight for each field in the comment signature, plus an additional weight called a bias.

470643 c13.qxd 3/8/04 11:18 AM Page 444

444 Chapter 13

Figure 13.7 The Genalytics system shows the progress of the training and how the fitness function improves with each generation.

Beyond the Simple Algorithm

Researchers have been pushing the bounds on genetic algorithms in several directions. Some of these enhancements are refinements to the basic algorithm; others modify the algorithm to provide a better model of genetic activity in the natural world. This work is often performed under the rubric of machine learning, an area of current artificial intelligence research that aims to enable computers to learn in ways analogous to humans. Companies such as Genalytics are starting to apply these evolutionary techniques to marketing.

The simple genetic algorithm previously described has room for improvement in several areas. One of the inefficiencies in the algorithm is the fact that entire populations are replaced from one generation to the next. This is a gross oversimplification of what happens in nature. Instead of replacing an entire population, some researchers have worked with overlapping populations that can grow in size. They have introduced the notion of crowding to determine which existing members should be targeted for replacement. When applied naively, this tends to result in very fast convergence, often to suboptimal solutions, because all the less fit genomes are replaced before they have an opportunity to reproduce—and the less fit genomes sometimes have something to offer. To get around this, the targets for replacement often come from subsets of the population that exhibit high degrees of similarity.

The issue of overly fast convergence is actually a problem for the simple genetic algorithm because the goal of finding the globally optimal solution is

470643 c13.qxd 3/8/04 11:18 AM Page 445

Genetic Algorithms 445

easily confused by locally optimal solutions. Overly fast convergence often suggests that the search is being limited. To get around this, the various probabilities for crossover and mutation are often set high initially, then slowly decreased from one generation to the next. Or, an initial population is allowed to grow before shrinking in size as the fitness becomes more consistent across the population.

The genomes discussed so far consist of only a single strand of genes. Didn’t we learn back in high school that DNA consists of two intertwining strands in a helix structure? And what happened to those other concepts buried back in this high-school past, such as recessive and dominant genes? The genetics used so far is based on the simplest chromosomes found in nature, single-stranded, or haploid chromosomes. These tend to be found in uncomplicated, single-cell organisms. In more complex organisms, the chromosomes are two-stranded, or diploid, as in our own DNA.

The algorithmic characteristics of diploid chromosomes are much the same as haploid chromosomes, since diploid chromosomes can be treated as two chromosomes tied together. The actual algorithm proceeds in much the same way: Selection, crossover, and mutation are the same. The difference is that now there are two alleles for each gene (two possible values) instead of one.

When they match, there is no problem. When they do not, which does the fitness function use? In the language of genetics, this is asking which of the alleles is expressed. For instance, when an allele for blue eyes pairs up with an allele for brown eyes, the brown eyes “win”; that is, they are expressed instead of the blue eyes. (Actually, eye color is a bit more complicated than this simple example, but this is useful for explanatory purposes.) Researchers have solved this problem by including information about dominance in the alleles themselves. The details of this mechanism are beyond the scope of this book.

The interested reader is referred to Goldberg’s classic book Genetic Algorithms in Search, Optimization, and Machine Learning (1989, Addison-Wesley, ISBN

0201157675 )

Why we should care about diploid structures? Geneticists have long wondered why two-stranded chromosomes predominate in nature, when single-stranded ones are simpler. They believe that the two-stranded structure allows an organism to “remember” a gene that was useful in another environment, but has become less useful in the current environment. In terms of GA, this suggests that these are useful in cases where the environment—or fitness function—is changing over time. In the real world, this may prove to be quite useful. An example of a changing fitness function would be a function that tried to determine the price of bonds over time. The goodness of a given bond price depends on factors not under control of the algorithm, such as the rate of inflation. The “fitness” function can take this into account by changing over time to incorporate estimates of inflation.

470643 c13.qxd 3/8/04 11:18 AM Page 446

446 Chapter 13

Lessons Learned

Genetic algorithms are a very powerful optimization technique. Optimization is not at the heart of data mining, but it can solve interesting and important problems. In fact, some data mining algorithms, such as neural networks, depend on optimization “under the hood.”

The key to the power of genetic algorithms is that they depend on only two things. First is the genome and the second the fitness function. The fitness function makes sense of the genome, by producing a value from what looks like a random set of bits. The genome encodes the problem; often it consists of a set of weights on some equation. Genetic algorithms work on a wide variety of fitness functions, making it possible to encode many different types of problems that are not easily handled by other means.

The process of evolution starts with a random population and then applies three transformation steps. The first is selection, which means that more fit genomes survive from one generation to another. This corresponds to natural selection. The second is crossover, where two genomes swap pieces and is analogous to a similar natural process. The third is mutation, where some values are changed randomly. Mutations are usually quite rare both in nature and in genetic algorithms.

The application of these three processes produces a new generation, whose average fitness should be greater than the original. As more and more generations are created, the population moves to an optimal solution. These processes have a theoretical foundation, based on schemata. This theory explains how genetic algorithms move toward a solution.

Genetic algorithms have been applied to practical problems, often to resource optimization problems. However, they can even be used for predictive modeling and classification, as explained in the case study on classifying comments made to an airline.

470643 c14.qxd 3/8/04 11:18 AM Page 447

C H A P T E R

Data Mining throughout the

Customer Life Cycle

The purpose of data mining is to help businesses realize value from their most important asset: customers. Earlier chapters have talked about the algorithms and methodology for making data mining successful. This chapter turns from the specific technology to customers. The next three chapters continue this theme, stepping away from technical algorithms to talk about the data and the systems environment needed to exploit data mining.

For almost any business, customers are the critical asset. Yet, they are elusive, because of the wide variety of different relationships that change over time. Different industries have different definitions of customers. Within an industry, different competitors have different approaches to managing these relationships. Some focus on quality of service, some on convenience, some on price, and some on other aspects of the relationship. No two businesses have exactly the same definition of a customer, nor treat customers the same way throughout the relationship.