X

Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

470643 c13.qxd 3/8/04 11:18 AM Page 431

Genetic Algorithms 431

Table 13.4 The Population after Selection and Crossover

GENOME

P

FITNESS

10010

18

234

11001

25

150

00110

6

150

10110

22

198

Mutation

The final operation is mutation. Mutation rarely occurs in nature and is the result of miscoded genetic material being passed from a parent to a child. The resulting change in the gene may represent a significant improvement in fitness over the existing population, although more often than not, the results are harmful. Selection and crossover do a good job of searching the space of possible genomes, but they depend on initial conditions and randomness that might conspire to prevent certain valuable combinations from being considered in succeeding generations. Mutation provides the additional input. The mutation rate is quite small in nature and is usually kept quite low for genetic algorithms—no more than one mutation or so per generation is a reasonable bound. For the example at hand, when a mutation occurs, the bit changes from a 0 to a 1 or from a 1 to a 0.

Assume that there is one mutation in this generation, occurring in the second genome at position 3. Table 13.5 shows the population of genomes after such a mutation. Notice that this mutation, like many mutations, is destructive: The fitness of the genome affected by the mutation decreased from 150 to 58, the average fitness of the population decreased from 183.0 to 160.0, and the resulting genome is unlikely to survive to the next generation. This is not unusual. The primary modus operandi of genetic algorithms is selection and crossover. Mutation is very much a second-order effect that helps avoid premature convergence to a local optimum. When the initial population provides good coverage of the space of possible combinations, succeeding generations move quickly toward the optimal solution by means of selection and crossover. Changes introduced by mutation are likely to be destructive and do not last for more than a generation or two. Yet, despite the harmful mutation in this example, the second generation is a considerable improvement over the original population.

470643 c13.qxd 3/8/04 11:18 AM Page 432

432 Chapter 13

Table 13.5 The Population after Selection, Crossover, and Mutation GENOME

P

FITNESS

10010

18

234

11101

29

58

00110

6

150

10110

22

198

The basis of genetic algorithms is the continual improvement of the fitness of the population by means of selection, crossover, and mutation as genes are passed from one generation to the next. After a certain number of generations—

typically several dozen or hundred—the population evolves to a near-optimal solution. Genetic algorithms do not always produce the exact optimal solution, but they do a very good job of getting close to the best solution. In data mining, where exact solutions may not be feasible, being close to the best solution still yields actionable results.

Representing Data

The previous example illustrated the basic mechanisms of applying genetic TEAMFLY

algorithms to the optimization of a simple function, 31 p – p 2. Since the example was trying to maximize a particular function, the function itself served as the fitness function. The genomes were quite easy to create, because the function had one parameter, a 5-bit integer that varied between 0 and 31. This genome contained a single gene, representing the parameter, and consisting of a sequence of 5 binary bits. The choice of representation using binary sequences is not accidental. As explained later in the section on schemata, genetic algorithms work best on binary representations of data—a highly convenient circumstance, since computers themselves work most efficiently on binary data.

Genetic algorithms are different from other data mining and optimization techniques in that they manipulate the patterns of bits in the genomes and do not care at all about the values represented by the bits—only the fitness function knows what the patterns really mean. One requirement for the fitness function is the ability to transform any genome into a fitness value. This requirement does not seem particularly onerous, because computers are used to working with data in bits. However, some patterns of bits may violate constraints imposed on the problem. When the genome violates such constraints, then the fitness is set to a minimum. That is, testing for constraints in the fitness function incorporates constraints into the solution.

For instance, the previous example had a constraint that the value be between 0 and 31. This was made implicitly true by using 5 bits to represent Team-Fly®

470643 c13.qxd 3/8/04 11:18 AM Page 433

Genetic Algorithms 433

the genome. What if there were 8 bits? In this case, the fitness function would look like:

■■

31 – p 2 when 0

Page: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Categories: Economics, finance
Oleg: