Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

470643 c13.qxd 3/8/04 11:18 AM Page 431

Genetic Algorithms 431

Table 13.4 The Population after Selection and Crossover

GENOME

P

FITNESS

10010

18

234

11001

25

150

00110

6

150

10110

22

198

Mutation

The final operation is mutation. Mutation rarely occurs in nature and is the result of miscoded genetic material being passed from a parent to a child. The resulting change in the gene may represent a significant improvement in fitness over the existing population, although more often than not, the results are harmful. Selection and crossover do a good job of searching the space of possible genomes, but they depend on initial conditions and randomness that might conspire to prevent certain valuable combinations from being considered in succeeding generations. Mutation provides the additional input. The mutation rate is quite small in nature and is usually kept quite low for genetic algorithms—no more than one mutation or so per generation is a reasonable bound. For the example at hand, when a mutation occurs, the bit changes from a 0 to a 1 or from a 1 to a 0.

Assume that there is one mutation in this generation, occurring in the second genome at position 3. Table 13.5 shows the population of genomes after such a mutation. Notice that this mutation, like many mutations, is destructive: The fitness of the genome affected by the mutation decreased from 150 to 58, the average fitness of the population decreased from 183.0 to 160.0, and the resulting genome is unlikely to survive to the next generation. This is not unusual. The primary modus operandi of genetic algorithms is selection and crossover. Mutation is very much a second-order effect that helps avoid premature convergence to a local optimum. When the initial population provides good coverage of the space of possible combinations, succeeding generations move quickly toward the optimal solution by means of selection and crossover. Changes introduced by mutation are likely to be destructive and do not last for more than a generation or two. Yet, despite the harmful mutation in this example, the second generation is a considerable improvement over the original population.

470643 c13.qxd 3/8/04 11:18 AM Page 432

432 Chapter 13

Table 13.5 The Population after Selection, Crossover, and Mutation GENOME

P

FITNESS

10010

18

234

11101

29

58

00110

6

150

10110

22

198

The basis of genetic algorithms is the continual improvement of the fitness of the population by means of selection, crossover, and mutation as genes are passed from one generation to the next. After a certain number of generations—

typically several dozen or hundred—the population evolves to a near-optimal solution. Genetic algorithms do not always produce the exact optimal solution, but they do a very good job of getting close to the best solution. In data mining, where exact solutions may not be feasible, being close to the best solution still yields actionable results.

Representing Data

The previous example illustrated the basic mechanisms of applying genetic TEAMFLY

algorithms to the optimization of a simple function, 31 p – p 2. Since the example was trying to maximize a particular function, the function itself served as the fitness function. The genomes were quite easy to create, because the function had one parameter, a 5-bit integer that varied between 0 and 31. This genome contained a single gene, representing the parameter, and consisting of a sequence of 5 binary bits. The choice of representation using binary sequences is not accidental. As explained later in the section on schemata, genetic algorithms work best on binary representations of data—a highly convenient circumstance, since computers themselves work most efficiently on binary data.

Genetic algorithms are different from other data mining and optimization techniques in that they manipulate the patterns of bits in the genomes and do not care at all about the values represented by the bits—only the fitness function knows what the patterns really mean. One requirement for the fitness function is the ability to transform any genome into a fitness value. This requirement does not seem particularly onerous, because computers are used to working with data in bits. However, some patterns of bits may violate constraints imposed on the problem. When the genome violates such constraints, then the fitness is set to a minimum. That is, testing for constraints in the fitness function incorporates constraints into the solution.

For instance, the previous example had a constraint that the value be between 0 and 31. This was made implicitly true by using 5 bits to represent Team-Fly®

470643 c13.qxd 3/8/04 11:18 AM Page 433

Genetic Algorithms 433

the genome. What if there were 8 bits? In this case, the fitness function would look like:

■■

31 – p 2 when 0 <= p <= 31■■0 otherwiseThe general rule here is to include a minimum fitness value for any bit patterns that do not make sense or that violate problem constraints. Such patterns may not be in the original population, but they might appear because of crossover and mutation.T I P The fitness function is defined on the genome, a sequence of bits. It must be able to understand any pattern of 1s and 0s in the bits. When a particular pattern of bits does not make sense, then the fitness function should return a very low value—so the pattern does not get passed on to succeeding generations.Case Study: Using Genetic Algorithmsfor Resource OptimizationOne area where genetic algorithms have proven quite successful is in problems involving scheduling resources subject to a wide range of constraints.These types of problems involve competition for limited resources, while adhering to a complex set of rules that describe relationships. The key to these problems is defining a fitness function that incorporates all the constraints into a single fitness value. These problems are outside the range of what we have been considering as traditional data mining problems; however, they are interesting and illustrate the power of genetic algorithms.An example of such a problem is the assignment of 40 medical residents to various duties in an outpatient clinic, as faced by Dr. Ed Ewen at the Medical Center of Delaware. The clinic is open 7 days a week, and the residents are assigned to one particular day of the week through an entire year, regardless of their other duties. The best assignment balances several different goals:■■The clinic must have staff at all times.■■The clinic should have a balance of first-, second-, and third-year residents.■■Third-year residents see eight patients per day, second-year residents see six, and first-year residents see four.470643 c13.qxd 3/8/04 11:18 AM Page 434434 Chapter 13So far, this problem is not so complicated. However, each resident spends 4weeks on a rotation in a given part of the hospital, such as the intensive care ward, the oncology unit, or a community hospital. These rotations impose some other constraints:■■Senior residents do not go to the clinic when they are assigned to the medical intensive care rotation, but all other residents do.■■Junior residents do not go to the clinic when they are assigned to the cardiac care rotation, but all other residents do.■■No more than two residents from the intensive care rotation can be assigned to the clinic on the same day.■■No more than three residents from other rotations can be assigned to the clinic on the same day.As an example of problems that may arise, consider that during one rotation, five residents are assigned to the clinic on a particular day. During the next rotation, the senior is on the medical intensive care rotation and the two juniors are on the cardiac care rotation. Now there are only two residents left at the clinic—and this is insufficient for clinic operations.The genetic algorithms approach recognizes that there is probably no perfect solution to this problem, but that some assignments of residents to days of the week are clearly better than others. Dr. Ewen recognized that he could capture the “goodness” of a schedule using a fitness function. Actually, the function that Dr. Ewen used was an anti-fitness function—the higher the value, the worse the schedule. This function imposed penalties for violating the constraints:■■For each day when the clinic has fewer than three residents, an amount is added—a larger amount the bigger the size of the deficit.■■For each day when there are no seniors in the clinic, a small amount is added.■■For each day when fewer than three residents are left on a rotation, a large amount is added to the fitness function.■■And so on.Setting up a spreadsheet with these functions, Dr. Ewen tried to minimize the functions to get the best assignment. His initial assignments had scores in the range of 130 to 140. After several hours of work, he was able to reduce the score to 72. Pretty good.However, he had available a genetic algorithms package from the Ward Systems Group (www.wardsystems.com) that plugs into Excel spreadsheets.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *