Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

“U”) defined by 31 p – p 2 where p varies between 0 and 31 (see Figure 13.1). The parameter p is expressed as a string of 5 bits to represent the numbers from 0

to 31; this bit string is the genetic material, called a genome. The fitness function peaks at the values 15 and 16, represented as 01111 and 10000, respectively.

This example shows that genetic algorithms are applicable even when there are multiple, dissimilar peaks.

300

250

200

150

Fitness Value

100

50

0

0

5

10

15

20

25

30

Parameter p

Figure 13.1 Finding the maximum of this simple function helps illustrate genetic algorithms.

470643 c13.qxd 3/8/04 11:18 AM Page 425

Genetic Algorithms 425

GAs work by evolving successive generations of genomes that get progressively more and more fit; that is, they provide better solutions to the problem.

In nature, fitness is simply the ability of an organism to survive and reproduce.

On a computer, evolution is simulated with the following steps:

1. Identify the genome and fitness function.

2. Create an initial generation of genomes.

3. Modify the initial population by applying the operators of genetic algorithms.

4. Repeat Step 3 until the fitness of the population no longer improves.

The first step is setting up the problem. In this simple example, the genome consists of a single, 5-bit gene for the parameter p. The fitness function is the parabola. Over the course of generations, the fitness function is going to be maximized.

For this example, shown in Table 13.1, the initial generation consists of four genomes, randomly produced. A real problem would typically have a population of hundreds or thousands of genomes, but that is impractical for illustrative purposes. Notice that in this population, the average fitness is 122.5—pretty good, since the actual maximum is 240, but evolution can improve it.

The basic algorithm modifies the initial population using three operators—

selection, then crossover, then mutation—as illustrated in Figure 13.2. These operators are explained in the next three sections.

Table 13.1 Four Randomly Generated Genomes

GENOME

P

FITNESS

10110

22

198

00011

3

84

00010

2

58

11001

25

150

470643 c13.qxd 3/8/04 11:18 AM Page 426

426

Chapter 13

generation n

generation n + 1

this genome

this genome

survives

Selection keeps the size of the population constant but

dies off

this genome

increases the fitness of the next generation. Genomes

multiplies

with a higher fitness (darker shading) proliferate and

genomes with lighter shading die off.

crossover

position

Crossover is a way of

combining two genomes.

A crossover position

determines where the

genomes “break” and

are recombined.

Mutation makes an occasional

mutation

random change to a random position

in a genome. This allows features to

appear that may not have been in the

original population.

Figure 13.2 The basic operators in genetic algorithms are selection, crossover, and mutation.

470643 c13.qxd 3/8/04 11:18 AM Page 427

Genetic Algorithms 427

SIMPLE OVERVIEW OF GENETICS

Life depends on proteins, which consist of sequences of 20 basic units called amino acids. The chromosomes in the nucleus of a cell are strands of DNA (deoxyribonucleic acid) that carry the blueprints for the proteins needed by the cell. The 23 chromosomes in each human cell together are the genome for that individual. In general, the genomes of different individuals in a species are very similar to each other; however, there are some individual differences.

The DNA in the genome encodes these blueprints for the amino acids sequences using strands of nucleotides. These nucleotides constitute the four letters of the genetic alphabet:

◆ A, adenine

◆ C, cytosine

◆ G, guanine

◆ T, thymine

Triplets of nucleotides represent the 20 amino acids. For instance, the amino acid called methionine corresponds to the triplet ATG. Another amino acid, lysine, has two “spellings”: AAA and AAG. So, if a strand of DNA contains the following letters:

ATGAAGATGCGA

then it decodes into a protein containing four amino acids: methionine, ATG; lysine, AAG; methionine, ATG; followed by arginine, CGA (see figure). This description intentionally glosses over the details of the actual biochemical mechanism that turns the blueprints into proteins, but it provides a high-level outline of the mapping from genetic information in DNA to the building blocks of proteins.

A biological example of encoding is the mapping from nucleotides in DNA to amino acids in protein.

In this simplified model, the process of evolution works as follows. The proteins produced by the representations in the DNA express themselves as features of the living organism, such as blue eyes, five fingers, the structure of the brain, a long trunk, and so on. Genes can express themselves in damaging ways, causing the resulting organism to die. Healthy organisms survive to produce offspring and pass their DNA to the next generation. In higher-level animals, the DNA is actually combined with the DNA from another survivor during sexual replication, using a technique called crossover. Sometimes, mistakes are made in passing genes from one generation to the next—these are mutations. The combination of all these processes over the course of many generations results in organisms highly adapted to their environment: the process of evolution.

(continued)

470643 c13.qxd 3/8/04 11:18 AM Page 428

428 Chapter 13

SIMPLE OVERVIEW OF GENETICS (continued)

DNA with

Amino Acids

Nucleotides

for Proteins

Adenine

Methionine

Thymine

Guanine

Adenine

A

Lysine

denine

Guanine

Adenine

T

Methionine

hymine

Guanine

Cytosine

G

Arginine

uanine

Adenine

470643 c13.qxd 3/8/04 11:18 AM Page 429

Genetic Algorithms 429

Selection

The selection step is analogous to the process of natural selection where only the fittest individuals in the population survive to pass their genetic material on to the next generation. Unlike nature, though, the size of the population remains constant from one generation to the next, so there is no chance of the population becoming extinct (which would clearly not be the optimum solution!). The chance of a genome surviving to the next generation is proportional to its fitness value—the better the fitness value relative to other genomes, the more copies that survive to the next generation. Table 13.2 shows the ratio of the fitness of the four genomes to the population fitness. This ratio determines the number of copies of each genome expected in the next generation.

The expected number of copies is a fraction, but the number of genomes in the population is never fractional. Survival is based on choosing the genomes in a random way proportional to their fitness. A random number is generated between 0 and 1, and this random number is used to determine whether a copy of a genome survives or not. Using the example from Table 13.2, if the first random number is less than 0.404, then genome 10110 would be chosen; if it is between 0.404 and 0.576 (40.4% + 17.1%), the genome 00011 would be chosen, and so on. More random numbers are generated until the next generation has the right number of genomes. Using a random number generator converts the fractional probabilities to whole number approximations, and it also allows some genomes with low fitness to survive.

Applying selection to the original four genomes yields the survivors shown in Table 13.3. Notice that in general this procedure produces more copies of the fitter genomes and fewer of the less fit. One of the less fit, 00011, has not survived this round of selection, but there are two copies of 10110, the fittest. And, the average fitness of the population has increased from 122.5 to 151.0.

Table 13.2 Using Fitness for Selection

% OF TOTAL

POPULATION

EXPECTED

GENOME

FITNESS

FITNESS

COPIES

10110

198

40.4%

1.62

00011

84

17.1%

0.69

00010

58

11.8%

0.47

11001

150

30.6%

1.22

470643 c13.qxd 3/8/04 11:18 AM Page 430

430 Chapter 13

Table 13.3 The Population after Selection

GENOME

P

FITNESS

10110

22

198

11001

25

150

00010

2

58

10110

22

198

Crossover

The next operator applied to the surviving genomes is crossover. Crossover, which occurs in nature, creates two new genomes from two existing ones by gluing together pieces of each one. As shown in Figure 13.2, crossover starts with two genomes and a random position. The first part of one genome swaps places with the first part of the second. For instance, starting with the two genomes 10110 and 00010 and using a crossover position between the second and third position works as follows:

1 0  1 1 0

0 0  0 1 0

The result of crossover is (the genes from the second genome are underlined): 1 0  0 1 0

0 0  1 1 0

The resulting genomes, called children, each have a piece of their chromosomes inherited from each of their parents. Applying crossover to the population proceeds by selecting pairs of genomes and flipping a coin to determine whether they split and swap. This probability is the crossover probability, denoted by pc. If they do cross over, then a random position is chosen and the children of the original genomes replace them in the next generation. A value of 0.5 for the crossover probability (corresponding to a coin toss) generally produces good results. In the example, the two genomes 10110 and 00010 are chosen for crossover, and the position is between the second and third genes (Table 13.4). Notice that after selection and crossover, the average fitness of the population has gone from 122.5 to 183.0. This is a significant improvement after only one generation.

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *