Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management

Hazards are like an x-ray of the customer life cycle. The related idea of survival, which is the proportion of customers who survive up to a particular point in time, makes it possible to compare different groups of customers and to translate results into dollars and cents. When there are enough customers (and usually there are), stratifying the customers by building a separate curve for each group provides a good comparison. It is possible to use other measures, such as the survival at a particular point in time, the customer half-life, and the average tenure, to better understand customers.

One of the key concepts in survival analysis is censoring. This means that some customers are dropped from the analysis. The idea of censoring can be extended to understand competing risks, such as voluntary versus forced attrition. Censoring also makes it possible to discard certain outcomes, such as a one-time boycott, without adversely biasing overall results.

470643 c12.qxd 3/8/04 11:17 AM Page 419

Hazard Functions and Survival Analysis in Marketing 419

One of the most powerful aspects of hazards is the ability to determine which factors, at the onset, are responsible for increasing or decreasing the hazards. In addition to stratifying customers, there is another technique, Cox proportional hazards regression, which has proven its worth since the 1970s and continues to be extended and improved upon.

Survival analysis has many applications beyond measuring the probability of customers leaving. It has been used for forecasting customer levels, as well as for predicting other types of events during the customer life cycle. It is a very powerful tool, seemingly designed specifically for understanding customers and their life cycles.

470643 c12.qxd 3/8/04 11:17 AM Page 420

470643 c13.qxd 3/8/04 11:18 AM Page 421

C H A P T E R

13

Genetic Algorithms

Like memory-based reasoning and neural networks, genetic algorithms are based on an analogy to biological processes. Evolution and natural selection have, over the course of millions of years, resulted in adaptable, specialized species that are highly suited to their environments. Evolution optimizes the fitness of individuals over succeeding generations by propagating the genetic material in the fittest individuals of one generation to the next generation.

Genetic algorithms apply the same idea to problems where the solution can be expressed as an optimal “individual” and the goal is to maximize the

“fitness” of individuals. Many problems can be described in this way; the challenge is encoding the problem in the right fashion. For instance, one application of genetic algorithms is to the training of neural networks. An individual is then a set of values for the weights inside the network; the fitness of an individual is the accuracy of the neural network having those weights on the training set.

The training proceeds in an evolutionary way, by having more fit individuals propagate their weights to succeeding generations. Less fit individuals—and their genetic material—do not survive. Although chance plays a significant role in the survival of any particular individual, over a larger population there are enough examples of different types of individuals for natural selection to propagate the genetic material that produces the fittest individuals.

Genetic algorithms, which are also called evolutionary algorithms, have been applied to optimization problems in various industries, including complex scheduling problems, resource optimization in large factories, and classification 421

470643 c13.qxd 3/8/04 11:18 AM Page 422

422 Chapter 13

problems involving complex data types. They have also been used in combination with other data mining algorithms, including determining the best topology for neural networks, determining the scoring function for memory-based reasoning, and, as already mentioned, optimizing weights in neural networks. However, genetic algorithms are not commonly found in general data mining packages.

OPTIMIZATION

Optimization problems have three features:

◆ A set of paramet

genomes

ers (which for GAs are called

or chromosomes)

◆ A function that combines the parameters into a single value (the fitness

function)

◆ A set of constraints on the parameters (for GAs these are incorporated into the fitness function)

The goal is to find the parameters that maximize or minimize the fitness function, subject to the constraints. Searching through all combinations of parameters that meet the constraints is too cumbersome for even the most advanced computers; even for a small number of parameters, the number of combinations is too large to search.

Genetic algorithms are one approach to solving such problems, but not the only one. When the fitness function satisfies some specific mathematical TEAMFLY

conditions, then differential calculus can be used to find the optimal solution.

Although few functions in practice are differentiable, calculus also includes ideas for estimating solutions in other cases. The conjugate-gradient method for training neural networks is based on such ideas, as is the “solver” capability in Excel.

Another approach arises with linear programming problems. These are problems where the fitness function is linear and all the constraints are also linear. These constraints are often met in resource allocation problems, such as:

A company produces widgets in a set of factories. Each factory

has a capacity, a cost of production, and a cost for transporting

widgets to customers. How many widgets should each factory

produce to satisfy customer demand at minimal cost?

The standard method for solving such problems is called the Simplex method, and it is computationally efficient. Such problems have been solved with thousands of variables. Further information on linear programming type problems is available on the linear programming FAQ at www-unix.mcs.anl.

gov/otc/Guide/faq/linear-programming-faq.html.

Another approach is called simulated annealing. This uses an analogy to a physical process: some liquids cool and form crystalline patterns as they cool.

The crystal minimizes certain types of energy, and this happens across the entire crystal. Scientists studying physical properties are the most common users of simulated annealing.

Team-Fly®

470643 c13.qxd 3/8/04 11:18 AM Page 423

Genetic Algorithms 423

The first work on genetic algorithms dates back to the late 1950s, when biologists and computer scientists worked together to model the mechanisms of evolution on early computers. A bit later, in the early 1960s, Professor John Holland and his colleagues at the University of Michigan applied this work on computerized genetics—chromosomes, genes, alleles, and fitness functions—

to optimization problems. In 1967, one of Holland’s students, J. D. Bagley, coined the term genetic algorithms in his graduate thesis to describe the optimization technique. At the time, many researchers were uncomfortable with genetic algorithms because of their dependence on random choices during the process of evolving a solution; these choices seemed arbitrary and unpredictable. In the 1970s, Prof. Holland developed a theoretical foundation for the technique. His theory of schemata gives insight into why genetic algorithms work—and intriguingly suggests why genetics itself creates successful, adaptable creatures such as ourselves.

In the world of data mining and data analysis, the use of genetic algorithms has not been as widespread as the use of other techniques. Data mining focuses on tasks such as classification and prediction, rather than on optimization. Although many data mining problems can be framed as optimization problems, this is not the usual description. For instance, a typical data mining problem might be to predict the level of inventory needed for a given item in a catalog based on the first week of sales, characteristics about the items in the catalog, and its recipients. Rephrasing this as an optimization problem turns it into something like “what function best fits the inventory curve for predictive purposes.” Applying statistical regression techniques is one way to find the function. Feeding the data into a neural network is another way of estimating it. Using genetic algorithms offers another possibility. The aside “Optimization” discusses other methods designed specifically for this purpose.

This chapter covers the background of genetics on computers, and introduces the schema mechanism invented by John Holland to explain why genetic algorithms work. It talks about two case studies, one in the area of resource optimization and the other in the area of classifying email messages.

Although few commercial data mining products include genetic algorithms (GAs), more specialized packages do support the algorithms. They are an important and active area of research and may become more widely used in the future.

How They Work

The power of genetic algorithms comes from their foundation in biology, where evolution has proven capable of adapting life to a multitude of environments (see sidebar “Simple Overview of Genetics”). The success of mapping the

470643 c13.qxd 3/8/04 11:18 AM Page 424

424 Chapter 13

template for the human genome—all the common DNA shared by human individuals—is just the beginning. The human genome has been providing knowledge for advances in many fields, such as medical research, biochemistry, genetics, and even anthropology. Although interesting, human genomics is fortunately beyond the scope of knowledge needed to understand genetic algorithms. The language used to describe the computer technique borrows heavily from the biological model, as discussed in the following section.

Genetics on Computers

A simple example helps illustrate how genetic algorithms work: trying to find the maximum value of a simple function with a single integer parameter p. The function in this example is the parabola (which looks like an upside-down

Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154

Leave a Reply 0

Your email address will not be published. Required fields are marked *