Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 35 – Library. Read online. Free books read online. Read books without registering

Occam’s Razor

William of Occam was a Franciscan monk born in a small English town in 1280—not only before modern statistics was invented, but also before the Renaissance and the printing press. He was an influential philosopher, theologian,

470643 c05.qxd 3/8/04 11:11 AM Page 125

The Lure of Statistics: Data Mining Using Familiar Tools 125

and professor who expounded many ideas about many things, including church politics. As a monk, he was an ascetic who took his vow of poverty very seriously. He was also a fervent advocate of the power of reason, denying the existence of universal truths and espousing a modern philosophy that was quite different from the views of most of his contemporaries living in the Middle Ages.

What does William of Occam have to do with data mining? His name has become associated with a very simple idea. He himself explained it in Latin (the language of learning, even among the English, at the time), “Entia non sunt multiplicanda sine necessitate. ” In more familiar English, we would say “the simpler explanation is the preferable one” or, more colloquially, “Keep it simple, stupid.” Any explanation should strive to reduce the number of causes to a bare minimum. This line of reasoning is referred to as Occam’s Razor and is William of Occam’s gift to data analysis.

The story of William of Occam had an interesting ending. Perhaps because of his focus on the power of reason, he also believed that the powers of the church should be separate from the powers of the state—that the church should be confined to religious matters. This resulted in his opposition to the meddling of Pope John XXII in politics and eventually to his own excommunication. He eventually died in Munich during an outbreak of the plague in 1349, leaving a legacy of clear and critical thinking for future generations.

The Null Hypothesis

Occam’s Razor is very important for data mining and statistics, although statistics expresses the idea a bit differently. The null hypothesis is the assumption that differences among observations are due simply to chance. To give an example, consider a presidential poll that gives Candidate A 45 percent and Candidate B 47 percent. Because this data is from a poll, there are several sources of error, so the values are only approximate estimates of the popularity of each candidate. The layperson is inclined to ask, “Are these two values different?” The statistician phrases the question slightly differently, “What is the probability that these two values are really the same?”

Although the two questions are very similar, the statistician’s has a bit of an attitude. This attitude is that the difference may have no significance at all and is an example of using the null hypothesis. There is an observed difference of 2 percent in this example. However, this observed value may be explained by the particular sample of people who responded. Another sample may have a difference of 2 percent in the other direction, or may have a difference of 0 percent. All are reasonably likely results from a poll. Of course, if the preferences differed by 20 percent, then sampling variation is much less likely to be the cause. Such a large difference would greatly improve the confidence that one candidate is doing better than the other, and greatly reduce the probability of the null hypothesis being true.

470643 c05.qxd 3/8/04 11:11 AM Page 126

126 Chapter 5

T I P The simplest explanation is usually the best one—even (or especially) if it does not prove the hypothesis you want to prove.

This skeptical attitude is very valuable for both statisticians and data miners. Our goal is to demonstrate results that work, and to discount the null hypothesis. One difference between data miners and statisticians is that data miners are often working with sufficiently large amounts of data that make it unnecessary to worry about the mechanics of calculating the probability of something being due to chance.

P-Values

The null hypothesis is not merely an approach to analysis; it can also be quantified. The p-value is the probability that the null hypothesis is true. Remember, when the null hypothesis is true, nothing is really happening, because differences are due to chance. Much of statistics is devoted to determining bounds for the p-value.

Consider the previous example of the presidential poll. Consider that the p-value is calculated to be 60 percent (more on how this is done later in the chapter). This means that there is a 60 percent likelihood that the difference in the support for the two candidates as measured by the poll is due strictly to chance and not to the overall support in the general population. In this case, there is little evidence that the support for the two candidates is different.

Let’s say the p-value is 5 percent, instead. This is a relatively small number, and it means that we are 95 percent confident that Candidate B is doing better than Candidate A. Confidence, sometimes called the q-value, is the flip side of the p-value. Generally, the goal is to aim for a confidence level of at least 90

percent, if not 95 percent or more (meaning that the corresponding p-value is less than 10 percent, or 5 percent, respectively).

These ideas—null hypothesis, p-value, and confidence—are three basic ideas in statistics. The next section carries these ideas further and introduces the statistical concept of distributions, with particular attention to the normal distribution.

A Look at Data

A statistic refers to a measure taken on a sample of data. Statistics is the study of these measures and the samples they are measured on. A good place to start, then, is with such useful measures, and how to look at data.

470643 c05.qxd 3/8/04 11:11 AM Page 127

The Lure of Statistics: Data Mining Using Familiar Tools 127

Looking at Discrete Values

Much of the data used in data mining is discrete by nature, rather than continuous. Discrete data shows up in the form of products, channels, regions, and descriptive information about businesses. This section discusses ways of looking at and analyzing discrete fields.

Histograms

The most basic descriptive statistic about discrete fields is the number of times different values occur. Figure 5.1 shows a histogram of stop reason codes during a period of time. A histogram shows how often each value occurs in the data and can have either absolute quantities (204 times) or percentage (14.6

percent). Often, there are too many values to show in a single histogram such as this case where there are over 30 additional codes grouped into the “other”

category.

In addition to the values for each category, this histogram also shows the cumulative proportion of stops, whose scale is shown on the left-hand side.

Through the cumulative histogram, it is possible to see that the top three codes account for about 50 percent of stops, and the top 10, almost 90 percent. As an aesthetic note, the grid lines intersect both the left- and right-hand scales at sensible points, making it easier to read values off of the chart.

12,500

100%

10,048

Cum

10,000

80%

ulative Pr

7,500

60%

5,944

opor

4,884

5,000

40%

tion

3,851

3,549

3,311

Number of Stops

3,054

2,500

20%

1,491

1,306

1,226

1,108

OTHER

Stop Reason Code

Figure 5.1 This example shows both a histogram (as a vertical bar chart) and cumulative proportion (as a line) on the same chart for stop reasons associated with a particular marketing effort.

470643 c05.qxd 3/8/04 11:11 AM Page 128

128 Chapter 5

Time Series

Histograms are quite useful and easily made with Excel or any statistics package. However, histograms describe a single moment. Data mining is often concerned with what is happening over time. A key question is whether the frequency of values is constant over time.

Time series analysis requires choosing an appropriate time frame for the data; this includes not only the units of time, but also when we start counting from. Some different time frames are the beginning of a customer relationship, when a customer requests a stop, the actual stop date, and so on. Different fields belong in different time frames. For example:

■■ Fields describing the beginning of a customer relationship—such as original product, original channel, or original market—should be looked at by the customer’s original start date.

■■ Fields describing the end of a customer relationship—such as last product, stop reason, or stop channel—should be looked at by the customer’s stop date or the customer’s tenure at that point in time.

■■ Fields describing events during the customer relationship—such as product upgrade or downgrade, response to a promotion, or a late payment—should be looked at by the date of the event, the customer’s tenure at that point in time, or the relative time since some other event.

The next step is to plot the time series as shown in Figure 5.2. This figure has two series for stops by stop date. One shows a particular stop type over time (price increase stops) and the other, the total number of stops. Notice that the units for the time axis are in days. Although much business reporting is done at the weekly and monthly level, we prefer to look at data by day in order to see important patterns that might emerge at a fine level of granularity, patterns that might be obscured by summarization. In this case, there is a clear up and down wiggling pattern in both lines. This is due to a weekly cycle in stops. In addition, the lighter line is for the price increase related stops. These clearly show a marked increase starting in February, due to a change in pricing.