market based analysis
acuity of testing, 147–148
differentiation, 289
confidence intervals, 146
discussed, 287
proportion, standard error of,
geographic attributes, 293
139–141
item popularity, 293
results, comparing, using confiitem sets, 289
dence bounds, 141–143
market basket data, 51, 289–291
sample sizes, 145
marketing interventions, tracking,
targeted acquisition campaigns, 31
293–294
types of, 111
order characteristics, 292
up-selling, 115–116
products, clustering by usage,
usage stimulation, 111
294–295
marriages
purchases, 289
categorical values, 239–240
support, 301
house-hold level data, 96
telecommunications customers, 288
mass intimacy, customer relationships,
time attributes, 293
451–453
market research
massively parallel processor
control group response versus , 38
(MPP), 485
literature, 22
maximum values, of simple functions,
shortcomings, 25
generic algorithms, 424
survey-based, 113
MBR. See memory-based reasoning
marketing campaigns. See also
MDL (minimum description
advertising
length), 78
acquisitions-time data, 108–110
mean between time failure
canonical measurements, 31
(MTBF), 384
champion-challenger approach, 139
mean time to failure (MTTF), 384
credit risks, reducing exposure to,
mean values, statistics, 137
113–114
measurement errors, 159
cross-selling, 115–116
median customer lifetime value,
customer response, tracking, 109
retention, 387
customer segmentation, 111–113
median values, statistics, 137
differential response analysis,
medical insurance claims, useful
107–108
data sources, 60
discussed, 95
medical treatment applications,
fixed budgets, 97–100
MBR, 258
loyalty programs, 111
meetings, brainstorming, 37
new customer information,
memory-based reasoning (MBR)
gathering, 109–110
case study, 259–262
people most influenced by, 106–107
challenges of, 262–265
planning, 27
classification codes, 266, 273–274
profitability, 100–104
combination function, 258, 265
proof-of-concept projects, 600
customer classification, 90–91
response modeling, 96–97
customer response prediction, 258
470643 bindex.qxd 3/8/04 11:08 AM Page 632
632 Index
memory-based reasoning (MBR)
missing data
(continued)
data correction, 73–74
democracy approach, 279–281
NULL values, 590
distance function, 258, 265, 271–272
splits, decision trees, 174–175
fraud detection, 258
mission-critical applications, 32
free text response, 258
mode values, statistics, 137
historical records, selecting, 262–263
models
medical treatment applications, 258
assessing
new customers, 277
classifiers and predictors, 79
relevance feedback, 267–268
descriptive models, 78
similarity measurements, 271–272
directed models, 78–79
training data, 263–264
estimators, 79–81
weighted voting, 281–282
building, 8, 77
men, differential response analysis
comparing, using lift ratio, 81–82
and, 107
deploying, 84–85
messages, prospecting, 89–90
model sets
metadata repository, 484, 491
balanced datasets, 68
methodologies
components of, 52
data correction, 72–74
customer signatures, assembling, 68
data exploration, 64–68
partitioning, 71–72
data mining process, 54–55
predictive models, 70–71
data selection, 60–64
timelines, multiple, 70
TEAMFLY
data transformation, 74–76
non-response, mass mailings, 35
data translation, 56–60
score sets, 52
learning sources
motor vehicle registration records,
truthful, 48–50
useful data sources, 61
untruthful, 44–48
MOU (minutes of use), wireless
model assessment, 78–82
communications industries, 38
model building, 77
MPP (massively parallel processor), 485
model deployment, 84–85
MSA (metropolitan statistical area), 94
model sets, creating, 68–72
MTBF (mean between time failure), 384
reasons for, 44
MTTF (mean time to failure), 384
results, assessing, 85
multiway splits, decision trees, 171
metropolitan statistical area (MSA), 94
mutation, generic algorithms, 431–432
minimum description length
(MDL), 78
N
minimum support pruning, decision
N variables, dimension, 352
trees, 312
National Consumer Assets Group
minutes of use (MOU), wireless
(NCAG), 23
communications industries, 38
natural association, automatic cluster
misclassification rates, binary
detection, 358
classification, 98
Team-Fly®
470643 bindex.qxd 3/8/04 11:08 AM Page 633
Index 633
nearest neighbor techniques
classification, 9
classification, 9
combination function, 222
collaborative filtering
components of, 220–221
estimated ratings, 284–285
continuous values, features with,
grouping customers, 90
235–237
predictions, 284–285
coverage of values, 232–233
profiles, building and comparing,
data preparation
283–284
categorical values, 239–240
social information filtering, 282
continuous values, 235–237
word-of-mouth advertising, 283
decision trees, 199
memory-based reasoning (MBR)
discussed, 211
case study, 259–262
estimation tasks, 10, 215
challenges of, 262–265
feed-forward
classification codes, 266, 273–274
back propagation, 228–232
combination function, 258, 265
hidden layer, 227
customer classification, 90–91
input layer, 226
customer response prediction, 258
output layer, 227
democracy approach, 279–281
generic algorithms and, 439–440
distance function
hidden layers, 221, 227
fraud detection, 258
historical data, 219
free text responses, 258
history of, 212–213
historical records, selecting,
implementation, 212
262–263
inputs/outputs, 215
medical treatment applications, 258
neighborliness parameters, 250
new customers, 277
nonlinear behaviors, 222
relevance feedback, 267–268
OR value, 222
similarity measurements, 271–272
overfitting, 234
training data, 263–264
parallel coordinates, 253
weighted voting, 281–282
prediction, 215
negative correlation, 139
real estate appraisal example,
neighborliness parameters, neural
213–217
networks, 250
results, interpreting, 241–243
neural networks
sensitivity analysis, 247–248
activation function, 222
sigmoid action functions, 225
AND value, 222
SOM (self-organizing map), 249–251
automation, 213
time series analysis, 244–247
average member technique, 252
training sets, selection consideration,
bias sampling, 227
232–234
biological, 211
transfer function, 223
building models, 8
validation sets, 218
case study, 252–254
variable selection problem, 233
categorical variables, 239–240
variance, 199
470643 bindex.qxd 3/8/04 11:08 AM Page 634
634 Index
new customer information
Open Database Connectivity
gathering, 109–110
(ODBC), 496
memory-based reasoning, 277
operational errors, 159
profiles, building, 283
operational feedback, 485, 492
new start forecast (NSF), 469
operational summary data, OLAP, 477
nodes, graphs, 322
opportunistic sample, defined, 25
nonlinear behaviors, neural
opportunities, good response
networks, 222
scores, 34
non-response models, mass
optimization
mailings, 35
generic algorithms, 422
normal distribution, statistics, 130–132
resources, generic algorithms,
normalization, numeric variables, 550
433–435
normalized absolute value, distance
training as, 230
function, 275
OR value, neural networks, 222
NORMDIST function, 134
Oracle, relational database
NORMSINV function, 147
management software, 13
NSF (new start forecast), 469
order characteristics, market based
null hypothesis, statistics and, 125–126
analysis, 292
NULL values, missing data, 590
ordered lists, 239
numeric variables
ordered variables, measure of, 549
data correction, 73
organizations. See businesses
distance function, 275
out of time tests, 72
measure of, 550–551
outliners
splits, decision trees, 173
data correction, 73
data transformation, 74
O
output layer, feed-forward neural
Occam’s Razor, 124–125
networks, 227
ODBC (Open Database
outputs, neural networks, 215
Connectivity), 496
outsourcing data mining, 522–524
one-tailed distribution, 134
overfitting, neural networks, 234
Online Analytic Processing (OLAP)
additive facts, 501
P
data mining and, 507–508
parallel coordinates, neural
decision-support summary data,
networks, 253
477–478
parsing variables, 569
dimension tables, 502–503
patterns
discussed, 31
meaningful discoveries, 56
levels of, 475
prediction, 45
logical schema, 478
untruthful learning sources, 45–46
metadata, 483–484, 491
peg values, 236
operational summary data, 477
penetration, proportion, 203
physical schema, 478
percent variations, 105
reporting requirements, 495–496
perceptrons, defined, 212
transaction data, 476–477
470643 bindex.qxd 3/8/04 11:08 AM Page 635
Index 635
performance, classification, 12
distribution and, 135
physical schema, OLAP, 478
hazards, 394–396
pilot projects, 598
statistics, 133–135
planar graphs, 323
probation periods, 518
planned processes, proof-of-concept
problem management
projects, 599
data transformation, 56–57
platforms, data mining, 527
identification, 43
point of maximum benefit, 101
lift ratio, 83
point-of-sale data
profiling as, 53–54
association rules, 288
rule-oriented problems, 176
scanners, 3
variable selection problems, neural
as useful data source, 60
networks, 233
population diversity, 178
products
positive ratings, voting, 284
clustering by usage, market based
postcards, as communication
analysis, 294–295
channel, 89
co-occurrence of, 299
potential revenue, behavior-based
hierarchical categories, 305
variables, 583–585
information as, 14
precision measurements, classification
introduction, planning for, 27
codes, 273–274
product codes, as categorical
preclassified tests, 79
value, 239
predictions
product-focused businesses, 2
accuracy, 79
taxonomy, 305
association rules, 70
profiling
business goals, formulating, 605
business goals, formulating, 605
collaborative filtering, 284–285
collaborative filtering, 283–284
credit risks, 113–114
data transformation, 57
customer longevity, 119–120
decision trees, 12
data transformation, 57
demographic profiles, 31
defined, 52
descriptive, 52
directed data mining, 57
directed, 52
errors, 191
examples of, 54
future behaviors, 10
gender example, 12
historical data, 10
new customer information, 283
model sets for, 70–71
overview, 12
neural networks, 215
predication versus , 52–53
patterns, 45
as problem management, 53–54
prediction task examples, 10
survey response, 53
profiling versus , 52–53
profitability
response, MBR, 258
marketing campaigns, 100–104
uses for, 54
proof-of concept projects, 599
probabilities
results, assessing, 85
calculating, 309
projective visualization (Marc
class labels, 85
Goodman), 206–208
470643 bindex.qxd 3/8/04 11:08 AM Page 636
636 Index
proof-of-concept projects
planning, 27
expectations, 599
profitability, 100–104
identifying, 599–601
response modeling, 96–97
implementation, 601–605
types of, 111
propensity
up-selling, 115–116
categorical variables, 242
messages, selecting appropriate,
propensity-to-respond score, 97
89–90
proportion
ranking, 88–89
converting counts to, 75–76
roles in, 88
difference of proportion
targeting, 88
chi-square tests versus , 153–154
time dependency and, 160
statistical analysis, 143–144
prospective customer value, 115
penetration, 203
prototypes, proof-of-concept
standard error of, statistical analysis,
projects, 599
139–141
pruning, decision trees
proportional hazards
C5 algorithm, 190–191
Cox, 410–411
CART algorithm, 185, 188–189
discussed, 408
discussed, 184
examples of, 409
minimum support pruning, 312
limitations of, 411–412
stability-based, 191–192
proportional scoring, census data,
public records, house-hold level
94–95
data, 96
prospecting
publications
advertising techniques, 90–94
Building the Data Warehouse (Bill
communication channels, 89
Inmon), 474
customer relationships, 457
Business Modeling and Data Mining
efforts, 90
(Dorian Pyle), 60
good prospects, identifying, 88–89
Data Preparation for Data Mining
index-based scores, 92–95
(Dorian Pyle), 75
marketing campaigns
The Data Warehouse Toolkit (Ralph
acquisition-time variables, 110
Kimball), 474
credit risks, reducing exposure to,
Genetic Algorithms in Search,
113–114
Optimization, and Machine