cross-selling, 115–116
Learning (Goldberg), 445
customer response, tracking, 109
purchases, market based analysis, 289
customer segmentation, 111–113
purchasing frequencies, behavior-
differential response analysis,
based variables, 575–576
107–108
purity measures, splitting criteria,
discussed, 95
decision trees, 177–178
fixed budgets, 97–100
p-values, statistics, 126
new customer information,
Pyle, Dorian
gathering, 109–110
Business Modeling and Data Mining, 60
people most influenced by, 106–107
Data Preparation for Data Mining, 75
470643 bindex.qxd 3/8/04 11:08 AM Page 637
Index 637
Q
relational database management
quadratic discriminates, box
system (RDBMS)
diagrams, 200
discussed, 474
quality of data, association rules, 308
source systems, 594–595
question asking, data exploration,
star schema, 505
67–68
suppliers, 13
Quinlan, J. Ross (Iterative
support, 511
Dichotomiser 3), 190
relevance feedback, MBR, 267–268
q-values, statistics, 126
replicating results, 33
reporting requirements, OLAP,
R
495–496
range values, statistics, 137
resources
rate plans, wireless telephone
geographical, 555–556
services, 7
optimization, generic algorithms,
ratios
433–435
data transformation, 75
response
lift ratio, 81–84
biased sampling, 146
RDBMS. See relational database
communication channels, 89
management system
control groups
real estate appraisals, neural network
market research versus , 38
example, 213–217
marketing campaigns, 106
recall measurements, classification
cumulative response
codes, 273–274
concentration, 82–83
recency, frequency, and monetary
results, assessing, 85
(RFM) value, 575
customer relationships, 457
recommendation-based businesses,
differential response analysis,
16–17
marketing campaigns, 107–108
records
erroneous conclusions, 74
combining values within, 569
free text, 285
default classes, 194
good response scores, 34
transactional, 574
marketing campaigns, 96–97
rectangular regions, decision trees, 197
prediction, MBR, 258
recursive algorithms, 173
proof-of-concept projects, 599
reduction in variance, splits, decision
response models
trees, 183
generic algorithms, 440–443
regression
prospects, ranking, 36
building models, 8
response times, interactive
estimation tasks, 10
systems, 33
linear, 139
sample sizes, 145
regression trees, 170
single response rates, 141
statistics, 139
survey response
techniques, generic algorithms, 423
customer classification, 91
inconclusive, 46
470643 bindex.qxd 3/8/04 11:08 AM Page 638
638 Index
response, survey response (continued)
data quality, 308
profiling, 53
dissociation rules, 317
survey-based market research, 113
effectiveness of, 299–301
useful data sources, 61
inexplicable rules, 297–298
results
point-of-sale data, 288
actionable, 22
practical limits, overcoming,
assessing, 85
311–313
comparing expectations to, 31
prediction, 70
deliverables, data transformation,
probabilities, calculating, 309
57–58
products, hierarchical categories, 305
measuring, virtuous cycle, 30–32
sequential analysis, 318–319
neural networks, 241–243
for store comparisons, 315–316
replicating, 33
trivial rules, 297
statistical analysis, 141–143
virtual items, 307
tainted, 72
decision trees, 193–194
retention
generalized delta, 229
calculating, 385–386
rule-oriented problems, 176
churn and, 116–120
customer relationships, 467–469
S
exponential decay, 389–390, 393
SAC (Simplifying Assumptions
hazards, 404–405
Corporation), 97, 100
median customer lifetime value, 387
sample sizes, statistical analysis, 145
retention curves, 386–389
sample variation, statistics, 129
truncated mean lifetime value, 389
SAS Enterprise Miner Tree Viewer
retrospective customer value, 115
tool, 167–168
revenue, behavior-based variables,
scalability, data mining, 533–534
581–585
scaling, automatic cluster detection,
revolvers, behavior-based
363–364
variables, 580
scanners, point-of-sale, 3
RFM (recency, frequency, and
scarce data, 62
monetary) value, 575
SCF (sectional center facility), 553
ring diagrams, as alternative to
schemata, generic algorithms, 434,
decision trees, 199–201
436–438
risks
scores
hazards, 403
bizocity, 112–113
proof-of-concept projects, 599
cutoff, 98
ROC curves, 98–99, 101
decision trees, 169–170
root sets, link analysis, 333
good response, 34
RuleQuest Web site, 190
index-based, 92–95
rules
model deployment, 84–85
association rules
propensity-to-respond, 97
actionable rules, 296
proportional, census data, 94–95
affinity grouping, 11
score sets, 52
anonymous versus identified
scoring platforms, data mining,
transactions, 308
527–528
470643 bindex.qxd 3/8/04 11:08 AM Page 639
Index 639
sorting customers by, 8
simulated annealing, 230
z-scores, 551
single linkage, automatic cluster
search programs, link analysis, 331
detection, 369
searchable criteria, relevance
single response rates, 141
feedback, 268
single views, customers, 517–518
sectional center facility (SCF), 553
sites. See Web sites
selection step, generic algorithms, 429
skewed distributions, data
self-organizing map (SOM), 249–251,
correction, 73
372
SKUs (stock-keeping units), 305
sensitivity analysis, neural networks,
small-business relationships, customer
247–248
relationship management, 2
sequential analysis, association rules,
SMP (symmetric multiprocessor), 485
318–319
snapshots, customer signatures, 562
sequential events, applying decision
social information filtering, 282
trees to, 205
soft clustering, automatic cluster
sequential patterns, identifying, 24
detection, 367
server platforms, affordability, 13
SOI (sphere of influence), 38
service business sectors, customer
sole proprietors, 3
relationships, 13–14
solicitation, marketing campaigns, 96
shared labels, fax machines, 341
SOM (self-organizing map),
short form, census data, 94
249–251, 372
short-term trends, 75
source systems, 484, 486–487, 594
sigmoid action functions, neural
special-purpose code, 595
networks, 225
sphere of influence (SOI), 38
signatures, customers
spiders, web crawlers, 331
assembling, 68
splits, decision trees
business versus residential
on categorical input variables, 174
customers, 561
chi-square testing, 180–183
columns, pivoting, 563
discussed, 170
computational issues, 594–596
diversity measures, 177–178
considerations, 564
entropy, 179
customer identification, 560–562
finding, 172
data for, cataloging, 559–560
Gini splitting criterion, 178
discussed, 540–541
information gain ratio, 178, 180
model set creation, 68
intrinsic information of, 180
snapshots, 562
missing values, 174–175
time frames, identifying, 562
multiway, 171
similarity and distance, automatic
on numeric input variables, 173
cluster detection, 359–363
population diversity, 178
similarity matrix, 368
purity measures, 177–178
similarity measurements, MBR,
reduction in variance, 183
271–272
surrogate, 175
Simplifying Assumptions Corporation
spreadsheets, results, assessing, 85
(SAC), 97, 100
470643 bindex.qxd 3/8/04 11:08 AM Page 640
640 Index
SQL data, time series analysis,
mean values, 137
572–573
median values, 137
stability-based pruning, decision trees,
mode values, 137
191–192
multiple comparisons, 148–149
staffing, data mining, 525–526
normal distribution, 130–132
standard deviation
null hypothesis and, 125–126
estimation, 81
probabilities, 133–135
statistics, 132, 138
p-values, 126
variance and, 138
q-values, 126
standard error of proportion,
range values, 137
statistical analysis, 139–141
regression ranges, 139
standardization, numeric values, 551
sample variation, 129
standardized values, statistics,
standard deviation, 132, 138
129–133
standardized values, 129–133
star schema structure, relational
sum of values, 137–138
databases, 505
time series analysis, 128–129
statistical analysis
truncated data, 162
business data versus scientific
variance, 138
data, 159
z-values, 131, 138
censored data, 161
statistical regression techniques,
Central Limit Theorem, 129–130
generic algorithms, 423
chi-square tests
status codes, as categorical value, 239
case study, 155–158
stemming, link analysis, 333
degrees of freedom values,
stock-keeping units (SKUs), 305
chi-square tests, 152–153
store comparisons, association rules
difference of proportions versus ,
for, 315–316
153–154
stratification
discussed, 149
customer relationships and, 469
expected values, calculating,
hazards, 410
150–151
strings, fixed-length characters,
continuous variables, 137–138
552–554
correlation ranges, 139
subgroups
cross-tabulations, 136
automatic cluster detection
density function, 133
agglomerative clustering, 368–370
as disciplinary technique, 123
case study, 374–378
discrete values, 127–131
categorical variables, 359
experimentation, 160–161
centroid distance, 369
field values, 128
complete linkage, 369
histograms and, 127
data preparation, 363–365
marketing campaign approaches
dimension, 352
acuity of testing, 147–148
directed clustering, 372
confidence intervals, 146
discussed, 12, 91, 351
proportion, standard error of,
distance and similarity, 359–363
139–141
divisive clustering, 371–372
sample sizes, 145
evaluation, 372–373
470643 bindex.qxd 3/8/04 11:08 AM Page 641
Index 641
Gaussian mixture model, 366–367
T
geometric distance, 360–361
tables, lookup, auxiliary information,
hard clustering, 367
570–571
Hertzsprung-Russell diagram,
tainted results, 72
352–354
tangent function, 223
luminosity, 351
target columns, 547
scaling, 363–364
target fields, input variables, 37
single linkage, 369
target market versus control group
soft clustering, 367
response, 38
SOM (self-organizing map), 372
targeted acquisition campaigns, 31
vectors, angles between, 361–362
targeting
weighting, 363–365
good prospects, identifying, 88–89
zone boundaries, adjusting, 380
prospecting, 88
business goals, formulating, 605
taxonomy, products, 305
customer attributes, 11
telecommunications customers,
data transformation, 57
market based analysis, 288
overview, 11
telephone switches, transaction
profiling tasks, 12
processing systems, 3
undirected data mining, 57
terabytes, 5
subscription-based relationships, cus
Teradata, relational database
tomer relationships, 459–460
management software, 13
subtrees, decision trees, 189
termination of services, 114
sum of values, statistics, 137–138
testing
summarization, data transformation, 44
acuity of, statistical analysis, 147–148
summation function, 272
chi-square tests
supermarket chains, as information
case study, 155–158
brokers, 15–16
CHIDIST function, 152
supervised learning, 57
degrees of freedom values, 152–153
support, market based analysis, 301
difference of proportions versus ,
surrogate splits, decision trees, 175
153–154
survey responses
discussed, 149
customer classification, 91
expected values, calculating,
inconclusive, 46
150–151
profiling, 53
splits, decision trees, 180–183
survey-based market research, 113
F tests, 183–184
useful data sources, 61
hypothesis testing
survival analysis
confidence levels, 148
attrition, handling different types of,
considerations, 51
412–413
decision-making process, 50–51
customer relationships, 413–415
generating, 51
estimation tasks, 10
market basket analysis, 51
forecasting, 415–416
null hypothesis, statistics and,
symmetric multiprocessor (SMP),
125–126
489–490
470643 bindex.qxd 3/8/04 11:08 AM Page 642
642 Index
testing (continued)
truncated mean lifetime value,
KS (Kolmogorov-Smirnov) tests, 101
retention, 389
preclassified tests, 79
truthful learning sources, 48–50
test groups, marketing
two-tailed distribution, 134
campaigns, 106
test sets
U
out of time tests, 72
undirected data mining
uses for, 52
affinity grouping, 57
time
clustering, 57
attributes, market based
discussed, 7
analysis, 293
uniform distribution, statistics, 132
and dates, interval variables, 551
uniform product code (UPC), 555
dependency, prospecting and, 160
UNIT_MASTER file, customer
frames, customer signatures, 562
signatures, 559
series analysis
unordered lists, 239
neural networks, 244–247