automatic cluster detection, 359
gathering, 109–110
data correction, 73
people most influenced by, 106–107
marriages, 239–240
planning, 27
measures of, 549
profitability, 100–104
neural networks, 239–240
proof-of-concept projects, 600
propensity, 242
response modeling, 96–97
splits, decision trees, 174
as statistical analysis
censored data
acuity of testing, 147–148
hazards, 399–403
confidence intervals, 146
statistics, 161
proportion, standard error of,
census data
139–141
proportional scoring, 94–95
results, comparing, using
useful data sources, 61
confidence bounds, 141–143
Central Limit Theorem, statistics,
sample sizes, 145
129–130
targeted acquisition campaigns, 31
central repository, 484, 488, 490
types of, 111
centroid distance, automatic cluster
up-selling, 115–116
detection, 369
usage stimulation, 111
C5 pruning algorithm, decision trees,
candidates, link analysis, 333
190–191
canonical measurements, marketing
CHAID (Chi-square Automatic
campaigns, 31
Interaction Detector), 182–183
capture trends, data transformation, 75
challenges, business challenges,
identifying, 23–24
470643 bindex.qxd 3/8/04 11:08 AM Page 620
620 Index
champion-challenger approach,
correct classification matrix, 79
marketing campaigns, 139
data transformation, 57
change processes, feedback, 34
decision trees, 166–168
charts
directed data mining, 57
concentration, 101
discrete outcomes, 9
cumulative gains, 101
estimation, 9
lift charts, 82, 84
leaf nodes, 167
time series, 128–129
memory-based reasoning, 90–91
CHIDIST function, 152
overview, 8–9
child nodes, classification, 167
performance, 12
children, number of, house-hold level
Classification and Regression Trees
data, 96
(CART) algorithm, decision trees,
chi-square tests
185, 188–189
case study, 155–158
classification codes
CHAID (Chi-square Automatic
discussed, 266
Interaction Detector), 182–183
precision measurements, 273–274
CHIDIST function, 152
recall measurements, 273–274
degrees of freedom values, 152–153
clustering
difference of proportions versus ,
automatic cluster detection
153–154
agglomerative clustering, 368–370
discussed, 149
case study, 374–378
expected values, calculating, 150–151
categorical variables, 359
splits, decision trees, 180–183
centroid distance, 369
churn
complete linkage, 369
as binary outcome, 119
data preparation, 363–365
customer longevity, predicting,
dimension, 352
119–120
directed clustering, 372
EBCF (existing base churn
discussed, 12, 91, 351
forecast), 469
distance and similarity, 359–363
expected, 118
divisive clustering, 371–372
forced attrition, 118
evaluation, 372–373
importance of, 117–118
Gaussian mixture model, 366–367
involuntary, 118–119, 521
geometric distance, 360–361
recognizing, 116–117
hard clustering, 367
retention and, 116–120
Hertzsprung-Russell diagram,
voluntary, 118–119, 521
352–354
class labels, probability, 85
luminosity, 351
classification
scaling, 363–364
accuracy, 79
single linkage, 369
binary
soft clustering, 367
decision trees, 168
SOM (self-organizing map), 372
misclassification rates, 98
vectors, angles between, 361–362
business goals, formulating, 605
weighting, 363–365
child nodes, 167
zone boundaries, adjusting, 380
470643 bindex.qxd 3/8/04 11:08 AM Page 621
Index 621
business goals, formulating, 605
competitive advantage, information
customer attributes, 11
as, 14
data transformation, 57
complete linkage, automatic cluster
overview, 11
detection, 369
profiling tasks, 12
computational issues, customer
undirected data mining, 57
signatures, 594–596
coding, special-purpose code, 595
concentration
collaborative filtering
concentration charts, 101
estimated ratings, 284–285
cumulative response, 82–83
grouping customers, 90
confidence intervals
predictions, 284–285
hypothesis testing, 148
profiles, building and comparing,
statistical analysis, 146, 148–149
283–284
confusion
social information filtering, 282
aggregation and, 48
word-of-mouth advertising, 283
confusion matrix, 79
collections, credit risks, 114
data transformation, 28
columns, data
conjugate gradient, 230
cost, 548
constant hazards
derived variables, 542
changing over time hazards versus ,
discussed, 542
416–417
identification, 548
discussed, 397
ignored, 547
continuous variables
input, 547
data preparation, 235–237
with one value, 544–546
neural networks, 235–237
target, 547
statistics, 137–138
with unique values, 546–547
control group response
weight, 548
marketing campaigns, 106
combination function
target market response versus , 38
attrition history, 280
controlled experiments, hypothesis
MBR (memory-response reasoning),
testing, 51
258, 265
convenience users, behavior-based
neural networks, 222
variables, 580, 587–589
weighted voting, 281–282
cookies, Web servers, 109
commercial software products, 15
correct classification matrix, 79
communication channels,
correlation ranges, statistics, 139
prospecting, 89
costs
companies. See businesses
cost columns, 548
comparisons
decision tree considerations, 195
comparing models, using lift ratio,
countervailing errors, 81
81–82
counts, converting to proportions,
data, 83
75–76
statistical analysis, 148–149
coverage of values, neural networks,
competing risks, hazards, 403
232–233
Cox proportional hazards, 410–411
470643 bindex.qxd 3/8/04 11:08 AM Page 622
622 Index
creative process, data mining as, 33
stages, 457
credit
strategies for, 6
credit applications
stratification, 469
classification tasks, 9
subscription-based relationships,
prediction tasks, 10
459–460
useful data sources, 60
survival analysis, 413–415
credit risks, reducing exposure to,
transaction processing systems, 3–4
113–114
up-selling, 467
crossover, generic algorithms, 430
winback approach, 470
cross-selling opportunities
customer-centric businesses,
affinity grouping, 11
514–515, 516–521
customer relationships, 467
demographic profiles, 31
marketing campaigns, 111, 115–116
grouping, collaborative filtering
reasons for, 17
and, 90
cross-tabulations, 136, 567–568
interactions, learning opportunities,
cumulative gains, 36, 101
520–521
cumulative response
loyalty, 520
concentration, 82–83
marginal, 553
results, assessing, 85
new customer information
customers
gathering, 109–110
attributes, clustering, 11
memory-based reasoning, 277
behaviors of, gaining insight, 56
profiles, building, 283
TEAMFLY
customer relationships
prospective customer value, 115
bad customers, weeding out, 18
responses
building businesses around, 2
to marketing campaigns, 109
customer acquisition, 461–464
prediction, MBR, 258
customer activation, 464–466
retrospective customer value, 115
customer-centric enterprises, 3
segmentation, marketing campaigns,
data mining role in, 5–6
111–113
data warehousing, 4–5
sequential patterns, identifying, 24
deep intimacy, 449, 451
signatures
event-based relationships, 458–459
assembling, 68
good customers, holding on to,
business versus residential
17–18
customers, 561
in-between relationships, 453
columns, pivoting, 563
indirect relationships, 453–454
computational issues, 594–596
interests in, 13–14
considerations, 564
large-business relationships, 3–4
customer identification, 560–562
levels of, 448
data for, cataloging, 559–560
life stages, 455–456
discussed, 540–541
lifetime customer value, 32
model set creation, 68
mass intimacy, 451–453
snapshots, 562
retention, 467–469
time frames, identifying, 562
service business sectors, 13–14
single views, 517–518
small-business relationships, 2
Team-Fly®
470643 bindex.qxd 3/8/04 11:08 AM Page 623
Index 623
sorting, by scores, 8
discussed, 64
telecommunications, market based
distributions, examining, 65
analysis, 288
histograms, 565–566
cutoff scores, 98
intuition, 65
cyclic graphs, 330–331
question asking, 67–68
data marts, 485, 491–492
D
data selection
data
contents of, outcomes of interest, 64
acquisition-time, 108–110
data locations, 61–62
as actionable information, 516
density, 62–63
availability, determining, 515–516
history of, determining, 63
binary, 557
scarce data, 61–62
business versus scientific, statistical
variable combinations, 63–64
analysis, 159
data transformation
censored, 161
capture trends, 75
by census tract, 94
counts, converting to proportions,
central repository, 484, 488, 490
75–76
columns
discussed, 74
cost, 548
information technology and user
derived variables, 542
roles, 58–60
discussed, 542
problems, identifying, 56–57
identification, 548
ratios, 75
ignored, 547
results, deliverables, 58
input, 547
results, how to use, 57–58
with one value, 544–546
summarization, 44
target, 547
virtuous cycle, 28–30
with unique values, 546–547
dirty, 592–593
weight, 548
dumping, flat files, 594
comparisons, 83
enterprise-wide, 33
for customer signatures, cataloging,
ETL (extraction, transformation, and
559–560
load) tools, 487
data correction
gigabytes, 5
categorical variables, 73
as graphs, 337
encoding, inconsistent, 74
historical
missing values, 73–74
customer behaviors, 5
numeric variables, 73
MBR (memory-based reasoning),
outliners, 73
262–263
overview, 72
neural networks, 219
skewed distributions, 73
prediction tasks, 10
values with meaning, 74
house-hold level, 96
data exploration
imperfections in, 34
assumptions, validating, 67
inconsistent, 593–594
descriptions, comparing values
as information, 22
with, 65
metadata repository, 484, 491
470643 bindex.qxd 3/8/04 11:08 AM Page 624
624 Index
data (continued)
outsourcing, 522–524
missing data
platforms, 527
data correction, 73–74
scalability, 533–534
NULL values, 590
scoring platforms, 527–528
splits, decision trees, 174–175
staffing, 525–526
operational feedback, 485, 492
typical operational systems
patterns
versus , 33
meaningful discoveries, 56
undirected
prediction, 45
affinity grouping, 57
untruthful learning sources, 45–46
clustering, 57
point-of-sale
discussed, 7
association rules, 288
Data Preparation for Data Mining
scanners, 3
(Dorian Pyle), 75
as useful data source, 60
The Data Warehouse Toolkit (Ralph
preparation
Kimball), 474
automatic cluster detection,
data warehousing
363–365
customer patterns, 5
categorical values, neural networks,
for decision support, 13
239–240
discussed, 4
continuous values, neural
database administrators (DBAs), 488
networks, 235–237
databases
quality, association rules, 308
call detail, 37
representation, generic algorithms,
demographic, 37
432–433
KDD (knowledge discovery in
scarce, 62
databases), 8
source systems, 484, 486–487
server platforms, affordability, 13
SQL, time series analysis, 572–573
datasets, balanced, model sets, 68
terabytes, 5
dates and times, interval variables,
truncated, 162
551
useful data sources, 60–61
DBAs (database administrators), 488
visualization tools, 65
deaths, house-hold level data, 96
wrong level of detail, untruthful
debt, nonrepayment of, credit
learning sources, 47
risks, 114
data mining
decision support
architecture, 528–532
data warehousing for, 13
as creative process, 33
hypothesis testing, 50–51
directed
summary data, OLAP, 477–478
classification, 57
decision trees
discussed, 7
alphas, 188
estimation, 57
alternate representations for, 199–202
prediction, 57
applying to sequential events, 205
documentation, 536–537
branching nodes, 176
goals of, 7
building models, 8
insourcing, 524–525
case-study, 206, 208
470643 bindex.qxd 3/8/04 11:08 AM Page 625
Index 625
for catalog response models, 175
deep intimacy, customer relationships,
classification, 9, 166–168
449, 451
cost considerations, 195
default classes, records, 194
effectiveness of, measuring, 176
default risks, proof-of-concept
estimation, 170
projects, 599
as exploration tool, 203–204
degrees of freedom values, chi-square
fields, multiple, 195–197
tests, 152–153
neural networks, 199
democracy approach, memory-based
profiling tasks, 12
reasoning, 279–281
projective visualization, 207–208
demographic databases, 37
pruning
demographic profiles, customers, 31
C5 algorithm, 190–191