density
CART algorithm, 185, 188–189
data selection, 62–63
discussed, 184
density function, statistics, 133
minimum support pruning, 312
deploying models, 84–85
stability-based, 191–192
derived variables, column data, 542
rectangular regions, 197
descriptions
regression trees, 170
comparing values with, 65
rules, extracting, 193–194
data transformation, 57
SAS Enterprise Miner Tree Viewer
descriptive models, assessing, 78
tool, 167–168
descriptive profiling, 52
scoring, 169–170
deviation. See standard deviation
splits
difference of proportion
on categorical input variables, 174
chi-square tests versus , 153–154
chi-square testing, 180–183
statistical analysis, 143–144
discussed, 170
differential response analysis,
diversity measures, 177–178
marketing campaigns, 107–108
entropy, 179
differentiation, market based
finding, 172
analysis, 289
Gini splitting criterion, 178
dimension
information gain ratio, 178, 180
automatic cluster detection, 352
intrinsic information of, 180
dimension tables, OLAP, 502–503
missing values, 174–175
directed clustering, automatic cluster
multiway, 171
detection, 372
on numeric input variables, 173
directed data mining
population diversity, 178
classification, 57
purity measures, 177–178
discussed, 7
reduction in variance, 183
estimation, 57
surrogate, 175
prediction, 57
subtrees, selecting, 189
directed graphs, 330
uses for, 166
directed models, assessing, 78–79
declining usage, behavior-based
directed profiling, 52
variables, 577–579
dirty data, 592–593
470643 bindex.qxd 3/8/04 11:08 AM Page 626
626 Index
discrete outcomes, classification, 9
equal-height binning, 551
discrete values, statistics, 127–131
equal-width binning, 551
discrimination measures, ROC
erroneous conclusions, 74
curves, 99
errors
dissociation rules, 317
countervailing, 81–82
distance and similarity, automatic
error rates
cluster detection, 359–363
adjusted, 185
distance function
establishing, 79
defined, 271–272
measurement, 159
discussed, 258, 265
operational, 159
hidden distance fields, 278
predicting, 191
identity distance, 271
standard error of proportion,
numeric fields, 275
statistical analysis, 139–141
triangle inequality, 272
established customers, customer
zip codes, 276–277
relationships, 457
distribution
estimation
data exploration, 65
accuracy, 79–81
one-tailed, 134
averages, 81
probability and, 135
business goals, formulating, 605
statistics, 130–132
classification tasks, 9
two-tailed, 134
collaboration filtering, 284–285
diverse data types, 536
data transformation, 57
diversity measures, splitting criteria,
decision trees, 170
decision trees, 177–178
directed data mining, 57
divisive clustering, automatic cluster
estimation task examples, 10
detection, 371–372
examples of, 10
documentation
neural networks, 10, 215
data mining, 536–537
regression models, 10
historical data as, 61
revenue, behavior-based variables,
dumping data, flat files, 594
581–583
standard deviation, 81
E
valued outcomes, 9
EBCF (existing base churn
ETL (extraction, transformation,
forecast), 469
and load) tools, 487, 595
economic data, useful data sources, 61
evaluation, automatic cluster
edges, graphs, 322
detection, 372–373
education level, house-hold level
event-based relationships, customer
data, 96
relationships, 458–459
existing base churn forecast
as communication channel, 89
(EBCF), 469
free text resources, 556–557
expectations
encoding, inconsistent, data
comparing to results, 31
correction, 74
expected values, chi-square tests,
enterprise-wide data, 33
150–151
entropy, information gain, 178–180
proof-of-concept projects, 599
470643 bindex.qxd 3/8/04 11:08 AM Page 627
Index 627
expected churn, 118
fraudulent insurance claims,
experimentation
classification, 9
hypothesis testing, 51
free text response, memory-based
statistics, 160–161
reasoning, 285
exploration tools, decision trees as,
functionality, lack of, data
203–204
transformation, 28
exponential decay, retention,
functions
389–390, 393
activation, 222
expressive power, descriptive
CHIDIST, 152
models, 78
combination
extraction, transformation, and load
attrition history, 280
(ETL) tools, 487, 595
MBR (memory-based reasoning),
258, 265
F
neural networks, 272
F tests (Ronald A. Fisher), 183–184
weighted voting, 281–282
fax machines, link analysis, 337–341
density, 133
Federal Express, transaction
distance
processing systems, 3–4
defined, 271–272
feedback
discussed, 258, 265
change processes, 34
hidden distance fields, 278
operational, 485, 492
identity distance, 271
relevance feedback, MBR, 267–268
numeric fields, 275
feed-forward neural networks
triangle inequality, 272
back propagation, 228–232
zip codes, 276–277
hidden layer, 227
hyperbolic tangent, 223
input layer, 226
NORMDIST, 134
output layer, 227
NORMSINV, 147
field values, statistics, 128
sigmoid, 225
Fisher, Ronald A. (F tests), 183–184
summation, 272
fixed budgets, marketing campaigns,
tangent, 223
97–100
transfer, 223
fixed positions, generic algorithms, 435
future attrition, 49
fixed-length character strings, 552–554
future customer behaviors,
flat files, dumping data, 594
predicting, 10
forced attrition, 118
forecasting
G
EBCF (existing base churn
gains, cumulative, 36, 101
forecast), 469
Gaussian mixture model, automatic
NSF (new start forecast), 469
cluster detection, 366–367
survival analysis, 415–416
gender
former customers, customer
as categorical value, 239
relationships, 457
profiling example, 12
forward-looking businesses, 2
generalized delta rules, 229
fraud detection, MBR, 258
470643 bindex.qxd 3/8/04 11:08 AM Page 628
628 Index
genetic algorithms
data as, 337
case study, 440–443
directed, 330
crossover, 430
edges, 322
data representation, 432–433
graph-coloring algorithm, 340–341
genome, 424
Hamiltonian path, 328
implicit parallelism, 438
linkage, 77
maximum values, of simple
nodes, 322
functions, 424
planar, 323
mutation, 431–432
traveling salesman problem, 327–329
neural networks and, 439–440
vertices, 322
optimization, 422
grouping. See clustering
overview, 421–422
GUI (graphical user interface), 535
resource optimization, 433–435
response modeling, 440–443
H
schemata, 434, 436–438
Hamiltonian path, graph theory, 328
selection step, 429
hard clustering, automatic cluster
statistical regression techniques, 423
detection, 367
Genetic Algorithms in Search,
hazards
Optimization, and Machine Learning
bathtub, 397–398
(Goldberg), 445
censoring, 399–403
geographic attributes, market based
constant, 397, 416–417
analysis, 293
probabilities, 394–396
geographic information system
proportional
(GIS), 536
Cox, 410–411
geographical resources, 555–556
discussed, 408
geometric distance, automatic cluster
examples of, 409
detection, 360–361
limitations of, 411–412
gigabytes, 5
real-world example, 398–399
Gini, Corrado (Gini splitting criterion,
retention, 404–405
decision trees), 178
stratification, 410
GIS (geographic information
Hertzsprung-Russell diagram,
system), 536
automatic cluster detection,
goals, formulating, 605–606
352–354
Goldberg (Genetic Algorithms in
hidden distance fields, distance
Search, Optimization, and Machine
function, 278
Learning), 445
hidden layer, feed-forward neural
good customers, holding on to, 17–18
networks, 221, 227
good prospects, identifying, 88–89
hierarchical categories, products, 305
Goodman, Marc (projective
histograms
visualization), 206–208
data exploration, 565–566
graphical user interface (GUI), 535
discussed, 543
graphs
statistics and, 127
acyclic, 331
historical data
cyclic, 330–331
customer behaviors, 5
documentation as, 61
470643 bindex.qxd 3/8/04 11:08 AM Page 629
Index 629
MBR (memory-based reasoning),
inconclusive survey responses, 46
262–263
inconsistent data, 593–594
neural networks, 219
index-based scores, 92–95
predication tasks, 10
indicator variables, 554
hobbies, house-hold level data, 96
indirect relationships, customer
holdout groups, marketing
relationships, 453–454
campaigns, 106
industry revolution, 18
home-based businesses, 56
inexplicable rules, association rules,
house-hold level data, 96
297–298
hubs, link analysis, 332–334
information
hyperbolic tangent function, 223
competitive advantages, 14
hypothesis testing
data as, 22
confidence levels, 148
infomediaries, 14
considerations, 51
information brokers, supermarket
decision-making process, 50–51
chains as, 15–16
generating, 51
information gain, entropy, 178–180
market basket analysis, 51
information technology, data
null hypothesis, statistics and,
transformation, 58–60
125–126
as products, 14
recommendation-based businesses,
I
16–17
IBM, relational database management
Inmon, Bill (Building the Data
software, 13
Warehouse), 474
ID and key variables, 554
input columns, 547
ID3 (Iteractive Dichotomiser 3), 190
input layer, free-forward neural
identification
networks, 226
columns, 548
input variables, target fields, 37
customer signatures, 560–562
inputs/outputs, neural networks, 215
good prospects, 88–89
insourcing data mining, 524–525
problem management, 43
insurance claims, classification, 9
proof-of-concept projects, 599–601
interactive systems, response times, 33
identified versus anonymous
Internet resources
transactions, association rules, 308
customer response to marketing
identity distance, distance function, 271
campaigns, tracking, 109
ignored columns, 547
RuleQuest, 190
images, binary data, 557
U.S. Census Bureau, 94
imperfections, in data, 34
interval variables, 549, 552
implementation
interviews
neural networks, 212
business opportunities,
proof-of-concept projects, 601–605
identifying, 27
implicit parallelism, 438
proof-of-concept projects, 600
in-between relationships, customer
intrinsic information, splits, decision
relationships, 453
trees, 180
income, house-hold level data, 96
introduction, of products, 27
470643 bindex.qxd 3/8/04 11:08 AM Page 630
630 Index
intuition, data exploration, 65
case study, 343–346
involuntary churn, 118–119, 521
classification, 9
item popularity, market based
discussed, 321
analysis, 293
fax machines, 337–341
item sets, market based analysis, 289
graphs
Iterative Dichotomiser 3 (ID3), 190
acyclic graphs, 331
communities of interest, 346
K
cyclic, 330–331
key and ID variables, 554
data as, 340
KDD (knowledge discovery in
directed graphs, 330
databases), 8
edges, 322
Kimball, Ralph (The Data Warehouse
graph-coloring algorithm, 340–341
Toolkit), 474
Hamiltonian path, 328
Kleinberg algorithm, link analysis,
nodes, 322
332–333
planar graphs, 323
K-means clustering, 354–358
traveling salesman problem,
knowledge discovery in databases
327–329
(KDD), 8
vertices, 322
Kolmogorov-Smirnov (KS) tests, 101
hubs, 332–334
Kleinberg algorithm, 332–333
L
root sets, 333
large-business relationships, customer
search programs, 331
relationship management, 3–4
stemming, 333
leaf nodes, classification, 167
weighted graphs, 322, 324
learning
linkage graphs, 77
opportunities, customer interactions,
lists, ordered and unordered, 239
520–521
literature, market research, 22
supervised, 57
logarithms, data transformation, 74
training techniques as, 231
logical schema, OLAP, 478
truthful sources, 48–50
logistic methods, box diagrams, 200
unsupervised, 57
long form, census data, 94
untruthful sources, 44–48
long-term trends, 75
life stages, customer relationships,
lookup tables, auxiliary information,
455–456
570–571
lifetime customer value, customer
loyalty
relationships, 32
customers, 520
lift ratio
loyalty programs
comparing models using, 81–82
marketing campaigns, 111
lift charts, 82, 84
welcome periods, 518
problems with, 83
luminosity, 351
linear processes, 55
linear regression, 139
M
link analysis
mailings
authorities, 333–334
marketing campaigns, 97
candidates, 333
non-response models, 35
470643 bindex.qxd 3/8/04 11:08 AM Page 631
Index 631
marginal customers, 553
as statistical analysis