470643 c15.qxd 3/8/04 11:20 AM Page 478
478 Chapter 15
effort. One of the goals of data warehousing is to provide consistent definitions and layouts so similar reports produce similar results, no matter which business user is producing them or when they are produced. This chapter is mostly concerned with this level of abstraction.
In one sense, summaries seem to destroy information as they aggregate things. For this reason, different summaries are useful for different purposes.
Point-of-sale transactions may capture every can of sardines that goes over the scanner, but only summaries begin to describe the shopper’s behavior in terms of her habitual time of day to shop and the proportion of her dollars spent in the canned food department. In this case, the customer summary seems to be creating information.
WA R N I N G Do not expect customer-level data warehouse information to balance exactly against financial systems (although the two systems should be close). Although theoretically possible, such balancing can prove very difficult and distract from the purpose of the data warehouse.
Database Schema
So far, the discussion has been on data. The structure of data is also important—
what data is stored, where it is stored, what is not stored, and so on. The sidebar “What is a relational database?” explains the key ideas behind relational databases, the most common systems for storing large amounts of data.
No matter how the data is stored, it is important to distinguish between two ways of describing the storage. The physical schema describes the layout in the technical detail needed by the underlying software. An example is the “CREATE
TABLE” statement in SQL. A logical schema, on the other hand, describes the data in a way more accessible to end users. The two are not necessarily the same, nor even similar, as shown in Figure 15.2.
WA R N I N G The existence of fields in a database does not mean that the data is actually present. It is important to understand every field used for data mining, and not to assume that a field populated correctly just because it exists. Skepticism is your ally.
An analogy might help to understand the utility of the physical and logical schemas. The logical schema describes things in a way that is familiar to business users. This would be analogous to saying that a house is ranch style, with four bedrooms, three baths, and a two-car garage. The physical schema goes
470643 c15.qxd 3/8/04 11:20 AM Page 479
Data Warehousing, OLAP, and Data Mining 479
into more detail about how it is laid out. The foundation is reinforced concrete, 4 feet deep; the slab is 1,500 square feet; the walls are concrete block; and so on.
The details of construction, although useful and complete, may not help a family find the right house.
Logical Model
This logical model has four entities, three
for customer-generated events and one for
COMPLAINT
accounts.
ACCT_ID
COMPLAINT_CODE
The logical model is intended to be
REFUND_AMOUNT
understood by business users.
. . .
COMMENT
ACCT
ACCT_ID
FIRST_NAME
COMMENT_CODE
LAST_NAME
COMMENT_TEXT
. . .
. . .
This symbol means a product
change has exactly one account
PRODUCT CHANGE
ACCT_ID
OLD_PROD
This symbol means an account
NEW_PROD
might have 0 or more product
changes
. . .
Physical Model
TABLE: CONTACT
Information from all four entities in the logical
ACCT_ID
model is found in the contact table.
CONTACT_TYPE
CONTACT_DATE
The different types of contact are differentiated
COMPLAINT_CODE
using the CONTACT_TYPE field.
REFUND_AMOUNT
OLD_PROD
The physical model also specifies exact types,
NEW_PROD
partitioning, indexes, storage characteristics,
COMMENT_TYPE
degrees of parallelism, constraints on values,
COMMENT_TEXT
and may other things not of interest to the
. . .
business user.
Figure 15.2 The physical and logical schema may not be related to each other.
470643 c15.qxd 3/8/04 11:20 AM Page 480
480 Chapter 15
WHAT IS A RELATIONAL DATABASE?
One of the most common ways to store data is in a relational database management system (RDBMS). The basis of relational databases starts with research by E. F. Codd in the early 1970s on the properties of a special type of set composed of tuples—what we would call rows in tables. From this, he derived a relational algebra consisting of operations that form a relational algebra, which are depicted in the following figure:
Before
After
row
col A
col B
col C col D
col E
colF
row
col A
col B col C
col D
col E
colF
Filter
001
001
002
002
003
003
Filtering removes rows based
004
004
005
005
006
006
on the values in one or more
007
007
008
008
columns. Each output row
009
009
010
010
either is or is not in the input
011
011
012
012
table.
row
col A
col B
col C col D
col E
colF
Select
row
col A
col B col C
col D
col E
colF
new
001
001
002
002
003
Selecting chooses the columns
003
004
004
005
005
006
for the output. Each column in
006
007
007
008
the output is in the input or is a
008
009
009
010
function of some of the input
010
011
011
012
012
columns.
Aggregation (or Group by)
row
col A
col B col C col D
col E
colF
001
key1
002
key1
Aggregation groups columns
003
key2
004
col A
avg B max B sum D
sum E
sum F
key2
key2
together based on a common
005
key1
006
key2
key2
007
key3
key3
key. All the rows with the same
008
key3
key4
009
key3
key are summarized into a
010
key4
011
key4
single output row.
012
key4
row
col A
col B
col C
001
key1
Join
col A
col B col C
col G
002
key1
key1
003
key2
row
col A
col G
key1
004
key2
001
key1
key3
005
key2
002
key3
key3
006
key2
Join matches rows in two
003
key4
key3
007
key3
004
key4
key4
008
key3
key4
tables. For every pair of rows
009
key3
key4
010
key4
key4
whose keys match in the inputs,
011
key4
key4
012
key4
key4
a new row is created in the
output.
Relational databases have four major querying operations.
470643 c15.qxd 3/8/04 11:20 AM Page 481
Data Warehousing, OLAP, and Data Mining 481
These operations are in addition to set operations, such as union and intersection. In nonscientific terminology, these relational operations are:
Filter a given set of rows based on the values in the rows.
Select a given set of columns and perform basic operations on them.
Group rows together and aggregate values in the columns.
Join two tables together based on the values in the columns.
Interestingly, the relational operations do not include sorting (except for output purposes). These operations specify what can be done with tuples, not
how it gets done. In fact, relational databases often use sorting for grouping and joining operations; however, there are non-sort-based algorithms for these operations as well.
SQL, developed by IBM in the 1980s, has become the standard language for accessing relational databases and implements these basic operations.
Because SQL supports subqueries (that is, using the results of one query as a table in another query), it is possible to express some very complex data manipulations.
A common way of representing the database structure is to use an entity-relationship (E-R) diagram. The following figure is a simple E-R diagram with five entities and four relationships among them. In this case, each entity corresponds to a separate table with columns corresponding to the attributes of the entity. In addition, columns represent the relationships between tables in the database; such columns are called keys (either foreign or primary keys).
Explicitly storing keys in the database tables using a consistent naming convention facilitates finding one’s way around the database.
One nice feature of relational databases is the ability to design a database so that any given data item appears in exactly one place—with no duplication.
Such a database is called a normalized database. Knowing exactly where each data item is located is highly efficient in theory, since updating any field requires modifying only one row in one table. When a normalized database is well-designed and implemented, there is no redundant data, out-of-date data, or invalid data.
An important idea behind normalization is creating reference tables. Each reference table logically corresponds to an entity, and each has a key used for looking up information about the entity. In a normalized database, the “join”
operation is used to lookup values in reference tables.
Relational databases are a powerful way of storing and accessing data.
However, much of their design is focused on updating the data and handling large numbers of transactions. Data mining is interested in combining data together to spot higher level patterns.. Typically, data mining uses many queries, each of which requires several joins, several aggregations, and subqueries—a veritable army of killer queries.
(continued)
470643 c15.qxd 3/8/04 11:20 AM Page 482