Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 116 – Library. Read online. Free books read online. Read books without registering

470643 c15.qxd 3/8/04 11:20 AM Page 478

478 Chapter 15

effort. One of the goals of data warehousing is to provide consistent definitions and layouts so similar reports produce similar results, no matter which business user is producing them or when they are produced. This chapter is mostly concerned with this level of abstraction.

In one sense, summaries seem to destroy information as they aggregate things. For this reason, different summaries are useful for different purposes.

Point-of-sale transactions may capture every can of sardines that goes over the scanner, but only summaries begin to describe the shopper’s behavior in terms of her habitual time of day to shop and the proportion of her dollars spent in the canned food department. In this case, the customer summary seems to be creating information.

WA R N I N G Do not expect customer-level data warehouse information to balance exactly against financial systems (although the two systems should be close). Although theoretically possible, such balancing can prove very difficult and distract from the purpose of the data warehouse.

Database Schema

So far, the discussion has been on data. The structure of data is also important—

what data is stored, where it is stored, what is not stored, and so on. The sidebar “What is a relational database?” explains the key ideas behind relational databases, the most common systems for storing large amounts of data.

No matter how the data is stored, it is important to distinguish between two ways of describing the storage. The physical schema describes the layout in the technical detail needed by the underlying software. An example is the “CREATE

TABLE” statement in SQL. A logical schema, on the other hand, describes the data in a way more accessible to end users. The two are not necessarily the same, nor even similar, as shown in Figure 15.2.

WA R N I N G The existence of fields in a database does not mean that the data is actually present. It is important to understand every field used for data mining, and not to assume that a field populated correctly just because it exists. Skepticism is your ally.

An analogy might help to understand the utility of the physical and logical schemas. The logical schema describes things in a way that is familiar to business users. This would be analogous to saying that a house is ranch style, with four bedrooms, three baths, and a two-car garage. The physical schema goes

470643 c15.qxd 3/8/04 11:20 AM Page 479

Data Warehousing, OLAP, and Data Mining 479

into more detail about how it is laid out. The foundation is reinforced concrete, 4 feet deep; the slab is 1,500 square feet; the walls are concrete block; and so on.

The details of construction, although useful and complete, may not help a family find the right house.

Logical Model

This logical model has four entities, three

for customer-generated events and one for

COMPLAINT

accounts.

ACCT_ID

COMPLAINT_CODE

The logical model is intended to be

REFUND_AMOUNT

understood by business users.

. . .

COMMENT

ACCT

ACCT_ID

FIRST_NAME

COMMENT_CODE

LAST_NAME

COMMENT_TEXT

. . .

This symbol means a product

change has exactly one account

PRODUCT CHANGE

ACCT_ID

OLD_PROD

This symbol means an account

NEW_PROD

might have 0 or more product

changes

. . .

Physical Model

TABLE: CONTACT

Information from all four entities in the logical

ACCT_ID

model is found in the contact table.

CONTACT_TYPE

CONTACT_DATE

The different types of contact are differentiated

COMPLAINT_CODE

using the CONTACT_TYPE field.

REFUND_AMOUNT

OLD_PROD

The physical model also specifies exact types,

NEW_PROD

partitioning, indexes, storage characteristics,

COMMENT_TYPE

degrees of parallelism, constraints on values,

COMMENT_TEXT

and may other things not of interest to the

. . .

business user.

Figure 15.2 The physical and logical schema may not be related to each other.

470643 c15.qxd 3/8/04 11:20 AM Page 480

480 Chapter 15

WHAT IS A RELATIONAL DATABASE?

One of the most common ways to store data is in a relational database management system (RDBMS). The basis of relational databases starts with research by E. F. Codd in the early 1970s on the properties of a special type of set composed of tuples—what we would call rows in tables. From this, he derived a relational algebra consisting of operations that form a relational algebra, which are depicted in the following figure:

Before

After

row

col A

col B

col C col D

col E

colF

row

col A

col B col C

col D

col E

colF

Filter

001

002

003

Filtering removes rows based

004

005

006

on the values in one or more

007

008

columns. Each output row

009

010

either is or is not in the input

011

012

table.

row

col A

col B

col C col D

col E

colF

Select

row

col A

col B col C

col D

col E

colF

new

001

002

003

Selecting chooses the columns

003

004

005

006

for the output. Each column in

006

007

008

the output is in the input or is a

008

009

010

function of some of the input

010

011

012

columns.

Aggregation (or Group by)

row

col A

col B col C col D

col E

colF

001

key1

002

key1

Aggregation groups columns

003

key2

004

col A

avg B max B sum D

sum E

sum F

key2

together based on a common

005

key1

006

key2

007

key3

key. All the rows with the same

008

key3

key4

009

key3

key are summarized into a

010

key4

011

key4

single output row.

012

key4

row

col A

col B

col C

001

key1

Join

col A

col B col C

col G

002

key1

003

key2

row

col A

col G

key1

004

key2

001

key1

key3

005

key2

002

key3

006

key2

Join matches rows in two

003

key4

key3

007

key3

004

key4

008

key3

key4

tables. For every pair of rows

009

key3

key4

010

key4

whose keys match in the inputs,

011

key4

012

key4

a new row is created in the

output.

Relational databases have four major querying operations.

470643 c15.qxd 3/8/04 11:20 AM Page 481

Data Warehousing, OLAP, and Data Mining 481

These operations are in addition to set operations, such as union and intersection. In nonscientific terminology, these relational operations are:

Filter a given set of rows based on the values in the rows.

Select a given set of columns and perform basic operations on them.

Group rows together and aggregate values in the columns.

Join two tables together based on the values in the columns.

Interestingly, the relational operations do not include sorting (except for output purposes). These operations specify what can be done with tuples, not

how it gets done. In fact, relational databases often use sorting for grouping and joining operations; however, there are non-sort-based algorithms for these operations as well.

SQL, developed by IBM in the 1980s, has become the standard language for accessing relational databases and implements these basic operations.

Because SQL supports subqueries (that is, using the results of one query as a table in another query), it is possible to express some very complex data manipulations.

A common way of representing the database structure is to use an entity-relationship (E-R) diagram. The following figure is a simple E-R diagram with five entities and four relationships among them. In this case, each entity corresponds to a separate table with columns corresponding to the attributes of the entity. In addition, columns represent the relationships between tables in the database; such columns are called keys (either foreign or primary keys).

Explicitly storing keys in the database tables using a consistent naming convention facilitates finding one’s way around the database.

One nice feature of relational databases is the ability to design a database so that any given data item appears in exactly one place—with no duplication.

Such a database is called a normalized database. Knowing exactly where each data item is located is highly efficient in theory, since updating any field requires modifying only one row in one table. When a normalized database is well-designed and implemented, there is no redundant data, out-of-date data, or invalid data.

An important idea behind normalization is creating reference tables. Each reference table logically corresponds to an entity, and each has a key used for looking up information about the entity. In a normalized database, the “join”

operation is used to lookup values in reference tables.

Relational databases are a powerful way of storing and accessing data.

However, much of their design is focused on updating the data and handling large numbers of transactions. Data mining is interested in combining data together to spot higher level patterns.. Typically, data mining uses many queries, each of which requires several joins, several aggregations, and subqueries—a veritable army of killer queries.

(continued)

470643 c15.qxd 3/8/04 11:20 AM Page 482