Berry M.J.A. – Data Mining Techniques For Marketing, Sales & Customer Relationship Management – Page 117 – Library. Read online. Free books read online. Read books without registering

482 Chapter 15

WHAT IS A RELATIONAL DATABASE? (continued)

A single transaction occurs at exactly one

vendor. But, each vendor may have multiple

transactions.

TRANSACTION TABLE

Transaction ID

Account ID

VENDOR TABLE

Vendor ID

Date

Vendor Name

Time

Vendor Type

One account has multiple

Amount

transactions, but each

Authorization Code

transaction is associated

with exactly one account.

A customer may have one or more

accounts. But each account belongs to

exactly one customer. Likewise, one or

more customers may be in a household.

ACCOUNT TABLE

Account ID

CUSTOMER TABLE

Customer ID

Household ID

Account Type

Interest Rate

Customer Name

Credit Limit

Date of Birth

Minimum Payment

Gender

Amount Due

FICO Score

Last Payment Amt

HOUSEHOLD TABLE

Household ID

Number of Children

ZIP Code

An E-R diagram can be used to show the tables and fields in a relational database.

Each box shows a single table and its columns. The lines between them show TEAMFLY

relationships, such as 1-many, 1-1, and many-to-many. Because each table corresponds to an entity, this is called a physical design.

Sometimes, the physical design of a database is very complicated. For instance, the TRANSACTION TABLE might actually be split into a separate table for each month of transactions. In this case, the above E-R diagram is still useful; it represents the logical structure of the data, as business users would understand it.

An entity relationship diagram describes the layout of data for a simple credit card database.

With respect to data mining, relational databases (and SQL) have some limitations. First, they provide little support for time series. This makes it hard to figure out from transaction data such things as the second product purchased, the last three promos a customer responded to, or the ordering of events; these can require very complicated SQL. Another problem is that two operations often eliminate fields inadvertently. When a field contains a missing value (NULL) then it automatically fails any comparison, even “not equals”.

Team-Fly®

470643 c15.qxd 3/8/04 11:20 AM Page 483

Data Warehousing, OLAP, and Data Mining 483

Also, the default join operation (called an inner join) eliminates rows that do not match, which means that customers may inadvertently be left out of a data pull. The set of operations in SQL is not particularly rich, especially for text fields and dates. The result is that every database vendor extends standard SQL

to include slightly different sets of functionality.

Database schema can also illuminate unusual findings in the data. For instance, we once worked with a file of call detail records in the United States that had city and state fields for the destination of every call. The file contained over two hundred state codes—that is a lot of states. What was happening? We learned that the city and state fields were never used by operational systems, so their contents were automatically suspicious—data that is not used is not likely to be correct. Instead of the city and state, all location information was derived from zip codes. These redundant fields were inaccurate because the state field was written first and the city field, with 14 characters, was written second. Longer city names overwrote the state field next to it. So, “WEST

PALM BEACH, FL” ended up putting the “H” in the state field, becoming

“WEST PALM BEAC, HL,” and “COLORADO SPRINGS, CO” became

“COLORADO SPRIN, GS.” Understanding the data layout helped us figure out this interesting but admittedly uncommon problem.

Metadata

Metadata goes beyond the database schema to let business users know what types of information are stored in the database. This is, in essence, documentation about the system, including information such as:

■■

The values legally allowed in each field

■■

A description of the contents of each field (for instance, is the start date the date of the sale or the date of activation)

■■

The date when the data was loaded

■■

An indication of how recently the data has been updated (when after the billing cycle does the billing data land in this system?)

■■

Mappings to other systems (the status code in table A is the status code field in table B in such-and-such source system)

When available, metadata provides an invaluable service. When not available, this type of information needs to be gleaned, usually from friendly database administrators and analysts—a perhaps inefficient use of everyone’s time. For a data warehouse, metadata provides discipline, since changes to the

470643 c15.qxd 3/8/04 11:20 AM Page 484

484 Chapter 15

warehouse must be reflected in the metadata to be communicated to users.

Overall, a good metadata system helps ensure the success of a data warehouse by making users more aware of and comfortable with the contents. For data miners, metadata provides valuable assistance in tracking down and understanding data.

Business Rules

The highest level of abstraction is business rules. These describe why relationships exist and how they are applied. Some business rules are easy to capture, because they represent the history of the business—what marketing campaigns took place when, what products were available when, and so on. Other types of rules are more difficult to capture and often lie buried deep inside code fragments and old memos. No one may remember why the fraud detection system ignores claims under $500. Presumably there was a good business reason, but the reason, the business rule, is often lost once the rule is embedded in computer code.

Business rules have a close relationship to data mining. Some data mining techniques, such as market basket analysis and decision trees, produce explicit rules. Often, these rules may already be known. For instance, learning that conference calling is sold with call waiting may not be interesting, since this feature is only sold as part of a bundle. Or a direct mail model response model that ends up targeting only wealthy areas may reflect the fact that the historical data used to build the model was biased, because the model set only had responders in these areas.

Discovering business rules in the data is both a success and a failure. Finding these rules is a successful application of sophisticated algorithms. However, in data mining, we want actionable patterns and such patterns are not actionable.

A General Architecture for Data Warehousing

The multitiered approach to data warehousing recognizes that data needs come in many different forms. It provides a comprehensive system for managing data for decision support. The major components of this architecture (see Figure 15.3) are:

■■

Source systems are where the data comes from.

■■

Extraction, transformation, and load (ETL) move data between different data stores.

470643 c15.qxd 3/8/04 11:20 AM Page 485

Data Warehousing, OLAP, and Data Mining 485

■■

The central repository is the main store for the data warehouse.

■■

The metadata repository describes what is available and where.

■■

Data marts provide fast, specialized access for end users and applications.

■■

Operational feedback integrates decision support back into the operational systems.

■■

End users are the reason for developing the warehouse in the first place.

End users are the raison d’etre of the data

warehouse. They act on the information

and knowledge gained from the data.

Networks using

Departmental data warehouses

standard protocols like

ODBC connect end

and metadata support

Meta-

users to the data.

applications used by end users.

data

The central data store is

a relational database

with a logical data model.

Central Repository

Extraction, transformation,

and load tools move data

between systems.

Operational systems are where the data

comes from. These are usually

mainframe or midrange systems.

External Data

Some data may be provided by external

vendors.

Figure 15.3 The multitiered approach to data warehousing includes a central repository, data marts, end-user tools, and tools that connect all these pieces together.

470643 c15.qxd 3/8/04 11:20 AM Page 486

486 Chapter 15

One or more of these components exist in virtually every system called a data warehouse. They are the building blocks of decision support throughout an enterprise. The following discussion of these components follows a dataflow approach. The data is like water. It originates in the source systems and flows through the components of the data warehouse ultimately to deliver information and value to end users. These components rest on a technological foundation consisting of hardware, software, and networks; this infrastructure must be sufficiently robust both to meet the needs of end users and to meet growing data and processing requirements.

Source Systems

Data originates in the source systems, typically operational systems and external data feeds. These are designed for operational efficiency, not for decision support, and the data reflects this reality. For instance, transactional data might be rolled off every few months to reduce storage needs. The same information might be represented in different ways. For example, one retail point-of-sale source system represented returned merchandise using a “returned item” flag. That is, except when the customer made a new purchase at the same time. In this case, there would be a negative amount in the purchase field. Such anomalies abound in the real world.