482 Chapter 15
WHAT IS A RELATIONAL DATABASE? (continued)
A single transaction occurs at exactly one
vendor. But, each vendor may have multiple
transactions.
TRANSACTION TABLE
Transaction ID
Account ID
VENDOR TABLE
Vendor ID
Vendor ID
Date
Vendor Name
Time
Vendor Type
One account has multiple
Amount
transactions, but each
Authorization Code
transaction is associated
with exactly one account.
A customer may have one or more
accounts. But each account belongs to
exactly one customer. Likewise, one or
more customers may be in a household.
ACCOUNT TABLE
Account ID
CUSTOMER TABLE
Customer ID
Customer ID
Household ID
Account Type
Interest Rate
Customer Name
Credit Limit
Date of Birth
Minimum Payment
Gender
Amount Due
FICO Score
Last Payment Amt
HOUSEHOLD TABLE
Household ID
Number of Children
ZIP Code
An E-R diagram can be used to show the tables and fields in a relational database.
Each box shows a single table and its columns. The lines between them show TEAMFLY
relationships, such as 1-many, 1-1, and many-to-many. Because each table corresponds to an entity, this is called a physical design.
Sometimes, the physical design of a database is very complicated. For instance, the TRANSACTION TABLE might actually be split into a separate table for each month of transactions. In this case, the above E-R diagram is still useful; it represents the logical structure of the data, as business users would understand it.
An entity relationship diagram describes the layout of data for a simple credit card database.
With respect to data mining, relational databases (and SQL) have some limitations. First, they provide little support for time series. This makes it hard to figure out from transaction data such things as the second product purchased, the last three promos a customer responded to, or the ordering of events; these can require very complicated SQL. Another problem is that two operations often eliminate fields inadvertently. When a field contains a missing value (NULL) then it automatically fails any comparison, even “not equals”.
Team-Fly®
470643 c15.qxd 3/8/04 11:20 AM Page 483
Data Warehousing, OLAP, and Data Mining 483
Also, the default join operation (called an inner join) eliminates rows that do not match, which means that customers may inadvertently be left out of a data pull. The set of operations in SQL is not particularly rich, especially for text fields and dates. The result is that every database vendor extends standard SQL
to include slightly different sets of functionality.
Database schema can also illuminate unusual findings in the data. For instance, we once worked with a file of call detail records in the United States that had city and state fields for the destination of every call. The file contained over two hundred state codes—that is a lot of states. What was happening? We learned that the city and state fields were never used by operational systems, so their contents were automatically suspicious—data that is not used is not likely to be correct. Instead of the city and state, all location information was derived from zip codes. These redundant fields were inaccurate because the state field was written first and the city field, with 14 characters, was written second. Longer city names overwrote the state field next to it. So, “WEST
PALM BEACH, FL” ended up putting the “H” in the state field, becoming
“WEST PALM BEAC, HL,” and “COLORADO SPRINGS, CO” became
“COLORADO SPRIN, GS.” Understanding the data layout helped us figure out this interesting but admittedly uncommon problem.
Metadata
Metadata goes beyond the database schema to let business users know what types of information are stored in the database. This is, in essence, documentation about the system, including information such as:
■■
The values legally allowed in each field
■■
A description of the contents of each field (for instance, is the start date the date of the sale or the date of activation)
■■
The date when the data was loaded
■■
An indication of how recently the data has been updated (when after the billing cycle does the billing data land in this system?)
■■
Mappings to other systems (the status code in table A is the status code field in table B in such-and-such source system)
When available, metadata provides an invaluable service. When not available, this type of information needs to be gleaned, usually from friendly database administrators and analysts—a perhaps inefficient use of everyone’s time. For a data warehouse, metadata provides discipline, since changes to the
470643 c15.qxd 3/8/04 11:20 AM Page 484
484 Chapter 15
warehouse must be reflected in the metadata to be communicated to users.
Overall, a good metadata system helps ensure the success of a data warehouse by making users more aware of and comfortable with the contents. For data miners, metadata provides valuable assistance in tracking down and understanding data.
Business Rules
The highest level of abstraction is business rules. These describe why relationships exist and how they are applied. Some business rules are easy to capture, because they represent the history of the business—what marketing campaigns took place when, what products were available when, and so on. Other types of rules are more difficult to capture and often lie buried deep inside code fragments and old memos. No one may remember why the fraud detection system ignores claims under $500. Presumably there was a good business reason, but the reason, the business rule, is often lost once the rule is embedded in computer code.
Business rules have a close relationship to data mining. Some data mining techniques, such as market basket analysis and decision trees, produce explicit rules. Often, these rules may already be known. For instance, learning that conference calling is sold with call waiting may not be interesting, since this feature is only sold as part of a bundle. Or a direct mail model response model that ends up targeting only wealthy areas may reflect the fact that the historical data used to build the model was biased, because the model set only had responders in these areas.
Discovering business rules in the data is both a success and a failure. Finding these rules is a successful application of sophisticated algorithms. However, in data mining, we want actionable patterns and such patterns are not actionable.
A General Architecture for Data Warehousing
The multitiered approach to data warehousing recognizes that data needs come in many different forms. It provides a comprehensive system for managing data for decision support. The major components of this architecture (see Figure 15.3) are:
■■
Source systems are where the data comes from.
■■
Extraction, transformation, and load (ETL) move data between different data stores.
470643 c15.qxd 3/8/04 11:20 AM Page 485
Data Warehousing, OLAP, and Data Mining 485
■■
The central repository is the main store for the data warehouse.
■■
The metadata repository describes what is available and where.
■■
Data marts provide fast, specialized access for end users and applications.
■■
Operational feedback integrates decision support back into the operational systems.
■■
End users are the reason for developing the warehouse in the first place.
End users are the raison d’etre of the data
warehouse. They act on the information
and knowledge gained from the data.
Networks using
Departmental data warehouses
standard protocols like
ODBC connect end
and metadata support
Meta-
users to the data.
applications used by end users.
data
The central data store is
a relational database
with a logical data model.
Central Repository
Extraction, transformation,
and load tools move data
between systems.
Operational systems are where the data
comes from. These are usually
mainframe or midrange systems.
External Data
Some data may be provided by external
vendors.
Figure 15.3 The multitiered approach to data warehousing includes a central repository, data marts, end-user tools, and tools that connect all these pieces together.
470643 c15.qxd 3/8/04 11:20 AM Page 486
486 Chapter 15
One or more of these components exist in virtually every system called a data warehouse. They are the building blocks of decision support throughout an enterprise. The following discussion of these components follows a dataflow approach. The data is like water. It originates in the source systems and flows through the components of the data warehouse ultimately to deliver information and value to end users. These components rest on a technological foundation consisting of hardware, software, and networks; this infrastructure must be sufficiently robust both to meet the needs of end users and to meet growing data and processing requirements.
Source Systems
Data originates in the source systems, typically operational systems and external data feeds. These are designed for operational efficiency, not for decision support, and the data reflects this reality. For instance, transactional data might be rolled off every few months to reduce storage needs. The same information might be represented in different ways. For example, one retail point-of-sale source system represented returned merchandise using a “returned item” flag. That is, except when the customer made a new purchase at the same time. In this case, there would be a negative amount in the purchase field. Such anomalies abound in the real world.