Uniprocessor
A simple computer follows the
P
bus
architecture laid out by Von
Neumann. A processing unit
communicates to memory and
disk over a local bus. (Memory
stores both data and the
executable program.) The
M
speed of the processor, bus,
and memory limits performance
and scalability.
SMP
The symmetric multiprocessor
(SMP) has a shared-everything
P
P
P
P
P
architecture. It expands the
capabilities of the bus to
support multiple processors,
more memory, and a larger disk.
The capacity of the bus limits
performance and scalability.
M
M
SMP architectures usually max
out with fewer than 20
processing units.
P
P
high
speed
MPP
network
M
M
The massively parallel
processor (MMP) has a shared-
nothing architecture. It
P
P
introduces a high-speed
network (also called a switch)
that connects independent
M
M
processor/memory/disk
components. MPP
architectures are very scalable
P
P
but fewer software packages
can take advantage of all the
hardware.
M
M
Parallel computers build on the basic Von Neumann uniprocessor architecture. SMP
and MPP systems are scalable because more processing units, disk drives, and memory can be added to the system.
Data warehousing is a process for managing the decision-support system of record. A process is something that can adjust to users’ needs as they are clarified and change over time. A process can respond to changes in the business as needs change over time. The central repository itself is going to be a brittle, little-used system without the realization that as users learn about data and about the business, they are going to want changes and enhancements on the
470643 c15.qxd 3/8/04 11:20 AM Page 491
Data Warehousing, OLAP, and Data Mining 491
time scale of marketing (days and weeks) rather than on the time scale of IT
(months).
Metadata Repository
We have already discussed metadata in the context of the data hierarchy. It can also be considered a component of the data warehouse. As such, the metadata repository is an often overlooked component of the data warehousing environment. The lowest level of metadata is the database schema, the physical layout of the data. When used correctly, though, metadata is much more. It answers questions posed by end users about the availability of data, gives them tools for browsing through the contents of the data warehouse, and gives everyone more confidence in the data. This confidence is the basis for new applications and an expanded user base.
A good metadata system should include the following:
■■
The annotated logical data model. The annotations should explain the entities and attributes, including valid values.
■■
Mapping from the logical data model to the source systems.
■■
The physical schema.
■■
Mapping from the logical model to the physical schema.
■■
Common views and formulas for accessing the data. What is useful to one user may be useful to others.
■■
Load and update information.
■■
Security and access information.
■■
Interfaces for end users and developers, so they share the same description of the database.
In any data warehousing environment, each of these pieces of information is available somewhere—in scripts written by the DBA, in email messages, in documentation, in the system tables in the database, and so on. A metadata repository makes this information available to the users, in a format they can readily understand. The key is giving users access so they feel comfortable with the data warehouse, with the data it contains, and with knowing how to use it.
Data Marts
Data warehouses do not actually do anything (except store and retrieve data effectively). Applications are needed to realize value, and these often take the form of data marts. A data mart is a specialized system that brings together the data needed for a department or related applications. Data marts are often used for reporting systems and slicing-and-dicing data. Such data marts often use OLAP technology, which is discussed later in this chapter. Another
470643 c15.qxd 3/8/04 11:20 AM Page 492
492 Chapter 15
important type of data mart is an exploratory environment used for data mining, which is discussed in the next chapter.
Not all the data in data marts needs to come from the central repository.
Often specific applications have an exclusive need for data. The real estate department, for instance, might be using geographic information in combination with data from the central repository. The marketing department might be combining zip code demographics with customer data from the central repository. The central repository only needs to contain data that is likely to be shared among different applications, so it is just one data source—usually the dominant one—for data marts.
Operational Feedback
Operational feedback systems integrate data-driven decisions back into the operational systems. For instance, a large bank may develop cross-sell models to determine what product next to offer a customer. This is a result of a data mining system. However, to be useful this information needs to go back into the operational systems. This requires a connection back from the decision-support infrastructure into the operational infrastructure.
Operational feedback offers the capability to complete the virtuous cycle of data mining very quickly. Once a feedback system is set up, intervention is TEAMFLY
only needed for monitoring and improving it—letting computers do what they do best (repetitive tasks) and letting people do what they do best (spot interesting patterns and come up with ideas). One of the advantages of Web-based businesses is that they can, in theory, provide such feedback to their operational systems in a fully automated way.
End Users and Desktop Tools
The end users are the final and most important component in any data warehouse. A system that has no users is not worth building. These end users are analysts looking for information, application developers, and business users who act on the information.
Analysts
Analysts want to access as much data as possible to discern patterns and create ad hoc reports. They use special-purpose tools, such as statistics packages, data mining tools, and spreadsheets. Often, analysts are considered to be the primary audience for data warehouses.
Usually, though, there are just a few technically sophisticated people who fall into this category. Although the work that they do is important, it is difficult to justify a large investment based on increases in their productivity. The virtuous cycle of data mining comes into play here. A data warehouse brings Team-Fly®
470643 c15.qxd 3/8/04 11:20 AM Page 493
Data Warehousing, OLAP, and Data Mining 493
together data in a cleansed, meaningful format. The purpose, though, is to spur creativity, a very hard concept to measure.
Analysts have very specific demands on a data warehouse:
■■ The system has to be responsive. Too much of the work of analysis is in the form of answering urgent questions in the form of ad hoc analysis or ad hoc queries.
■■ Data needs to be consistent across the database. That is, if a customer started on a particular date, then the first occurrence of a product, channel, and so on should be exactly on that date.
■■ Data needs to be consistent across time. A field that has a particular meaning now should have the same meaning going back in time. At the very least, differences should be well documented.
■■ It must be possible to drill down to customer level and preferably to the transaction level detail to verify values in the data warehouse and to develop new summaries of customer behavior.
Analysts place a heavy load on data warehouses, and need access to consistent information in a timely manner.
Application Developers
Data warehouses usually support a wide range of applications (in other words, data marts come in many flavors). In order to develop stable and robust applications, developers have some specific needs from the data warehouse.
First, the applications they are developing need to be shielded from changes in the structure of the data warehouse. New tables, new fields, and reorganizing the structure of existing tables should have a minimal impact on existing applications. Special application-specific views on the data help provide this assurance. In addition, open communication and knowledge about what applications use which attributes and entities can prevent development gridlock.
Second, the developers need access to valid field values and to know what the values mean. This is the purpose of the metadata repository, which provides documentation on the structure of the data. By setting up the application to verify data values against expected values in the metadata, developers can circumvent problems that often appear only after applications have rolled out.
The developers also need to provide feedback on the structure of the data warehouse. This is one of the principle means of improving the warehouse, by identifying new data that needs to be included in the warehouse and by fixing problems with data already loaded. Since real business needs drive the development of applications, understanding the needs of developers is important to ensure that a data warehouse contains the data it needs to deliver business value.
470643 c15.qxd 3/8/04 11:20 AM Page 494