494 Chapter 15
The data warehouse is going to change and applications are going to continue to use it. The key to delivering success is controlling and managing the changes. The applications are for the end users. The data warehouse is there to support their data needs—not vice versa.
Business Users
Business users are the ultimate devourers of information derived from the corporate data warehouse. Their needs drive the development of applications, the architecture of the warehouse, the data it contains, and the priorities for implementation.
Many business users only experience the warehouse through printed reports, static online reports, or spreadsheets—basically the same way they have been gathering information for a long time. Even these users will experience the power of having a data warehouse as reports become more accurate, more consistent, and easier to produce.
More important, though, are the people who use the computers on their desks and are willing to take advantage of direct access to the data warehousing environment. Typically, these users access intermediate data marts to satisfy the vast majority of their information needs using friendly, graphical tools that run in their familiar desktop environment. These tools include off-the-shelf query generators, custom applications, OLAP interfaces, and report generation tools. On occasion, business users may drill down into the central repository to explore particularly interesting things they find in the data. More often, they will contact an analyst and have him or her do the heavier analytic work.
Business users also have applications built for specific purposes. These applications may even incorporate some of the data mining techniques discussed in previous chapters. For instance, a resource scheduling application might include an engine that optimizes the schedule using genetic algorithms.
A sales forecasting application may have built-in survival analysis models.
When embedded in an application, the data mining algorithms are usually quite well hidden from the end user, who cares more about the results than the algorithms that produced them.
Where Does OLAP Fit In?
The business world has been generating automated reports to meet business needs for many decades. Figure 15.4 shows a range of common reporting
470643 c15.qxd 3/8/04 11:20 AM Page 495
Data Warehousing, OLAP, and Data Mining 495
capabilities. The oldest manual methods are the mainframe report-generation tools whose output is traditionally printed on green bar paper or green screens. These mainframe reports automate paper-based methods that preceded computers. Producing such reports is often the primary function of IS
departments. Even minor changes to the reports require modifying code that sometimes dates back decades. The result is a lag between the time when a user requests changes and the time when he or she sees the new information that is measured in weeks and months. This is old technology that organizations are generally trying to move away from, except for the lowest-level reports that summarize specific operational systems.
The source of the data is
usually legacy mainframe
systems used for operations,
but it could be a data
warehouse.
Using processes, often too
cumbersome to
understand and too old to
change, operational data is
extracted and summarized.
Paper-based reports from
mainframe systems are
part of the business
process. They are usually
OLAP tools, based on multi
too late and too inflexible
dimensional cubes, give users
for decision support.
flexible and fast access to
Off-the-shelf query tools
data, both summarized and
provide users some access to
detail.
the data and the ability to form
their own queries.
Figure 15.4 Reporting requirements on operational systems are typically handled the same way they have been for decades. Is this the best way?
470643 c15.qxd 3/8/04 11:20 AM Page 496
496 Chapter 15
In the middle are off-the-shelf query generation packages that have become popular for accessing data in the past decade. These generate queries in SQL
and can talk to local or remote data sources using a standard protocol, such as the Open Database Connectivity (ODBC) standard. Such reports might be embedded in a spreadsheet, accessed through the Web, or through some other reporting interface. With a day or so of training, business analysts can usually generate the reports that they need. Of course, the report itself is often running as an SQL query on an already overburdened database, so response times are measured in minutes or hours, when the queries are even allowed to run to completion. These response times are much faster than the older report-generation packages, but they still make it difficult to exploit the data. The goal is to be able to ask a question and still remember the question when the answer comes back.
OLAP is a significant improvement over ad hoc query systems, because OLAP systems design the data structure with users in mind. This powerful and efficient representation is called a cube, which is ideally suited for slicing and dicing data. The cube itself is stored either in a relational database, typically using a star schema, or in a special multidimensional database that optimizes OLAP operations. In addition, OLAP tools provide handy analysis functions that are difficult or impossible to express in SQL. If OLAP tools have one downside, it is that business users start to focus only on the dimensions of data represented by the tool. Data mining, on the other hand, is particularly valuable for creative thinking.
Setting up the cube requires analyzing the data and the needs of the end users, which is generally done by specialists familiar with the data and the tool, through a process called dimensional modeling. Although designing and loading an OLAP system requires an initial investment, the result provides informative and fast access to end users, generally much more helpful than the results from a query-generation tool. Response times, once the cube has been built, are almost always measured in seconds, allowing users to explore data and drill down to understand interesting features that they encounter.
OLAP is a powerful enhancement to earlier reporting methods. Its power rests on three key features:
■■ First, a well-designed OLAP system has a set of relevant dimensions—
such as geography, product, and time—understandable to business
users. These dimensions often prove important for data mining
purposes.
■■
Second, a well-designed OLAP system has a set of useful measures relevant to the business.
■■
Third, OLAP systems allow users to slice and dice data, and sometimes to drill down to the customer level.
470643 c15.qxd 3/8/04 11:20 AM Page 497
Data Warehousing, OLAP, and Data Mining 497
T I P Quick response times are important for getting user acceptance of reporting systems. When users have to wait, they may forget the question that they asked. Interactive response times as experienced by end users should be in the range of 3–5 seconds.
These capabilities are complementary to data mining, but not a substitute for it. Nevertheless, OLAP is a very important (perhaps even the most important) part of the data warehouse architecture because it has the largest number of users.
What’s in a Cube?
A good way to approach OLAP is to think of data as a cube split into subcubes, as shown in Figure 15.5. Although this example uses three dimensions, OLAP
can have many more; three dimensions are useful for illustrative purposes.
This example shows a typical retailing cube that has one dimension for time, another for product, and a third for store. Each subcube contains various measures indicating what happened regarding that product in that store on that date, such as:
■■
Total number of items sold
■■
Total value of the items
■■
Total amount of discount on the items
■■
Inventory cost of the items
The measures are called facts. As a rule of thumb, dimensions consist of categorical variables and facts are numeric. As users slice and dice the data, they are aggregating facts from many different subcubes. The dimensions are used to determine exactly which subcubes are used in the query.
Even a simple cube such as the one described above is very powerful.
Figure 15.6 shows an example of summarizing data in the cube to answer the question “On how many days did a particular store not sell a particular product?” Such a question requires using the store and product dimension to determine which subcubes are used for the query. This question only looks at one fact, the number of items sold, and returns all the dates for which this value is 0. Here are some other questions that can be answered relatively easily:
■■
What was the total number of items sold in the past year?
■■
What were the year over year sales, by month, of stores in the Northeast?
■■
What was the overall margin for each store in November? (Margin
being the price paid by the customer minus the inventory cost.)
470643 c15.qxd 3/8/04 11:20 AM Page 498
498
Chapter 15
po
Sh
Date
Product
shop = Pinewood
product = 4
date = ‘7 Mar 2004’
count = 5
Dimension columns
value = $215
discount = $32
Aggregate columns
cost = $75
Figure 15.5 The cube used for OLAP is divided into subcubes. Each subcube contains the key for that subcube and summary information for the data falls into that subcube.