You are on page 1of 43

Building Data WareHouse by Inmon

Chapter 5: The Data Warehouse and Technology

http://it-slideshares.blogspot.com/

5.0 Overview

Requires a simpler set of technological features than its operational predecessors:


Online updating: Not need. Locking, integrity: needs are minimal. Teleprocessing interface: is required very basic.

This chapter outlines some of technological requirements for the data warehouse.

MANAGING LARGE AMOUNTS OF DATA

Manage Volumes 2. Manage multiple media technology 3. Index and monitoring data 4. Interface to retrieve and passing data
1.

Managing Multiple Media


Following is a hierarchy of storage of data in terms of speed of access and cost of storage:
Main memory expensive Expanded memory Cache DASD Magnetic tape expensive Near line expensive Optical disk expensive Fiche Very fast Very fast Very fast Fast Not fast Very Expensive Expensive Moderate Not Not

Not fast* Not slow Slow Not Cheap

*Not fast to find first record sought; very fast to find all other records in the block.

Indexing and Monitoring Data


Monitoring data warehouse data determines such factors as the following:
If a reorganization needs to be done If an index is poorly structured If too much or not enough data is in overflow The statistical composition of the access of the data Available remaining space

Interfaces to Many Technologies


The interface to different technologies requires several considerations:
Does the data pass from one DBMS to another easily? Does it pass from one operating system to another easily? Does it change its basic format in passage (EBCDIC, ASCII, and so forth)? Can passage into multidimensional processing be done easily? Can selected increments of data, such as changed data capture (CDC) be passed rather than entire tables? Is the context of data lost in translation as data is moved to other environments?

PROGRAMMER OR DESIGNER CONTROL OF DATA PLACEMENT

Place

data at block/page level Manage data in parallel Solid Meta Data control Rich Language Interface

Parallel Storage and Management of Data


Metadata Management
Data warehouse table structures Data warehouse table attribution Data warehouse source data (the system of record) Mapping from the system of record to the data warehouse Data model specification Extract logging Common routines for access of data Definitions and/or descriptions of data Relationships of one unit of data to another

Language Interface
Typically, the language interface to the data warehouse should do the following:
Be able to access data a set at a time Be able to access data a record at a time Specifically ensure that one or more indexes will be used in the satisfaction of a query Have an SQL interface Be able to insert, delete, or update data

EFFICIENT LOADING OF DATA

Load

efficiently Use indexes efficiently Store data in compact way Support compound Keys

Efficient Index Utilization


Technology can support efficient index access in several ways:
Using bit maps Having multileveled indexes Storing all or parts of an index in main memory Compacting the index entries when the order of the data being indexed allows such compaction Creating selective indexes and range indexes

Compaction of Data
Manage large amounts of data. Programmer gets the most out of a given I/O when data is stored compactly

Compound Keys
The time valiancy of data warehouse data. Key-foreign key relationships are quite common in the atomic data

VARIABLE-LENGTH DATA
Variable-length data efficiently Lock Manager, explicit control at Able Index Only processing Restore data in Bulk efficiently

programmer Level

Lock Management
Ensures that two or more people are not updating the same record at the same time. Turn the lock manager off and on is necessary.

Index-Only Processing

Looking in an index (or indexes) without going to the primary source of data

Fast Restore

The capability to quickly restore a data warehouse table from non-DASD storage

Other Technological Features


Some of those features include the following:
Transaction integrity High-speed buffering Row- or page-level locking Referential integrity VIEWs of data Partial block loadin

DBMS Types and the Data Warehouse


Data warehouses manage massive amounts of data because:
Granular, atomic detail Historical information Summary as well as detailed data

Because record level, transaction-based updates are a regular feature of the general-purpose DBMS, must offer facilities:
Locking COMMITs Checkpoints Log tape processing Deadlock Backout

Changing DBMS Technology


Such a change may be in order for several reasons:
DBMS technologies may be available. The size of the warehouse has grown. Use of the warehouse has escalated and changed. The basic DBMS decision must be revisited from time to time.

Should the decision be made to go to a new DBMS technology, what are the considerations?
Will the new DBMS technology meet the foreseeable requirements? How will the conversion from the older DBMS technology to the newer DBMS technology be done?

Multidimensional DBMS and the Data Warehouse

The multidimensional DBMS


1. 2.

The data warehouse


1.

holds at least an order of magnitude less data. is geared for very heavy and unpredictable access and analysis of data. holds a much shorter time horizon of data. allows unfettered access.

holds massive amounts of data is geared for a limited amount of flexible access contains data with a very lengthy time horizon (from 5 to 10 years) allows analysts to access its data in a constrained fashion

2.

3.

3.

4.

4.

5.

enjoy a complementary relationship.

5.

being housed in a multidimensional DBMS

Multidimensional DBMS and the Data Warehouse cont

Multidimensional DBMS and the Data Warehouse cont


Following is the relational foundation for multidimensional DBMS data marts: Strengths:
Can support a lot of data. Can support dynamic joining of data. Has proven technology. Is capable of supporting general-purpose update processing. If there is no known pattern of usage of data, then the relational structure is as good as any other.

Weaknesses:
Has performance that is less than optimal. Cannot be purely optimized for access

Multidimensional DBMS and the Data Warehouse cont


Following is the cube foundation for multidimensional DBMS data marts: Strengths:
Performance that is optimal for DSS processing. Can be optimized for very fast access of data. If pattern of access of data is known, then the structure of data can be optimized. Can easily be sliced and diced. Can be examined in many ways.

Weaknesses:
Cannot handle nearly as much data as a standard relational format. Does not support general-purpose update processing. May take a long time to load. If access is desired on a path not supported by the

Multidimensional DBMS and the Data Warehouse cont

Multidimensional DBMS and the Data Warehouse cont

MULTIDIMENSIONAL DBMS AND THE DATA WAREHOUSE CONT

Data Warehousing across Multiple Storage Media

A large amount of data is spread across more than one storage medium.
One processing environment is the DASD environment where online, interactive processing is done. The other processing environment is often a tape or mass store environment

The Role of Metadata in the Data Warehouse Environment

The Role of Metadata in the Data Warehouse Environment

The Role of Metadata in the Data Warehouse Environment

Context and Content

The context of the reports is explained for the contents

Three Types of Contextual Information

Three levels of contextual information must be managed:


Simple contextual information Complex contextual information External contextual information

Simple contextual information relates to the basic structure of data itself, and includes such things as these:
The structure of data The encoding of data The naming conventions used for data The metrics describing the data, such as:
How much data there is How fast the data is growing What sectors of the data are growing

Three Types of Contextual Information cont

This type of information addresses such aspects of data as these: Product definitions Marketing territories Pricing Packaging Organization structure Distribution

Three Types of Contextual Information cont

Some examples of external contextual information include the following: Economic forecasts:
Inflation Financial trends Taxation Economic growth

Political information Competitive information Technological advancements Consumer demographic movements

Capturing and Managing Contextual Information

Complex and external contextual types of information are hard to capture and quantify because they are so unstructured.

Looking at the Past


Some of these shortcomings are as follows: The information management attempts were aimed at the information systems developer, not the end user. Attempts at contextual management were passive. Attempts at contextual information management were in many cases

Refreshing the Data Warehouse Reading a log tape is no small matter,


however. Many obstacles are in the way, including the following: The log tape contains much extraneous data. The log tape format is often arcane. The log tape contains spanned records. The log tape often contains addresses instead of data values.

Testing
It is very unusual to find a similar test environment in the world of the data warehouse, for the following reasons: Data warehouses are so large that a corporation has a hard time justifying one of them, much less two of them. The nature of the development life cycle for the data warehouse is iterative. For the most part, programs are run in

Summary
Some technological features are

required:

Robust language interface Compound keys Variable-length data The abilities to do the following: Manage large amounts of data Manage data on a diverse media Easily index and monitor data Interface with a wide number of technologies Allow the programmer to place the data directly on the physical device Store and access data in parallel Have metadata control of the warehouse Efficiently load the warehouse Efficiently use indexes Store data in a compact way Support compound keys Selectively turn off the lock manager Do index-only processing Quickly restore from bulk storage

Summary cont

The data architect must recognize the differences between a transactionbased DBMS and a data warehousebased DBMS.

Summary cont

Multidimensional OLAP technology is suited for data mart processing and not data warehouse processing. When the data mart approach is used, many problems become evident:
The number of extract programs grows large. Each new multidimensional database must return to the legacy operational environment for its own data. There is no basis for reconciliation of differences in analysis. A tremendous amount of redundant data among different multidimensional DBMS environments exists.

Summary cont

Metadata in the data warehouse environment plays a very different role than metadata in the operational legacy environment.

http://it-slideshares.blogspot.com

You might also like