You are on page 1of 666

Oracle9i

Data Warehousing Guide

Release 2 (9.2)

March 2002 Part No. A96520-01

Oracle9 i Data Warehousing Guide Release 2 (9.2) March 2002 Part No. A96520-01

Oracle9i Data Warehousing Guide, Release 2 (9.2)

Part No. A96520-01

Copyright © 1996, 2002 Oracle Corporation. All rights reserved.

Primary Author:

Contributing Authors:

Contributors:

Tolga Bozkaya, Benoit Dageville, John Haydu, Lilian Hobbs, Hakan Jakobsson, George Lumpkin, Cetin Ozbutun, Jack Raitto, Ray Roccaforte, Sankar Subramanian, Gregory Smith, Ashish Thusoo, Jean-Francois Verrier, Gary Vincent, Andy Witkowski, Zia Ziauddin

Graphic Designer:

The Programs (which include both the software and documentation) contain proprietary information of Oracle Corporation; they are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright, patent and other intellectual and industrial property laws. Reverse engineering, disassembly or decompilation of the Programs, except to the extent required to obtain interoperability with other independently created software or as specified by law, is prohibited.

The information contained in this document is subject to change without notice. If you find any problems in the documentation, please report them to us in writing. Oracle Corporation does not warrant that this document is error-free. Except as may be expressly permitted in your license agreement for these Programs, no part of these Programs may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of Oracle Corporation.

If the Programs are delivered to the U.S. Government or anyone licensing or using the programs on behalf of the U.S. Government, the following notice is applicable:

Restricted Rights Notice Programs delivered subject to the DOD FAR Supplement are "commercial computer software" and use, duplication, and disclosure of the Programs, including documentation, shall be subject to the licensing restrictions set forth in the applicable Oracle license agreement. Otherwise, Programs delivered subject to the Federal Acquisition Regulations are "restricted computer software" and use, duplication, and disclosure of the Programs shall be subject to the restrictions in FAR 52.227-19, Commercial Computer Software - Restricted Rights (June, 1987). Oracle Corporation, 500 Oracle Parkway, Redwood City, CA 94065.

The Programs are not intended for use in any nuclear, aviation, mass transit, medical, or other inherently dangerous applications. It shall be the licensee's responsibility to take all appropriate fail-safe, backup, redundancy, and other measures to ensure the safe use of such applications if the Programs are used for such purposes, and Oracle Corporation disclaims liability for any damages caused by such use of the Programs.

Oracle is a registered trademark, and Express, Oracle Expert, Oracle Store, Oracle7, Oracle8, Oracle8i,

Oracle9i, Oracle Store, PL/SQL, Pro*C, and SQL*Plus are trademarks or registered trademarks of Oracle

Corporation. Other names may be trademarks of their respective owners.

Paul Lane

Viv Schupmann (Change Data Capture)

Patrick Amor, Hermann Baer, Subhransu Basu, Srikanth Bellamkonda, Randy Bello,

Valarie Moore

Contents

Send Us Your Comments

xix

Preface

xxi

What’s New in Data Warehousing?

xxxiii

Part I

Concepts

1

Data Warehousing Concepts

 

What is a Data Warehouse?

1-2

Subject Oriented

1-2

Integrated

1-2

Nonvolatile

1-3

Time Variant

1-3

Contrasting OLTP and Data Warehousing Environments

1-3

Data Warehouse Architectures

1-5

Data Warehouse Architecture (Basic)

1-5

Data Warehouse Architecture (with a Staging Area)

1-6

Data Warehouse Architecture (with a Staging Area and Data Marts)

1-7

Part II

Logical Design

2

Logical Design in Data Warehouses

 

Logical Versus Physical Design in Data Warehouses

2-2

Creating a Logical Design

2-2

Data Warehousing Schemas

2-3

Star Schemas

2-4

Other Schemas

2-5

Data Warehousing Objects

2-5

Fact Tables

2-5

Dimension Tables

2-6

Unique Identifiers

2-8

Relationships

2-8

Example of Data Warehousing Objects and Their Relationships

2-8

Part III

Physical Design

3 Physical Design in Data Warehouses

 

Moving from Logical to Physical Design

3-2

Physical Design

3-2

Physical Design Structures

3-4

Tablespaces

3-4

Tables and Partitioned Tables

3-5

Views

3-6

Integrity Constraints

3-6

Indexes and Partitioned Indexes

3-6

Materialized Views

3-7

Dimensions

3-7

4 Hardware and I/O Considerations in Data Warehouses

Overview of Hardware and I/O Considerations in Data Warehouses

4-2

Why Stripe the Data?

4-2

Automatic Striping

4-3

Manual Striping

4-4

Local and Global Striping

4-5

Analyzing Striping

4-6

RAID Configurations

4-9

RAID 0 (Striping)

4-10

RAID 1 (Mirroring)

4-10

RAID 0+1 (Striping and Mirroring)

4-10

Striping, Mirroring, and Media Recovery

4-10

RAID 5

4-11

The Importance of Specific Analysis

4-12

5 Parallelism and Partitioning in Data Warehouses

Overview of Parallel Execution

5-2

When to Implement Parallel Execution

5-2

Granules of Parallelism

5-3

Block Range Granules

5-3

Partition Granules

5-4

Partitioning Design Considerations

5-4

Types of Partitioning

5-4

Partitioning and Data Segment Compression

5-17

Partition Pruning

5-19

Partition-Wise Joins

5-21

Miscellaneous Partition Operations

5-31

Adding Partitions

5-32

Dropping Partitions

5-33

Exchanging Partitions

5-34

Moving Partitions

5-34

Splitting and Merging Partitions

5-35

Truncating Partitions

5-35

Coalescing Partitions

5-36

6 Indexes

Bitmap Indexes

6-2

Bitmap Join Indexes

6-6

B-tree Indexes

6-10

Local Indexes Versus Global Indexes

6-10

7 Integrity Constraints

Why Integrity Constraints are Useful in a Data Warehouse

7-2

 

Overview of Constraint States

7-3

Typical Data Warehouse Integrity Constraints

7-4

UNIQUE Constraints in a Data Warehouse

7-4

FOREIGN KEY Constraints in a Data Warehouse

7-5

RELY Constraints

7-6

Integrity Constraints and Parallelism

7-7

Integrity Constraints and Partitioning

7-7

View Constraints

7-7

8

Materialized Views

Overview of Data Warehousing with Materialized Views

8-2

Materialized Views for Data Warehouses

8-2

Materialized Views for Distributed Computing

8-3

Materialized Views for Mobile Computing

8-3

The Need for Materialized Views

8-3

Components of Summary Management

8-5

Data Warehousing Terminology

8-7

Materialized View Schema Design

8-8

Loading Data

8-10

Overview of Materialized View Management Tasks

8-11

Types of Materialized Views

8-12

Materialized Views with Aggregates

8-13

Materialized Views Containing Only Joins

8-16

Nested Materialized Views

8-18

Creating Materialized Views

8-21

Naming Materialized Views

8-22

Storage And Data Segment Compression

8-23

Build Methods

8-23

Enabling Query Rewrite

8-24

Query Rewrite Restrictions

8-24

Refresh Options

8-25

ORDER BY Clause

8-31

Materialized View Logs

8-31

Using Oracle Enterprise Manager

8-32

Using Materialized Views with NLS Parameters

8-32

 

Registering Existing Materialized Views

8-33

Partitioning and Materialized Views

8-35

Partition Change Tracking

8-35

Partitioning a Materialized View

8-39

Partitioning a Prebuilt Table

8-40

Rolling Materialized Views

8-41

Materialized Views in OLAP Environments

8-41

OLAP Cubes

8-41

Specifying OLAP Cubes in SQL

8-42

Querying OLAP Cubes in SQL

8-43

Partitioning Materialized Views for OLAP

8-47

Compressing Materialized Views for OLAP

8-47

Materialized Views with Set Operators

8-47

Choosing Indexes for Materialized Views

8-49

Invalidating Materialized Views

8-50

Security Issues with Materialized Views

8-50

Altering Materialized Views

8-51

Dropping Materialized Views

8-52

Analyzing Materialized View Capabilities

8-52

Using the DBMS_MVIEW.EXPLAIN_MVIEW Procedure

8-53

MV_CAPABILITIES_TABLE.CAPABILITY_NAME Details

8-56

MV_CAPABILITIES_TABLE Column Details

8-58

9

Dimensions

What are Dimensions?

9-2

Creating Dimensions

9-4

Multiple Hierarchies

9-7

Using Normalized Dimension Tables

9-9

Viewing Dimensions

9-10

Using The DEMO_DIM Package

9-10

Using Oracle Enterprise Manager

9-11

Using Dimensions with Constraints

9-11

Validating Dimensions

9-12

Altering Dimensions

9-13

Deleting Dimensions

9-14

Using the Dimension Wizard

9-14

Managing the Dimension Object

9-14

Creating a Dimension

9-17

Part IV

Managing the Warehouse Environment

10 Overview of Extraction, Transformation, and Loading

 

Overview of ETL

10-2

ETL Tools

10-3

Daily Operations

10-4

Evolution of the Data Warehouse

10-4

11 Extraction in Data Warehouses

Overview of Extraction in Data Warehouses

11-2

Introduction to Extraction Methods in Data Warehouses

11-2

Logical Extraction Methods

11-3

Physical Extraction Methods

11-4

Change Data Capture

11-5

Data Warehousing Extraction Examples

11-8

Extraction Using Data Files

11-8

Extraction Via Distributed Operations

11-11

12 Transportation in Data Warehouses

Overview of Transportation in Data Warehouses

12-2

Introduction to Transportation Mechanisms in Data Warehouses

12-2

Transportation Using Flat Files

12-2

Transportation Through Distributed Operations

12-2

Transportation Using Transportable Tablespaces

12-3

13 Loading and Transformation

Overview of Loading and Transformation in Data Warehouses

13-2

Transformation Flow

13-2

Loading Mechanisms

13-5

SQL*Loader

13-5

 

External Tables

13-6

OCI and Direct-Path APIs

13-8

Export/Import

13-8

Transformation Mechanisms

13-9

Transformation Using SQL

13-9

Transformation Using PL/SQL

13-15

Transformation Using Table Functions

13-16

Loading and Transformation Scenarios

13-25

Parallel Load Scenario

13-25

Key Lookup Scenario

13-33

Exception Handling Scenario

13-34

Pivoting Scenarios

13-35

14

Maintaining the Data Warehouse

Using Partitioning to Improve Data Warehouse Refresh

14-2

Refresh Scenarios

14-5

Scenarios for Using Partitioning for Refreshing Data Warehouses

14-7

Optimizing DML Operations During Refresh

14-8

Implementing an Efficient MERGE Operation

14-9

Maintaining Referential Integrity

14-10

Purging Data

14-11

Refreshing Materialized Views

14-12

Complete Refresh

14-13

Fast Refresh

14-14

ON COMMIT Refresh

14-14

Manual Refresh Using the DBMS_MVIEW Package

14-14

Refresh Specific Materialized Views with REFRESH

14-15

Refresh All Materialized Views with REFRESH_ALL_MVIEWS

14-16

Refresh Dependent Materialized Views with REFRESH_DEPENDENT

14-16

Using Job Queues for Refresh

14-18

When Refresh is Possible

14-18

Recommended Initialization Parameters for Parallelism

14-18

Monitoring a Refresh

14-19

Checking the Status of a Materialized View

14-19

Tips for Refreshing Materialized Views with Aggregates

14-19

Tips for Refreshing Materialized Views Without Aggregates

14-22

Tips for Refreshing Nested Materialized Views

14-23

Tips for Fast Refresh with UNION ALL

14-25

Tips After Refreshing Materialized Views

14-25

Using Materialized Views with Partitioned Tables

14-26

Fast Refresh with Partition Change Tracking

14-26

Fast Refresh with CONSIDER FRESH

14-30

15 Change Data Capture

About Change Data Capture

15-2

Publish and Subscribe Model

15-3

Example of a Change Data Capture System

15-4

Components and Terminology for Synchronous Change Data Capture

15-5

Installation and Implementation

15-8

Change Data Capture Restriction on Direct-Path INSERT

15-8

Security

15-9

Columns in a Change Table

15-9

Change Data Capture Views

15-10

Synchronous Mode of Data Capture

15-12

Publishing Change Data

15-12

Step 1: Decide which Oracle Instance will be the Source System

15-12

Step 2: Create the Change Tables that will Contain the Changes

15-12

Managing Change Tables and Subscriptions

15-14

Subscribing to Change Data

15-15

Steps Required to Subscribe to Change Data

15-15

What Happens to Subscriptions when the Publisher Makes Changes

15-19

Export and Import Considerations

15-20

16 Summary Advisor

Overview of the Summary Advisor in the DBMS_OLAP Package

16-2

Using the Summary Advisor

16-6

Identifier Numbers

16-7

Workload Management

16-7

Loading a User-Defined Workload

16-9

Loading a Trace Workload

16-12

Loading a SQL Cache Workload

16-15

Validating a Workload

16-17

Removing a Workload

16-18

Using Filters with the Summary Advisor

16-18

Removing a Filter

16-22

Recommending Materialized Views

16-23

SQL Script Generation

16-27

Summary Data Report

16-29

When Recommendations are No Longer Required

16-31

Stopping the Recommendation Process

16-32

Summary Advisor Sample Sessions

16-32

Summary Advisor and Missing Statistics

16-37

Summary Advisor Privileges and ORA-30446

16-38

Estimating Materialized View Size

16-38

ESTIMATE_MVIEW_SIZE Parameters

16-38

Is a Materialized View Being Used?

16-39

DBMS_OLAP.EVALUATE_MVIEW_STRATEGY Procedure

16-39

Summary Advisor Wizard

16-40

Summary Advisor Steps

16-41

Part V

Warehouse Performance

17 Schema Modeling Techniques

 

Schemas in Data Warehouses

17-2

Third Normal Form

17-2

Optimizing Third Normal Form Queries

17-3

Star Schemas

17-4

Snowflake Schemas

17-5

Optimizing Star Queries

17-6

Tuning Star Queries

17-6

Using Star Transformation

17-7

18 SQL for Aggregation in Data Warehouses

Overview of SQL for Aggregation in Data Warehouses

18-2

 

Analyzing Across Multiple Dimensions

18-3

Optimized Performance

18-4

An Aggregate Scenario

18-5

Interpreting NULLs in Examples

18-6

ROLLUP Extension to GROUP BY

18-6

When to Use ROLLUP

18-7

ROLLUP Syntax

18-7

Partial Rollup

18-8

CUBE Extension to GROUP BY

18-10

When to Use CUBE

18-10

CUBE Syntax

18-11

Partial CUBE

18-12

Calculating Subtotals Without CUBE

18-13

GROUPING Functions

18-13

GROUPING Function

18-14

When to Use GROUPING

18-16

GROUPING_ID Function

18-17

GROUP_ID Function

18-17

GROUPING SETS Expression

18-19

Composite Columns

18-21

Concatenated Groupings

18-24

Concatenated Groupings and Hierarchical Data Cubes

18-26

Considerations when Using Aggregation

18-28

Hierarchy Handling in ROLLUP and CUBE

18-28

Column Capacity in ROLLUP and CUBE

18-29

HAVING Clause Used with GROUP BY Extensions

18-29

ORDER BY Clause Used with GROUP BY Extensions

18-30

Using Other Aggregate Functions with ROLLUP and CUBE

18-30

Computation Using the WITH Clause

18-30

19

SQL for Analysis in Data Warehouses

Overview of SQL for Analysis in Data Warehouses

19-2

Ranking Functions

19-5

RANK and DENSE_RANK

19-5

Top N Ranking

19-12

Bottom N Ranking

19-12

CUME_DIST

19-13

PERCENT_RANK

19-14

NTILE

19-14

ROW_NUMBER

19-16

Windowing Aggregate Functions

19-17

Treatment of NULLs as Input to Window Functions

19-18

Windowing Functions with Logical Offset

19-18

Cumulative Aggregate Function Example

19-18

Moving Aggregate Function Example

19-19

Centered Aggregate Function

19-20

Windowing Aggregate Functions in the Presence of Duplicates

19-21

Varying Window Size for Each Row

19-22

Windowing Aggregate Functions with Physical Offsets

19-23

FIRST_VALUE and LAST_VALUE

19-24

Reporting Aggregate Functions

19-24

Reporting Aggregate Example

19-26

RATIO_TO_REPORT

19-27

LAG/LEAD Functions

19-27

LAG/LEAD Syntax

19-28

FIRST/LAST Functions

19-28

FIRST/LAST Syntax

19-29

FIRST/LAST As Regular Aggregates

19-29

FIRST/LAST As Reporting Aggregates

19-30

Linear Regression Functions

19-31

REGR_COUNT

19-32

REGR_AVGY and REGR_AVGX

19-32

REGR_SLOPE and REGR_INTERCEPT

19-32

REGR_R2

19-32

REGR_SXX, REGR_SYY, and REGR_SXY

19-33

Linear Regression Statistics Examples

19-33

Sample Linear Regression Calculation

19-34

Inverse Percentile Functions

19-34

Normal Aggregate Syntax

19-35

Inverse Percentile Restrictions

19-38

Hypothetical Rank and Distribution Functions

19-38

Hypothetical Rank and Distribution Syntax

19-38

WIDTH_BUCKET Function

19-40

WIDTH_BUCKET Syntax

19-40

User-Defined Aggregate Functions

19-43

CASE Expressions

19-44

CASE Example

19-44

Creating Histograms With User-Defined Buckets

19-45

20 OLAP and Data Mining

OLAP

20-2

Benefits of OLAP and RDBMS Integration

20-2

Data Mining

20-4

Enabling Data Mining Applications

20-5

Predictions and Insights

20-5

Mining Within the Database Architecture

20-5

Java API

20-7

21 Using Parallel Execution

Introduction to Parallel Execution Tuning

21-2

When to Implement Parallel Execution

21-2

Operations That Can Be Parallelized

21-3

The Parallel Execution Server Pool

21-3

How Parallel Execution Servers Communicate

21-5

Parallelizing SQL Statements

21-6

Types of Parallelism

21-11

Parallel Query

21-11

Parallel DDL

21-13

Parallel DML

21-18

Parallel Execution of Functions

21-28

Other Types of Parallelism

21-29

Initializing and Tuning Parameters for Parallel Execution

21-30

Selecting Automated or Manual Tuning of Parallel Execution

21-31

Using Automatically Derived Parameter Settings

21-31

Setting the Degree of Parallelism

21-32

How Oracle Determines the Degree of Parallelism for Operations

21-34

Balancing the Workload

21-37

Parallelization Rules for SQL Statements

21-38

Enabling Parallelism for Tables and Queries

21-46

Degree of Parallelism and Adaptive Multiuser: How They Interact

21-47

Forcing Parallel Execution for a Session

21-48

Controlling Performance with the Degree of Parallelism

21-48

Tuning General Parameters for Parallel Execution

21-49

Parameters Establishing Resource Limits for Parallel Operations

21-49

Parameters Affecting Resource Consumption

21-58

Parameters Related to I/O

21-63

Monitoring and Diagnosing Parallel Execution Performance

21-64

Is There Regression?

21-66

Is There a Plan Change?

21-66

Is There a Parallel Plan?

21-66

Is There a Serial Plan?

21-66

Is There Parallel Execution?

21-67

Is the Workload Evenly Distributed?

21-67

Monitoring Parallel Execution Performance with Dynamic Performance Views

21-68

Monitoring Session Statistics

21-71

Monitoring System Statistics

21-73

Monitoring Operating System Statistics

21-74

Affinity and Parallel Operations

21-75

Affinity and Parallel Queries

21-75

Affinity and Parallel DML

21-76

Miscellaneous Parallel Execution Tuning Tips

21-76

Setting Buffer Cache Size for Parallel Operations

21-77

Overriding the Default Degree of Parallelism

21-77

Rewriting SQL Statements

21-78

Creating and Populating Tables in Parallel

21-78

Creating Temporary Tablespaces for Parallel Sort and Hash Join

21-80

Executing Parallel SQL Statements

21-81

Using EXPLAIN PLAN to Show Parallel Operations Plans

21-81

Additional Considerations for Parallel DML

21-82

Creating Indexes in Parallel

21-85

 

Parallel DML Tips

21-87

Incremental Data Loading in Parallel

21-90

Using Hints with Cost-Based Optimization

21-92

FIRST_ROWS(n) Hint

21-93

Enabling Dynamic Statistic Sampling

21-93

22

Query Rewrite

Overview of Query Rewrite

22-2

Cost-Based Rewrite

22-3

When Does Oracle Rewrite a Query?

22-4

Enabling Query Rewrite

22-7

Initialization Parameters for Query Rewrite

22-8

Controlling Query Rewrite

22-8

Privileges for Enabling Query Rewrite

22-9

Accuracy of Query Rewrite

22-10

How Oracle Rewrites Queries

22-11

Text Match Rewrite Methods

22-12

General Query Rewrite Methods

22-13

When are Constraints and Dimensions Needed?

22-14

Special Cases for Query Rewrite

22-45

Query Rewrite Using Partially Stale Materialized Views

22-45

Query Rewrite Using Complex Materialized Views

22-49

Query Rewrite Using Nested Materialized Views

22-50

Query Rewrite When Using GROUP BY Extensions

22-51

Did Query Rewrite Occur?

22-56

Explain Plan

22-56

DBMS_MVIEW.EXPLAIN_REWRITE Procedure

22-57

Design Considerations for Improving Query Rewrite Capabilities

22-63

Query Rewrite Considerations: Constraints

22-63

Query Rewrite Considerations: Dimensions

22-63

Query Rewrite Considerations: Outer Joins

22-63

Query Rewrite Considerations: Text Match

22-63

Query Rewrite Considerations: Aggregates

22-64

Query Rewrite Considerations: Grouping Conditions

22-64

Query Rewrite Considerations: Expression Matching

22-64

Query Rewrite Considerations: Date Folding

22-65

Query Rewrite Considerations: Statistics

22-65

Glossary

Index

Send Us Your Comments

Oracle9i Data Warehousing Guide, Release 2 (9.2)

Part No. A96520-01

Oracle Corporation welcomes your comments and suggestions on the quality and usefulness of this document. Your input is an important part of the information used for revision.

Did you find any errors?

Is the information clearly presented?

Do you need more information? If so, where?

Are the examples correct? Do you need more examples?

What features did you like most?

If you find any errors or have any other suggestions for improvement, please indicate the document title and part number, and the chapter, section, and page number (if available). You can send com- ments to us in the following ways:

Electronic mail: infodev_us@oracle.com

FAX: (650) 506-7227

Postal service:

Attn: Server Technologies Documentation Manager

Oracle Corporation Server Technologies Documentation 500 Oracle Parkway, Mailstop 4op11 Redwood Shores, CA 94065 USA

If you would like a reply, please give your name, address, telephone number, and (optionally) elec- tronic mail address.

If you have problems with the software, please contact your local Oracle Support Services.

Preface

This manual provides information about Oracle9i’s data warehousing capabilities. This preface contains these topics:

Audience

Organization

Related Documentation

Conventions

Documentation Accessibility

Audience

Oracle9i Data Warehousing Guide is intended for database administrators, system administrators, and database application developers who design, maintain, and use data warehouses.

To use this document, you need to be familiar with relational database concepts, basic Oracle server concepts, and the operating system environment under which you are running Oracle.

Organization

This document contains:

Part 1: Concepts

Chapter 1, Data Warehousing Concepts This chapter contains an overview of data warehousing concepts.

Part 2: Logical Design

Chapter 2, Logical Design in Data Warehouses This chapter discusses the logical design of a data warehouse.

Part 3: Physical Design

Chapter 3, Physical Design in Data Warehouses This chapter discusses the physical design of a data warehouse.

Chapter 4, Hardware and I/O Considerations in Data Warehouses This chapter describes some hardware and input-output issues.

Chapter 5, Parallelism and Partitioning in Data Warehouses This chapter describes the basics of parallelism and partitioning in data warehouses.

Chapter 6, Indexes This chapter describes how to use indexes in data warehouses.

Chapter 7, Integrity Constraints This chapter describes some issues involving constraints.

Chapter 8, Materialized Views This chapter describes how to use materialized views in data warehouses.

Chapter 9, Dimensions This chapter describes how to use dimensions in data warehouses.

Part 4: Managing the Warehouse Environment

Chapter 10, Overview of Extraction, Transformation, and Loading This chapter is an overview of the ETL process.

Chapter 11, Extraction in Data Warehouses This chapter describes extraction issues.

Chapter 12, Transportation in Data Warehouses This chapter describes transporting data in data warehouses.

Chapter 13, Loading and Transformation This chapter describes transforming data in data warehouses.

Chapter 14, Maintaining the Data Warehouse This chapter describes how to refresh in a data warehousing environment.

Chapter 15, Change Data Capture This chapter describes how to use Change Data Capture capabilities.

Chapter 16, Summary Advisor This chapter describes how to use the Summary Advisor utility.

Part 5: Warehouse Performance

Chapter 17, Schema Modeling Techniques This chapter describes the schemas useful in data warehousing environments.

Chapter 18, SQL for Aggregation in Data Warehouses This chapter explains how to use SQL aggregation in data warehouses.

Chapter 19, SQL for Analysis in Data Warehouses This chapter explains how to use analytic functions in data warehouses.

Chapter 20, OLAP and Data Mining This chapter describes using analytic services in combination with Oracle9i.

Chapter 21, Using Parallel Execution This chapter describes how to tune data warehouses using parallel execution.

Chapter 22, Query Rewrite This chapter describes how to use query rewrite.

Glossary

Related Documentation

For more information, see these Oracle resources:

Oracle9i Database Performance Tuning Guide and Reference

Many of the examples in this book use the sample schemas of the seed database, which is installed by default when you install Oracle. Refer to Oracle9i Sample Schemas for information on how these schemas were created and how you can use them yourself.

In North America, printed documentation is available for sale in the Oracle Store at

http://oraclestore.oracle.com/

Customers in Europe, the Middle East, and Africa (EMEA) can purchase documentation from

http://www.oraclebookshop.com/

Other customers can contact their Oracle representative to purchase printed documentation.

To download free release notes, installation documentation, white papers, or other collateral, please visit the Oracle Technology Network (OTN). You must register online before using OTN; registration is free and can be done at

http://otn.oracle.com/admin/account/membership.html

If you already have a username and password for OTN, then you can go directly to the documentation section of the OTN Web site at

http://otn.oracle.com/docs/index.htm

To access the database documentation search engine directly, please visit

http://tahiti.oracle.com

For additional information, see:

The Data Warehouse Toolkit by Ralph Kimball (John Wiley and Sons, 1996)

Building the Data Warehouse by William Inmon (John Wiley and Sons, 1996)

Conventions

This section describes the conventions used in the text and code examples of this documentation set. It describes:

Conventions in Text

Conventions in Code Examples

Conventions for Windows Operating Systems

Conventions in Text

We use various conventions in text to help you more quickly identify special terms. The following table describes those conventions and provides examples of their use.

Convention

Meaning

Example

Bold

Bold typeface indicates terms that are defined in the text or terms that appear in a glossary, or both.

When you specify this clause, you create an index-organized table.

Italics

Italic typeface indicates book titles or emphasis.

Oracle9i Database Concepts

Ensure that the recovery catalog and target database do not reside on the same disk.

UPPERCASE

Uppercase monospace typeface indicates elements supplied by the system. Such elements include parameters, privileges, datatypes, RMAN keywords, SQL keywords, SQL*Plus or utility commands, packages and methods, as well as system-supplied column names, database objects and structures, usernames, and roles.

You can specify this clause only for a NUMBER column.

monospace

(fixed-width)

 

font

You can back up the database by using the BACKUP command.

Query the TABLE_NAME column in the USER_ TABLES data dictionary view.

Use the DBMS_STATS.GENERATE_STATS procedure.

lowercase

Enter sqlplus to open SQL*Plus.

monospace

The password is specified in the orapwd file.

(fixed-width)

font

Lowercase monospace typeface indicates executables, filenames, directory names, and sample user-supplied elements. Such elements include computer and database names, net service names, and connect identifiers, as well as user-supplied database objects and structures, column names, packages and classes, usernames and roles, program units, and parameter values.

Note: Some programmatic elements use a mixture of UPPERCASE and lowercase. Enter these elements as shown.

lowercase

Lowercase italic monospace font

italic

represents placeholders or

variables.

monospace

(fixed-width)

font

Back up the datafiles and control files in the /disk1/oracle/dbs directory.

The department_id, department_name, and location_id columns are in the hr.departments table.

Set the QUERY_REWRITE_ENABLED initialization parameter to true.

Connect as oe user.

The JRepUtil class implements these methods.

You can specify the parallel_clause.

Run Uold_release.SQL where old_

release refers to the release you installed prior to upgrading.

Conventions in Code Examples

Code examples illustrate SQL, PL/SQL, SQL*Plus, or other command-line statements. They are displayed in a monospace (fixed-width) font and separated from normal text as shown in this example:

SELECT username FROM dba_users WHERE username = ’MIGRATE’;

The following table describes typographic conventions used in code examples and provides examples of their use.

Convention

Meaning

Example

[

{

]

}

Brackets enclose one or more optional items. Do not enter the brackets.

Braces enclose two or more items, one of

which is required. Do not enter the

DECIMAL (digits [ , precision ])

braces. {ENABLE | DISABLE}

| A vertical bar represents a choice of two or more options within brackets or braces. Enter one of the options. Do not enter the vertical bar.

{ENABLE | DISABLE} [COMPRESS | NOCOMPRESS]

.

.

.

Other notation

Horizontal ellipsis points indicate either:

That we have omitted parts of the code that are not directly related to the example

That you can repeat a portion of the code

Vertical ellipsis points indicate that we

have omitted several lines of code not

directly related to the example.

You must enter symbols other than brackets, braces, vertical bars, and ellipsis points as shown.

CREATE TABLE

AS subquery;

SELECT col1, col2,

employees;

, coln FROM

SQL> SELECT NAME FROM V$DATAFILE; NAME

------------------------------------

/fsl/dbs/tbs_01.dbf

/fs1/dbs/tbs_02.dbf

.

.

.

/fsl/dbs/tbs_09.dbf

9 rows selected.

acctbal NUMBER(11,2);

acct

CONSTANT NUMBER(4) := 3;

Convention

Meaning

Example

Italics

Italicized text indicates placeholders or variables for which you must supply particular values.

CONNECT SYSTEM/system_password DB_NAME = database_name

UPPERCASE

Uppercase typeface indicates elements supplied by the system. We show these terms in uppercase in order to distinguish them from terms you define. Unless terms appear in brackets, enter them in the order and with the spelling shown. However, because these terms are not case sensitive, you can enter them in lowercase.

SELECT last_name, employee_id FROM employees; SELECT * FROM USER_TABLES; DROP TABLE hr.employees;

lowercase

Lowercase typeface indicates programmatic elements that you supply. For example, lowercase indicates names of tables, columns, or files.

SELECT last_name, employee_id FROM employees; sqlplus hr/hr CREATE USER mjones IDENTIFIED BY ty3MU9;

Note: Some programmatic elements use a mixture of UPPERCASE and lowercase. Enter these elements as shown.

 

Conventions for Windows Operating Systems

The following table describes conventions for Windows operating systems and provides examples of their use.

Convention

Meaning

Example

Choose Start >

How to start a program.

To start the Database Configuration Assistant, choose Start > Programs > Oracle - HOME_ NAME > Configuration and Migration Tools > Database Configuration Assistant.

File and directory names

c:\winnt"\"system32 is the same as

C:\WINNT\SYSTEM32

File and directory names are not case sensitive. The following special characters are not allowed: left angle bracket (<), right angle bracket (>), colon (:), double quotation marks ("), slash (/), pipe (|), and dash (-). The special character backslash (\) is treated as an element separator, even when it appears in quotes. If the file name begins with \\, then Windows assumes it uses the Universal Naming Convention.

Convention

Meaning

Example

C:\>

Represents the Windows command prompt of the current hard disk drive. The escape character in a command prompt is the caret (^). Your prompt reflects the subdirectory in which you are working. Referred to as the command prompt in this manual.

C:\oracle\oradata>

Special characters

HOME_NAME

The backslash (\) special character is sometimes required as an escape character for the double quotation mark (") special character at the Windows command prompt. Parentheses and the single quotation mark (’) do not require an escape character. Refer to your Windows operating system documentation for more information on escape and special characters.

Represents the Oracle home name. The home name can be up to 16 alphanumeric characters. The only special character allowed in the home name is the underscore.

C:\>exp scott/tiger TABLES=emp QUERY=\"WHERE job=’SALESMAN’ and

sal<1600\"

C:\>imp SYSTEM/password FROMUSER=scott TABLES=(emp, dept)

C:\> net start OracleHOME_NAMETNSListener

Convention

Meaning

Example

ORACLE_HOME

In releases prior to Oracle8i release 8.1.3, when you installed Oracle components, all subdirectories were located under a top level ORACLE_HOME directory that by default used one of the following names:

Go to the ORACLE_BASE\ORACLE_

and ORACLE_

HOME\rdbms\admin directory.

BASE

C:\orant for Windows NT

C:\orawin98 for Windows 98

This release complies with Optimal Flexible Architecture (OFA) guidelines. All subdirectories are not under a top level ORACLE_HOME directory. There is a top level directory called ORACLE_BASE that by default is C:\oracle. If you install the latest Oracle release on a computer with no other Oracle software installed, then the default setting for the first Oracle home directory is C:\oracle\orann, where nn is the latest release number. The Oracle home directory is located directly under

ORACLE_BASE.

All directory path examples in this guide follow OFA conventions.

Refer to Oracle9i Database Getting Started for Windows for additional information about OFA compliances and for information about installing Oracle products in non-OFA compliant directories.

Documentation Accessibility

Our goal is to make Oracle products, services, and supporting documentation accessible, with good usability, to the disabled community. To that end, our documentation includes features that make information available to users of assistive technology. This documentation is available in HTML format, and contains markup to facilitate access by the disabled community. Standards will continue to evolve over time, and Oracle Corporation is actively engaged with other market-leading technology vendors to address technical obstacles so that our documentation can be accessible to all of our customers. For additional information, visit the Oracle Accessibility Program Web site at

http://www.oracle.com/accessibility/

Accessibility of Code Examples in Documentation

JAWS, a Windows screen

reader, may not always correctly read the code examples in this document. The conventions for writing code require that closing braces should appear on an otherwise empty line; however, JAWS may not always read a line of text that consists solely of a bracket or brace.

Accessibility of Links to External Web Sites in Documentation

This

documentation may contain links to Web sites of other companies or organizations that Oracle Corporation does not own or control. Oracle Corporation neither evaluates nor makes any representations regarding the accessibility of these Web sites.

What’s New in Data Warehousing?

This section describes new features of Oracle9i release 2 (9.2) and provides pointers to additional information. New features information from previous releases is also retained to help those users migrating to the current release.

The following sections describe the new features in Oracle Data Warehousing:

Oracle9i Release 2 (9.2) New Features in Data Warehousing

Oracle9i Release 1 (9.0.1) New Features in Data Warehousing

Oracle9i Release 2 (9.2) New Features in Data Warehousing

Data Segment Compression

You can compress data segments in heap-organized tables, and a typical example of a heap-organized table you should consider for data segment compression is partitioned tables. Data segment compression is also useful for highly redundant data, such as tables with many foreign keys and materialized views created with the ROLLUP clause. You should avoid compression on tables with many updates or DML.

See Also:

Chapter 8, "Materialized Views"

Materialized View Enhancements

You can now nest materialized views when the materialized view contains joins and aggregates. Fast refresh is now possible on a materialized views containing the UNION ALL operator. Various restrictions were removed in addition to expanding the situations where materialized views could be effectively used. In particular, using materialized views in an OLAP environment has been improved.

"Overview of Data Warehousing with Materialized

Views" on page 8-2 and "Materialized Views in OLAP Environments" on page 8-41, and Chapter 14, "Maintaining the Data Warehouse"

See Also:

Parallel DML on Non-Partitioned Tables

You can now use parallel DML on non-partitioned tables.

See Also:

Chapter 21, "Using Parallel Execution"

Partitioning Enhancements

You can now simplify SQL syntax by using a DEFAULT partition or a subpartition template. You can implement SPLIT operations more easily.

See Also:

"Partitioning Methods" on page 5-5, Chapter 5,

"Parallelism and Partitioning in Data Warehouses", and Oracle9i Database Administrator’s Guide

Query Rewrite Enhancements

Text match processing and join equivalence recognition have been improved. Materialized views containing the UNION ALL operator can now use query rewrite.

See Also:

Chapter 22, "Query Rewrite"

Range-List Partitioning

You can now subpartition by list range-partitioned tables.

See Also:

"Types of Partitioning" on page 5-4

Summary Advisor Enhancements

The Summary Advisor tool and its related DBMS_OLAP package were improved

so you can restrict workloads to a specific schema.

See Also:

Chapter 16, "Summary Advisor"

Oracle9i Release 1 (9.0.1) New Features in Data Warehousing

Analytic Functions

Oracle’s analytic capabilities have been improved through the addition of Inverse percentile, hypothetical distribution, and first/last analytic functions.

See Also:

Chapter 19, "SQL for Analysis in Data Warehouses"

Bitmap Join Index

A bitmap join index spans multiple tables and improves the performance of

joins of those tables.

See Also:

"Bitmap Indexes" on page 6-2

ETL Enhancements

Oracle’s extraction, transformation, and loading capabilities have been improved with a MERGE statement, multi-table inserts, and table functions.

See Also:

Chapter 10, "Overview of Extraction, Transformation,

and Loading"

Full Outer Joins

Oracle added full support for full outer joins so that you can more easily express certain complex queries.

See Also:

Grouping Sets

Oracle9i Database Performance Tuning Guide and Reference

You can now selectively specify the set of groups that you want to create using a GROUPING SETS expression within a GROUP BY clause. This allows precise specification across multiple dimensions without computing the whole CUBE.

See Also:

Chapter 18, "SQL for Aggregation in Data Warehouses"

List Partitioning

List partitioning offers you precise control over which data belongs in a particular partition.

See Also:

"Partitioning Design Considerations" on page 5-4 and

Oracle9i Database Concepts, and Oracle9i Database Administrator’s Guide

Materialized View Enhancements

Various restrictions were removed in addition to expanding the situations where materialized views could be effectively used.

See Also:

"Overview of Data Warehousing with Materialized

Views" on page 8-2

Query Rewrite Enhancements

The query rewrite feature, which allows many SQL statements to use materialized views, thereby improving performance significantly, was improved significantly. Text match processing and join equivalence recognition have been improved.

See Also:

Chapter 22, "Query Rewrite"

Summary Advisor Enhancements

The Summary Advisor tool and its related DBMS_OLAP package were improved so you can specify workloads. In addition, a broader class of schemas is now supported.

See Also:

WITH Clause

Chapter 16, "Summary Advisor"

The WITH clause enables you to reuse a query block in a SELECT statement when it occurs more than once within a complex query.

See Also:

"Computation Using the WITH Clause" on page 18-30

Part I

Concepts

This section introduces basic data warehousing concepts. It contains the following chapter:

Data Warehousing Concepts

1

Data Warehousing Concepts

This chapter provides an overview of the Oracle data warehousing implementation. It includes:

What is a Data Warehouse?

Data Warehouse Architectures

Note that this book is meant as a supplement to standard texts about data warehousing. This book focuses on Oracle-specific material and does not reproduce in detail material of a general nature. Two standard texts are:

The Data Warehouse Toolkit by Ralph Kimball (John Wiley and Sons, 1996)

Building the Data Warehouse by William Inmon (John Wiley and Sons, 1996)

What is a Data Warehouse?

What is a Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis

rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

In addition to a relational database, a data warehouse environment includes an

extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

See Also:

Chapter 10, "Overview of Extraction, Transformation,

and Loading"

A common way of introducing data warehousing is to refer to the characteristics of

a data warehouse as set forth by William Inmon:

Subject Oriented

Integrated

Nonvolatile

Time Variant

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.

Integrated

Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems

as naming conflicts and inconsistencies among units of measure. When they achieve

this, they are said to be integrated.

What is a Data Warehouse?

Nonvolatile

Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse’s focus on change over time is what is meant by the term time variant.

Contrasting OLTP and Data Warehousing Environments

Figure 1–1 illustrates key differences between an OLTP system and a data warehouse.

Figure 1–1

Contrasting OLTP and Data Warehousing Environments

OLTP

Complex data

structures

(3NF databases)

Data Warehouse

Multidimensional

data structures

Few

Indexes

Many

Many

Joins

Some

Normalized

Duplicated

Denormalized

DBMS

Data

DBMS

Rare

Derived Data

Common

and Aggregates

One major difference between the types of system is that data warehouses are not usually in third normal form (3NF), a type of data normalization common in OLTP environments.

What is a Data Warehouse?

Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:

Workload

 

Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.

OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.

Data modifications

A

data warehouse is updated on a regular basis by the ETL process (run nightly

or

weekly) using bulk data modification techniques. The end users of a data

warehouse do not directly update the data warehouse.

In

OLTP systems, end users routinely issue individual data modification

statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.

Schema design

Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.

OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.

Typical operations

A

typical data warehouse query scans thousands or millions of rows. For

example, "Find the total sales for all customers last month."

A

typical OLTP operation accesses only a handful of records. For example,

"Retrieve the current order for this customer."

Historical data

Data warehouses usually store many months or years of data. This is to support historical analysis.

OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

Data Warehouse Architectures

Data Warehouse Architectures

Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:

Data Warehouse Architecture (Basic)

Data Warehouse Architecture (with a Staging Area)

Data Warehouse Architecture (with a Staging Area and Data Marts)

Data Warehouse Architecture (Basic)

Figure 1–2 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.

Figure 1–2

Architecture of a Data Warehouse

Data Sources

Operational System Operational System
Operational
System
Operational
System

Flat Files

Warehouse Metadata Summary Raw Data Data
Warehouse
Metadata
Summary
Raw
Data
Data
Users Analysis Reporting Mining
Users
Analysis
Reporting
Mining

In Figure 1–2, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view.

Data Warehouse Architectures

Data Warehouse Architecture (with a Staging Area)

In Figure 1–2, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 1–3 illustrates this typical architecture.

Figure 1–3

Architecture of a Data Warehouse with a Staging Area

Data

Staging

Sources

Area

Warehouse

Users

Operational System Operational System
Operational
System
Operational
System

Flat Files

Metadata Summary Raw Data Data
Metadata
Summary
Raw
Data
Data
Analysis Reporting Mining
Analysis
Reporting
Mining

Data Warehouse Architectures

Data Warehouse Architecture (with a Staging Area and Data Marts)

Although the architecture in Figure 1–3 is quite common, you may want to customize your warehouse’s architecture for different groups within your organization. You can do this by adding data marts, which are systems designed for a particular line of business. Figure 1–4 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.

Figure 1–4

Architecture of a Data Warehouse with a Staging Area and Data Marts

Data

Staging

Data

Sources

Area

Warehouse

Marts

Users

Operational Purchasing Analysis System Metadata Summary Raw Data Data Operational Sales Reporting System Flat
Operational
Purchasing
Analysis
System
Metadata
Summary
Raw
Data
Data
Operational
Sales
Reporting
System
Flat Files
Inventory
Mining

Note:

Data marts are an important part of many warehouses, but

they are not the focus of this book.

See Also:

Data Mart Suites documentation for further information

regarding data marts

Data Warehouse Architectures

Part II

Logical Design

This section deals with the issues in logical design in a data warehouse. It contains the following chapter:

Logical Design in Data Warehouses

2

Logical Design in Data Warehouses

This chapter tells you how to design a data warehousing environment and includes the following topics:

Logical Versus Physical Design in Data Warehouses

Creating a Logical Design

Data Warehousing Schemas

Data Warehousing Objects

Logical Versus Physical Design in Data Warehouses

Logical Versus Physical Design in Data Warehouses

Your organization has decided to build a data warehouse. You have defined the business requirements and agreed upon the scope of your application, and created a conceptual design. Now you need to translate your requirements into a system deliverable. To do so, you create the logical and physical design for the data warehouse. You then define:

The specific data content

Relationships within and between groups of data

The system environment supporting your data warehouse

The data transformations required

The frequency with which data is refreshed

The logical design is more conceptual and abstract than the physical design. In the logical design, you look at the logical relationships among the objects. In the physical design, you look at the most effective way of storing and retrieving the objects as well as handling them from a transportation and backup/recovery perspective.

Orient your design toward the needs of the end users. End users typically want to perform analysis and look at aggregated data, rather than at individual

transactions. However, end users might not know what they need until they see it.

In addition, a well-planned design allows for growth and changes as the needs of

users change and evolve.

By beginning with the logical design, you focus on the information requirements and save the implementation details for later.

Creating a Logical Design

A logical design is conceptual and abstract. You do not deal with the physical

implementation details yet. You deal only with defining the types of information that you need.

One technique you can use to model your organization's logical information requirements is entity-relationship modeling. Entity-relationship modeling involves identifying the things of importance (entities), the properties of these things (attributes), and how they are related to one another (relationships).

The process of logical design involves arranging data into a series of logical relationships called entities and attributes. An entity represents a chunk of

Data Warehousing Schemas

information. In relational databases, an entity often maps to a table. An attribute is a component of an entity that helps define the uniqueness of the entity. In relational databases, an attribute maps to a column.

To be sure that your data is consistent, you need to use unique identifiers. A unique identifier is something you add to tables so that you can differentiate between the same item when it appears in different places. In a physical design, this is usually a primary key.

While entity-relationship diagramming has traditionally been associated with highly normalized models such as OLTP applications, the technique is still useful for data warehouse design in the form of dimensional modeling. In dimensional modeling, instead of seeking to discover atomic units of information (such as entities and attributes) and all of the relationships between them, you identify which information belongs to a central fact table and which information belongs to its associated dimension tables. You identify business subjects or fields of data, define relationships between business subjects, and name the attributes for each subject.

See Also:

Chapter 9, "Dimensions" for further information

regarding dimensions

Your logical design should result in (1) a set of entities and attributes corresponding

to fact tables and dimension tables and (2) a model of operational data from your

source into subject-oriented information in your target data warehouse schema.

You can create the logical design using a pen and paper, or you can use a design tool such as Oracle Warehouse Builder (specifically designed to support modeling the ETL process) or Oracle Designer (a general purpose modeling tool).

See Also:

Oracle Designer and Oracle Warehouse Builder

documentation sets

Data Warehousing Schemas

A schema is a collection of database objects, including tables, views, indexes, and

synonyms. You can arrange schema objects in the schema models designed for data warehousing in a variety of ways. Most data warehouses use a dimensional model.

The model of your source data and the requirements of your users help you design the data warehouse schema. You can sometimes get the source model from your company's enterprise data model and reverse-engineer the logical data model for the data warehouse from this. The physical implementation of the logical data

Data Warehousing Schemas

warehouse model may require some changes to adapt it to your system parameters—size of machine, number of users, storage capacity, type of network, and software.

Star Schemas

The star schema is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of one or more fact tables and the points of the star are the dimension tables, as shown in Figure 2–1.

Figure 2–1 Star Schema products times sales (amount_sold, quantity_sold) Fact Table customers channels
Figure 2–1
Star Schema
products
times
sales
(amount_sold,
quantity_sold)
Fact Table
customers
channels
Dimension Table
Dimension Table

The most natural way to model a data warehouse is as a star schema, only one join establishes the relationship between the fact table and any one of the dimension tables.

A star schema optimizes performance by keeping queries simple and providing fast response time. All the information about each level is stored in one row.

Note:

Oracle Corporation recommends that you choose a star

schema unless you have a clear reason not to.

Data Warehousing Objects

Other Schemas

Some schemas in data warehousing environments use third normal form rather than star schemas. Another schema that is sometimes useful is the snowflake schema, which is a star schema with normalized dimensions in a tree structure.

See Also:

Chapter 17, "Schema Modeling Techniques" for further

information regarding star and snowflake schemas in data warehouses and Oracle9i Database Concepts for further conceptual material

Data Warehousing Objects

Fact tables and dimension tables are the two types of objects commonly used in dimensional data warehouse schemas.

Fact tables are the large tables in your warehouse schema that store business measurements. Fact tables typically contain facts and foreign keys to the dimension tables. Fact tables represent data, usually numeric and additive, that can be analyzed and examined. Examples include sales, cost, and profit.

Dimension tables, also known as lookup or reference tables, contain the relatively static data in the warehouse. Dimension tables store the information you normally use to contain queries. Dimension tables are usually textual and descriptive and you can use them as the row headers of the result set. Examples are customers or products.

Fact Tables

A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.

Data Warehousing Objects

Creating a New Fact Table

You must define a fact table for each star schema. From a modeling standpoint, the primary key of the fact table is usually a composite key that is made up of all of its foreign keys.

Dimension Tables

A dimension is a structure, often composed of one or more hierarchies, that categorizes data. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Several distinct dimensions, combined with facts, enable you to answer business questions. Commonly used dimensions are customers, products, and time.

Dimension data is typically collected at the lowest level of detail and then aggregated into higher level totals that are more useful for analysis. These natural rollups or aggregations within a dimension table are called hierarchies.

Hierarchies

Hierarchies are logical structures that use ordered levels as a means of organizing data. A hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy might aggregate data from the month level to the quarter level to the year level. A hierarchy can also be used to define a navigational drill path and to establish a family structure.

Within a hierarchy, each level is logically connected to the levels above and below it. Data values at lower levels aggregate into the data values at higher levels. A dimension can be composed of more than one hierarchy. For example, in the product dimension, there might be two hierarchies—one for product categories and one for product suppliers.

Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to enable you to drill down into your data to view different levels of granularity. This is one of the key benefits of a data warehouse.

When designing hierarchies, you must consider the relationships in business structures. For example, a divisional multilevel sales organization.

Hierarchies impose a family structure on dimension values. For a particular level value, a value at the next higher level is its parent, and values at the next lower level are its children. These familial relationships enable analysts to access data quickly.

Data Warehousing Objects

Levels A level represents a position in a hierarchy. For example, a time dimension might have a hierarchy that represents data at the month, quarter, and year levels. Levels range from general to specific, with the root level as the highest or most general level. The levels in a dimension are organized into one or more hierarchies.

Level Relationships Level relationships specify top-to-bottom ordering of levels from most general (the root) to most specific information. They define the parent-child relationship between the levels in a hierarchy.

Hierarchies are also essential components in enabling more complex rewrites. For example, the database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation when the dimensional dependencies between quarter and year are known.

Typical Dimension Hierarchy

Figure 2–2 illustrates a dimension hierarchy based on customers.

Figure 2–2

Typical Levels in a Dimension Hierarchy

See Also:

region

subregion

country_name

customer

Chapter 9, "Dimensions" and Chapter 22, "Query

Rewrite" for further information regarding hierarchies

Data Warehousing Objects

Unique Identifiers

Unique identifiers are specified for one distinct record in a dimension table. Artificial unique identifiers are often used to avoid the potential problem of unique identifiers changing. Unique identifiers are represented with the # character. For example, #customer_id.

Relationships

Relationships guarantee business integrity. An example is that if a business sells something, there is obviously a customer and a product. Designing a relationship between the sales information in the fact table and the dimension tables products and customers enforces the business rules in databases.

Example of Data Warehousing Objects and Their Relationships

Figure 2–3 illustrates a common example of a sales fact table and dimension tables customers, products, promotions, times, and channels.

Figure 2–3

Typical Data Warehousing Objects

Relationship customers products Fact Table #cust_id #prod_id cust_last_name sales cust_city cust_id
Relationship
customers
products
Fact Table
#cust_id
#prod_id
cust_last_name
sales
cust_city
cust_id
cust_state_province
prod_id
times
channels
promotions
Dimension Table
Dimension Table

Dimension Table

Hierarchy

Part III

Physical Design

This section deals with the physical design of a data warehouse. It contains the following chapters:

Physical Design in Data Warehouses

Hardware and I/O Considerations in Data Warehouses

Parallelism and Partitioning in Data Warehouses

Indexes

Integrity Constraints

Materialized Views

Dimensions

3

Physical Design in Data Warehouses

This chapter describes the physical design of a data warehousing environment, and includes the following topics:

Moving from Logical to Physical Design

Physical Design

Moving from Logical to Physical Design

Moving from Logical to Physical Design

Logical design is what you draw with a pen and paper or design with Oracle Warehouse Builder or Designer before building your warehouse. Physical design is the creation of the database with SQL statements.

During the physical design process, you convert the data gathered during the logical design phase into a description of the physical database structure. Physical design decisions are mainly driven by query performance and database maintenance aspects. For example, choosing a partitioning strategy that meets common query requirements enables Oracle to take advantage of partition pruning, a way of narrowing a search before performing it.

See Also:

Chapter 5, "Parallelism and Partitioning in Data Warehouses" for further information regarding partitioning

Oracle9i Database Concepts for further conceptual material regarding all design matters

Physical Design

During the logical design phase, you defined a model for your data warehouse consisting of entities, attributes, and relationships. The entities are linked together using relationships. Attributes are used to describe the entities. The unique identifier (UID) distinguishes between one instance of an entity and another.

Figure 3–1 offers you a graphical way of looking at the different ways of thinking about logical and physical designs.

Physical Design

Figure 3–1

Logical Design Compared with Physical Design

Logical

EntitiesLogical Relationships Attributes Unique Identifiers

Relationships Attributes Unique Identifiers
Relationships
Attributes
Unique
Identifiers

Physical (as Tablespaces)

TablesPhysical (as Tablespaces) Integrity Constraints - Primary Key - Foreign Key - Not Null Columns Indexes

Integrity Constraints - Primary Key - Foreign Key - Not Null Columns
Integrity
Constraints
- Primary Key
- Foreign Key
- Not Null
Columns
Indexes Materialized Views
Indexes
Materialized
Views
Dimensions
Dimensions

During the physical design process, you translate the expected schemas into actual database structures. At this time, you have to map:

Entities to tables

Relationships to foreign key constraints

Attributes to columns

Primary unique identifiers to primary key constraints

Unique identifiers to unique key constraints

Physical Design

Physical Design Structures

Once you have converted your logical design to a physical one, you will need to create some or all of the following structures:

Tablespaces

Tables and Partitioned Tables

Views

Integrity Constraints

Dimensions

Some of these structures require disk space. Others exist only in the data dictionary. Additionally, the following structures may be created for performance improvement:

Indexes and Partitioned Indexes

Materialized Views

Tablespaces

A tablespace consists of one or more datafiles, which are physical structures within the operating system you are using. A datafile is associated with only one tablespace. From a design perspective, tablespaces are containers for physical design structures.

Tablespaces need to be separated by differences. For example, tables should be separated from their indexes and small tables should be separated from large tables. Tablespaces should also represent logical business units if possible. Because a tablespace is the coarsest granularity for backup and recovery or the transportable tablespaces mechanism, the logical business design affects availability and maintenance operations.

See Also:

Chapter 4, "Hardware and I/O Considerations in Data

Warehouses" for further information regarding tablespaces

Physical Design

Tables and Partitioned Tables

Tables are the basic unit of data storage. They are the container for the expected amount of raw data in your data warehouse.

Using partitioned tables instead of nonpartitioned ones addresses the key problem of supporting very large data volumes by allowing you to decompose them into smaller and more manageable pieces. The main design criterion for partitioning is manageability, though you will also see performance benefits in most cases because of partition pruning or intelligent parallel processing. For example, you might choose a partitioning strategy based on a sales transaction date and a monthly granularity. If you have four years’ worth of data, you can delete a month’s data as it becomes older than four years with a single, quick DDL statement and load new data while only affecting 1/48th of the complete table. Business questions regarding the last quarter will only affect three months, which is equivalent to three partitions, or 3/48ths of the total volume.

Partitioning large tables improves performance because each partitioned piece is more manageable. Typically, you partition based on transaction dates in a data warehouse. For example, each month, one month’s worth of data can be assigned its own partition.

Data Segment Compression

You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should consider for data segment compression is partitioned tables.

To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Data segment compression can also speed up query execution. There is, however, a cost in CPU overhead.

Data segment compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.

See Also:

Chapter 5, "Parallelism and Partitioning in Data

Warehouses" and Chapter 14, "Maintaining the Data Warehouse" for information regarding data segment compression and partitioned tables

Physical Design

Views

A view is a tailored presentation of the data contained in one or more tables or other views. A view takes the output of a query and treats it as a table. Views do not require any space in the database.

See Also:

Integrity Constraints

Oracle9i Database Concepts

Integrity constraints are used to enforce business rules associated with your database and to prevent having invalid information in the tables. Integrity constraints in data warehousing differ from constraints in OLTP environments. In OLTP environments, they primarily prevent the insertion of invalid data into a record, which is not a big problem in data warehousing environments because accuracy has already been guaranteed. In data warehousing environments, constraints are only used for query rewrite. NOT NULL constraints are particularly common in data warehouses. Under some specific circumstances, constraints need space in the database. These constraints are in the form of the underlying unique index.

See Also:

Chapter 7, "Integrity Constraints" and Chapter 22,

"Query Rewrite"

Indexes and Partitioned Indexes

Indexes are optional structures associated with tables or clusters. In addition to the classical B-tree indexes, bitmap indexes are very common in data warehousing environments. Bitmap indexes are optimized index structures for set-oriented operations. Additionally, they are necessary for some optimized data access methods such as star transformations.

Indexes are just like tables in that you can partition them, although the partitioning strategy is not dependent upon the table structure. Partitioning indexes makes it easier to manage the warehouse during refresh and improves query performance.

See Also:

Chapter 6, "Indexes" and Chapter 14, "Maintaining the

Data Warehouse"

Physical Design

Materialized Views

Materialized views are query results that have been stored in advance so long-running calculations are not necessary when you actually execute your SQL statements. From a physical design point of view, materialized views resemble tables or partitioned tables and behave like indexes.

Dimensions

See Also:

Chapter 8, "Materialized Views"

A dimension is a schema object that defines hierarchical relationships between columns or column sets. A hierarchical relationship is a functional dependency from one level of a hierarchy to the next one. A dimension is a container of logical relationships and does not require any space in the database. A typical dimension is city, state (or province), region, and country.

See Also:

Chapter 9, "Dimensions"

Physical Design

4

Hardware and I/O Considerations in Data Warehouses

This chapter explains some of the hardware and I/O issues in a data warehousing environment and includes the following topics:

Overview of Hardware and I/O Considerations in Data Warehouses

RAID Configurations

Overview of Hardware and I/O Considerations in Data Warehouses

Overview of Hardware and I/O Considerations in Data Warehouses

Data warehouses are normally very concerned with I/O performance. This is in contrast to OLTP systems, where the potential bottleneck depends on user workload and application access patterns. When a system is constrained by I/O capabilities, it is I/O bound, or has an I/O bottleneck. When a system is constrained by having limited CPU resources, it is CPU bound, or has a CPU bottleneck.

Database architects frequently use RAID (Redundant Arrays of Inexpensive Disks) systems to overcome I/O bottlenecks and to provide higher availability. RAID can be implemented in several levels, ranging from 0 to 7. Many hardware vendors have enhanced these basic levels to lessen the impact of some of the original restrictions at a given RAID level. The most common RAID levels are discussed later in this chapter.

Why Stripe the Data?

To avoid I/O bottlenecks during parallel processing or concurrent query access, all tablespaces accessed by parallel operations should be striped. Striping divides the data of a large table into small portions and stores them on separate datafiles on separate disks. As shown in Figure 4–1, tablespaces should always stripe over at least as many devices as CPUs. In this example, there are four CPUs, two controllers, and five devices containing tablespaces.

Figure 4–1

Striping Objects Over at Least as Many Devices as CPUs

Controller 1 Controller 2 1 1 1 1 tablespace 1 0001 0001 0001 0001 tablespace
Controller 1
Controller 2
1
1
1
1
tablespace 1
0001
0001
0001
0001
tablespace 2
2
2
2
2
0002
0002
0002
0002
3 tablespace 3
3
3
3
4
4
4
4 tablespace 4
5
5
5
5 tablespace 5

See Also:

striping

Oracle9i Database Concepts for further details about disk

Overview of Hardware and I/O Considerations in Data Warehouses

You should stripe tablespaces for tables, indexes, rollback segments, and temporary tablespaces. You must also spread the devices over controllers, I/O channels, and internal buses. To make striping effective, you must make sure that enough controllers and other I/O components are available to support the bandwidth of parallel data movement into and out of the striped tablespaces.

You can use RAID systems or you can perform striping manually through careful data file allocation to tablespaces.

The striping of data across physical drives has several consequences besides balancing I/O. One additional advantage is that logical files can be created that are larger than the maximum size usually supported by an operating system. There are disadvantages however. Striping means that it is no longer possible to locate a single datafile on a specific physical drive. This can cause the loss of some application tuning capabilities. Also, it can cause database recovery to be more time-consuming. If a single physical disk in a RAID array needs recovery, all the disks that are part of that logical RAID device must be involved in the recovery.

Automatic Striping

Automatic striping is usually flexible and easy to manage. It supports many scenarios such as multiple users running sequentially or as single users running in parallel. Two main advantages make automatic striping preferable to manual striping, unless the system is very small or availability is the main concern:

For parallel scan operations (such as full table scan or fast full scan), operating system striping increases the number of disk seeks. Nevertheless, this is largely offset by the large I/O size (DB_BLOCK_SIZE * MULTIBLOCK_READ_COUNT), which should enable this operation to reach the maximum I/O throughput for your platform. This maximum is in general limited by the number of controllers or I/O buses of the platform, not by the number of disks (unless you have a small configuration or are using large disks).

For index probes (for example, within a nested loop join or parallel index range scan), operating system striping enables you to avoid hot spots by evenly distributing I/O across the disks.

Oracle Corporation recommends using a large stripe size of at least 64 KB. Stripe size must be at least as large as the I/O size. If stripe size is larger than I/O size by a factor of two or four, then trade-offs may arise. The large stripe size can be advantageous because it lets the system perform more sequential operations on each disk; it decreases the number of seeks on disk. Another advantage of large stripe sizes is that more users can work on the system without affecting each other. The disadvantage is that large stripes reduce the I/O parallelism, so fewer disks are

Overview of Hardware and I/O Considerations in Data Warehouses

simultaneously active. If you encounter problems, increase the I/O size of scan operations (for example, from 64 KB to 128 KB), instead of changing the stripe size. The maximum I/O size is platform-specific (in a range, for example, of 64 KB to 1 MB).

With automatic striping, from a performance standpoint, the best layout is to stripe data, indexes, and temporary tablespaces across all the disks of your platform. This layout is also appropriate when you have little information about system usage. To increase availability, it may be more practical to stripe over fewer disks to prevent a single disk value from affecting the entire data warehouse. However, for better performance, it is crucial to stripe all objects over multiple disks. In this way, maximum I/O performance (both in terms of throughput and in number of I/Os per second) can be reached when one object is accessed by a parallel operation. If multiple objects are accessed at the same time (as in a multiuser configuration), striping automatically limits the contention.

Manual Striping

You can use manual striping on all platforms. To do this, add multiple files to each tablespace, with each file on a separate disk. If you use manual striping correctly, your system’s performance improves significantly. However, you should be aware of several drawbacks that can adversely affect performance if you do not stripe correctly.

When using manual striping, the degree of parallelism (DOP) is more a function of the number of disks than of the number of CPUs. First, it is necessary to have one server process for each datafile to drive all the disks and limit the risk of experiencing I/O bottlenecks. Second, manual striping is very sensitive to datafile size skew, which can affect the scalability of parallel scan operations. Third, manual striping requires more planning and set-up effort than automatic striping.

Note:

Oracle Corporation recommends that you choose automatic

striping unless you have a clear reason not to.

Overview of Hardware and I/O Considerations in Data Warehouses

Local and Global Striping

Local striping, which applies only to partitioned tables and indexes, is a form of non-overlapping, disk-to-partition striping. Each partition has its own set of disks and files, as illustrated in Figure 4–2. Disk access does not overlap, nor do files.

An advantage of local striping is that if one disk fails, it does not affect other partitions. Moreover, you still have some striping even if you have data in only one partition.

A disadvantage of local striping is that you need many disks to implement it—each partition requires multiple disks of its own. Another major disadvantage is that when partitions are reduced to a few or even a single partition, the system retains limited I/O bandwidth. As a result, local striping is not optimal for parallel operations. For this reason, consider local striping only if your main concern is availability, rather than parallel execution.

Figure 4–2 Local Striping Partition 1 Partition 2 Stripe 1 Stripe 3 Stripe 2 Stripe
Figure 4–2
Local Striping
Partition 1
Partition 2
Stripe 1
Stripe 3
Stripe 2
Stripe 4

Overview of Hardware and I/O Considerations in Data Warehouses

Global striping, illustrated in Figure 4–3, entails overlapping disks and partitions.

Figure 4–3 Global Striping Partition 1 Partition 2 Stripe 1 Stripe 2
Figure 4–3
Global Striping
Partition 1
Partition 2
Stripe 1
Stripe 2

Global striping is advantageous if you have partition pruning and need to access data in only one partition. Spreading the data in that partition across many disks improves performance for parallel execution operations. A disadvantage of global striping is that if one disk fails, all partitions are affected if the disks are not mirrored.

See Also:

Oracle9i Database Concepts for information on disk

striping and partitioning. For MPP systems, see your operating system specific Oracle documentation regarding the advisability of disabling disk affinity when using operating system striping

Analyzing Striping

Two considerations arise when analyzing striping issues for your applications. First, consider the cardinality of the relationships among the objects in a storage system. Second, consider what you can optimize in your striping effort: full table scans, general tablespace availability, partition scans, or some combinations of these goals. Cardinality and optimization are discussed in the following section.

Overview of Hardware and I/O Considerations in Data Warehouses

Cardinality of Storage Object Relationships

To analyze striping, consider the relationships illustrated in Figure 4–4.

Figure 4–4

Cardinality of Relationships

table

1

partitions table 1 s 1 tablespace 1 files mn devices
partitions