You are on page 1of 14

ACTA Journal, September 1997

Building a Data Warehouse

Technical Report

Building a Data Warehouse


Using Oracle OLAP Tools

Satish Mahajan

Center of Expertise Worldwide Customer Support Oracle Corporation September 1997

Using Oracle OLAP Tools

?-1

Building a Data Warehouse

ACTA Journal, September 1997

TABLE OF CONTENTS
rototype Development .......................................................................................................................................4 3.0 PRE-DEVELOPMENT ACTIVITIES ....................................................................................................................4 3.1 End User Requirements Analysis.........................................................................................................................4 3.2 Capacity Planning ................................................................................................................................................5 4.0 DEVELOP SYSTEM ARCHITECTURE ...............................................................................................................7 5.0 DESIGN AND DEVELOPMENT...........................................................................................................................8 5.1 Data Management Oracle Side .........................................................................................................................8 5.2 Data Management Express Side .......................................................................................................................9 6.0 DESIGN AND DEVELOPMENT PROGRAM MANAGEMENT .....................................................................9 6.1 Oracle Design Goals ............................................................................................................................................9 6.2 Program Modules ..............................................................................................................................................10 6.3 Program Management Express Side ...............................................................................................................12 GLOSSARY ................................................................................................................................................................13

Copyright Oracle Corporation 1997. All rights reserved. Printed in the U.S.A. Author: Satish Mahajan History: This paper was original presented at Oracle Open World, Australia, in 1996. Oracle is a registered trademark of Oracle Corporation. Oracle7 is a trademark of Oracle Corporation. All other products or company names are used for identification purposes only, and may be trademarks of their respective owners. NO PART OF THIS DOCUMENT MAY BE REPRODUCED, IN ANY FORM, WITHOUT THE PERMISSION OF THE AUTHORS. THIS STIPULATION ALSO APPLIES TO ORACLE EMPLOYEES.

?-2

Using Oracle OLAP Tools

ACTA Journal, September 1997

Building a Data Warehouse

INTRODUCTION
This paper describes how to develop a large data warehouse. Specifically, we are interested in using the Oracle Express tools family for Online Analytical Processing (OLAP). The paper is written with the assumption that the reader is familiar with basic data warehouse and multi-dimensional database concepts. Definitions of the most commonly used terms in this paper are provided in a glossary at the end of this paper. Our methodology, as listed below, is generally accepted within the industry and the literature. 1. 2. 3. 4. 5. 6. Management collaboration get management sponsor, technical and project management, technical resources, project scope and schedule in place. Know the end users gather end user requirements, rapid prototype development. Pre-development activity -- user requirements analysis, capacity planning, volume sizing, confirmation of requirements. Develop system architecture hardware and software layouts. Design and Development data management on Oracle side, data management on Express side. Design and Development continued design goals, program management on Oracle side, program management on Express side.

The rest of this paper provides a checklist approach for each of the above steps. Specifically, from an Oracle perspective, we are interested in solutions to problems in the areas of design, development, and implementation.

1.0 MANAGEMENT COLLABORATION


1.1 Having a management sponsor for a data warehouse project is a mandatory requirement. It is important for the management contact to know the following things: = = Scope of the project is generally very large and hence costs for the proposed systems hardware, software, manpower and maintenance burden should be budgeted. Communication with peer executives should set right expectations about the project. Preparing the background for understanding of the basic concepts of data warehousing and OLAP technology is important. It is also necessary to prepare a return on investment so that the project effort is unanimously backed by the management. The management sponsor should find a project manager who has broad understanding of the business and business goals of the proposed system in addition to the experience of managing large projects.

1.2 The project manager should opt for a technical architect having a broad experience in development of large systems using Oracle database and tools. Familiarity with Oracle Express product family and general multidimensional database concept is also required. 1.3 Incremental addition of technical resources during the development is suggested and keeping the team size small often helps create solid basic foundation for the project. Initial team can consist of one technical person on Oracle and Express side, each in addition to the architect of the system. A detailed project schedule and additional resources can be sought after the prototype is done.

Using Oracle OLAP Tools

?-3

Building a Data Warehouse

ACTA Journal, September 1997

2.0 KNOW THE END USERS


Because the data warehousing and OLAP are relatively new concepts, it is a good idea to develop a small prototype. This prototype can then be used for gathering overall end user requirements.

2.1 Prototype Development


2.1.1 After talking to one or two key end users, two simple analysis measures with about three dimensions (preferably the same dimensions and time being one of them) from the same operational data source can be selected for the prototype. 2.1.2 The prototype can be created on a desktop PC with a small amount of data such that it represents about three levels for the time dimension. Assuming the prototype PC on a corporate network, the data can be loaded into a Personal Oracle database. With such a real time data load, the prototype creates good impact on the end users. 2.1.3 Personal Express Server and Oracle Sales Analyzer are the obvious choices on the Express side. The Personal Express Server can communicate with Personal Oracle database on an ODBC link. Because Oracle Sales Analyzer is a canned application, it is very easy to create graph and table objects for the selected measures. 2.1.4 On the Oracle side, a fact table for each measure needs to be created along with the corresponding dimension tables. 2.1.5 On the Express side, dimensions and data cubes corresponding to each measure need to be defined so that these objects can be populated with data values from the Oracle side objects. Aggregating data values along all the dimensions can be achieved in Express Server since direct command to do the rollup is available which needs no programming effort. 2.2 User Requirements Gathering A visual presentation by the prototype demonstration is a very effective way of gathering the end user requirements for data analysis and reporting. An analyst with a good understanding of the business is required to carry out this task.

3.0 PRE-DEVELOPMENT ACTIVITIES


3.1 End User Requirements Analysis
3.1.1 The requirements gathering phase is followed by their analysis. This can be conceptualized by taking following factors into account: = = = = = = = business measures of the end users interest all dimensions of each measure hierarchies and levels in each dimension time duration for which data trend is observed for each measure end user response time requirements for on-line data analysis data refresh cycle requirements all operational data sources from which data is required to be gathered

?-4

Using Oracle OLAP Tools

ACTA Journal, September 1997

Building a Data Warehouse

3.1.2 It is also important to decide which measures are important for on-line data analysis as against reporting. The OLAP tools are most effective when used for on-line data analysis although they provide some reporting capabilities. An important criteria for the distinction is that an on-line measure is able to answer all the end user questions regarding the given subject matter with simple user actions such as mouse clicks as against waiting for some other reports to come out. This way, end users can correlate different trends and exceptions, rank best and worst scenarios on-line and do much more analysis without disturbing their train of thoughts. 3.1.3 Once all the measures are in perspective, the next task is to prioritize them. Generally, reporting measures can be implemented by relational query tools since response time is not a major concern. On-line measures hence take precedence over reporting measures in a heavy analysis environment. Another angle for prioritization is a group of users that will be using a given measure. Overall, end users can be divided into at least two categories as users doing typical analysis and those who do research type analysis. It is a good idea to implement a few measures for each category of users at every delivery. All these steps of requirements analysis are very important because they pave the foundation of the overall system design and its successful implementation.

3.2 Capacity Planning


Before entering into actual system development, it is imperative to plan the capacity of the proposed system which helps determine delivery and implementation plan. The capacity planning can be achieved by considering = = processes carried out on the system (for CPU and memory sizing) and data volume sizing (for disk requirements)

3.2.1 A major part of capacity planning is applicable to a server side where Oracle and Express databases reside. On a client side, a standard desktop PC or a laptop running Windows 95 or Windows NT system is assumed. Oracle Sales Analyzer or Express Objects applications run as standard Windows application. With changing product directions for the front-end tools, these products are becoming web enabled and the front-end can then be run from any web browser. The focus of planning is then going to shift on networking environment. 3.2.2 On the server side, there are many processes that can determine the size of the system in terms of CPUs, memory and disks. A volume sizing step discussed in the next section gives pointers for the disk requirement. It is a good idea to carry out actual benchmarks with multiple configurations for CPU and memory requirements. Important processes to consider on Oracle side are: = = = load rate of raw data since it has a bearing on the refresh cycle representative programs that will be run as a part of dimension cleanup (scrub process) and fact table creations representative queries that will be run as a part of rollup process on Oracle side

Important processes on Express side are: = = = = = data load times and space requirements for dimensions in multiple databases data load times and space requirements with conjoints of the same dimensions data load times for loading data cubes CPU and memory requirements with multiple users doing variety of analytical tasks drill through times for the analysis on virtual cubes

Using Oracle OLAP Tools

?-5

Building a Data Warehouse 3.2.3. Volume Sizing

ACTA Journal, September 1997

Volume sizing helps determine disk capacity that is needed on the proposed warehouse system. Basic components of volume sizing are, = = = = = Operating system level requirements (swap, mirroring, backup, etc.) Oracle database system level requirements (system, rollback, temporary, redo logs, etc.) Staging area for data feeds from operational data sources Space required for data processing (data loading, data cleaning, rollup) Express database space required for system, dimensions, physical data cubes, etc.

It is also necessary to look at the measures from rollup viewpoint. Depending on the quality of data, sometimes data volume shrinks as the summarization are carried out but if the data under consideration is sparse, data volume explodes at higher levels of summarization. This sparsity effect is more pronounced with measures having more number of dimensions and more levels in each dimension. It then becomes necessary to revisit the end user requirements analysis and break down a single measure with a large number of dimensions into multiple measures having lesser dimensions. This causes some loss of analytical capability but brings the measure from an impossibility to a feasible solution. This volume sizing exercise also helps in deciding as to which measures will become physical cubes and which ones will be virtual cubes in Express Server. The volume sizing is easy and accurate since data is mostly represented in number format to convert relational objects into hierarchies to represent in the multi-dimensional world. Also, bytes required to represent numbers, levels in each dimension and possible data explosion/shrink at each level are either known or can be deduced by minor testing. Once the volume sizing is done, it is important to reconcile the end user requirements analysis and possible modifications in the measures because of volume constraints. It is a good idea to get an end user confirmation on final measures and their dimensionality.

?-6

Using Oracle OLAP Tools

ACTA Journal, September 1997

Building a Data Warehouse

4.0 DEVELOP SYSTEM ARCHITECTURE


CLIENT SIDE COMMUNICATION SERVER SIDE

HARDWARE

DESKTOP PC

ETHERNET 10BASET OR CDDI

NETWORK CONNECTING OTHER OPERATIONAL SYSTEMS

DESKTOP PC

LAPTOP

LAPTOP

ORACLE SERVER DATA + EXPRESS SERVER DATA UNIX SYSTEM

DOCKING STATION

SOFTWARE

EXPRESS OBJECT V.2.0/ SALES ANALYZER APPLICATION

EXPRESS OBJECT V.2.0/ SALES ANALYZER APPLICATION

EXPRESS OBJECT V.2.0/ SALES ANALYZER APPLICATION

CLIENT SIDE APPLICATIONS USE REMOTE API TO ACCESS DATA FROM EXPRESS SERVER BASED ON THE DCE SERVICES

EXPRESS SERVER V.5.0 RAA/RAM

ORACLE SERVER V.7.3

PC DCE SERVICES BUNDLED AS A PART OF EXPRESS OBJECTS SOFTWARE

DCE SERVICE SUPPORT FROM THE UNIX OPERATING SYSTEM

Figure 1: Typical System Architecture A typical client server solution in an open systems environment using Oracles relational and Express family products is shown in Figure 1. At hardware level, there is a single powerful server at the back-end (mostly UNIX) which serves many clients on the front-end (mostly PCs and laptops). With changing product directions, a communication link between the client and server nodes is going to become corporate intranet/internet and client machines are going to work more like thin clients. The overall solution is still valid and upgradable to the new scenario. At the software level, the system architecture is a three tier solution: Oracle database and Express database being parts of server side and Express applications running on Windows being a part of client side. If the load on the server side keeps on growing, Oracle and Express databases can be installed on separate machines such that they use SQL*NET for communication. This provides scalability as well as some fall-back mechanism if one of the servers goes down. This solution is also portable since Oracle and Express Server software is available on majority of hardware platforms and the client-server communication being Distributed Computing Environment (DCE) compatible.

Using Oracle OLAP Tools

?-7

Building a Data Warehouse

ACTA Journal, September 1997

5.0 DESIGN AND DEVELOPMENT


This phase is applicable to two main areas data management and program management and activities in these areas can be carried out concurrently.

5.1 Data Management Oracle Side


5.1.1. Protocols With Data Source Owners The outcome of user requirements analysis identifies all the objects that are needed from each data source (operational system) in the organization. It is important to form a protocol with such data source owners using following points: = = = = = = = = Name of the data source Management contact(s) details Technical contact(s) details Machine access details Database access details Best time for data transfers Historical data details History, frequency and details of structure changes in the data

5.1.2. Data Object Details For every data source, it is necessary to identify all the objects needed during each refresh cycle. Following points are important in this area: = = = = = = Object details Object type static or dynamic Data transfer method Object size Growth at every refresh cycle Relationship with object(s) in the other data sources

At this point, it is important to understand how object is going to move through various modules on Oracle and Express sides so that it is finally either represented as a data cube or a dimension. Details about this and all the above mentioned points become a part of metadata for the system.

?-8

Using Oracle OLAP Tools

ACTA Journal, September 1997

Building a Data Warehouse

5.2 Data Management Express Side


From the volume sizing calculations in the capacity planning stage, it should be clear as to which objects should be loaded into Express database. Following points are important while doing data management on Express side: = = Since data loading into Express database is currently single threaded, a single physical database increases overall refresh cycle time for the system. Creating multiple physical databases increases availability as seen by the end users in case of a single database failure. But this also increases load time by an extent that all the dimension objects are loaded in all the databases. Conjoint dimensions guarantee 100% dense object and thus can eliminate sparsity issues. However, conjoint maintenance for changing dimensions is not linear as number of values in the conjoint increases. Loading and maintaining small size cubes in Express database and keeping remaining data in Oracle database for reach-through gives a feasible solution in many situations.

= =

6.0 DESIGN AND DEVELOPMENT PROGRAM MANAGEMENT


6.1 Oracle Design Goals
The following goals are important for program design: = = Scalability Once the basic system is designed and developed, it should be extensible in terms of addition of new data sources as new analytical capabilities are added to the system. Maintenance Changes in the object structures and inconsistent data values should be automatically detected by the system and communicated to the concerned data administrators. Regular data growth patterns, warnings and alerts about space management, regular backup procedures, etc. should be automated so that system and data administrators can invest their time in system enhancements. Recoverability In case of a failure: = = = complete system build should be avoided except for catastrophic failures. parts of the system should be made incrementally available to the end users according to their preference. system should resume its processing from a point where it failed and still produce consistent results at the end of the processing cycle.

Using Oracle OLAP Tools

?-9

Building a Data Warehouse

ACTA Journal, September 1997

6.2 Program Modules


With the above design goals in mind, program modules as shown in Figure 2 can be established.

Extraction

Load

Scrub

Metadata Warehouse + OEM + RAA

Talking via SQL & PL/SQL Blocks

Create Base Fact

Rollup Global Error & State Processing

Jobs Communicating via OEM

ORACLE SIDE PROCESSES EXTERNAL WORLD INTERFACE EXPRESS SIDE PROCESSES Express Server Databases

Metadata Setup Module Operational Data Sources

Talking via DCE/RPC

Security Presentation Data Express DBA

Bridge

Sales Analyzer / Express Objects Front-End Application

?-10

Using Oracle OLAP Tools

ACTA Journal, September 1997 A brief discussion of each module follows: =

Building a Data Warehouse

Extraction This module does the extraction of raw data by connecting to all data sources. Various extraction methods such as ftp and SQL*LOADER, import/export, create table as select, read-only snapshots, etc. should be provided in this module. The program should also check availability of various resources such as connection establishment, space, etc. at the start. It should also be capable of extracting variety of data formats such as Oracle tables, flat file dumps, Excel spreadsheets, etc. Load This module selects only that data which gets converted into a dimension or a fact table measure of interest. It also tracks any structural changes in the objects that are extracted. Some structural changes such as a deletion of a column from a flat file can not be automatically tracked but needs proactive notification to the warehouse administrator so that appropriate changes can be made in the extraction routines. Scrub All the measures selected for on-line analysis can either be presented as having shared dimensions or data source specific dimensions. With shared dimensions, more analytical capabilities such as comparing two measures from two different data sources are achieved. In order to share dimensions, all the data coming from different data sources should conform with the master shared dimension values. If any anomalies exist then they should be correctly pointed out. All these activities of data cleanup are done in this module. Dimensions with multiple hierarchies are also created in this module. Generally, dimensions are created from code tables in operational systems which are static in nature. Base Fact Creation Every measure can be represented as a fact table surrounded by the dimension tables created in the scrub module. Such fact tables with values in the lowest levels of all the dimensions are created in this module and are called as base fact tables. If many measures have the same dimensionality, all of them can be represented in a single base fact table. Generally, base fact tables are created from transactional tables in operational systems which are dynamic in nature. Objects are represented as star schema in this module. Rollup Once the base fact table for each measure is ready, it needs to be rolled up at every level of all the dimensional hierarchies associated with that measure. This is a most important step in the warehouse processing. As the data values are summarized at higher levels of hierarchies, data volume can either shrink or explode depending on the data sparsity for a given measure. This step is implemented by using typical warehousing query consisting of group by clause on union all tables. Union all tables generally represent a partition view created on a shared dimension (most commonly time) with a constraint. A constraint represents granularity of union all tables (generally one week or a month for time dimension). Recent addition of features to Oracle Server such as hash join, bit-mapped index and parallel-aware optimizer all help execute such queries very efficiently. Mostly rollup needs to be carried out on refresh cycle data but if dimensional hierarchies change, rollup generally works on a large amount of data stored in the warehouse. This processing and an issue of finding out only relevant data that is affected by a particular hierarchy change pose major challenges. Metadata As all the above modules go in action, the data needed for their own automation and smooth functioning has to be kept. A separate repository is created for this data. It is a good idea to create all the repository objects in a separate schema. The objects are initially created and maintained by a separate application which is accessible to only warehouse administrator. During the warehouse operations, various modules may modify repository objects to maintain various process states, statistics, etc.

Choice of a programming language can become an issue in the development of the system since there are a variety of tasks that are carried out at database and operating system level. OraTcl is a language which incorporates tcl shell with access to Oracle. It is a shareware software in which Oracle SQL scripts and anonymous PL/SQL blocks can be called. Apart from handling all the procedural constructs, oraTcl can also execute C programs. Oracle Enterprise Manager (OEM) can become a great vehicle to implement warehouse systems. Apart from regular database management, OEM has capabilities to call and execute tcl programs. It has all the scheduling capabilities that are needed in an operational warehouse environment. It has event handling capabilities to process errors/alerts/warnings during the job executions.

Using Oracle OLAP Tools

?-11

Building a Data Warehouse

ACTA Journal, September 1997

The metadata management and other warehouse tools on the market besides very high costs concentrate more on data feeds to warehouses but lack serious functionality in the areas of data scrubbing and rollup. The warehouse metadata interface can be developed by using tcl toolkit (tcl/tk) which acts as an oraTcl extension and can be merged with the basic OEM interface. The OEM tool can thus be used both for doing both regular operational tasks and metadata management.

6.3 Program Management Express Side


6.3.1. Express Design Goals = = User response times: Response times for various analytical tasks should be known up front so that the multidimensional database can be designed to optimize certain tasks. Data loading from fact tables: The physical design of the database should be such that measures and changed dimensions can be loaded into Express database during an acceptable refresh cycle time. Loading times for simple dimension value changes and for more time consuming hierarchical changes should be acceptable to the end users.

6.3.2. Program Modules With the above design goals in mind, program modules as shown in Figure 2 can be established. A brief discussion of each module follows: = DBA This module defines all the dimensions, measures and supporting data structures in Express database. Express 4GL scripts are used to create and maintain these objects. Since the Express front-end application tools are distributed in nature, all the data objects in the applications are physically present on the server side. Security Because the data used for strategic analysis is used by different groups, it is necessary to provide security features in the end user interface such that various functional groups of users can look only at the data in their own area. Bridge This is a part of configuration and setup of the new tools in the Express family called Relational Access Administrator (RAA) and Relational Access Manager (RAM). They work based on their own repositories in Oracle and Express databases, and manage data load from Oracle database (relational format) to Express database (multi-dimensional format) as well as on-line reach-through for the relational data that is not loaded in the multi-dimensional format. Presentation There are two main choices for tool selection in this area: Oracle Sales Analyzer (OSA) and Oracle Express Objects (OEO). OSA is a canned read-only application with predefined objects and many customization options. This tool can get the overall application ready in a short time and can satisfy majority of analytical requirements. OEO is a development environment in which customized read-write applications with user defined objects can be developed. Both the tools run as distributed applications with Express database.

?-12

Using Oracle OLAP Tools

ACTA Journal, September 1997

Building a Data Warehouse

GLOSSARY
Multi-dimensional Database A database where data objects are represented as business measurements. The measurements are measured across multiple dimensions. Generally, this type of database is used to perform historical data analysis. OLAP On-Line Analytical Processing (OLAP) is a type of information processing application used for multidimensional analysis, such as trending, exceptions, patterns, ranking, etc. OLAP tools working against multidimensional database provide this functionality. Measure (Data Cube) A fact of business interest that can be measured across multiple other business objects. The measurements are generally summarization, rankings, averages, counts, etc. Data cubes are either physically present in multi-dimensional database (physical cubes) or are in structural form with no data (virtual cubes). The data is reached through from the relational side when a virtual cube is accessed by end users. Examples of measures are: sales volume, Customer satisfaction counts, etc. Dimension Dimensions are the business objects about which business measures are maintained. Examples of dimensions are: time, product, customer, etc. Dimension Levels And hierarchies Dimensions are stored in a hierarchical form with multiple levels. One dimension can have multiple hierarchies and each hierarchy can have multiple levels. For example, time dimension can have levels such as weeks, months, quarters, years; the time dimension can also have two hierarchies such as calendar year and fiscal year. Sparsity A data cube for certain measures may not have measured values for all possible combinations of the cubes dimensions. Such values are represented as no value cells in a data cube. A ratio of such no value cells to total number cells is called sparsity. A data cube with more number of dimensions can have more sparsity and may require a large amount of disk space during rollup. Conjoint Dimension If a combination of multiple dimensions produces a sparse cube then only valid combinations (that produce no no value cells) of these dimension values can be stored in a common dimension called conjoint. Conjoint dimensions create 100% dense data cubes saving disk space but they can be cumbersome to maintain especially with changing hierarchies and large number of values. Star Schema A multi-dimensional data model can be represented as a central fact table connected to multiple dimension tables around it. This structure resembles a star and is called as star schema. DCE Distributed Computing Environment (DCE) is a standard used to write distributed applications in open systems so that applications can be used in a heterogeneous operating system and network environments. Metadata Data about data generally stored in a separate repository. It is mandatory to keep this data consistent so that overall warehouse system operations run smoothly. Generally, metadata is a set of complex structures which keep information about data sources, various process states, error handling, data statistics, etc.

Using Oracle OLAP Tools

?-13

Building a Data Warehouse

ACTA Journal, September 1997

?-14

Using Oracle OLAP Tools

You might also like