You are on page 1of 34

High Performance Data Warehouse

Design and Construction

OLAP Implementation
Techniques

1
Objectives
 Provide a robust framework for OLAP
techniques for decision support.
 Characterize tradeoffs in performance,
scalability, flexibility, and complexity
associated to various OLAP implementation
techniques.
 Examine tradeoffs in aggregate construction.

2
Topics
 OLAP framework for decision support.
 Physical implementation techniques:
MOLAP, ROLAP, HOLAP, and DOLAP.
 Star schema design.

3
Where Does OLAP Fit In?

OLAP = On-line analytical processing.


 OLAP is a characterization of applications, not a database
design technique.
 Idea is to provide very fast response time in order to
facilitate iterative decision-making.
 Analytical processing requires access to complex
aggregations (as opposed to record-level access).
Where Does OLAP Fit In?
Information is conceptually viewed as “cubes” for simplifying
the way in which users access, view, and analyze data.

 Quantitative values are known as “facts” or “measures.”


– e.g., sales $, units sold, etc.
 Descriptive categories are known as “dimensions.”
– e.g., geography, time, product, scenario (budget or actual), etc.

Dimensions are often organized in hierarchies that represent


levels of detail in the data (e.g., UPC, SKU, product
subcategory, product category, etc.).
OLAP FASMI Test
Fast: Delivers information to the user at a fairly constant
rate. Most queries should be delivered to the user in five
seconds or less.
Analysis: Performs basic numerical and statistical analysis
of the data, pre-defined by an application developer or
defined ad hoc by the user.
Shared: Implements the security requirements necessary for
sharing potentially confidential data across a large user
population.
Multi-dimensional: The essential characteristic of OLAP.
Information: Accesses all the data and information necessary
and relevant for the application, wherever it may reside
and not limited by volume.
...from the OLAP Report by Pendse and Creeth.
OLAP Implementations
MOLAP: OLAP implemented with a multi-dimensional
database.

ROLAP: OLAP implemented with a relational database.

HOLAP: OLAP implemented with a hybrid of multi-


dimensional and relational database technologies.

DOLAP: OLAP implemented for desktop decision support


environments.
MOLAP Implementations
OLAP has historically been implemented through use of
multi-dimensional databases (MDDs).
 Dimensions are key business factors for analysis:
– geographies (zip, state, region,...)
– products (item, product category, product department,...)
– dates (day, week, month, quarter, year,...)
 Very high performance via fast look-up into “cube”
data structure to retrieve pre-calculated results.
 “Cube” data structures allow pre-calculation of
aggregate results for each possible combination of
dimensional values.
 Use of application programming interface (API) for
access via front-end tools.
MOLAP Implementations
Need to consider both maintenance and storage
implications when designing strategy for when to build
cubes.
 Maintenance Considerations: Every data item received
into MDD must be aggregated into every cube
(assuming “to-date” summaries are maintained).
 Storage Considerations: Although cubes get much
smaller (e.g., more dense) as dimensions get less
detailed (e.g., year vs. day), storage implications for
building hundreds of cubes can be significant.
MOLAP Implementations
 Typically outperform relational database technology because all
answers are pre-computed into cubes (and overhead for accessing
cubes is very low).
 Difficult to scale because of combinatorial explosion in the number
and size of cubes when dimensions of significant cardinality are
required.
 Beyond tens (sometimes small hundreds) of thousands of entries
in a single dimension will break the MOLAP model because the
pre-computed cube model does not work well when the cubes are
very sparse in the population of individual cells.

See www.olapreport.com/DataExplosion.htm
Virtual Cubes
Virtual cubes are used when there is a need to join
information from two dissimilar cubes that share one or
more common dimensions.
 Similar to a relational view; two (or more) cubes are
linked along common dimension(s).
 Often used to save space by eliminating redundant
storage of information.

Example: Build a list price cube that can be used to


compute discounts given across many stores in a retail
chain without redundant storage of the list price data
through use of a virtual cube.
Partitioned Cubes

 One logical cube of data can be spread across multiple


physical cubes on separate (or same) servers.
 The divide-and-conquer approach of partitioned cubes
helps to mitigate the scalability limitations of a MOLAP
environment.
 Ideal cube partitioning is completely invisible to end
users.
ROLAP Implementations

Advances in database technologies and front-end tools have


begun to allow deployment of OLAP using ANSI SQL
RDBMS implementations.
 ROLAP facilitates deployment of much larger dimension
tables than MOLAP implementations.
 Front-end tools to facilitate GUI access to multi-
dimensional analysis capabilities.
 Aggregate awareness allows exploitation of pre-built
summary tables for some front-end tools.
Star schema designs are often used to facilitate OLAP against
relational databases.
Simplified Third Normal Form
ZONE REGION (Retail)
1
M zip _x_SMSA zip _x_adi year
ZIP ZONE ZIP SMSA ZIP ADI QTR YR
1 1 1 quarter M 1
store M M M
WEEK QTR
STORE # ADDRESS ZIP ...
1
1 week M
1
sale_header M M DATE WEEK
RECEIPT # STORE # DATE ...
1 M M 1
1 STORE # DATE WEATHER date_x_store_x_weather
M
ITEM # RECEIPT # ... $
M 1 M sale_detail
1
ITEM # CATEGORY
ITEM # MFCTR
item_x_category M
1 item_x_mfctr
CATEGORY DEPT
category_x_dept
14
Simplified Star Schema
Geography Dimension Table
STORE# ADDRESS ZIP ADI SMSA ZONE REGION
1
Calendar Dimension Table
DATE WEEK QUARTER YEAR ...
1

Fact Table M M
ITEM# RECEIPT# STORE# DATE ... $
M M M

1 1 1
ITEM# CATEGORY DEPT MFCTR ... STORE# DATE WEATHER

Product Dimension Table Store x Date Dimensional Table

A vastly simplified model ... may even summarize out receipt # .....
15
Simplified Star Schema

A vastly simplified physical data model!

Collapse dimensional hierarchies into a single table


for each dimension and create a single fact table
from the header and detail records:
 Fewer tables.
 Fewer joins to get results.
Star Schema for High Performance

Business question: How many $ in raincoats did I sell in


the first week of January through stores in Boston?

Assume:
 4 Billion rows in fact table.
 20 different kinds (size, color, style) of raincoats
(product category) out of 50,000 UPCs in store.
 8 stores out of 400 are in BOSTON SMSA.
 2 years of POS history in DBMS.

17
Star Schema for High Performance

Simple (poor performance) approach to query execution:

1. Join item table with filtering on raincoat product


category (very selective) to fact table.
2. Join date table with filtering by week (next most
selective) to result table.
3. Join store table with filtering on store to result table
from step 2.
4. Aggregate.

18
Star Schema for High Performance
Advanced (better performance) approach to query execution:

1. Cartesian product join between dimensional tables.


* Result is 20 x 8 x 7 = 1,120 rows.

2. Use composite index on item:store:day into fact table for


very selective access.
* Access less than 0.00000008 percent of data in fact table!

Sophisticated cost-based optimizers will figure this out.

19
Forcing a Cartesian Product Join
 Add an addition “join_value” column in each
dimensional table.
 Set join_value to same value in all rows of the
dimensional tables.
 Add additional where clause predicates joining
on this column between dimensional tables.

NOTE: This shouldn't be necessary with a “smart”


optimizer.

20
Forcing a Cartesian Product Join
Sample code:

select sum(sales.sales_amt)
from d_sales_detail
,store
,item
,period
where d_sales_detail.store_id = store.store_id
and d_sales_detail.item_id = item.item_id
and d_sales_detail.day_dt = period.day_dt
and period.day_dt between '23-NOV-2000' and '24-DEC-2000'
and item.trade_style_cd = 'BARBIE'
and store.state_cd = 'CA'
and store.join_value = period.join_value
and store.join_value = item.join_value
and period.join_value = item.join_value
;

21
Star Schema for High Performance

Problem: What if I want to know raincoat sales in first


week of January regardless of store?
Answer: Performance advantage of composite index in
traditional RDBMS is severely impaired!
 B-tree indexing techniques do not allow for flexibility in
the use of dimensions for query purposes.
 Bit indexing (and variations thereof) often allows much
more generality in achieving high performance from a
star schema.

22
Star Schema for High Performance
Bottom Line:
 It is not at all unusual to obtain an order of
magnitude (or more) in performance advantage
using a star schema with advanced indexing
versus a more traditional relational database
implementation.
 Despite what vendors may tell you, star schemas
cannot be effectively implemented for all DSS
business applications and/or data models.

23
ROLAP

 Relational OLAP often makes heavy use of


summary tables to provide near instantaneous
access for multi-dimensional queries.
 Foundation is usually star schema or snowflake
database design.
 Allows OLAP with much larger data sets than
multi-dimensional database (MDD) products
using cube structures (MOLAP).

24
ROLAP

Number of summary tables can get very large if


discipline is not enforced...

Assume a retail database with the following two


dimensions on the fact table...
Calendar: Day, Week, Period, Quarter, Year, All Days
Geography: Store, Zone, District, Region, All Stores

25
ROLAP
Summary tables in a naive implementation require all
combinations of the dimensions at each aggregation level...

All Days 13 19 24 28 30
Year 9 15 22 27 29
Quarter 6 11 18 23 26
Period 4 8 14 21 25
Week 2 5 10 17 20
Day 1 3 7 12 16
Store Zone District Region All Stores

30 summary tables! ... Add in item, SKU, subcategory,


category, and all items...now we are up to 150 pre-
aggregates!
26
ROLAP
Summary tables are more of a maintenance issue
than a storage issue in most production
implementations.
 Notice that summary tables get much smaller as
dimensions get less detailed (e.g., year vs. day).
 Should plan for double the size of the unsummarized
data for ROLAP summaries in most environments.
 Every detail record that is received into warehouse
must aggregate into EVERY summary table (assuming
"to-date" summaries are maintained).

27
ROLAP
Warning: Do not assume that dimensions are always
simple hierarchies.

Example: Items are not just category, subcategory,


SKU, and atomic item.... what about trade styles or
manufacturer?

Now we need summary tables along these lines as


well...another 120 summary tables!

Calendar vs. accounting period vs. billing cycle can


be even worse...

28
ROLAP
Many ROLAP products have devised ways to reduce
the number of summary tables:

 Ability to build summaries on-the-fly as demanded by end-


user applications.
 Ability to aggregate efficiently from subset of the summary
tables.
 Tools exist in some products to assist in DBAs in selecting
the "best” aggregations to build.
 HOLAP (Hybrid OLAP) tools allow co-existence of pre-built
cubes alongside relational OLAP structures.

29
Intelligent Aggregation Selection

 Maximum performance boost implies lots of disk for


every pre-calculation.
 Minimum performance boost implies no disk with zero
pre-calculation.
 Strategy is to use meta data to heuristically determine
optimum set of aggregates from which all other
aggregates can be derived.
Aggregate Wizards
Fact Table Aggregates
 Enhance performance on common queries at
coarser granularities.
 Save space to permit storing more history than
possible with finer granularities.
 Take advantage of need to store other facts (with
similar samples) at a particular granularity.
Aggregate Advice
 Coarser granularity decreases potential
cardinality, but usually increases density (e.g.,
daily summary table is typically twice the size of
weekly summary table - not seven times).
 Strongly consider omitting candidate aggregates
where expected cardinality is more than 10% that
of next finer granularity stored.
 Keep the detail for drill down, even if you deploy
aggregates for performance.
Bottom Line
 There are many implementation techniques for
delivery of an OLAP environment.

 Must fully consider the performance, scalability,


complexity, and flexibility characteristics when
deciding between MOLAP, ROLAP, and HOLAP.

 Understand your tools and RDBMS!

34

You might also like