You are on page 1of 42

OLAP

(on line analytical


processing)
Three-Tier Decision Support Systems
 Warehouse database server
 Almost always a relational DBMS, rarely flat files
 OLAP servers
 Relational OLAP (ROLAP): extended relational DBMS that maps
operations on multidimensional data to standard relational
operators
 Multidimensional OLAP (MOLAP): special-purpose server that
directly implements multidimensional data and operations
 Clients
 Query and reporting tools
 Analysis tools
 Data mining tools

CS 336 2
The Complete Decision Support
System
Information Sources Data Warehouse OLAP Servers Clients
Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
Semistructured Analysis
Sources
Data
Warehouse serve

extract Query/Reporting
transform
load serve
refresh
etc. e.g., ROLAP
Operational
DB’s Data Mining
serve

Data Marts
CS 336 3
Approaches to OLAP Servers

 Relational DBMS as Warehouse Servers


 Two possibilities for OLAP servers
 (1) Relational OLAP (ROLAP)
 Relational and specialized relational DBMS to
store and manage warehouse data
 OLAP middleware to support missing pieces
 (2) Multidimensional OLAP (MOLAP)
 Array-based storage structures
 Direct access to array data structures

CS 336 4
OLAP Server: Query Engine
Requirements

 Aggregates (maintenance and querying)


 Decide what to precompute and when
 Query language to support
multidimensional operations
 Standard SQL falls short
 Scalable query processing
 Data intensive and data selective queries

CS 336 5
OLAP for Decision Support
 Support ad-hoc querying for business analyst
 Think in terms of spreadsheets
 View sales data by geography, time, or product
 Extend spreadsheet analysis model to work with
warehouse data
 Large data sets
 Semantically enriched to understand business terms
 Combine interactive queries with reporting functions
 Multidimensional view of data is the foundation
of OLAP
 Data model, operations, etc.
CS 336 6
Warehouse Models & Operators
 Data Models
 relations
 stars & snowflakes
 cubes
 Operators
 slice & dice
 roll-up, drill down
 pivoting
 other
CS 336 7
Multi-Dimensional Data
 Measures - numerical data being tracked
 Dimensions - business parameters that define
a transaction
 Example: Analyst may want to viewsales data
(measure) bygeography , bytime , and by
product (dimensions)
 Dimensional modeling is a technique for
structuring data around the business
concepts
 ER models describe “entities” and
“relationships” vs dimensional models that
describe “measures” and “dimensions”
CS 336 8
The Multi-Dimensional Model
“Sales by product line over the past six months”
“Sales by store between 1990 and 1995”
Store Info Key columns joining fact table
to dimension tables Numerical Measures

Prod Code Time Code Store Code Sales Qty

Fact table for


Product Info
measures

Dimension tables Time Info

...
CS 336 9
Dimensional Modeling

 Dimensions are organized into hierarchies


 E.g., Time dimension: days  weeks  quarters
 E.g., Product dimension: product  product line 
brand
 Dimensions have attributes

CS 336 10
Dimension Hierarchies
Store Dimension Product Dimension

Total Total

Region Manufacturer

District Brand

Stores Products

CS 336 11
An expanded view of the model shows three
dimensions: Time, Store and Product.
Attribute hierarchies are vertical
relationships, while extended attribute
characteristics are diagonal.

CS 336 12
• In the time dimension,
                                                                       
• a given date is further described by its extended attributes "current
flag," "sequence" and "day of the week."
• Extended attribute characteristics have no impact on granularity.
The fact that February 4, 1996 is on a Sunday has no effect on the
fact that we collect sales by day. In practice, though. we may wish
to compare Sunday sales to other days. We can do this by
constraining our query on the extended attribute characteristic "day
of the week" without having to gather any additional information.

• The store dimension includes an extended attribute


characteristic at a higher level: each region has a Regional
Manager.

• Attribute hierarchies imply aggregation of data: stores roll up


into districts; districts into regions.

CS 336 13
ROLAP: Dimensional Modeling Using
Relational DBMS
 Special schema design:star, snowflake
 Special indexes: bitmap, multi-table join
 Special tuning: maximize query throughput
 Proven technology (relational model, DBMS),
tend to outperform specialized MDDB
especially on large data sets
 Products
 IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

CS 336 14
MOLAP: Dimensional Modeling
Using the Multi Dimensional Model

 MDDB: a special-purpose data model


 Facts stored in multi-dimensional arrays
 Dimensions used to index array
 Sometimes on top of relational DB
 Products
 Pilot, Hyperion, Gentia, Express

CS 336 15
Star Schema (in RDBMS)

CS 336 16
Star Schema Example

CS 336 17
Star
Schema
with
Sample
Data

CS 336 18
The “Classic” Star Schema

 A single fact table, with


detail and summary data
 Fact table primary key
has only one key column
per dimension
 Each key is generated
 Each dimension is a
single table, highly
denormalized
Benefits: Easy to understand, easy to define hierarchies, reduces # of physical joins,
low maintenance, very simple metadata
Drawbacks: Summary data in the fact table yields poorer performance for summary
levels, huge dimension tables a problem
CS 336 19
The “Classic” Star Schema
 The fact table may also contain partially
consolidated data, such as sales dollars for a
region, for a given product for a given time period.
 Confusion is possible if ALL consolidated data isn't
included. For example, if data is consolidated only
to the district level, a query for regional information
will bring back no records and, hence, report no
sales activity. For this reason, simple stars MUST
contain either
 ALL of the combinations of aggregated data or
 At least views of every combination

CS 336 20
One approach is to create a multi-part key, identifying each record by the combination
of store/district/region. Using compound keys in a dimension table can cause
problems:
• It requires three separate metadata definitions to define a single relationship,
which adds to complexity in the design and sluggishness in performance.
• Since the fact table must carry all three keys as part of its primary key, addition
or deletion of levels in the hierarchy (such as the addition of "territory" between
store and district) will require physical modification of the fact table, a time-
consuming process that limits flexibility
• Carrying all of the segments of the compound dimensional key in the fact table
increases the size of the crucial fact table index, a real determinant to both
CS 336 performance and scalability. 21
                                                                          
One alternative to compound keys is to
concatenate the keys into a single key. While this
approach solves the first two problems with
compound keys (extra metadata and rigidity in the
fact table), the size of the key is still a problem.
Also, as in the example above, dealing with nulls
can be confusing.
CS 336 22
                                                                          
Instead of "meaningful" keys generate the smallest possible key that will insure
uniqueness of each record. Integers are the most efficient in most cases.
Note that the meaningful keys do not have to disappear; they may be shifted to non-key
attribute columns. If, in fact, these attributes are used frequently in queries (where
region_description is "North"), the columns can still be indexed, even if they aren't used as the
key.
The use of generated keys is preferred because:
• It allows for the highest level of flexibility of metadata
• Low maintenance as the data warehouse matures
CS 336 • Highest possible performance 23
The “Classic” Star Schema

The biggest drawback: dimension


tables must carry alevel indicator for
every record and every query must
use it. In the example below, without
the level constraint, keys for all stores
in the NORTH region, including
aggregates for region and district will
be pulled from the fact table, resulting
in error.

Example :
Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A Level is needed
where A.STORE_KEY in (select STORE_KEY whenever aggregates
from Store_Dimension B are stored with detail
where region = “North” and Level = 2) facts.
andetc...

CS 336 24
The “Level” Problem
 Level is a problem because it causes potential for
error. If the query builder, human or program, forgets
about it, perfectly reasonable looking WRONG
answers can occur.
 One alternative: the FACT CONSTELLATION model
summary
( tables )

The biggest drawback of the level indicator is that it limits flexibility (we may not know all of
the levels in the attribute hierarchies at first). By limiting ourselves to only certain levels, we
force a physical structure that may change, resulting in higher maintenance costs and more
downtime.
The level concept is a useful tool for very controlled data warehouses, that is, those that either
have no ad hoc users or at least only those ad hoc users who are knowledgably about the
database. In particular, when the results of queries are pre-formatted reports or extracts to
smaller systems, such as data marts, the drawbacks of the level indicator are not so evident.

CS 336 25
The “Fact Constellation” Schema

District Fact Table


Region Fact Table
District_ID
PRODUCT_KEY Region_ID
PRODUCT_KEY
PERIOD_KEY
PERIOD_KEY
Dollars
Dollars
Units Units
Price Price

CS 336 26
The chart above is composed of all of the tables from the Classic Star, plus
aggregated fact (summary) tables. For example, the Store dimension is
formed of a hierarchy of store-> district -> region.
The District fact table contains ONLY data aggregated by district, therefore
there are no records in the table with STORE_KEY matching any record for
the Store dimension at the store level. Therefore, when we scan the Store
dimension table, and select keys that have district = "Texas," they will only
match STORE_KEY in the District fact table when the record is aggregated
for stores in the Texas district. No double (or triple, etc.) counting is
possible and the Level indicator is not needed.
These aggregated fact tables can get very complicated, though. For
example, we need a District and Region fact table, but what level of detail
will they contain about the product dimension? All of the following:
STORE/PROD DISTRICT/PROD REGION/PROD
STORE/BRAND DISTRICT/BRAND REGION/BRAND
STORE/MANUF DISTRICT/MANUF REGION/MANUF
And these are just the combinations from two dimensions!

CS 336 27
The “Fact Constellation” Schema

In the Fact Constellations,


aggregate tables are created
separately from the detail,
therefore,
it is impossible to pick up, for
example, Store detail when
querying
the District Fact Table.

Major Advantage: No need for the “Level” indicator in the dimension tables,
since no aggregated data is stored with lower-level detail

Disadvantage: Dimension tables are still very large in some cases, which can slow
performance; front-end must be able to detect existence of aggregate facts, which
requires more extensive metadata
CS 336 28
Another Alternative to “Level”

 Fact Constellation is a good alternative to the Star, but when


dimensions have very high cardinality, the sub-selects in the
dimension tables can be a source of delay.
 Another drawback is that multiple SQL statements may be
needed to answer a single question, for example: measure the
percent to total of a district to its region. Separate queries are
needed to both the district and region fact tables, then some
complex "stitching" together of the results is needed.
 Once again, it is easy to see that even with its disadvantages, the
Classic Star enjoys the benefit of simplicity over its alternatives.
 An alternative is to normalize the dimension tables by attribute
level, with each smaller dimension table pointing to an
appropriate aggregated fact table, the “Snowflake Schema” ...

CS 336 29
The “Snowflake” Schema
Store Dimension
STORE KEY District_ID Region_ID
Store Description District Desc. Region Desc.
City Region_ID Regional Mgr.
State
District ID
Region_ID

Store Fact Table District Fact Table RegionFact Table


Region_ID
STORE KEY District_ID
PRODUCT_KEY
PRODUCT_KEY PERIOD_KEY
PRODUCT KEY PERIOD_KEY Dollars
PERIOD KEY Dollars Units
Units Price
Dollars Price
Units
Price

CS 336 30
Store Dimension
The “Snowflake” Schema
STORE KEY District_ID Region_ID
Store Description District Desc. Region Desc.
City Region_ID Regional Mgr.
State
District ID
District Desc.
Region_ID
Region Desc.
Regional Mgr. Store Fact Table District Fact Table RegionFact Table
Region_ID
STORE KEY District_ID
PRODUCT_KEY
PRODUCT_KEY PERIOD_KEY
PRODUCT KEY PERIOD_KEY Dollars
PERIOD KEY Dollars Units
Units Price
Dollars Price
Units
Price

The original Store Dimension table, completely de-normalized, is kept


intact, since certain queries can benefit by its all-encompassing
CS 336 content. 31
The “Snowflake” Schema
 No LEVEL in dimension tables
 Dimension tables are normalized by
decomposing at the attribute level
 Each dimension table has one key for
each level of the dimensions
hierarchy
 The lowest level key joins the
dimension table to both the fact table
and the lower level attribute table

How does it work? The best way is for the query to be built by
understanding which summary levels exist, and finding the proper
snowflaked attribute tables, constraining there for keys, then selecting
from the fact table.
CS 336 32
Notice how the Store dimension table generates subsets of records.
First, all records from the table (where level = "District" in the Star) are
extracted, and only those attributes that refer to that level (District
Description, for example) and the keys of the parent hierarchy (Region_ID) are
included in the table. Though the tables are subsets, it is absolutely critical
that column names are the same throughout the schema.

The diagram above is a partial schema - it only


shows the "snowflaking" of one dimension. In
fact, the product and time dimensions would be
similarly decomposed as follows:
Product - product -> brand -> manufacturer (color
and size are extended attribute characteristics of
the attribute "product," not part of the attribute
hierarchy)
Time - day -> month -> quarter -> year
CS 336 33
The “Snowflake” Schema
 Additional features: The original
Store Dimension table, completely
de-normalized, is kept intact, since
certain queries can benefit by its all-
encompassing content.
 In practice, start with a Star
Schema and create the “snowflakes”
with queries. This eliminates the
need to create separate extracts for
each table, and referential integrity
is inherited from the dimension
table.

Advantage: Best performance when queries involve aggregation

Disadvantage: Complicated maintenance and metadata, explosion in the num


of tables in the database
CS 336 34
Aggregates
 Add up amounts for day 1
 In SQL: SELECT sum(amt)
FROM sale
WHERE date = 1

sale prodId storeId date amt


p1 s1 1 12
p2 s1 1 11
p1
p2
s3
s2
1
1
50
8
81
p1 s1 2 44
p1 s2 2 4

CS 336 35
Aggregates
 Add up amounts by day
 In SQL: SELECT date, sum(amt)
FROM sale
GROUP BY date

sale prodId storeId date amt


p1 s1 1 12
ans d a te sum
p2 s1 1 11
1 81
p1 s3 1 50
p2 s2 1 8 2 48
p1 s1 2 44
p1 s2 2 4

CS 336 36
Aggregates (Another Example)
 Add up amounts by day, product
 In SQL: SELECT date, sum(amt)
FROM sale
GROUP BY date, prodId

sale prodId storeId date amt


s a le p r o d Id d a te am t
p1 s1 1 12
p2 s1 1 11 p1 1 62
p1 s3 1 50 p2 1 19
p2 s2 1 8
p1 2 48
p1 s1 2 44
p1 s2 2 4

rollup
drill-down

CS 336 37
Aggregates
 Operators: sum, count, max, min,
median, ave
 “Having” clause
 In order to retrieve the values of these functions only for
groups that satisfy certain conditions, we use HAVING.
SELECT date, sum(amt)
FROM sale
GROUP BY date
HAVING amt > 20;

 Using dimension hierarchy


 average by region (within store)
 maximum by month (within date)

CS 336 38
ROLAP aggregation performance
D3 Stores 100
D2 Products 1000
D1 Days 10,000
It’s given that on average, each store sells 5,000 different products during the entire period.

Sparsity
Cube size: 100 * 1,000 * 10,000 = 1,000,000,000
Actual data: 100 * 5,000 = 500,000
Sparsity = 0.05%

Relational calculation

Data
Length of Fact Table r = 500,000 records
Record size R = 100 Bytes
Block size B = 1024 Bytes
Block factor bfr = 1024/100 = 10
Number of data blocks b = 50,000

Index
Compound key V = 27 bytes
P = 6 Bytes
Ri = 33 Bytes
Index block factor bfri = 1024/33 = 30 entries/block
How many blocks per index bi = ri / bfri = 50,000/30 = 1667 blocks

CS 336 39
Single Retrieval [store, product, day ]
Binary search for index log2 1667 =
11 disk accesses
+ one data access total = 12
disk accesses
12 * 10msec = 120 msec = 0.12 sec

Join

In order to see the total sales for a particular month for a particular
product:
SELECT Sum ( SalesFact.SalesDollars) AS SumOfSalesDollars
FROM TimeDimension JOIN (ProductDimension JOIN
{StoreDimension JOIN SalesFact ON StoreDimension.StoreID = SalesFact.StoreID}
ON Product Dimension.ProductID = SalesFact.ProductID)
ON TimeDimension.TimeID = SalesFact.TimeID
WHERE StoreDimension.StoreName = ‘Dizengoff’ AND ProductDimension.ProductName =‘White
Cheese’ AND TimeDimension.Month=3
AND TimeDimension.Year=1999

CS 336 40
To perform a JOIN needed:
Access (FT + StoreDim. + Result1Table + ProductDim. + Result2Table + TimeDim.)

For simplicity let’s assume only one FT access.

Monthly aggregation [store, product, month ]


Making search for 1st month, 2nd , etc.
For each month 12 * 30 * 1000 * 100 = 36,000,000 accesses
But there are 10 days per block, therefore we need only one access for every 10 days.
It makes it 3.6 Million accesses per month.

Number of months 10,000/30 = 333

Total 333 * 3,600,000 = 1,200,000,000 accesses


= 12,000,000 sec = 3,300 hours!

CS 336 41
Advantages of ROLAP
Dimensional Modeling
 Define complex, multi-dimensional data with
simple model
 Standard SQL tools
 Allows the data warehouse to evolve with
relatively low maintenance
 HOWEVER! Star schema and relational DBMS
are not the magic solution
 Query optimization is still problematic

CS 336 42

You might also like