Professional Documents
Culture Documents
CS 336 2
The Complete Decision Support
System
Information Sources Data Warehouse OLAP Servers Clients
Server (Tier 2) (Tier 3)
(Tier 1)
e.g., MOLAP
Semistructured Analysis
Sources
Data
Warehouse serve
extract Query/Reporting
transform
load serve
refresh
etc. e.g., ROLAP
Operational
DB’s Data Mining
serve
Data Marts
CS 336 3
Approaches to OLAP Servers
CS 336 4
OLAP Server: Query Engine
Requirements
CS 336 5
OLAP for Decision Support
Support ad-hoc querying for business analyst
Think in terms of spreadsheets
View sales data by geography, time, or product
Extend spreadsheet analysis model to work with
warehouse data
Large data sets
Semantically enriched to understand business terms
Combine interactive queries with reporting functions
Multidimensional view of data is the foundation
of OLAP
Data model, operations, etc.
CS 336 6
Warehouse Models & Operators
Data Models
relations
stars & snowflakes
cubes
Operators
slice & dice
roll-up, drill down
pivoting
other
CS 336 7
Multi-Dimensional Data
Measures - numerical data being tracked
Dimensions - business parameters that define
a transaction
Example: Analyst may want to viewsales data
(measure) bygeography , bytime , and by
product (dimensions)
Dimensional modeling is a technique for
structuring data around the business
concepts
ER models describe “entities” and
“relationships” vs dimensional models that
describe “measures” and “dimensions”
CS 336 8
The Multi-Dimensional Model
“Sales by product line over the past six months”
“Sales by store between 1990 and 1995”
Store Info Key columns joining fact table
to dimension tables Numerical Measures
...
CS 336 9
Dimensional Modeling
CS 336 10
Dimension Hierarchies
Store Dimension Product Dimension
Total Total
Region Manufacturer
District Brand
Stores Products
CS 336 11
An expanded view of the model shows three
dimensions: Time, Store and Product.
Attribute hierarchies are vertical
relationships, while extended attribute
characteristics are diagonal.
CS 336 12
• In the time dimension,
• a given date is further described by its extended attributes "current
flag," "sequence" and "day of the week."
• Extended attribute characteristics have no impact on granularity.
The fact that February 4, 1996 is on a Sunday has no effect on the
fact that we collect sales by day. In practice, though. we may wish
to compare Sunday sales to other days. We can do this by
constraining our query on the extended attribute characteristic "day
of the week" without having to gather any additional information.
CS 336 13
ROLAP: Dimensional Modeling Using
Relational DBMS
Special schema design:star, snowflake
Special indexes: bitmap, multi-table join
Special tuning: maximize query throughput
Proven technology (relational model, DBMS),
tend to outperform specialized MDDB
especially on large data sets
Products
IBM DB2, Oracle, Sybase IQ, RedBrick, Informix
CS 336 14
MOLAP: Dimensional Modeling
Using the Multi Dimensional Model
CS 336 15
Star Schema (in RDBMS)
CS 336 16
Star Schema Example
CS 336 17
Star
Schema
with
Sample
Data
CS 336 18
The “Classic” Star Schema
CS 336 20
One approach is to create a multi-part key, identifying each record by the combination
of store/district/region. Using compound keys in a dimension table can cause
problems:
• It requires three separate metadata definitions to define a single relationship,
which adds to complexity in the design and sluggishness in performance.
• Since the fact table must carry all three keys as part of its primary key, addition
or deletion of levels in the hierarchy (such as the addition of "territory" between
store and district) will require physical modification of the fact table, a time-
consuming process that limits flexibility
• Carrying all of the segments of the compound dimensional key in the fact table
increases the size of the crucial fact table index, a real determinant to both
CS 336 performance and scalability. 21
One alternative to compound keys is to
concatenate the keys into a single key. While this
approach solves the first two problems with
compound keys (extra metadata and rigidity in the
fact table), the size of the key is still a problem.
Also, as in the example above, dealing with nulls
can be confusing.
CS 336 22
Instead of "meaningful" keys generate the smallest possible key that will insure
uniqueness of each record. Integers are the most efficient in most cases.
Note that the meaningful keys do not have to disappear; they may be shifted to non-key
attribute columns. If, in fact, these attributes are used frequently in queries (where
region_description is "North"), the columns can still be indexed, even if they aren't used as the
key.
The use of generated keys is preferred because:
• It allows for the highest level of flexibility of metadata
• Low maintenance as the data warehouse matures
CS 336 • Highest possible performance 23
The “Classic” Star Schema
Example :
Select A.STORE_KEY, A.PERIOD_KEY, A.dollars from Fact_Table A Level is needed
where A.STORE_KEY in (select STORE_KEY whenever aggregates
from Store_Dimension B are stored with detail
where region = “North” and Level = 2) facts.
andetc...
CS 336 24
The “Level” Problem
Level is a problem because it causes potential for
error. If the query builder, human or program, forgets
about it, perfectly reasonable looking WRONG
answers can occur.
One alternative: the FACT CONSTELLATION model
summary
( tables )
The biggest drawback of the level indicator is that it limits flexibility (we may not know all of
the levels in the attribute hierarchies at first). By limiting ourselves to only certain levels, we
force a physical structure that may change, resulting in higher maintenance costs and more
downtime.
The level concept is a useful tool for very controlled data warehouses, that is, those that either
have no ad hoc users or at least only those ad hoc users who are knowledgably about the
database. In particular, when the results of queries are pre-formatted reports or extracts to
smaller systems, such as data marts, the drawbacks of the level indicator are not so evident.
CS 336 25
The “Fact Constellation” Schema
CS 336 26
The chart above is composed of all of the tables from the Classic Star, plus
aggregated fact (summary) tables. For example, the Store dimension is
formed of a hierarchy of store-> district -> region.
The District fact table contains ONLY data aggregated by district, therefore
there are no records in the table with STORE_KEY matching any record for
the Store dimension at the store level. Therefore, when we scan the Store
dimension table, and select keys that have district = "Texas," they will only
match STORE_KEY in the District fact table when the record is aggregated
for stores in the Texas district. No double (or triple, etc.) counting is
possible and the Level indicator is not needed.
These aggregated fact tables can get very complicated, though. For
example, we need a District and Region fact table, but what level of detail
will they contain about the product dimension? All of the following:
STORE/PROD DISTRICT/PROD REGION/PROD
STORE/BRAND DISTRICT/BRAND REGION/BRAND
STORE/MANUF DISTRICT/MANUF REGION/MANUF
And these are just the combinations from two dimensions!
CS 336 27
The “Fact Constellation” Schema
Major Advantage: No need for the “Level” indicator in the dimension tables,
since no aggregated data is stored with lower-level detail
Disadvantage: Dimension tables are still very large in some cases, which can slow
performance; front-end must be able to detect existence of aggregate facts, which
requires more extensive metadata
CS 336 28
Another Alternative to “Level”
CS 336 29
The “Snowflake” Schema
Store Dimension
STORE KEY District_ID Region_ID
Store Description District Desc. Region Desc.
City Region_ID Regional Mgr.
State
District ID
Region_ID
CS 336 30
Store Dimension
The “Snowflake” Schema
STORE KEY District_ID Region_ID
Store Description District Desc. Region Desc.
City Region_ID Regional Mgr.
State
District ID
District Desc.
Region_ID
Region Desc.
Regional Mgr. Store Fact Table District Fact Table RegionFact Table
Region_ID
STORE KEY District_ID
PRODUCT_KEY
PRODUCT_KEY PERIOD_KEY
PRODUCT KEY PERIOD_KEY Dollars
PERIOD KEY Dollars Units
Units Price
Dollars Price
Units
Price
How does it work? The best way is for the query to be built by
understanding which summary levels exist, and finding the proper
snowflaked attribute tables, constraining there for keys, then selecting
from the fact table.
CS 336 32
Notice how the Store dimension table generates subsets of records.
First, all records from the table (where level = "District" in the Star) are
extracted, and only those attributes that refer to that level (District
Description, for example) and the keys of the parent hierarchy (Region_ID) are
included in the table. Though the tables are subsets, it is absolutely critical
that column names are the same throughout the schema.
CS 336 35
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt)
FROM sale
GROUP BY date
CS 336 36
Aggregates (Another Example)
Add up amounts by day, product
In SQL: SELECT date, sum(amt)
FROM sale
GROUP BY date, prodId
rollup
drill-down
CS 336 37
Aggregates
Operators: sum, count, max, min,
median, ave
“Having” clause
In order to retrieve the values of these functions only for
groups that satisfy certain conditions, we use HAVING.
SELECT date, sum(amt)
FROM sale
GROUP BY date
HAVING amt > 20;
CS 336 38
ROLAP aggregation performance
D3 Stores 100
D2 Products 1000
D1 Days 10,000
It’s given that on average, each store sells 5,000 different products during the entire period.
Sparsity
Cube size: 100 * 1,000 * 10,000 = 1,000,000,000
Actual data: 100 * 5,000 = 500,000
Sparsity = 0.05%
Relational calculation
Data
Length of Fact Table r = 500,000 records
Record size R = 100 Bytes
Block size B = 1024 Bytes
Block factor bfr = 1024/100 = 10
Number of data blocks b = 50,000
Index
Compound key V = 27 bytes
P = 6 Bytes
Ri = 33 Bytes
Index block factor bfri = 1024/33 = 30 entries/block
How many blocks per index bi = ri / bfri = 50,000/30 = 1667 blocks
CS 336 39
Single Retrieval [store, product, day ]
Binary search for index log2 1667 =
11 disk accesses
+ one data access total = 12
disk accesses
12 * 10msec = 120 msec = 0.12 sec
Join
In order to see the total sales for a particular month for a particular
product:
SELECT Sum ( SalesFact.SalesDollars) AS SumOfSalesDollars
FROM TimeDimension JOIN (ProductDimension JOIN
{StoreDimension JOIN SalesFact ON StoreDimension.StoreID = SalesFact.StoreID}
ON Product Dimension.ProductID = SalesFact.ProductID)
ON TimeDimension.TimeID = SalesFact.TimeID
WHERE StoreDimension.StoreName = ‘Dizengoff’ AND ProductDimension.ProductName =‘White
Cheese’ AND TimeDimension.Month=3
AND TimeDimension.Year=1999
CS 336 40
To perform a JOIN needed:
Access (FT + StoreDim. + Result1Table + ProductDim. + Result2Table + TimeDim.)
CS 336 41
Advantages of ROLAP
Dimensional Modeling
Define complex, multi-dimensional data with
simple model
Standard SQL tools
Allows the data warehouse to evolve with
relatively low maintenance
HOWEVER! Star schema and relational DBMS
are not the magic solution
Query optimization is still problematic
CS 336 42