Professional Documents
Culture Documents
Data Warehousing & OLAP
Data Warehousing & OLAP
Data Mining:
— Chapter 3 —
Jiawei Han
and
What is OLAP?
TOTQTY
1600
2. SELECT S#,
SUM(QTY) AS TOTQTY
FROM SP
GROUP BY (S#) ;
S# TOTQTY
S1 500
S2 700
S3 200
S4 200
3. SELECT P#,
SUM(QTY) AS TOTQTY
FROM SP
GROUP BY (P#) ;
P# TOTQTY
P1 600
P2 1000
4. SELECT S#, P#,
SUM(QTY) AS TOTQTY
FROM SP
GROUP BY (S#,P#) ,
S# P# TOTQTY
S1 P1 300
S1 P2 200
S2 P1 300
S2 P2 400
S3 P2 200
S4 P2 200
Drawbacks
Formulation so many similar but distinct
queries is tedious
Executing the queries is expensive
Make life easier,
more efficient computation
Single query
GROUPING SETS, ROLLUP, CUBE options
Added to SQL standard 1999
GROUPING SETS
Execute several queries simultaneously
SELECT S#, P#, SUM (QTY) AS TOTQTY S# P# TOTQTY
FROM SP S1 null 500
GROUP BY GROUPING SETS ( (S#), (P#) ) ;
S2 null 700
S3 null 200
S4 null 200
Single results table
null P1 600
Not a relation !!
null missing information null P2 1000
SELECT CASE GROUPING ( S# )
WHEN 1 THEN ‘??‘ S# P# TOTQTY
ELSE S# S1 !! 500
AS S#, S2 !! 700
CASE GROUPING ( P# ) S3 !! 200
WHEN 1 THEN ‘!!‘ S4 !! 200
ELSE P# ?? P1 600
AS P#, ?? P2 1000
SUM ( QTY ) AS TOTQTY
FROM SP
GROUP BY GROUPING SETS ( ( S# ), ( P# ) );
S# P# TOTQTY
ROLLUP S1 P1 300
S1 P2 200
S2 P1 300
SELECT S#,P#, SUM ( QTY ) AS TOTQTY S2 P2 400
FROM SP S3 P2 200
GROUP BY ROLLUP (S#, P#) ; S4 P2 200
S1 null 500
S2 null 700
S3 null 200
S4 null 200
null null 1600
(A,B,...,Z)
(A,B,...)
(A,B)
(A)
()
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
24=16
Conceptual Modeling of Data
Warehouses
Modeling data warehouses: dimensions & measures
instead of relational model
Subject, facilitates on-line data analysis oriented
Most popular model is the multidimensional model
Most common modeling paradigm:
Star schema
Data warehouse contains a large central table (fact table)
• Contains the data without redundancy
A set of dimension tables (each for each dimension)
time
Example of Star Schema
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
Snowflake schema
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
Fact constellations
Specification of hierarchies
Schema hierarchy
day < {month < quarter; week}
< year
Set_grouping hierarchy
{1..10} < inexpensive
Multidimensional Data
Sales volume as a function of product,
month, and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
Office Day
Month
Measures of Data Cube:
Three Categories
(Depending on the aggregate functions)
Country
sum
Spain
Germany
sum
Browsing a Data
Cube
Visualization
OLAP capabilities
Interactive manipulation
Typical OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or
detailed data, or introducing new dimensions
Slice and dice: project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)
Fig. 3.10 Typical OLAP
Operations
Multi-dimensional query
Language
No standard yet..
DMQL, DMX,..
• MDX was introduced by Microsoft with Microsoft SQL
Server OLAP Services in around 1998, as the language
component of the OLE DB for OLAP API. More recently,
MDX has appeared as part of the XML for Analysis API.
Microsoft proposed that the MDX is a standard, and its
adoption among application writers and other OLAP
providers is steadily increasing.
Mondrian (open-source)
Mondrian is an OLAP server written in Java. It
enables you to interactively analyze very large
datasets stored in SQL databases without
writing SQL.
http://mondrian.sourceforge.net/index.html
MOLAP
Multidimensional database (server)
Data is stored in cells of a multi-dimensional array
Three dimensions: products, customers, time intervals
Each individual cell value might then represent the total quantity
of the indicated product sold to the indicated customer in the
indicated time interval