Professional Documents
Culture Documents
Chapter 4, Data Warehouse & OLAP Operations
Chapter 4, Data Warehouse & OLAP Operations
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
1
11/16/2017
Data Warehousing
The process of constructing and using data warehouses
2
11/16/2017
3
11/16/2017
Methods
(1) Process to provide uniform interface to multiple data sources
→ Tradition Database Integration
(2) Process to combine multiple data sources into coherent storage
→ Data warehousing
Traditional DB Integration
A query-driven approach
Wrappers / mediators on top of heterogeneous data sources
Data Warehousing
An update-driven approach
Combined the heterogeneous data sources in advance
Stored them in a warehouse for direct query and analysis
4
11/16/2017
5
11/16/2017
Performance Issue
DBMS: tuned for OLTP
e.g., access methods, indexing, concurrency control, recovery
Data warehouse: tuned for OLAP
e.g., complex queries, multidimensional view, consolidation
Data Issue
Decision support requires historical data, consolidated and summarized
data, consistent data
6
11/16/2017
Dimensions
Multi-dimensional data model
Data are stored in the form of a data cube
Data Cube
A view of multi-dimensions
Dimension tables, such as item (item_name, brand, type), or
time (day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables
Cuboid
Each combination of dimensional spaces in a data cube
0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid
all
0-D (apex) cuboid
time,item,location time,location,supplier
3-D cuboids
time,item,supplier item,location,supplier
7
11/16/2017
Conceptual Modeling
Examples
Star schema: A fact table in the middle connected to a set of dimension
tables
Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed
as a collection of stars, therefore called galaxy schema
or fact constellation
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch
location_key location
branch_key
branch_name units_sold location_key
branch_type street
dollars_sold city
avg_sales state
country
Measures
8
11/16/2017
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch_key
branch
location_key
branch_key location
branch_name units_sold location_key
branch_type street city
dollars_sold
city_key city_key
avg_sales city
Measures state
country
9
11/16/2017
Example
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, state, country)
10
11/16/2017
Example
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter,
year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
Example
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name,
location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
11
11/16/2017
Measures
Distributive
If the result derived by applying the function to n aggregate values is the
same as that derived by applying the function on all the data without
partitioning
e.g., count(), sum(), min(), max()
Algebraic
If it can be computed by an algebraic function with m arguments, each of
which is obtained by applying a distributive aggregate function
e.g., avg(), standard_deviation()
Holistic
If there is no constant bound on the storage size needed to describe a
subaggregate
e.g., median(), mode(), rank()
Concept Hierarchy
Schema Hierarchy
Year
e.g., street < city < state < country
Quarter
e.g., day < { month < quarter ; week } < year
| Week
Month
Day
Set-group hierarchy
e.g., { (0..100] ; (100..200] } < (0..200]
e.g., { (0..10] < lowPrice ; { (10..100] ; (100..200] } < highPrice }
< allProducts
allProducts
(0..200]
lowPrice highPrice
(0..100] (100..200]
(0..10] (10..100] (100..200]
12
11/16/2017
Example
Measures: data values as a function of products, locations and time
Dimensions:
Hierarchies:
Year
Company Country
| |
locations
Quarter
Category City
| Week
| |
Month
Product Office
Day
time
time dimension
product, quarter
2-D cuboids
quarter, country
13
11/16/2017
Country
sum
Canada
Mexico
sum
OLAP Operations
Roll-up (drill-up)
Summarizes (aggregates) data
by climbing up hierarchy or by dimension reduction
Drill-down (roll-down)
Reverse of roll-up
by stepping down to lower-level data or introducing new dimensions
Slice
Selecting data on one dimension
Dice
Selecting data on multi-dimensions
Pivot (rotate)
Reorienting the cube, or transforming 3-D data to a series of 2-D spaces
14
11/16/2017
city
location
country
quarter
quarter
time
country month
laptop
pc
supercom
country
canada
laptop
pc USA
mexico
location
Q1 Q2
country
canada quarter
USA
Q1 Q2 Q3 Q4
quarter
time
mexico
country
canada
USA
Q1 Q2 Q3 Q4
quarter
15
11/16/2017
Orders Customer
Shipping Method
contracts
air customerID
order
truck
item
Time Product
annually quarterly daily category
sales-person
city
country
division
region
division
Location
Each circle is
called a footprint Organization
Promotion
16
11/16/2017
Top-Down View
Allows the selection of the relevant information necessary for the data
warehouse
17
11/16/2017
Extract Query
Transform
Data Serve Analysis
Load
Refresh Warehouse
Reports
Other
sources
Data Marts
Enterprise Warehouse
A global view with all the information about subjects spanning the entire
organization
Data Mart
A subset of corporate-wide data that is of value to a specific group of
users
Its scope is confined to specific, selected groups
Independent vs. dependent (directly from warehouse) data mart
Virtual Warehouse
A set of views over operational databases
Only some of possible summary views may be materialized
18
11/16/2017
Multi-Tier
Data Warehouse
Distributed
Data Marts
Enterprise
Data Extraction
Get data from multiple, heterogeneous, and external sources
Data Cleaning
Detect errors in the data and rectify them when possible
Data Transformation
Convert data from the original format to the warehouse format
Loading
Sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions
Refresh
Propagate the updates from data sources to the warehouse
19
11/16/2017
Metadata Repository
Definition of Metadata
The data defining data warehouse objects
Examples
Description of the structure of the data warehouse, e.g., schema, view,
dimensions, hierarchies, data definitions, data mart locations and contents
Operational meta-data, e.g., history of migrated data, currency of data,
warehouse usage statistics, error reports
Algorithms used for summarization
Mapping from operational environment to data warehouse
Data related to system performance
Business data, e.g., business terms and definitions, ownership of data,
charging policies
20
11/16/2017
21
11/16/2017
Cube Operation
Internal Operations
group by (item, city, year)
group by (item, city), (item, year), (city, year)
group by (item), (city), (year)
group by ()
Iceberg Cube
Motivation
Only a small portion of cube cells may be “above the water’’
in a sparse cube
Only calculate “interesting” cells — data above certain threshold
Avoid explosive growth of the cube
22
11/16/2017
Information Processing
Supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
Analytical Processing
Supports OLAP operations in multi-dimensional space
Data Mining
Supports pattern discovery from warehouse data, and presenting
the mining results using visualization tools
23
11/16/2017
Why OLAM ?
High quality data
• Data warehouse contains integrated, consistent and cleaned data
Information processing infrastructure
• ODBC/OLE DB connections, web accessing, service facilities
OLAP-based exploratory data analysis
• Mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
• Integration and swapping of multiple data mining functions
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
Data
Meta Data MDDB
Warehouse
24
11/16/2017
Questions?
References
Gray, J., et al., “Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab and Sub-Totals”, Data Mining and
Knowledge Discovery, Vol. 1 (1997)
25