Chapter 4, Data Warehouse & OLAP Operations

11/16/2017
CSI 4352, Introduction to Data Mining
Chapter 4, Data Warehouse and

OLAP Operations
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
Chapter 4, Data Warehouse & OLAP Operations
 Basic Concept of Data Warehouse
 Data Warehouse Modeling
 Data Warehouse Architecture
 Data Warehouse Implementation
 From Data Warehousing to Data Mining
1
11/16/2017
What is Data Warehouse?
 Data Warehouse ( defined in many different ways )

 A decision support database that is maintained separately from the
organization’s operational database
 The support of information processing by providing a solid platform of
consolidated, historical data for analysis
 “A data warehouse is a (1) subject-oriented, (2) integrated,
(3) time-variant, and (4) nonvolatile collection of data in support of
management’s decision-making process.” — W. H. Inmon
 Data Warehousing
 The process of constructing and using data warehouses
Data Warehouse – Subject-Oriented
 Organized around major subjects

 e.g., customers, products, sales
 Focusing on the modeling and analysis of data for decision makers,

not on daily operations or transaction processing
 Provide a simple and concise view around particular subject issues

 Excluding data that are not useful in the decision support process
2
11/16/2017
Data Warehouse – Integrated
 Integrating multiple, heterogeneous data sources

 Relational databases, flat files, on-line transaction records
 Apply data cleaning and data integration techniques

 Ensures consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
Data Warehouse – Time Variant
 The time horizon of data warehouses is significantly longer than that of

operational systems
 Operational databases have current data values
 Data warehouses provide information from a historical perspective
(e.g., 10-20 years)
 Time is a key structure in data warehouses

 Contain the attribute of time (explicitly or implicitly)
3
11/16/2017
Data Warehouse – Nonvolatile
 A physically separate storage of data transformed from operational databases
 Operational update of data does not occur

 Not require transaction processing, recovery, and concurrency control
mechanisms
 Require only two operations, initial loading of data and access of data
Data Integration Methods
 Methods
(1) Process to provide uniform interface to multiple data sources
→ Tradition Database Integration
(2) Process to combine multiple data sources into coherent storage
→ Data warehousing
 Traditional DB Integration
 A query-driven approach
 Wrappers / mediators on top of heterogeneous data sources
 Data Warehousing
 An update-driven approach
 Combined the heterogeneous data sources in advance
 Stored them in a warehouse for direct query and analysis
4
11/16/2017
OLTP vs. OLAP
 OLTP (on-line transaction processing)

 Major task of traditional relational DBMS
 Day-to-day operations: e.g., purchasing, inventory, manufacturing,
banking, payroll, registration, accounting, etc.
 OLAP (on-line analytical processing)

 Major task of data warehouse system
 Data analysis and decision making
 Distinct Features (OLTP vs. OLAP)

 User and system orientation (customers vs. market analysts)
 Data contents (current, detailed vs. historical, consolidated)
 Database design (ER + application vs. star + subject)
 View (current, local vs. evolutionary, integrated)
OLTP vs. OLAP
Feature OLTP OLAP
Characteristic operational processing information processing
Orientation transaction analysis

Users clerk, DBA knowledge worker (CEO, analyst)
Function day-to-day operations decision support
DB Design application-oriented subject-oriented
Data current, up-to-date historical, integrated, summarized
Unit of work short, simple transaction complex query
Access read/write/update read-only (lots of scans)
5
11/16/2017
Why Data Warehouse?
 Performance Issue
 DBMS: tuned for OLTP
e.g., access methods, indexing, concurrency control, recovery
 Data warehouse: tuned for OLAP
e.g., complex queries, multidimensional view, consolidation
 Data Issue
 Decision support requires historical data, consolidated and summarized
data, consistent data
 Basic Concept of Data Warehouse
 Data Warehouse Modeling
6
11/16/2017
Data Format for Warehouse
 Dimensions
 Multi-dimensional data model
 Data are stored in the form of a data cube
 Data Cube
 A view of multi-dimensions
 Dimension tables, such as item (item_name, brand, type), or
time (day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables
 Cuboid
 Each combination of dimensional spaces in a data cube
 0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid
The Lattice of Cuboids
all
0-D (apex) cuboid
time item location supplier

1-D cuboids
time,item time,location item,location location,supplier

2-D cuboids
time,supplier item,supplier
time,item,location time,location,supplier
3-D cuboids
time,item,supplier item,location,supplier
4-D (base) cuboid

time, item, location, supplier
7
11/16/2017
Conceptual Modeling
 Key of Modeling Data Warehouses

 Handling dimensions & measures
 Examples
 Star schema: A fact table in the middle connected to a set of dimension
tables
 Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables, viewed
as a collection of stars, therefore called galaxy schema
or fact constellation
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch
location_key location
branch_key
branch_name units_sold location_key
branch_type street
dollars_sold city
avg_sales state
country
Measures
8
11/16/2017
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch_key
branch
location_key
branch_key location
branch_name units_sold location_key
branch_type street city
dollars_sold
city_key city_key
avg_sales city
Measures state
country
Example of Fact Constellation
time Shipping Fact Table

time_key item
time_key
day item_key
day_of_the_week Sales Fact Table item_name item_key
month brand
quarter time_key shipper_key
type
year supplier_type from_location
item_key
branch_key to_location
branch dollars_cost
location_key location
branch_key
branch_name units_sold location_key units_shipped
branch_type street
dollars_sold city shipper
avg_sales state shipper_key
country shipper_name
Measures location_key
shipper_type
9
11/16/2017
Cube Definition in DMQL
 Cube Definition (Fact Table)

 define cube <cube_name> [<dimension_list>]: <measure_list>
 Dimension Definition (Dimension Table)

 define dimension <dimension_name> as (<attribute_or_dimension_list>)
 Special Case (Shared Dimension Table)

 define dimension <dimension_name> as <dimension_name_first>
in cube <cube_name_first>
Star Schema Definition in DMQL
 Example
 define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)
 define dimension time as (time_key, day, day_of_week, month, quarter,
year)
 define dimension item as (item_key, item_name, brand, type,
supplier_type)
 define dimension branch as (branch_key, branch_name, branch_type)
 define dimension location as (location_key, street, city, state, country)
10
11/16/2017
Snowflake Schema Definition in DMQL
 Example
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
 define dimension time as (time_key, day, day_of_week, month, quarter,
year)
 define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
 define dimension location as (location_key, street, city(city_key,
province_or_state, country))
Fact Constellation Schema Definition in DMQL
 Example
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
 define dimension time as (time_key, day, day_of_week, month, quarter, year)
 define dimension item as (item_key, item_name, brand, type, supplier_type)
 define dimension location as (location_key, street, city, state, country)
 define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
 define dimension time as time in cube sales
 define dimension item as item in cube sales
 define dimension shipper as (shipper_key, shipper_name,
location as location in cube sales, shipper_type)
 define dimension from_location as location in cube sales
 define dimension to_location as location in cube sales
11
11/16/2017
Measures
 Distributive
 If the result derived by applying the function to n aggregate values is the
same as that derived by applying the function on all the data without
partitioning
 e.g., count(), sum(), min(), max()
 Algebraic
 If it can be computed by an algebraic function with m arguments, each of
which is obtained by applying a distributive aggregate function
 e.g., avg(), standard_deviation()
 Holistic
 If there is no constant bound on the storage size needed to describe a
subaggregate
 e.g., median(), mode(), rank()
Concept Hierarchy
 Schema Hierarchy
Year
 e.g., street < city < state < country
Quarter
 e.g., day < { month < quarter ; week } < year
| Week
Month
Day
 Set-group hierarchy
 e.g., { (0..100] ; (100..200] } < (0..200]
 e.g., { (0..10] < lowPrice ; { (10..100] ; (100..200] } < highPrice }
< allProducts
allProducts
(0..200]
lowPrice highPrice
(0..100] (100..200]
(0..10] (10..100] (100..200]
12
11/16/2017
Three Components of Data Cube
 Example
 Measures: data values as a function of products, locations and time
 Dimensions:
 Hierarchies:
product location time
Year
Company Country
| |
locations
Quarter
Category City
| Week
| |
Month
Product Office
Day
time
Example of Cuboid Cells
time dimension
all location dimension

product dimension 0-D (apex) cuboid
product quarter country

1-D cuboids
product, country
product, quarter
2-D cuboids
quarter, country
3-D (base) cuboid

product, quarter, country
13
11/16/2017
Example of Data Cube
Quarter Total annual sales

2Qtr of TV in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
Country
sum
Canada
Mexico
sum
OLAP Operations
 Roll-up (drill-up)
 Summarizes (aggregates) data
 by climbing up hierarchy or by dimension reduction
 Drill-down (roll-down)
 Reverse of roll-up
 by stepping down to lower-level data or introducing new dimensions
 Slice
 Selecting data on one dimension
 Dice
 Selecting data on multi-dimensions
 Pivot (rotate)
 Reorienting the cube, or transforming 3-D data to a series of 2-D spaces
14
11/16/2017
Roll-UP & Drill-Down
city
location
country
quarter
quarter
time
country month
Slice, Dice & Pivot
laptop
pc
supercom
country
canada
laptop
pc USA
mexico
location
Q1 Q2
country
canada quarter
USA
Q1 Q2 Q3 Q4
quarter
time
mexico
country
canada
USA
Q1 Q2 Q3 Q4
quarter
15
11/16/2017
Starnet Query Model
Orders Customer
Shipping Method
contracts
air customerID
order
truck
item
Time Product
annually quarterly daily category
sales-person
city
country
division
region
division
Location
Each circle is
called a footprint Organization
Promotion
 Data Warehouse Architecture
16
11/16/2017
Data Warehouse Design
 Top-Down View
 Allows the selection of the relevant information necessary for the data
warehouse
 Data Source View

 Exposes the information being captured, stored, and managed by
operational systems
 Data Warehouse View

 Consists of fact tables and dimension tables
 Business Query View

 Shows the perspectives of data in the warehouse to end-users
Data Warehouse Design Process
 Categories by Process Direction

 Top-down: Starts with overall design and planning (mature)
 Bottom-up: Starts with experiments and prototypes (rapid)
 Categories by Software Engineering View

 Waterfall: structured, systematic analysis at each step before proceeding
to the next
 Spiral: rapid generation of functional systems, short turn around time
 Typical Data Warehouse Design Process

 Choose business processes for modeling, e.g., orders, invoices, etc
 Choose the grain (atomic level of data) of the business processes
 Choose the dimensions that will apply to each fact table
 Choose the measures that will populate each fact table
17
11/16/2017
Data Warehouse Architecture
Monitor & OLAP Server

Operational Metadata Integrator
DBs
Extract Query
Transform
Data Serve Analysis
Load
Refresh Warehouse
Reports
Other
sources
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools
Three Data Warehouse Models
 Enterprise Warehouse
 A global view with all the information about subjects spanning the entire
organization
 Data Mart
 A subset of corporate-wide data that is of value to a specific group of
users
 Its scope is confined to specific, selected groups
 Independent vs. dependent (directly from warehouse) data mart
 Virtual Warehouse
 A set of views over operational databases
 Only some of possible summary views may be materialized
18
11/16/2017
Development of Data Warehouse
Multi-Tier
Data Warehouse
Distributed
Data Marts
Enterprise
Data Data Data

Mart Mart Warehouse
Model Refinement Model Refinement
Define a high-level corporate data model
Utilities of Back-End Tools
 Data Extraction
 Get data from multiple, heterogeneous, and external sources
 Data Cleaning
 Detect errors in the data and rectify them when possible
 Data Transformation
 Convert data from the original format to the warehouse format
 Loading
 Sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions
 Refresh
 Propagate the updates from data sources to the warehouse
19
11/16/2017
OLAP Server Architecture
 Relational OLAP (ROLAP)

 Uses relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
 Includes optimization of DBMS back-end, implementation of aggregation
navigation logic, and additional tools and services
 High scalability
 Multidimensional OLAP (MOLAP)

 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP)

 Low-level: relational / high-level: array
 High flexibility
Metadata Repository
 Definition of Metadata
 The data defining data warehouse objects
 Examples
 Description of the structure of the data warehouse, e.g., schema, view,
dimensions, hierarchies, data definitions, data mart locations and contents
 Operational meta-data, e.g., history of migrated data, currency of data,
warehouse usage statistics, error reports
 Algorithms used for summarization
 Mapping from operational environment to data warehouse
 Data related to system performance
 Business data, e.g., business terms and definitions, ownership of data,
charging policies
20
11/16/2017
 Data Warehouse Implementation
Data Cube Computation
 View as a Lattice of Cuboids

all
 How many cuboids in an n-dimensional cube? product time location
2n
product, location
 How many cuboid cells in an n-dimensional cube
with Li levels? product, time,
n time location
 ( L i  1)
i1 product, time, location
 Materialization of Data Cube

 Full materialization (all cuboids), Partial materialization (some cuboids),
No materialization (only base cuboid)
 Selection of cuboids to materialize
• Based on the size, sharing, access frequency, etc.
21
11/16/2017
Cube Operation
 Cube Definition and Computation in DMQL

 define cube sales [item, city, year]: sum (sales_in_dollars)
 compute cube sales
 Cube Definition and Computation in SQL

 select item, city, year, SUM (amount)
 from sales
 cube by item, city, year
 Internal Operations
 group by (item, city, year)
 group by (item, city), (item, year), (city, year)
 group by (item), (city), (year)
 group by ()
Iceberg Cube
 Iceberg Cube Computation

 Computing only the cuboid cells whose count or
other aggregates satisfying the condition like
HAVING COUNT(*) >= min_sup
 Motivation
 Only a small portion of cube cells may be “above the water’’
in a sparse cube
 Only calculate “interesting” cells — data above certain threshold
 Avoid explosive growth of the cube
22
11/16/2017
 From Data Warehousing to Data Mining
Data Warehouse Usage
 Information Processing
 Supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
 Analytical Processing
 Supports OLAP operations in multi-dimensional space
 Data Mining
 Supports pattern discovery from warehouse data, and presenting
the mining results using visualization tools
23
11/16/2017
From OLAP To OLAM
 On-Line Analytical Mining (OLAM)

 ( OLAP + Data Mining ) in data warehouse
 Why OLAM ?
 High quality data
• Data warehouse contains integrated, consistent and cleaned data
 Information processing infrastructure
• ODBC/OLE DB connections, web accessing, service facilities
 OLAP-based exploratory data analysis
• Mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions
• Integration and swapping of multiple data mining functions
OLAM System Architecture
mining query mining result

Layer4
User GUI API
User Interface
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Data Cube API
Layer2
Data
Meta Data MDDB
Warehouse
Database API data cleaning data integration

Layer1
Databases Data Repository
24
11/16/2017
Questions?
 References
 Gray, J., et al., “Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab and Sub-Totals”, Data Mining and
Knowledge Discovery, Vol. 1 (1997)
 Lecture Slides are found on the Course Website,

www.ecs.baylor.edu/faculty/cho/4352
25

Chapter 4, Data Warehouse & OLAP Operations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4, Data Warehouse & OLAP Operations

Uploaded by

Copyright:

Available Formats

11/16/2017

CSI 4352, Introduction to Data Mining

Chapter 4, Data Warehouse and

CSI 4352, Introduction to Data Mining