You are on page 1of 25

11/16/2017

CSI 4352, Introduction to Data Mining

Chapter 4, Data Warehouse and


OLAP Operations

Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University

CSI 4352, Introduction to Data Mining

Chapter 4, Data Warehouse & OLAP Operations

 Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

1
11/16/2017

What is Data Warehouse?

 Data Warehouse ( defined in many different ways )


 A decision support database that is maintained separately from the
organization’s operational database
 The support of information processing by providing a solid platform of
consolidated, historical data for analysis
 “A data warehouse is a (1) subject-oriented, (2) integrated,
(3) time-variant, and (4) nonvolatile collection of data in support of
management’s decision-making process.” — W. H. Inmon

 Data Warehousing
 The process of constructing and using data warehouses

Data Warehouse – Subject-Oriented

 Organized around major subjects


 e.g., customers, products, sales

 Focusing on the modeling and analysis of data for decision makers,


not on daily operations or transaction processing

 Provide a simple and concise view around particular subject issues


 Excluding data that are not useful in the decision support process

2
11/16/2017

Data Warehouse – Integrated

 Integrating multiple, heterogeneous data sources


 Relational databases, flat files, on-line transaction records

 Apply data cleaning and data integration techniques


 Ensures consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources

Data Warehouse – Time Variant

 The time horizon of data warehouses is significantly longer than that of


operational systems
 Operational databases have current data values
 Data warehouses provide information from a historical perspective
(e.g., 10-20 years)

 Time is a key structure in data warehouses


 Contain the attribute of time (explicitly or implicitly)

3
11/16/2017

Data Warehouse – Nonvolatile

 A physically separate storage of data transformed from operational databases

 Operational update of data does not occur


 Not require transaction processing, recovery, and concurrency control
mechanisms
 Require only two operations, initial loading of data and access of data

Data Integration Methods

 Methods
(1) Process to provide uniform interface to multiple data sources
→ Tradition Database Integration
(2) Process to combine multiple data sources into coherent storage
→ Data warehousing

 Traditional DB Integration
 A query-driven approach
 Wrappers / mediators on top of heterogeneous data sources

 Data Warehousing
 An update-driven approach
 Combined the heterogeneous data sources in advance
 Stored them in a warehouse for direct query and analysis

4
11/16/2017

OLTP vs. OLAP

 OLTP (on-line transaction processing)


 Major task of traditional relational DBMS
 Day-to-day operations: e.g., purchasing, inventory, manufacturing,
banking, payroll, registration, accounting, etc.

 OLAP (on-line analytical processing)


 Major task of data warehouse system
 Data analysis and decision making

 Distinct Features (OLTP vs. OLAP)


 User and system orientation (customers vs. market analysts)
 Data contents (current, detailed vs. historical, consolidated)
 Database design (ER + application vs. star + subject)
 View (current, local vs. evolutionary, integrated)

OLTP vs. OLAP

Feature OLTP OLAP

Characteristic operational processing information processing

Orientation transaction analysis


Users clerk, DBA knowledge worker (CEO, analyst)

Function day-to-day operations decision support

DB Design application-oriented subject-oriented

Data current, up-to-date historical, integrated, summarized

Unit of work short, simple transaction complex query

Access read/write/update read-only (lots of scans)

5
11/16/2017

Why Data Warehouse?

 Performance Issue
 DBMS: tuned for OLTP
e.g., access methods, indexing, concurrency control, recovery
 Data warehouse: tuned for OLAP
e.g., complex queries, multidimensional view, consolidation

 Data Issue
 Decision support requires historical data, consolidated and summarized
data, consistent data

CSI 4352, Introduction to Data Mining

Chapter 4, Data Warehouse & OLAP Operations

 Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

6
11/16/2017

Data Format for Warehouse

 Dimensions
 Multi-dimensional data model
 Data are stored in the form of a data cube

 Data Cube
 A view of multi-dimensions
 Dimension tables, such as item (item_name, brand, type), or
time (day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables

 Cuboid
 Each combination of dimensional spaces in a data cube
 0-D cuboid, 1-D cuboid, 2-D cuboid, … , n-D cuboid

The Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,item time,location item,location location,supplier


2-D cuboids
time,supplier item,supplier

time,item,location time,location,supplier
3-D cuboids
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier

7
11/16/2017

Conceptual Modeling

 Key of Modeling Data Warehouses


 Handling dimensions & measures

 Examples
 Star schema: A fact table in the middle connected to a set of dimension
tables
 Snowflake schema: A refinement of star schema where some dimensional
hierarchy is normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
 Fact constellations: Multiple fact tables share dimension tables, viewed
as a collection of stars, therefore called galaxy schema
or fact constellation

Example of Star Schema

time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch
location_key location
branch_key
branch_name units_sold location_key
branch_type street
dollars_sold city
avg_sales state
country
Measures

8
11/16/2017

Example of Snowflake Schema

time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year supplier_key
item_key
branch_key
branch
location_key
branch_key location
branch_name units_sold location_key
branch_type street city
dollars_sold
city_key city_key
avg_sales city
Measures state
country

Example of Fact Constellation

time Shipping Fact Table


time_key item
time_key
day item_key
day_of_the_week Sales Fact Table item_name item_key
month brand
quarter time_key shipper_key
type
year supplier_type from_location
item_key
branch_key to_location
branch dollars_cost
location_key location
branch_key
branch_name units_sold location_key units_shipped
branch_type street
dollars_sold city shipper
avg_sales state shipper_key
country shipper_name
Measures location_key
shipper_type

9
11/16/2017

Cube Definition in DMQL

 Cube Definition (Fact Table)


 define cube <cube_name> [<dimension_list>]: <measure_list>

 Dimension Definition (Dimension Table)


 define dimension <dimension_name> as (<attribute_or_dimension_list>)

 Special Case (Shared Dimension Table)


 define dimension <dimension_name> as <dimension_name_first>
in cube <cube_name_first>

Star Schema Definition in DMQL

 Example
 define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)
 define dimension time as (time_key, day, day_of_week, month, quarter,
year)
 define dimension item as (item_key, item_name, brand, type,
supplier_type)
 define dimension branch as (branch_key, branch_name, branch_type)
 define dimension location as (location_key, street, city, state, country)

10
11/16/2017

Snowflake Schema Definition in DMQL

 Example
 define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
avg_sales = avg(sales_in_dollars),
units_sold = count(*)
 define dimension time as (time_key, day, day_of_week, month, quarter,
year)
 define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
 define dimension branch as (branch_key, branch_name, branch_type)
 define dimension location as (location_key, street, city(city_key,
province_or_state, country))

Fact Constellation Schema Definition in DMQL

 Example
 define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
 define dimension time as (time_key, day, day_of_week, month, quarter, year)
 define dimension item as (item_key, item_name, brand, type, supplier_type)
 define dimension branch as (branch_key, branch_name, branch_type)
 define dimension location as (location_key, street, city, state, country)
 define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
 define dimension time as time in cube sales
 define dimension item as item in cube sales
 define dimension shipper as (shipper_key, shipper_name,
location as location in cube sales, shipper_type)
 define dimension from_location as location in cube sales
 define dimension to_location as location in cube sales

11
11/16/2017

Measures

 Distributive
 If the result derived by applying the function to n aggregate values is the
same as that derived by applying the function on all the data without
partitioning
 e.g., count(), sum(), min(), max()

 Algebraic
 If it can be computed by an algebraic function with m arguments, each of
which is obtained by applying a distributive aggregate function
 e.g., avg(), standard_deviation()

 Holistic
 If there is no constant bound on the storage size needed to describe a
subaggregate
 e.g., median(), mode(), rank()

Concept Hierarchy

 Schema Hierarchy
Year
 e.g., street < city < state < country
Quarter
 e.g., day < { month < quarter ; week } < year
| Week
Month

Day
 Set-group hierarchy
 e.g., { (0..100] ; (100..200] } < (0..200]
 e.g., { (0..10] < lowPrice ; { (10..100] ; (100..200] } < highPrice }
< allProducts
allProducts
(0..200]
lowPrice highPrice
(0..100] (100..200]
(0..10] (10..100] (100..200]

12
11/16/2017

Three Components of Data Cube

 Example
 Measures: data values as a function of products, locations and time
 Dimensions:
 Hierarchies:

product location time

Year
Company Country
| |
locations

Quarter
Category City
| Week
| |
Month
Product Office
Day

time

Example of Cuboid Cells

time dimension

all location dimension


product dimension 0-D (apex) cuboid

product quarter country


1-D cuboids
product, country

product, quarter
2-D cuboids
quarter, country

3-D (base) cuboid


product, quarter, country

13
11/16/2017

Example of Data Cube

Quarter Total annual sales


2Qtr of TV in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

OLAP Operations

 Roll-up (drill-up)
 Summarizes (aggregates) data
 by climbing up hierarchy or by dimension reduction

 Drill-down (roll-down)
 Reverse of roll-up
 by stepping down to lower-level data or introducing new dimensions

 Slice
 Selecting data on one dimension

 Dice
 Selecting data on multi-dimensions

 Pivot (rotate)
 Reorienting the cube, or transforming 3-D data to a series of 2-D spaces

14
11/16/2017

Roll-UP & Drill-Down

city
location
country

quarter
quarter
time

country month

Slice, Dice & Pivot

laptop
pc
supercom
country

canada
laptop
pc USA
mexico
location

Q1 Q2
country

canada quarter
USA
Q1 Q2 Q3 Q4
quarter
time
mexico
country

canada

USA
Q1 Q2 Q3 Q4
quarter

15
11/16/2017

Starnet Query Model

Orders Customer
Shipping Method

contracts
air customerID

order
truck
item
Time Product
annually quarterly daily category
sales-person
city
country
division
region
division
Location
Each circle is
called a footprint Organization
Promotion

CSI 4352, Introduction to Data Mining

Chapter 4, Data Warehouse & OLAP Operations

 Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

16
11/16/2017

Data Warehouse Design

 Top-Down View
 Allows the selection of the relevant information necessary for the data
warehouse

 Data Source View


 Exposes the information being captured, stored, and managed by
operational systems

 Data Warehouse View


 Consists of fact tables and dimension tables

 Business Query View


 Shows the perspectives of data in the warehouse to end-users

Data Warehouse Design Process

 Categories by Process Direction


 Top-down: Starts with overall design and planning (mature)
 Bottom-up: Starts with experiments and prototypes (rapid)

 Categories by Software Engineering View


 Waterfall: structured, systematic analysis at each step before proceeding
to the next
 Spiral: rapid generation of functional systems, short turn around time

 Typical Data Warehouse Design Process


 Choose business processes for modeling, e.g., orders, invoices, etc
 Choose the grain (atomic level of data) of the business processes
 Choose the dimensions that will apply to each fact table
 Choose the measures that will populate each fact table

17
11/16/2017

Data Warehouse Architecture

Monitor & OLAP Server


Operational Metadata Integrator
DBs

Extract Query
Transform
Data Serve Analysis
Load
Refresh Warehouse
Reports
Other
sources

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

Three Data Warehouse Models

 Enterprise Warehouse
 A global view with all the information about subjects spanning the entire
organization

 Data Mart
 A subset of corporate-wide data that is of value to a specific group of
users
 Its scope is confined to specific, selected groups
 Independent vs. dependent (directly from warehouse) data mart

 Virtual Warehouse
 A set of views over operational databases
 Only some of possible summary views may be materialized

18
11/16/2017

Development of Data Warehouse

Multi-Tier
Data Warehouse

Distributed
Data Marts

Enterprise

Data Data Data


Mart Mart Warehouse

Model Refinement Model Refinement

Define a high-level corporate data model

Utilities of Back-End Tools

 Data Extraction
 Get data from multiple, heterogeneous, and external sources

 Data Cleaning
 Detect errors in the data and rectify them when possible

 Data Transformation
 Convert data from the original format to the warehouse format

 Loading
 Sort, summarize, consolidate, compute views, check integrity, and
build indices and partitions

 Refresh
 Propagate the updates from data sources to the warehouse

19
11/16/2017

OLAP Server Architecture

 Relational OLAP (ROLAP)


 Uses relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
 Includes optimization of DBMS back-end, implementation of aggregation
navigation logic, and additional tools and services
 High scalability

 Multidimensional OLAP (MOLAP)


 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data

 Hybrid OLAP (HOLAP)


 Low-level: relational / high-level: array
 High flexibility

Metadata Repository

 Definition of Metadata
 The data defining data warehouse objects

 Examples
 Description of the structure of the data warehouse, e.g., schema, view,
dimensions, hierarchies, data definitions, data mart locations and contents
 Operational meta-data, e.g., history of migrated data, currency of data,
warehouse usage statistics, error reports
 Algorithms used for summarization
 Mapping from operational environment to data warehouse
 Data related to system performance
 Business data, e.g., business terms and definitions, ownership of data,
charging policies

20
11/16/2017

CSI 4352, Introduction to Data Mining

Chapter 4, Data Warehouse & OLAP Operations

 Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

Data Cube Computation

 View as a Lattice of Cuboids


all
 How many cuboids in an n-dimensional cube? product time location
2n
product, location
 How many cuboid cells in an n-dimensional cube
with Li levels? product, time,
n time location
 ( L i  1)
i1 product, time, location

 Materialization of Data Cube


 Full materialization (all cuboids), Partial materialization (some cuboids),
No materialization (only base cuboid)
 Selection of cuboids to materialize
• Based on the size, sharing, access frequency, etc.

21
11/16/2017

Cube Operation

 Cube Definition and Computation in DMQL


 define cube sales [item, city, year]: sum (sales_in_dollars)
 compute cube sales

 Cube Definition and Computation in SQL


 select item, city, year, SUM (amount)
 from sales
 cube by item, city, year

 Internal Operations
 group by (item, city, year)
 group by (item, city), (item, year), (city, year)
 group by (item), (city), (year)
 group by ()

Iceberg Cube

 Iceberg Cube Computation


 Computing only the cuboid cells whose count or
other aggregates satisfying the condition like
HAVING COUNT(*) >= min_sup

 Motivation
 Only a small portion of cube cells may be “above the water’’
in a sparse cube
 Only calculate “interesting” cells — data above certain threshold
 Avoid explosive growth of the cube

22
11/16/2017

CSI 4352, Introduction to Data Mining

Chapter 4, Data Warehouse & OLAP Operations

 Basic Concept of Data Warehouse

 Data Warehouse Modeling

 Data Warehouse Architecture

 Data Warehouse Implementation

 From Data Warehousing to Data Mining

Data Warehouse Usage

 Information Processing
 Supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs

 Analytical Processing
 Supports OLAP operations in multi-dimensional space

 Data Mining
 Supports pattern discovery from warehouse data, and presenting
the mining results using visualization tools

23
11/16/2017

From OLAP To OLAM

 On-Line Analytical Mining (OLAM)


 ( OLAP + Data Mining ) in data warehouse

 Why OLAM ?
 High quality data
• Data warehouse contains integrated, consistent and cleaned data
 Information processing infrastructure
• ODBC/OLE DB connections, web accessing, service facilities
 OLAP-based exploratory data analysis
• Mining with drilling, dicing, pivoting, etc.
 On-line selection of data mining functions
• Integration and swapping of multiple data mining functions

OLAM System Architecture

mining query mining result


Layer4
User GUI API
User Interface

Layer3
OLAM OLAP
Engine Engine OLAP/OLAM

Data Cube API

Layer2
Data
Meta Data MDDB
Warehouse

Database API data cleaning data integration


Layer1

Databases Data Repository

24
11/16/2017

Questions?

 References
 Gray, J., et al., “Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab and Sub-Totals”, Data Mining and
Knowledge Discovery, Vol. 1 (1997)

 Lecture Slides are found on the Course Website,


www.ecs.baylor.edu/faculty/cho/4352

25

You might also like