Note2 3

CSCI6405 Fall 2003
Dta Mining and Data Warehousing

Instructor: Qigang Gao, Office: CS219,
Tel:494-3356, Email: qggao@cs.dal.ca
Teaching Assistant: Christopher Jordan,
Email: cjordan@cs.dal.ca
Office Hours: TR, 1:30 - 3:00 PM
7 October 2003 1
Lectures Outline
Pat I: Overview on DM and DW
1. Introduction (ch1) Ass1 Due: Sep 23 Tue
2. Data preprocessing (ch3)
Part II: DW and OLAP
3. Data warehousing and OLAP (Ch2) Ass2: Sep 23 – Oct 14
Part III: Data Mining Methods/Algorithms
4. Data mining primitives (ch4)
5. Classification data mining (ch7) Ass3: Oct 7 – Oct 21
6. Association data mining (ch6) Ass4: Oct 21 – Nov 5
7. Characterization data mining (ch5)
8. Clustering data mining (ch8)
Part IV: Mining Complex Types of Data
9. Mining the Web (Ch9)
10. Mining spatial data (Ch9)
Project Presentations
Project Due: Dec 8
7 October 2003 2
Reservation of the LCD Lab:
Wed: 8:30 am – 2:00 pm

Sat: 12:00 pm - 6:00 pm
Sun: 12:00 pm – 6:00 pm
7 October 2003 3
2. DATA WAREHOUSING AND OLAP
(Ch2)
Objectives of DW/OLAP
What is a DW?
Multidimensional Data Model
DW Schemas
Aggregations
OLAP Operations
DW Architecture
From data warehousing to data mining
7 October 2003 4
How to define DW schema: a data mining query
language: DMQL
Cube Definition (Fact Table)

define cube <cube_name> [<dimension_list>]:
<measure_list>
Dimension Definition ( Dimension Table )
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
Special Case (Shared Dimension Tables)
First time as “cube definition”
define dimension <dimension_name> as
<dimension_name_first_time> in cube <cube_name_first_time>
7 October 2003 5
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
7 October 2003 6
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
7 October 2003 7
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
7 October 2003 8
Defining a Snowflake Schema in DMQL
define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
7 October 2003 9
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
branch location_key location to_location

branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city
units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
7 October 2003 10
shipper_type
Defining a Fact Constellation in DMQL
define cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in
cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
7 October 2003 11
How hierarchical data are materialized in a data warehouse ?
7 October 2003 12
Aggregations
- To measure a business event
What do I want to look at? What am I trying to compare?
* define a grouping (i.e. determine a cuboid of the data cube),
* measure the fact about the event (I.e., the cuboid)
|
retrieval a pre-calculated value, or invoke an aggregate function
* OLAP query: Dimension-value pairs.
E.g., dimension: <time="Q1", location="Vancouver", item="Computer">
value (measured): sales=sum (the data set).
A measure value is computed for a defined cuboid by aggregating the

data corresponding to the respective dimension-value pairs defining the
given event.
7 October 2003 13
Measures: Three Categories
* Distributive functions: A aggregate function is distributive if a set is divided into n

subsets, use the function to calculate the set and the subsets, and the result from the set
and the total result from the n subset are same.
E.g., count(), sum(), min(), max().
* Algebraic functions: A aggregate function is algebraic if it can be calculated by an

algebraic function with M arguments, and each argument is a distributive aggregation function.
E.g., ave() = sum() / count(), standard_deviation(), ...
* Holistic functions: A aggregate function is holistic if it characterizes a set element (s)

relative to other elements of the set without an algebraic calculation.
E.g., rank(), median(), ...
Distributive and algebraic aggregate functions are most frequently used and can be
calculated efficiently. In contrast holistic aggregate functions can not be efficiently calculated
in general which are not used in data warehouses.
7 October 2003 14
Pre-aggregation vs. On-line aggregation
Pre-aggregation: all needed calculations are done by batch process.
On-line aggregation: the aggregating computation is on-line.

The main issue is the data volume to be aggregated is normally very large.
On-line aggregation results in real time aggravation.
The manager's rule of thumb:

- An average aggregation should response from the data warehousing
system in 20 seconds or under.
7 October 2003 15
Efficient Data Cube Computation
Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell
How many cuboids in an n-dimensional cube with L levels?
n
T = ∏ ( Li + 1)
i =1
E.g. The cube has 10 dimensions and 4 levels for each dimension:
5^10 = 9.8 x 10^6.
Materialization of data cube
Materialize every (cuboid) (full materialization), none (no materialization),
or some (partial materialization)
Selection of which cuboids to materialize
Based on size, sharing, access frequency, etc.
7 October 2003 16
Cube: A Lattice of Cuboids
all
0-D(apex) cuboid
time item location supplier

1-D cuboids
time,item time,location item,location location,supplier

2-D cuboids
time,supplier item,supplier
time,location,supplier
time,item,location 3-D cuboids
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
7 October 2003 17
OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up
from higher level summary to lower level summary or detailed data, or
introducing new dimensions
Slice and dice:
project and select
Pivot (rotate):
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table, etc
7 October 2003 18
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
7 October 2003 19
7 October 2003 20
Example of data warehousing using MS SQL server 2000
7 October 2003 21
7 October 2003 22
7 October 2003 23
Drill down to see product categories.
7 October 2003 24
Drill down to see product “Clams” sales information
7 October 2003 25
DW Development Procedure
1. Choose a business process to model, understand the complexity of data, determine a

data schema to use, etc.
2. Decide subject(s), choose the measures that will populate each fact table record.
3. Choose fact table: the grain and measures of the subject:
The fundamental, atomic level of data to be represented in the fact table,
such as daily or weekly sales, etc.
4. Choose the dimensions that will apply to each fact table record.
7 October 2003 26
Data Warehouse Development: An
Incremental Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
Model refinement Model refinement
Define a high-level corporate data model

7 October 2003 27
Data Warehouse Architecture
The architecture of data:
Abstraction level
Business rules
|
Metadata
|
Schema
|
Summary data
|
Operational data
The abstraction hierarchy of data and its description helps users navigate around a data
warehouse. As data gets more abstract, it generally gets less voluminous.
7 October 2003 28
The architecture of data (cont)
- Operational data: who, what, where, and when

- Summary data: summaries by who, what, where, and when
- Schema: physical layout of the data, tables, fields, indexes, types
- Metadata: logical model and mappings to physical layout and sources
(by defining the data in business terms)
- Business rules: what's been learned from the data
7 October 2003 29
Multitiers architecture:
• Client site: The end user can query and visualize data on the local computer or
connect up to a display server that has access to the DW.
• Middle server: Logically, OLAP engines present the users with multidimensional
data from DWs or data marts. However, the physical architecture implementation
issues must be considered for OLAP engines.
• DW server: Data warehouse generated from relational or operational databases,
gateways for extraction and integration of multiple data sources: ODBC (Open
Database Connection), and OLEDB (Open Linking and Embedding for Databases), and
JDBC (Java Database Connections), etc
7 October 2003 30
Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Extract Query
Transform Data Serve Reports
DBs
Load
Refresh
Warehouse Data mining
Data Marts
Data Sources Data Storage OLAP Engine Front-End Tools

7 October 2003 31
Data Warehouse Back-End Tools and Utilities
Data extraction:
get data from multiple, heterogeneous, and external sources
Data cleaning:
detect errors in the data and rectify them when possible
Data transformation:
convert data from legacy or host format to warehouse format
Load:
sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions
Refresh
propagate the updates from the data sources to the warehouse
7 October 2003 32
OLAP Server Architectures
Multidimensional OLAP (MOLAP)

Implemented as a large multidimensional array
Fast indexing to pre-computed summarized data (with built-in indexing)
Not proven to scale effectively to large, high-dimensionality data sets
Relational OLAP (ROLAP)
Implemented as a collection of relational tables
Can be processed and queried with traditional RDBMS technology (I.e. indexes and
joins etc)
Greater scalability
No “built-in” indexing
E.g. The same data stored in a multidimensional array for MOLAP, and multi-tables
for RLOAP (the distributed sheet).
Hybrid OLAP (HOLAP)
User flexibility, e.g., low level: relational, high-level: array
MS SQL Server 2000
7 October 2003 33
From On-Line Analytical Processing
to On Line Analytical Mining (OLAM)
Why online analytical mining?
High quality of data in data warehouses
DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data

warehouses
ODBC (Open Data Base Connectivity), Web accessing, service
facilities, reporting and OLAP tools

OLAP-based exploratory data analysis
mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions

integration and swapping of multiple mining functions,
algorithms, and tasks.

Architecture of OLAM
7 October 2003 34
An OLAM Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration Database API Filtering

Layer1
Data cleaning Data
Databases Data
Data integration Warehouse Repository
7 October 2003 35
Summary
Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process
A multi-dimensional model of a data warehouse
Multidimensional data model
Star schema, snowflake schema, fact constellations
A data cube consists of identifier dimensions & measure dimension
Concept hierarchies
OLAP operations: drilling, rolling, slicing, dicing and pivoting
OLAP servers: ROLAP, MOLAP, HOLAP
…
7 October 2003 36

Note2 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Note2 3

Uploaded by

Copyright:

Available Formats

CSCI6405 Fall 2003

Dta Mining and Data Warehousing

Wed: 8:30 am – 2:00 pm

 Cube Definition (Fact Table)

 define dimension <dimension_name> as

<dimension_name_first_time> in cube <cube_name_first_time>

define cube sales_star [time, item, branch, location]:

define cube sales_snowflake [time, item, branch, location]:

branch location_key location to_location

A measure value is computed for a defined cuboid by aggregating the

* Distributive functions: A aggregate function is distributive if a set is divided into n

* Algebraic functions: A aggregate function is algebraic if it can be calculated by an

* Holistic functions: A aggregate function is holistic if it characterizes a set element (s)

Pre-aggregation: all needed calculations are done by batch process.

On-line aggregation: the aggregating computation is on-line.

The manager's rule of thumb:

time item location supplier

time,item time,location item,location location,supplier

1. Choose a business process to model, understand the complexity of data, determine a

Model refinement Model refinement

Define a high-level corporate data model

- Operational data: who, what, where, and when

Data Sources Data Storage OLAP Engine Front-End Tools

 Multidimensional OLAP (MOLAP)

 Fast indexing to pre-computed summarized data (with built-in indexing)

 Not proven to scale effectively to large, high-dimensionality data sets

 Relational OLAP (ROLAP)

 Implemented as a collection of relational tables

 MS SQL Server 2000

 Available information processing structure surrounding data

facilities, reporting and OLAP tools

 On-line selection of data mining functions

algorithms, and tasks.

Data Cube API

Filtering&Integration Database API Filtering

You might also like

Cube Definition (Fact Table)

define dimension <dimension_name> as

Multidimensional OLAP (MOLAP)

Fast indexing to pre-computed summarized data (with built-in indexing)

Not proven to scale effectively to large, high-dimensionality data sets

Relational OLAP (ROLAP)

Implemented as a collection of relational tables

MS SQL Server 2000

Available information processing structure surrounding data

On-line selection of data mining functions