Professional Documents
Culture Documents
Data Mining Session 4 - Main Theme Data Warehousing and OLAP Dr. Jean-Claude Franchitti
Data Mining Session 4 - Main Theme Data Warehousing and OLAP Dr. Jean-Claude Franchitti
Agenda
11 Session
Session Overview
Overview
22 Data
Data Warehousing
Warehousing and
and OLAP
OLAP
33 Summary
Summary and
and Conclusion
Conclusion
2
What is the class about?
Textbooks:
» Data Mining: Concepts and Techniques (2nd Edition)
Jiawei Han, Micheline Kamber
Morgan Kaufmann
ISBN-10: 1-55860-901-6, ISBN-13: 978-1-55860-901-3, (2006)
» Microsoft SQL Server 2008 Analysis Services Step by Step
Scott Cameron
Microsoft Press
ISBN-10: 0-73562-620-0, ISBN-13: 978-0-73562-620-31 1st Edition (04/15/09)
Session Agenda
4
Icons / Metaphors
Information
Common Realization
Knowledge/Competency Pattern
Governance
Alignment
Solution Approach
55
Agenda
11 Session
Session Overview
Overview
22 Data
Data Warehousing
Warehousing and
and OLAP
OLAP
33 Summary
Summary and
and Conclusion
Conclusion
6
Data Warehousing and OLAP - Sub-Topics
8
Data Warehouse—Subject-Oriented
Data Warehouse—Integrated
10
Data Warehouse—Time Variant
11
Data Warehouse—Nonvolatile
12
Data Warehouse vs. Heterogeneous DBMS
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
15
17
all
0-D(apex) cuboid
time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier
4-D(base) cuboid
time, item, location, supplier
19
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures
21
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key
branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type dollars_sold city
city_key
avg_sales city
state_or_province
Measures country
22
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
24
Defining Star Schema in DMQL
26
Defining Fact Constellation in DMQL
28
A Concept Hierarchy: Dimension (location)
all all
29
Specification of
hierarchies
Schema hierarchy
day < {month < quarter;
week} < year
Set_grouping
hierarchy
{1..10} < inexpensive 30
Multidimensional Data
Office Day
Month
31
TV
od
PC U.S.A
Pr
VCR
Country
sum
Canada
Mexico
sum
32
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
3-D(base) cuboid
product, date, country
33
Visualization
OLAP capabilities
Interactive
manipulation
34
Typical OLAP Operations (1/2)
35
Q1 605 at
loc Q1 1000
time
time (quarters)
Q2 Q2
computer Q3
home
entertainment Q4
item (types)
computer security
home phone
dice for entertainment
(location = “Toronto” or “Vancouver”) item (types)
and (time = “Q1” or “Q2”) and
(item = “home entertainment” or “computer”)
roll-up
on location
(from cities
to countries)
s)
t ie
(ci New YorkChicago 440
on 1560
at i Toronto 395
loc Vancouver
Q1 605 825 14 400
time (quarters)
Q2
Q3
Q4
slice
computer security
for time = “Q1”
home phone
entertainment drill-down
item (types) on time
(from quarters
Chicago to months)
location (cities)
New York
Toronto
May
June
July
home
entertainment 605 August
item (types)
September
computer 825
October
phone 14 November
December
security 400
computer security
New York Vancouver home phone
Chicago Toronto entertainment
location (cities) item (types) 36
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
37
38
Design of Data Warehouse: A Business Analysis Framework
Monitor
& OLAP Server
Other Metadata
sources Integrator
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining
Data Marts
Enterprise warehouse
» collects all of the information about subjects spanning
the entire organization
Data Mart
» a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse
» A set of views over operational databases
» Only some of the possible summary views may be
materialized
42
Data Warehouse Development: A Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Enterprise
Data Data
Data
Mart Mart
Warehouse
Data extraction
» get data from multiple, heterogeneous, and external
sources
Data cleaning
» detect errors in the data and rectify them when possible
Data transformation
» convert data from legacy or host format to warehouse
format
Load
» sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
» propagate the updates from the data sources to the
warehouse
44
Metadata Repository
45
46
Data Warehousing and OLAP - Sub-Topics
47
48
Cube Operation
Iceberg Cube
Explore indexing structures and compressed vs. dense array structs in MOLAP
53
54
Data Warehouse Usage
55
56
An OLAM System Architecture
Mining query Mining result Layer4
User Interface
User GUI API
Layer3
OLAM OLAP
Engine Engine OLAP/OLAM
Layer2
MDDB
MDDB
Meta Data
58
What is Concept Description?
59
Data generalization
» A process which abstracts a large set of task-relevant
data in a database from a low conceptual levels to
higher ones.
1
2
3
4
Conceptual levels
5
» Approaches:
• Data cube approach(OLAP approach)
• Attribute-oriented induction approach
60
Attribute-Oriented Induction
61
62
Attribute-Oriented Induction: Basic Algorithm
Example
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
65
Generalized relation:
» Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
Cross tabulation:
» Mapping results into cross tabulation form (similar to contingency
tables).
» Visualization techniques:
» Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
» Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad ( x) ∧ male ( x) ⇒
birth _ region ( x) ="Canada "[t :53%] ∨ birth _ region ( x) =" foreign"[t : 47%].
66
Mining Class Comparisons
67
Cj = target class
qa = a generalized tuple covers some tuples of
class
» but can also cover some tuples of contrasting class
d-weight
» range: [0, 1] count(q a ∈ C j )
d − weight = m
∑ count(q
i =1
a ∈ Ci )
68
Example: Quantitative Discriminant Rule
Count distribution between graduate and undergraduate students for a generalized tuple
69
Class Description
70
Example: Quantitative Description Rule
N_Am 120 17.65% 60% 560 82.35% 70% 680 100% 68%
Both_ 200 20% 100% 800 80% 100% 1000 100% 100%
regions
71
Similarity:
» Data generalization
» Presentation of data summarization at multiple levels of
abstraction
» Interactive drilling, pivoting, slicing and dicing
Differences:
» OLAP has systematic preprocessing, query independent,
and can drill down to rather low level
» AOI has automated desired level allocation, and may
perform dimension relevance analysis/ranking when
there are many relevant dimensions
72
Agenda
11 Session
Session Overview
Overview
22 Data
Data Warehousing
Warehousing and
and OLAP
OLAP
33 Summary
Summary and
and Conclusion
Conclusion
73
Summary
74
References (1/2)
References (2/2)
76
Assignments & Readings
Readings
» Chapter 3
Assignment #3
» TBA
77
78