Professional Documents
Culture Documents
Lecture 03
Lecture 03
TIES443
Lecture 3
Data Warehousing
Mykola Pechenizkiy
Course webpage: http://www.cs.jyu.fi/~mpechen/TIES443
November 3, 2006
Department of Mathematical Information Technology
University of Jyvskyl
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
OLAP
12 Codds rules for OLAP
Main OLAP operations
New buzzwords
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
Data Warehouse
A decision support DB that is maintained separately
from the organizations operational databases.
Why Separate Data Warehouse?
High performance for both systems
DBMS tuned for OLTP
access methods, indexing, concurrency control, recovery
UNIVERSITY OF JYVSKYL
Three-Tier Architecture
other
Monitor
&
Integrator
Metadata
sources
OLAP
Server
Analysis
Query/Reporting
Operational
DBs
Extract
Transform
Load
Refresh
Data
Warehouse
Serve
Data Mining
Data Marts
Data Sources
TIES443: Introduction to DM
Data Storage
Lecture 3: Data Warehousing
ROLAP
Server
UNIVERSITY OF JYVSKYL
Three-Tier Architecture
Warehouse database server
OLAP servers
Relational OLAP (ROLAP): extended relational DBMS that maps operations on
multidimensional data to standard relational operators
Multidimensional OLAP (MOLAP): special-purpose server that directly
implements multidimensional data and operations
Hybrid OLAP (HOLAP): user flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers: specialized support for SQL queries over star/snowflake
schemas
Clients
Query and reporting tools
Analysis tools
Data mining tools
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
Client
Client
Central
Data
Warehouse
Source
Source
Logical
Data
Warehouse
Centralized architecture
Source
Physical
Data
Warehouse
Source
Source
Federated architecture
TIES443: Introduction to DM
Source
Tiered architecture
6
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
Data Warehouse
A data warehouse is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in organizational
decision making
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
Data WarehouseSubject-Oriented
Organized around major subjects, such as customer,
product, sales
Focusing on the modeling and analysis of data for
decision makers, not on daily operations or transaction
processing
Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process
TIES443: Introduction to DM
10
UNIVERSITY OF JYVSKYL
Data WarehouseIntegrated
Constructed by integrating multiple, heterogeneous
data sources
relational databases, flat files, on-line transaction records
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
11
TIES443: Introduction to DM
12
UNIVERSITY OF JYVSKYL
Data WarehouseNon-Volatile
A physically separate store of data transformed from
the operational environment
Operational update of data does not occur in the data
warehouse environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
13
Data warehouse
update-driven, high performance
Information from heterogeneous sources is integrated in advance and
stored in warehouses for direct query and analysis
TIES443: Introduction to DM
14
UNIVERSITY OF JYVSKYL
UNIVERSITY OF JYVSKYL
15
Star Schema
Snowflake Schema
TIES443: Introduction to DM
16
UNIVERSITY OF JYVSKYL
Product
Order No
ProductNO
Order Date
ProdName
Customer
Customer No
Customer Name
Fact Table
ProdDescr
OrderNO
Category
SalespersonID
CategoryDescription
UnitPrice
CustomerNO
Customer
Address
City
Salesperson
ProdNo
Date
DateKey
DateKey
CityName
Date
Quantity
SalespersonID
SalespersonName
City
Total Price
CityName
City
State
Quota
TIES443: Introduction to DM
Country
UNIVERSITY OF JYVSKYL
17
Star Schema
A single fact table and a single table for each dimension
Every fact points to one tuple in each of the dimensions
and has additional attributes
Does not capture hierarchies directly
Generated keys are used for performance and
maintenance reasons
Fact constellation: Multiple Fact tables that share many
dimension tables
Example: Projected expense and the actual expense may share
dimensional tables
TIES443: Introduction to DM
18
UNIVERSITY OF JYVSKYL
Some Terms
Relation, which relates the dimensions to the
measure of interest, is called the fact table (e.g. sale)
Information about dimensions can be represented as
a collection of relations called the dimension tables
(product, customer, store)
Each dimension can have a set of associated
attributes
For each dimension, the set of associated attributes
can be structured as a hierarchy
TIES443: Introduction to DM
19
UNIVERSITY OF JYVSKYL
all
region
country
city
North_America
Canada
Toronto
...
...
Mexico
Europe
Ireland
Dublin
Belfield
office
TIES443: Introduction to DM
...
...
...
...
France
Belfast
Blackrock
20
10
UNIVERSITY OF JYVSKYL
Order No
Order Date
Fact Table
Customer
OrderNO
Customer No
Customer Name
SalespersonID
Category
ProdName
CategoryName
ProdDescr
CategoryDescr
Category
Category
UnitPrice
CustomerNO
Customer
Address
City
Salesperson
ProdNo
Date
DateKey
DateKey
CityName
Date
SalespersonID
Quantity
Month
City
SalespersonName
Total Price
CityName
City
State
Quota
Country
TIES443: Introduction to DM
Month
Month
Year
Year
Year
State
StateName
Country
UNIVERSITY OF JYVSKYL
21
Time
time_key
day
day_of_the_week
month
quarter
year
Item
item_key
item_name
brand
type
supplier_key
Location_key
Branch
branch_key
branch_name
branch_type
Unit_sold
Euros_sold
Avg_sales
Measures
TIES443: Introduction to DM
Location
location_key
street
city
Province/street
country
Item_key
shipper_key
from_location
to_location
Euros_sold
unit_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
22
11
UNIVERSITY OF JYVSKYL
Fact relation
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
Amt
12
11
50
8
Fact relation
sale
Product Client
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
TIES443: Introduction to DM
Date
1
1
1
1
2
2
c1
12
11
p1
p2
c2
c3
50
3-dimensional cube
Amt
12
11
50
8
44
4
day 2
day 1
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
UNIVERSITY OF JYVSKYL
23
Multidimensional Data
Sales volume as a function of product, month, and
region
Dimensions: Product, Location, Time
Re
gi
on
Year
Product
City
Office
Month
Week
Day
Month
TIES443: Introduction to DM
24
12
UNIVERSITY OF JYVSKYL
Definitions
an n-Dimensional base cube is called a base cuboid
The top most 0-D cuboid, which holds the highest-level of
summarization, is called the apex cuboid
The lattice of cuboids forms a data cube
TIES443: Introduction to DM
25
UNIVERSITY OF JYVSKYL
0-D(apex) cuboid
item
location
supplier
1-D cuboids
time,item
time,location
item,location
time,supplier
time,item,location
location,supplier
item,supplier
2-D cuboids
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
TIES443: Introduction to DM
26
13
UNIVERSITY OF JYVSKYL
Pr
od
u
TV
PC
VCR
sum
1Qtr
2Qtr
Date
3Qtr
sum
4Qtr
Ireland
France
Germany
Country
ct
sum
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
27
Pivot (rotate)
reorient the cube, visualization, 3D to series of 2D planes.
Other operations
drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its back-end relational
tables (using SQL)
rankings
time functions: e.g. time avg.
TIES443: Introduction to DM
28
14
UNIVERSITY OF JYVSKYL
Total Sales
Total Sales per city
Total Sales per city per store
Total Sales per city per store per month
Drill
Down
Total Sales
Total Sales per city
Total Sales per city by category
Drill
Up
Drill
Up
Drill Across
TIES443: Introduction to DM
29
UNIVERSITY OF JYVSKYL
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
day 2
day 1
p1
p2
c1
56
11
c2
4
8
c3
50
rollup
drill-down
TIES443: Introduction to DM
sum
c1
67
c2
12
c3
50
129
p1
p2
sum
110
19
30
15
UNIVERSITY OF JYVSKYL
prodId clientid
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
1
2
Sum
c1
23
44
67
c2
8
4
12
c3
50
50
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
Sum
81
48
129
day 2
day 1
p1
p2
Sum
c1
c2
c3
p1
44
4
p2 c1
c2
c3
p1
12
50
p2
11
8
c1
56
11
67
c2
4
8
12
c3
50
50
Sum
110
19
129
UNIVERSITY OF JYVSKYL
31
TIES443: Introduction to DM
32
16
UNIVERSITY OF JYVSKYL
2. Transparency
The OLAP functionality should be provided behind the user's existing
software without adversely affecting the functionality of the 'host, i.e.
OLAP server should shield the user for the complexity of the data and
application
3. Accessibility
OLAP should allow the user to access diverse data stores (relational,
nonrelational and legacy systems) but see the data within a common
'schema provided by the OLAP tool, i.e. Users shouldnt have to know
the location, type or layout of the data to access it.
OLAP server should automate the mapping of the logical schema to the
physical data
UNIVERSITY OF JYVSKYL
33
6. Generic Dimensionality
Data dimensions must all be treated equally. Functions available
for one dimension must be available for others.
TIES443: Introduction to DM
34
17
UNIVERSITY OF JYVSKYL
UNIVERSITY OF JYVSKYL
35
Data cleaning:
detect errors in the data and rectify them when possible
Data transformation:
convert data from legacy or host format to warehouse format:
different data formats, languages, etc.
Load:
sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions
Refresh
propagate the updates from the data sources to the warehouse
TIES443: Introduction to DM
36
18
UNIVERSITY OF JYVSKYL
DW Information Flows
INFLOW - Processes associated with the extraction,
cleansing, and loading of the data from the source systems
into the data warehouse.
UPFLOW - Processes associated with adding value to the
data in the warehouse through summarizing, packaging,
and distribution of the data.
DOWNFLOW - Processes associated with archiving and
backing-up/recovery of data in the warehouse.
OUTFLOW - Processes associated with making the data
available to the end-users.
METAFLOW - Processes associated with the management
of the metadata.
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
37
Data Cleaning
why?
Data warehouse contains data that is analyzed for business
decisions
More data and multiple sources could mean more errors in the
data and harder to trace such errors
Results in incorrect analysis
38
19
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
39
UNIVERSITY OF JYVSKYL
Load
Issues:
huge volumes of data to be loaded
small time window (usually at night) when the warehouse can be
taken off-line
When to build indexes and summary tables
allow system administrator to monitor status, cancel suspend,
resume load, or change load rate
restart after failure with no loss of data integrity
Techniques:
batch load utility: sort input records on clustering key and use
sequential I/O; build indexes and derived tables
sequential loads still too long (~100 days for TB)
use parallelism and incremental techniques
TIES443: Introduction to DM
40
20
UNIVERSITY OF JYVSKYL
Refresh
when to refresh
on every update: too expensive, only necessary if OLAP queries
need current data (e.g., up-the-minute stock quotes)
periodically (e.g., every 24 hours, every week) or after
significant events
refresh policy set by administrator based on user needs and traffic
possibly different policies for different sources
how to refresh
Full extract from base tables
read entire source table or database: expensive
Incremental techniques
detect & propagate changes on base tables: replication servers
logical correctness
transactional correctness: incremental load
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
41
Metadata Repository
Administrative metadata
Business data
business terms and definitions
ownership of data
charging policies
Operational metadata
data lineage: history of migrated data and sequence of transf-s applied
currency of data: active, archived, purged
monitoring information: warehouse usage statistics, error reports, audit
trails.
TIES443: Introduction to DM
42
21
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
43
TIES443: Introduction to DM
44
22
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
45
Meta-data
Data sourcing
Data quality
Data security
Granularity
History- how long and how much?
Performance
TIES443: Introduction to DM
46
23
UNIVERSITY OF JYVSKYL
Common DW Problems
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
47
Research Issues
Data cleaning
focus on data inconsistencies, not schema differences
data mining techniques
Physical Design
design of summary tables, partitions, indexes
tradeoffs in use of different indexes
Query processing
selecting appropriate summary tables
dynamic optimization with feedback
query optimization: cost estimation, use of transformations, search
strategies
partitioning query processing between OLAP server and backend server.
Warehouse Management
TIES443: Introduction to DM
48
24
UNIVERSITY OF JYVSKYL
UNIVERSITY OF JYVSKYL
49
Summary
What is a data warehouse
Data warehouse architectures
Conceptual DW Modelling
Physical DW Modelling
50
25
UNIVERSITY OF JYVSKYL
Additional Slides
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
51
TIES443: Introduction to DM
52
26
UNIVERSITY OF JYVSKYL
Data
Mart
Enterprise
Data
Warehouse
Data
Mart
Model refinement
Model refinement
UNIVERSITY OF JYVSKYL
53
54
27
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
UNIVERSITY OF JYVSKYL
55
TIES443: Introduction to DM
56
28
UNIVERSITY OF JYVSKYL
57
UNIVERSITY OF JYVSKYL
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Time
ANNUALY QTRLY DAILY
Product
PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location
TIES443: Introduction to DM
Each circle is
called a
footprint
Lecture 3: Data Warehousing
Promotion
Organization
58
29
UNIVERSITY OF JYVSKYL
Index Structures
This and the following slides on Indexing for DW are adopted with minor
modifications from: http://infolab.stanford.edu/~hector/cs245/Notes12.ppt
Indexing principle:
mapping key values to records for associative direct access
inverted lists
bit map indexes
join indexes
text indexes
TIES443: Introduction to DM
59
UNIVERSITY OF JYVSKYL
Inverted Lists
18
19
20
21
22
23
25
26
inverted
lists
age
index
Query:
Get people with age = 20 and
name = fred
TIES443: Introduction to DM
r5
r19
r37
r40
rId
r4
r18
r19
r34
r35
r36
r5
r41
name
joe
fred
sally
nancy
tom
pat
dave
jeff
age
20
20
21
20
20
25
21
26
...
20
23
r4
r18
r34
r35
data
records
30
UNIVERSITY OF JYVSKYL
Bitmap Indexes
Bitmap index: An indexing technique that has
attracted attention in multi-dimensional database
implementation
table
Customer
c1
c2
c3
c4
c5
c6
TIES443: Introduction to DM
City
Detroit
Chicago
Detroit
Poznan
Paris
Paris
Car
Ford
Honda
Honda
Ford
BMW
Nissan
61
UNIVERSITY OF JYVSKYL
Bitmap Indexes
The index consists of bitmaps:
Index on City:
ec1
1
2
3
4
5
6
Chicago Detroit
0
1
1
0
0
1
0
0
0
0
0
0
Index on Car:
Paris
0
0
0
0
1
1
Poznan
0
0
0
1
0
0
bitmaps
ec1
1
2
3
4
5
6
BMW
0
1
0
0
1
0
Ford
1
0
0
1
0
0
Honda
0
1
1
0
0
0
Nissan
0
0
0
0
0
1
bitmaps
62
31
UNIVERSITY OF JYVSKYL
Bitmap Indexes
Index on a particular column
Index consists of a number of bit vectors - bitmaps
Each value in the indexed column has a bit vector
(bitmaps)
The length of the bit vector is the number of records
in the base table
The i-th bit is set if the i-th row of the base table has
the value for the indexed column
TIES443: Introduction to DM
63
UNIVERSITY OF JYVSKYL
Bitmap Index
20
23
20
21
22
23
25
26
age
index
TIES443: Introduction to DM
1
1
0
1
1
0
0
0
0
0
0
1
0
0
0
1
0
1
1
id
1
2
3
4
5
6
7
8
name age
joe
20
fred
20
sally
21
nancy 20
tom
20
pat
25
dave 21
jeff
26
...
18
19
data
records
Query:
Get people with age = 20
and name = fred
List for age = 20:
1101100000
List for name = fred:
0100000001
Answer is intersection:
0100000000
Suited well for domains
with small cardinality
bit
maps
Lecture 3: Data Warehousing
64
32
UNIVERSITY OF JYVSKYL
TIES443: Introduction to DM
65
UNIVERSITY OF JYVSKYL
Join
Combine SALE, PRODUCT relations
In SQL: SELECT * FROM SALE, PRODUCT
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
joinTb
TIES443: Introduction to DM
date
1
1
1
1
2
2
prodId
p1
p2
p1
p2
p1
p1
amt
12
11
50
8
44
4
name
bolt
nut
bolt
nut
bolt
bolt
product
price
10
5
10
5
10
10
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
id
p1
p2
name price
bolt
10
nut
5
amt
12
11
50
8
44
4
66
33
UNIVERSITY OF JYVSKYL
Join Indexes
join index
product
sale
TIES443: Introduction to DM
id
p1
p2
rId
r1
r2
r3
r4
r5
r6
name price
bolt
10
nut
5
prodId
p1
p2
p1
p2
p1
p1
jIndex
r1,r3,r5,r6
r2,r4
storeId
c1
c1
c3
c2
c1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
UNIVERSITY OF JYVSKYL
67
Join Indexes
Traditional indexes map the value to a list of record
ids. Join indexes map the tuples in the join result of
two relations to the source tables.
In data warehouse cases, join indexes relate the
values of the dimensions of a star schema to rows in
the fact table.
For a warehouse with a Sales fact table and dimension city, a
join index on city maintains for each distinct city a list of
RIDs of the tuples recording the sales in the city
TIES443: Introduction to DM
68
34