Professional Documents
Culture Documents
Warehousing
Why need a Warehouse?
2
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
Personal
Databases
World
Scientific Databases
Wide
Web
Digital Libraries
Different interfaces
Different data representations
Duplicate and inconsistent information
3
CS 336
Problem: Data Management in Large
Enterprises
Vertical fragmentation of informational systems
(vertical stove pipes)
Result of application (user)-driven development of
operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...
Integration System
World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases
Two Approaches:
Query-Driven (Lazy)
Warehouse (Eager)
Source Source
6
CS 336
The Traditional Research Approach
Query-driven (lazy, on-demand)
Clients
...
Wrapper Wrapper Wrapper
...
Source Source Source
7
CS 336
Disadvantages of Query-Driven
Approach
8
CS 336
The Warehousing Approach
Information Clients
integrated in
advance Data
Warehouse
Stored in wh for
direct querying
Integration System Metadata
and analysis
...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor
...
Source Source Source
9
CS 336
Advantages of Warehousing Approach
High query performance
But not necessarily most current information
Doesn’t interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
Has caught on in industry
10
CS 336
Not Either-Or Decision
11
CS 336
Concept of
Data Warehouse
12
What is a Data Warehouse?
A Practitioners Viewpoint
13
CS 336
What is a Data Warehouse?
An Alternative Viewpoint
“A DW is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in
organizational decision making.”
-- W.H. Inmon, Building the Data Warehouse, 1992
14
CS 336
What is a Data Warehouse?
An Alternative Viewpoint
15
What is a Data Warehouse?
An Alternative Viewpoint
17
OLTP vs. OLAP
19
CS 336
Concept of
Data Mart
20
Data Mart
(Paul Lane)
21
Example Data Mart
(Inmon, 2002)
22
Example Data Mart
24
Subject-oriented
The subject orientation of the data warehouse
is shown in Figure 2.1. Classical operations
systems are organized around the applications
of the company. For an insurance company,
the applications may be auto, health, life, and
casualty.
The major subject areas of the insurance
corporation might be customer, policy,
premium, and claim.
For a manufacturer, the major subject areas
might be product, order, vendor, bill of
material, and raw goods. Each type of
company has its own unique set of subjects.
26
CS 336
Non-Volatile
27
CS 336
Time-Variant
Time variancy implies that every unit of data in the data warehouse is
accurate as of some one moment in time. In some cases, a record is time
stamped. In other cases, a record has a date of transaction. But in every
case, there is some form of time marking to show the moment in time
during which the record is accurate. Figure 2.4 illustrates how time
variancy of data warehouse data can show up in several ways.
28
CS 336
A Data Warehouse is...
Stored collection of diverse data
A solution to data integration problem
Single repository of information
Subject-oriented
Organized by subject, not by application
Used for analysis, data mining, etc.
Optimized differently from transaction-
oriented db
User interface aimed at executive
29
CS 336
… Cont’d
Large volume of data (Gb, Tb)
Non-volatile
Historical
Time attributes are important
Updates infrequent
May be append-only
Examples
All transactions ever at Sainsbury’s
Complete client histories at insurance firm
LSE financial information and portfolios
30
CS 336
Warehouse Architecture
31
Generic Warehouse Architecture
Client Client
Query & Analysis
Warehouse Metadata
Maintenance
Integrator Optimization
...
32
CS 336
Data Warehouse Architectures:
Conceptual View
Operational Informational
Single-layer
systems systems
Virtual warehouse
33
CS 336
Three-layer Architecture: Conceptual
View
Transformation of real-time data to derived
data really requires two steps
Operational Informational
systems systems
View level
“Particular informational
Derived Data
needs”
Reconciled Data
Physical Implementation
of the Data Warehouse
Real-time data
34
CS 336
Data Warehouse Issues
35
Data Warehousing: Two Distinct
Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in
warehouse
“Warehouse DBMS”
Both rich research areas
Industry has focused on (2)
36
CS 336
Issues in Data Warehousing
Warehouse Design
Extraction
Wrappers, monitors (change detectors)
Integration
Cleansing & merging
Warehousing specification & Maintenance
Optimizations
Miscellaneous (e.g., evolution)
37
CS 336
Literatur
38
Question ?
39
Tugas