You are on page 1of 40

Introduction to Data

Warehousing
Why need a Warehouse?

2
Problem: Heterogeneous Information
Sources
“Heterogeneities are everywhere”
Personal
Databases

World
Scientific Databases
Wide
Web
Digital Libraries

 Different interfaces
 Different data representations
 Duplicate and inconsistent information
3
CS 336
Problem: Data Management in Large
Enterprises
 Vertical fragmentation of informational systems
(vertical stove pipes)
 Result of application (user)-driven development of
operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...

Sales Administration Finance Manufacturing ...


4
CS 336
Goal: Unified Access to Data

Integration System

World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases

 Collects and combines information


 Provides integrated view, uniform user interface
 Supports sharing
5
CS 336
Why a Warehouse?

 Two Approaches:
 Query-Driven (Lazy)
 Warehouse (Eager)

Source Source

6
CS 336
The Traditional Research Approach
 Query-driven (lazy, on-demand)
Clients

Integration System Metadata

...
Wrapper Wrapper Wrapper

...
Source Source Source

7
CS 336
Disadvantages of Query-Driven
Approach

 Delay in query processing


 Slow or unavailable information sources
 Complex filtering and integration
 Inefficient and potentially expensive for
frequent queries
 Competes with local processing at sources
 Hasn’t caught on in industry

8
CS 336
The Warehousing Approach
 Information Clients
integrated in
advance Data
Warehouse
 Stored in wh for
direct querying
Integration System Metadata
and analysis
...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
9
CS 336
Advantages of Warehousing Approach
 High query performance
 But not necessarily most current information
 Doesn’t interfere with local processing at sources
 Complex queries at warehouse
 OLTP at information sources
 Information copied at warehouse
 Can modify, annotate, summarize, restructure, etc.
 Can store historical information
 Security, no auditing
 Has caught on in industry
10
CS 336
Not Either-Or Decision

 Query-driven approach still better for


 Rapidly changing information
 Rapidly changing information sources
 Truly vast amounts of data from large numbers
of sources
 Clients with unpredictable needs

11
CS 336
Concept of
Data Warehouse

12
What is a Data Warehouse?
A Practitioners Viewpoint

“A data warehouse is simply a single,


complete, and consistent store of data
obtained from a variety of sources and made
available to end users in a way they can
understand and use it in a business context.”
-- Barry Devlin, IBM Consultant

13
CS 336
What is a Data Warehouse?
An Alternative Viewpoint

“A DW is a
 subject-oriented,
 integrated,
 time-varying,
 non-volatile
collection of data that is used primarily in
organizational decision making.”
-- W.H. Inmon, Building the Data Warehouse, 1992

14
CS 336
What is a Data Warehouse?
An Alternative Viewpoint

“Data warehousing is really a simple concept:


Take all the data you already have in the
organization, clean and transform it, and
then provide useful strategic information.”
-- Paulraj Ponniah, Data Warehousing Fundamental.

15
What is a Data Warehouse?
An Alternative Viewpoint

A data warehouse is a database designed to enable


business intelligence activities: it exists to help
users understand and enhance their
organization's performance.
It is designed for query and analysis rather than for
transaction processing, and usually contains
historical data derived from transaction data, but
can include data from other sources.

-- Paul Lane, Data Warehousing Guide oracle12c, 2014


16
Concept of
OLTP & OLAP

17
OLTP vs. OLAP

 OLTP: On Line Transaction Processing


 Describes processing at operational sites
 OLAP: On Line Analytical Processing
 Describes processing at warehouse

Advantage of data warehouse:


With a data warehouse you separate analysis
workload from transaction workload. This enables
far better analytical performance and avoids
impacting your transaction systems. (Paul Lane)
18
Warehouse is a Specialized DB
Standard DB (OLTP) Warehouse (OLAP)
 Mostly updates  Mostly reads
 Many small transactions  Queries are long and complex
 Mb - Gb of data  Gb - Tb of data
 Current snapshot  History
 Index/hash on p.k.  Lots of scans
 Raw data  Summarized, reconciled data
 Thousands of users (e.g.,  Hundreds of users (e.g.,
clerical users) decision-makers, analysts)

19
CS 336
Concept of
Data Mart

20
Data Mart

(Paul Lane)

21
Example Data Mart

Perbedaan contoh data


yang disimpan
pada setiap level data

(Inmon, 2002)
22
Example Data Mart

Salah satu contoh arsitektur DW yang menggunakan Data Mart


(Paul Lane) 23
Characteristic of
Data Warehouse

24
Subject-oriented
The subject orientation of the data warehouse
is shown in Figure 2.1. Classical operations
systems are organized around the applications
of the company. For an insurance company,
the applications may be auto, health, life, and
casualty.
The major subject areas of the insurance
corporation might be customer, policy,
premium, and claim.
For a manufacturer, the major subject areas
might be product, order, vendor, bill of
material, and raw goods. Each type of
company has its own unique set of subjects.

What is the subject areas of the university ?


25
CS 336
Integrated

Data is fed from multiple


disparate sources into the
data warehouse.

As the data is fed it is


converted, reformatted,
resequenced, summarized,
and so forth.

26
CS 336
Non-Volatile

Operational data is regularly accessed and manipulated one record


at a time. But the history of data is kept in the data warehouse.

27
CS 336
Time-Variant

Time variancy implies that every unit of data in the data warehouse is
accurate as of some one moment in time. In some cases, a record is time
stamped. In other cases, a record has a date of transaction. But in every
case, there is some form of time marking to show the moment in time
during which the record is accurate. Figure 2.4 illustrates how time
variancy of data warehouse data can show up in several ways.
28
CS 336
A Data Warehouse is...
 Stored collection of diverse data
 A solution to data integration problem
 Single repository of information
 Subject-oriented
 Organized by subject, not by application
 Used for analysis, data mining, etc.
 Optimized differently from transaction-
oriented db
 User interface aimed at executive

29
CS 336
… Cont’d
 Large volume of data (Gb, Tb)
 Non-volatile
 Historical
 Time attributes are important
 Updates infrequent
 May be append-only
 Examples
 All transactions ever at Sainsbury’s
 Complete client histories at insurance firm
 LSE financial information and portfolios
30
CS 336
Warehouse Architecture

31
Generic Warehouse Architecture
Client Client
Query & Analysis

Design Phase Loading

Warehouse Metadata
Maintenance
Integrator Optimization

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

...
32
CS 336
Data Warehouse Architectures:
Conceptual View
Operational Informational

 Single-layer
systems systems

 Every data element is stored once only “Real-time data”

 Virtual warehouse

 Two-layer Operational Informational

 Real-time + derived data


systems systems

 Most commonly used approach in


Derived Data
industry today
Real-time data

33
CS 336
Three-layer Architecture: Conceptual
View
 Transformation of real-time data to derived
data really requires two steps
Operational Informational
systems systems

View level
“Particular informational
Derived Data
needs”

Reconciled Data
Physical Implementation
of the Data Warehouse

Real-time data

34
CS 336
Data Warehouse Issues

35
Data Warehousing: Two Distinct
Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in
warehouse
“Warehouse DBMS”
 Both rich research areas
 Industry has focused on (2)

36
CS 336
Issues in Data Warehousing
 Warehouse Design
 Extraction
 Wrappers, monitors (change detectors)
 Integration
 Cleansing & merging
 Warehousing specification & Maintenance
 Optimizations
 Miscellaneous (e.g., evolution)

37
CS 336
Literatur

 Slide presentasi diadaptasi dari Enrico


Franconi CS 636
 Beberapa literatur tambahan dari buku
sesuai silabus mata kuliah Datawarehouse

38
Question ?

 Kapan diperlukan data warehouse ?

39
Tugas

 Carilah jurnal/ paper yang membahas


implementasi “data warehouse” kemudian
dicetak A4.
 Kerjakan di kertas folio bergaris:
1. Tuliskan definisi DW yang ada pada jurnal
tersebut.
2. Sebutkan latar belakang masalah yang ada
pada jurnal sehingga DW diperlukan.
3. Kelebihan / Hasil DW
40

You might also like