Professional Documents
Culture Documents
to Data
Warehousing
Copyright@ Pawan Likhi
Agenda
Operational Systems
Overview of Data Warehousing
Data Warehouse Architectures
Understand the ETL process
Informatica Introduction
Operational
Systems
Copyright@ Pawan Likhi
Characteristics of Operational
Systems
Continuous availability
Predefined access paths
Transaction integrity
Volume of transaction - High
Data volume per query - Low
Supports day to day control operations
Large number of users
Data Warehousing
Subject Oriented
Integrated
Time variant
Non-volatile collection of data in support of management decision
processes
Warehouse
Order
Entry
Customer
Billing
Usage
Accounting
Operational
data is
organized by specific
processes or tasks and is
maintained by separate
Copyright@
Pawan Likhi
systems
Revenue
Warehoused
data is
organized by subject area
and is populated from many
operational systems
Application Specific
Applications
Evolved
Data Warehouse
Integrated
Integrated
Designed
Primarily
concerned with
current data
Data
Warehouse
Generally
concerned
with historical data
Load/
Update
Incremental Load
Insert
Incremental Load
Constant Change
Updated
Initial Load
constantly
changes according to
need, not a fixed schedule
Warehouse Database
Refresh
Refresh
Purge or Archive
Refresh
Any Source
Operational
data
Any Data
Relational /
Multidimensional
Any Access
Relational
tools
Oracle Medi`
External
data
Text, image
Spatial
Web
Audio,
video
OLAP
tools
Applications/ Web
Types of datawarehouses
Enterprise Data Warehouse - An enterprise data
warehouse provides a central database for decision
support throughout the enterprise.
ODS (Operational Data Store) - This has a broad
enterprise wide scope, but unlike the real entertprise
data warehouse, data is refreshed in near real time and
used for routine business activity. example finding the
status of a customer order
Data Mart - Datamart is a subset of data warehouse and
it supports a particular region, business unit or business
function.
Copyright@ Pawan Likhi
Query&&Analysis
Analysis
Query
Client
Loading
Design Phase
Warehouse
Metadata
Maintenance
Integrator
Extractor/
Monitor
CS 336
Extractor/
Monitor
Optimization
Extractor/
Monitor
...
Warehouse Design
Extraction
Wrappers, monitors (change detectors)
Integration
Cleansing & merging
Warehouse is a Specialized DB
Standard DB (OLTP)
Mostly updates
Current snapshot
Index/hash on p.k.
Raw data
Thousands of users
(e.g., clerical users)
Warehouse (OLAP)
Mostly reads
Queries are long and complex
Gb - Tb of data
History
Lots of scans
Summarized, reconciled data
Hundreds of users (e.g.,
decision-makers, analysts)
Data Marts
Enterprise
Data Warehouse
Copyright@ Pawan Likhi
Main Features:
Low cost
Controlled locally rather than centrally, conferring power on the user group.
Contain less information than the warehouse
Rapid response
Easily understood and navigated than an enterprise data warehouse.
Within the range of divisional or departmental budgets
Information
Information
Access
Access
Data warehouse
Reporting tools
Operational
& External
data
ODS
Mining
Data
Staging
layer
OLAP
Data
Marts
Information
Servers
Administration
Copyright@ Pawan Likhi
Web
Browsers
Operational
&
External
Data
Layer
The database-ofrecord
Consists of system
specific reference
data and event data
Source of data for the
data warehouse.
Contains detailed
data
Continually changes
due to updates
Stores data up to the
last transaction.
Data
Staging
layer
objectives of the
enterprise determine
the structure
Dimensional
Modeling
Sa
s
e
l
e
v
Re
Pr o
fita
bili
ty
e
u
n
Net Pr
ofit
Gros
s Ma
rgin
st
o
C
Dimension
ue
n
ve
e
)
R
e
r
es asu
l
Sa Me
(
Product Dimension
Customer Dimension
Geographic Dimension
Time dimension
Modeling
Modeling is iterative
2.
3.
Modeling summaries
4.
Select a
business
process
2, 3
Physical model
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
Dimension Tables
Dimension tables have the following
characteristics:
Contain textual information that represents
the attributes of the business
Contain relatively static data
Are joined to a fact table through a foreign
key reference
Product
Channel
Facts
(units,
price)
Customer
Time
Fact Tables
Fact table
Product
Channel
Facts
(units,
price)
Customer
Time
Dimension tables
Store Table
Store_id
District_id
...
Time Table
Day_id
Month_id
Period_id
Year_id
Item Table
Item_id
Item_desc
...
Store Table
Store_id
Store_desc
District_id
District Table
District_id
District_desc
Item Table
Item_id
Item_desc
Dept_id
Dept Table
Dept_id
Dept_desc
Mgr_id
Mgr Table
Dept_id
Mgr_id
Mgr_name
2.
3.
4.
Detect cannibalization
Customers buy the promoted product instead of competing products
Promoting Brand A reduces sales of Brand B
Surrogate Keys
Primary keys of dimension tables should
be surrogate keys, not natural keys
Natural key: A key that is
users
meaningful to
Other attributes
Product name, Size, Weight, Package Type, etc.
Store
Geography hierarchy
Store ZIP Code County State
Administrative hierarchy
Store District Region
Other attributes
Address, Store name, Store Manager, Square Footage, etc.
Hierarchies
Common in dimension tables
Multiple hierarchies can appear in the same dimension
Dont need to be strict hierarchies
e.g. ZIP code that spans 2 counties
Examples:
Student/department mapping fact table
What is the major field of study for each student?
Even for students who didnt enroll in any courses
SCD
The usual changes to dimension tables are classified into
three types
Type 1
Type 2
Type 3
We will consider the points discussed earlier when
deciding which type to use
Type 1 Changes
58
59
Type 2 Changes
Lets look at the martial status of Miky Schreiber
One the DWHs requirements is to track orders by martial status (in
addition to other attributes)
All changes before 11/10/2004 will be under Martial Status = Single,
and all changes after that date will be under Martial Status = Married
We need to aggregate the orders before and after the marriage
separately
Lets make life harder:
Miky is living in Negba st., but on 30/8/2009 he moves to Avivim st.
60
61
Type 3 Changes
Not common at all
Complex queries on type 2 changes may be
Hard to implement
Time-consuming
Hard to maintain
We want to track history without lifting heavy burden
There are many soft changes and we dont care for the far
history
62
Type 3 Changes
General Principles:
They usually relate to soft or tentative changes in the source
systems
There is a need to keep track of history with old and new
values of the changes attribute
They are used to compare performances across the transition
They provide the ability to track forward and backward
63
ETL Concepts
Extraction, transformation, and loading. ETL refers to the
methods involved in accessing and manipulating source data
and loading it into target database.
The first step in ETL process is mapping the data between
source systems and target database (data warehouse or
data mart). The second step is cleansing of source data in
staging area. The third step is transforming cleansed source
data and then loading into the target system.
ETL (Extraction, Transformation and Loading) is a process
by which data is integrated and transformed from the
operational systems into the data warehouse environment
Copyright@ Pawan Likhi
ETL Glossary
Source System
A database, application, file, or other storage facility from which the
data in a data warehouse is derived.
Mapping
The definition of the relationship and data flow between source and
target objects.
Metadata
Data that describes data and other structures, such as objects,
business rules, and processes. For example, the schema design of a
data warehouse is typically stored in a repository as metadata, which
is used to generate scripts used to build and populate the data
warehouse. A repository contains metadata.
Staging Area
A place where data is processed before entering the warehouse.
Copyright@ Pawan Likhi
ETL Glossary
Cleansing
The process of resolving inconsistencies and fixing the anomalies in
source data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying
is a transformation. Examples include cleansing, aggregating, and
integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to
a data warehouse.
Target System
A database, application, file, or other storage facility to which the
"transformed source data" is loaded in a data warehouse.
Copyright@ Pawan Likhi
Transformation
Rules
Rule 1
Rule 2
Rule 3
Rule 1
Rule 2
Rule 3
Transformation
Engine
Error
View
Check
Correct
Error
View
Check
Correct
Cleaning
Rules
Operational systems
Cleanser
Loader
Integrator
Warehouse
Extraction
80 tables
Oracle
Target
50 tables
Sybase
Text files
Transformation
Source
Emp
id
Last
Name
First
Name
10001
Jones
Indiana
10002
Holmes
Sherloc
k
Staging Area
Name =
Concat(First Name,
Last Name)
Indiana Jones
Sherlock Homes
Loading
Data
Warehouse
Source
Direct Load
Staging Area
Cleaning,
Transformation
& Integration of
Raw data
d&
e
m
sfor load
n
a
r
n,T
data
a
e
l
C
ated
r
g
e
i nt
Company Name
Informatica
Informatica Corporation
DT/Studio
Embarcadero Technologies
Data Stage
IBM
Ab Initio
Data Junction
Pervasive Software
Oracle Corporation
Microsoft
TransformOnDemand
Solonde
Transformation Manager
ETL Solutions
Informatica Corporation
A market leading provider of e-business infrastructure and
analytic software which enables customers to automate the
integration, analysis and real time delivery of critical
corporate information via web,wireless and voice
Informatica applications include
eCRM application
eBusiness Operations application
eProcurement
More than 1,370 customers, including 60 percent of the
Fortune 100 companies are using Informaticas analytic
solutions
More than 900 companies are using Informatica products
Copyright@ Pawan Likhi
Informatica Headquarters
Founded in 1993
HQ : Redwood City, CA
Informatica Architecture
Informatica Components
Server Components
1. Informatica Server
2. Repository Server
Client Components
1. Repository Server Administration Console
2. Repository Manager
3. Designer
4. Workflow Manager
5. Workflow Monitor
Copyright@ Pawan Likhi
Thank You!