Day 1-2 - DW Concepts

Introduction
to Data
Warehousing
Copyright@ Pawan Likhi
Agenda
Operational Systems
Overview of Data Warehousing
Data Warehouse Architectures
Understand the ETL process
Informatica Introduction
Operational
Systems
What is an Operational System?

Operational systems are just what their name implies; they are
the systems that help us run the day-to-day enterprise operations.
These are the backbone systems of any enterprise, such as order

entry inventory etc.
The classic examples are airline reservations, credit-card

authorizations, and ATM withdrawals etc.,
Characteristics of Operational
Systems
Continuous availability
Predefined access paths
Transaction integrity
Volume of transaction - High
Data volume per query - Low
Supports day to day control operations
Large number of users
Data Warehousing
Data Warehouse Definition

The Data Warehouse is
Subject Oriented
Integrated
Time variant
Non-volatile collection of data in support of management decision
processes
Data Warehouse- Differences from

Operational
Systems
Operational
Data
Systems
Warehouse
Order
Entry
Customer
Billing
Usage
Accounting
Operational
data is
organized by specific
processes or tasks and is
maintained by separate
Copyright@
Pawan Likhi
systems
Revenue
Warehoused
data is
organized by subject area
and is populated from many
operational systems

Operational Systems
Operational
Systems
Application Specific
Applications
and their databases

were designed and built
separately
Evolved
over long periods of time
Data Warehouse
Integrated
Integrated
Designed
from the start
(or Architected) at one

time, implemented iteratively over
short periods of time

Operational Systems
Operational
Systems
Primarily
concerned with
current data
Data
Warehouse
Generally
concerned
with historical data
Datawarehouse- Differences from

Operational systems
Data warehouse
Operational
Systems
Database
Update
Insert
Update
Delete
Load/
Update
Incremental Load
Insert
Incremental Load
Constant Change
Updated
Initial Load
constantly
Consistent Points in Time

Added
to regularly, but loaded data
is rarely directly changed

Data
changes according to
need, not a fixed schedule
Does NOT mean the Data

warehouse is never updated or
never changes!!
Data in a Data Warehouse

What about the data in the Datawarehouse?
Separate DSS data base

Storage of data only, no data is created
Integrated and Scrubbed data
Historical data
Read only (no recasting of history)
Various levels of summarization
Subject oriented
Easily accessible
Life cycle of the DW

First time load
Operational Databases
Warehouse Database
Refresh
Refresh
Purge or Archive
Refresh
Data Warehouse- Application

Areas
Following are some Business Applications of a data warehouse:
Risk management
Financial analysis
Marketing programs
Profit trends
Procurement analysis
Inventory analysis
Statistical analysis
Claims analysis
Manufacturing optimization
Customer relationship management
Any Source
Operational
data
Any Data
Relational /
Multidimensional
Any Access
Relational
tools
Oracle Medi`
External
data
Text, image
Spatial
Web
Audio,
video
OLAP
tools
Applications/ Web
Types of datawarehouses
Enterprise Data Warehouse - An enterprise data
warehouse provides a central database for decision
support throughout the enterprise.
ODS (Operational Data Store) - This has a broad
enterprise wide scope, but unlike the real entertprise
data warehouse, data is refreshed in near real time and
used for routine business activity. example finding the
status of a customer order
Data Mart - Datamart is a subset of data warehouse and
it supports a particular region, business unit or business
function.
Generic Warehouse Architecture

Client
Query&&Analysis
Analysis
Query
Client
Loading
Design Phase
Warehouse
Metadata
Maintenance
Integrator
Extractor/
Monitor
CS 336
Extractor/
Monitor
Optimization
Extractor/
Monitor
...
Issues in Data Warehousing
Warehouse Design
Extraction
Wrappers, monitors (change detectors)
Integration
Cleansing & merging
Warehousing specification & Maintenance

Optimizations
Miscellaneous (e.g., evolution)
OLTP vs. OLAP

OLTP: On Line Transaction Processing
Describes processing at operational sites
OLAP: On Line Analytical Processing

Describes processing at warehouse
Warehouse is a Specialized DB
Standard DB (OLTP)
Mostly updates
Many small transactions

Mb - Gb of data
Current snapshot
Index/hash on p.k.
Raw data
Thousands of users
(e.g., clerical users)
Warehouse (OLAP)
Mostly reads
Queries are long and complex
Gb - Tb of data
History
Lots of scans
Summarized, reconciled data
Hundreds of users (e.g.,
decision-makers, analysts)
Data Marts
What is a Data mart?
Data mart is a decentralized subset of data found either in a data

warehouse or as a standalone subset designed to support the unique
business unit requirements of a specific decision-support system.
Data marts have specific business-related purposes such as
measuring the impact of marketing promotions, or measuring and
forecasting sales performance etc,.
Enterprise
Data Warehouse
Data marts - Main Features
Main Features:
Low cost
Controlled locally rather than centrally, conferring power on the user group.
Contain less information than the warehouse
Rapid response
Easily understood and navigated than an enterprise data warehouse.
Within the range of divisional or departmental budgets
Advantages of Datamart over

Datawarehouse
Datamart Advantages :
Typically single subject area and fewer dimensions
Limited feeds
Very quick time to market (30-120 days to pilot)
Quick impact on bottom line problems
Focused user needs
Limited scope
Optimum model for DW construction
Demonstrates ROI
Allows prototyping
Disadvantages of Data Mart

Data Mart disadvantages :
Does not provide integrated view of business information.
Uncontrolled proliferation of data marts results in redundancy
More number of data marts complex to maintain
Scalability issues for large number of users and increased data

volume
Basic Data Warehouse Architecture

Meta Data Management
Information
Information
Access
Access
Data warehouse
Reporting tools
Operational
& External
data
ODS
Mining
Data
Staging
layer
OLAP
Data
Marts
Information
Servers
Administration
Web
Browsers
Basic Data Warehouse Architecture
Operational & External Data layer
Operational
&
External
Data
Layer
The database-ofrecord
Consists of system
specific reference
data and event data
Source of data for the
data warehouse.
Contains detailed
data
Continually changes
due to updates
Stores data up to the
last transaction.
Data Staging layer

Extracts data from
operational and
external databases.
Transforms the data
and loads into the data
warehouse.
This includes
decoding production
data and merging of
records from multiple
DBMS formats.
Data
Staging
layer
Data Warehouse layer

Stores data used for
informational analysis
Present summarized
data to the end-user for
analysis
The nature of the
operational data, the
end-user requirements
and the business
Data ware house
Layer
objectives of the
enterprise determine
the structure
Meta Data layer

Metadata is
data about
data.
Stored in a
repository.
Contains all
corporate
Metadata
resources:
database
catalogs,
data
dictionaries
Meta Data Layer
Process Management layer

Scheduler or the
high-level
job
control
To build and
maintain the
data warehouse
and data
directory
information
To keep the
Data warehouse
up-to-date.
Process Management Layer

Information Access layer
Information Access Layer
Interfaced with the

data warehouse
through an OLAP
server.
Performs analytical
operations and
presents data for
analysis.
End-users
generates ad-hoc
reports and perform
multidimensional
analysis using
OLAP tools
Dimensional
Modeling
Facts and Measures
Sa
s
e
l
e
v
Re
Pr o
fita
bili
ty
e
u
n
Net Pr
ofit
Gros
s Ma
rgin
st
o
C
Facts or Measures are the Key Performance Indicators of

an enterprise
Factual data about the subject area
Numeric, summarized
Dimension
ue
n
ve
e
)
R
e
r
es asu
l
Sa Me
(
What was sold ?

Whom was it sold to ?
When was it sold ?
Where was it sold ?
Dimensions put measures in perspective

What, when and where qualifiers to the measures
Dimensions could be products, customers, time, geography etc.
Some Examples of Data

warehousing Dimensions
The following Dimensions are common in all Data warehouses in
various forms
Product Dimension
Customer Dimension
Geographic Dimension
Time dimension
Modeling
Warehouses differ from operational structures:

Analytical requirements
Subject orientation
Data must map to subject oriented information:

Identify business subjects
Define relationships between subjects
Name the attributes of each subject
Modeling is iterative
Modeling the Data Warehouse

1
1.
Defining the business model
2.
Creating the dimensional model
3.
Modeling summaries
4.
Creating the physical model
Select a
business
process
2, 3
Physical model
Identifying Business Rules

Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles
Time
Month > Quarter > Year
Product
Type
Monitor
Status
PC
Server
15 inch
17 inch
19 inch
None
New
Rebuilt
Custom
Store
Store > District > Region
Creating the Dimensional Model
Identify fact tables

Translate business measures into fact tables
Analyze source system information for additional measures
Identify base and derived measures
Document additivity of measures
Identify dimension tables

Link fact tables to the dimension tables
Create views for users
Dimension Tables
Dimension tables have the following
characteristics:
Contain textual information that represents
the attributes of the business
Contain relatively static data
Are joined to a fact table through a foreign
key reference
Product
Channel
Facts
(units,
price)
Customer
Time
Fact Tables
Fact tables have the following characteristics:

Contain numeric measures (metrics) of the business
May contain summarized (aggregated) data
May contain date-stamped data
Have key value that is typically a concatenated key
composed of the primary keys of the dimensions
Joined to dimension tables through foreign keys that
reference primary keys in the dimension tables
Dimensional Model (Star Schema)
Fact table
Product
Channel
Facts
(units,
price)
Customer
Time
Dimension tables
Star Schema Model

Product Table
Product_id
Product_desc
Store Table
Store_id
District_id
...
Sales Fact Table

Product_id
Store_id
Item_id
Day_id
Sales_dollars
Sales_units
...
Central fact table

Radiating dimensions
Denormalized model
Time Table
Day_id
Month_id
Period_id
Year_id
Item Table
Item_id
Item_desc
...
Star Schema Model

Easy for users to understand
Fast response to queries
Simple metadata
Supported by many front end tools
Less robust to change
Slower to build
Snowflake Schema Model

Product Table
Product_id
Product_desc
Store Table
Store_id
Store_desc
District_id
District Table
District_id
District_desc
Sales Fact Table

Item_id
Store_id
Sales_dollars
Sales_units
Time Table
Week_id
Period_id
Year_id
Item Table
Item_id
Item_desc
Dept_id
Dept Table
Dept_id
Dept_desc
Mgr_id
Mgr Table
Dept_id
Mgr_id
Mgr_name
Direct use by some tools

More flexible to change
Provides for speedier data loading
May become large and unmanageable
Degrades query performance
More complex metadata
Direct use by some tools

More flexible to change
Provides for speedier data loading
May become large and unmanageable
Degrades query performance
More complex metadata
Four steps in dimensional

modeling
1.
Identify the process being modeled.
2.
Determine the grain at which facts will be stored.
3.
Choose the dimensions.
4.
Identify the numeric measures for the facts.
Retail Sales Questions

What is the lift due to a promotion?
Lift = gain in sales in a product because its being promoted
Requires estimated baseline sales value
Could be calculated based on historical sales figures
Detect time shifting

Customers stock up on the product thats on sale
Then they dont buy more of it for a long time
Detect cannibalization
Customers buy the promoted product instead of competing products
Promoting Brand A reduces sales of Brand B
Detect cross-sell of complementary products

Promoting charcoal increases sales of lighter fluid
Promoting hamburger meat increases sales of hamburger buns
What is the profitability of a promotion?

Considering promotional costs, discounts, lift, time shifting, cannibalization, and
cross-sell
Grain of a Fact Table

Grain of a fact table = the meaning of one fact table row
Determines the maximum level of detail of the warehouse
Example grain statements: (one fact row represents a)
Line item from a cash register receipt

Boarding pass to get on a flight
Daily snapshot of inventory level for a product in a warehouse
Sensor reading per minute for a sensor
Student enrolled in a course
Finer-grained fact tables:

are more expressive
have more rows
Trade-off between performance and expressiveness

Rule of thumb: Err in favor of expressiveness
Pre-computed aggregates can solve performance problems
Surrogate Keys
Primary keys of dimension tables should
be surrogate keys, not natural keys
Natural key: A key that is
users
meaningful to
Surrogate key: A meaningless integer key

that is assigned by the data warehouse
Keys or codes generated by operational systems
= natural keys (avoid using these as keys!)
E.g. Account number, UPC code, Social Security
Number
Benefits of Surrogate Keys

Data warehouse insulated from changes to operational systems
Easy to integrate data from multiple systems
What if theres a merger or acquisition?
Narrow dimension keys Thinner fact table Better

performance
This can actually make a big performance difference.
Better handling of exceptional cases

For example, what if the value is unknown or TBD?
Using NULL is a poor option
Three-valued logic is not intuitive to users
They will get their queries wrong
Join performance will suffer
Better: Explicit dimension rows for Unknown, TBD, N/A, etc.
Avoids tempting query writers to assume implicit semantics

Example: WHERE date_key < '01/01/2004'
Will facts with unknown date be included?
More Dimension Tables

Product
Merchandise hierarchy
SKU Brand Category Department
Other attributes
Product name, Size, Weight, Package Type, etc.
Store
Geography hierarchy
Store ZIP Code County State
Administrative hierarchy
Store District Region
Other attributes
Address, Store name, Store Manager, Square Footage, etc.
Hierarchies
Common in dimension tables
Multiple hierarchies can appear in the same dimension
Dont need to be strict hierarchies
e.g. ZIP code that spans 2 counties
Factless Fact Tables
Factless fact table

A fact table without numeric fact columns
Used to capture relationships between dimensions
Examples:
Student/department mapping fact table
What is the major field of study for each student?
Even for students who didnt enroll in any courses
SCD
The usual changes to dimension tables are classified into
three types
Type 1
Type 2
Type 3
We will consider the points discussed earlier when
deciding which type to use
Type 1 Changes
Usually relate to corrections of errors in the source system

For example, the customer dimension: Mickey Schreiber -> Miky
Schreiber
What will happen when number of children is changed?
58
Type 1 Changes, cont.
General Principles for Type 1 changes:

Usually, the changes relate to correction of errors in the
source system
Sometimes the change in the source system has no
significance
The old value in the source system needs to be discarded
The change in the source system need not be preserved in the
DWH
What will happen when only the last value before the change is needed?
59
Type 2 Changes
Lets look at the martial status of Miky Schreiber
One the DWHs requirements is to track orders by martial status (in
addition to other attributes)
All changes before 11/10/2004 will be under Martial Status = Single,
and all changes after that date will be under Martial Status = Married
We need to aggregate the orders before and after the marriage
separately
Lets make life harder:
Miky is living in Negba st., but on 30/8/2009 he moves to Avivim st.
60
Type 2 Changes, cont.
General Principles for Type 2 changes:

They usually relate to true changes in source systems
There is a need to preserve history in the DWH
This type of change partitions the history in the DWH
Every change for the same attributes must be preserved
Must we track changes for all the attributes?
For which attributes will we track changes? What are the considerations?
61
Type 3 Changes
Not common at all
Complex queries on type 2 changes may be
Hard to implement
Time-consuming
Hard to maintain
We want to track history without lifting heavy burden
There are many soft changes and we dont care for the far
history
62
Type 3 Changes
General Principles:
They usually relate to soft or tentative changes in the source
systems
There is a need to keep track of history with old and new
values of the changes attribute
They are used to compare performances across the transition
They provide the ability to track forward and backward
63
ETL Concepts
Extraction, transformation, and loading. ETL refers to the
methods involved in accessing and manipulating source data
and loading it into target database.
The first step in ETL process is mapping the data between
source systems and target database (data warehouse or
data mart). The second step is cleansing of source data in
staging area. The third step is transforming cleansed source
data and then loading into the target system.
ETL (Extraction, Transformation and Loading) is a process
by which data is integrated and transformed from the
operational systems into the data warehouse environment
ETL Glossary
Source System
A database, application, file, or other storage facility from which the
data in a data warehouse is derived.
Mapping
The definition of the relationship and data flow between source and
target objects.
Metadata
Data that describes data and other structures, such as objects,
business rules, and processes. For example, the schema design of a
data warehouse is typically stored in a repository as metadata, which
is used to generate scripts used to build and populate the data
warehouse. A repository contains metadata.
Staging Area
A place where data is processed before entering the warehouse.
ETL Glossary
Cleansing
The process of resolving inconsistencies and fixing the anomalies in
source data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying
is a transformation. Examples include cleansing, aggregating, and
integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to
a data warehouse.
Target System
A database, application, file, or other storage facility to which the
"transformed source data" is loaded in a data warehouse.
Sample ETL Process Flow
Detailed ETL Process Flow

Filters and
Extractors
Transformation
Rules
Rule 1
Rule 2
Rule 3
Rule 1
Rule 2
Rule 3
Transformation
Engine
Error
View
Check
Correct
Error
View
Check
Correct
Cleaning
Rules
Operational systems
Cleanser
Loader
Integrator
Warehouse
Extraction
80 tables
Oracle
Target
50 tables
Sybase
Text files
Transformation
Source
Emp
id
Last
Name
First
Name
10001
Jones
Indiana
10002
Holmes
Sherloc
k
Staging Area
Name =
Concat(First Name,
Last Name)
Indiana Jones
Sherlock Homes
Loading
Data
Warehouse
Source
Direct Load
Staging Area
Cleaning,
Transformation
& Integration of
Raw data
d&
e
m
sfor load
n
a
r
n,T
data
a
e
l
C
ated
r
g
e
i nt
Popular ETL Tools

Tool Name
Company Name
Informatica
Informatica Corporation
DT/Studio
Embarcadero Technologies
Data Stage
IBM
Ab Initio
Ab Initio Software Corporation
Data Junction
Pervasive Software
Oracle Warehouse Builder
Oracle Corporation
Microsoft SQL Server

Integration
Microsoft
TransformOnDemand
Solonde
Transformation Manager
ETL Solutions
Informatica Corporation
A market leading provider of e-business infrastructure and
analytic software which enables customers to automate the
integration, analysis and real time delivery of critical
corporate information via web,wireless and voice
Informatica applications include
eCRM application
eBusiness Operations application
eProcurement
More than 1,370 customers, including 60 percent of the
Fortune 100 companies are using Informaticas analytic
solutions
More than 900 companies are using Informatica products
Informatica Headquarters
Founded in 1993
HQ : Redwood City, CA
Informatica Powercenter Glossary

Repository: This is where all the metadata information is stored in
the Informatica suite. The Power Center Client and the Repository
Server would access this repository to retrieve, store and manage
metadata.
Power Center Client: Informatica client is used for managing users,
identifiying source and target systems definitions, creating mapping
and mapplets, creating sessions and run workflows etc.
Repository Server: This repository server takes care of all the
connections between the repository and the Power Center Client.
Power Center Server: Power Center server does the extraction from
source and then loading data into targets.

Designer: Source Analyzer, Mapping Designer and Warehouse
Designer are tools reside within the Designer wizard. Source
Analyzer is used for extracting metadata from source systems.
Mapping Designer is used to create mapping between sources and
targets. Mapping is a pictorial representation about the flow of data
from source to target.
Warehouse Designer is used for extracting metadata from target
systems or metadata can be created in the Designer itself.
Data Cleansing: The Power Center's data cleansing technology
improves data quality by validating, correctly naming and
standardization of address data. A person's address may not be
same in all source systems because of typos and postal code, city
name may not match with address. These errors can be corrected by
using data cleansing process and standardized data can be loaded in
target systems (data warehouse).

Transformation: Transformations help to transform the source data
according to the requirements of target system. Sorting, Filtering,
Aggregation, Joining are some of the examples of transformation.
Transformations ensure the quality of the data being loaded into
target and this is done during the mapping process from source to
target.
Workflow Manager: Workflow helps to load the data from source to
target in a sequential manner. For example, if the fact tables are
loaded before the lookup tables, then the target system will pop up
an error message since the fact table is violating the foreign key
validation. To avoid this, workflows can be created to ensure the
correct flow of data from source to target.
Workflow Monitor: This monitor is helpful in monitoring and tracking
the workflows created in each Power Center Server.
Informatica Architecture
Informatica Components
Server Components
1. Informatica Server
2. Repository Server
Client Components
1. Repository Server Administration Console
2. Repository Manager
3. Designer
4. Workflow Manager
5. Workflow Monitor
Thank You!

Day 1-2 - DW Concepts

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day 1-2 - DW Concepts

Uploaded by

Copyright:

Available Formats

Introduction

Copyright@ Pawan Likhi

What is an Operational System?

These are the backbone systems of any enterprise, such as order

The classic examples are airline reservations, credit-card

Copyright@ Pawan Likhi

Copyright@ Pawan Likhi

Copyright@ Pawan Likhi

Data Warehouse Definition

Copyright@ Pawan Likhi

Data Warehouse- Differences from

Data Warehouse- Differences from

and their databases

over long periods of time

Copyright@ Pawan Likhi

from the start

(or Architected) at one

Data Warehouse- Differences from

Copyright@ Pawan Likhi

Datawarehouse- Differences from

Consistent Points in Time

to regularly, but loaded data

is rarely directly changed

Copyright@ Pawan Likhi

Does NOT mean the Data

Data in a Data Warehouse

Separate DSS data base

Life cycle of the DW

Data Warehouse- Application

Generic Warehouse Architecture

Issues in Data Warehousing

Warehousing specification & Maintenance

OLTP vs. OLAP

OLAP: On Line Analytical Processing

Many small transactions

Copyright@ Pawan Likhi

What is a Data mart?

Data mart is a decentralized subset of data found either in a data

Data marts - Main Features

Copyright@ Pawan Likhi

Advantages of Datamart over

Disadvantages of Data Mart

Does not provide integrated view of business information.

Uncontrolled proliferation of data marts results in redundancy

More number of data marts complex to maintain

Scalability issues for large number of users and increased data

Copyright@ Pawan Likhi

Basic Data Warehouse Architecture

Basic Data Warehouse Architecture

Copyright@ Pawan Likhi

Operational & External Data layer

Copyright@ Pawan Likhi

Data Staging layer

Copyright@ Pawan Likhi

Data Warehouse layer

Copyright@ Pawan Likhi

Meta Data layer

Meta Data Layer

Copyright@ Pawan Likhi

Process Management layer

Process Management Layer

Information Access layer

Information Access Layer