You are on page 1of 80

Introduction

to Data
Warehousing
Copyright@ Pawan Likhi

Agenda

Operational Systems
Overview of Data Warehousing
Data Warehouse Architectures
Understand the ETL process
Informatica Introduction

Copyright@ Pawan Likhi

Operational
Systems
Copyright@ Pawan Likhi

What is an Operational System?


Operational systems are just what their name implies; they are
the systems that help us run the day-to-day enterprise operations.

These are the backbone systems of any enterprise, such as order


entry inventory etc.

The classic examples are airline reservations, credit-card


authorizations, and ATM withdrawals etc.,

Copyright@ Pawan Likhi

Characteristics of Operational
Systems
Continuous availability
Predefined access paths
Transaction integrity
Volume of transaction - High
Data volume per query - Low
Supports day to day control operations
Large number of users

Copyright@ Pawan Likhi

Data Warehousing

Copyright@ Pawan Likhi

Data Warehouse Definition


The Data Warehouse is

Subject Oriented
Integrated
Time variant
Non-volatile collection of data in support of management decision
processes

Copyright@ Pawan Likhi

Data Warehouse- Differences from


Operational
Systems
Operational
Data
Systems

Warehouse

Order
Entry

Customer

Billing

Usage

Accounting

Operational

data is
organized by specific
processes or tasks and is
maintained by separate
Copyright@
Pawan Likhi
systems

Revenue

Warehoused

data is
organized by subject area
and is populated from many
operational systems

Data Warehouse- Differences from


Operational Systems
Operational
Systems

Application Specific
Applications

and their databases


were designed and built
separately

Evolved

over long periods of time

Copyright@ Pawan Likhi

Data Warehouse

Integrated

Integrated

Designed

from the start

(or Architected) at one


time, implemented iteratively over
short periods of time

Data Warehouse- Differences from


Operational Systems
Operational
Systems

Primarily

concerned with
current data

Copyright@ Pawan Likhi

Data
Warehouse

Generally

concerned
with historical data

Datawarehouse- Differences from


Operational systems
Data warehouse
Operational
Systems
Database
Update
Insert
Update
Delete

Load/
Update

Incremental Load

Insert

Incremental Load

Constant Change
Updated

Initial Load

constantly

Consistent Points in Time


Added

to regularly, but loaded data

is rarely directly changed


Data

changes according to
need, not a fixed schedule

Copyright@ Pawan Likhi

Does NOT mean the Data


warehouse is never updated or
never changes!!

Data in a Data Warehouse


What about the data in the Datawarehouse?

Separate DSS data base


Storage of data only, no data is created
Integrated and Scrubbed data
Historical data
Read only (no recasting of history)
Various levels of summarization
Subject oriented
Easily accessible
Copyright@ Pawan Likhi

Life cycle of the DW


First time load
Operational Databases

Warehouse Database
Refresh

Refresh

Purge or Archive
Refresh

Data Warehouse- Application


Areas
Following are some Business Applications of a data warehouse:
Risk management
Financial analysis
Marketing programs
Profit trends
Procurement analysis
Inventory analysis
Statistical analysis
Claims analysis
Manufacturing optimization
Customer relationship management
Copyright@ Pawan Likhi

Any Source

Operational
data

Any Data

Relational /
Multidimensional

Any Access

Relational
tools

Oracle Medi`

External
data

Text, image

Spatial

Web

Audio,
video

OLAP
tools

Applications/ Web

Types of datawarehouses
Enterprise Data Warehouse - An enterprise data
warehouse provides a central database for decision
support throughout the enterprise.
ODS (Operational Data Store) - This has a broad
enterprise wide scope, but unlike the real entertprise
data warehouse, data is refreshed in near real time and
used for routine business activity. example finding the
status of a customer order
Data Mart - Datamart is a subset of data warehouse and
it supports a particular region, business unit or business
function.
Copyright@ Pawan Likhi

Generic Warehouse Architecture


Client

Query&&Analysis
Analysis
Query

Client
Loading

Design Phase
Warehouse

Metadata

Maintenance
Integrator

Extractor/
Monitor

CS 336

Extractor/
Monitor

Optimization

Extractor/
Monitor

...

Issues in Data Warehousing

Warehouse Design
Extraction
Wrappers, monitors (change detectors)

Integration
Cleansing & merging

Warehousing specification & Maintenance


Optimizations
Miscellaneous (e.g., evolution)

OLTP vs. OLAP


OLTP: On Line Transaction Processing
Describes processing at operational sites

OLAP: On Line Analytical Processing


Describes processing at warehouse

Warehouse is a Specialized DB
Standard DB (OLTP)

Mostly updates

Many small transactions


Mb - Gb of data

Current snapshot

Index/hash on p.k.

Raw data

Thousands of users
(e.g., clerical users)

Warehouse (OLAP)
Mostly reads
Queries are long and complex
Gb - Tb of data
History
Lots of scans
Summarized, reconciled data
Hundreds of users (e.g.,
decision-makers, analysts)

Data Marts

Copyright@ Pawan Likhi

What is a Data mart?

Data mart is a decentralized subset of data found either in a data


warehouse or as a standalone subset designed to support the unique
business unit requirements of a specific decision-support system.
Data marts have specific business-related purposes such as
measuring the impact of marketing promotions, or measuring and
forecasting sales performance etc,.

Enterprise
Data Warehouse
Copyright@ Pawan Likhi

Data marts - Main Features

Main Features:
Low cost

Controlled locally rather than centrally, conferring power on the user group.
Contain less information than the warehouse
Rapid response
Easily understood and navigated than an enterprise data warehouse.
Within the range of divisional or departmental budgets

Copyright@ Pawan Likhi

Advantages of Datamart over


Datawarehouse
Datamart Advantages :
Typically single subject area and fewer dimensions
Limited feeds
Very quick time to market (30-120 days to pilot)
Quick impact on bottom line problems
Focused user needs
Limited scope
Optimum model for DW construction
Demonstrates ROI
Allows prototyping
Copyright@ Pawan Likhi

Disadvantages of Data Mart


Data Mart disadvantages :

Does not provide integrated view of business information.

Uncontrolled proliferation of data marts results in redundancy

More number of data marts complex to maintain

Scalability issues for large number of users and increased data


volume

Copyright@ Pawan Likhi

Basic Data Warehouse Architecture


Meta Data Management

Information
Information
Access
Access

Data warehouse
Reporting tools

Operational
& External
data

ODS
Mining

Data
Staging
layer

OLAP

Data
Marts

Information
Servers

Administration
Copyright@ Pawan Likhi

Web
Browsers

Basic Data Warehouse Architecture

Copyright@ Pawan Likhi

Operational & External Data layer

Operational
&
External
Data
Layer

Copyright@ Pawan Likhi

The database-ofrecord
Consists of system
specific reference
data and event data
Source of data for the
data warehouse.
Contains detailed
data
Continually changes
due to updates
Stores data up to the
last transaction.

Data Staging layer


Extracts data from
operational and
external databases.
Transforms the data
and loads into the data
warehouse.
This includes
decoding production
data and merging of
records from multiple
DBMS formats.

Data
Staging
layer

Copyright@ Pawan Likhi

Data Warehouse layer


Stores data used for
informational analysis
Present summarized
data to the end-user for
analysis
The nature of the
operational data, the
end-user requirements
and the business
Data ware house
Layer

objectives of the
enterprise determine
the structure

Copyright@ Pawan Likhi

Meta Data layer


Metadata is
data about
data.
Stored in a
repository.
Contains all
corporate
Metadata
resources:
database
catalogs,
data
dictionaries

Meta Data Layer

Copyright@ Pawan Likhi

Process Management layer


Scheduler or the
high-level
job
control
To build and
maintain the
data warehouse
and data
directory
information
To keep the
Data warehouse
up-to-date.

Process Management Layer


Copyright@ Pawan Likhi

Information Access layer

Information Access Layer

Copyright@ Pawan Likhi

Interfaced with the


data warehouse
through an OLAP
server.
Performs analytical
operations and
presents data for
analysis.
End-users
generates ad-hoc
reports and perform
multidimensional
analysis using
OLAP tools

Dimensional
Modeling

Copyright@ Pawan Likhi

Facts and Measures

Sa

s
e
l

e
v
Re

Pr o
fita
bili

ty

e
u
n

Net Pr
ofit

Gros
s Ma
rgin

st
o
C

Facts or Measures are the Key Performance Indicators of


an enterprise
Factual data about the subject area
Numeric, summarized
Copyright@ Pawan Likhi

Dimension
ue
n
ve
e
)
R
e
r
es asu
l
Sa Me
(

What was sold ?


Whom was it sold to ?
When was it sold ?
Where was it sold ?

Dimensions put measures in perspective


What, when and where qualifiers to the measures
Dimensions could be products, customers, time, geography etc.
Copyright@ Pawan Likhi

Some Examples of Data


warehousing Dimensions
The following Dimensions are common in all Data warehouses in
various forms

Product Dimension
Customer Dimension
Geographic Dimension
Time dimension

Copyright@ Pawan Likhi

Modeling

Warehouses differ from operational structures:


Analytical requirements
Subject orientation

Data must map to subject oriented information:


Identify business subjects
Define relationships between subjects
Name the attributes of each subject

Modeling is iterative

Modeling the Data Warehouse


1
1.

Defining the business model

2.

Creating the dimensional model

3.

Modeling summaries

4.

Creating the physical model

Select a
business
process
2, 3

Physical model

Identifying Business Rules


Location
Geographic proximity
0 - 1 miles
1 - 5 miles
> 5 miles

Time
Month > Quarter > Year

Product
Type

Monitor

Status

PC
Server

15 inch
17 inch
19 inch
None

New
Rebuilt
Custom

Store
Store > District > Region

Creating the Dimensional Model

Identify fact tables


Translate business measures into fact tables
Analyze source system information for additional measures
Identify base and derived measures
Document additivity of measures

Identify dimension tables


Link fact tables to the dimension tables
Create views for users

Dimension Tables
Dimension tables have the following
characteristics:
Contain textual information that represents
the attributes of the business
Contain relatively static data
Are joined to a fact table through a foreign
key reference
Product

Channel
Facts
(units,
price)

Customer

Time

Fact Tables

Fact tables have the following characteristics:


Contain numeric measures (metrics) of the business
May contain summarized (aggregated) data
May contain date-stamped data
Have key value that is typically a concatenated key
composed of the primary keys of the dimensions
Joined to dimension tables through foreign keys that
reference primary keys in the dimension tables

Dimensional Model (Star Schema)

Fact table

Product

Channel
Facts
(units,
price)

Customer

Time

Dimension tables

Star Schema Model


Product Table
Product_id
Product_desc

Store Table
Store_id
District_id
...

Sales Fact Table


Product_id
Store_id
Item_id
Day_id
Sales_dollars
Sales_units
...

Central fact table


Radiating dimensions
Denormalized model

Time Table
Day_id
Month_id
Period_id
Year_id

Item Table
Item_id
Item_desc
...

Star Schema Model


Easy for users to understand
Fast response to queries
Simple metadata
Supported by many front end tools
Less robust to change
Slower to build

Snowflake Schema Model


Product Table
Product_id
Product_desc

Store Table
Store_id
Store_desc
District_id

District Table
District_id
District_desc

Sales Fact Table


Item_id
Store_id
Sales_dollars
Sales_units
Time Table
Week_id
Period_id
Year_id

Item Table
Item_id
Item_desc
Dept_id

Dept Table
Dept_id
Dept_desc
Mgr_id

Mgr Table
Dept_id
Mgr_id
Mgr_name

Snowflake Schema Model

Direct use by some tools


More flexible to change
Provides for speedier data loading
May become large and unmanageable
Degrades query performance
More complex metadata

Snowflake Schema Model

Direct use by some tools


More flexible to change
Provides for speedier data loading
May become large and unmanageable
Degrades query performance
More complex metadata

Four steps in dimensional


modeling
1.

Identify the process being modeled.

2.

Determine the grain at which facts will be stored.

3.

Choose the dimensions.

4.

Identify the numeric measures for the facts.

Retail Sales Questions


What is the lift due to a promotion?
Lift = gain in sales in a product because its being promoted
Requires estimated baseline sales value
Could be calculated based on historical sales figures

Detect time shifting


Customers stock up on the product thats on sale
Then they dont buy more of it for a long time

Detect cannibalization
Customers buy the promoted product instead of competing products
Promoting Brand A reduces sales of Brand B

Detect cross-sell of complementary products


Promoting charcoal increases sales of lighter fluid
Promoting hamburger meat increases sales of hamburger buns

What is the profitability of a promotion?


Considering promotional costs, discounts, lift, time shifting, cannibalization, and
cross-sell

Grain of a Fact Table


Grain of a fact table = the meaning of one fact table row
Determines the maximum level of detail of the warehouse
Example grain statements: (one fact row represents a)

Line item from a cash register receipt


Boarding pass to get on a flight
Daily snapshot of inventory level for a product in a warehouse
Sensor reading per minute for a sensor
Student enrolled in a course

Finer-grained fact tables:


are more expressive
have more rows

Trade-off between performance and expressiveness


Rule of thumb: Err in favor of expressiveness
Pre-computed aggregates can solve performance problems

Surrogate Keys
Primary keys of dimension tables should
be surrogate keys, not natural keys
Natural key: A key that is
users

meaningful to

Surrogate key: A meaningless integer key


that is assigned by the data warehouse
Keys or codes generated by operational systems
= natural keys (avoid using these as keys!)
E.g. Account number, UPC code, Social Security
Number

Benefits of Surrogate Keys


Data warehouse insulated from changes to operational systems
Easy to integrate data from multiple systems
What if theres a merger or acquisition?

Narrow dimension keys Thinner fact table Better


performance
This can actually make a big performance difference.

Better handling of exceptional cases


For example, what if the value is unknown or TBD?
Using NULL is a poor option
Three-valued logic is not intuitive to users
They will get their queries wrong
Join performance will suffer

Better: Explicit dimension rows for Unknown, TBD, N/A, etc.

Avoids tempting query writers to assume implicit semantics


Example: WHERE date_key < '01/01/2004'
Will facts with unknown date be included?

More Dimension Tables


Product
Merchandise hierarchy
SKU Brand Category Department

Other attributes
Product name, Size, Weight, Package Type, etc.

Store
Geography hierarchy
Store ZIP Code County State

Administrative hierarchy
Store District Region

Other attributes
Address, Store name, Store Manager, Square Footage, etc.

Hierarchies
Common in dimension tables
Multiple hierarchies can appear in the same dimension
Dont need to be strict hierarchies
e.g. ZIP code that spans 2 counties

Factless Fact Tables

Factless fact table


A fact table without numeric fact columns
Used to capture relationships between dimensions

Examples:
Student/department mapping fact table
What is the major field of study for each student?
Even for students who didnt enroll in any courses

SCD
The usual changes to dimension tables are classified into
three types
Type 1
Type 2
Type 3
We will consider the points discussed earlier when
deciding which type to use

Type 1 Changes

Usually relate to corrections of errors in the source system


For example, the customer dimension: Mickey Schreiber -> Miky
Schreiber

What will happen when number of children is changed?

58

Type 1 Changes, cont.

General Principles for Type 1 changes:


Usually, the changes relate to correction of errors in the
source system
Sometimes the change in the source system has no
significance
The old value in the source system needs to be discarded
The change in the source system need not be preserved in the
DWH
What will happen when only the last value before the change is needed?

59

Type 2 Changes
Lets look at the martial status of Miky Schreiber
One the DWHs requirements is to track orders by martial status (in
addition to other attributes)
All changes before 11/10/2004 will be under Martial Status = Single,
and all changes after that date will be under Martial Status = Married
We need to aggregate the orders before and after the marriage
separately
Lets make life harder:
Miky is living in Negba st., but on 30/8/2009 he moves to Avivim st.

60

Type 2 Changes, cont.

General Principles for Type 2 changes:


They usually relate to true changes in source systems
There is a need to preserve history in the DWH
This type of change partitions the history in the DWH
Every change for the same attributes must be preserved
Must we track changes for all the attributes?
For which attributes will we track changes? What are the considerations?

61

Type 3 Changes
Not common at all
Complex queries on type 2 changes may be
Hard to implement
Time-consuming
Hard to maintain
We want to track history without lifting heavy burden
There are many soft changes and we dont care for the far
history

62

Type 3 Changes

General Principles:
They usually relate to soft or tentative changes in the source
systems
There is a need to keep track of history with old and new
values of the changes attribute
They are used to compare performances across the transition
They provide the ability to track forward and backward

63

ETL Concepts
Extraction, transformation, and loading. ETL refers to the
methods involved in accessing and manipulating source data
and loading it into target database.
The first step in ETL process is mapping the data between
source systems and target database (data warehouse or
data mart). The second step is cleansing of source data in
staging area. The third step is transforming cleansed source
data and then loading into the target system.
ETL (Extraction, Transformation and Loading) is a process
by which data is integrated and transformed from the
operational systems into the data warehouse environment
Copyright@ Pawan Likhi

ETL Glossary
Source System
A database, application, file, or other storage facility from which the
data in a data warehouse is derived.
Mapping
The definition of the relationship and data flow between source and
target objects.
Metadata
Data that describes data and other structures, such as objects,
business rules, and processes. For example, the schema design of a
data warehouse is typically stored in a repository as metadata, which
is used to generate scripts used to build and populate the data
warehouse. A repository contains metadata.
Staging Area
A place where data is processed before entering the warehouse.
Copyright@ Pawan Likhi

ETL Glossary
Cleansing
The process of resolving inconsistencies and fixing the anomalies in
source data, typically as part of the ETL process.
Transformation
The process of manipulating data. Any manipulation beyond copying
is a transformation. Examples include cleansing, aggregating, and
integrating data from multiple sources.
Transportation
The process of moving copied or transformed data from a source to
a data warehouse.
Target System
A database, application, file, or other storage facility to which the
"transformed source data" is loaded in a data warehouse.
Copyright@ Pawan Likhi

Sample ETL Process Flow

Copyright@ Pawan Likhi

Detailed ETL Process Flow


Filters and
Extractors

Transformation
Rules

Rule 1
Rule 2
Rule 3

Rule 1
Rule 2
Rule 3

Transformation
Engine

Error
View
Check
Correct

Copyright@ Pawan Likhi

Error
View
Check
Correct

Cleaning
Rules

Operational systems

Cleanser

Loader

Integrator

Warehouse

Extraction
80 tables

Oracle
Target
50 tables

Sybase

Text files

Copyright@ Pawan Likhi

Transformation
Source
Emp
id

Last
Name

First
Name

10001

Jones

Indiana

10002

Holmes

Sherloc
k

Staging Area
Name =
Concat(First Name,
Last Name)
Indiana Jones
Sherlock Homes

Copyright@ Pawan Likhi

Loading
Data
Warehouse

Source
Direct Load

Staging Area

Cleaning,
Transformation
& Integration of
Raw data

Copyright@ Pawan Likhi

d&
e
m
sfor load
n
a
r
n,T
data
a
e
l
C

ated
r
g
e
i nt

Popular ETL Tools


Tool Name

Company Name

Informatica

Informatica Corporation

DT/Studio

Embarcadero Technologies

Data Stage

IBM

Ab Initio

Ab Initio Software Corporation

Data Junction

Pervasive Software

Oracle Warehouse Builder

Oracle Corporation

Microsoft SQL Server


Integration

Microsoft

TransformOnDemand

Solonde

Transformation Manager

ETL Solutions

Copyright@ Pawan Likhi

Informatica Corporation
A market leading provider of e-business infrastructure and
analytic software which enables customers to automate the
integration, analysis and real time delivery of critical
corporate information via web,wireless and voice
Informatica applications include
eCRM application
eBusiness Operations application
eProcurement
More than 1,370 customers, including 60 percent of the
Fortune 100 companies are using Informaticas analytic
solutions
More than 900 companies are using Informatica products
Copyright@ Pawan Likhi

Informatica Headquarters

Founded in 1993
HQ : Redwood City, CA

Copyright@ Pawan Likhi

Informatica Powercenter Glossary


Repository: This is where all the metadata information is stored in
the Informatica suite. The Power Center Client and the Repository
Server would access this repository to retrieve, store and manage
metadata.
Power Center Client: Informatica client is used for managing users,
identifiying source and target systems definitions, creating mapping
and mapplets, creating sessions and run workflows etc.
Repository Server: This repository server takes care of all the
connections between the repository and the Power Center Client.
Power Center Server: Power Center server does the extraction from
source and then loading data into targets.
Copyright@ Pawan Likhi

Informatica Powercenter Glossary


Designer: Source Analyzer, Mapping Designer and Warehouse
Designer are tools reside within the Designer wizard. Source
Analyzer is used for extracting metadata from source systems.
Mapping Designer is used to create mapping between sources and
targets. Mapping is a pictorial representation about the flow of data
from source to target.
Warehouse Designer is used for extracting metadata from target
systems or metadata can be created in the Designer itself.
Data Cleansing: The Power Center's data cleansing technology
improves data quality by validating, correctly naming and
standardization of address data. A person's address may not be
same in all source systems because of typos and postal code, city
name may not match with address. These errors can be corrected by
using data cleansing process and standardized data can be loaded in
target systems (data warehouse).
Copyright@ Pawan Likhi

Informatica Powercenter Glossary


Transformation: Transformations help to transform the source data
according to the requirements of target system. Sorting, Filtering,
Aggregation, Joining are some of the examples of transformation.
Transformations ensure the quality of the data being loaded into
target and this is done during the mapping process from source to
target.
Workflow Manager: Workflow helps to load the data from source to
target in a sequential manner. For example, if the fact tables are
loaded before the lookup tables, then the target system will pop up
an error message since the fact table is violating the foreign key
validation. To avoid this, workflows can be created to ensure the
correct flow of data from source to target.
Workflow Monitor: This monitor is helpful in monitoring and tracking
the workflows created in each Power Center Server.
Copyright@ Pawan Likhi

Informatica Architecture

Copyright@ Pawan Likhi

Informatica Components
Server Components
1. Informatica Server
2. Repository Server

Client Components
1. Repository Server Administration Console
2. Repository Manager
3. Designer
4. Workflow Manager
5. Workflow Monitor
Copyright@ Pawan Likhi

Thank You!

Copyright@ Pawan Likhi

You might also like