You are on page 1of 56

Data Warehousing

Concepts

1
WHY DATAWAREHOUSE?
• Multi Dimensional Analysis of Data –Reporting

• Improve Decision Making


OLTP – ONLINE TRANSACTION PROCESSING
CITY BANK
Account, Loans, Mutual, Insure

Database Multi Dimensional


Capture Information:
Analysis of data
-- Customer
-Reporting
-- Saving, Account
-- Insurance

Front End Applications T1 R1


-- Java, . Net
T2 Select R2
INSERT/UPDAT Statement
T3
E/DELETE
R3
T4
Transaction by Transaction
ER Entity Relationship
OLTP- Online Transactional Processing or
Transactional Systems or Operational
Systems
CITY BANK Salary Accounts IBM/Accenture/Dell
Accounts –
Saving Account 10,00,000
Insurance Acc
Loans Acc
Customer –
Employee of Each org
Insert/Delete/Update Balance –
Account ID
5,00,000 Branch
Date
Office –
IBM/Accenture/Dell
1000 Branch/ATM –
Insert/Delete/Update 8000 Insert/Delete/Update
Report: Give me all the offices in which branch they are
doing transactions more – Analysis for Open new branch
Select office, branch from office, customer, accounts,
balance, branch where office = customer
Customer= Account
Longer Time to scan the data
Account=Balance
based on Join conditions
Balance= Branch
What is Data Warehouse ?
What is a Data Warehouse ?

A data warehouse is a subject-oriented, integrated,


nonvolatile, time-variant collection of data in support
of management's decisions.
- WH Inmon

WH Inmon - Regarded As Father Of Data Warehousing


Integrated - Characteristics of a Data Warehouse
Sale ID –Integer

Sales Product-Char
Hyderabad Informatica
E Staging
OLTP DB- SQL Data
Server SaleID-Decimal
Warehouse
Product-String L
E
Sales
Chennai
Sale ID –Numeric
Product-Varchar2
OLTP DB-
Oracle Server
Integrated View Is The Essence Of A Data Warehouse
Non-volatile - Characteristics of a Data Warehouse

insert change

Operational Data
Warehouse
insert
delete
load
read only
access
replace
change

Data Warehouse Is Relatively Static In Nature


Time Variant - Characteristics of a Data Warehouse

Operational Data
Warehouse

Current Value data Snapshot data


• time horizon : 60-90 days • time horizon : 5-10 years
•data warehouse stores historical
data

Data Warehouse Typically Spans Across Time


Subject-Oriented- Characteristics of a Data Warehouse

Data
Warehouse

DW is a subject-oriented database which supports the business needs of


Individual departments in the enterprise

Example : SALES,HR,ACCOUNTS,LOANS etc….

SALES ACCOUNTS

LOANS HR

Focus is on Subject Areas rather than Applications


Data warehouse is a database which is specifically
designed for analyzing the business but not for business
transactional processing.
- Ralph Kimball
OLTP Vs Data Warehouse

Operational System Data Warehouse


It is designed to support It is designed to support
operational monitoring decision making process

Data is volatile Data is non-volatile

Current Data Historical data


OLTP Vs Data Warehouse

Operational System Data Warehouse


Detailed Data Summarized data
Normalization De-Normalization
Relatively smaller database Large database size
Designed to support E-R Designed to support
Modeling Dimensional modeling
DATA ACQUISITION

It is a process of Extracting the relevant business information,


Transforming the data into a required business format and Loading
Into the Data Warehouse.

It is defined with the following processes.

* Data Extraction

* Data Transformation

* Data Loading
DATA ACQUISITION

Relational
Source Staging (Buffer)

Extraction Loading Data


ERP Source Transformation Warehouse

Mainframe
Data Acquisition
DATA ACQUISITION –Data Extraction

Data Extraction:
It is a process of reading the data from various types of sources
Such as relational sources, ERP sources, Mainframe sources,
XML file and Flat files.

Relational Oracle, SQL Server, Teradata

ERP SAP, PeopleSoft

Mainframe COBOL Files, DB2

File Flat Files (Text Files), XML Files


DATA ACQUISITION -- Data Transformation

Data Transformation:
It is a process of cleaning the data and transforming the data into
A required business format.

The following data transformation activities take place in staging


Area.

* Data Merging

* Data Cleansing

* Data Scrubbing

* Data Aggregation
DATA ACQUISITION --DATA TRANSFORMATION

ata Merging:
It is a process of combining the data from multiple inputs and
ad into a single output. There are two types of Data Merging Activities.

Join
Union
Data Cleansing:
It is a process of removing unwanted data from Staging
OR
It is a process of changing inconsistencies and inaccuracies

Example : Init Cap() and Round() functions


DATA ACQUISITION --DATA TRANSFORMATION

Data Scrubbing:
It is a process of deriving new data definitions using existing data.

Example: Concat (First Name+ Last Name), Sal Amount=QTY*Price

Data Aggregation:

It’s process of calculating the summaries for a group of records


Using aggregate functions.

Exxample : Average, Max, Min etc….


DATA ACQUISITION --DATA LOADING
Data Loading:
It is a process of inserting the data into a target system. There are
2 types of Data Loads.

1. Initial or Full Load


2. Incremental or Delta Load

1. Initial or Full Load


It is a process of loading all the required data at very first load.

2. Incremental or Delta Load


It is a process of loading only new records after initial load.
Data Marts

Data Mart are known as High Performance Query Structure.

There are 2 types of DM

1. Dependent DM

2. Independent DM
Data Marts

Top Down Approach or Dependent Data Marts (W.H.Inmon)

According to W.H.Inmon first we need to design an Enterprise


Data warehouse then design a small form of Subject Oriented
Department design specific DB known as Data Marts
Data Marts

Bottom-Up Approach or Independent Data Marts (Ralph Kimball)

According to Ralph Kimball first we need to design department


specific database known as Data Marts then integrate all data marts into
Enterprise Data Marts.
Warehouse Database Schema
 ER design techniques not appropriate
 Design should reflect multidimensional view
– Star Schema
– Snowflake Schema
– Fact Constellation Schema
– Integrated Schema
Star Schema

 A single fact table and a single table for each dimension


 A Fact table surrounded by all de-normalized dimension tables is
called start schema
DIMENSION 4 DIMENSION 1
STAR SCHEMA DATABASE DESIGN Pk1
Pk4

FK1
FK2
FK3
FK4
FACTS

DIMENSION 3 DIMENSION 2
Pk3
Pk2
STAR SCHEMA DATABASE DESIGN TIME
CUSTOMER
DATE_ID(PK)
Customer_id(pk) YEAR
Cust_name QUARTER
Address MONTH
Phone WEEK
SALES FACT DAY
fax

CUSTOMER_ID(FK)
STORE_ID(FK)
PRODUCT_ID(FK)
DATE_ID(FK)
QUANTITY (fact)
REVENUE (fact)

STORE

Store_id (pk)
PRODUCT
Country
Product_id(pk) Region
Category State
Sub Category
Product
City
Store
Snowflake Schema

One of the dimension table is divided into another


dimension tables is called snow flake schema.
SNOWFLAKE SCHEMA DATABASE DESIGN TIME
CUSTOMER
DATE_ID(PK)
Customer_id(pk) YEAR
Cust_name QUARTER
Address MONTH
Phone WEEK
SALES FACT DAY
fax

CUSTOMER_ID(FK)
STORE_ID(FK)
PRODUCT_ID(FK)
DATE_ID(FK)
QUANTITY (fact)
REVENUE (fact)

PRODUCT
STORE
Product_id(pk)
Category Store_id (pk)
Sub Category
Product Country
Region
State
City
Category
Store
Sub Category
Integrated Schema

Integrated schema is the process of joining two or more


fact tables and combination of star and snowflake
schema
INTEGRATED SCHEMA

D A X

PK FK

PK-FK

C B Y

A,B ARE CONFIRMED DIMENSIONS


Fact Constellation Schema:

It is a process of joining two fact tables


using PK-FK relationship is known as Fact
Constellation Schema
Slowly Changing Dimensions

SCD captures the changes which takes place over the period of
time.

There are three types of SCD:

1. SCD Type 1 ---------C

2. SCD Type 2---------C+H

3. SCD Type 3--------C+P


Slowly Changing Dimensions

1. SCD Type 1

Type 1 dimension keeps only the current values. Doesn’t maintain


history

2. SCD Type 2

Type 2 dimension maintain the full history in the target. For each
update it inserts a new record in the target tables.

3. SCD Type 3 :

Type 3 dimension maintains current and previous information


(Partial History)
THANK YOU

You might also like