You are on page 1of 39

Agenda

• BI Application Architecture
• History of Data Warehouse
• What is Data Warehouse
• What is a Data Mart
• OLTP Vs OLAP
• Top Down Approach(Dr.inmon)
• Bottom up Approach(Dr.Kimball)
BI Application Architecture
OLAP
Read only
History
Size-Huge

• OLTP(Bank) Read/write
Data warehouse
Live data
Small size Staging DB Data mart
uk End users
Data cleansing

E
-------------------
Remove duplicates
Loans
Merging
us
T Split
convert Insurance
L
india
Credit card
History:
Father of Data warehouse
Dr. Willian H inmon
Dr. Ralph Kimball
--Introduced Dimension Modeling to design DWH in 1990.
What is Data Warehouse?

• As Per Dr. William inmon


• Data Warehouse is a
 Subject Oriented(loans, mutual Funds etc)
 integrated(Get data from various sources)
 Non Volatile(Read only data)
 Time Variant(Different time period data)
Data Mart
• Data mart is subject oriented
• DWH is collection of data marts
• DWH is super set of Data Marts
• Data Mart is also another Database
Data Warehouse Means
• It is a centralized large database
• It stores integrated,summarized and historical data for analysis and
decision making purpose
OLTP Vs. OLAP
OLTP(operational Data-Fresh data) OLAP(Analytical Data-historical data)

Designed for transactions Designed for analysis and Reporting


purpose(Business purpose) purpose

Data modifications are frequent Data modifications are not frequent

Has Normalized Tables Has De-Normalized Tables

Indexes not recommended Indexes are recommended

Maintain small data Maintain Large data

More users(customers) Less users(Business man)

Read/write data Historical data


TOP-DOWN Approach

• Proposed By Dr.Bill inmon


• DWH is created first and then the data Marts are derived from it
• Data Mart have consistent data and subject oriented Data
• It is robust against business changes.The cost of the project is very high

Bank Data mart


Loans

OLTP DWS
DataMart

Credit Card

Data mart

Insurence
Bottom-up Approach

• Proposed By Dr.Ralph Kimball


• Data Marts are created first and the integrated and a comprehensive DWH is created
• As the data marts are created first,business solutions can be answered quickly

DataMart

OLTP DWS
DataMart

DataMart
Normalization: Convert the larger tables into smaller tables

 Reduce the redundancy and inconsistency of the data


 Data writing is fast
 Reading is very slow(due to join condition in between tables)
 More no of joins are required

De-Normalization: combine smaller tables into larger tables

 Reading is very fast.


 Less number of joins are required
Normalization Tables: OLTP(ERD MODELING)
DEPARTMENT
EMPLOYEE
EID ENAME DEPTID(FK) DEPTID(PK) DEPTNAME

101 SMITH 10 10 ACCOUNT and Research


Depatment
102 FORD 10 20 SALES
30 MARKETING
103 MILLER 20

104 KING 30 WRITING:


INSERTION FAST
105 Adam Null DELETION FAST
UPDATIONS FAST
De-Normalization Table: OLAP(DIMENSTION MODELING) READING:SLOW
EID ENAME DEPTNAME READING:Fast
101 SMITH ACCOUNT and
102 FORD ACCOUNT
103 MILLER SALES
104 KING MARKETING
105 ADAM SALES
INDEX(OLAP)
HEAP(OLTP)
ROOM1 OILTINS ROOM1 OILTINS,CLOTHS
ROOM2 COOLDRINKS ROOM2 COOLDRINKS,KICHEN
ITEMS,STATIONARY
ROOM3 STATIONARY
ROOM4 CLOTHS ROOM3 CLOTHS,COOLDRINKS

ROOM5 KICHEN ITEMS ROOM4 KICHEN


ITEMS,OILTINS,STATIONARY

OLTP(HEAP) OLAP(INDEXES)

INSERT FAST SLOW

UPDATE FAST SLOW

DELETE FAST SLOW

READ SLOW FAST


EXAMPLE:

HEAP INDEXES
EMP ENAME SALARY DEPT LOCATION EMPID ENAME SALARY DEPT LOCATION
ID
19 SMITH 10000 10 CHICAGO 1 FORD 15000 10 CHICAGO
15 MILLER 25000 20 NEWYORK
15 MILLER 25000 20 NEWYORK
19 SMITH 10000 10 CHICAGO
1 FORD 15000 10 CHICAGO 20 ADAM 40000 30 WASHINGTON
20 ADAM 40000 30 WASHINGTON

INSERT NEW RECORD:2,KING,20000,20,NEWYORK


UPDATE THE RECORD:WHERE EMPID IS 15 SET EMPID 21
DELETE THE RECORD:
AGENDA:
• Design concepts of DWH
• Introduction to Dimension Modeling
• What is a Fact or Measure
• What is a Dimension
• Dimension Hierarchies
• Types of dimensions and Facts
• Introduction to Schemas
Star Schemas
Snow Flake Schema
Design Concepts-OLTP and OLAP(DWH)

OLTP E OLAP

T
To Design OLTP L To Design OLAP(DWH)

ERD Modleing Dimension Modeling

Dr.Peter chen Dr. Kimboall


OLAP
OLTP

Amazon ETL
OLTP to OLAP Table Conversion Amazon DWH

OLTP Contains
-Master Tables
• Product Tables
• Category Tables
• Customer
-Transactional Tables
• Payment Details Master Transactional
• Order Details Product Master Sales
• Delivery details Dimension Fact
Customer Master
Product Dimension Sales
Category Master
Customer Dimension
In OLAP
-Master Tables converted as Dimension
Tables
-Transactional Tables converted as Fact
Tables
ERD Modeling
OLTP Product:
Customer:
------------------
Master(PK,string)-----------incremental load ---------------- M purchage N cid(PK)
---------------------------------------------------------------- Pid Cname
pk Caid(FK) Address
pName M : 1
Product Master(productid,name,price) 50 gb+50rows 1 : 1 Phone
price age
pk
customerMaster(custid,name,addr,city,phone) 40gb+1000
rows Sales
------------------------------
Pid
Transactional(fk,numeric)--------------full load cid
-------------------------------------------------------------------------- Quantity
fk fk Amount
Sales(productid,customerid,qty,amount,date)500gb delete date
all dataM

1:1 relation ship


1:m relation ship
M:N relation ship
Dimension Modeling:

• Developed by Dr.Kimball
• It is a DWH design concept
• It is a process to convert OLTP data into OLAP data in the form of Dimension tables and
Fact tables
• Dimension Modeling is heart of data warehouse

• Fact represents a business measure.

• Dimension: containing the textual context associated with a business process


measurement event.
• Describe “who, what, when, how, and why” associated with the business measure
• Who:customer
• What:product
• When:time
• How:mode
Fact or Measure

 Measures are the numeric values that we want to aggregate,slice,dice,and Analyze


 Measures are following properties:
• How they should be formatted
• How they should aggregate up
• How they interact with specific Dimensions and so on
 Measures always answer questions for ‘How Much’ or ‘How many ‘ etc
 Data in the fact Table can be filtered and grouped(Slice and Diced)
OLAP Numerical Value(sum,min,max,avg) ----FK
Fact_Sales Measure/Fact/Metrics:
Dim_Product 1)How much
Pid_key(FK) Time
2)How many
Cid-key(FK) Time_key(PK)
Time-key(FK) Date
Pid-key(PK) String Data(Dimensions)------PK
Location-key(FK) Day
Pname pprice Month
Brand 1)when(Time)
Qty Week
2)where(location)
Salesamount Qtr
3)which(products)
4)Who (customers/dealers)
5)How(mode)
Dim_Location
Location_Key(P Dim_Customer
DimensionTable=PK+Dimension
K) Cid_key(PK)
Fact Table=FK+Facts
Region Cname
Country Address
State Phone
City Age
1000 millions USD
Hierarchy

2016- Time Location


2015---400$ ---- Year country
600$ Qtr State

Month city
Q3-- Q4-- Week
Q1-- Q2--- Day
50 50 100
200

Jan-- Feb-- Mar--


100 35 65

W1 W2 W3 W4

Mon Tue Wed Thu Fri sat Day


Sun
Star Schema:
 Star Schema means Dimension Tables are directly
linked to the Fact Table.

 The fact table is at the center which contains keys to


every dimension table like Dealer_ID, Model ID,
Date_ID, Product_ID, Branch_ID & other attributes
like Units sold and revenue.

• Every dimension in a star schema is represented with


the only one-dimension table.
• The dimension table should contain the set of
attributes.
• The dimension table is joined to the fact table using a
Star Schema: foreign key
• The dimension table are not joined to each other
• Fact table would contain key and measure
• The dimension tables are not normalized. For
instance, in the above figure, Country_ID does not
have Country lookup table as an OLTP design would
have.
• The schema is widely supported by BI Tools
Dimension:string information of business transaction(sales amount 10000 soaps), (what,when,how,why,where).
Fact:Business values(measure/metric)-------Reven,sales amount, deposit, withdraw(How much,how many)
Dimension Table:PK+Dimensions(Pk---dpesn’t allows duplicates and nulls)
Fact Table:FK+Facts(FK----Allow duplicates and nulls)
Stars Schema:Dimension tables are directly connected with the fact
Denormalized method
Snow Flake Schema:Dimension tables are indirectly connected with the fact
Normalized method
Galaxy Schema: starschema+snowflake schema(more than one fact table)
Normalized method.
Snow Flake Schema:
Dimension Tables are indirectly linked to the Fact Table
Or
Dimension table is linking to another dimension table
A Snowflake Schema is an extension of a Star Schema, and it adds additional dimensions. The dimension tables are
normalized which splits data into additional tables.
In Snowflake Schema example, Country is further normalized into an individual table.

Characteristics of Snowflake Schema:


• The main benefit of the snowflake schema it uses smaller disk space.
• Easier to implement a dimension is added to the Schema
• Due to multiple tables query performance is reduced
• The primary challenge that you will face while using the snowflake Schema is that you need to perform more
maintenance efforts because of the more lookup tables.
Star Schema Vs Snow Flake Schema

Star Schema Snow Flake Schema

1.Has redundant data and hence tough to 1.No Redundancy and hence more easy to
maintain/change maintain/change

2.Less complex queries and hence easy to understand 2.More complex queries and hence less easy is to
understand

3.Less foreign keys so faster execution time 3.More foreign keys so takes more execution time

4.Has De-Normalized tables 4.Has Normalized tables

5.Good to use for large database 5.Good to use for small database

6.Less number of joins 6.More number of joins

7.When Dimension tables has less number of rows we can 7.When dimension tables is relatively big in size is better
go for this as it reduce space.
Types of Dimensions
• Role Playing Dimension
• Conformed Dimension
• Junk Dimension
• Degenerated Dimension
• Slowly Changing Dimension
• inferred Dimension

• Role Play Dimension:one dimension having more than one relation ship with in the same fact table(Fact)---
• Conformed Dimension:one single dimension shared with muiltiple fact table in dimension modeling
Role play Dimension:
Fact_Sales One dimension table having more
than one relationship with fact table
Dim_Product Pid_key(FK) Time then the dimension is called Role Play
Cid-key(FK) date_key(PK) Dimension.
Location-key(FK) Date
Pid-key(PK) pprice Day
Pname Qty Month
Brand Salesamount Week
Orderdate_key(FK) Qtr
Deliverydate_key(F Orderdate_key(FK)
K) Deliverydate_key(FK

Dim_Location
Location_Key(P Dim_Customer
K) Cid_key(PK)
Region Cname
Country Address
State Phone
City Age
• Conformed Dimension

• A conformed dimension relates to multiple fact tables with in the same DWH.
• Dim Date is a common conformed dimension because it’s attributes(day,week,month,quarter,year etc) have the same
meaning when joined to any fact table.

Date Dimension
Date_key(PK)

Fact(Loan) Fact(Credit card)


Date_key(FK) Date_key(FK)
Conformed vs Role-Playing Dimension
Conformed is the same dimension used in different facts and has the same meaning .
Role-Playing is the same dimension which used multiple times within the same fact but with different
meanings ex: Date.

Roleplay:date_key ---------Fact table1------2 (order,delivery)


Conformed :datekey--------Fact Table1,Fact2,Fact3----
Degenerated Dimension
• Degenerated Dimension table describe a dimension field in the fact
table that doesn’t have a corresponding dimension table.
• Placing these text attributes will slow down the performance of fact
tables
• EX: voucher number , Bill number
• Fact:Fk+Fact+Degenarated Dimensions
Junk Dimension
• Junk Dimensions are used to reduce the no.of dimensions in the
Dimension model
• And reduce the no.of columns in the fact table
• It will help you reduce size of fact table
Slowly-Changing Dimension
• In Datawarehose there is a need to track changes in dimension
attributes in order to report historical data.

• Slowly changing dimensions are used when we wish to capture the


changing data with in the dimension over time.

• SCD Type1:only Current Data


• SCD Type2:history Data + Current data (storage is row wise)
• SCD Type3:history Data + Current Data(storage is column wise)
Inferred Dimension
• Inferred Dimensions are called as Late Arriving Dimensions.
• Early Arraiving Facts,Late Arraiving Dimensions.
Types of Facts
• Fully Additive Measure/Fact
• Semi Additive Measure/Fact
• Non Additive Measure/Fact
• Fact less Fact Measure/Fact
Fully Additive Measure/Fact:
• Can be added across all the associated dimensions
• Eg: Sales Amount
Semi_Additive Fact:
Can be added across only few dimensions rather than all dimensions.
Eg:sum( Customer balance)

Balance of company’s acct 1 for day 1:5000 Day1 means Date/time dimension
Balance of company’s acct 2 for day1:2000
-----------------------------------------------------------------
Total: 7000
-----------------------------------------------------------------

Balance of company’s acct 1 for day1:5000


Balance of company’s acct 1 for day2:3000
-----------------------------------------------------------------
Total:8000
------------------------------------------------------------------
Non-Additive Facts: can not be added across any of the dimensions
Does not give any business meaning to the super market
Such Measures which do not give any business meaning when added across dimensions are called non additive fact

Super Market Super Market


Profit
Date Store Margin Date Store Profit Margin

01-Jan-22More 20% 01-Jan-22More 30%

02-Jan-22More 40% 01-Jan-22 Dmart 60%


Total Profit 60% Total Profit 90%
30%

1200/1000=20% 1300/1000=30%
1400/1000=40% 1600/1000=60%
2600/2000=30% 2900/2000=45%
Fact less Fact:
Fact table doesn’t contain any measures or facts.
Fact Table contain only key attributes.

You might also like