You are on page 1of 72

Introduction to Data Warehousing

Srinivasarao Sanka
• Introduction
Agenda • Normalization and de-normalization
• What is a data warehouse ?
• Advantages and Disadvantages of DWH
• Types of Load
• Dimension table
• Types of SCD
• Fact tables
• Data ware housing Schema
• Difference between OLAP and OLTP
• ETL process
• Inmon’s and Kimbal’s Concept
• ETL testing process
• Checks in DWH testing
• How to do ETL testing
• Challenges in ETL/DWH testing
• DWH Versus DB testing
• DWH Versus Data marts

2 Confidential Services
Introduction
• E-R model is Entity Relation model used in two dimensional Databases. For
Example, SQL Server, or Oracle. A table is based on two dimensional Rows
and Columns. Generally, OLTP systems are based on two dimensions.
• But, if you see in Dimensional modeling, we have more than two dimensions.
• A cube represents a three dimensional model in a data warehouse, the data are
stored in the form of summary of information. Also, these data can be easily
retrieved from a DB compared to a normal OLTP Database.
• Let us assume, PROD, GEOG, TIME and MEAS are the four dimensions we
have. A DW System have stored information with these four dimensions. If
you want to know the sales of Lux (Prod), in?North India (Geog), during (Oct
2006) for a measure value of Lux 75 grams (MEAS).
• ie., FACT_TBL(PROD LUX, GEOG NORTH_INDIA, TIME OCT06, MEAS
Units)

3 Confidential 3 Services
Normalization

4 Confidential Services
DE-Normalization

This works –
queries are now blindingly
simple
(select * from users)

5 Confidential Services
Necessity is the mother of invention

Why Data Warehouse?

6 Confidential Services
Scenario

ABC Pvt Ltd is a company with branches


at Mumbai, Delhi, Chennai and Banglore.
The Sales Manager wants quarterly sales
report. Each branch has a separate
operational system.

7 Confidential Services
Scenario : ABC Pvt Ltd.

Mumbai

Delhi
Sales per item type per branch Sales
for first quarter. Manager

Chennai

Banglore

8 Confidential Services
Solution :ABC Pvt Ltd.

• Extract sales information from each database.


• Store the information in a common repository at a
single site.

9 Confidential Services
Solution :ABC Pvt Ltd.

Mumbai

Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse

Chennai

Banglore

10 Confidential Services
Need for Data Warehousing

• Industry has huge amount of operational data


• Knowledge worker wants to turn this data into
useful information.
• This information is used by them to support strategic
decision making .

11 Confidential Services
Need for Data Warehousing (contd..)

• It is a platform for consolidated historical data for


analysis.
• It stores data of good quality so that knowledge
worker can make correct decisions.

12 Confidential Services
Need for Data Warehousing (contd..)

• From business perspective


-it is latest marketing weapon
-helps to keep customers by learning more about
their needs .
-valuable tool in today’s competitive fast evolving
world.

13 Confidential Services
What is Data Warehouse??

14 Confidential Services
Inmons’s definition

A data warehouse is
-subject-oriented,
-integrated,
-time-variant,
-nonvolatile
collection of data in support of management’s
decision making process.

15 Confidential Services
Subject-oriented
• Data warehouse is organized around subjects such as
sales,product,customer.
• It focuses on modeling and analysis of data for
decision makers.
• Excludes data not useful in decision support process.

16 Confidential Services
Integration
• Data Warehouse is constructed by integrating
multiple heterogeneous sources.
• Data Preprocessing are applied to ensure
consistency.
RDBMS

Data
Legacy Warehouse
System

Flat File Data Processing


Data Transformation
17 Confidential Services
Integration

• In terms of data.
– encoding structures.

– Measurement of
attributes.

– physical attribute.
of data
remarks

– naming conventions.

– Data type format


18 Confidential Services
Time-variant
• Provides information from historical perspective e.g.
past 5-10 years
• Every key structure contains either implicitly or
explicitly an element of time

19 Confidential Services
Nonvolatile
• Data once recorded cannot be updated.
• Data warehouse requires two operations in data
accessing
– Initial loading of data
– Access of data

load

acce
20 Confidential Services
ss
Advantages of DWH
• Data warehouses tend to have a very high query
success as they have complete control over the four
main areas of data management systems.
• Clean data
• Indexes: multiple types
• Query processing: multiple options
• Security: data and access
• Easy report creation.
• Enhanced access to data and information.

21 Confidential Services
Disadvantages of DWH
• There are considerable disadvantages involved in
moving data from multiple, often highly
disparate, data sources to one data warehouse
that translate into long implementation time,
high cost, lack of flexibility, dated information,
and limited capabilities.
• Preparation may be time consuming.
• Compatibility with existing systems.
• Security issues.
• Long initial implementation time and associated
high cost

22 Confidential Services
Disadvantages of DWH Contd..
• Limited flexibility of use and types of users
• Difficult to accommodate changes in data types and ranges,
data source schema, indexes and queries

23 Confidential Services
Data Warehousing Architecture
Monitoring &
Administratio OLAP Servers
n
Metadata
Repository

Reconciled data Analysis


External Extract
Sources
Transform
Serve
Load
Refresh Query/Reportin
Operational g
Dbs

Data Mining

DATA SOURCES TOOLS

DATA MARTS
24 Confidential Services
The ETL Process

Source Staging Presentation


Systems Area System

Extract Transform Load

25 Confidential Services
Types of Load

• Initial Load : ‘Initial Load’ involves a one time load of the source transaction
system data of the past years into the data warehouse system.

• Incremental Load: applying ongoing changes to one or more tables based on a


predefined schedule.

• Full Load: It involves complete delete and reload of data, it means small in
size and largely independent set of tables which receives full data (current
data + history data) as input would be loaded.

26 Confidential 26 Services
Dimension

• A dimensional table is a collection of hierarchies and categories along which


the user can drill down and drill up. it contains only the textual attributes.
• A structure that categorizes data to enable end users to answer business
questions
• e.g., Time, Location, Customer

27 Confidential Services
Types of Dimension Tables

• Conformed Dimensions:
A Dimension that is used in multiple locations is called a conformed dimension.
A conformed dimension may be used with multiple fact tables in a single
database, or across multiple data marts or data warehouses.
Eg: Time Dimension.
•  Junk Dimensions:
A junk dimension is a single table with a combination of different and unrelated
attributes to avoid having a large number of foreign keys in the fact table. Junk
dimensions are often created to manage the foreign keys created by Rapidly
Changing Dimensions.

28 Confidential 28 Services
Types of Dimension Tables Cont

• Degenerate Dimensions:
 A degenerate dimension is when the dimension attribute is stored as part of fact
table, and not in a separate dimension table.
• Slowly Changing Dimensions (SCD):
Dimensions that change over time are called Slowly Changing Dimensions.
Three types of SCD are used in DW
SCD Type 1
SCD Type 2
SCD Type 3

29 Confidential 29 Services
Slowly Changing Dimensions

• Type 1:
• The new information simply overwrites the original information. In other
words, no history is kept.
• Example: we originally have the following table:

Customer Key Name State


100 Srinivas Karnataka
• After Srinivas moved from Karnataka to Chennai, the new information
replaces the new record, and we have the following table:

Customer Key Name State


100 Srinivas Chennai

30 Confidential 30 Services
Slowly Changing Dimensions Cont

• Type 2:
• A new record is added to the table to represent the new information.
Therefore, both the original and the new record will be present. The new
record gets its own primary key.
• Example: we originally have the following table:

Customer Key Name State


100 Srinivas Karnataka
• After Srinivas moved from Karnataka to Chennai, we add the new
information as a new row into the table:

Customer Key Name State


100 Srinivas Karnataka
101 Srinivas Chennai

31 Confidential 31 Services
Slowly Changing Dimensions
• Type 3:
• In Type 3 Slowly Changing Dimension, there will be two columns to indicate
the particular attribute of interest, one indicating the original value, and one
indicating the current value. There will also be a column that indicates when
the current value becomes active.
• Example: we originally have the following table:
Customer Key Name State
100 Srinivas Karnataka

• After Srinivas moved from Karnataka to Chennai, the original information get
updated, and we have the following table:

Customer Name Original Current Effective


Key State State Date
100 Srinivas Karnataka Chennai 12-12-12

32 Confidential 32 Services
Fact Table

• A table that contains facts


• Contains numeric, additive fields (measurements of the business)
• It has two types of columns:

– containing facts
– foreign keys to dimension tables

33 Confidential Services
Types of Fact Tables

• Additive: Additive facts are facts that can be summed up through all of the
dimensions in the fact table.

• Semi-Additive: Semi-additive facts are facts that can be summed up for some
of the dimensions in the fact table, but not the others.

• Non-Additive: Non-additive facts are facts that cannot be summed up for any
of the dimensions present in the fact table

• Factless Fact : A factless fact table is a fact table that does not have any
measures. It is essentially an intersection of dimensions.

34 Confidential Services
Schema

The foundation of each


data warehouse is a
relational database
built using a dimensional
model. A dimensional
model consists of dimension
and fact tables and is typically
described as star or snowflake
schema.

35 Confidential Services
Data Warehouse Schema
• Star Schema
• Snowflake Schema
• Galaxy /Fact Constellation Schema

36 Confidential Services
Star Schema

• A single,large and central fact table and one table for each dimension.
• Every fact points to one tuple in each of the dimensions and has additional
attributes.
• Does not capture hierarchies directly.

37 Confidential Services
Star Schema (contd..)

Store Dimension Fact Table Time Dimension


Store Key Store Key Period Key
Store Name Product Key Year
City Period Key
Quarter
Units
State Month
Price
Region

Product Key
Product Desc

Product Dimension

Benefits: Easy to understand, easy to define hierarchies, reduces no. of physical


joins.
38 Confidential Services
SnowFlake Schema

• Variant of star schema model.


• A single,large and central fact table and one or more tables for each
dimension.
• Dimension tables are normalized i.e. split dimension table data into additional
tables

39 Confidential Services
SnowFlake Schema (contd..)
Store Dimension Fact Table Time Dimension
Store Key Period Key
Store Key
Product Key Year
Store Name Period Key
Quarter
City Key Units
Month
Price
City Dimension
City Key
Product Key
City
Product Desc
State
Region Product Dimension

Drawbacks: Time consuming joins,report generation slow

40 Confidential Services
Galaxy / Fact Constellation Cont
Sales Shipping
Fact Table Fact Table
Store Key Product Dimension
Shipper Key
Product Key Product Key Store Key
Period Key Product Desc Product Key
Units
Period Key
Price
Units
Price
Store Dimension

Store Key
Store Name
City
State
41 Confidential
Region Services
Difference between OLAP and OLTP
Based on understanding of in scope requirement, We have made the following
assumptions/dependencies

42 Confidential 42 Services
Difference between OLAP and OLTP
Contd..
OLAP Database (OLAP) OLTP Database
Multidimensional Database
Normalized Data Structures
Structures
Index - Many Index - Few
Joins - Few Joins - Many
Aggregated Data - More Aggregate Data - Few

No. of users - Few No. of users - More

Periodic update of data Data Modification More

Huge volumes of data Small volumes of data

43 Confidential 43 Services
ETL Process

44 Confidential 44 Services
Bill Inmon's paradigm

One centralize data warehouse which will


act as a enterprise-wide data warehouse
and then build data mart as per need for
specific department or process

It is known as top down approach


Central data warehouse to follow ER
modeling approach

45 Confidential 45 Services
Inmon’s Architecture

46 Confidential 46 Services
Ralph Kimball's paradigm

Build business process oriented


small data marts which are joined
to each other using common
dimensions between business
process.

It is known as bottom-up
approach

Data marts should be build on


dimensional modelling approach

47 Confidential Services
Kimball’s Architecture

48 Confidential Services
ETL testing Process

49 Confidential Services
Scope of Testing…

• The scope of testing is much more in case of data


warehouse in comparison to manual testing
• The testing is divided into two phases:
– Source to staging
– Staging to target

• Functional/Integration/UAT/ Regression testing are


basically done is a data warehouse project

50 Confidential Services
Types Of Testing

The following kinds of testing can be done in data


warehouse:
• Functional test: it verifies that the item is compliant with its
specified business requirements.
• Usability test: it evaluates the item by letting users interact
with it, in order to verify that the item is easy to use and
comprehensible.
• Performance test: it checks that the item performance is
satisfactory under typical workload conditions.
• Stress test: it shows how well the item performs with peak
loads of data and very heavy workloads.

51 Confidential Services
Types Of Testing Contd…

• Recovery test: it checks how well an item is able to re-cover


from crashes, hardware failures and other similar problems.
• Security test: it checks that the item protects data and maintains
functionality as intended.
• Regression test: It checks that the item still functions correctly
after a change has occurred.

52 Confidential Services
Types Of Testing Contd…

• Recovery test: it checks how well an item is able to re-cover


from crashes, hardware failures and other similar problems.
• Security test: it checks that the item protects data and maintains
functionality as intended.
• Regression test: It checks that the item still functions correctly
after a change has occurred.

53 Confidential Services
What vs. how in testing

Conceptual ETL
  schema Logical schema procedures Database Front-end

Functional          

Usability        

Performance          

Stress          

Recovery          

Security          

Regression          

Analysis & design Implementation

54 Confidential Services
Checks in Warehouse Testing

• Static Data validation


• Data Type Validation
• Counts Check
• Duplicates Check
• Primary Key Foreign Key
• Value Maps/Look ups Testing
• Hash Counts/Checksum
• Validation of Views
• Error Handling/ITO’s

55 Confidential Services
Static Data validation
– Data which is not going to change frequently is Static Data. This data is
loaded once and a manual process is in general followed to load
inserts/Updates .Eg.. County table, Currency Table
– Method used to validate the data is
– by comparing in xls sheet by is
– string comparison {EXACT (A1,A2)}
– Numeric {DELTA(A1,A2)}

• SELECT column_name
FROM Table
WHERE column_name not in (stat_col1_row1,stat_col2_row2….);
- stat_colN_rowN are all the possible static value
- Use Microsoft Access to perform the testing for comparisons

56 Confidential Services
Data Type Validation

• Data type validation helps in identifying if the data that is being sent from
the source is of the same length as expected. The data mappings once
finalized are used to do this comparison
• In SQL Server the following command helps
– SELECT column_name ,Data_type, Numeric_Precision,Numeric_scale,
Domain_name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_name

57 Confidential Services
Counts Check
• Tables where there is a direct map from the source to the target we can
check that all the data is loaded in a quick step.
– SELECT count(*) FROM table_name
--WHERE task_id =<load_id> if there is a load id for each load

58 Confidential Services
Duplicates Check

• In the inner select put all the columns except the primary key. Use only the source
attributes. Leave out the attributes like load id,expiry date which come during the
loading process.
• SELECT count(*) AS count FROM
(SELECT trade_px, trade_qty, security, accrued_int, gross_amt, net_amt,
trade_date, dq_id, app_id, tran_type, port, settle_date, fx_rate_sec2port,
fx_rate_sec2usd, buy_ccy, sell_ccy, broker, task_id, fx_rate_sec2cbal,realized_GL,
tran_action_id,trade_id,trade_seq_id
FROM cr.cr_trade_details_f
WHERE task_id = '43DF2AEF-53F6-449A-81D9-C17401E6454C'
GROUP BY trade_px, trade_qty, security, accrued_int, gross_amt, net_amt,
trade_date, dq_id, app_id, tran_type, port, settle_date, fx_rate_sec2port,
fx_rate_sec2usd, buy_ccy, sell_ccy, broker, task_id, fx_rate_sec2cbal,
realized_GL,tran_action_id,trade_id,trade_seq_id
HAVING count(*) > 1) AS tbl

59 Confidential Services
Contd..
SELECT ColA, count(*) FROM TableName
GROUP BY ColA
HAVING COUNT(*) > 1

Or

SELECT T.cola, T.colb


FROM your_table AS T
JOIN
  (SELECT cola
   FROM your_table
   GROUP BY cola
   HAVING COUNT(*)>1) AS D
  ON T.cola = D.cola ;

60 Confidential Services
Primary Key Foreign Key
• Select <fk> from table where <fk> not in (select pk from table)

61 Confidential Services
Value Maps/Look ups Testing
• Value maps. The source system may call a string Finance/Financial/Fin. But
in the target system it is expected to be called Financials. Value maps are
used for this.
• To test value maps check that distinct values of the target system is in the
source database.
• Possible problems here
– Change in value maps over a period of time may result in redundant data to be
present in the database.

62 Confidential Services
Hash Counts/Checksum
• Hash counts
– are used to identify in one go if all the numerical columns are loaded in one go.
Sum all the numerical columns in source and target and compare where there is a
one-to-one mapping. In case there is a flat file used in the source. Open the flat
file in xls and do the sum for the specified column.
– SELECT sum(trade_px) sum_of_trade_px
– FROM cr.cr_trade_details_f
– WHERE task_id = <value>
– In case of error out records do a sum of the column in error record and the sum
in the target should equal the source. The error record after correction should be
inserted into the target through load process

• Check sum
– is used to verify the row level data heck sum is used to verify the row level data.
This can be used for direct maps to compare the results
– checksum(field1, field2, ... fieldn)

63 Confidential Services
Validation of Views

• Select * from v_<View_name> where <Key_column> in (select


key_column from table_name)

64 Confidential Services
Error Handling/ITO’s

• Go to the error logs or the error tables.


• Validate the exact error info.. Or the ITO number for the error scenario
tested

65 Confidential Services
How to do ETL/DWH testing
• Validate the environment
• Prepare/Load the test data
• Run the ETL jobs
• Validate the flow of data as per the mapping document (source
to staging and staging to target)
• Prepare the test data for the error handling scenarios to test the
ITO alarms

66 Confidential Services
Challenges in DWH testing

•  Data selection from multiple source systems and Analysis that


follows pose great challenge.
• Volume and the complexity of the data.
• Inconsistent and redundant data in a data warehouse.
• Inconsistent and Inaccurate reports.
• Non-availability of History data.
• Manipulation of Test Data to ensure full test coverage

67 Confidential Services
Tools for DWH Testing
• Excel
• DWH/Bi Comparator tool

• Reason to use
– Access limitations.
– Link between source &target databases not available.

68 Confidential Services
Difference between DB and DWH testing

Database Testing Datawarehouse Testing

Smaller in scale Large scale. Voluminous Data


Usually used to test data at the source Includes several facets. Extraction,
instead of testing using the GUI Transformation & Loading
mechanisms being the major ones

Usually Homogenous data Heterogeneous data involved

Normalized data De normalized data


CRUD Operations Usually Read-Only Operations
Consistent Data Temporal Data Inconsistency

69 Confidential Services
Difference between DWH and Data Marts

Data Warehouse Data marts

Expensive Relatively cheap


Large development cycle Delivered in < 6 months
Change management is difficult Easy to manage change
Difficult to obtain continuous Can lead to independent and
corporate support incompatible marts
Technical challenges in building Cleansing, transformation,
large databases modeling techniques may be
incompatible

70 Confidential Services
?
71 Confidential Services
Thank You

72 Confidential Services

You might also like