Professional Documents
Culture Documents
Srinivasarao Sanka
• Introduction
Agenda • Normalization and de-normalization
• What is a data warehouse ?
• Advantages and Disadvantages of DWH
• Types of Load
• Dimension table
• Types of SCD
• Fact tables
• Data ware housing Schema
• Difference between OLAP and OLTP
• ETL process
• Inmon’s and Kimbal’s Concept
• ETL testing process
• Checks in DWH testing
• How to do ETL testing
• Challenges in ETL/DWH testing
• DWH Versus DB testing
• DWH Versus Data marts
2 Confidential Services
Introduction
• E-R model is Entity Relation model used in two dimensional Databases. For
Example, SQL Server, or Oracle. A table is based on two dimensional Rows
and Columns. Generally, OLTP systems are based on two dimensions.
• But, if you see in Dimensional modeling, we have more than two dimensions.
• A cube represents a three dimensional model in a data warehouse, the data are
stored in the form of summary of information. Also, these data can be easily
retrieved from a DB compared to a normal OLTP Database.
• Let us assume, PROD, GEOG, TIME and MEAS are the four dimensions we
have. A DW System have stored information with these four dimensions. If
you want to know the sales of Lux (Prod), in?North India (Geog), during (Oct
2006) for a measure value of Lux 75 grams (MEAS).
• ie., FACT_TBL(PROD LUX, GEOG NORTH_INDIA, TIME OCT06, MEAS
Units)
3 Confidential 3 Services
Normalization
4 Confidential Services
DE-Normalization
This works –
queries are now blindingly
simple
(select * from users)
5 Confidential Services
Necessity is the mother of invention
6 Confidential Services
Scenario
7 Confidential Services
Scenario : ABC Pvt Ltd.
Mumbai
Delhi
Sales per item type per branch Sales
for first quarter. Manager
Chennai
Banglore
8 Confidential Services
Solution :ABC Pvt Ltd.
9 Confidential Services
Solution :ABC Pvt Ltd.
Mumbai
Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse
Chennai
Banglore
10 Confidential Services
Need for Data Warehousing
11 Confidential Services
Need for Data Warehousing (contd..)
12 Confidential Services
Need for Data Warehousing (contd..)
13 Confidential Services
What is Data Warehouse??
14 Confidential Services
Inmons’s definition
A data warehouse is
-subject-oriented,
-integrated,
-time-variant,
-nonvolatile
collection of data in support of management’s
decision making process.
15 Confidential Services
Subject-oriented
• Data warehouse is organized around subjects such as
sales,product,customer.
• It focuses on modeling and analysis of data for
decision makers.
• Excludes data not useful in decision support process.
16 Confidential Services
Integration
• Data Warehouse is constructed by integrating
multiple heterogeneous sources.
• Data Preprocessing are applied to ensure
consistency.
RDBMS
Data
Legacy Warehouse
System
• In terms of data.
– encoding structures.
– Measurement of
attributes.
– physical attribute.
of data
remarks
– naming conventions.
19 Confidential Services
Nonvolatile
• Data once recorded cannot be updated.
• Data warehouse requires two operations in data
accessing
– Initial loading of data
– Access of data
load
acce
20 Confidential Services
ss
Advantages of DWH
• Data warehouses tend to have a very high query
success as they have complete control over the four
main areas of data management systems.
• Clean data
• Indexes: multiple types
• Query processing: multiple options
• Security: data and access
• Easy report creation.
• Enhanced access to data and information.
21 Confidential Services
Disadvantages of DWH
• There are considerable disadvantages involved in
moving data from multiple, often highly
disparate, data sources to one data warehouse
that translate into long implementation time,
high cost, lack of flexibility, dated information,
and limited capabilities.
• Preparation may be time consuming.
• Compatibility with existing systems.
• Security issues.
• Long initial implementation time and associated
high cost
22 Confidential Services
Disadvantages of DWH Contd..
• Limited flexibility of use and types of users
• Difficult to accommodate changes in data types and ranges,
data source schema, indexes and queries
23 Confidential Services
Data Warehousing Architecture
Monitoring &
Administratio OLAP Servers
n
Metadata
Repository
Data Mining
DATA MARTS
24 Confidential Services
The ETL Process
25 Confidential Services
Types of Load
• Initial Load : ‘Initial Load’ involves a one time load of the source transaction
system data of the past years into the data warehouse system.
• Full Load: It involves complete delete and reload of data, it means small in
size and largely independent set of tables which receives full data (current
data + history data) as input would be loaded.
26 Confidential 26 Services
Dimension
27 Confidential Services
Types of Dimension Tables
• Conformed Dimensions:
A Dimension that is used in multiple locations is called a conformed dimension.
A conformed dimension may be used with multiple fact tables in a single
database, or across multiple data marts or data warehouses.
Eg: Time Dimension.
• Junk Dimensions:
A junk dimension is a single table with a combination of different and unrelated
attributes to avoid having a large number of foreign keys in the fact table. Junk
dimensions are often created to manage the foreign keys created by Rapidly
Changing Dimensions.
28 Confidential 28 Services
Types of Dimension Tables Cont
• Degenerate Dimensions:
A degenerate dimension is when the dimension attribute is stored as part of fact
table, and not in a separate dimension table.
• Slowly Changing Dimensions (SCD):
Dimensions that change over time are called Slowly Changing Dimensions.
Three types of SCD are used in DW
SCD Type 1
SCD Type 2
SCD Type 3
29 Confidential 29 Services
Slowly Changing Dimensions
• Type 1:
• The new information simply overwrites the original information. In other
words, no history is kept.
• Example: we originally have the following table:
30 Confidential 30 Services
Slowly Changing Dimensions Cont
• Type 2:
• A new record is added to the table to represent the new information.
Therefore, both the original and the new record will be present. The new
record gets its own primary key.
• Example: we originally have the following table:
31 Confidential 31 Services
Slowly Changing Dimensions
• Type 3:
• In Type 3 Slowly Changing Dimension, there will be two columns to indicate
the particular attribute of interest, one indicating the original value, and one
indicating the current value. There will also be a column that indicates when
the current value becomes active.
• Example: we originally have the following table:
Customer Key Name State
100 Srinivas Karnataka
• After Srinivas moved from Karnataka to Chennai, the original information get
updated, and we have the following table:
32 Confidential 32 Services
Fact Table
– containing facts
– foreign keys to dimension tables
33 Confidential Services
Types of Fact Tables
• Additive: Additive facts are facts that can be summed up through all of the
dimensions in the fact table.
• Semi-Additive: Semi-additive facts are facts that can be summed up for some
of the dimensions in the fact table, but not the others.
• Non-Additive: Non-additive facts are facts that cannot be summed up for any
of the dimensions present in the fact table
• Factless Fact : A factless fact table is a fact table that does not have any
measures. It is essentially an intersection of dimensions.
34 Confidential Services
Schema
35 Confidential Services
Data Warehouse Schema
• Star Schema
• Snowflake Schema
• Galaxy /Fact Constellation Schema
36 Confidential Services
Star Schema
• A single,large and central fact table and one table for each dimension.
• Every fact points to one tuple in each of the dimensions and has additional
attributes.
• Does not capture hierarchies directly.
37 Confidential Services
Star Schema (contd..)
Product Key
Product Desc
Product Dimension
39 Confidential Services
SnowFlake Schema (contd..)
Store Dimension Fact Table Time Dimension
Store Key Period Key
Store Key
Product Key Year
Store Name Period Key
Quarter
City Key Units
Month
Price
City Dimension
City Key
Product Key
City
Product Desc
State
Region Product Dimension
40 Confidential Services
Galaxy / Fact Constellation Cont
Sales Shipping
Fact Table Fact Table
Store Key Product Dimension
Shipper Key
Product Key Product Key Store Key
Period Key Product Desc Product Key
Units
Period Key
Price
Units
Price
Store Dimension
Store Key
Store Name
City
State
41 Confidential
Region Services
Difference between OLAP and OLTP
Based on understanding of in scope requirement, We have made the following
assumptions/dependencies
42 Confidential 42 Services
Difference between OLAP and OLTP
Contd..
OLAP Database (OLAP) OLTP Database
Multidimensional Database
Normalized Data Structures
Structures
Index - Many Index - Few
Joins - Few Joins - Many
Aggregated Data - More Aggregate Data - Few
43 Confidential 43 Services
ETL Process
44 Confidential 44 Services
Bill Inmon's paradigm
45 Confidential 45 Services
Inmon’s Architecture
46 Confidential 46 Services
Ralph Kimball's paradigm
It is known as bottom-up
approach
47 Confidential Services
Kimball’s Architecture
48 Confidential Services
ETL testing Process
49 Confidential Services
Scope of Testing…
50 Confidential Services
Types Of Testing
51 Confidential Services
Types Of Testing Contd…
52 Confidential Services
Types Of Testing Contd…
53 Confidential Services
What vs. how in testing
Conceptual ETL
schema Logical schema procedures Database Front-end
Functional
Usability
Performance
Stress
Recovery
Security
Regression
54 Confidential Services
Checks in Warehouse Testing
55 Confidential Services
Static Data validation
– Data which is not going to change frequently is Static Data. This data is
loaded once and a manual process is in general followed to load
inserts/Updates .Eg.. County table, Currency Table
– Method used to validate the data is
– by comparing in xls sheet by is
– string comparison {EXACT (A1,A2)}
– Numeric {DELTA(A1,A2)}
• SELECT column_name
FROM Table
WHERE column_name not in (stat_col1_row1,stat_col2_row2….);
- stat_colN_rowN are all the possible static value
- Use Microsoft Access to perform the testing for comparisons
56 Confidential Services
Data Type Validation
• Data type validation helps in identifying if the data that is being sent from
the source is of the same length as expected. The data mappings once
finalized are used to do this comparison
• In SQL Server the following command helps
– SELECT column_name ,Data_type, Numeric_Precision,Numeric_scale,
Domain_name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_name
57 Confidential Services
Counts Check
• Tables where there is a direct map from the source to the target we can
check that all the data is loaded in a quick step.
– SELECT count(*) FROM table_name
--WHERE task_id =<load_id> if there is a load id for each load
58 Confidential Services
Duplicates Check
• In the inner select put all the columns except the primary key. Use only the source
attributes. Leave out the attributes like load id,expiry date which come during the
loading process.
• SELECT count(*) AS count FROM
(SELECT trade_px, trade_qty, security, accrued_int, gross_amt, net_amt,
trade_date, dq_id, app_id, tran_type, port, settle_date, fx_rate_sec2port,
fx_rate_sec2usd, buy_ccy, sell_ccy, broker, task_id, fx_rate_sec2cbal,realized_GL,
tran_action_id,trade_id,trade_seq_id
FROM cr.cr_trade_details_f
WHERE task_id = '43DF2AEF-53F6-449A-81D9-C17401E6454C'
GROUP BY trade_px, trade_qty, security, accrued_int, gross_amt, net_amt,
trade_date, dq_id, app_id, tran_type, port, settle_date, fx_rate_sec2port,
fx_rate_sec2usd, buy_ccy, sell_ccy, broker, task_id, fx_rate_sec2cbal,
realized_GL,tran_action_id,trade_id,trade_seq_id
HAVING count(*) > 1) AS tbl
59 Confidential Services
Contd..
SELECT ColA, count(*) FROM TableName
GROUP BY ColA
HAVING COUNT(*) > 1
Or
60 Confidential Services
Primary Key Foreign Key
• Select <fk> from table where <fk> not in (select pk from table)
61 Confidential Services
Value Maps/Look ups Testing
• Value maps. The source system may call a string Finance/Financial/Fin. But
in the target system it is expected to be called Financials. Value maps are
used for this.
• To test value maps check that distinct values of the target system is in the
source database.
• Possible problems here
– Change in value maps over a period of time may result in redundant data to be
present in the database.
62 Confidential Services
Hash Counts/Checksum
• Hash counts
– are used to identify in one go if all the numerical columns are loaded in one go.
Sum all the numerical columns in source and target and compare where there is a
one-to-one mapping. In case there is a flat file used in the source. Open the flat
file in xls and do the sum for the specified column.
– SELECT sum(trade_px) sum_of_trade_px
– FROM cr.cr_trade_details_f
– WHERE task_id = <value>
– In case of error out records do a sum of the column in error record and the sum
in the target should equal the source. The error record after correction should be
inserted into the target through load process
• Check sum
– is used to verify the row level data heck sum is used to verify the row level data.
This can be used for direct maps to compare the results
– checksum(field1, field2, ... fieldn)
63 Confidential Services
Validation of Views
64 Confidential Services
Error Handling/ITO’s
65 Confidential Services
How to do ETL/DWH testing
• Validate the environment
• Prepare/Load the test data
• Run the ETL jobs
• Validate the flow of data as per the mapping document (source
to staging and staging to target)
• Prepare the test data for the error handling scenarios to test the
ITO alarms
66 Confidential Services
Challenges in DWH testing
67 Confidential Services
Tools for DWH Testing
• Excel
• DWH/Bi Comparator tool
• Reason to use
– Access limitations.
– Link between source &target databases not available.
68 Confidential Services
Difference between DB and DWH testing
69 Confidential Services
Difference between DWH and Data Marts
70 Confidential Services
?
71 Confidential Services
Thank You
72 Confidential Services