Professional Documents
Culture Documents
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
An Overview
Understanding What is a Data Warehouse
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
Data Modeling
Effective way of using a Data Warehouse
10
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
13
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
14
15
16
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
17
18
20
21
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
22
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
23
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
24
25
26
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
27
Metadata Management
28
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
29
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
30
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
32
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
34
OLAP
36
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
8/30/2012
37
37
38
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the data
39
39
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
40
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
41
3 x 3 x 3 = 27 cells
42
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
8/30/2012
43
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
43
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
8/30/2012
44
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
44
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
8/30/2012
45
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
45
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
46
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
47
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
8/30/2012
48
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
48
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
8/30/2012
49
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
49
8/30/2012
50
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
50
1st Qtr
4th Qtr
51
52
8/30/2012
53
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
53
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
54
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
8/30/2012
55
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
55
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
56
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
56
8/30/2012
57
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
57
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
58
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
58
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
8/30/2012
59
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
59
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
8/30/2012
60
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
60
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
8/30/2012
61
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
61
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
8/30/2012
62
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
62
63
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
64
65
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
66
67
68
69
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
70
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
71
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
72
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
73
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
74
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
75
Questions
76
Thank You
77
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
79
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
80
An Overview
Understanding What is a Data Warehouse
81
82
83
84
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
86
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
87
An Overview
Understanding What is a Data Warehouse
88
89
90
91
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
93
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
94
An Overview
Understanding What is a Data Warehouse
95
96
97
98
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
99
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
100
Data Modeling
Effective way of using a Data Warehouse
101
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
104
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
105
106
107
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
108
109
111
112
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
113
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
114
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
115
116
117
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
118
Metadata Management
119
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
120
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
121
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
123
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
125
OLAP
127
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
8/30/2012
128
128
129
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 130 data
130
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
131
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
132
3 x 3 x 3 = 27 cells
133
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
8/30/2012
134
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
134
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
8/30/2012
135
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
135
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
8/30/2012
136
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
136
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
137
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
138
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
8/30/2012
139
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
139
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
8/30/2012
140
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
140
8/30/2012
141
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
141
1st Qtr
4th Qtr
142
143
8/30/2012
144
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
144
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
145
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
8/30/2012
146
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
146
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
147
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
147
8/30/2012
148
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
148
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
149
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
149
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
8/30/2012
150
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
150
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
8/30/2012
151
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
151
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
8/30/2012
152
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
152
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
8/30/2012
153
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
153
154
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
155
156
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
157
158
159
160
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
161
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
162
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
163
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
164
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
165
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
166
Questions
167
Thank You
168
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
169
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
170
Data Modeling
Effective way of using a Data Warehouse
171
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
174
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
175
176
177
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
178
179
181
182
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
183
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
184
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
185
186
187
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
188
Metadata Management
189
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
190
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
191
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
193
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
195
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
197
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
199
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
200
An Overview
Understanding What is a Data Warehouse
201
202
203
204
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
205
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
206
Data Modeling
Effective way of using a Data Warehouse
207
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
210
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
211
212
213
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
214
215
217
218
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
219
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
220
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
221
222
223
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
224
Metadata Management
225
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
226
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
227
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
229
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
231
OLAP
233
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
8/30/2012
234
234
235
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 236 data
236
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
237
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
238
3 x 3 x 3 = 27 cells
239
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
8/30/2012
240
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
240
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
8/30/2012
241
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
241
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
8/30/2012
242
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
242
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
243
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
244
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
8/30/2012
245
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
245
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
8/30/2012
246
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
246
8/30/2012
247
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
247
1st Qtr
4th Qtr
248
249
8/30/2012
250
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
250
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
251
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
8/30/2012
252
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
252
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
253
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
253
8/30/2012
254
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
254
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
255
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
255
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
8/30/2012
256
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
256
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
8/30/2012
257
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
257
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
8/30/2012
258
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
258
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
8/30/2012
259
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
259
260
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
261
262
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
263
264
265
266
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
267
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
268
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
269
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
270
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
271
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
272
Questions
273
Thank You
274
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
275
An Overview
Understanding What is a Data Warehouse
276
277
278
279
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
280
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
281
Data Modeling
Effective way of using a Data Warehouse
282
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
285
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
286
287
288
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
289
290
292
293
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
294
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
295
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
296
297
298
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
299
Metadata Management
300
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
301
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
302
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
304
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
306
OLAP
308
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
8/30/2012
309
309
310
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 311 data
311
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
312
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
313
3 x 3 x 3 = 27 cells
314
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
8/30/2012
315
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
315
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
8/30/2012
316
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
316
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
8/30/2012
317
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
317
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
318
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
319
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
8/30/2012
320
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
320
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
8/30/2012
321
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
321
8/30/2012
322
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
322
1st Qtr
4th Qtr
323
324
8/30/2012
325
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
325
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
326
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
8/30/2012
327
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
327
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
328
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
328
8/30/2012
329
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
329
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
330
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
330
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
8/30/2012
331
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
331
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
8/30/2012
332
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
332
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
8/30/2012
333
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
333
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
8/30/2012
334
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
334
335
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
336
337
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
338
339
340
341
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
342
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
343
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
344
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
345
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
346
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
347
Questions
348
Thank You
349
OLAP
351
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
8/30/2012
352
352
353
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 354 data
354
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
355
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
356
3 x 3 x 3 = 27 cells
357
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
8/30/2012
358
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
358
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
8/30/2012
359
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
359
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
8/30/2012
360
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
360
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
361
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
362
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
8/30/2012
363
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
363
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
8/30/2012
364
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
364
8/30/2012
365
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
365
1st Qtr
4th Qtr
366
367
8/30/2012
368
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
368
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
369
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
8/30/2012
370
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
370
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
371
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
371
8/30/2012
372
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
372
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
373
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
373
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
8/30/2012
374
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
374
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
8/30/2012
375
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
375
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
8/30/2012
376
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
376
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
8/30/2012
377
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
377
378
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
379
380
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
381
382
383
384
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
385
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
386
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
387
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
388
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
389
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
390
Questions
391
Thank You
392
Components of Warehouse
Source Tables: These are real-time, volatile data in relational databases for transaction processing (OLTP). These can be any relational databases or flat files. ETL Tools: To extract, cleansing, transform (aggregates, joins) and load the data from sources to target. Maintenance and Administration Tools: To authorize and monitor access to the data, set-up users. Scheduling jobs to run on offshore periods. Modeling Tools: Used for data warehouse design for high-performance using dimensional data modeling technique, mapping the source and target files. Databases: Target databases and data marts, which are part of data warehouse. These are structured for analysis and reporting purposes. End-user tools for analysis and reporting: get the reports and analyze the data from target tables. Different types of Querying, Data Mining, OLAP tools are used for this purpose.
393
This has a staging area, where the data after cleansing, transforming is loaded and tested here. Later is directly loaded to the target database/warehouse. Which is divided to data marts and can be accessed by different users for their reporting and analyzing purposes.
394
Data Modeling
Effective way of using a Data Warehouse
395
Data Modeling
Commonly E-R Data Model is used in OLTP, In OLAP Dimensional Data Model is used commonly. E-R (Entity-Relationship) Data Model
Entity: Object that can be observed and classified based on its properties and characteristics. Like employee, book, student Relationship: relating entities to other entities.
Star Schema
Dimension Table
product prodId p1 p2 name price bolt 10 nut 5
Dimension Table
store storeId c1 c2 c3 city nyc sfo la
Fact Table
sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97 custId 53 53 111 prodId p1 p2 p1 storeId c1 c1 c3 qty 1 2 5 amt 12 11 50
Dimension Table
customer custId 53 81 111 name joe fred sally address 10 main 12 main 80 willow city sfo sfo la
398
Snowflake Schema
Dimension Table Fact Table
store storeId s5 s7 s9 cityId sfo sfo la tId t1 t2 t1 mgr joe fred nancy
sType tId t1 t2 city size small large location downtown suburbs regId north south
Dimension Table
cityId pop sfo 1M la 5M
The star and snowflake schema are most commonly found in dimensional data warehouses and data marts where speed of data retrieval is more important than the efficiency of data manipulations. As such, the tables in these schema are not normalized much, and are frequently designed at a level of normalization short of third normal form.
399
400
401
Continuous Monitoring
Identify & Correct Cause of Defects Refine data capture mechanisms at source Educate users on importance of DQ
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
402
403
405
406
ETL Architecture
Visitors
Web Browsers
The Internet
Staging Area
Web Server Logs & E-comm Transaction Data Flat Files
Scheduled Extraction
RDBMS
Scheduled Loading
Data Collection
Data Extraction
Data Transformation
Data Loading
407
ETL Architecture
Data Extraction:
Rummages through a file or database Uses some criteria for selection Identifies qualified data and Transports the data over onto another file or database
Data transformation
Integrating dissimilar data types Changing codes Adding a time attribute Summarizing data Calculating derived values Renormalizing data
Data loading
Initial and incremental loading Updation of metadata
408
Why ETL ?
Companies have valuable data lying around throughout their networks that needs to be moved from one place to another. The data lies in all sorts of heterogeneous systems,and therefore in all sorts of formats.
To solve the problem, companies use extract, transform and load (ETL) software.
The data used in ETL processes can come from any source: a mainframe application, an ERP application, a CRM tool, a flat file, and an Excel spreadsheet.
409
410
411
ETL Tools
Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment ETL Tools - Second-Generation PowerCentre/Mart from Informatica Data Mart Solution from Sagent Technology DataStage from Ascential
412
Metadata Management
413
What Is Metadata?
Metadata is Information...
That describes the WHAT, WHEN, WHO, WHERE, HOW of the data warehouse About the data being captured and loaded into the Warehouse Documented in IT tools that improves both business and technical understanding of data and data-related processes
414
Importance Of Metadata
Locating Information Time spent in looking for information. How often information is found? What poor decisions were made based on the incomplete information?
415
Consumers of Metadata
Technical Users Warehouse administrator Application developer Business Users -Business metadata Meanings Definitions Business Rules Software Tools Used in DW life-cycle development Metadata requirements for each tool must be identified The tool-specific metadata should be analysed for inclusion in the enterprise metadata repository Previously captured metadata should be electronically transferred from the enterprise metadata repository to each individual tool
417
Reischmann-Informatik-Toolbus
Features include facilitation of selective bridging of metadata
419
OLAP
421
Agenda
OLAP Definition Distinction between OLTP and OLAP
MDDB Concepts
Implementation Techniques Architectures
Features
Representative Tools
8/30/2012
422
422
423
OLAP System Consolidation data; OLAP data comes from the various OLTP databases Decision support
Purpose of data
Multi-dimensional views of various kinds of business activities Periodic long-running batch jobs refresh the 424 data
424
MDDB Concepts
A multidimensional database is a computer software system designed to allow for efficient and convenient storage and retrieval of data that is intimately related and stored, viewed and analyzed from different perspectives (Dimensions). A hypercube represents a collection of multidimensional data. The edges of the cube are called dimensions Individual items within each dimensions are called members
425
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
MDDB
Sales Volumes
M O D E L
Mini Van
Sedan
DEALERSHIP
COLOR
27 x 4 = 108 cells
426
3 x 3 x 3 = 27 cells
427
Sparsity
- Input data in applications are typically sparse -Increases with increased dimensions
Data Explosion
-Due to Sparsity -Due to Summarization
Performance
-Doesnt perform better than RDBMS at high data volumes (>20-30 GB)
8/30/2012
428
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
428
LAST NAME EMP# AGE SMITH 01 21 REGAN 12 Sales Volumes 19 FOX 31 63 Miini Van WELD 14 6 5 31 4 M O KELLY 54 3 5 27 Coupe D 5 E L LINK 03 56 4 3 2 Sedan KRANZ 41 45 Blue Red White LUCUS 33 COLOR 41 WEISS 23 19
Smith
Regan
Fox
L A S T N A M E
Weld
Kelly
Link
Kranz
Lucas
Weiss
EMPLOYEE #
8/30/2012
429
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
429
OLAP Features
Calculations applied across dimensions, through hierarchies and/or across members Trend analysis over sequential time periods, What-if scenarios. Slicing / Dicing subsets for on-screen viewing Rotation to new dimensional comparisons in the viewing area Drill-down/up along the hierarchy Reach-through / Drill-through to underlying detail data
8/30/2012
430
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
430
M O D E L
Mini Van
6 3 4
Blue
5 5 3
Red
4 5 2
White
Coupe
C O L O R ( ROTATE 90 )
o
Blue
6 5 4
3 5 5
MODEL
4 3 2
Sedan
Red
Sedan
White
COLOR
View #1
View #2
431
M O D E L
Sedan
C O L O R
Blue
C O L O R
Blue
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
( ROTATE 90 )
DEALERSHIP
DEALERSHIP
MODEL
View #1
D E A L E R S H I P D E A L E R S H I P
View #2
View #3
Mini Van
Clyde
M O D E L
Sedan
COLOR
( ROTATE 90 )
MODEL
( ROTATE 90 )
DEALERSHIP
MODEL
COLOR
COLOR
View #4
View #5
View #6
432
Sales Volumes
M O D E L
Coupe
Carr Clyde
Carr Clyde
Normal Blue
Metal Blue
DEALERSHIP
COLOR
8/30/2012
433
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
433
ORGANIZATION DIMENSION
REGION Midwest
DISTRICT
Chicago
St. Louis
Gary
DEALERSHIP
Clyde
Gleason
Carr
Levi
Lucas
Bolton
Moving Up and moving down in a hierarchy is referred to as drill-up / roll-up and drill-down
8/30/2012
434
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
434
8/30/2012
435
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
435
1st Qtr
4th Qtr
436
437
8/30/2012
438
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
438
OLAP
Cube
OLAP Calculation Engine
Web Browser
OLAP Tools
439
MOLAP - Features
Powerful analytical capabilities (e.g., financial, forecasting, statistical) Aggregation and calculation capabilities Read/write analytic applications Specialized data structures for
Maximum query performance. Optimum space utilization.
8/30/2012
440
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
440
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
441
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
441
8/30/2012
442
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
442
Relational DW
Web Browser
OLAP Calculation Engine
SQL
OLAP Tools
OLAP Applications
8/30/2012
443
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
443
HOLAP - Features
RDBMS used for detailed data stored in large databases MDDB used for fast, read/write OLAP analysis and calculations Scalability of RDBMS and MDDB performance Calculation engine provides full analysis features Source of data transparent to end user
8/30/2012
444
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
444
Architecture Comparison
MOLAP
Definition
ROLAP
HOLAP
Hybrid OLAP = ROLAP + summary in MDDB Sparsity exists only in MDDB part To the necessary extent
MDDB OLAP = Relational OLAP = Transaction level data + Transaction level data + summary in MDDB summary in RDBMS Good Design 3 10 times High (May go beyond control. Estimation is very important) Fast - (Depends upon the size of the MDDB) No Sparsity To the necessary extent
Data explosion due to Sparsity Data explosion due to Summarization Query Execution Speed
Slow
Optimum - If the data is fetched from RDBMS then its like ROLAP otherwise like MOLAP. High: RDBMS + disk space + MDDB Server cost Large transactional data + frequent summary analysis
Cost
Where to apply?
Small transactional Very large transactional data + complex model + data & it needs to be frequent summary viewed / sorted analysis
8/30/2012
445
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
445
Oracle Express Products Hyperion Essbase Cognos -PowerPlay Seagate - Holos SAS
Micro Strategy - DSS Agent Informix MetaCube Brio Query Business Objects / Web Intelligence
8/30/2012
446
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
446
Sales Analysis Financial Analysis Profitability Analysis Performance Analysis Risk Management Profiling & Segmentation Scorecard Application NPA Management Strategic Planning Customer Relationship Management (CRM)
8/30/2012
447
2009 Wipro Ltd - Confidential 2009 Wipro Ltd - Confidential
447
448
The methodology required for testing a Data Warehouse is different from testing a typical transaction system
449
450
In data Warehouse, most of the testing is system triggered. Most of the production/Source system testing is the processing of individual transactions, which are driven by some input from the users (Application Form, Servicing Request.). There are very few test cycles, which cover the system-triggered scenarios (Like billing, Valuation.)
451
452
453
454
Requirements testing
The main aim for doing Requirements testing is to check stated requirements for completeness. Requirements can be tested on following factors. Are the requirements Complete? Are the requirements Singular? Are the requirements Ambiguous? Are the requirements Developable? Are the requirements Testable?
455
Unit Testing
Unit testing for data warehouses is WHITEBOX. It should check the ETL procedures/mappings/jobs and the reports developed. Unit testing the ETL procedures: Whether ETLs are accessing and picking up right data from right source.
All the data transformations are correct according to the business rules and data warehouse is correctly populated with the transformed data.
Testing the rejected records that dont fulfil transformation rules.
456
Unit Testing
Unit Testing the Report data:
Verify Report data with source: Data present in a data warehouse will be stored at an aggregate level compare to source systems. QA team should verify the granular data stored in data warehouse against the source data available Field level data verification: QA team must understand the linkages for the fields displayed in the report and should trace back and compare that with the source systems Derivation formulae/calculation rules should be verified
457
Integration Testing
Integration testing will involve following:
Sequence of ETLs jobs in batch. Initial loading of records on data warehouse. Incremental loading of records at a later date to verify the newly inserted or updated data. Testing the rejected records that dont fulfil transformation rules. Error log generation
458
Performance Testing
Performance Testing should check for : ETL processes completing within time window.
459
Acceptance testing
Here the system is tested with full functionality and is expected to function as in production. At the end of UAT, the system should be acceptable to the client for use in terms of ETL process integrity and business functionality and reporting.
460
Questions
461
Thank You
462
Content
1 An Overview of Data Warehouse 2 Data Warehouse Architecture 3 Data Modeling for Data Warehouse 4 Overview of Data Cleansing
464
Content [contd]
6 Metadata Management 7 OLAP 8 Data Warehouse Testing
465
An Overview
Understanding What is a Data Warehouse
466