Data Warehouse Testing

You might also like

You are on page 1of 17

Data Warehouse Testing

Padma M. Kulkarni February 2010

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited 2008 MindTree Consulting

Data Warehouse and Data Warehousing?


 A data warehouse is a repository of an organization's electronically stored data and
designed mainly to facilitate reporting and analysis.

 A single, complete and consistent store of data obtained from a variety of


different sources made available to end users in a way that they can understand and use it in a business context.

 Data warehousing is the process of managing a data warehouse by combining data


from multiple and usually varied sources into one comprehensive and easily manipulated database and any data marts associated with a specific enterprise (business) layer.

 A process of transforming data into information and making it available to users in


a timely manner to make a critical business decisions

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Data Warehouse Architecture

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Terms in Data warehouse


ETL
 ETL Technology is an important component of the Data Warehousing Architecture.  It is used to copy data from Operational Applications or upstream applications to the Data

Warehouse Staging Area, from the DW Staging Area into the Data Warehouse and finally from the Data Warehouse into a set of conformed Data Marts that are accessible by decision makers or the downstream applications.  The scheduling of ETL jobs is critical. The ETL process could potentially run several times a day or weekly, monthly, quarterly, and annual production schedules as well.

DW Staging Area
 The Data Warehouse Staging Area holds temporary data i.e. copied from source systems.  Due to varying business cycles, data processing cycles, hardware and network resource limitations
and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time.

 In short, all required data must be available before data can be integrated into the Data
Warehouse

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Terms in Data warehouse, Cont


Dimension tables
 Mainly contain master data that is subject to change over a period of time.  Define business in terms already familiar to users  Wide rows with lots of descriptive text  Small tables (about a million rows)  Heavily indexed  Typical dimensions time periods, geographic region (markets, cities), products, customers, salesperson, etc.

Fact tables
 Central table  Stores the measures of the business Mostly raw numeric items Large number of rows (millions to a billion) Accessed via dimensions [All the data in the fact is related to the data in the dimension] Points to the key value at the lowest level of each dimension table

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Terms in Data warehouse, Cont


Operational Data Store (ODS)
 A type of database that's often used as an interim logical area for a data warehouse.  An integrated, subject-oriented, volatile, current valued structure designed to serve
operational users.

 Usually designed to contain low level or atomic (indivisible) data.  Contains limited history that is captured "real time" or "near real time.

Slowly changing Dimensions (SCD)


 It describe the dimensions whose attribute value vary over time  SCD are used to describe the latest data pertaining to the dimension.
Eg: Current address of the client.

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Data warehouse testing


What?
Data warehouse testing in a majority includes testing of ETLs i.e.'Back-end' testing where the source systems data is compared to the end-result data in Loaded area, It also includes validating the data flow across all the integrated applications and validating the data based on the business rules as defined by the business analysts.

Why?
There is an exponentially increasing cost associated with finding software defects later in the development lifecycle. In data warehousing, this is compounded because of the additional business costs of using incorrect data to make critical business decisions.

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Process of Data warehouse testing

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Goals of ETL Testing


 Data Completeness  Data Transformation  Data Quality  End to End Integration testing  Regression testing  Performance and Security  User Acceptance testing  Reports [BI]

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Data Completeness
 Expected data is there in the data warehouse
 All Records
Record count Sum of Numeric fields between source and target Minus queries PK values between source and target Right records from right source [Filter condition] Record count in target tables, pre and post ETL job execution

 All Fields
Table structure between source and target objects Table structure with requirements and design matrix

 Full content in each field


Boundary testing Data truncation

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Data Transformation
 Data transformation is correct or not
 Transformations based on business rules
if X ->Y  Transformations at ETL Job level and DB level Primary key and Constraints between source and target DB defaults and Job defaults [for NULL and NOT NULL values] Transformations w.r.t. requirements and design matrix ETL generated fields such as surrogate keys Referential integrity with associated target tables One time historical transforms. E.g.: Inserting records in master tables Re run ETL jobs [Complete and Abort] Transformations on cloud. Different data load formats like append, delete and insert, SCD types.  Stare and Compare Manual check for sanity test

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Data Quality
 Handling the incorrect data in the source.
 Reject the record completely
Based on business rules Based on DB or job validations Based on filter conditions Substitute default values for partially rejected records Default values at job level and at DB level Based on validations with Master table Based on Data mapping Exception reports Exception report for all the stages of ETL job [Extract, Transform and Load] Records in report with the count mentioned in ETL jobs log file Records in report with the count mentioned in Control total tables Email notifications Corrected data Rejected data being corrected in source for subsequent ETLs Negative testing Manipulate data in source for all possible combinations of reject handling.

 

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

End to End Integration testing


 Compatibility with upstream and downstream systems
 Different source base like flat files or other database  End to end touch points of ETL like front end to a data mart of any other downstream application  ETL compatibility with referenced applications. E.g.: Currency rates  Compatibility with schedulers and flag values

Regression testing
 Ensure the existing functionality is intact
 Existing data flows are modified for new functionality  Enhancements and defect fixes  Modifications of any upstream system.

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Performance testing
 ETLs execute in expected timeframes for any volume of data
 Larger volume of data in source  Schedulers response and flag updates  Performance of reject handling  Performance of extract, transform and load jobs separately  Historic updates of production data

Security testing
 Ensure the access to all objects and applications are as defined
 Access to execute ETL jobs  Access to databases, DB objects and files  Access to exception reports  Notification or information emails from the jobs  Modifying privileges to the data and other configurations  Access to all integrated applications
CONFIDENTIAL: For limited circulation only 2010 MindTree Limited

User Acceptance testing


 Solution meets business expectations
 Typically done by business users and not the QA team.  Validated for business functionality and the reports used for critical decisions and forecasts.  Tested with production or production like data.  One time processing to be manageable within the conversion weekend  Parallel testing mainly in software upgrade projects

Reports [Business Intelligence]


 Validate report with all combinations of parameters  Content of the report for boundary values  Drill down and drill up features of the reports  Reports data with the backend

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Thank you

CONFIDENTIAL: For limited circulation only

2010 MindTree Limited

Successful Customers
Our Mission

Happy People Innovative Solutions

Padma M. Kulkarni padma_ kulkarni@mindtree.com www.mindtree.com


CONFIDENTIAL: For limited circulation only 2010 MindTree Limited Slide 17

You might also like