You are on page 1of 5

Strategies for Testing Data Warehouse Applications

By Jeff Theobald
Businesses are increasingly focusing on the collection and organization of data for
strategic decision-making. The ability to review historical trends and monitor near
real-time operational data has become a key competitive advantage.
This article provides practical recommendations for testing extract, transform and
load (ETL) applications based on years of experience testing data warehouses in the
financial services and consumer retailing areas. Every attempt has been made to
keep this article tool-agnostic so as to be applicable to any organization attempting
to build or improve on an existing data warehouse.

Testing Goals
There is an exponentially increasing cost associated with finding software defects
later in the development lifecycle. In data warehousing, this is compounded
because of the additional business costs of using incorrect data to make critical
business decisions. Given the importance of early detection of software defects,
let's first review some general goals of testing an ETL application:






Data completeness. Ensures that all expected data is loaded.
Data transformation. Ensures that all data is transformed correctly according
to business rules and/or design specifications.
Data quality. Ensures that the ETL application correctly rejects, substitutes
default values, corrects or ignores and reports invalid data.
Performance and scalability. Ensures that data loads and queries perform
within expected time frames and that the technical architecture is scalable.
Integration testing. Ensures that the ETL process functions well with other
upstream and downstream processes.
User-acceptance testing. Ensures the solution meets users' current
expectations and anticipates their future expectations.
Regression testing. Ensures existing functionality remains intact each time a
new release of code is completed.

and for date fields include the entire range of dates expected.Data Completeness One of the most basic tests of data completeness is to verify that all expected data loads into the data warehouse. Here are some simple automated data movement techniques:  Create a spreadsheet of scenarios of input data and expected results and validate these with the business customer. Strategies to consider include: Comparing record counts between source data. Populating the full contents of each field to validate that no truncation occurs at any step in the process. For example. For example. One typical method is to pick some sample records and "stare and compare" to validate data transformations manually. This can be used during testing and in production to compare source and target data sets and point out any data anomalies from source systems that may be missed even when the data movement is correct. This can be useful but requires manual testing steps and testers who understand the ETL logic. Testing the boundaries of each field to find any database limitations. for a decimal(3) field include values of -99 and 999. . This is a good requirements elicitation exercise during design and can also be used during testing. if the source data field is a string(30) make sure to test it with 30 characters. Utilizing a data profiling tool that shows the range and value distributions of fields in a data set. data loaded to the warehouse and rejected records. This includes validating that all records. it is possible that the range of values the database accepts is too small. all fields and the full contents of each field are loaded. A combination of automated data profiling and automated data movement validations is a better long-term strategy. This is a valuable technique that points out a variety of possible data errors without doing a full validation on all fields. Depending on the type of database and how it is indexed. Data Transformation Validating that data is transformed correctly based on business rules can be the most complex part of testing an ETL application with significant transformation logic. Comparing unique values of key fields between source data and data loaded to the warehouse.

Review the detailed test scenarios with business users and technical designers to ensure that all are on the same page." To ensure success in testing data quality. and performance of queries can be expected to degrade. it is important to ensure that what is done with invalid data is reported to the users. ETL load times can be expected to increase.g. For this reason. data quality is defined as "how the ETL system handles data rejection. These data quality reports present valuable data that sometimes reveals systematic issues with source data. it may be beneficial to populate the "before" data in the database for users to view.      Create test data that includes all scenarios. Validate that data types in the warehouse are as specified in the design and/or the data model. Set up data scenarios that test referential integrity between tables. This can be . Substitute null if a certain decimal field has nonnumeric data. alphabetic characters in a decimal field). For example. what happens when the data contains foreign key values not in the parent table? Validate parent-to-child relationships in the data.. data quality rules are defined during design. duplicate records in source data and invalid data types in fields (e. substitution. Compare product code to values in a lookup table. for example:     Reject the record if a certain decimal field has nonnumeric data. Performance and Scalability As the volume of data in a data warehouse grows. Validate and correct the state field if necessary based on the ZIP code. Data Quality For the purposes of this discussion. correction and notification without modifying data. In some cases. include as many data scenarios as possible. Depending on the data quality rules of the application being tested. Elicit the help of an ETL developer to automate the process of populating data sets with the scenario spreadsheet to allow for flexibility because scenarios will change. Data quality rules applied to the data will usually be invisible to the users once the application is in production. Validate correct processing of ETL-generated fields such as surrogate keys. Set up data scenarios that test how orphaned child records are handled. Typically. scenarios to test might include null key values. Utilize data profiling results to compare range and distribution of values in each field between source and target data. and if there is no match load anyway but report to users. users will only see what's loaded to the database.

system testing only includes testing within the ETL application. it is important to integration test with production-like data. When creating integration test scenarios. The endpoints for system testing are the input and output of the ETL code being tested. Work with business users to develop sample queries and acceptable performance criteria for each query. Integration testing shows how the application fits into the overall flow of all upstream and downstream applications. Perform simple and multiple join queries to validate query performance on large database volumes. gather team members from all systems together to formulate test scenarios and discuss what could go wrong in production. Consider how process failures at each step would be handled and how data would be recovered or deleted if necessary. Compare these ETL loading times to loads performed with a smaller amount of data to anticipate scalability issues. Therefore. consider how the overall process can break and focus on touchpoints between applications rather than within one application. there could be privacy or security concerns that require certain fields to be randomized before using it in a test environment. such as reading a file multiple times or creating unnecessary intermediate files. Integration testing should be a combined effort and not the responsibility solely of the team testing the ETL application. but depending on the contents of the data. As always. The following strategies will help discover performance issues:     Load the database with peak expected production volumes to ensure that this volume of data can be loaded by the ETL process within the agreed-upon window. don't forget the importance of good communication between the testing and design teams of all systems involved. Compare the ETL processing times component by component to point out any areas of weakness. User-Acceptance Testing . Integration Testing Typically. Most issues found during integration testing are either data related to or resulting from false assumptions about the design of another application. Monitor the timing of the reject process and consider how large volumes of rejected data will be handled. The aim of the performance testing is to point out any potential weaknesses in the ETL design. Real production data is ideal.mitigated by having a solid technical architecture and good ETL design. Run the overall process from end to end in the same order and with the same dependencies as in production. To help bridge this communication gap.

When building test cases. sometimes leading to design changes. remember that they will likely be executed multiple times as new releases are created due to defect fixes. A simple but effective and efficient strategy to retest basic functionality is to store source data sets and results from successful runs of the code and compare new test results with previous runs.theobald@capitalone. Jeff Theobald is a manager at Capital One. enhancements or upstream systems changes. Users know the data best. it is much quicker to compare results to a previous execution than to do an entire data validation again. The users will likely have questions about how the data is populated and need to understand details of how the ETL works. Users typically find issues once they see the "real" data. Regression Testing Regression testing is revalidation of existing functionality with each new release of code. Test database views comparing view contents to what is expected.com . Consider how the users would require the data loaded during UAT and negotiate how often the data will be refreshed. It is important that users sign off and clearly understand how the views are created. He can be reached at jeff. Consider the following strategies:     Use data that is either from production or as near to production data as possible. Test cases should be prioritized by risk in order to help determine which need to be rerun for each new release. Plan for the system test team to support users during UAT. When doing a regression test. and their participation in the testing effort is a key component to the success of a data warehouse implementation. not the mechanics of how the ETL application works. Building automation during system testing will make the process of regression testing much smoother.The main reason for building a data warehouse application is to make data available to business users. Taking these considerations into account during the design and testing portions of building a data warehouse will ensure that a quality product is produced and prevent costly mistakes from being discovered in production. Useracceptance testing (UAT) typically focuses on data loaded to the data warehouse and any views that have been created on top of the tables.