You are on page 1of 12

ETL Testing Process

Author: Srinivasa Rao P.V (

This paper takes a look at the introduction of a warehouse and the different strategies to
test a data warehouse application. It attempts to suggest various approaches that could be
beneficial while testing the ETL process in a DW. A data warehouse is a critical business
application and defects in it results is business loss that cannot be accounted for. ere! we walk
"ou through some of the basic phases and strategies to minimi#e defects.
This is an era of global competition and ignorance is one of the greatest threats to modern
business. As such organi#ations across the globe are rel"ing on IT services for strategic decision$
making. A data warehouse implementation is one such tool that comes to the rescue. %iven the
criticalit" of a DW

application! a defect$free DW implementation is a dream comes true for an"
organi#ation. As &A

and testing personnel! our role is to ensure this thereb" leading to
ma'imi#ed profits! better decisions and customer satisfactions. A bug in the s"stem traced at a
later stage not onl" increases the cost associated with rework! but also associates with it the use
of incorrect data to make strategic decisions. ence! pre$implementation defect detection should
be ensured. In light of the above discussion! let us take a look at the definition and e'amples of
DW and then into the various strategies involved in the testing c"cle for a DW application.
What is a Data Warehouse?
According to Inmon! famous author for several data warehouse books! (A data warehouse
is a sub)ect oriented! integrated! time variant! non volatile collection of data in support of
management*s decision making process(.
In order to store data! over the "ears! man" application designers in each branch
have made their individual decisions as to how an application and database should be
built. ,o source s"stems will be different in naming conventions! variable measurements!
encoding structures! and ph"sical attributes of data. -onsider a bank that has got several
branches in several countries! has millions of customers and the lines of business of the
enterprise are savings! and loans. The following e'ample e'plains how the data is
integrated from source s"stems to target s"stems.
Example o Source !ata ::
In the below e'ample! attribute name! column name! data t"pe and values are entirel"
different from one source s"stem to another. This inconsistenc" in data can become a problem
while generating the statistics from the historical data hence we have to avoid this b" integrating
the data into a data warehouse with good standards.
Example o Source !ata:
Attribute #ame $olumn #ame !ata t"pe Values
,"stem .
Application Date
4/1E2I-56!78 ..7.977:
,"stem 9
Application Date
-/,T3APPLI-ATI043DATE DATE ..7.977:
,"stem ;
Application Date APPLI-ATI043DATE DATE 7.40<977:
Example o %ar&et !ata (!ata 'arehouse):
2ecord =. -ustomer
Application Date
-/,T01E23APPLI-ATI043DATE DATE 7...977:
2ecord =9 -ustomer
Application Date
-/,T01E23APPLI-ATI043DATE DATE 7...977:
2ecord =; -ustomer
Application Date
-/,T01E23APPLI-ATI043DATE DATE 7...977:
In the above e'ample of target data! attribute names! column names! and data t"pes are consistent
throughout the target s"stem. This is how data from various source s"stems is integrated and
accuratel" stored into the data warehouse
A warehouse is a relational database that is designed for >uer" and anal"sis rather than
transaction processing. A DW usuall" contains historical data that is derived from transaction
data. It separates anal"sis workload from transaction workload and enables a business to
consolidate data from several sources.
In addition to a relational database! a data warehouse environment often consists of an
ETL solution! an 0LAP engine! client anal"sis tools! and other applications that manage the
process of gathering data and delivering it to business users.
There are three t"pes of data warehouses+
Enterprise Data Warehouse $ An enterprise data warehouse provides a central database for
decision support throughout the enterprise.
0D,50perational Data ,tore8 $ This has a broad enterprise wide scope! but unlike the real
enterprise data warehouse! data is refreshed in near real time and used for routine
business activit". 0ne of the t"pical applications of the 0D, 50perational Data ,tore8 is
to hold the recent data before migration to the Data Warehouse.T"picall"! the 0D, are
not conceptuall" e>uivalent to the Data Warehouse albeit do store the data that have a
deeper level of the histor" than that of the 0LTP data.
Data 1art $ Data mart is a subset of data warehouse and it supports a particular region!
business unit or business function.
%he !' %estin& (ie $"cle:
As with an" other piece of software a DW implementation undergoes the natural c"cle of
/nit testing! ,"stem testing! 2egression testing! Integration testing and Acceptance testing.
owever! unlike others there are no off$the$shelf testing products available for a DW.
Unit testing:
Traditionall" this has been the task of the developer. This is a white$bo' testing to ensure
the module or component is coded as per agreed upon design specifications. The developer
should focus on the following+
a8 All inbound and outbound director" structures are created properl" with appropriate
permissions and sufficient disk space. All tables used during the ETL
are present with necessar"
b8 The ETL routines give e'pected results+
All transformation logics work as designed from source till target
?oundar" conditions are satisfied@ e.g. check for date fields with leap "ear dates
,urrogate ke"s have been generated properl"
4/LL values have been populated where e'pected
2e)ects have occurred where e'pected and log for re)ects is created with sufficient
Error recover" methods
c8 That the data loaded into the target is complete+
All source data that is e'pected to get loaded into target! actuall" get loaded@
compare counts between source and target and use data profiling tools
All fields are loaded with full contents@ i.e. no data field is truncated while
4o duplicates are loaded
Aggregations take place in the target properl"
Data integrit" constraints are properl" taken care of
System testing:
%enerall" the &A team owns this responsibilit". Aor them the design document is the
bible and the entire set of test cases is directl" based upon it. ere we test for the functionalit" of
the application and mostl" it is black$bo'. The ma)or challenge here is preparation of test data.
An intelligentl" designed input dataset can bring out the flaws in the application more >uickl".
Wherever possible use production$like data. Bou ma" also use data generation tools or
customi#ed tools of "our own to create test data. We must test for all possible combinations of
input and specificall" check out the errors and e'ceptions. An unbiased approach is re>uired to
ensure ma'imum efficienc". Cnowledge of the business process is an added advantage since we
must be able to interpret the results functionall" and not )ust code$wise.
The &A team must test for+

Data completeness and correctness@ match source to target counts and validate the data.

Data aggregations@ match aggregated data against staging tables andDor 0D,

Lookups/Transformations is applied correctl" as per specifications

Granularity of data is as per specifications

Error logs and audit tables are generated and populated properl"

Notifications to IT and/or business are generated in proper format

ETL Data Validation Components to be considered :

0rgani#ations t"picall" have Edirt" dataF that must be cleansed or scrubbed before
being loaded into the data warehouse. In an ideal world! there would not be dirt" data. The data
in operational s"stems would be clean. /nfortunatel"! this is virtuall" never the case. The data in
these source s"stems is the result of poor data >ualit" practices and little can be done about the
data that is alread" there. While organi#ations should move toward improving data >ualit" at the
source s"stem level! nearl" all data warehousing initiatives must cope with dirt" data! at least in
the short term. There are man" reasons for dirt" data! including+
Dummy alues. Inappropriate values have been entered into fields. Aor e'ample! a customer
service representative! in a hurr" and not perceiving entering correct data as being
particularl" important! might enter the storeGs HIP code rather than the customerGs HIP! or
enters III$II$IIII whenever a ,,4 is unknown. The operational s"stem accepts the input!
but it is not correct.
!bsence of data" Data was not entered for certain fields. This is not alwa"s attributable to
la#" data entr" habits and the lack of edit checks! but to the fact that different business units
ma" have different needs for certain data values in order to run their operations. Aor
e'ample! the department that originates mortgage loans ma" have a federal reporting
re>uirement to capture the se' and ethnicit" of a customer! whereas the department that
originates consumer loans does not.
#ultipurpose fields. A field is used for multiple purposesJ conse>uentl"! it does not
consistentl" store the same thing. This can happen with packaged applications that include
fields that are not re>uired to run the application. Different departments ma" use the Ee'traF
fields for their own purposes! and as a result! what is stored in the fields is not consistent.
Cryptic data" It is not clear what data is stored in a field. The documentation is poor and the
attribute name provides little help in understanding the fieldGs content. The field ma" be
derived from other fields or the field ma" have been used for different purposes over the
Contradicting data" The data should be the same but it isnGt. Aor e'ample! a customer ma"
have different addresses in different source s"stems.
Inappropriate use of address lines" Data has been incorrectl" entered into address lines.
Address lines are commonl" broken down into! for e'ample! Line . for first! middle! and last
name! Line 9 for street address! Line ; for apartment number! and so on. Data is not alwa"s
entered into the correct line! which makes it difficult to parse the data for later use.
Violation of business rules" ,ome of the values stored in a field are inconsistent with
business realit". Aor e'ample! a source s"stem ma" have recorded an ad)ustable rate
mortgage loan where the value of the minimum interest rate is higher than the value of the
ma'imum interest rate.
$eused primary keys" A primar" ke" is not uni>ueJ it is used with multiple occurrences.
There are man" wa"s that this problem can occur. Aor e'ample! assume that a branch bank
has a uni>ue identifier 5i.e.! a primar" ke"8. The branch is closed and the primar" ke" is no
longer in use. ?ut two "ears later! a new branch is opened! and the old identifier is reused.
The primar" ke" is the same for the old and the new branch.
Non%uni&ue identifiers" An item of interest! such as a customer! has been assigned multiple
identifiers. Aor e'ample! in the health care field! it is common for health care providers to
assign their own identifier to patients. This makes it difficult to integrate patient records to
provide a comprehensive understanding of a patientGs health care histor".
Data integration problems" The data is difficult or impossible to integrate. This can be due
to non$uni>ue identifiers! or the absence of an appropriate primar" ke". To illustrate! for
decades customers have been associated with their accounts through a customer name field
on the account record. Integrating multiple customer accounts in this situation can be
difficult. When we e'amine all the account records that belong to one customer! we find
different spellings or abbreviations of the same customer name! sometimes the customer is
recorded under an alias or a maiden name! and occasionall" two or three customers have a
)oint account and all of their names are s>uee#ed into one name field.
There are several alternatives to cleansing dirt" data. 0ne option is to rel" on the basic cleansing
capabilities of ETL software. Another option is to custom$write data cleansing routines. The
final alternative is to use special$purpose data cleansing software. 2egardless of the alternative
selected! the basic process is the same.

The first step is to parse the individual data elements that are e'tracted from the source
s"stems 5L"on! .II68. Aor e'ample! a customer record might be broken down into first name!
middle name! last name! title! firm! street number! street! cit"! state! and HIP code.
Data algorithms 5possibl" based on AI techni>ues8 and secondar"! e'ternal data sources
5such as /, -ensus data8 are then used to correct and enhance the parsed data. Aor e'ample! a
vanit" address 5like Lake -alumet8 is replaced with the ErealF address 5-hicago8 and the plus
four digits are added to the HIP code.
4e't! the parsed data is standardi)ed. /sing both standard and custom business rules! the
data is transformed into its preferred and consistent format. Aor e'ample! a prename ma" be
added 5e.g.! 1s.! Dr.8! first name match standards ma" be identified 5e.g! ?eth ma" be Eli#abeth!
?ethan"! or ?ethel8! and a standard street name ma" be applied 5e.g.! ,outh ?utler Drive ma" be
transformed to ,. ?utler Dr.8.

The parsed! corrected! and standardi#ed data is then scanned to match records. The
matching ma" be based on simple business rules! such as whether the name and address are the
same! or AI based methods that utili#e sophisticated pattern recognition techni>ues.
1atched records are then consolidated. The consolidated records integrate the data from the
different sources and reflect the standards that have been applied. Aor e'ample! source s"stem
number one ma" not contain phone numbers but source s"stem number two does. The
consolidated record contains the phone number. The consolidated record also contains the
applied standards! such as recording 1s. Eli#abeth Kames as the personGs name! with the
appropriate pre$name applied.
0nce the data is cleaned! transformed! and integrated! it is read" for loading into the
warehouse. The first loading provides the initial data for the warehouse. ,ubse>uent loadings
can be done in one of two wa"s. 0ne alternative is to bulk load the warehouse ever" time. With
this approach! all of the data 5i.e.! the old and the new8 is loaded each time. This approach
re>uires simple processing logic but becomes impractical as the volume of data increases. The
more common approach is to refresh the warehouse with onl" newl" generated data.
Another issue that must be addressed is how fre>uentl" to load the warehouse. Aactors that
affect this decision include the business need for the data and the business c"cle that provides the
data. Aor e'ample! users of the warehouse ma" need dail"! weekl"! or monthl" updates!
depending on their use of the data. 1ost business processes have a natural business c"cle that
generates data that can be loaded into the warehouse at various points in the c"cle. Aor e'ample!
a compan"Gs pa"roll is t"picall" run on a weekl" basis. -onse>uentl"! data from the pa"roll
application is loaded to the warehouse on a weekl" basis.
The trend is for continuous updating of the data warehouse. This approach is sometimes referred
to as EtrickleF loading of the warehouse. There are several factors that are causing this near real$
time updating of the warehouse. As data warehouses are increasingl" being used to support
operational processes! having current data is important. Also! when trading partners are given
access to warehouse data! the e'pectation is that the data is up$to$date. Ainall"! man" firms
operate on a global basis and there is not a good time to load the warehouse. /sers around the
world need access to the warehouse on a 9LMN basis. A long Eload windowF is not acceptable.
Regression testing:
A DW application is not a one$time solution. Possibl" it is the best e'ample of an
incremental design where re>uirements are enhanced and refined >uite often based on business
needs and feedbacks. In such a situation it is ver" critical to test that the e'isting functionalities
of a DW application are not messed up whenever an enhancement is made to it. %enerall" this is
done b" running all functional tests for e'isting code whenever a new piece of code is
introduced. owever! a better strateg" could be to preserve earlier test input data and result sets
and running the same again. 4ow the new results could be compared against the older ones to
ensure proper functionalit".
Integration testing:
This is done to ensure that the application developed works from an end$to$end perspective.
ere we must consider the compatibilit" of the DW application with upstream and downstream
flows. We need to ensure for data integrit" across the flow. 0ur test strateg" should include
testing for+
,e>uence of )obs to be e'ecuted with )ob dependencies and scheduling
2e$startabilit" of )obs in case of failures
%eneration of error logs
-leanup scripts for the environment including database
This activit" is a combined responsibilit" and participation of e'perts from all related
applications is a must in order to avoid misinterpretation of results.
Acceptance testing:
This is the most critical part because here the actual users validate "our output datasets.
The" are the best )udges to ensure that the application works as e'pected b" them. owever!
business users ma" not have proper ETL knowledge. ence! the development and test team
should be read" to provide answers regarding ETL process that relate to data population. The test
team must have sufficient business knowledge to translate the results in terms of business. Also
the load windows refresh period for the DW and the views created should be signed off from
Performance testing:
In addition to the above tests a DW must necessaril" go through another phase called
performance testing. An" DW application is designed to be scaleable and robust. Therefore!
when it goes into production environment! it should not cause performance problems. ere! we
must test the s"stem with huge volume of data. We must ensure that the load window is met even
under such volumes. This phase should involve D?A team! and ETL e'pert and others who can
review and validate "our code for optimi#ation.
*inall" a few words of caution to end with. Testing a DW application should be done
with a sense of utmost responsibilit". A bug in a DW traced at a later stage results in
unpredictable losses. And the task is even more difficult in the absence of an" single end$to$end
testing tool. ,o the strategies for testing should be methodicall" developed! refined and
streamlined. This is also true since the re>uirements of a DW are often d"namicall" changing.
/nder such circumstances repeated discussions with development team and users is of utmost
importance to the test team. Another area of concern is test coverage. This has to be reviewed
multiple times to ensure completeness of testing. Alwa"s remember! a DW tester must go an
e'tra mile to ensure near defect free solutions.
E'ample for Data <alidation
Testing Levels There are several levels of testing that can be performed during data warehouse
testing. ,ome e'amples! -onstraint testing ,ource to target counts ,ource to target data
Error processing. The level of testing to be performed should be defined as part of the testing
-onstraints During constraint testing! the ob)ective is to validate uni>ue constraints! primar"
ke"s! foreign ke"s! inde'es! and relationships. The test script should include these validation
points. ,ome ETL processes can be developed to validate constraints during the loading of the
warehouse. If the decision is made to add constraint validation to the ETL process! the ETL code
must validate all business rules and relational data re>uirements. Depending solel" on the
automation of constraint testing is risk". When the setup is not done correctl" or maintained
throughout the ever changing re>uirements process! the validation could become incorrect and
will nullif" the tests.
-ounts the ob)ective of the count test scripts is to determine if the record counts in the source
match the record counts in the target. ,ome ETL
Processes are capable of capturing record count information such as records read! records
written! records in error! etc. If the ETL process is being used can capture that level of detail and
create a list of the counts! allow it to do so. This will save time during the validation process.
,ource to Target 4o ETL process is smart enough to perform source to target field$to field
validation. This piece of the testing c"cle is the most labor intensive and re>uires the most
thorough anal"sis of the data. There are a variet" of tests that can be performed during source to
target validation. ?elow is a list of tests that are best practices+
Threshold testing e'pose an" truncation that ma" be occurring during the transformation or
loading of data
Aor e'ample+
,ource+ table..field. 5<A2-A2L78+
,tage+ table9.field: 5<A2-A29:8+
Target+ table;.field9 5<A2-A2L78+
In this e'ample the source field has a threshold of L7! the stage field has a threshold of 9: and
the target mapping has a threshold of L7. The last .: characters will be truncated during the ETL
process of the stage table. An" data that was stored in position 9O$;7 will be lost during the move
from source to staging
Aield to Aield Aield$to$field testing is a constant value being populated during the ETL processP
It should not be unless it is documented in the
2e>uirements and subse>uentl" documented in the test scripts. Do the values in the source fields
match the values in the respective target fieldsP ?elow are two additional field$to$field tests that
should occur.
Initiali#ation during the ETL process if the code does not re$initiali#e the cursor 5or working
storage8 after each record! there is a chance that fields with null values ma" contain data from a
previous record.
Aor e'ample+
2ecord .9:+ ,ource field. Q 2ed Target field. Q 2ed
2ecord .9O+ ,ource field. Q null Target field . Q 2ed
.. DW@ Data Warehouse
9. &A@ &ualit" Assurance
;. ETL@ E'traction! Transformation and Loading
L. 0D,@ 0perational Data ,tore
.. Data Warehousing@ ,oumendra 1ohant"
9. ,trategies for testing data warehouse applications@ Keff Theobald! DW review 1aga#ine! Kune
977N issue
;. The Data Warehouse Toolkit@ 2alph Cimball

You might also like