You are on page 1of 58

Data Extraction, Transformation, and Loading Techniques

Introduction
During the ETL process, data is extracted from an OLTP database, transformed to
match the data warehouse schema, and loaded into the data warehouse database.
Many data warehouses also incorporate data from nonOLTP systems, such as text
files, legacy systems, and spreadsheets! such data also re"uires extraction,
transformation, and loading.
#n its simplest form, ETL is the process of copying data from one database to another.
This simplicity is rarely, if e$er, found in data warehouse implementations! in reality, ETL
is often a complex combination of process and technology that consumes a significant
portion of the data warehouse de$elopment efforts and re"uires the s%ills of business
analysts, database designers, and application de$elopers.
&hen defining ETL for a data warehouse, it is important to thin% of ETL as a process,
not a physical implementation. ETL systems $ary from data warehouse to data
warehouse and e$en between department data marts within a data warehouse. '
monolithic application, regardless of whether it is implemented in Transact()L or a
traditional programming language, does not pro$ide the flexibility for change necessary
in ETL systems. ' mixture of tools and technologies should be used to de$elop
applications that each perform a specific ETL tas%.
The ETL process is not a onetime e$ent! new data is added to a data warehouse
periodically. Typical periodicity may be monthly, wee%ly, daily, or e$en hourly, depending
on the purpose of the data warehouse and the type of business it ser$es. *ecause ETL
is an integral, ongoing, and recurring part of a data warehouse, ETL processes must be
automated and operational procedures documented. ETL also changes and e$ol$es as
the data warehouse e$ol$es, so ETL processes must be designed for ease of
modification. ' solid, welldesigned, and documented ETL system is necessary for the
success of a data warehouse pro+ect.
Data warehouses e$ol$e to impro$e their ser$ice to the business and to adapt to
changes in business processes and re"uirements. *usiness rules change as the
business reacts to mar%et influences,the data warehouse must respond in order to
1
maintain its $alue as a tool for decision ma%ers. The ETL implementation must adapt as
the data warehouse e$ol$es.
Microsoft- ()L (er$er. /000 pro$ides significant enhancements to existing
performance and capabilities, and introduces new features that ma%e the de$elopment,
deployment, and maintenance of ETL processes easier and simpler, and its
performance faster.
Top of page
ETL Functional Elements
1egardless of how they are implemented, all ETL systems ha$e a common purpose2
they mo$e data from one database to another. 3enerally, ETL systems mo$e data from
OLTP systems to a data warehouse, but they can also be used to mo$e data from one
data warehouse to another. 'n ETL system consists of four distinct functional elements2
4 Extraction
4 Transformation
4 Loading
4 Meta data
Extraction
The ETL extraction element is responsible for extracting data from the source system.
During extraction, data may be remo$ed from the source system or a copy made and
the original data retained in the source system. #t is common to mo$e historical data that
accumulates in an operational OLTP system to a data
warehouse to maintain OLTP performance and efficiency. Legacy systems may re"uire
too much effort to implement such offload processes, so legacy data is often copied into
the data warehouse, lea$ing the original data in place. Extracted data is loaded into the
data warehouse staging area 5a relational database usually separate from the data
warehouse database6, for manipulation by the remaining ETL processes.
Data extraction is generally performed within the source system itself, especially if it is a
relational database to which extraction procedures can easily be added. #t is also
possible for the extraction logic to exist in the data warehouse staging area and "uery
the source system for data using OD*7, OLE D*, or other 'P#s. 8or legacy systems,
the most common method of data extraction is for the legacy system to produce text
files, although many newer systems offer direct "uery 'P#s or accommodate access
through OD*7 or OLE D*.
2
Data extraction processes can be implemented using Transact()L stored procedures,
Data Transformation (er$ices 5DT(6 tas%s, or custom applications de$eloped in
programming or scripting languages.
Transformation
The ETL transformation element is responsible for data $alidation, data accuracy, data
type con$ersion, and business rule application. #t is the most complicated of the ETL
elements. #t may appear to be more efficient to perform some transformations as the
data is being extracted 5inline transformation6! howe$er, an ETL system that uses inline
transformations during extraction is less robust and flexible than one that confines
transformations to the transformation element. Transformations performed in the OLTP
system impose a performance burden on the OLTP database. They also split the
transformation logic between two ETL elements and add maintenance complexity when
the ETL logic changes.
Tools used in the transformation element $ary. (ome data $alidation and data accuracy
chec%ing can be accomplished with straightforward Transact()L code. More
complicated transformations can be implemented using DT( pac%ages. The application
of complex business rules often re"uires the de$elopment of sophisticated custom
applications in $arious programming languages. 9ou can use DT( pac%ages to
encapsulate multistep transformations into a single tas%.
Listed below are some basic examples that illustrate the types of transformations
performed by this element2
Data Validation
7hec% that all rows in the fact table match rows in dimension tables to enforce data
integrity.
Data Accuracy
Ensure that fields contain appropriate $alues, such as only :off: or :on: in a status field.
Data Type Conversion
Ensure that all $alues for a specified field are stored the same way in the data
warehouse regardless of how they were stored in the source system. 8or example, if
one source system stores :off: or :on: in its status field and another source system
stores :0: or :;: in its status field, then a data type con$ersion transformation con$erts
the content of one or both of the fields to a specified common $alue such as :off: or
:on:.
3
Business Rule Application
Ensure that the rules of the business are enforced on the data stored in the warehouse.
8or example, chec% that all customer records contain $alues for both FirstName and
LastName fields.
Loadin
The ETL loading element is responsible for loading transformed data into the data
warehouse database. Data warehouses are usually updated periodically rather than
continuously, and large numbers of records are often loaded to multiple tables in a
single data load. The data warehouse is often ta%en offline during update operations so
that data can be loaded faster and ()L (er$er /000 'nalysis (er$ices can update
OL'P cubes to incorporate the new data. *<L= #>(E1T, !cp, and the *ul% 7opy 'P#
are the best tools for data loading operations. The design of the loading element should
focus on efficiency and performance to minimi?e the data warehouse offline time. 8or
more information and details about performance tuning, see 7hapter /0, :1D*M(
Performance Tuning 3uide for Data &arehousing.:
"eta Data
The ETL meta data functional element is responsible for maintaining information 5meta
data6 about the mo$ement and transformation of data, and the operation of the data
warehouse. #t also documents the data mappings used during the transformations. Meta
data logging pro$ides possibilities for automated administration, trend prediction, and
code reuse.
Examples of data warehouse meta data that can be recorded and used to analy?e the
acti$ity and performance of a data warehouse include2
4 Data Lineae, such as the time that a particular set of records was loaded into the
data warehouse.
4 #c$ema C$anes, such as changes to table definitions.
4 Data Type %sae, such as identifying all tables that use the :*irthdate: userdefined
data type.
4 Transformation #tatistics, such as the execution time of each stage of a
transformation, the number of rows processed by the transformation, the last time the
transformation was executed, and so on.
4 DT# &ac'ae Versionin, which can be used to $iew, branch, or retrie$e any
historical $ersion of a particular DT( pac%age.
4 Data (are$ouse %sae #tatistics, such as "uery times for reports.
Top of page
4
ETL Desin Considerations
1egardless of their implementation, a number of design considerations are common to
all ETL systems2
"odularity
ETL systems should contain modular elements that perform discrete tas%s. This
encourages reuse and ma%es them easy to modify when implementing changes in
response to business and data warehouse changes. Monolithic systems should be
a$oided.
Consistency
ETL systems should guarantee consistency of data when it is loaded into the data
warehouse. 'n entire data load should be treated as a single logical transaction,either
the entire data load is successful or the entire load is rolled bac%. #n some systems, the
load is a single physical transaction, whereas in others it is a series of transactions.
1egardless of the physical implementation, the data load should be treated as a single
logical transaction.
Flexi!ility
ETL systems should be de$eloped to meet the needs of the data warehouse and to
accommodate the source data en$ironments. #t may be appropriate to accomplish some
transformations in text files and some on the source data system! others may re"uire
the de$elopment of custom applications. ' $ariety of technologies and techni"ues can
be applied, using the tool most appropriate to the indi$idual tas% of each ETL functional
element.
#peed
ETL systems should be as fast as possible. <ltimately, the time window a$ailable for
ETL processing is go$erned by data warehouse and source system schedules. (ome
data warehouse elements may ha$e a huge processing window 5days6, while others
may ha$e a $ery limited processing window 5hours6. 1egardless of the time a$ailable, it
is important that the ETL system execute as rapidly as possible.
)eteroeneity
ETL systems should be able to wor% with a wide $ariety of data in different formats. 'n
ETL system that only wor%s with a single type of source data is useless.
"eta Data "anaement
5
ETL systems are arguably the single most important source of meta data about both the
data in the data warehouse and data in the source system. 8inally, the ETL process
itself generates useful meta data that should be retained and analy?ed regularly. Meta
data is discussed in greater detail later in this chapter.
Top of page
ETL Arc$itectures
*efore discussing the physical implementation of ETL systems, it is important to
understand the different ETL architectures and how they relate to each other.
Essentially, ETL systems can be classified in two architectures2 the homogenous
architecture and the heterogeneous architecture.
)omoenous Arc$itecture
' homogenous architecture for an ETL system is one that in$ol$es only a single source
system and a single target system. Data flows from the single source of data through
the ETL processes and is loaded into the data warehouse, as shown in the following
diagram.

Most homogenous ETL architectures ha$e the following characteristics2
4 (ingle data source2 Data is extracted from a single source system, such as an OLTP
system.
4 1apid de$elopment2 The de$elopment effort re"uired to extract the data is
straightforward because there is only one data format for each record type.
4 Light data transformation2 >o data transformations are re"uired to achie$e
consistency among disparate data formats, and the incoming data is often in a format
usable in the data warehouse. Transformations in this architecture typically in$ol$e
replacing ><LLs and other formatting transformations.
4 Light structural transformation2 *ecause the data comes from a single source, the
amount of structural changes such as table alteration is also $ery light. The structural
changes typically in$ol$e denormali?ation efforts to meet data warehouse schema
6
re"uirements.
4 (imple research re"uirements2 The research efforts to locate data are generally
simple2 if the data is in the source system, it can be used. #f it is not, it cannot.
The homogeneous ETL architecture is generally applicable to data marts, especially
those focused on a single sub+ect matter.
)eteroeneous Arc$itecture
' heterogeneous architecture for an ETL system is one that extracts data from multiple
sources, as shown in the following diagram. The complexity of this architecture arises
from the fact that data from more than one source must be merged, rather than from the
fact that data may be formatted differently in the different sources. @owe$er, significantly
different storage formats and database schemas do pro$ide additional complications.

Most heterogeneous ETL architectures ha$e the following characteristics2
4 Multiple data sources.
4 More complex de$elopment2 The de$elopment effort re"uired to extract the data is
increased because there are multiple source data formats for each record type.
4 (ignificant data transformation2 Data transformations are re"uired to achie$e
consistency among disparate data formats, and the incoming data is often not in a
format usable in the data warehouse. Transformations in this architecture typically
in$ol$e replacing ><LLs, additional data formatting, data con$ersions, loo%ups,
computations, and referential integrity $erification. Precomputed calculations may
re"uire combining data from multiple sources, or data that has multiple degrees of
granularity, such as allocating shipping costs to indi$idual line items.
4 (ignificant structural transformation2 *ecause the data comes from multiple sources,
the amount of structural changes, such as table alteration, is significant.
4 (ubstantial research re"uirements to identify and match data elements.
@eterogeneous ETL architectures are found more often in data warehouses than in data
marts.
Top of page
ETL Development
7
ETL de$elopment consists of two general phases2 identifying and mapping data, and
de$eloping functional element implementations. *oth phases should be carefully
documented and stored in a central, easily accessible location, preferably in electronic
form.
Identify and "ap Data
This phase of the de$elopment process identifies sources of data elements, the targets
for those data elements in the data warehouse, and the transformations that must be
applied to each data element as it is migrated from its source to its destination. @igh
le$el data maps should be de$eloped during the re"uirements gathering and data
modeling phases of the data warehouse pro+ect. During the ETL system design and
de$elopment process, these high le$el data maps are extended to thoroughly specify
system details.
Identify #ource Data
8or some systems, identifying the source data may be as simple as identifying the
ser$er where the data is stored in an OLTP database and the storage type 5()L (er$er
database, Microsoft Excel spreadsheet, or text file, among others6. #n other systems,
identifying the source may mean preparing a detailed definition of the meaning of the
data, such as a business rule, a definition of the data itself, such as decoding rules 5O A
On, for example6, or e$en detailed documentation of a source system for which the
system documentation has been lost or is not current.
Identify Taret Data
Each data element is destined for a target in the data warehouse. ' target for a data
element may be an attribute in a dimension table, a numeric measure in a fact table, or
a summari?ed total in an aggregation table. There may not be a onetoone
correspondence between a source data element and a data element in the data
warehouse because the destination system may not contain the data at the same
granularity as the source system. 8or example, a retail client may decide to roll data up
to the (=< le$el by day rather than trac% indi$idual line item data. The le$el of item
detail that is stored in the fact table of the data warehouse is called the grain of the data.
#f the grain of the target does not match the grain of the source, the data must be
summari?ed as it mo$es from the source to the target.
"ap #ource Data to Taret Data
8
' data map defines the source fields of the data, the destination fields in the data
warehouse and any data modifications that need to be accomplished to transform the
data into the desired format for the data warehouse. (ome transformations re"uire
aggregating the source data to a coarser granularity, such as summari?ing indi$idual
item sales into daily sales by (=<. Other transformations in$ol$e altering the source
data itself as it mo$es from the source to the target. (ome transformations decode data
into human readable form, such as replacing :;: with :on: and :0: with :off: in a status
field. #f two source systems encode data destined for the same target differently 5for
example, a second source system uses 9es and >o for status6, a separate
transformation for each source system must be defined. Transformations must be
documented and maintained in the data maps. The relationship between the source and
target systems is maintained in a map that is referenced to execute the transformation
of the data before it is loaded in the data warehouse.
Develop Functional Elements
Design and implementation of the four ETL functional elements, Extraction,
Transformation, Loading, and meta data logging, $ary from system to system. There will
often be multiple $ersions of each functional element.
Each functional element contains steps that perform indi$idual tas%s, which may
execute on one of se$eral systems, such as the OLTP or legacy systems that contain
the source data, the staging area database, or the data warehouse database. Barious
tools and techni"ues may be used to implement the steps in a single functional area,
such as Transact()L, DT( pac%ages, or custom applications de$eloped in a
programming language such as Microsoft Bisual *asic-. (teps that are discrete in one
functional element may be combined in another.
Extraction
The extraction element may ha$e one $ersion to extract data from one OLTP data
source, a different $ersion for a different OLTP data source, and multiple $ersions for
legacy systems and other sources of data. This element may include tas%s that execute
(ELE7T "ueries from the ETL staging database against a source OLTP system, or it
may execute some tas%s on the source system directly and others in the staging
database, as in the case of generating a flat file from a legacy system and then
importing it into tables in the ETL database. 1egardless of methods or number of steps,
the extraction element is responsible for extracting the re"uired data from the source
system and ma%ing it a$ailable for processing by the next element.
9
Transformation
8re"uently a number of different transformations, implemented with $arious tools or
techni"ues, are re"uired to prepare data for loading into the data warehouse. (ome
transformations may be performed as data is extracted, such as an application on a
legacy system that collects data from $arious internal files as it produces a text file of
data to be further transformed. @owe$er, transformations are best accomplished in the
ETL staging database, where data from se$eral data sources may re"uire $arying
transformations specific to the incoming data organi?ation and format.
Data from a single data source usually re"uires different transformations for different
portions of the incoming data. 8act table data transformations may include
summari?ation, and will always re"uire surrogate dimension %eys to be added to the
fact records. Data destined for dimension tables in the data warehouse may re"uire one
process to accomplish one type of update to a changing dimension and a different
process for another type of update.
Transformations may be implemented using Transact()L, as is demonstrated in the
code examples later in this chapter, DT( pac%ages, or custom applications.
1egardless of the number and $ariety of transformations and their implementations, the
transformation element is responsible for preparing data for loading into the data
warehouse.
Loadin
The loading element typically has the least $ariety of tas% implementations. 'fter the
data from the $arious data sources has been extracted, transformed, and combined, the
loading operation consists of inserting records into the $arious data warehouse
database dimension and fact tables. #mplementation may $ary in the loading tas%s, such
as using *<L= #>(E1T, !cp, or the *ul% 7opy 'P#. The loading element is responsible
for loading data into the data warehouse database tables.
"eta Data Loin
Meta data is collected from a number of the ETL operations. The meta data logging
implementation for a particular ETL tas% will depend on how the tas% is implemented.
8or a tas% implemented by using a custom application, the application code may
produce the meta data. 8or tas%s implemented by using Transact()L, meta data can
be captured with Transact()L statements in the tas% processes. The meta data
logging element is responsible for capturing and recording meta data that documents
10
the operation of the ETL functional areas and tas%s, which includes identification of data
that mo$es through the ETL system as well as the efficiency of ETL tas%s.
Common Tas's
Each ETL functional element should contain tas%s that perform the following functions,
in addition to tas%s specific to the functional area itself2
Confirm #uccess or Failure. ' confirmation should be generated on the success or
failure of the execution of the ETL processes. #deally, this mechanism should exist for
each tas% so that rollbac% mechanisms can be implemented to allow for incremental
responses to errors.
#c$edulin. ETL tas%s should include the ability to be scheduled for execution.
(cheduling mechanisms reduce repetiti$e manual operations and allow for maximum
use of system resources during recurring periods of low acti$ity.
Top of page
#*L #erver +,,, ETL Components
()L (er$er /000 includes se$eral components that aid in the de$elopment and
maintenance of ETL systems2
4 Data Transformation #ervices -DT#.2 ()L (er$er /000 DT( is a set of graphical
tools and programmable ob+ects that lets you extract, transform, and consolidate data
from disparate sources into single or multiple destinations.
4 #*L #erver Aent2 ()L (er$er 'gent pro$ides features that support the scheduling
of periodic acti$ities on ()L (er$er /000, or the notification to system administrators
of problems that ha$e occurred with the ser$er.
4 #tored &rocedures and Vie/s2 (tored procedures assist in achie$ing a consistent
implementation of logic across applications. The Transact()L statements and logic
needed to perform a commonly performed tas% can be designed, coded, and tested
once in a stored procedure. ' $iew can be thought of as either a $irtual table or a
stored "uery. The data accessible through a $iew is not stored in the database as a
distinct ob+ect! only the (ELE7T statement for the $iew is stored in the database.
4 Transact #*L2 Transact()L is a superset of the ()L standard that pro$ides
powerful programming capabilities that include loops, $ariables, and other
programming constructs.
4 0LE DB2 OLE D* is a lowle$el interface to data. #t is an open specification designed
to build on the success of OD*7 by pro$iding an open standard for accessing all
%inds of data.
11
4 "eta Data #ervices2 ()L (er$er /000 Meta Data (er$ices pro$ides a way to store
and manage meta data about information systems and applications. This technology
ser$es as a hub for data and component definitions, de$elopment and deployment
models, reusable software components, and data warehousing descriptions.
Top of page
T$e ETL #tain Data!ase
#n general, ETL operations should be performed on a relational database ser$er
separate from the source databases and the data warehouse database. ' separate
staging area database ser$er creates a logical and physical separation between the
source systems and the data warehouse, and minimi?es the impact of the intense
periodic ETL acti$ity on source and data warehouse databases. #f a separate database
ser$er is not a$ailable, a separate database on the data warehouse database ser$er
can be used for the ETL staging area. @owe$er, in this case it is essential to schedule
periods of high ETL acti$ity during times of low data warehouse user acti$ity.
8or small data warehouses with a$ailable excess performance and low user acti$ity, it is
possible to incorporate the ETL system into the data warehouse database. The
ad$antage of this approach is that separate copies of data warehouse tables are not
needed in the staging area. @owe$er, there is always some ris% associated with
performing transformations on li$e data, and ETL acti$ities must be $ery carefully
coordinated with data warehouse periods of minimum acti$ity. &hen ETL is integrated
into the data warehouse database, it is recommended that the data warehouse be ta%en
offline when performing ETL transformations and loading.
Most systems can effecti$ely stage data in a ()L (er$er /000 database, as we
describe in this chapter. 'n ETL system that needs to process extremely large $olumes
of data will need to use speciali?ed tools and custom applications that operate on files
rather than database tables. &ith extremely large $olumes of data, it is not practical to
load data into a staging database until it has been cleaned, aggregated, and stripped of
meaningless information. *ecause it is much easier to build an ETL system using the
standard tools and techni"ues that are described in this chapter, most experienced
system designers will attempt to use a staging database, and mo$e to custom tools only
if data cannot be processed during the load window.
&hat does :extremely large: mean and when does it become infeasible to use standard
DT( tas%s and Transact()L scripts to process data from a staging databaseC The
12
answer depends on the load window, the complexity of transformations, and the degree
of data aggregation necessary to create the rows that are permanently stored in the
data warehouse. 's a conser$ati$e rule of thumb, if the transformation application
needs to process more than ; gigabyte of data in less than an hour, it may be
necessary to consider speciali?ed high performance techni"ues, which are outside the
scope of this chapter.
This section pro$ides general information about configuring the ()L (er$er /000
database ser$er and the database to support an ETL system staging area database
with effecti$e performance. ETL systems can $ary greatly in their database ser$er
re"uirements! ser$er configurations and performance option settings may differ
significantly from one ETL system to another.
ETL data manipulation acti$ities are similar in design and functionality to those of OLTP
systems although ETL systems do not experience the constant acti$ity associated with
OLTP systems. #nstead of constant acti$ity, ETL systems ha$e periods of high write
acti$ity followed by periods of little or no acti$ity. 7onfiguring a ser$er and database to
meet the needs of an ETL system is not as straightforward as configuring a ser$er and
database for an OLTP system.
8or a detailed discussion of 1'#D and ()L (er$er /000 performance tuning, see
7hapter /0, :1D*M( Performance Tuning 3uide for Data &arehousing.:
#erver Confiuration
Dis% storage system performance is one of the most critical factors in the performance
of database systems. (er$er configuration options offer additional methods for ad+usting
ser$er performance.
RAID
's with any OLTP system, the 1'#D le$el for the dis% dri$es on the ser$er can ma%e a
considerable performance difference. 8or maximum performance of an ETL database,
the dis% dri$es for the ser$er computer should be configured with 1'#D ; or 1'#D ;0.
'dditionally, it is recommended that the transaction logs, databases, and tempd! be
placed on separate physical dri$es. 8inally, if the hardware controller supports write
caching, it is recommended that write caching be enabled. @owe$er, be sure to use a
caching controller that guarantees that the controller cache contents will be written to
dis% in case of a system failure.
#erver Confiuration 0ptions -sp1confiure.
13
>o specific changes need to be made to the ser$er configuration options in order to
optimi?e performance for an ETL system. #t is recommended that these options be left
at their default settings unless there is a specific reason to modify them.
Data!ase Confiuration
#n ()L (er$er /000, database performance can be tuned by proper selection of settings
for data file growth and ad+usting database configuration options.
Data File 2ro/t$
&hen creating a database, an initial si?e for the data files for the database and
transaction log must be specified. *y default, ()L (er$er /000 allows the data files to
grow as much as necessary until dis% space is exhausted. #t is important to si?e the
database appropriately before loading any data into it to a$oid the #DO intensi$e
operation of autogrowing data files. 8ailure to appropriately si?e the data files initially
for the database means that ()L will be forced to fre"uently increase the si?e of the
data files, which will degrade performance of the ETL processes.
'ppropriate initial si?ing of the data files can reduce the li%elihood of ()L being forced
to increase the si?e of the database, which eliminates an intensi$e #DO operation. #f a
data file is allowed to automatically grow, the file growth may be specified by a
percentage or a set number $alue. The growth $alue can be specified in megabytes
5M*6, %ilobytes 5=*6, or percent. #f percent is specified, the increment si?e is the
specified percentage of the file si?e at the time the increment occurs. #f the data file is
too small, the growth increments will be fre"uent. 8or example, if a data file is initially
created at ;0 M* and set to grow in ;0 percent increments until it reaches /0 M*, ()L
(er$er /000 will perform eight autogrow operations as the data file si?e increases to /0
M*. Therefore, it is recommended that a fixed M* $alue be chosen for data file growth
increments.
8inally, if the ser$er uses (7(# dis%s, special care should be paid to pre$enting dis%
space consumption from increasing beyond EF percent of the capacity of the dri$e.
*eyond EF percent consumption, (7(# dis% performance begins to degrade. Therefore,
it is recommended that the data files for the database are set to grow automatically, but
only to a predefined maximum si?e, which should be no more than EF percent capacity
of the dri$e.
Data!ase Confiuration 0ptions
(e$eral database options can be ad+usted to enhance the performance of an ETL
database. 8or a complete discussion of these options, see ()L (er$er *oo%s Online.
14
8or more information about database performance tuning, see 7hapter /0, :1D*M(
Performance Tuning 3uide for Data &arehousing.:
The following table lists some database options and their setting that may be used to
increase ETL performance.
Option name (etting
'<TOG71E'TEG(T'T#(T#7( Off
'<TOG<PD'TEG(T'T#(T#7( On
'<TOG(@1#>= Off
7<1(O1GDE8'<LT LO7'L
1E7OBE19 Option *ul%GLoaded
TO1>GP'3EGDETE7T#O> On
Caution Different reco$ery model options introduce $arying degrees of ris% of data loss.
#t is imperati$e that the ris%s be thoroughly understood before choosing a reco$ery
model.
Top of page
"anain #urroate 3eys
(urrogate %eys are critical to successful data warehouse design2 they pro$ide the
means to maintain data warehouse information when dimensions change. 8or more
information and details about surrogate %eys, see 7hapter ;H, :Data &arehouse Design
7onsiderations.:
The following are some common characteristics of surrogate %eys2
4 <sed as the primary %ey for each dimension table, instead of the original %ey used in
the source data system. The original %ey for each record is carried in the table but is
not used as the primary %ey.
4 May be defined as the primary %ey for the fact table. #n general, the fact table uses a
composite primary %ey composed of the dimension foreign %ey columns, with no
surrogate %ey. #n schemas with many dimensions, load and "uery performance will
impro$e substantially if a surrogate %ey is used. #f the fact table is defined with a
surrogate primary %ey and no uni"ue index on the composite %ey, the ETL application
must be careful to ensure row uni"ueness outside the database. ' third possibility for
the fact table is to define no primary %ey at all. &hile there are systems for which this
is the most effecti$e approach, it is not good database practice and should be
considered with caution.
4 7ontains no meaningful business information! its only purpose is to uni"uely identify
15
each row. There is one exception2 the primary %ey for a time dimension table pro$ides
humanreadable information in the format :yyyymmdd :.
4 #s a simple %ey on a single column, not a composite %ey.
4 (hould be numeric, preferably integer, and not text.
4 (hould ne$er be a 3<#D.
The ()L (er$er /000 Identity column pro$ides an excellent surrogate %ey mechanism.
Top of page
ETL Code Examples
7ode examples in these sections use the pu!s sample database included with ()L
(er$er /000 to demonstrate $arious acti$ities performed in ETL systems. The examples
illustrate techni"ues for loading dimension tables in the data warehouse! they do not
ta%e into consideration separate procedures that may be re"uired to update OL'P
cubes or aggregation tables.
The use of temporary and staging tables in the ETL database allows the data extraction
and loading process to be bro%en up into smaller segments of wor% that can be
indi$idually reco$ered. The temporary tables allow the source data to be loaded and
transformed without impacting the performance of the source system except for what is
necessary to extract the data. The staging tables pro$ide a mechanism for data
$alidation and surrogate %ey generation before loading transformed data into the data
warehouse. Transformation, $alidation, and surrogate %ey management tas%s should
ne$er be performed directly on dimension tables in the data warehouse.
The code examples in this chapter are presented as Transact()L, in order to
communicate to the widest audience. ' production ETL system would use DT( to
perform this wor%. ' $ery simple system may use se$eral Execute ()L tas%s lin%ed
within a pac%age. More complex systems di$ide units of wor% into separate pac%ages,
and call those subpac%ages from a master pac%age. 8or a detailed explanation of how
to use DT( to implement the functionality described in this chapter, please see ()L
(er$er *oo%s Online.
Ta!les for Code Examples
The examples use the aut$ors table in the pu!s database as the source of data. The
following three tables are created for use by the code examples.
Table name Purpose
Aut$ors1Temp @olds the data imported from the source system.
Aut$ors1#tain@olds the dimension data while it is being updated. The data for the
16
Table name Purpose
authors will be updated in this table and then the data will be loaded
into the data warehouse dimension table.
Aut$ors1D( (imulates the 'uthors dimension table in the data warehouse.
These are %ey points regarding the structures of these tables2
4 There is no difference between the structure of the aut$ors table in the pu!s
database and the Aut$ors1Temp table in the staging area. This allows for
straightforward extraction of data from the source system with minimum impact on
source system performance.
4 The Aut$or1#tain table is used to generate the surrogate %ey 5Aut$or13ey
column6 that is used by the data warehouse. This table is also used to $alidate any
data changes, con$ert data types, and perform any other transformations necessary
to prepare the data for loading into the data warehouse.
4 The structure of the Aut$or1#tain table in the staging area is the same as that of
the Aut$ors1D( table in the data warehouse. This allows for straightforward loading
of the dimension data from the staging database to the data warehouse. #f the
dimension table in the data warehouse is small enough, it can be truncated and
replaced with data from the staging table. #n many data warehouses, dimension tables
are too large to be efficiently updated by dropping and reloading them in their entirety.
#n this case, the tables in both the staging area and data warehouse should contain a
datetime column, which can be used to determine which records need to be updated,
inserted, or deleted in the data warehouse table.
4 The staging and data warehouse tables are identical after the data is loaded into the
data warehouse. This fact can be considered for use in bac%up strategy planning.
Define Example Ta!les
The following three Transact()L statements create the Aut$ors1Temp and
Aut$ors1#tain tables, and the Aut$ors1D( table that simulates the 'uthors
dimension table in the data warehouse2
Code Example 4564
71E'TE T'*LE I'uthorsGTempJ 5
IauGidJ I$archarJ 5;;6 P1#M'19 =E9 7L<(TE1ED,
IauGlnameJ I$archarJ 5K06 DE8'<LT 5LMissingL6,
IauGfnameJ I$archarJ 5/06 DE8'<LT 5LMissingL6,
IphoneJ IcharJ 5;/6 DE8'<LT 5L0000000000L6,
17
IaddressJ I$archarJ 5K06 DE8'<LT 5LMissingL6,
IcityJ I$archarJ 5/06 DE8'<LT 5LMissingL6,
IstateJ IcharJ 5/6 DE8'<LT 5LMML6,
I?ipJ IcharJ 5F6 DE8'<LT 5L00000L6,
IcontractJ IbitJ >OT ><LL DE8'<LT 5066
O> IP1#M'19J
3O
71E'TE T'*LE 'uthorsG(taging 5
I'uthorG=eyJ int >OT ><LL #DE>T#T9 5;,;6 P1#M'19 =E9 7L<(TE1ED ,
IauGidJ $archar 5;;6 >OT ><LL ,
IauGlnameJ $archar 5K06 >OT ><LL DE8'<LT 5LMissingL6,
IauGfnameJ $archar 5/06 >OT ><LL DE8'<LT 5LMissingL6,
IphoneJ char 5;/6 >OT ><LL DE8'<LT 5L0000000000L6,
IaddressJ $archar 5K06 ><LL DE8'<LT 5LMissingL6,
IcityJ $archar 5/06 >OT ><LL DE8'<LT 5LMissingL6,
IstateJ char 5/6 >OT ><LL DE8'<LT 5LMML6 ,
I?ipJ char 5F6 >OT ><LL DE8'<LT 5L00000L6 ,
IcontractJ bit >OT ><LL,
IDate7reatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566,
IDate<pdatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566
6O> IP1#M'19J
3O
71E'TE T'*LE I'uthorsGD&J 5
18
I'uthorG=eyJ IintJ >OT ><LL P1#M'19 =E9 7L<(TE1ED,
IauGidJ I$archarJ 5;;6 >OT ><LL,
IauGlnameJ I$archarJ 5K06 >OT ><LL DE8'<LT 5LMissingL6,
IauGfnameJ I$archarJ 5/06 >OT ><LL DE8'<LT 5LMissingL6,
IphoneJ IcharJ 5;/6 >OT ><LL DE8'<LT 5L0000000000L6,
IaddressJ I$archarJ 5K06 ><LL DE8'<LT 5LMissingL6,
IcityJ I$archarJ 5/06 >OT ><LL DE8'<LT 5LMissingL6,
IstateJ IcharJ 5/6 >OT ><LL DE8'<LT 5LMML6,
I?ipJ IcharJ 5F6 >OT ><LL DE8'<LT 5L00000L6,
IcontractJ IbitJ >OT ><LL,
IDate7reatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566,
IDate<pdatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566
6 O> IP1#M'19J
3O
&opulate Example Ta!les
The following three Transact()L statements populate the temporary, staging, and data
warehouse sample tables by loading all of the author records except the record for
author Nohnson &hite, which will be inserted later to illustrate a techni"ue for adding
records to the data warehouse dimension table2
Code Example 456+
Populate the 'uthorsGTemp table with all author records except Nohnson &hiteLs
#>(E1T #>TO 'uthorsGTemp
(ELE7T O 81OM 'uthors
&@E1E '<G#D PQ L;H/R/;;HSL
3O
19
Populate the 'uthorsG(taging table from 'uthorsGTemp
#>(E1T #>TO 'uthorsG(taging 5auGid, auGlname, auGfname, phone, address, city,
state,
?ip, contract6
(ELE7T auGid, auGlname, auGfname, phone, address, city, state, ?ip, contract
81OM 'uthorsGTemp
3O
Populate the simulated data warehouse dimension table, 'uthorsGD&
#>(E1T #>TO 'uthorsGD&
(ELE7T O 81OM 'uthorsG(taging
3O
The contents of the three tables now simulate the state following the completion of all
pre$ious ETL processing before the author Nohnson &hite is added to the source data
table.
Insertin Ne/ Dimension Records
Loading new author records is a relati$ely simple tas%. #f the extraction method is
capable of generating a change set 5a set of records that ha$e been altered since the
last data extraction6 from the source system, we load the change set into the temporary
table. #f we cannot generate a change set from the source system, we will ha$e to load
the entire data set from the source system into the temporary table, e$en if only a single
record has changed.
The following Transact()L code demonstrates a simple techni"ue for loading new
rows into the 'uthors dimension. This example assumes that there is a primary %ey on
the source system that we can use and it assumes that we do not ha$e a change set.
Code Example 4567
Truncate any data that currently exists in the 'uthorsGTemp table
T1<>7'TE T'*LE 'uthorsGTemp
3O
20
Load all of the data from the source system into the 'uthorsGTemp table
#>(E1T #>TO 'uthorsGTemp
(ELE7T O 81OM 'uthors
3O
(et a starting $alue for the 7ontract field for two records
for use by future examples
<PD'TE 'uthorsGTemp
(ET 7ontract A 0
&@E1E state A L<TL
3O
Locate all of the new records that ha$e been added to the source system by
comparing the new temp table contents to the existing staging table contents
and add the new records to the staging table
#>(E1T #>TO 'uthorsG(taging 5auGid, auGlname, auGfname, phone, address, city,
state,
?ip, contract6
(ELE7T T.auGid, T.auGlname, T.auGfname, T.phone, T.address, T.city, T.state, T.?ip,
T.contract
81OM 'uthorsGTemp T LE8T O<TE1 NO#>
'uthorsG(taging ( O> T.auGid A (.auGid
&@E1E 5(.auGid #( ><LL6
3O
21
Locate all of the new records that are to be added to the data warehouse
and insert them into the data warehouse by comparing 'uthorsG(taging to
'uthorsGD&
#>(E1T #>TO 'uthorsGD& 5'uthorG=ey, auGid, auGlname, auGfname, phone, address,
city,
state, ?ip, contract,
Date7reated, Date<pdated6
(ELE7T (.'uthorG=ey, (.auGid, (.auGlname, (.auGfname, (.phone, (.address, (.city,
(.state, (.?ip, (.contract,
(.Date7reated, (.Date<pdated
81OM 'uthorsG(taging ( LE8T O<TE1 NO#>
'uthorsGD& D O> (.auGid A D.auGid
&@E1E 5D.auGid #( ><LL6
3O
"anain #lo/ly C$anin Dimensions
This section describes $arious techni"ues for managing slowly changing dimensions in
the data warehouse. :(lowly changing dimensions: is the customary term used for
dimensions that contain attributes that, when changed, may affect grouping or
summari?ation of historical data. Design approaches to dealing with the issues of slowly
changing dimensions are commonly categori?ed into the following three change types2
4 Type 48 O$erwrite the dimension record
4 Type +8 'dd a new dimension record
4 Type 78 7reate new fields in the dimension record
Type ; and Type / dimension changes are discussed in this section. Type R changes
are not recommended for most data warehouse applications and are not discussed
here. 8or more information and details about slowly changing dimensions, see 7hapter
;H, :Data &arehouse Design 7onsiderations.:
Type ; and Type / dimension change techni"ues are used when dimension attributes
change in records that already exist in the data warehouse. The techni"ues for inserting
new records into dimensions 5discussed earlier in the section :#nserting >ew Dimension
22
1ecords:6 apply to all dimensions regardless of whether changes to dimension
attributes are incorporated using Type ; or Type / change techni"ues.
The code examples in the following sections demonstrate techni"ues for managing
Type ; and Type / dimension changes. The examples ha$e been %ept simple to
maintain clarity for techni"ue illustration purposes,the examples assume that all
changes for a dimension will be of the same type, whereas, in reality, most dimensions
include some attributes that re"uire Type / changes and other attributes that can be
maintained using Type ; changes. 8or example, a retailer may decide that a change in
the marital status of a customer should be treated as a Type / change, whereas a
change of street address for the same customer should be treated as a Type ; change.
Therefore, it is important to document all of the attributes in a dimension and, for each
attribute, whether a $alue change should be applied as Type ; or a Type / change.
Type 48 0ver/rite t$e Dimension Record
' change to a dimension attribute that is ne$er used for analysis can be managed by
simply changing the data to the new $alue. This type of change is called a Type ;
change. 8or example, a change to a customerLs street address is unli%ely to affect any
summari?ed information and the pre$ious street address can be discarded without
conse"uence.
Type ; dimension changes are straightforward to implement. The following Transact
()L code demonstrates a simple Type ; techni"ue for updating existing rows in the
'uthors dimension. 8or this example, we will change some data in the Aut$ors1Temp
table to simulate changed records recei$ed as a result of updates to the aut$ors table
in the source database. The $alue for the 7ontract field is assumed to be eligible for
Type ; changes in this example. #n a later section, the 7ontract field will be updated
using a Type / change. The following example assumes that there is a primary %ey on
the source system that we can use and it assumes that we do not ha$e a change set2
Code Example 4569
7hange the 'uthorsGTemp table to simulate updates recei$ed from the source system
<PD'TE 'uthorsGTemp
(ET 7ontract A 0
&@E1E state A L<TL
3O
23
<pdate the 'uthorsG(taging table with the new $alues in 'uthorsGTemp
<PD'TE 'uthorsG(taging
(ET 7ontract A T.7ontract,
Date<pdated A getdate56
81OM 'uthorsGTemp T #>>E1 NO#> 'uthorsG(taging (
O> T.auGid A (.auGid
&@E1E T.7ontract PQ (.7ontract
3O
<pdate the 'uthorGD& with the new data in 'uthorsG(taging
<PD'TE 'uthorsGD&
(ET 7ontract A (.7ontract,
Date<pdated A getdate56
81OM 'uthorsG(taging ( #>>E1 NO#> 'uthorsGD& D
O> (.'uthorG=ey A D.'uthorG=ey
&@E1E (.7ontract PQ D.7ontract
3O
Type +8 Add a Ne/ Dimension Record
Type / changes cause history to be partitioned at the e$ent that triggered the change.
Data prior to the e$ent continues to be summari?ed and analy?ed as before! new data is
summari?ed and analy?ed in accordance with the new $alue of the data. The techni"ue
for implementing a Type / change is to %eep the existing dimension record and add a
new record that contains the updated data for the attribute or attributes that ha$e
changed. Balues are copied from the existing record to the new record for all fields that
ha$e not changed. ' new surrogate %ey $alue is created for the new record and the
record is added to the dimension table. 8act records that apply to e$ents subse"uent to
the Type / change must be related to the new dimension record.
24
'lthough it is relati$ely straightforward to implement Type / change techni"ues in the
ETL process to manage slowly changing dimensions, the data associated with a
dimension member becomes fragmented as such changes are made. Data warehouse
analysis and reporting tools must be capable of summari?ing data correctly for
dimensions that include Type / changes. To minimi?e unnecessary fragmentation, a
Type / change should not be used if a Type ; change is appropriate.
The techni"ues used to insert new records into a Type / dimension are the same as the
ones used to insert new records into a Type ; dimension. @owe$er, the techni"ues used
to trac% updates to dimension records are different.
The following Transact()L code demonstrates a simple techni"ue for applying Type /
changes for existing rows in the 'uthors dimension. <nli%e a Type ; change, existing
records are not updated in a Type / dimension. #nstead, new records are added to the
dimension to contain the changes to the source system records. #n this example, the
$alues of the contract field changed in the Type ; example are changed to different
$alues and we now assume the contract field is to be managed as a Type / change.
>otice that the Transact()L statement used to load updated records into the staging
table is the same as the one used to insert new records into the staging table except
that the predicate clause in the two statements differ. &hen loading new records, the
&@E1E clause uses the aut$1id field to determine which records are new. &hen
inserting records for Type / changes, the &@E1E clause causes new records to be
added when the $alue of the attribute of interest 5contract6 in the temporary table differs
from the attribute $alue in the staging table.
Code Example 456:
7hange the 'uthorsGTemp table to simulate updates recei$ed from the source system
This change re$erses the change made in the Type ; example by setting 7ontract to ;
<PD'TE 'uthorsGTemp
(ET 7ontract A ;
&@E1E state A L<TL
3O
8or example purposes, ma%e sure the staging table records ha$e a different $alue
25
for the contract field for the <T authors
<PD'TE 'uthorsG(taging
(ET 7ontract A 0
&@E1E state A L<TL
3O
#nsert new records into the (taging Table for those records in the temp table
that ha$e a different $alue for the contract field
#>(E1T #>TO 'uthorsG(taging 5auGid, auGlname, auGfname, phone, address, city,
state,
?ip, contract6
(ELE7T T.auGid, T.auGlname, T.auGfname, T.phone, T.address, T.city, T.state, T.?ip,
T.contract
81OM 'uthorsGTemp T
LE8T O<TE1 NO#> 'uthorsG(taging ( O> T.auGid A (.auGid
&@E1E T.7ontract PQ (.7ontract
3O
#nsert the new records into the data warehouse Table
#>(E1T #>TO 'uthorsGD& 5'uthorG=ey, auGid, auGlname, auGfname, phone, address,
city,
state, ?ip, contract,
Date7reated, Date<pdated6
(ELE7T (.'uthorG=ey, (.auGid, (.auGlname, (.auGfname, (.phone, (.address, (.city,
(.state, (.?ip, (.contract,
(.Date7reated, (.Date<pdated
26
81OM 'uthorsG(taging ( LE8T O<TE1 NO#>
'uthorsGD& D O> (.'uthorG=ey A D.'uthorG=ey
&@E1E 5D.'uthorG=ey #( ><LL6
3O
"anain t$e Fact Ta!le
'fter all dimension records ha$e been loaded and updated, the fact table also must be
loaded with new data. The fact table must be loaded after the dimension tables so the
surrogate %eys added to the dimension records during the ETL processes can be used
as foreign %eys in the fact table. This section demonstrates techni"ues for loading the
fact table.
8or purposes of these examples, a table 5Fact1#ource6 is created that simulates a data
table in a source system from which fact data can be extracted. The Fact1#ource table
data is a combination of data found in the #ales and TitleAut$or tables in the pu!s
database.
The following table lists definitions of the tables created for use with the examples that
follow.
Table name Purpose
Fact1#ource ' simulated source data table that will be used in the example code. This
table is a combination of #ales and TitleAut$or in the pu!s database.
Fact1Temp 1ecei$es data imported from the source system.
Fact1#tain@olds the fact table data during transformation and surrogate %ey
operations. Data is loaded to the data warehouse fact table after ETL
operations are complete.
Fact1D( The fact table in the data warehouse.
Titles1D( ' dimension table for Titles to demonstrate the use of surrogate %eys in
the fact table.
#tore1D( ' dimension table for (tores to demonstrate the use of surrogate %eys in
the fact tables.
(e$eral %ey points about the structures of these tables should be noted2
4 There is no difference between the structures of the Fact1#ource table and the
Fact1Temp tables. This allows for the easiest method to extract data from the source
system so that transformations on the data do not impact the source system.
4 The Fact1#tain table is used to add the dimension surrogate %eys to the fact table
records. This table is also used to $alidate any data changes, con$ert any data types,
27
and so on.
4 The structures of the Fact1#tain and Fact1D( tables do not match. This is
because the final fact table in the data warehouse does not store the original %eys,
+ust the surrogate %eys.
4 The fact table %ey is an identity column that is generated when the transformed data is
loaded into the fact table. (ince we will not be updating the records once they ha$e
been added to the Fact1D( table, there is no need to generate the %ey prior to the
data load into the fact table. This is not how the %ey column is generated in dimension
tables. 's discussed abo$e, the decision to use an identity %ey for a fact table
depends on the complexity of the data warehouse schema and the performance of
load and "uery operations! this example implements an identity %ey for the fact table.
The following Transact()L statements create the tables defined abo$e2
Code Example 456;
7reate the simulated source data table
71E'TE T'*LE I8actG(ourceJ 5
IstorGidJ IcharJ 5K6 >OT ><LL ,
IordGnumJ I$archarJ 5/06 >OT ><LL ,
IordGdateJ IdatetimeJ >OT ><LL ,
I"tyJ IsmallintJ >OT ><LL ,
IpaytermsJ I$archarJ 5;/6 >OT ><LL ,
ItitleGidJ ItidJ >OT ><LL
6 O> IP1#M'19J
3O
7reate the example temporary source data table used in the ETL database
71E'TE T'*LE I8actGTempJ 5
IstorGidJ IcharJ 5K6 >OT ><LL ,
IordGnumJ I$archarJ 5/06 >OT ><LL ,
IordGdateJ IdatetimeJ >OT ><LL ,
28
I"tyJ IsmallintJ >OT ><LL ,
IpaytermsJ I$archarJ 5;/6 >OT ><LL ,
ItitleGidJ ItidJ >OT ><LL
6 O> IP1#M'19J
3O
7reate the example fact staging table
71E'TE T'*LE I8actG(tagingJ 5
IstorGidJ IcharJ 5K6 >OT ><LL ,
IordGnumJ I$archarJ 5/06 >OT ><LL ,
IordGdateJ IdatetimeJ >OT ><LL ,
I"tyJ IsmallintJ >OT ><LL DE8'<LT 506,
IpaytermsJ I$archarJ 5;/6 >OT ><LL ,
ItitleGidJ ItidJ >OT ><LL,
I(toreG=eyJ IintJ >OT ><LL DE8'<LT 506,
ITitleG=eyJ IintJ >OT ><LL DE8'<LT 506
6 O> IP1#M'19J
3O
7reate the example data warehouse fact table
71E'TE T'*LE I8actGD&J 5
I(toreG=eyJ IintJ >OT ><LL DE8'<LT 506,
ITitleG=eyJ IintJ >OT ><LL DE8'<LT 506,
IordGnumJ I$archarJ 5/06 >OT ><LL ,
IordGdateJ IdatetimeJ >OT ><LL ,
29
I"tyJ IsmallintJ >OT ><LL DE8'<LT 506,
IpaytermsJ I$archarJ 5;/6 >OT ><LL ,
I8actG=eyJ IintJ #DE>T#T9 5;, ;6 >OT ><LL P1#M'19 =E9 7L<(TE1ED
6 O> IP1#M'19J
3O
7reate the example titles dimension table
71E'TE T'*LE ITitlesGD&J 5
ItitleGidJ ItidJ >OT ><LL,
ItitleJ I$archarJ 5E06 >OT ><LL,
ItypeJ IcharJ 5;/6 DE8'<LT 5L<>DE7#DEDL6,
IpubGidJ IcharJ 5K6 ><LL,
IpriceJ ImoneyJ ><LL,
Iad$anceJ ImoneyJ ><LL,
IroyaltyJ IintJ ><LL,
IytdGsalesJ IintJ ><LL,
InotesJ I$archarJ 5/006 ><LL,
IpubdateJ IdatetimeJ >OT ><LL DE8'<LT 5getdate566,
ITitleG=eyJ IintJ >OT ><LL #DE>T#T9 5;,;6 P1#M'19 =E9 7L<(TE1ED
6 O> IP1#M'19J
3O
7reate the example stores dimension table
71E'TE T'*LE I(toresGD&J 5
IstorGidJ IcharJ 5K6 >OT ><LL,
30
IstorGnameJ I$archarJ 5K06 ><LL,
IstorGaddressJ I$archarJ 5K06 ><LL,
IcityJ I$archarJ 5/06 ><LL,
IstateJ IcharJ 5/6 ><LL,
I?ipJ IcharJ 5F6 ><LL,
I(toreG=eyJ IintJ #DE>T#T9 5;,;6 P1#M'19 =E9 7L<(TE1ED
6 O> IP1#M'19J
3O
The following statements populate the sample fact and dimension tables and pro$ide a
base set of data for the remainder for the examples. The Fact1Temp and Fact1#ource
may appear to be redundant, but Fact1#ource is only used to simulate a source table
in an OLTP system.
Code Example 456<
Load the simulated 8actG(ource table with example data
#>(E1T #>TO 8actG(ource
(ELE7T (.O
81OM titleauthor T' #>>E1 NO#> sales ( O> T'.titleGid A (.titleGid
3O
Load the 8actGTemp table with data from the 8actG(ource table
#>(E1T #>TO 8actGTemp
(ELE7T O
81OM 8actG(ource
3O
Load the example dimension for Titles
31
#>(E1T #>TO TitlesGD&
(ELE7T O
81OM Titles
3O
Load the example dimension for (tores
#>(E1T #>TO (toresGD&
(ELE7T O
81OM (tores
3O
This completes the preparation of the sample data. The remaining examples
demonstrate the tas%s that prepare data for loading and load it into the data warehouse
fact table.
The following code loads the Fact1#tain table. >otice that the #tore13ey and
Title13ey columns that are used for surrogate %eys contain ?eros when the data is first
loaded into the staging table. This is because ><LLs are not allowed in these columns.
The pre$ention of ><LLs allows for a $ery clean data load and it negates the need to do
><LL logic chec%s in the ETL code or the reporting system. The ?eros in the column
also pro$ide an easy mechanism for locating in$alid data in the dimension and fact table
data. #f a ?ero appears in either column in the final fact table, then the ETL logic failed to
handle a dimension attribute. #t is good practice to always add a dimension record with
?ero %ey and assign it the description of :un%nown.: This helps preser$e relational
integrity in the data warehouse and allows reporting systems to display the in$alid data,
so that corrections can be made to the ETL logic or the source data.
Code Example 456=
Load the 8actG(taging table with data in the 8actGTemp table
#>(E1T #>TO 8actG(taging 5storGid, ordGnum, ordGdate, "ty, payterms, titleGid,
(toreG=ey, TitleG=ey6
(ELE7T storGid, ordGnum, ordGdate, "ty, payterms, titleGid, 0, 0
32
81OM 8actGTemp
3O
>ow that the Fact1#tain table is loaded, the surrogate %eys can be updated. The
techni"ues for updating the surrogate %eys in the fact table will differ depending on
whether the dimension contains Type / changes. The following techni"ue can be used
for Type ; dimensions2
Code Example 4565
<pdate the 8actG(taging table with the surrogate %ey for Titles
5Type ; dimension6
<PD'TE 8actG(taging
(ET TitleG=ey A T.TitleG=ey
81OM 8actG(taging 8 #>>E1 NO#>
TitlesGD& T O> 8.titleGid A T.titleGid
3O
<pdate the 8actG(taging table with the surrogate %ey for (tore
5Type ; dimension6
<PD'TE 8actG(taging
(ET (toreG=ey A (.(toreG=ey
81OM 8actG(taging 8 #>>E1 NO#>
(toresGD& ( O> 8.(torGid A (.(torGid
3O
The techni"ue abo$e will not wor% for dimensions that contain Type / changes,
howe$er, because there may be more than one dimension record that contains the
original source %ey. The following techni"ue is appropriate for Type / dimensions2
Code Example 4564,
'dd a few new rows to the (toresGD& table to demonstrate techni"ue
33
Duplicate (tore records are added that reflect changed store names
#>(E1T #>TO (toresGD& 5storGid, storGname, storGaddress, city, state, ?ip6
(ELE7T storGid, L>ew L T storGname, storGaddress, city, state, ?ip
81OM (toresGD&
&@E1E state A L&'L
3O
'dd some new rows to fact table to demonstrate techni"ue
#>(E1T #>TO 8actG(taging 5storGid, ordGnum, ordGdate, "ty, payterms, titleGid,
(toreG=ey, TitleG=ey6
(ELE7T storGid, ordGnum, ordGdate, "ty, payterms, titleGid, 0, 0
81OM 8actGTemp
3O
<pdate the fact table. <se the maximum store %ey
to relate the new fact data to the latest store record.
*E3#> T1'>('7T#O>
get the maximum storeG%ey for each storGid
(ELE7T M'M5(TO1EG=E96 '( (toreG=ey, storGid
#>TO U(tores
81OM (toresGD&
31O<P *9 storGid
O1DE1 *9 storGid
update the fact table
34
<PD'TE 8actG(taging
(ET (toreG=ey A (.(toreG=ey
81OM 8actG(taging 8 #>>E1 NO#>
U(tores ( O> 8.storGid A (.storGid
&@E1E 8.(toreG=ey A 0
drop the temporary table
D1OP T'*LE U(tores
7OMM#T T1'>('7T#O>
3O
'fter the fact data has been successfully scrubbed and transformed, it needs to be
loaded into the data warehouse. #f the ETL database is not on the same ser$er as the
data warehouse database, then the data will need to be transferred using DT(, !cp, or
another mechanism. 'n efficient approach is to use !cp to export the data from the ETL
database, copy the data to the target ser$er, and then use *<L= #>(E1T to update the
target database. @owe$er, if the databases are on the same ser$er, a simple #>(E1T
statement will load the new fact table rows2
Code Example 45644
Load the new fact table rows into the data warehouse
#>(E1T #>TO 8actGD& 5ordGnum, ordGdate, "ty, payterms, (toreG=ey, TitleG=ey6
(ELE7T ordGnum, ordGdate, "ty, payterms, (toreG=ey, TitleG=ey
81OM 8actG(taging
3O
8inally, the following (ELE7T statement shows the data warehouse fact table, complete
with Type / dimension for the stores dimension2
Code Example 4564+
Demonstrate the success of the techni"ue
35
(ELE7T (.storGid, (.(toreG=ey, (.storGname, 8.ordGnum, 8.ordGdate, 8."ty,
8.payterms
81OM (toresGD& ( #>>E1 NO#>
8actGD& 8 O> (.(toreG=ey A 8.(toreG=ey
O1DE1 *9 (.storGid, (.(toreG=ey
3O
Advanced Tec$ni>ues
&hile the sample techni"ues described abo$e will wor% for small to mediumsi?ed
dimensions, they will not wor% for large dimensions. 8or large dimensions, $ariations of
these techni"ues can pro$ide greater efficiency. The code examples in this topic show
some ad$anced techni"ues for Type / dimensions.
One of the %ey design decisions in the abo$e techni"ues centers on the use of the
staging tables. #n the techni"ues illustrated abo$e, the staging tables are exact copies of
the final data warehouse dimension tables. @owe$er, the efficiency of the abo$e
techni"ues decreases as the number of rows in the dimension increase due to records
added for Type / changes. 8or $ery large dimensions 5millions of rows6, the abo$e
techni"ue will re"uire massi$e amounts of processing power to complete. Therefore, for
large dimensions, we need to introduce a $ariation of the abo$e techni"ue that will allow
the system to scale with the data warehouse.
This $ariation in$ol$es creating a :current $ersion: dimension table for use in the ETL
process that contains only a single row for each of the dimension members. This record
contains the current attributes of the dimension member. 8or example, if we ha$e a
Type / dimension for stores, and the data for the store *oo%beat has undergone three
Type / changes, then the current $ersion table would not contain all four records for the
store. #nstead, the table contains a single row for *oo%beat that contains all of the
current information for it, including the current surrogate %ey $alue for the dimension
member. This creates a smaller table with fewer rows that allows for faster access
during the ETL process.
The following code incorporates a #tore1Current table to demonstrate this techni"ue
for the (tores dimension. The table below describes each of the tables used in the
example.
Table name Purpose
Aut$ors#tores1Temp @olds the data imported from the source system.
36
Table name Purpose
#tores1#tain @olds the dimension data while it is being updated. The data for
the stores will be updated in this table and then the data will be
loaded into the data warehouse dimension table.
#tores1Current 7ontains a single record for each store to trac% the current
information for the store.
#tores1D( (imulates the (tores dimension table in the data warehouse.
The following statements create the four tables2
Code Example 45647
D1OP T'*LE (toresGD&
3O
71E'TE T'*LE I(toresGTempJ 5
IstorGidJ IcharJ 5K6 >OT ><LL,
IstorGnameJ I$archarJ 5K06 ><LL,
IstorGaddressJ I$archarJ 5K06 ><LL,
IcityJ I$archarJ 5/06 ><LL,
IstateJ IcharJ 5/6 ><LL,
I?ipJ IcharJ 5F6 ><LL
6 O> IP1#M'19J
3O
71E'TE T'*LE I(toresG(tagingJ 5
IstorGidJ IcharJ 5K6 >OT ><LL,
IstorGnameJ I$archarJ 5K06 ><LL,
IstorGaddressJ I$archarJ 5K06 ><LL,
IcityJ I$archarJ 5/06 ><LL,
IstateJ IcharJ 5/6 ><LL,
37
I?ipJ IcharJ 5F6 ><LL,
IDate7reatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566,
IDate<pdatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566,
I(toreG=eyJ IintJ #DE>T#T9 5;,;6 P1#M'19 =E9 7L<(TE1ED
6 O> IP1#M'19J
3O
71E'TE T'*LE I(toresG7urrentJ 5
IstorGidJ IcharJ 5K6 >OT ><LL,
IstorGnameJ I$archarJ 5K06 ><LL,
IstorGaddressJ I$archarJ 5K06 ><LL,
IcityJ I$archarJ 5/06 ><LL,
IstateJ IcharJ 5/6 ><LL,
I?ipJ IcharJ 5F6 ><LL,
IDate7reatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566,
IDate<pdatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566,
I(toreG=eyJ IintJ P1#M'19 =E9 7L<(TE1ED
6 O> IP1#M'19J
3O
71E'TE T'*LE I(toresGD&J 5
IstorGidJ IcharJ 5K6 >OT ><LL,
IstorGnameJ I$archarJ 5K06 ><LL,
IstorGaddressJ I$archarJ 5K06 ><LL,
IcityJ I$archarJ 5/06 ><LL,
38
IstateJ IcharJ 5/6 ><LL,
I?ipJ IcharJ 5F6 ><LL,
IDate7reatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566,
IDate<pdatedJ smalldatetime >OT ><LL DE8'<LT 5getdate566,
I(toreG=eyJ IintJ P1#M'19 =E9 7L<(TE1ED
6 O> IP1#M'19J
3O
The following statements populate the #tores1Temp, #tores1#tain, and
#tores1Current sample tables to pro$ide a base set of data that will be used in the
remainder of the example2
Code Example 45649
Load the (toresGTemp table with the default set of data
#>(E1T #>TO (toresGTemp
(ELE7T O 81OM (tores
3O
Load the (toresG(taging table with the default set of data
#>(E1T #>TO (toresG(taging 5storGid, storGname, storGaddress, city, state, ?ip,
Date7reated, Date<pdated6
(ELE7T storGid, storGname, storGaddress, city, state, ?ip, getdate56, getdate56
81OM (toresGTemp
3O
Load the (toresG7urrent table with the default set of data
#>(E1T #>TO (toresG7urrent 5storGid, storGname, storGaddress, city, state, ?ip,
Date7reated, Date<pdated, storeG%ey6
39
(ELE7T storGid, storGname, storGaddress, city, state, ?ip, Date7reated, Date<pdated,
storeG%ey
81OM (toresG(taging
3O
The following code adds some new records into the #tores1#tain table to simulate
Type / changes to the (tores dimension. The new records reflect changes to existing
store data! no new store records are added.
Code Example 4564:
#nsert some change records into (toreG(taging to demonstrate the techni"ue
Duplicate records are added that reflect changes to store names for some stores
#>(E1T #>TO (toresG(taging 5storGid, storGname, storGaddress, city, state, ?ip6
(ELE7T storGid, storGname T L >ewL, storGaddress, city, state, ?ip
81OM (toresGstaging
&@E1E state PQ LcaL
3O
1ecords for new stores are loaded into #tores1Current before starting to process
stores with change records. The following Transact()L code loads new stores in the
#tores1#tain table into the #tores1Current table. This techni"ue is the same as the
one documented earlier in the chapter.
Code Example 4564;
#nsert any new stores in (toresG(taging into the (toresG7urrent table
#n this example there should not be any new stores
#>(E1T #>TO (toresG7urrent 5storGid, storGname, storGaddress, city, state, ?ip,
Date7reated, Date<pdated6
(ELE7T (.storGid, (.storGname, (.storGaddress, (.city, (.state, (.?ip, (.Date7reated,
(.Date<pdated
81OM (toresG(taging ( LE8T O<TE1 NO#> (toresG7urrent 7 O> (.storGid A
7.storGid
40
&@E1E 5c.(toreG=ey #( ><LL6
3O
The real change in this techni"ue in$ol$es changing the way that dimension members
are updated. The following Transact()L code demonstrates the alternati$e way to
update the dimension members and load them into the data warehouse dimension.
Once the new members of the dimension ha$e been loaded, the next step is to chec%
existing members for attributes changes that re"uire Type / changes to the dimension.
This example chec%s the storGname attribute and updates the row in the
#tores1Current table for e$ery store that has had a name change 5in this example, all
stores that do not exist in 7'6.
Code Example 4564<
<pdate (toreG7urrent table for all stores that ha$e had a name change
<PD'TE (toresG7urrent
(ET storGname A (.storGname,
(toreG%ey A (.(toreG=ey,
Date<pdated A getdate56
81OM (toresG(taging (
#>>E1 NO#> (toresG7urrent 7 O> (.storGid A 7.storGid
&@E1E (.storGname PQ 7.storGname
3O
>ow that all of the dimension records ha$e been updated with the latest data, the
surrogate %eys can be updated for the fact table data with the following Transact()L
statement. This techni"ue is more efficient because a temporary table does not ha$e to
be created to determine the current $alue of the dimension table %ey.
Code Example 4564=
generate some fact data rows that do not ha$e a storeG%ey
#>(E1T #>TO 8actG(taging 5storGid, ordGnum, ordGdate, "ty, payterms, titleGid,
storeG%ey, titleG%ey6
(ELE7T storGid, ordGnum, getdate56, "ty, payterms, titleGid, 0, 0
41
81OM 8actG(taging
&@E1E )T9 P /0
3O
<pdate the fact data using the (toreG=ey %ey from the (toreG7urrent table
to relate the new fact data to the latest store record
<PD'TE 8actG(taging
(ET (toreG=ey A 7.(toreG=ey
81OM 8actG(taging 8 #>>E1 NO#>
(toresG7urrent 7 O> 8.storGid A 7.storGid
&@E1E 8.(toreG=ey A 0
3O
"eta Data Loin
' critical design element in successful ETL implementation is the capability to generate,
store and re$iew meta data. Data tables in a data warehouse store information about
customers, items purchased, dates of purchase, and so on. Meta data tables store
information about users, "uery execution times, number of rows retrie$ed in a report,
etc. #n ETL systems, meta data tables store information about transformation execution
time, number of rows processed by a transformation, the last date and time a table was
updated, failure of a transformation to complete, and so on. This information, if analy?ed
appropriately, can help predict what is li%ely to occur in future transformations by
analy?ing trends of what has already occurred.
#n the code examples that follow, the terms :Nob: and :(tep: are used with the following
meanings2
4 ' :Nob: is an ETL element that is either executed manually or as a scheduled e$ent. '
Nob contains one or more steps.
4 ' :(tep: is an indi$idual unit of wor% in a +ob such as an #>(E1T, <PD'TE, or
DELETE operation.
4 ' :Threshold: is a range of $alues defined by a minimum $alue and a maximum $alue.
'ny $alue that falls within the specified range is deemed acceptable. 'ny $alue that
42
does not fall within the range is unacceptable. 8or example, a processing window is a
type of threshold. #f a +ob completes within the time allotted for the processing window,
then it is acceptable. #f it does not, then it is not acceptable.
Designing meta data storage re"uires careful planning and implementation. There are
dependencies between tables and order of precedence constraints on records.
@owe$er, the meta data information generated by ETL acti$ities is critical to the success
of the data warehouse. 8ollowing is a sample set of tables that can be used to trac%
meta data for ETL acti$ities.
?o! Audit
ETL +obs produce data points that need to be collected. Most of these data points are
aggregates of the data collected for the +ob steps and could theoretically be deri$ed by
"uerying the +ob step audit table. @owe$er, the meta data for the +ob itself is important
enough to warrant storage in a separate table. *elow are sample meta data tables that
aid in trac%ing +ob information for each step in an ETL process.
t!lAdmin1?o!1"aster
This table lists all of the +obs that are used to populate the data warehouse. These are
the fields in t!lAdmin1?o!1"aster2
8ield Definition
Nob>umber ' uni"ue identifier for the record, generally an identity column.
Nob>ame The name 5description6 for the +ob. 8or example, :Load new
dimension data.:
MinThresh1ecords The minimum acceptable number of records affected by the +ob.
MaxThresh1ecordsThe maximum acceptable number of records affected by the +ob.
MinThreshTime The minimum acceptable execution time for the +ob.
MaxThreshTime The maximum acceptable execution time for the +ob.
7reateDate The date and time the record was created.
t!lAdmin1Audit1?o!s
This table is used to trac% each specific execution of a +ob. #t is related to the
t!lAdmin1?o!1"aster table using the ?o!Num!er column. These are the fields in
t!lAdmin1Audit1?o!s2
8ield Definition
Nob>umber ' uni"ue identifier for the record, generally an identity column.
Nob>ame The name 5description6 for the +ob. 8or example, :Load new dimension
data.:
(tartDate The date and time the +ob was started.
EndDate The date and time the +ob ended.
>umber1ecordsThe number of records affected by the +ob.
43
8ield Definition
(uccessful ' flagindicating if the execution of the +ob was successful.
This data definition language will generate the abo$e audit tables2
Code Example 45645
71E'TE T'*LE IdboJ.Itbl'dminGNobGMasterJ 5
INob>umberJ IintJ #DE>T#T9 5;, ;6 >OT ><LL
7O>(T1'#>T <P=7LGNob P1#M'19 =E9 7L<(TE1ED,
INob>ameJ I$archarJ 5F06 ><LL DE8'<LT 5LMissingL6,
IMinThresh1ecordsJ IintJ >OT ><LL DE8'<LT 506,
IMaxThresh1ecordsJ IintJ >OT ><LL DE8'<LT 506,
IMinThreshTimeJ IintJ >OT ><LL DE8'<LT 506,
IMaxThreshTimeJ IintJ >OT ><LL DE8'<LT 506
3O
71E'TE T'*LE IdboJ.Itbl'dminG'uditGNobsJ 5
INob>umberJ IintJ #DE>T#T9 5;, ;6 >OT ><LL
7O>(T1'#>T <P=7LGNob P1#M'19 =E9 7L<(TE1ED,
INob>ameJ I$archarJ 5F06 ><LL DE8'<LT 5LMissingL6,
I(tartDateJ IsmalldatetimeJ >OT ><LL DE8'<LT 5getdate566,
IEndDateJ IsmalldatetimeJ >OT ><LL DE8'<LT 5L0;D0;D;V00L6,
I>umber1ecordsJ IintJ >OT ><LL DE8'<LT 506,
I(uccessfulJ IbitJ >OT ><LL DE8'<LT 506,
3O
#tep Audit
Many ETL +obs are multistep, complicated transformations that in$ol$e #>(E1T,
<PD'TE, and DELETE statements with branching logic and an execution dependency.
#t is important to record meta data that trac%s the successful completion of an operation,
44
when it happened and how many rows it processed. This information should be stored
for e$ery step in an ETL +ob. *elow are sample meta data tables that aid in trac%ing
information for each step in an ETL +ob.
t!lAdmin1#tep1"aster
This table lists all of the steps in a +ob. These are the fields in t!lAdmin1#tep1"aster2
8ield Definition
Nob>umber The uni"ue number of the +ob that this step is associated with.
(tep(e">umber The step number within the ob+ect that executed the unit of wor%.
8re"uently, ETL +obs contain more than a single unit of wor% and
storing the step number allows for easy debugging and specific
reporting. #f the ob+ect only has a single step, then the $alue of this
field is :;:.
(tepDescription ' description of the step. 8or example, :#nserted records into tbl'.:
Ob+ect The name of the ob+ect. 8or example, the name of a stored
procedure or DT( pac%age that accomplishes the step.
>umber1ecords The number of records affected by the step.
MinThresh1ecords The minimum :acceptable: number of records affected by the step.
MaxThresh1ecordsThe maximum :acceptable: number of records affected by the step.
MinThreshTime The minimum :acceptable: execution time for the step.
MaxThreshTime The maximum :acceptable: execution time for the step.
7reateDate The date and time the record was created.
(tep>umber ' uni"ue $alue assigned to the record, generally an identity column.
t!l1Admin1Audit1#tep
This table is used to trac% each specific execution of a +ob step. #t is related to the
t!lAdmin1#tep1"aster table using the #tepNum!er column. These are the fields in
t!lAdmin1Audit1#tep2
8ield Definition
1ecord#D ' uni"ue $alue assigned to the record, generally an identity column.
Nob'udit#D <sed to tie the specific execution on a +ob step to the specific execution
of a +ob.
(tep>umber The step number executed.
Parameters 'ny parameters sent to the +ob step for the specific execution instance.
These are the parameter $alues, not a list of the parameters.
>umber1ecordsThe number of records affected by the +ob step.
(tartDate The date and time the +ob step was started.
EndDate The date and time the +ob step ended.
<ser>ame The name of the user that executed the +ob step.
*elow is the data definition language that will generate the +ob step audit tables abo$e.
45
Code Example 456+,
71E'TE T'*LE IdboJ.Itbl'dminG(tepGMasterJ 5
INob>umberJ IintJ >OT ><LL DE8'<LT 5;6,
I(tep(e">umberJ IintJ >OT ><LL DE8'<LT 5;6,
I(tepDescriptionJ I$archarJ >OT ><LL DE8'<LT 5LMissingL6,
IOb+ectJ I$archarJ 5F06 ><LL DE8'<LT 5LMissingL6,
IMinThresh1ecordsJ IintJ >OT ><LL DE8'<LT 506,
IMaxThresh1ecordsJ IintJ >OT ><LL DE8'<LT 506,
IMinThreshTimeJ IintJ >OT ><LL DE8'<LT 506,
IMaxThreshTimeJ IintJ >OT ><LL DE8'<LT 506,
I(tartDateJ IsmalldatetimeJ >OT ><LL DE8'<LT 5getdate566,
I(tep>umberJ IintJ #DE>T#T9 5;, ;6 >OT ><LL
7O>(T1'#>T <P=7LGPrimary=ey P1#M'19 =E9 7L<(TE1ED6 O> IP1#M'19J
3O
71E'TE T'*LE IdboJ.Itbl'dminG'uditG(tepJ 5
I1ecord#DJ IintJ #DE>T#T9 5;, ;6 >OT ><LL
7O>(T1'#>T <P=7LGPrimary=ey P1#M'19 =E9 7L<(TE1ED,
IOb+ectJ I$archarJ 5F06 ><LL DE8'<LT 5LMissingL6,
I(tep>umberJ ItinyintJ >OT ><LL DE8'<LT 5;6,
I(tepDescriptionJ I$archarJ >OT ><LL DE8'<LT 5LMissingL6,
IParametersJ I$archarJ 5;006 ><LL,
I>umber1ecordsJ IintJ >OT ><LL DE8'<LT 5;6,
I(tartDateJ IsmalldatetimeJ >OT ><LL DE8'<LT 5getdate566,
IEndDateJ IsmalldatetimeJ >OT ><LL DE8'<LT 5L0;D0;D;V00L6,
46
I<ser>ameJ I$archarJ 5/06 >OT ><LL DE8'<LT 5LMissingL66 O> IP1#M'19J
3O
Error Trac'in
'nother important type of meta data about transformations is information that trac%s
what failed and why. ETL +obs produce errors. Nust as trac%ing successful execution is
important, trac%ing failures is e"ually important. *elow is a sample meta data table that
aids in trac%ing error information for each step in an ETL +ob. This table is designed to
trac% ()L (er$er /000 errors, although it could be modified to trac% OLE D* errors as
well.
Note #n ()L (er$er /000, only the error number can be trapped, not the generated
error message.
t!lAdmin1Audit1Errors
This table lists all of the errors that are generated during a +ob step. These are the fields
in t!lAdmin1Audit1Errors2
8ield Definition
1ecord#D ' uni"ue $alue assigned to the record, generally an identity column.
(tep>umber The step number executed that generated the error.
Parameters 'ny parameters sent to the +ob step for the specific execution instance.
These are the parameter $alues, not a list of the parameters.
Error>umber The error number raised by ()L (er$er /000 5generally the WWE11O1
number6.
1ecord7ount The number of records affected by the +ob step.
<ser>ame The name of the user that executed the +ob step.
EndDate The date and time the +ob step ended.
Code #ample8 ?o! Audit
The following stored procedures demonstrate one method of logging +ob le$el meta
data. The uspG'dminG'uditGNobG(tart procedure indicates the start of an ETL +ob and
should be the $ery first stored procedure executed in the ETL +ob2
Code Example 456+4
'LTE1 P1O7ED<1E uspG'dminG'uditGNobG(tart
WNob>umber int A ; The number of the +ob being executed 5from the mast +ob table6
'(
(ET >O7O<>T O> (ET >o7ount O>
47
DE7L'1E $ariables
DE7L'1E WError>umber int the number of the ()L error generated
DE7L'1E WError1ow7ount int the number of rows in the unit of wor% affected by
error
DE7L'1E W(tartdate smalldatetime the datetime the load +ob started
DE7L'1E WEndDate smalldatetime the datetime the load +ob ended
#>(E1T the first record 5start time6 for the +ob into the tbl'dminG'uditGNobs table
*E3#> T1'>('7T#O>
(ET W(tartDate A getdate56 set a start date for the batch
(ET WEndDate A L0;D0;D;V00L set a bogus endate for the batch
insert into tbl'dminG'uditGNobs 5Nob>umber, (tartDate, EndDate, >umber1ecords,
(uccessful6
$alues 5WNob>umber, W(tartDate, WEndDate, 0, 06
(ELE7T WError>umber A WWerror, WError1ow7ount A WWrowcount
#f WError>umber PQ 0
*E3#>
1OLL*'7= T1'>('7T#O>
3OTO ErrG@andler
E>D
7OMM#T T1'>('7T#O>
1ET<1> 506
48
ErrG@andler2
exec uspG'dminError WWProc#D, LnoneL, WError>umber, WError1ow7ount
1ET<1> 5;6
3O
The following stored procedure indicates the end of an ETL +ob and should be the last
stored procedure executed in the ETL +ob. #t is important to note that in addition to
updating the t!lAdmin1Audit1?o!s table, this stored procedure also updates the
t!lAdmin1Audit1#tep table with the threshold information for each step. The threshold
information is stored with each step in the table because o$er time, the acceptable
thresholds for the step may change. #f the threshold information is only stored in the
master step table 5a Type ; dimension6, any changes to the table affect meta data
generated for historical steps.
Therefore, storing the threshold with the step 5a Type / dimension6 allows us to
maintain historical execution records without affecting their integrity if the master step
information is changed. 8or example, if a step initially loads ;,000 rows but o$er time
the number of rows increases to ; million, the acceptable threshold information for that
step must be changed as well. #f the threshold data is stored only in the
t!lAdmin1#tep1"aster table and not stored with each record, the context of the data
will be lost, which can cause inaccuracies in reports built on the meta data information.
8or simplicity, to illustrate the techni"ue, the sample code does not maintain threshold
information automatically. #n order to change the threshold information for a step, an
administrator will need to modify the master step record manually. @owe$er, it would be
possible to automate this process.
Code Example 456++
71E'TE P1O7ED<1E uspG'dminG'uditGNobGEnd
WNob>umber int A ;, The number of the +ob 5from the master +ob table6 being
executed
W(uccessful bit ' flag indicating if the +ob was successful
'(
49
(ET >O7O<>T O> (ET >o7ount O>
DE7L'1E $ariables
DE7L'1E WError>umber int the number of the ()L error generated
DE7L'1E WError1ow7ount int the number of rows in the unit of wor% affected by
error
DE7L'1E W(tartdate smalldatetime the datetime the load +ob started
DE7L'1E WEndDate smalldatetime the datetime the load +ob ended
DE7L'1E WNob'udit#D int the U for the instance of the +ob
DE7L'1E W1ow7ount int the number of rows affected by the +ob
<PD'TE the +ob record 5end time6 in the 'udit table
*E3#> T1'>('7T#O>
(ET WEndDate A getdate56 set the end date for the batch
(ET WNob'udit#D A 5(ELE7T M'M5Nob'udit#D6 81OM tbl'dminG'uditGNobs
where Nob>umber A WNob>umber6 get the +ob number
(ET W1ow7ount A 5(ELE7T (<M5>umber1ecords6 get the total +ob record count
81OM tbl'dminG'uditG(tep &@E1E Nob'udit#D A WNob'udit#D6
<PD'TE tbl'dminG'uditGNobs <pdate the Nob record with the end time
(ET EndDate A WEndDate,
>umber1ecords A W1ow7ount,
(uccessful A ;
&@E1E Nob'udit#D A WNob'udit#D
50
(ELE7T WError>umber A WWerror, WError1ow7ount A WWrowcount
<PD'TE tbl'dminG'uditG(tep <pdate all steps for the +ob with the
(ET Min1ecords A T.MinThresh1ecords, threshold information for each step
Max1ecords A T.MaxThresh1ecords,
MinTime A T.MinThreshTime,
MaxTime A T.MaxThreshTime,
TimeTarget A 7'(E
&@E> D'TED#885mi, '.(tartDate, '.EndDate6 *ET&EE> T.MinThreshTime '>D
T.MaxThreshTime
T@E> LOnL
&@E> D'TED#885mi, '.(tartDate, '.EndDate6P T.MinThreshTime T@E> L<nderL
&@E> D'TED#885mi, '.(tartDate, '.EndDate6 Q T.MaxThreshTime T@E> LO$erL
EL(E L<n%nownL
E>D,
1ecordTarget A 7'(E
&@E> '.>umber1ecords *ET&EE> T.MinThresh1ecords '>D T.MaxThresh1ecords
T@E> LOnL
&@E> '.>umber1ecords P T.MinThresh1ecords T@E> L<nderL
&@E> '.>umber1ecords Q T.MaxThresh1ecords T@E> LO$erL
EL(E L<n%nownL
E>D
81OM tbl'dminG(tepGMaster T
1#3@T O<TE1 NO#> tbl'dminG'uditG(tep ' O> T.(tep>umber A '.(tep>umber
&@E1E '.Nob'udit#D A WNob'udit#D
(ELE7T WError>umber A WWerror, WError1ow7ount A WWrowcount
51
#f WError>umber PQ 0
*E3#>
1OLL*'7= T1'>('7T#O>
3OTO ErrG@andler
E>D
7OMM#T T1'>('7T#O>
1ET<1> 506
ErrG@andler2
exec uspG'dminError WWProc#D, LnoneL, WError>umber, WError1ow7ount
1ET<1> 5;6
3O
Code #ample8 #tep Audit
The following stored procedures demonstrate one method of logging step records from
within ETL stored procedures. >otice that the WW&rocID is used to retrie$e the ob+ect
id of the executing stored procedure. 'lso note that the $alues of WWerror and
WWro/count are retrie$ed immediately after the #>(E1T statement.
Code Example 456+7
'LTE1 P1O7ED<1E uspG'dminG'uditG(tep
W(tep>umber tinyint A 0, the uniue number of the step
WParameters $archar5F06 A LnoneL, any parameters used in the (P
W1ecord7ount int A 0, the number of records modified by the step
W(tartDate smalldatetime, the date X time the step started
WEndDate smalldatetime the date X time the step ended
'(
(ET >O7O<>T O> (ET >o7ount O>
DE7L'1E $ariables
52
DE7L'1E WError>umber int
DE7L'1E WError1ow7ount int
DE7L'1E WNob'udit#D int
*E3#> T1'>('7T#O> #>(E1T the audit record into the tbl'dminG'uditG(tep table
(ET WNob'udit#D A 5(ELE7T M'M5Nob'udit#D6 81OM tbl'dminG'uditGNobs6
#>(E1T #>TO tbl'dminG'uditG(tep 5 Nob'udit#D, (tep>umber, Parameters,
>umber1ecords,
(tartDate, EndDate, <sername6
B'L<E( 5WNob'udit#D, W(tep>umber, WParameters, W1ecord7ount,
W(tartDate, WEndDate, userGname566
(ELE7T WError>umber A WWerror, WError1ow7ount A WW1ow7ount
#f WError>umber PQ 0
*E3#>
1OLL*'7= T1'>('7T#O>
3OTO ErrG@andler
E>D
7OMM#T T1'>('7T#O>
1ET<1> 506
ErrG@andler2
exec uspG'dminGLogGError WWProc#D, ;, LnoneL, WError>umber, WError1ow7ount
1ET<1> 5;6
53
3O
The following stored procedure demonstrates the use of the auditing stored procedure
detailed abo$e2
Code Example 456+9
71E'TE P1O7ED<1E uspG'udit(ample
'(
(ET >O7O<>T O> (ET >o7ount O>
DE7L'1E $ariables
DE7L'1E WError>umber int
DE7L'1E W1ecord7ount int
DE7L'1E W(tartDate smalldatetime
DE7L'1E WEndDate smalldatetime
*E3#> T1'>('7T#O>
(ET W(tartDate A getdate56 get the datetime the step started
insert into tblTest
select O from tblTest
(ELE7T WError>umber A WWerror, W1ecord7ount A WWrowcount
#f WError>umber PQ 0 error handler
*E3#>
1OLL*'7= T1'>('7T#O>
3OTO ErrG@andler
E>D
54
(ET WEndDate A getdate56 get the datetime the step finished
log the audit record into the tbl'dminG'uditG(tep table
exec uspG'dminG'uditG(tep ; , Ltest from (PL, W1ecord7ount, W(tartDate,
WEndDate
7OMM#T T1'>('7T#O>
1ET<1> 506
ErrG@andler2
exec uspG'dminGLogGError WWProc#D, LnoneL, WError>umber, W1ecord7ount
1ET<1> 5;6
3O
Code #ample8 Error Trac'in
The following stored procedures demonstrate one possible method of logging errors in
ETL stored procedures. >otice that the stored procedure uses the O*NE7TG>'ME
function to retrie$e the name of the ob+ect 5table, $iew, stored procedure, and so on6.
This introduces a le$el of abstraction so that the code is only useful for stored
procedures.
Code Example 456+:
71E'TE P1O7ED<1E uspG'dminGLogGError
WOb+ect#D int,
W(tep>umber int,
WParameters $archar5F06 A LnoneL,
WError>umber int A 0,
W1ecord7ount int A 0
'(
55
(ET >o7ount O>
(ET >O7O<>T O>
1ET1#EBE the >'ME of the ob+ect being audited
DE7L'1E WOb+ect>ame $archar5F06
(ET WOb+ect>ame A O*NE7TG>'ME5WOb+ect#D6
#>(E1T the audit record into the tbl'dminG'uditGErrors table
*E3#> T1'>('7T#O>
insert into tbl'dminG'uditGErrors 5Ob+ect, (tep>umber, Parameters, Error>umber,
1ecord7ount, <ser>ame, 1ecordDate6
$alues 5WOb+ect>ame, W(tep>umber, WParameters, WError>umber, W1ecord7ount,
userGname56, getdate566
7OMM#T T1'>('7T#O>
3O
Once an error is generated and passed to the error logging stored procedure, it is
logged to the t!lAdmin1Audit1Errors table. >otice that the WW&rocID is used to
retrie$e the ob+ected of the executing stored procedure. 'lso note that the $alues of
WWerror and WWro/count are retrie$ed immediately after the #>(E1T statement.
&ith the exception of modifying the $alue of W#tep, this logic may be deployed with no
other alterations to the code. The following stored procedure demonstrates how to
deploy the error logging method detailed abo$e2
Code Example 456+;
71E'TE P1O7ED<1E uspGError(ample
'(
(ET >o7ount O>
(ET >O7O<>T O>
56
DE7L'1E Bariables
DE7L'1E WOb+ect>ame $archar5F06
DE7L'1E WError>umber int, W1ecord7ount int
DE7L'1E W(tep int
#>(E1T the audit record into the 'uthors table
*E3#> T1'>('7T#O>
insert into 'uthors
(elect O from authors
(ELE7T WError>umber A WWerror, W1ecord7ount A WWrowcount, W(tep A /
#f WError>umber PQ 0
*E3#>
1OLL*'7= T1'>('7T#O>
3OTO ErrG@andler
E>D
7OMM#T T1'>('7T#O>
1ET<1> 506
ErrG@andler2
exec uspG'dminGLogGError WWProc#D, W(tep, LnoneL, WError>umber,
W1ecord7ount
1ET<1> 5;6
3O
57
Conclusion
The ETL system efficiently extracts data from its sources, transforms and sometimes
aggregates data to match the target data warehouse schema, and loads the
transformed data into the data warehouse database. ' welldesigned ETL system
supports automated operation that informs operators of errors with the appropriate le$el
of warning. ()L (er$er /000 Data Transformation ;0KEF/HF/;(er$ices can be used to
manage the ETL operations, regardless of the techni"ues used to implement indi$idual
ETL tas%s.
&hile it is tempting to perform some transformation on data as it is extracted from the
source system, the best practice is to isolate transformations within the transformation
modules. #n general, the data extraction code should be designed to minimi?e the
impact on the source system databases.
#n most applications, the %ey to efficient transformation is to use a ()L (er$er /000
database for staging. Once extracted data has been loaded into a staging database, the
powerful ()L (er$er /000 database engine is used to perform complex
transformations.
The process of loading fact table data from the staging area into the target data
warehouse should use bul% load techni"ues. Dimension table data is usually small in
$olume, which ma%es bul% loading less important for dimension table loading.
The ETL system is a primary source of meta data that can be used to trac% information
about the operation and performance of the data warehouse as well as the ETL
processes.
58

You might also like