Professional Documents
Culture Documents
5
Data Staging
In this Chapter
Overview
Data extracted, transformed and loaded (ETL) from the source to destination may require
temporary staging for various reasons such as addressing failures, reducing load on the
source system, data cleansing, auditing purposes etc. The following section will iterate
through various scenarios to help identify if staging is required in first place, if you do
need staging then selecting the most appropriate model that fits the criteria becomes
critical for an optimal ETL solution.
There is a time lag between Extraction and Loading process. If the extraction
occurs at 7 AM and the loading to the warehouse happens at midnight then the
extracted data has to be stored somewhere. A staging database is a reasonable
option..
Restart on failure:. If various parts of the ETL process are vulnerable to fail,
recovery is easier if a staging database is available as the starting point. For
example, Out of 100 tables, the extraction process is failed after extracting 60 files to
staging area successfully. When the extraction process is restarted it does not need
to extract all but the failed 40 as the data is already available for successful 60 in
staging. When the source system resources are constrained, it is particularly
important to avoid repeating an extract unnecessarily. The staging database allows
the Extraction process to be decoupled from the subsequent processes.
Multiple Source systems: When the data is to be consolidated from multiple source
systems, a staging database allows the consolidating to take place before processing
the transformation.
Types of Staging
Assuming that there is a need for staging the extracted data, there are various types of
staging repository patterns available. Consider the following common staging
repository architectures when designing the ETL process.
Staggered Staging
Persisted Staging
Accumulated Staging
ETL Back
office
Productio
n
E, T
L
Target Db
Source
Db
Staging
Db
Fire Wall
Staggered Staging
In this case, you create multiple staging databases for different stages of the ETL
process. The reason for using this approach is if different steps of the process are very
expensive to repeat and are also vulnerable to failure. There is clearly a cost in disk
space for this option, but it allows multiple levels of restart capability, without having to
repeat earlier stages of the process.
Extract
Exception
Source
Db
Transfor
m
Load
Exception
Staging
Db
Exception
Target Db
Persisted Staging
In this case an archive copy of the staging database is created routinely. The primary
reason for using this alternative is to allow auditing of the extract and transformation
processes for more than the current cycle.
ETL
Archive
Extracted
Data
Source
Db
Staging
Db
Target Db
Accumulated Staging
Ideally, the source systems provide delta data having the corresponding transaction
details (Inserts, Updates, Deletes).
Accumulated staging approach can be adopted, when
1. If the source system does not have inbuilt delta detection mechanism
2. If the source system does not provide the kind of transaction applied for each
delta detected record
If the source system does not have inbuilt delta detection, all the records from the
required tables are extracted to staging then compared against the accumulated
staging.
If the souce system does have delta detection mechanism and extracted only the
changed/added records the next step is to determine the transaction type that occurred
for these changed records (Inserts/Updates/Deletes). Following example would give
more insight on how the comparison is made to determine the transaction type.
Incoming Extraction (Table Employee):
EmpId
EmpName
EmpLocation
1011
John
Redmond
1021
Dave
Seattle
EmpName
EmpLocation
1011
John
New York
In the above tables, when compared the incoming employee table with accumulated
staging, the record EmpId 1011 can be determined as Modified where as the EmpId
1021 can be determined as Inserted.
T1 L1
E1
T2 L2
E1+E2
T3 L3
E1+E2+E3
Source
Db
Staging
Db
Target Db
11AM
3 PM
Transform &
Load
9 PM
Chunks of data
extracted at
different times
Source
Db
Staging
Db
Target Db
Destination Considerations
The previous sections described various ways of staging data before loading the data
warehouse. In some BI database implementations, there may be an additional step
that involves distributing the data to separate data marts. .
Transform &
Load
Source
Db
Staging
Db
Warehous
e
Marts
For example, the data from the central data warehouse may need to be distributed to
data marts at different geographical locations. In some ways, the data warehouse is
functioning as a staging area for the distributed data marts. This is not technically a
stage, it is still taken into consideration for choosing appropriate conditions to load the
warehouse to ensure it is appropriate populated to destination marts.