ETL Processing: High Performance Data Warehouse Design and Construction

High Performance Data Warehouse
Design and Construction
ETL Processing
1
ETL Processing
IT Users
Operational Data
Data Transformation
Enterprise Warehouse and

Integrated Data Marts
Replication
Dependent Data Marts or

Departmental Warehouses
Business Users
2
Data Acquisition from OLTP Systems
•Why is it hard?
Multiple source systems technologies
Multiple sources for the same data element
Complexity of require transformations
Scarcity and cost of legacy cycles
Inconsistent data representations
Volume of legacy data
3
Multiple source systems technologies
Flat files *Excel *Model 204

*VSAM *Access *DBF Format
*IMS *Oracle *RDB
*IDMS *Informix *RMS
*DB2 (many flavors) *Sysbase *Compressed
*Adabase *Ingres *Many others…
4
•Inconsistent data representations: same data, different
domain
Examples:
Date value representations
- 1996-02-14
-02/14/1996
-14-FEB-1996
-960214
-14485
Gender value representations

- M/F - M/F/PM/PF
-0/1 -1/2
5
•Multiple sources for the same data element
Need to establish precedence between source systems on a per

data element basis.
Take data element from source system with highest precedence
where element exists
Must sometimes establish “group precedence” rules to maintain
data integrity
6
•Complexity of require transformations

Simple scalar transformations.
-0/1 => M/F
One to many element transformations.
- 6x30 address field => street1, stree2, city , state, zip.
Many to many element transformations
-Householding and individualization of customer
records
7
•Scarcity and cost of legacy cycles

Generally want to off-load transformation cycles to open
systems environment.
Often requires new skill sets.
Need efficient and easy way to deal with mainframe data
format such as EBDIC and packed decimal
8
•Volume of legacy data

Need lots of processing and I/O to effectively handle large
data volumes.
2 GB file limit in older versions of UNIX is not acceptable for
handling legacy data – need full 64- bit file system
Need efficient interconnect bandwidth to transfer large
amounts of data from legacy sources
9
•What does the solution look like?

Meta data driven transformation architecture
Modular software solutions with component building blocks
Parallel software and hardware architecture
10
•Meta data driven transformation architecture

Need multiple meta data structures
- Source meta data
- Target meta data
- Transformation meta data
Must avoid ‘hard coding’ for maintainability
Automatic generation of transformation from meta data
structures
Meta data repository ideally accessible by APIs and end user
tools
11
•Modular software solutions with component building

blocks
Want a data flow driven transformation architecture that
supports multiple processing steps.
Meta data structures should map inputs and outputs between
each transformation module.
Leverage pre-packaged tools for transformation steps
wherever possible
12
•Parallel software and hardware architecture

User data parallelism (partitioning) to allow concurrent
execution of multiple job streams.
Software architecture must allow efficient repartitioning of
data between steps in the transformation process.
Want powerful parallel hardware architectures with many
processors and I/O channels.
13
A Word of Warning
•The data quality in the source systems will be much

worse than what you expect.
Must allocate explicit time and resource to facilitate data clean-
up.
Data quality is a continuous improvement process – must
institute TQM program to be successful.
Use ‘house of quality’ technique to prioritize and focus data
quality efforts.
14
ETL Processing
•It is important to look at the big picture.
Data acquisition time may include:
Extracts from source systems
Data movement
Transformations
Data loading
Index maintenance
Statistics collection
Summary data maintenance
Data mart construction
Backups
15
Loading Strategies
1. Once we have transformed data, there are three

primary loading strategies:
Full data refresh with ‘block slamming’ into empty

table.
Incremental data refresh with ‘block slamming’ into
existing (populated) tables
Trickle feed with continuous data acquisition using
row level insert and update operations
16
Loading Strategies
We must also worry about rolling off “old” data as its economic
value drops below the cost for storing and maintaining it.
************************
***
DIAGRAM to be inserted
************************
***
17
Loading Strategies
• Choice in loading strategy depends in tradeoffs in data

freshness and performance, as well as data volatility
characteristics.
What is the goal?
Increased data freshness
Increased data loading performance
(Delayed Availability)
Real time Availability Minimal Load Time
Low Update Rates High Update Rates
18
Loading Strategies
• Should consider:
Data storage requirements
Impact on query workloads
Ratio of existing to new data
Insert versus update workloads
19
Loading Strategies
Tradeoffs in data loading

with a high ************************
***
percentage of data
changes per data GRAPH to be inserted
block ************************
***
20
Loading Strategies
Tradeoffs in data loading

with a low ************************
***
percentage of data
changes per data GRAPH to be inserted
block ************************
***
21
Full Refresh Strategy
Completely re-load table on each refresh.

Step 1: Load table using block slamming
Step 2: Build Indexes
Step 3: Collect statistics
This is a good (simple) strategy for small tables or when a

high percentage of rows in the data changes on each
refresh (greater than 10%)
E.g. Reference lookup tables or account tables where balances
change on each refresh.
22
• Performance hints:
Remove referential integrity (RI) constraints from

table definitions for loading operations.
- Assume that data cleansing take place in transformation
Remove secondary index specification from table
definition
- Build indices after table has been loaded
Make sure target table logging is disables during
loads.
23
1. Consider using ‘shadow’ tables to allow refresh to

take place without impacting query workloads.
Load shadow table.
Replace-view operation to direct queries to
refreshed table make new data visible.
Trades storage for availability.
24
Incremental Refresh Strategy
1. Incrementally load new data into existing target

table that has already been populated from previous
loads.
Two primary strategies:

Incremental load directly into target table.
Use shadow table load followed by insert-select
operation into target table.
25
• Design considerations for incremental load directly into

target using RDBMS utilities:
Indices should be maintained automatically.
Re-collect statistics if demographics have changes
significantly
Typically requires a table lock to be taken during block
slamming operations
Do you want to allow for ‘dirty’ reads?
Logging behavior differs across RDBMS products.
26
• Design considerations for shadow table implementation:

Use block slamming into empty ‘shadow’ table having
identical structure to target table.
Staging space required for shadow table
Insert-select operation from shadow table to target table
will preserve indices
Locking will normally escalate to table level lock.
Beware of log file size constraints.
Beware of performance overhead for logging
Beware of rollbacks if operation fails for any reason.
27
• Both incremental load strategies describes preserve index
structures during the loading operation.
However, there is a cost to maintaining indexes
maintained during the load costs 2-3 times the resources
of the actual row insertion of data into a table.
Rule of thumb: Each secondary index maintained during
the load costs 2-3 times the resources of the actual row
insertion of data into a table.
Rule of thumb: Consider dropping and re-building index
structures if the number of rows being incrementally
loaded is more than 10% of the size of the target table.
Note: Drop and re-build of secondary indices may not be acceptable
due to availability requirements of DW.
28
Trickle Feed
• Acquire data on a continuous basis into RDBMS
using row level SQL insert and update operations
Data is made available to DW ‘immediately’ rather than

waiting for batch loading to complete.
Much higher overhead for data acquisition on a per record
basis as compared to batch strategies.
Row level locking mechanisms allow queries to proceed
during data acquisitions.
Typically relies on Enterprise Application Integration (EAI)
of data delivery.
29
Trickle Feed
• A tradeoff exists between data freshness and

insert efficiency
Buffering rows for insertion allows for fewer round trips to
RDBMS…
… but waiting to accumulate rows into the buffer impact
data freshness
Suggested approach: Use threshold that buffers up to M

rows, but never waits more than N seconds before
sending a buffer of data insertion.
30
ELT versus ETL
• There are two fundamental approaches to data

acquisition:
ETL is extract, transfer, load in which transformation takes
place on a transformation server using either an ‘engine’ or by
generated code.
ELT is extract, load, transform in which data transformations
take place in a relational database on the data warehouse
serve.
Of course, hybrids are also possible…
31
ETL Processing
1. ETL processing performs the transform operations

to loading data into the RDBMS.
Extract data from the source systems.
Transform data into a form consistent with the target
tables.
Load the data into the target table (or to shadow tables.)
32
ETL Processing
ETL processing is typically performed using resources on the

source systems platform(s) or a dedicated transformation
server.
************************
***
************************
***
33
ETL Processing
Perform the transformations on the source system platform if
available resource exist and there is significant data
reduction that can be achieved during the transformations
Perform the transformations on a dedicated transformation

server if the source systems are highly distributed, lack
capacity, or have high cost per unit computing
34
ETL Processing
1. Two approaches for ETL processing:

Engine: ETL processing using an interpretive engine for
applying transformation rules based on meta data
specifications.
- e.g., Ascential, Informatica,DTS
Code Generation: ETL processing using code generated
based on meta data specification
- e.g.., Ab initio, ETI,DTS
35
ETL Processing
• First, load ‘raw’ data into empty tables using RDBMS block
slamming utilities.
Next, use SQL to transform the ‘raw’ data into a form
appropriate to target table.
- ideally, the SQL is d\generated using a meta data driven tool
rather than hand coding.
Finally, use insert-select into the target table for
incremental loads or view switching if a full refresh
strategy is used.
36
ELT Processing
DW server is the transformation server for ELT processing
************************
***
************************
***
37
ETL Processing
• ELT Processing obviates the need for a separate

transformation server.
- Assumes that spare capacity exists on DW to support
transformation operations.
ELT leverages the build-in scalability and manageability of
the parallel RDBMN and HW platform.
Must allocate sufficient staging area space to support load
of raw data and execution of the transformation SQL.
Works well only for batch oriented transforms because SQL
is optimized for set processing
38
Bottom Line
• ETL is a significant take in any DW deployment

Many options for data loading strategies: need to evaluate
tradeoffs in performance, data freshness, and compatibility
with source systems environment.
Many options for ETL/ELT deployment: need to evaluate
tradeoffs in where and how transformations should be
applied.
39

ETL Processing: High Performance Data Warehouse Design and Construction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ETL Processing: High Performance Data Warehouse Design and Construction

Uploaded by

Copyright:

Available Formats

High Performance Data Warehouse

Design and Construction

Enterprise Warehouse and

Dependent Data Marts or

Multiple source systems technologies

Flat files *Excel *Model 204

Gender value representations

•Multiple sources for the same data element

Need to establish precedence between source systems on a per

•Complexity of require transformations

•Scarcity and cost of legacy cycles

•Volume of legacy data

•What does the solution look like?

•Meta data driven transformation architecture

•Modular software solutions with component building

•Parallel software and hardware architecture

•The data quality in the source systems will be much

1. Once we have transformed data, there are three

Full data refresh with ‘block slamming’ into empty

• Choice in loading strategy depends in tradeoffs in data

Low Update Rates High Update Rates

Tradeoffs in data loading

Tradeoffs in data loading

Completely re-load table on each refresh.

This is a good (simple) strategy for small tables or when a

Remove referential integrity (RI) constraints from

1. Consider using ‘shadow’ tables to allow refresh to

Trades storage for availability.

1. Incrementally load new data into existing target

Two primary strategies:

• Design considerations for incremental load directly into

• Design considerations for shadow table implementation:

Data is made available to DW ‘immediately’ rather than

• A tradeoff exists between data freshness and

Suggested approach: Use threshold that buffers up to M

• There are two fundamental approaches to data

Of course, hybrids are also possible…

1. ETL processing performs the transform operations

ETL processing is typically performed using resources on the

Perform the transformations on a dedicated transformation

1. Two approaches for ETL processing:

• ELT Processing obviates the need for a separate

• ETL is a significant take in any DW deployment

You might also like

Flat files Excel Model 204