You are on page 1of 39

High Performance Data Warehouse

Design and Construction

ETL Processing

1
ETL Processing
IT Users

Operational Data

Data Transformation

Enterprise Warehouse and


Integrated Data Marts

Replication

Dependent Data Marts or


Departmental Warehouses

Business Users

2
Data Acquisition from OLTP Systems

•Why is it hard?
Multiple source systems technologies
Multiple sources for the same data element
Complexity of require transformations
Scarcity and cost of legacy cycles
Inconsistent data representations
Volume of legacy data

3
Data Acquisition from OLTP Systems

Multiple source systems technologies

Flat files *Excel *Model 204


*VSAM *Access *DBF Format
*IMS *Oracle *RDB
*IDMS *Informix *RMS
*DB2 (many flavors) *Sysbase *Compressed
*Adabase *Ingres *Many others…

4
Data Acquisition from OLTP Systems
•Inconsistent data representations: same data, different
domain

Examples:
Date value representations
- 1996-02-14
-02/14/1996
-14-FEB-1996
-960214
-14485

Gender value representations


- M/F - M/F/PM/PF
-0/1 -1/2

5
Data Acquisition from OLTP Systems

•Multiple sources for the same data element

Need to establish precedence between source systems on a per


data element basis.
Take data element from source system with highest precedence
where element exists
Must sometimes establish “group precedence” rules to maintain
data integrity

6
Data Acquisition from OLTP Systems

•Complexity of require transformations


Simple scalar transformations.
-0/1 => M/F
One to many element transformations.
- 6x30 address field => street1, stree2, city , state, zip.
Many to many element transformations
-Householding and individualization of customer
records

7
Data Acquisition from OLTP Systems

•Scarcity and cost of legacy cycles


Generally want to off-load transformation cycles to open
systems environment.
Often requires new skill sets.
Need efficient and easy way to deal with mainframe data
format such as EBDIC and packed decimal

8
Data Acquisition from OLTP Systems

•Volume of legacy data


Need lots of processing and I/O to effectively handle large
data volumes.
2 GB file limit in older versions of UNIX is not acceptable for
handling legacy data – need full 64- bit file system
Need efficient interconnect bandwidth to transfer large
amounts of data from legacy sources

9
Data Acquisition from OLTP Systems

•What does the solution look like?


Meta data driven transformation architecture
Modular software solutions with component building blocks
Parallel software and hardware architecture

10
Data Acquisition from OLTP Systems

•Meta data driven transformation architecture


Need multiple meta data structures
- Source meta data
- Target meta data
- Transformation meta data
Must avoid ‘hard coding’ for maintainability
Automatic generation of transformation from meta data
structures
Meta data repository ideally accessible by APIs and end user
tools

11
Data Acquisition from OLTP Systems

•Modular software solutions with component building


blocks
Want a data flow driven transformation architecture that
supports multiple processing steps.
Meta data structures should map inputs and outputs between
each transformation module.
Leverage pre-packaged tools for transformation steps
wherever possible

12
Data Acquisition from OLTP Systems

•Parallel software and hardware architecture


User data parallelism (partitioning) to allow concurrent
execution of multiple job streams.
Software architecture must allow efficient repartitioning of
data between steps in the transformation process.
Want powerful parallel hardware architectures with many
processors and I/O channels.

13
A Word of Warning

•The data quality in the source systems will be much


worse than what you expect.
Must allocate explicit time and resource to facilitate data clean-
up.
Data quality is a continuous improvement process – must
institute TQM program to be successful.
Use ‘house of quality’ technique to prioritize and focus data
quality efforts.

14
ETL Processing
•It is important to look at the big picture.
Data acquisition time may include:
Extracts from source systems
Data movement
Transformations
Data loading
Index maintenance
Statistics collection
Summary data maintenance
Data mart construction
Backups

15
Loading Strategies

1. Once we have transformed data, there are three


primary loading strategies:

Full data refresh with ‘block slamming’ into empty


table.
Incremental data refresh with ‘block slamming’ into
existing (populated) tables
Trickle feed with continuous data acquisition using
row level insert and update operations

16
Loading Strategies
We must also worry about rolling off “old” data as its economic
value drops below the cost for storing and maintaining it.

************************
***
DIAGRAM to be inserted
************************
***

17
Loading Strategies

• Choice in loading strategy depends in tradeoffs in data


freshness and performance, as well as data volatility
characteristics.
What is the goal?
Increased data freshness
Increased data loading performance

(Delayed Availability)
Real time Availability Minimal Load Time

Low Update Rates High Update Rates

18
Loading Strategies

• Should consider:
Data storage requirements
Impact on query workloads
Ratio of existing to new data
Insert versus update workloads

19
Loading Strategies

Tradeoffs in data loading


with a high ************************
***
percentage of data
changes per data GRAPH to be inserted
block ************************
***

20
Loading Strategies

Tradeoffs in data loading


with a low ************************
***
percentage of data
changes per data GRAPH to be inserted
block ************************
***

21
Full Refresh Strategy

Completely re-load table on each refresh.


Step 1: Load table using block slamming
Step 2: Build Indexes
Step 3: Collect statistics

This is a good (simple) strategy for small tables or when a


high percentage of rows in the data changes on each
refresh (greater than 10%)
E.g. Reference lookup tables or account tables where balances
change on each refresh.

22
Full Refresh Strategy

• Performance hints:

Remove referential integrity (RI) constraints from


table definitions for loading operations.
- Assume that data cleansing take place in transformation
Remove secondary index specification from table
definition
- Build indices after table has been loaded
Make sure target table logging is disables during
loads.

23
Full Refresh Strategy

1. Consider using ‘shadow’ tables to allow refresh to


take place without impacting query workloads.
Load shadow table.
Replace-view operation to direct queries to
refreshed table make new data visible.

Trades storage for availability.

24
Incremental Refresh Strategy

1. Incrementally load new data into existing target


table that has already been populated from previous
loads.

Two primary strategies:


Incremental load directly into target table.
Use shadow table load followed by insert-select
operation into target table.

25
Incremental Refresh Strategy

• Design considerations for incremental load directly into


target using RDBMS utilities:
Indices should be maintained automatically.
Re-collect statistics if demographics have changes
significantly
Typically requires a table lock to be taken during block
slamming operations
Do you want to allow for ‘dirty’ reads?
Logging behavior differs across RDBMS products.

26
Incremental Refresh Strategy

• Design considerations for shadow table implementation:


Use block slamming into empty ‘shadow’ table having
identical structure to target table.
Staging space required for shadow table
Insert-select operation from shadow table to target table
will preserve indices
Locking will normally escalate to table level lock.
Beware of log file size constraints.
Beware of performance overhead for logging
Beware of rollbacks if operation fails for any reason.

27
Incremental Refresh Strategy
• Both incremental load strategies describes preserve index
structures during the loading operation.
However, there is a cost to maintaining indexes
maintained during the load costs 2-3 times the resources
of the actual row insertion of data into a table.
Rule of thumb: Each secondary index maintained during
the load costs 2-3 times the resources of the actual row
insertion of data into a table.
Rule of thumb: Consider dropping and re-building index
structures if the number of rows being incrementally
loaded is more than 10% of the size of the target table.
Note: Drop and re-build of secondary indices may not be acceptable
due to availability requirements of DW.

28
Trickle Feed
• Acquire data on a continuous basis into RDBMS
using row level SQL insert and update operations

Data is made available to DW ‘immediately’ rather than


waiting for batch loading to complete.
Much higher overhead for data acquisition on a per record
basis as compared to batch strategies.
Row level locking mechanisms allow queries to proceed
during data acquisitions.
Typically relies on Enterprise Application Integration (EAI)
of data delivery.

29
Trickle Feed

• A tradeoff exists between data freshness and


insert efficiency
Buffering rows for insertion allows for fewer round trips to
RDBMS…
… but waiting to accumulate rows into the buffer impact
data freshness

Suggested approach: Use threshold that buffers up to M


rows, but never waits more than N seconds before
sending a buffer of data insertion.

30
ELT versus ETL

• There are two fundamental approaches to data


acquisition:
ETL is extract, transfer, load in which transformation takes
place on a transformation server using either an ‘engine’ or by
generated code.
ELT is extract, load, transform in which data transformations
take place in a relational database on the data warehouse
serve.

Of course, hybrids are also possible…

31
ETL Processing

1. ETL processing performs the transform operations


to loading data into the RDBMS.
Extract data from the source systems.
Transform data into a form consistent with the target
tables.
Load the data into the target table (or to shadow tables.)

32
ETL Processing

ETL processing is typically performed using resources on the


source systems platform(s) or a dedicated transformation
server.

************************
***
DIAGRAM to be inserted
************************
***

33
ETL Processing
Perform the transformations on the source system platform if
available resource exist and there is significant data
reduction that can be achieved during the transformations

Perform the transformations on a dedicated transformation


server if the source systems are highly distributed, lack
capacity, or have high cost per unit computing

34
ETL Processing

1. Two approaches for ETL processing:


Engine: ETL processing using an interpretive engine for
applying transformation rules based on meta data
specifications.
- e.g., Ascential, Informatica,DTS
Code Generation: ETL processing using code generated
based on meta data specification
- e.g.., Ab initio, ETI,DTS

35
ETL Processing
• First, load ‘raw’ data into empty tables using RDBMS block
slamming utilities.
Next, use SQL to transform the ‘raw’ data into a form
appropriate to target table.
- ideally, the SQL is d\generated using a meta data driven tool
rather than hand coding.
Finally, use insert-select into the target table for
incremental loads or view switching if a full refresh
strategy is used.

36
ELT Processing
DW server is the transformation server for ELT processing

************************
***
DIAGRAM to be inserted
************************
***

37
ETL Processing

• ELT Processing obviates the need for a separate


transformation server.
- Assumes that spare capacity exists on DW to support
transformation operations.
ELT leverages the build-in scalability and manageability of
the parallel RDBMN and HW platform.
Must allocate sufficient staging area space to support load
of raw data and execution of the transformation SQL.
Works well only for batch oriented transforms because SQL
is optimized for set processing

38
Bottom Line

• ETL is a significant take in any DW deployment


Many options for data loading strategies: need to evaluate
tradeoffs in performance, data freshness, and compatibility
with source systems environment.
Many options for ETL/ELT deployment: need to evaluate
tradeoffs in where and how transformations should be
applied.

39

You might also like