You are on page 1of 32

Extract Transform Load Cycle

Extract Transform Load


●Extract data from operational system,
transform and load into data warehouse
Challenges in ETL process
-ETL functions are challenging because of the nature of source systems
● Diverse and disparate
● Different operating systems/platforms
● May not preserve historical data
● Quality of data may not be guaranteed in the older
operational source systems
● Structures keep changing with time
● Prevalence of data inconsistency in the source system.
● Data may be stored in cryptic form
● Data type, format,naming convention may be different
Steps for ETL process
● Determine all the target data needed
● All the data sources,both internal/external
● Prepare data mapping for target data elements from
sources
● Determine data transformation and cleansing rules
● Plan for aggregate tables
● Organize data staging area and test tools
● Write procedures for all data loads
● ETL for dimension tables
● ETL for fact tables
The ETL Process

●Capture/Extract
●Scrub or data cleansing
●Transform
●Load
ETL = Extract, transform, and
load
The ETL Process
Source Presentation
Staging
Systems System
Area

Extract Transform Load


Data Extraction
●Often performed by COBOL routines
(not recommended because of high program
maintenance and no automatically generated
meta data)
●Sometimes source data is copied to the target
database using the replication capabilities of
standard RDMS (not recommended because of
“dirty data” in the source systems)
●specialized ETL software
Data Extraction Techniques
●Immediate Data Extraction(real time)
-capture through transaction logs
-capture through database triggers
-capture through source applications
●Deferred data Extraction(capture happens later)
-capture based on data and timestamp
-capture by comparing files
Capture through transaction logs

●Does not provide much flexibility for


capturing specifications
●Does not affect the performance of source
systems
●Does not require any revisions to the
existing source applications
●Cannot be used on file oriented system.
Capture through database triggers

●Does not provide much flexibility for


capturing specifications
●Does not affect the performance of source
systems
●Does not require any revisions to the
existing source applications
●Cannot be used on file oriented system.
●Cannot be used on a legacy system
Capture in Source Application
●Provides flexibility for capturing
specification
●Does not affect the performance of source
systems
●Requires the existing source systems to
be revised
●Can be used on a file oriented system
●Can be used on a legacy system
Capture based on date and
timestamp
●Provides flexibility for capturing
specification
●Does not affect the performance of source
systems
●Requires the existing source systems to
be revised
●Can be used on a file oriented system
●Cannot be used on a legacy system
Capture by comparing files
●Provides flexibility for capturing
specification
●Does not affect the performance of source
systems
●Does not require the existing source
systems to be revised
●may be used on a file oriented system
●may be used on a legacy system
Data Cleansing
●Source systems contain “dirty data” that must be
cleansed
●ETL software contains rudimentary data
cleansing capabilities
●Specialized data cleansing software is often
used. Important for performing name and
address correction and householding functions
●Leading data cleansing vendors include Vality
(Integrity), Harte-Hanks (Trillium), and Firstlogic
(i.d.Centric)
Reasons for “Dirty” Data
●Dummy Values
●Absence of Data
●Multipurpose Fields
●Cryptic Data
●Contradicting Data
●Inappropriate Use of Address Lines
●Violation of Business Rules
●Reused Primary Keys,
●Non-Unique Identifiers
Steps in Data Cleansing
●Parsing
●Correcting
●Standardizing
●Matching
●Consolidating
Parsing
●Parsing locates and identifies individual
data elements in the source files and then
isolates these data elements in the target
files.
●Examples include parsing the first, middle,
and last name; street number and street
name; and city and state.
Correcting
●Corrects parsed individual data
components using sophisticated data
algorithms and secondary data sources.
●Example include replacing a vanity
address and adding a zip code.
Standardizing
●Standardizing applies conversion routines
to transform data into its preferred (and
consistent) format using both standard and
custom business rules.
●Examples include adding a pre name,
replacing a nickname, and using a
preferred street name.
Matching
●Searching and matching records within
and across the parsed, corrected and
standardized data based on predefined
business rules to eliminate duplications.
●Examples include identifying similar
names and addresses.
Consolidating
●Analyzing and identifying relationships
between matched records and
consolidating/merging them into ONE
representation.
Data Transformation
●Transforms the data in accordance with
the business rules and standards that
have been established
Example include: format changes, duplication, splitting up
fields, replacement of codes, derived values, and
aggregates

●Deals with rectifying any inconsistency


●Attribute naming inconsistency issue
once all the data elements have right names
they must be converted into common
formats.
●Data format has to be standardized
●All the transformation activities are
automated
●Tool: DataMapper
Basic tasks in Transformation
●Selection
●Splitting/joining
●Conversion
●Summarization
Data Loading
●Data are physically moved to the data
warehouse
●The loading takes place within a “time
window”
●The trend is to near real time updates of
the data warehouse as the warehouse is
increasingly used for operational
applications
Different modes in which data can be
applied to the warehouse
●Load
●Append
●Merge
Loading Techniques
●Initial load
●Incremental load
●Full refresh
Sample ETL Tools
●Teradata Warehouse Builder from Teradata
●DataStage from Ascential Software
●SAS System from SAS Institute
●Power Mart/Power Center from Informatica
●Sagent Solution from Sagent Software
●Hummingbird Genio Suite from Hummingbird
Communications
Steps in data reconciliation

Capture = extract…obtaining a
snapshot of a chosen subset of the
source data for loading into the data
warehouse
Static extract = capturing a Incremental extract =
snapshot of the source data capturing changes that
at a point in time have occurred since the
last static extract
Steps in data reconciliation (continued)

Scrub = cleanse…uses pattern


recognition and AI techniques to
upgrade data quality

Fixing errors: misspellings, Also: decoding, reformatting, time


erroneous dates, incorrect field stamping, conversion, key generation,
usage, mismatched addresses, merging, error detection/logging,
missing data, duplicate data, locating missing data
inconsistencies
Steps in data reconciliation (continued)

Transform = convert data from


format of operational system to
format of data warehouse
Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
Steps in data reconciliation (continued)

Load/Index= place transformed data


into the warehouse and create
indexes

Refresh mode: bulk rewriting of Update mode: only changes in


target data at periodic intervals source data are written to data
warehouse