You are on page 1of 36

UNIT – III

ETL: Data extraction, Transformation, Cleansing, Loading


Data Warehouse Information flows
Components of Data warehouse
Types of difficulties in ETL functions
• Source systems are very diverse and disparate
• Multiple platforms
• Different operating systems
• legacy applications running on obsolete database technologies
• Generally, historical data on changes in values are not preserved in source operational systems
• Quality of data is dubious in many old source systems that have evolved over time
• Source system structures keep changing over time because of new business conditions
• Gross lack of consistency among source systems is commonly prevalent
• Even when inconsistent data is detected among disparate source systems, lack of a means for
resolving mismatches escalates the problem of inconsistency
• Most source systems do not represent data in types or formats that are meaningful to the users.
Many representations are cryptic and ambiguous
ETL is time-consuming and arduous
• It is not uncommon for a project team to spend as much as 50–70% of the project effort on ETL
functions
• The activities during extraction include determining
• metadata on the source systems
• information on every database
• Information on data structure
• database size
• volatility of the data
• mechanism for capturing changes to data in each of the relevant source systems
Activities for Transformation and Loading
• reformat internal data structures
• Re-sequence data
• apply various forms of conversion techniques
• supply default values wherever values are missing
• design the whole set of aggregates that are needed for performance improvement
• May convert from EBCDIC to ASCII formats
• sheer massive size of the initial loading can populate millions of rows in the data warehouse
database
• Create and manage load images for such large numbers
• testing and applying the load images to actually populate the physical files in the data warehouse
• Sometimes, it may take two or more weeks to complete the initial physical loading
Major Steps in ETL process
Extraction
List of Data Extraction issues
• Source Identification
• identify source applications and source structures
• Method of extraction
• for each data source, define whether the extraction process is manual or tool-based
• Extraction frequency
• for each data source, establish how frequently the data extraction must by done—daily, weekly, quarterly, and so on
• Time window
• for each data source, denote the time window for the extraction process
• Job sequencing
• determine whether the beginning of one job in an extraction job stream has to wait until the previous job has
finished successfully
• Exception handling
• determine how to handle input records that cannot be extracted
Source identification
Data in operational systems
Types of data extractions
1. “as is” (static) data
• capture of data at a given point in time
• snapshot of the relevant source data at a certain point in time
• primarily for the initial load of the data warehouse
• May also be used for full refresh of a dimension table
2. data of revisions
• incremental data capture
• revisions since the last time data was captured
• may be immediate or deferred
Immediate data extraction
1. Capture through transaction logs
• This option uses the transaction logs of the
DBMSs maintained for recovery from possible
failures
• no extra overhead in the operational systems
• However, indexed and other flat files do not
have log files
2. Capture through database triggers
• triggers are special stored procedures
(programs) that are stored on the database and
fired when certain predefined events occur
• trigger program to capture all updates and
deletes in that table
• Additional development effort
Immediate data extraction
3. Capture in source applications
• application-assisted data capture
• may be used for all types of source data
• revise the programs to write all adds, updates,
and deletes to the source files and database
tables
• may degrade the performance of the source
applications
• additional processing needed to capture the changes
on separate files
Data extraction using replication technology
Deferred Data Extraction
• Capture Based on Date and Time Stamp
• Every time a source record is created or updated
it is marked with a stamp showing the date and
time
• The time stamp provides the basis for selecting
records for data extraction
• Here the data capture occurs at a later time
• any type of source file
• Any intermediary states between two data
extraction runs are lost
• Deletion of source records presents a special
problem
Deferred Data Extraction
• Capture by Comparing Files
• also called snapshot differential technique
• compares two snapshots of the source data
• full file comparison between today’s copy of
the product data and yesterday’s copy
• Also compare the record keys to find the
inserts and deletes
• comparison of full rows in a large file can be
very inefficient
• may be the only feasible option for some
legacy data sources that do not have
transaction logs or time stamps on source
records
Data capture techniques: advantages and
disadvantages
Data Transformation
Introduction
• Extracted data is raw data and cannot be applied to the data warehouse
• combined data should not be in violation of any business rules
• quality of the data coming from legacy systems is less likely to be good enough for the data
warehouse
Data transformation: Basic tasks
1. Selection
• In certain cases, the composition of the source structure may not be amenable to selection of
the necessary parts during data extraction
• extract the whole record and then do the selection as part of the transformation function
2. Splitting/joining
• includes data manipulation
• Joining of parts selected from many source systems is more widespread in the data
warehouse environment
• Sometimes (uncommonly), splitting the selected parts even further during data
transformation
3. Conversion
• includes a large variety of rudimentary conversions of single fields for two primary reasons
• one to standardize among the data extractions from disparate source systems
• to make the fields usable and understandable to the users
Data transformation: Basic tasks
4. Summarization
• As per granularity

5. Enrichment
• rearrangement and simplification of individual fields to make them more useful for the data
warehouse environment
• one or more fields from the same input record may be used to create a better view of the
data for the data warehouse
Major Transformation Types
• Format Revisions
• changes to the data types and lengths of individual fields
• standardize and change the data type
• Decoding of Fields
• same data items described by a plethora of field values in source systems
• some systems may have cryptic codes to represent business values
• Calculated and Derived Values
• Ex. profit margin, average daily balances, operating ratios
• Splitting of Single Fields
• To improve the operating performance
• users may need to perform analysis by using individual components such as city, state, and
Zip Code
Major Transformation Types
• Merging of Information
• information about a single subject may come from multiple sources
• Character Set Conversion
• Ex. EBCDIC format to ASCII format
• Conversion of Units of Measurements
• Date/Time Conversion
• Summarization
• creating summaries to be loaded in the data warehouse
Major Transformation Types
• Key Restructuring

• Deduplication
Typical data source environment
Data Integration and Consolidation - issues
• Entity Identification Problem
• Multiple source systems/departments may have their own unique key for customer
identification
• problem of identification - which of the records relate to the same subject
• design complex algorithms to match records from multiple sources and then review the
exceptions to automated procedures
• May be done in two phases
• First phase : irrespective of whether they are duplicates or not, assign unique identifiers
• Second phase consists of reconciling the duplicates periodically through automatic
algorithms and manual verification
Data Integration and Consolidation - issues
• Multiple Sources Problem
• single data element may have more than one source
• Ex. unit cost of products may be available from two systems
• standard costing application
• order processing system
• There could be slight variations in the cost figures from these two systems
Loading
Introduction- Loading
• Loading is taking the prepared data and applying it to the data warehouse
• loading the data warehouse may take an inordinate amount of time therefore loads are generally
cause for great concern
• During the loads, the data warehouse has to be offline
• loads may be scheduled for a window of time without affecting your data warehouse users
• the whole load process may be divided into smaller chunks and only a few files may be
populated at a time
• Certain load images may not load so procedures may be provided to handle it
• quality of the loaded records is also to be ensured
Types of loads
1. Initial Load
• populating all the data warehouse tables for the very first time

2. Incremental Load
• applying ongoing changes as necessary in a periodic manner

3. Full Refresh
• completely erasing the contents of one or more tables and reloading with
fresh data
Staging area and warehouse links
• If the data staging area and the data warehouse database are on the same server, it saves the
effort of moving the load images to the data warehouse server
• When the staging area files and the data warehouse repository are on different servers, database
access methods are used
• Loading process involves consideration of following:
• Speed of loading
• Heterogeneity
• necessary bandwidth needed
• impact of the transmissions on the network
• data compression requirements
• contingency plans
Database access methods
• Web Services
• Allows a system to connect to an outside hosted system via a simple API call
• In general, a web service is a slow method of communicating for bulk data loads
• Database Links
• Provides a convenient way of connecting to a remote database, but this convenience comes
at the cost of performance
• does not allow parallel loading and therefore often provides a bottleneck in the data
movement process
• Data Pump
• Allows to choose specific columns and tables for a very fast way of moving data between two or more Oracle
instances
Database access methods
• Flat Files
• one of the best performing means of moving large data volumes between databases
especially in a heterogeneous environment
• Transportable Tablespaces
Modes of applying data
1. Load
• If the target table to be loaded already exists and data exists in the table, the
load process wipes out the existing data and applies the data from the
incoming file
• If the table is already empty before loading, the load process simply applies
the data from the incoming file
2. Append
• If data already exists in the table, the append process unconditionally adds
the incoming data, preserving the existing data in the target table
• When an incoming record is a duplicate of an already existing record, you may
define how to handle an incoming duplicate
Modes of applying data
3. Destructive Merge
• If the primary key of an incoming record matches with the key of an existing
record, update the matching target record
• If the incoming record is a new record without a match with any existing
record, add the incoming record to the target table
4. Constructive Merge
• If the primary key of an incoming record matches with the key of an existing
record, leave the existing record, add the incoming record, and mark the
added record as superceding the old record
Modes of applying data

You might also like