You are on page 1of 43

ETL (Extract, Transform, and Load)

1. What is ETL ?
2. Major steps in ETL process.
3. Extract
4. Cleaning and Transformation
5. Loading
6. Implementing ETL process using SSIS
What is ETL ?
• ETL is the process of populating a data warehouse.
• The main purpose of ETL is to get data out of
source systems and load it to the data warehouse.
•As data warehouse is very crucial to support
decision making, data needs to be cleaned
and transformed before it is loaded into the
data warehouse.

•ETL consumes about 60 to 70 percent of the


resources needed to implement and maintain
a data warehouse
ETL has four major sub processes :

• Extracting data from source systems

• Loading extracted data into a staging area

• Cleaning and transforming

• Loading cleaned and transformed data into


the data warehouse
Cont
Staging Target Schemas
..
Relationa Web logs
Area

l
databases Flat
files T
Spread R
Flat files sheets C A
Relati N
L
o nal T
databa E S
Main
Frames ses F
XML Sources
A O
R
N M
Web services

ERP

Data Sources ETL Process


Extract
• Gathering of raw data from source systems and
stage it into a disk either in flat files or relational
tables.

• The complexity of extracting raw data depends on


the number and nature of data sources.
•There are two methods of extracting data:

• Enable source systems push the data into flat


files and then loading the flat files into a
staging area.

• Connecting with source systems and


loading data directly into a staging area.
Issues with Extract:

How often data should be extracted ?

It could be daily, weekly or monthly depending on


the business requirements.

How often are reports generated and other


requirements determine how often data should
be extracted.
Cont..
How much data to extract after conducting initial extraction ?

- The whole data or only changed data in the source systems.


- Initial data extract is done when data is extracted for the first
time. At this stage all data should be extracted.
• However, the subsequent data extracts which is called
incremental data extracts can be done in two ways either
extracting the whole data or only changed data in the
source systems.
The most efficient incremental data extract is to identify and
extract only changed data in the source systems.

This method is only be possible if source systems have


timestamp to indicate when a data has been entered or
updated.
• The alternative method is to extract all data from source
systems and compare with previous data extracts in a
staging area to identify only changed data.
Data Cleaning
Data cleaning is identifying and fixing errors and missing.

Examples of data cleaning tasks:

1. Checking validity of column values


- For example: ensuring that Phone numbers contain
only numbers and the right number of digits.
2. Ensuring consistency across values
- For example, ensuring that country column has
either code or full name values.
3. Removing duplicates
4. Handling null values.
5. Checking referential integrity
6. Checking business rules
Data Transformation
Data transformation is the process of converting data from one
format into another.
Examples:
- Computing revenue from product price and sold
quantity.
- Converting gender values of Male or Female into M or F.
- Creating a new column called customer type based on
their income and total transaction values.
Loading Data
•Once data are cleaned and transformed, the final step of
ETL process is loading them into the data warehouse.
Insert SQL commands can be used to insert records into
dimension or fact tables.
Implementing ETL Process
Tools to implement and manage ETL process:

- Oracle Data Integrator


- SAS Data Management
- PowerCenter Informatica
- SQL Server Integration Services (SSIS)
- Talend Studio for Data Integration
- IBM Infosphere Information Server.

You might also like