You are on page 1of 18

4.1.

Introduction to ETL
ETL is a technology/set of processes by which data is extracted from numerous
systems/databases/applications/files transformed as appropriate loaded into target systems -
including (but not limited to) ODS (Operational Data Store), DW (Data Warehouse), DM (Data
Mart), Analytical applications, etc. ETL process periodically refreshes the data warehouse.

Extraction

Extraction is the process of getting data from a source system so that it can be used elsewhere.
Extraction can be done to the entire data in the source system or only to a part of data based on
the requirements.

The source system can be a database, XML files, excel sheets, Cobol files etc.

There are two extraction types.

Push Method
In push method, the source systems will push the data to the Integration area where different
cleansing operations and transformations happen. In other words, it is the responsibility of the
source systems to send the required data to the integration area.

Pull Method

In pull method, the target system will pull the data from different source systems. In other words,
it is the responsibility of the target system to extract or retrieve the required data.

Following are the two modes of Extraction:

Incremental

Here, we only extract the data which changed after the last extraction. For example, let’s
assume that the last extraction was on 1st January, 2014 and the next extraction is on 1st
February, 2014. In this mode, on 1st February 2014, the system will only extract the data that
got changed after 1st January, 2014.

Scenario – Initial state

During Extraction

Incremental extraction can be implemented through changed data capture, commonly called as
CDC. Changed data capture is the process of recording or capturing the changes to the source
database i.e. capturing any insert, update or delete activity applied to the source data.

Snapshot

Here we capture the snapshot of source data at a point of time. For example let us assume that
the extraction happens on 1st January, 2014. In this mode on 1st January, 2014, the system will
extract data as it looks at that particular moment in time.

Scenario- Initial State


During Extraction

Change Data Capture/ Data extraction Options

Capture through Transaction Logs

Transaction log is a record of all actions, saved by the database and it is stored in a separate file
or table. We can make use of these transaction logs to capture the latest data.

Capture through Database Triggers

Database triggers are procedures/code that are stored in the database and implicitly executed
when the associated table is modified. The triggers are executed when an INSERT, UPDATE, or
DELETE statement is issued against the associated table. These can also be used to do a
change data capture.

Capture in Source Application or Application-assisted data capture

Here the source application will provide the latest changes in data to the target system.

Capture Based on Date and Time Stamp

Here the tables whose changes have to be captured will have fields (columns) in it which
represents the date and time of the last change. Any record in the table that has a timestamp that
is more recent than the last extraction time is considered to have changed.

Capture by Comparing Files

For file sources, the changes are captured by comparing latest file and the previous file.
4.2. Transformation
Transformation is the process of validating, cleaning and transforming the extracted data to
convert it to the required form.

Validation
The data extracted from the source systems have to be validated for Range (E.g. Age of a minor
should be less than 18), Sequence, Blanks (Employee name shouldn't be blank), Numeric
(Phone number should be a number), Domain (Gender should be male or female), Mandatory
fields, Duplicate, etc.

Cleansing
Lot of cleansing operations have to be done on the extracted data like Name standardization
(Name should be First-name Middle-name Last-name), Address cleansing, De-duplication, etc.

Data cleansing and validation can include the below tasks:

Restructuring of records or fields


E.g. - Standardizing the size of the record or the fields, if the same data is being fetched from
different data sources.

Removal of Operational-only data


E.g. - Address and Phone number from an invoice may not be picked because it is not required
for any Analysis purpose.

Supply of missing field values


E.g. - Let’s suppose a field should contain either Yes (Y) or No (N) in an application. If it is left
blank then should it be considered as Y or N? In such cases, user might have to provide the
default values.

Data Integrity checks


E.g. - There should not be a business transaction for which the customer information is non-
existent in the customer master table.

Data Consistency and Range checks, etc...


E.g. - All negative numbers should have a negative sign one place before the most significant
digit.

Transforming
A lot of data transformations have to be done on the extracted raw data to ensure that it abides
to the business rules, and fits the data warehouse schema.
Type translations, format changes (e.g. Date should be in YYYY-MM-DD format), code value
mapping (Mum to Mumbai), combining fields, concatenating fields, aggregating rows, splitting
records, combining records, filtering records, augmenting records, etc.

Data transformation may include the following steps

Integrating dissimilar data types or Format Revision


E.g. - The length might be measured in cm or inch or yards in different system. But in the
common data store, everything should be converted to a single unit (E.g. cm).
Changing codes or Decoding of Fields
E.g. - The gender might be denoted using different codes in different systems like male/female,
m/f, x/y or 1/0. But in the common data store, only a single code should be used (m/f).

Adding a time attribute & Date Time Conversion


E.g. - Different systems might be using different date formats like 01-13-2013 or 13-01-2013 or
Jan 13, 2013 or 13 January, 2013. Date and time formats should be standardized.

Data Summarization
E.g. - If a Sales Manager wants to analyse daily sales revenue only, then individual sales amount
need not be stored in the Data Warehouse.

Calculation of derived values


E.g. - If the Quantity sold and the price of each item is available in the source system and during
transformation we might have to calculate the total sales amount (quantity * price).

Splitting of single fields


E.g. - Splitting of Name into First name, middle name and last name.

Merging of information
E.g. - House no, Street name, town name, State name, Country, and pin code to a single field
named “Address”.

Character set conversion.

De-duplication
E.g. - If the name, age and address of a single customer come multiple times, we have to
remove the duplicate record and store only one record for one customer.

4.3. Load
Loading is the process of storing the transformed data in the target data store. The whole of
transformed data or only a part of it may be loaded to the target system. Target systems can be
databases, XML files, Cobol files etc.

Below are the different load types:

Initial load
In this load type, we insert history data. This is usually a one-time activity which happens when a
new application or process goes live. For example, let us assume that a new data mart is going
live on December 1, 2013. In this mode, all the history data required from the OLTP application
till December 1, 2013 will be inserted first to the data mart.

Scenario- Initial State

Scenario - After Load

Full refresh

Here we erase and replace the existing data in target data store with the new transformed data.
For example, let’s assume that the last load was on 1st January, 2014 and the next load is on 1st
March, 2014. In this mode, on 1st March 2014, the system will delete all the data in the target
tables and insert the new data.

Scenario- Initial State

Scenario - After Load


Incremental Load

In this load mode, new data is appended to the existing data in the target system. In this mode,
considering the previous example, on 1st March 2014, the system will append new data to the
existing data (i.e. the data from 1st January 2014 till 1st March 2014) in the target tables.

Scenario- Initial State

Scenario - After Load

Snapshot Load

This load mode appends the snapshot of source data as at a given point of time.

Scenario- Initial State


Scenario - After Load

4.4. ETL Generic Architecture and Usage

 Extract – Extract the required data from the source system.


 Scrub – Apply the data quality rules and perform all the necessary data cleansing and
validation transformations on the data.

 Transform – Convert data to the required format by applying the business rules.

 Load – Load the data into the target system.

 Data reconciliation – Data reconciliation is a process which allows us to ensure the


consistency of data in the target system. This is mostly done by comparing the data in
the source and the target system after the data is loaded.

ETL Usage

ETL is used not only in data warehousing but also in other applications.

ETL in Data Warehousing

Data warehousing involves integrating data from heterogeneous data stores, providing a single
store for current and historical information and providing a single platform to support business
users' needs. Data Warehousing involves the process of extracting data from OLTP sources,
transforming the OLTP data into a format that we are expecting in the Data warehouse, and
finally loading the transformed data into the target system. This part of the data warehousing is
done using ETL tools and techniques

ETL load frequency and load window

ETL load Frequency

ETL load frequency is information about how often the ETL processes are run. Consider a BI
application, where the reports are generated every month. In this case the data has to be
extracted and loaded into the data warehouse every month i.e. the ETL process will run every
month. Here the ETL load frequency is monthly.

ETL load window


ETL load window is the time allocated for the ETL process to run. Consider a BI application for a
retail store which is open for 12 hours, from 9 am to 9 pm. The OLTP application which handles
the sales transactions will be busy servicing the customers from 9 am to 9 pm. If the BI
application tries to extract the data from OLTP database at this time it will slow down the entire
system. So the BI application might schedule the ETL processes from OLTP database to data
warehouse, to run at night 9 pm to 9 am, when the traffic is slow. Here the ETL load window is 12
hours from 9 pm to 9 am.

ETL in Data Migration

Data migration is the process of transferring data between storage systems. For example, let us
assume that application A is storing its data to files. It is possible that the application owner later
decides to use a database to store the data. In this case, the previous data from the files have to
be moved to the new database. This process of data transfer is called data migration. This
movement can be done using ETL tools and techniques.

ETL in Application integration

Any organization will have multiple applications and information sources. For the business
functions to run smoothly, one or more applications may require data from other applications.
The data storage systems of these different applications might be entirely different. Here rises
the need for application integration. In application integration, ETL tools and techniques can be
used to bring together data from different applications into a single data store.
4.5. ETL Products and Metadata
ETL Products

Metadata

Metadata is data about data.

Consider the table Employee_Details

Employee_id
Employee_Name
Date_of_Joining
Date_of_Birth
Designation

Technical Metadata

Technical information about the data like Column name, Data type etc.

Business Metadata

Business information about the data. The business metadata tell you what data you have, where
they come from, what they mean and what their relationship is to other data in the data
warehouse.

Process Metadata

Metadata that documents the details of processes used to reformat (convert) or transform
content. Process metadata describes the results of various operations in a data warehouse. It will
contain the details of the process that loads data into the data warehouse

E.g. ETL job start time, ETL job end time, CPU seconds used, number of rows processed etc.

Reference data

Reference data is data that defines the set of permissible values to be used by other data fields.
Reference data is generally uniform, company-wide, and can be either created within a country
or by external standardization bodies. Some types of reference data, such as currencies and
currency codes, are always standardized. Others, such as the positions (roles or designations) of
employees within an organization, are less standardized. Reference data gains in value when it
is widely re-used and widely referenced. Typically, it does not change overly much in terms of
definition (apart from occasional revisions).
Typical examples of reference data are:

Units of measure
Country codes
Corporate codes
Conversion rates (currency, weight, temperature, etc.)
Calendar and calendar constraints

Master Data

Master Data is business critical data that is stored in disparate systems spread across the
Enterprise. E.g. Data about customers, products, employees, materials, suppliers, and vendors.

While it is often non-transactional in nature, it is not limited to non-transactional data, and often
supports transactional processes and operations. Master data is typically shared by multiple
users and groups and departments across an organization.

4.6. Data Governance


Data governance (DG) refers to the overall management of the availability, usability, integrity,
and security of the data employed in an enterprise.

Benefits

 Increase consistency & confidence in decision making

 Decrease the risk of regulatory fines

 Improve data security

Master Data Management

Consider the following scenario. A bank’s customer service department has a record with
customer name as Abhishek Gaur (meaning Abhishek Gaur is already a customer of the bank).
The bank’s Marketing department (that tries to get new customers) has a marketing system,
which also has a customer name as Abhishek G. Marketing executives consider Abhishek G only
as a potential customer (not realizing that he is an already existing customer due to the
difference in name between Marketing system and Customer service system), and continuously
try to push promotional offers to him to enrol him as a new customer – which annoys the
customer and also wasting theirs and the customer’s time.

This issue is an example of not maintaining proper master data management in the company. If
there was a central Master data management system which verifies, standardizes and publishes
customer information to all other systems in the company, this situation could have been
avoided.

Master Data Management (MDM), is a discipline in Information Technology (IT) that focuses on
the management of reference or master data that is shared by several disparate IT systems and
groups. MDM enables consistent computing between diverse system architectures and business
functions. MDM integrates master data across BI, data warehouse, financial & operational
systems, providing for accurate, consistent and compliant enterprise reporting. MDM supplies
meta-data for aggregating and integrating transactional data.

MDM Capabilities

Role Definition Support

Support for definition of roles with access rights enforced, depending on the responsibilities
assigned for that role.

ETL

ETL capabilities for extracting master data/reference data files or tables from multiple sources,
and loading the data into the master data repository.

Data Cleansing

Data cleansing capabilities for de-duplication and matching of master data records.

Collaborative platform

A collaborative platform for coordinating decisions on master data reconciliation and


rationalization. The platform should be supported by standards, if available, or via industry
knowledge of a master data domain. An example is a standard product hierarchy for a particular
industry, say retail.

Data synchronization and replication support

For applying changes established in a central server to each consuming application. Incremental
change support is important for performance reasons.

Version control and Change monitoring

Version control at the central policy hub combined with change monitoring across all of the
participating systems. This is needed in order to track changes to master data over time.

MDM architecture

Master Data is managed Via a Policy Hub as shown in the figure


1) The policy hub for master data management collects master data from participating analytical
and transactional systems.
2) Collaborative applications (applications that coordinate with each other) run on the central
policy hub to coordinate decisions among team members on master data policies.
3) The standard master data is published to each participating system (transactional and
analytical), so that they are synchronized with the hub.

Steps in the Process for Managing and Maintaining Master Data

1) Assign business responsibility for each master data domain such as products, customers,
suppliers, organizational structure.
2) Extract master data for a domain from separate operational and reporting systems to a central
server.
3) Apply data quality standards, such as de-duplication and matching of master data records, to
get a clean set of master data for the domain.
4) Reconcile and rationalize the master data records. This process entails setting policies
pertaining to an optimal product hierarchy, organizational structure, or preferred supplier list.
5) Synchronize participating operational and reporting systems with the centrally managed,
canonical master data.
6) Monitor changes or updates to master data in each participating system. Then repeat the
preceding steps for ongoing maintenance of master data. Over time, with the centralization of
master data management responsibilities, the origination of master data changes moves from the
participating systems to the master data management hub or server.

4.7. Data Quality Management


Data quality is critical to data warehouse and business intelligence solutions. Better informed and
more reliable decisions come from using the right data during the process of loading a data
warehouse. It is important that the data is accurate, complete, and consistent across data
sources. Poor quality of data will lead to bad decisions which will in turn affect the organisation’s
performance.

Data Quality can be hampered by errors in following elements

Definitions

Sometimes, the column names and column description may be misleading.

Definition problems can be further categorized as below

Synonyms - The fields EMP_ID, EMPID, and EM01 may or may not all actually refer to the
same type of data

Homonyms - These indicate fields that are spelled the same, but really aren’t the same
(c_name can be used for customer name and category name).

Relationships - Just because a field is named FK_INVOICE, doesn’t mean that is really a
foreign key to the invoice file.

Domains

Domains describe the range and types of values that can be present in a data set.

Some examples of domain errors are:

 Unexpected values – e.g. Gender = one of {Male , Female, Others}, and not Orange.

 Cardinality – A Yes/No field can have only two credible values.


 Uniqueness – Consider a column that is supposed to have unique values. If that
column is nullable and more than one record comes in with a null value for this column,
then it will throw an error

Completeness

The complete and accurate data should be present in the database

Completeness of dataset can be gauged by its

 Integrity – Is actual data mapping to our definition of data?

 Accuracy – Name and address matching, demographics check

 Reliability – Zip code should match to city and state

 Redundancy – Data duplication shouldn't be there

 Consistency – Is same invoice number referenced with different amounts?

Validity

Validity indicates whether or not the data is valid

Validity checks used to spot data problems are

 Acceptability check – E.g. Product part number should consist of 7-character


alphanumeric string, with two characters and 5 digits.

 Anomaly check

 Timeliness check

Data Flows

These checks are related to the aggregate results of movement of data from source to target.
Many data quality problems can be traced back to incorrect data loads, missed loads or system
failures that go unnoticed.
Data flow checks to ensure data quality are

 Record counts – Reconciliation of source and target record counts

 Checksums

 Timestamps

 Process Time

Note: - A checksum is a count of the number of bits in a transmission data. This is included with
the data so that the receiver can check to see whether the same number of bits arrived. If the
counts match, it's assumed that the complete data was received.

Structural Integrity

These checks ensure that when data is taken as a whole, you are getting correct results

Structural integrity checks include

 Cardinality checks between tables


 Primary keys – Are these unique?

 Referential integrity – Product available on invoice but missing from product catalogue.

Business Rules

Business rule checks measure the degree of compliance between actual data and expected
data. These checks constitute of

 Constraints – Does the data comply with a known set of validations?

 Computational rules – Is formula for deriving amount correct?

 Comparisons

 Functional dependencies

 Conditions

Transformations

Transformation checks examine the impact of data transformations as data moves from one
system to another. Quality of data can be affected by incorrect transformation logic. Only way to
identify these are to compare source data set with target data set and verify transformations for

 Computations

 Merging

 Filtering

 Relationships

You might also like