Professional Documents
Culture Documents
Introduction to ETL
ETL is a technology/set of processes by which data is extracted from numerous
systems/databases/applications/files transformed as appropriate loaded into target systems -
including (but not limited to) ODS (Operational Data Store), DW (Data Warehouse), DM (Data
Mart), Analytical applications, etc. ETL process periodically refreshes the data warehouse.
Extraction
Extraction is the process of getting data from a source system so that it can be used elsewhere.
Extraction can be done to the entire data in the source system or only to a part of data based on
the requirements.
The source system can be a database, XML files, excel sheets, Cobol files etc.
Push Method
In push method, the source systems will push the data to the Integration area where different
cleansing operations and transformations happen. In other words, it is the responsibility of the
source systems to send the required data to the integration area.
Pull Method
In pull method, the target system will pull the data from different source systems. In other words,
it is the responsibility of the target system to extract or retrieve the required data.
Incremental
Here, we only extract the data which changed after the last extraction. For example, let’s
assume that the last extraction was on 1st January, 2014 and the next extraction is on 1st
February, 2014. In this mode, on 1st February 2014, the system will only extract the data that
got changed after 1st January, 2014.
During Extraction
Incremental extraction can be implemented through changed data capture, commonly called as
CDC. Changed data capture is the process of recording or capturing the changes to the source
database i.e. capturing any insert, update or delete activity applied to the source data.
Snapshot
Here we capture the snapshot of source data at a point of time. For example let us assume that
the extraction happens on 1st January, 2014. In this mode on 1st January, 2014, the system will
extract data as it looks at that particular moment in time.
Transaction log is a record of all actions, saved by the database and it is stored in a separate file
or table. We can make use of these transaction logs to capture the latest data.
Database triggers are procedures/code that are stored in the database and implicitly executed
when the associated table is modified. The triggers are executed when an INSERT, UPDATE, or
DELETE statement is issued against the associated table. These can also be used to do a
change data capture.
Here the source application will provide the latest changes in data to the target system.
Here the tables whose changes have to be captured will have fields (columns) in it which
represents the date and time of the last change. Any record in the table that has a timestamp that
is more recent than the last extraction time is considered to have changed.
For file sources, the changes are captured by comparing latest file and the previous file.
4.2. Transformation
Transformation is the process of validating, cleaning and transforming the extracted data to
convert it to the required form.
Validation
The data extracted from the source systems have to be validated for Range (E.g. Age of a minor
should be less than 18), Sequence, Blanks (Employee name shouldn't be blank), Numeric
(Phone number should be a number), Domain (Gender should be male or female), Mandatory
fields, Duplicate, etc.
Cleansing
Lot of cleansing operations have to be done on the extracted data like Name standardization
(Name should be First-name Middle-name Last-name), Address cleansing, De-duplication, etc.
Transforming
A lot of data transformations have to be done on the extracted raw data to ensure that it abides
to the business rules, and fits the data warehouse schema.
Type translations, format changes (e.g. Date should be in YYYY-MM-DD format), code value
mapping (Mum to Mumbai), combining fields, concatenating fields, aggregating rows, splitting
records, combining records, filtering records, augmenting records, etc.
Data Summarization
E.g. - If a Sales Manager wants to analyse daily sales revenue only, then individual sales amount
need not be stored in the Data Warehouse.
Merging of information
E.g. - House no, Street name, town name, State name, Country, and pin code to a single field
named “Address”.
De-duplication
E.g. - If the name, age and address of a single customer come multiple times, we have to
remove the duplicate record and store only one record for one customer.
4.3. Load
Loading is the process of storing the transformed data in the target data store. The whole of
transformed data or only a part of it may be loaded to the target system. Target systems can be
databases, XML files, Cobol files etc.
Initial load
In this load type, we insert history data. This is usually a one-time activity which happens when a
new application or process goes live. For example, let us assume that a new data mart is going
live on December 1, 2013. In this mode, all the history data required from the OLTP application
till December 1, 2013 will be inserted first to the data mart.
Full refresh
Here we erase and replace the existing data in target data store with the new transformed data.
For example, let’s assume that the last load was on 1st January, 2014 and the next load is on 1st
March, 2014. In this mode, on 1st March 2014, the system will delete all the data in the target
tables and insert the new data.
In this load mode, new data is appended to the existing data in the target system. In this mode,
considering the previous example, on 1st March 2014, the system will append new data to the
existing data (i.e. the data from 1st January 2014 till 1st March 2014) in the target tables.
Snapshot Load
This load mode appends the snapshot of source data as at a given point of time.
Transform – Convert data to the required format by applying the business rules.
ETL Usage
ETL is used not only in data warehousing but also in other applications.
Data warehousing involves integrating data from heterogeneous data stores, providing a single
store for current and historical information and providing a single platform to support business
users' needs. Data Warehousing involves the process of extracting data from OLTP sources,
transforming the OLTP data into a format that we are expecting in the Data warehouse, and
finally loading the transformed data into the target system. This part of the data warehousing is
done using ETL tools and techniques
ETL load frequency is information about how often the ETL processes are run. Consider a BI
application, where the reports are generated every month. In this case the data has to be
extracted and loaded into the data warehouse every month i.e. the ETL process will run every
month. Here the ETL load frequency is monthly.
Data migration is the process of transferring data between storage systems. For example, let us
assume that application A is storing its data to files. It is possible that the application owner later
decides to use a database to store the data. In this case, the previous data from the files have to
be moved to the new database. This process of data transfer is called data migration. This
movement can be done using ETL tools and techniques.
Any organization will have multiple applications and information sources. For the business
functions to run smoothly, one or more applications may require data from other applications.
The data storage systems of these different applications might be entirely different. Here rises
the need for application integration. In application integration, ETL tools and techniques can be
used to bring together data from different applications into a single data store.
4.5. ETL Products and Metadata
ETL Products
Metadata
Employee_id
Employee_Name
Date_of_Joining
Date_of_Birth
Designation
Technical Metadata
Technical information about the data like Column name, Data type etc.
Business Metadata
Business information about the data. The business metadata tell you what data you have, where
they come from, what they mean and what their relationship is to other data in the data
warehouse.
Process Metadata
Metadata that documents the details of processes used to reformat (convert) or transform
content. Process metadata describes the results of various operations in a data warehouse. It will
contain the details of the process that loads data into the data warehouse
E.g. ETL job start time, ETL job end time, CPU seconds used, number of rows processed etc.
Reference data
Reference data is data that defines the set of permissible values to be used by other data fields.
Reference data is generally uniform, company-wide, and can be either created within a country
or by external standardization bodies. Some types of reference data, such as currencies and
currency codes, are always standardized. Others, such as the positions (roles or designations) of
employees within an organization, are less standardized. Reference data gains in value when it
is widely re-used and widely referenced. Typically, it does not change overly much in terms of
definition (apart from occasional revisions).
Typical examples of reference data are:
Units of measure
Country codes
Corporate codes
Conversion rates (currency, weight, temperature, etc.)
Calendar and calendar constraints
Master Data
Master Data is business critical data that is stored in disparate systems spread across the
Enterprise. E.g. Data about customers, products, employees, materials, suppliers, and vendors.
While it is often non-transactional in nature, it is not limited to non-transactional data, and often
supports transactional processes and operations. Master data is typically shared by multiple
users and groups and departments across an organization.
Benefits
Consider the following scenario. A bank’s customer service department has a record with
customer name as Abhishek Gaur (meaning Abhishek Gaur is already a customer of the bank).
The bank’s Marketing department (that tries to get new customers) has a marketing system,
which also has a customer name as Abhishek G. Marketing executives consider Abhishek G only
as a potential customer (not realizing that he is an already existing customer due to the
difference in name between Marketing system and Customer service system), and continuously
try to push promotional offers to him to enrol him as a new customer – which annoys the
customer and also wasting theirs and the customer’s time.
This issue is an example of not maintaining proper master data management in the company. If
there was a central Master data management system which verifies, standardizes and publishes
customer information to all other systems in the company, this situation could have been
avoided.
Master Data Management (MDM), is a discipline in Information Technology (IT) that focuses on
the management of reference or master data that is shared by several disparate IT systems and
groups. MDM enables consistent computing between diverse system architectures and business
functions. MDM integrates master data across BI, data warehouse, financial & operational
systems, providing for accurate, consistent and compliant enterprise reporting. MDM supplies
meta-data for aggregating and integrating transactional data.
MDM Capabilities
Support for definition of roles with access rights enforced, depending on the responsibilities
assigned for that role.
ETL
ETL capabilities for extracting master data/reference data files or tables from multiple sources,
and loading the data into the master data repository.
Data Cleansing
Data cleansing capabilities for de-duplication and matching of master data records.
Collaborative platform
For applying changes established in a central server to each consuming application. Incremental
change support is important for performance reasons.
Version control at the central policy hub combined with change monitoring across all of the
participating systems. This is needed in order to track changes to master data over time.
MDM architecture
1) Assign business responsibility for each master data domain such as products, customers,
suppliers, organizational structure.
2) Extract master data for a domain from separate operational and reporting systems to a central
server.
3) Apply data quality standards, such as de-duplication and matching of master data records, to
get a clean set of master data for the domain.
4) Reconcile and rationalize the master data records. This process entails setting policies
pertaining to an optimal product hierarchy, organizational structure, or preferred supplier list.
5) Synchronize participating operational and reporting systems with the centrally managed,
canonical master data.
6) Monitor changes or updates to master data in each participating system. Then repeat the
preceding steps for ongoing maintenance of master data. Over time, with the centralization of
master data management responsibilities, the origination of master data changes moves from the
participating systems to the master data management hub or server.
Definitions
Synonyms - The fields EMP_ID, EMPID, and EM01 may or may not all actually refer to the
same type of data
Homonyms - These indicate fields that are spelled the same, but really aren’t the same
(c_name can be used for customer name and category name).
Relationships - Just because a field is named FK_INVOICE, doesn’t mean that is really a
foreign key to the invoice file.
Domains
Domains describe the range and types of values that can be present in a data set.
Unexpected values – e.g. Gender = one of {Male , Female, Others}, and not Orange.
Completeness
Validity
Anomaly check
Timeliness check
Data Flows
These checks are related to the aggregate results of movement of data from source to target.
Many data quality problems can be traced back to incorrect data loads, missed loads or system
failures that go unnoticed.
Data flow checks to ensure data quality are
Checksums
Timestamps
Process Time
Note: - A checksum is a count of the number of bits in a transmission data. This is included with
the data so that the receiver can check to see whether the same number of bits arrived. If the
counts match, it's assumed that the complete data was received.
Structural Integrity
These checks ensure that when data is taken as a whole, you are getting correct results
Referential integrity – Product available on invoice but missing from product catalogue.
Business Rules
Business rule checks measure the degree of compliance between actual data and expected
data. These checks constitute of
Comparisons
Functional dependencies
Conditions
Transformations
Transformation checks examine the impact of data transformations as data moves from one
system to another. Quality of data can be affected by incorrect transformation logic. Only way to
identify these are to compare source data set with target data set and verify transformations for
Computations
Merging
Filtering
Relationships