ETL Process Training

Building the warehouse
Concepts
Satyam Computers Services Ltd.
Introduction to
Building a
Datawarehousing
Data Warehousing Architecture

Operational Systems
External Systems
Information Transformation/Migration Infrastructure
Replication Services
Finance
Datamart
Independent
LAN Clients
Sales
Datamart
Mkting
Datamart
Dependent
Dependent
Enterprise
Data Warehouse
Light
Clients
Web
Server
Data Warehousing Architecture

Metadata
Repository
Staging Area
To Warehouse/
Datamart
Metadata
Data
Stores
Legacy
System
Extraction/
Transformation
Server
Design/Mgmt
Scrubbing Tool
Mapping Tool
Extraction Mgmt Tool
Transformation Tool
Migration Mgmt Tool
Building a Datawarehouse
Steps Involved in Building a Datawarehouse
Extracting, Transforming, and Transporting

Data
Extracting Data
Extraction Techniques Phase
Extraction Tools
Extraction Phase
Examining the Source Data and Identify the Extraction

tool
The Extracts are typically written in Source system
Code (e.g. PL/SQL or VB Script or COBOL).
The Extraction tool also generates Source system
Code.
Using the tool for Extraction makes the process easier
instead of Hand-Coding.
The Pre and Post Process Exists. E.g. Before Extract
process there may be a call for sorting the data or a call
to a function that scores a record based on a formula.
Transformation Phase
Importance of Quality Data

Creating Business Rules
Tools are available to create Reusable
Transformation Modules or objects.
Simple Data Transformation which includes Date,
Number and Character Conversion
Assigning Surrogate Keys
Combining from Separate Sources
Validating one-one and one-many Relationships
Transporting Phase
Insert statements create Logs.
Bulk Loader is advisable
Truncate target tables before full refresh
Index Management
Drop and reindex.
Refresh Phase
Process Slowly Changing Dimensions
Automate the Extract-Transform-Load Cycle.
Incremental Fact Table Extracts.
Purging and Archiving Data.
Extracting Data
Extraction Process in Detail
The Process of getting data from Legacy System or any
Data Source.After extracting data is put in staging area
where it can be scrubbed and cleaned.
The source of data may from a single source or from a
multiple source. If the source is from multiple sources
then a connector tool is required to connect between
multiple sources.
If the data is from single source it can come from OLTP
system or from a flat file.
Extracting Data
The extraction process can be done either by hand coded
method or by using tools.
Advantageous and disadvantages over Customprogrammed Extraction (PL SQL Scripts) and tool based
extraction.
Tools have Well Defined disciplined approach and

Documentation.
Tools provide an easier way to perform the extraction
method by providing click, drag and drop features.
Extracting Data
Hand coded extraction techniques allow extraction in

cost effective manner since the PL/SQL construct is
available with the RDBMS.
Hand coded extraction are used when the extraction is

to be taken place where the programmer has clear
data structure known.
Extraction Techniques
Extraction Methods.
Bulk Extraction.
The entire data warehouse is refreshed
periodically by extraction's from the source
systems. All applicable data are extracted from the
source systems for loading into the warehouse.
This approach heavily uses the network
connection for loading data from source to target
databases, but such mechanism is easy to set up
and maintain.
Extraction Methods.
Change-Based Replication
Only data that have been newly inserted or
updated in the source systems are extracted and
loaded into the warehouse. This approach uses
less network connection due to the volume of data
to be transported. This mechanism involves
complex programming to determine when a new
warehouse record to be inserted or when an
existing warehouse record must be updated.
Hand Coding Development practices
Set up Header and Comment fields for the Code

Stick to the naming standards.
Test everything - both Test everything - both Unit
testing and system testing.
Document Everything
Extracting Data
Criteria for Identifying Extraction Tool.
The Source System Platform and Database.

Tools cannot access all types of data source on all
types of Computing platforms
Built-in Extraction or Duplication Functionality.

The availability of built-in extraction or duplication
reduces the technical difficulties inherent in the
data extraction process.
Extracting Data
Criteria for Identifying Extraction Tool.
The Batch windows of the Operational

Systems.
Some extraction mechanism are faster or more
efficient than others. The batch windows of the
operational system determine the time frame for
the extraction.
Extraction Tools
Extraction Tools include
Apertus Carleton. Passport
Evolutionary Technologies. ETL Extract.
Platinum. InfoPump
TRANSFORMING DATA
Transforming Data
IMPORTANCE OF QUALITY DATA.
TRANSFORMATION
TRANSFORMING DATA : PROBLEMS AND
SOLUTIONS
TRANSFORMATION TECHNIQUES
TRANSFORMATION TOOLS
Importance of Quality Data

Quality Data:
Before the extracted data is to be transformed, the
quality of the data has to be looked on. Once quality
data is transformed there will be minimum necessary to
change the data at the target which reduces
inconsistencies between source and target.
Data Quality Assurance

Characteristic of Quality Data
Accurate
Complete
Consistent
Unique
Timely
Data Quality Assurance

Data Quality Tools assist warehousing teams with the
task of locating and correcting data errors.
Corrections of data can be made to source or to the
target. But when corrections are made to target it
causes inconsistencies between the source and target
data which create synchronization problems.
Data Quality Tools

Though dirty data continue to be the biggest issues for
data warehousing initiatives, research indicates that
data quality investments are small percentage to total
warehouse spending.
DataFlux. Data Quality Workbench.

Pine Cone Systems. Content Tracker.
Prism. Quality Manager.
Vality Technology. Integrity Data Reengineering
Transformation
Transformation :
Transformation is process by which extracted data are
transformed into appropriate format. The data extracted
in put into the staging area where cleaning, scrubbing
takes place and stored so that transformation of the
clean data can take place. For transformation phase
data can come from cleansing tool. After transformation
data goes to the transportation stage.
Transforming Data : Problems and

Solutions
The Common Problems of Data that come out of a
Legacy System are
Inconsistent or Incorrect use of codes and special
characters.
A Single Field is Used for Unofficial or undocumented
purposes.
Overloaded Codes.
Evolving Data.
Missing, Incorrect or Duplicate values.
Transforming Data Problems and

Solutions
There are different solutions available to ensure the
data to be loaded is Correct or not
Cross-Footing
A template for the quality data norms can be used to
identify the erroneous data by comparing with the
norms in the template.
Manual Examination
A sampling methodology can be selected and a manual
examination can be made on the sampled data
Process Validation
Scripts can be generated which takes care of identifying
erroneous and segregate them.
Transformation Techniques
Field Splitting and consolidation :

Single physical field in source system needs to split up
into more than one target warehouse field.
Several source system field must be consolidated and
stored in one single warehouse field
Address field
# 123 ABC Street,
DEF City,
Republic of GH
No :
Street :
City :
Country:
123
ABC STREET
DEF
GH
Standardization : Standards and conventions for
abbreviations are applied to individual data items to
improve uniformity in both source and target objects.
System A
Order Date
05 August 1998
----------------------------System B
Order Date
08-08-98
System A
Order Date
August 05 1998
----------------------------System B
Order Date
August 08 1998
Deduplication : Rules are defined to identify duplicate
stores of customers or products. In case of two or
more repeated records, they are merged to form one
warehouse record.
System A
Customer Name :
John W Istin
-----------------------------------System B
Customer Name :
John William Istin
Customer Name :
John William Istin
Transformation Tools
Some of the Transformation tools includes
Apertus Carleton. Enterprise/Integrarot.
Data Mirror. Transformation Server.
Informatica. Power Mart Designer.
TRANSPORTATION
Transporting the Data

TRANSPORTING DATA INTO WAREHOUSE
BUILDING THE TRANSPORTATION PROCESS
TRANSPORTING THE DATA
POST PROCESSING OF LOADED DATA
Transporting Data into

Warehouse
The transformed data is then transported into the data warehouse. The load
images are transported through the loaders into the warehouse.
Data Loaders :
Data loaders load transformed data into the data warehouse.
Stored procedurs can be used to handle the warehouse loading if the images
are available in same RDBMS engine.
Transporting Data into

Warehouse
Extract
Source Data
Load
Staging Area
Warehouse Schema
Transporting Data into Warehouse

Warehouse Schema: It is nothing but the Dimensional
Model(dimensions and facts)
Staging Area: It is nothing but workspace where data is
ready after cleaning. This is for minimizing the time
required to prepare the data.
Source Data: This can be flat file, oracle table or some
other form.
Transporting Data into Warehouse

First of all source data (can be in flat file or oracle or
other) comes to staging area. This is called Extraction
from source and then putting into the staging area after
cleaning data, can be done thru a tool or PL/SQL or SQL
Loader.
In the Staging area data can be transformed to the

required format. After transforming the data in the
staging area data can be moved to the warehouse thru
the tool or PL/SQL scripts.
Building the Transporting

Process
For Transporting Data we can use:
PL/SQL scripts
SQL Loader Routines for flat files
ETL Tool
Building the Transporting Process

Using PL/SQL
With PL/SQL Scripts we can load the data into the
warehouse from the one or more source tables or
files. We use PL/SQL for adding surrogate key to the
tables and doing some transformation. We do
transformation based on the requirement and plus for
storing the data in such a way to increase the
performance.

Using SQL Loader
Similarly we can use SQL Loader for directly putting
the data from the flat files to the tables. We use this
for the Bulk loading. SQL Loader can be used for
loading
varying length and
fixed format files.

Using Tools
We can also use tool for this purpose. In tool there will
be graphical features. You have to map the source to
target and add some transformation things to this and
it will automatically generate the script for transporting
the data to the target.
Tools->
Oracle warehouse Builder,Informatica
Transporting the Data

After Building the Process data is loaded into the
warehouse. For PL/SQL process this can be done
by executing the Procedures and for SQL Loader
routines this can be done by running routines.
Post Processing of Loaded

Data
Scheduling of Jobs
Oracle Enterprise Manager or Some Oracle Package
(DBMS_JOB) can be used for this purpose. All the
Jobs or procedures can be scheduled according to
the loading requirement. In OEM you can submit a
job for the scheduling and set the interval for the job.
In the later stage you can alter this setting.

Data
OEM internally use DBMS_JOB for all the scheduling
purposes. DBMS_JOB is a package that can be used
for the scheduling purposes. You can schedule any job
and set the interval by writing a procedure for the job.
Job will be automatically executed at the interval set in
the job.

Data
create or replace procedure schedule_job is
job_no number;
begin
DBMS_JOB.SUBMIT( job_no,
'insert_temp;',
sysdate,
'sysdate+1/48' );
commit;
dbms_output.put_line('job '||to_char(job_no));
end;
Datawarehouse Building
Source A part A
Source B part B
Source C part C
Operational
Extraction
Transformation
Categorization
of transaction
data
Users View
A
B
C
Analytical
ETVL Tools
The following are the Popular ETVL Tools
Oralce Warehouse Builder.
Informatica.
Sagent.
SAS Warehouse Administrator.
ETVL Tools
Oracle Warehouse Builder - Key Features
Easy to Use - Graphical Design.
Wizard driven Interface.
Integrated Metadata via Common Warehouse

Model (CWM).
Tightly Integrated with Oracle 8i.
A Library of Pre-defined Transformations

available
ETVL Tools
Oracle Warehouse Builder - Key Features
Graphical mapping and Transformation design.
Automated Code Generation.
Support for Heterogeneous Sources.
LEAVING A METADATA TRAIL

DEFINING WAREHOUSE METADATA
DEVELOPING A METADATA STRATEGY
EXAMINING TYPES OF METADATA
METADATA MANAGEMENT TOOLS
COMMON WAREHOUSE METADATA
DEFINING WAREHOUSE
METADATA
Metadata
What is Metadata?
Traditionally defined as data about data
Form of abstraction that describes the
structure and contents of the data
warehouse
Metadata
Metadata is more comprehensive and transcends the

data.
Metadata provide the format and name of data items
It actually provides the context in which the data
element exists.
provides information such as the domain of possible
values;
the relation that data element has to others;
the data's business rules,
rules
and even the origin of the data.
data
Importance of Metadata
Metadata establish the context of the
Warehouse data
Metadata help warehouse administrators and users
locate and understand data items, both in the source
systems and in the warehouse data structures.
E.g.: The date 02/05/98 could mean either May 2, 1998
or February 5, 1998 depending on the date
convention used. Metadata describing the format of
this date field could help determine the definite and
unambiguous meaning of the data item.
Metadata facilitate the Analysis Process

Metadata must provide warehouse end-users with
the information they need to easily perform the
analysis steps. It should thus allow users to quickly
locate data that are in the warehouse.
Metadata should allow analysts to interpret data
correctly by providing information about data formats
and data definitions.
Metadata are a form of Audit Trail for Data
Transformation
Metadata document the transformation of source data
into warehouse data. Hence warehouse metadata must
be capable of explaining how a particular piece of
warehouse data was derived from the operational
systems.
All business rules governing the transformation of data
to new values or new formats are also documented as
metadata.
This kind of audit trail is required:
- to build the users confidence regarding the
veracity and quality of warehouse data
- to know where the data came from so that the user
has a good understanding of warehouse data
- by some warehousing products that use this type
of metadata to generate extraction and
transformation scripts for use in the warehouse
back-end
Metadata Improve or Maintain Data Quality
Metadata can improve or maintain warehouse data quality
through the definition of valid values for individual warehouse
data items. Using a data quality tool prior to actual loading
into the warehouse, the warehouse load images can be
reviewed to check for compliance with valid values for key
data items. Data errors are quickly highlighted for correction.
Metadata can be used as the basis for any error-correction
processing that should be done if a data error is found. Errorcorrection rules are documented in the metadata repository
and executed by program code on an as needed basis.
DEVELOPING A METADATA
STRATEGY
METADATA STRATEGY
Metadata organization and administration, which

promotes sharing and central management of metadata
in distributed repository architecture.
Content Creation and integrity, to maintain consistency

of metadata that may be passed among various tools
throughout the phases of the projects;
METADATA STRATEGY
Component-based metadata sharing, which includes

facilities for exchanging metadata among upstream
design/modeling tools and downstream analytical
problems
Planning for the future, necessary for ensuring
compatibility with emerging metadata and
interoperability standards.
EXAMINING
TYPES OF METADATA
METADATA TYPES
ADMINISTRATIVE METADATA
END-USER METADATA
OPTIMIZATION METADATA
The metadata has 3 major

categorize
There is the metadata associated with the decisionsupport database.
This metadata describes the database structures
such as tables, columns and partitions, as well as
security settings and operational information.
The second category of data warehouse metadata is
used by the end user to navigate the database.
A query and analysis tool, such as BusinessObjects
from Business Objects Inc. or PowerPlay from
Cognos Corp., usually creates and manages this
metadata.
The metadata has 3 major

categorize
The third category is the metadata created by the
back-end extract/transformation tool that's used to
move data from the source systems to the data
warehouse.
This metadata is primarily concerned with source
data definitions, transformation logic and sourceto-target data mappings. These tools also must be
concerned with process scheduling, maintaining
data integrity and error management.
These contain descriptions of the source databases and their

contents, the data warehouse objects, and the business rules
to transform data from the sources into the data warehouse.
Data Sources: Descriptions of all data sources used by the
warehouse, including information about the data ownership.
Any relationships between different data sources (e.g., one
provides data to the other) are also documented.
Source-to-target field mapping: The mapping of source fields
(in operational systems) to target fields (in the data warehouse)
explains what fields are used to populate the data warehouse.
It also documents transformations and formatting changes that
were applied to the original, raw data to derive the warehouse
data.
Warehouse Schema Design: Describes the warehouse

servers, databases, database tables, fields, and any
hierarchies that may exist in the data. All referential
tables, system codes, etc., are also documented.
Warehouse back-end data structure: Model of the
back-end of the warehouse, including staging tables, load
image tables, and any other temporary data structures
that are used during the data transformation process.
Warehouse back-end tools or programs: A definition of
each extraction, transformation, and quality assurance
program or tool that is used to build or refresh the data
warehouse.
Warehouse architecture: If the warehouse architecture is

one where an enterprise warehouse feeds many
departmental or vertical data marts the warehouse
architecture should be documented. If the data mart
contains a logical subset of the data warehouse contents,
this subset should also be defined.
Business Rules and Policies: All applicable business
rules and policies are documented. Examples include
business formulae for computing costs or profits.
Access and Security rules: Rules governing the flow of
data across various users and their access limitations are
documented.
END-USER METADATA
End-User metadata help users create their queries and
interpret the results, and also contain,
Warehouse Contents : Must describe the data
structure and contents of the data warehouse in user
friendly terms. Aliases, rules, summaries and
precomputed totals are to be documented.
Predefined Queries & Reports : Queries & reports
that have been predefined and documented to avoid
duplication of effort.
Business rules & Policies : All business rules and
changes of there rles over time should be documented.
END-USER METADATA
Hierarchy Definitions : Hierarchy definitions are
important to support driling up and down warehouse
dimensions.
Status Information : Status information is required
to inform warehouse users of the warehouse status
at any point of time.
Data Quality : Known data quality problems in the
ware house should be clearly documented, this will
prompt users to make careful use of warehouse
data.
END-USER METADATA
Warehouse load History : A history of data errors,
data volume, load schedule should be available.
Warehouse purging rules : The rules which
determine when data is removed from warehouse
should be known to end-users.
OPTIMIZATION METADATA
Metadata are maintained to aid in the optimization of

the data warehouse design and performance.
Aggregate Definitions : All warehouse aggregates
should be documented so that front end tools with
aggregate navigation facilities rely on this type of
metadata.
Collection of Query Statistics : It is helpful to track
the type of queries that are made against the
warehouse. This helps in optimization and tuning.
Also helps to identify data that are largely unused.
METADATA MANAGEMENT
TOOLS

Metadata Catalog is a generic descriptor for the overall
set of metadata used in the warehouse.
Tools are needed for cataloging all of this metadata and
keeping track of it. The tool probably cant read and
write all the metadata directly, but it will manage
metadata stored in many locations.
The functions and services required in the metadata
catalog maintenance includes
1. Information catalog integration/Merge- from data model
to database to front-end tools
2. Metadata managemtnt - remove old unused entries.
3. Capture existing metadata - from mainframe or other

sources.
4. Manage and display graphical and tabular
representations of the
Metadata catalog contents - metadata browser.
5. Maintain user profiles for application and security use.
6. Security for the metadata catalog.
7. Local or centralized metadata catalog support.
8. Creating remote procedure calls to provide
COMMON WAREHOUSE
METADATA
The CWM Metamodel

The CWM metamodel is organized into 18 packages
arranged in 4 layers on a UML base (Fig next).
CWM new architecture defines its sub-metamodel as
individual packages. Because CWM uses modeling
techniques that minimize the number of dependencies
between its packages, tool integrators can select only
those metamodel services they need while avoiding
problems common to large, monolithic metamodels (UML).
The CWM Metamodel

Warehouse
Process
Management
Analysis
Transformation OLAP
Resource
Foundation
Counts
Classes
Associations
CWM
157
115
CWMX
130
77
Total
287
192
Object
(UML)
Relational
Warehouse
Operation
Data
Information
Business
Mining Visualization Nomenclature
Record
Multi
Dimensional
XML
Business Data
Keys
Type
Software
Expressions
Information Types
Index Mapping Deployment
UML 1.3
(Foundation, Behavioral_Elements, Model_Management)
The CWM Metamodel Cont.
The Four layers of the CWM collect together different

sorts of metamodel packages:
Base Layer contains the standard UML 1.3 notation
and the extensions to support warehouse concepts
The Foundation layer contains the metamodel
shared by other packages (Business Information, Data
Types, Expressions, Keys & Indexes, Software
Deployment, Type Mapping).
The Resource layer contains data models used for
operational data sources and target data warehouses.
The Analysis layer provides metamodels supporting

logical services that may be mapped onto data stores
defined by Resource layer packages. For example, the
Transformation metamodel supports the definition of
transformations between data warehouse sources and
targets, and the OLAP metamodel allows data
warehouses stored in either relational or
multidimensional data engines to be viewed as
dimensions and cubes.
The Management layer metamodels support the

operation of data warehouses by allowing the definition
and scheduling of operational tasks (Warehouse
Process package) and by recording the activity of
warehouse processes and related statistics (Warehouse
Operation package).
CWM Design Basis
In accordance to solution framework metamodeling

architecture constitutes 4 layers
Metamodeling language (M3)

Metamodels (M2)
Metadata or Models (M1)
Data or Objects (M0)
CWM Design Basis

Standard OMG Components
Modeling Language: UML
Metadata Interchange: XMI
Metadata API:
MOF IDL Mapping
M
I
D
D
L
E
W
A
R
E
A
P
P
L
I
C
A
T
I
O
N
Meta-metamodel
Layer (M3)
MOF: Class, Attribute,

Operation,
Association
Metamodel
Layer(M2)
UML: Class, Attribute

CWM: Table, Column
ElementType, Attribute
Metadata/Model
Layer(M1)
User Data/Object
Layer (M0)
Stock: name, price
<Stock name=IBM
price=112/>
Our Vision..
Enable
Decisions@speed
of thought
SATYAM - Our People Make The

Difference
Thank

ETL Process Training

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ETL Process Training

Uploaded by

Copyright:

Available Formats

Building the warehouse

Satyam Computers Services Ltd.

Data Warehousing Architecture

Information Transformation/Migration Infrastructure

Data Warehousing Architecture

Extracting, Transforming, and Transporting

Examining the Source Data and Identify the Extraction

Importance of Quality Data

Process Slowly Changing Dimensions

Automate the Extract-Transform-Load Cycle.

Incremental Fact Table Extracts.

Purging and Archiving Data.

Tools have Well Defined disciplined approach and

Hand coded extraction techniques allow extraction in

Hand coded extraction are used when the extraction is

Set up Header and Comment fields for the Code

The Source System Platform and Database.

Built-in Extraction or Duplication Functionality.

The Batch windows of the Operational

Importance of Quality Data

Data Quality Assurance

Data Quality Assurance

Data Quality Tools

DataFlux. Data Quality Workbench.

Transforming Data : Problems and

Transforming Data Problems and

Field Splitting and consolidation :

Transporting the Data

Transporting Data into

Transporting Data into

Transporting Data into Warehouse

Transporting Data into Warehouse

In the Staging area data can be transformed to the

Building the Transporting

Building the Transporting Process

Building the Transporting Process

Building the Transporting Process

Transporting the Data

Post Processing of Loaded

Post Processing of Loaded

Post Processing of Loaded

Oralce Warehouse Builder.

SAS Warehouse Administrator.

Easy to Use - Graphical Design.

Wizard driven Interface.

Integrated Metadata via Common Warehouse

Tightly Integrated with Oracle 8i.

A Library of Pre-defined Transformations

Graphical mapping and Transformation design.

Automated Code Generation.

Support for Heterogeneous Sources.

LEAVING A METADATA TRAIL

Metadata is more comprehensive and transcends the

Metadata facilitate the Analysis Process

Metadata organization and administration, which

Content Creation and integrity, to maintain consistency

Component-based metadata sharing, which includes

The metadata has 3 major

The metadata has 3 major

These contain descriptions of the source databases and their

Warehouse Schema Design: Describes the warehouse

Warehouse architecture: If the warehouse architecture is

Metadata are maintained to aid in the optimization of