You are on page 1of 85

Building the warehouse

Concepts

Satyam Computers Services Ltd.

Introduction to
Building a
Datawarehousing

Data Warehousing Architecture


Operational Systems

External Systems

Information Transformation/Migration Infrastructure

Replication Services

Finance
Datamart
Independent

LAN Clients

Sales
Datamart

Mkting
Datamart

Dependent

Dependent

Enterprise
Data Warehouse

Light
Clients

Web
Server

Data Warehousing Architecture


Metadata
Repository

Staging Area

To Warehouse/
Datamart

Metadata

Data
Stores

Legacy
System

Extraction/
Transformation
Server

Design/Mgmt

Scrubbing Tool
Mapping Tool
Extraction Mgmt Tool
Transformation Tool
Migration Mgmt Tool

Building a Datawarehouse
Steps Involved in Building a Datawarehouse

Extracting, Transforming, and Transporting


Data
Extracting Data
Extraction Techniques Phase
Extraction Tools

Extraction Phase

Examining the Source Data and Identify the Extraction


tool
The Extracts are typically written in Source system
Code (e.g. PL/SQL or VB Script or COBOL).
The Extraction tool also generates Source system
Code.
Using the tool for Extraction makes the process easier
instead of Hand-Coding.
The Pre and Post Process Exists. E.g. Before Extract
process there may be a call for sorting the data or a call
to a function that scores a record based on a formula.

Transformation Phase

Importance of Quality Data


Creating Business Rules
Tools are available to create Reusable
Transformation Modules or objects.
Simple Data Transformation which includes Date,
Number and Character Conversion
Assigning Surrogate Keys
Combining from Separate Sources
Validating one-one and one-many Relationships

Transporting Phase
Insert statements create Logs.
Bulk Loader is advisable
Truncate target tables before full refresh
Index Management
Drop and reindex.

Refresh Phase

Process Slowly Changing Dimensions

Automate the Extract-Transform-Load Cycle.

Incremental Fact Table Extracts.

Purging and Archiving Data.

Extracting Data
Extraction Process in Detail
The Process of getting data from Legacy System or any
Data Source.After extracting data is put in staging area
where it can be scrubbed and cleaned.
The source of data may from a single source or from a
multiple source. If the source is from multiple sources
then a connector tool is required to connect between
multiple sources.
If the data is from single source it can come from OLTP
system or from a flat file.

Extracting Data
The extraction process can be done either by hand coded
method or by using tools.
Advantageous and disadvantages over Customprogrammed Extraction (PL SQL Scripts) and tool based
extraction.

Tools have Well Defined disciplined approach and


Documentation.
Tools provide an easier way to perform the extraction
method by providing click, drag and drop features.

Extracting Data

Hand coded extraction techniques allow extraction in


cost effective manner since the PL/SQL construct is
available with the RDBMS.

Hand coded extraction are used when the extraction is


to be taken place where the programmer has clear
data structure known.

Extraction Techniques
Extraction Methods.

Bulk Extraction.
The entire data warehouse is refreshed
periodically by extraction's from the source
systems. All applicable data are extracted from the
source systems for loading into the warehouse.
This approach heavily uses the network
connection for loading data from source to target
databases, but such mechanism is easy to set up
and maintain.

Extraction Techniques
Extraction Methods.

Change-Based Replication
Only data that have been newly inserted or
updated in the source systems are extracted and
loaded into the warehouse. This approach uses
less network connection due to the volume of data
to be transported. This mechanism involves
complex programming to determine when a new
warehouse record to be inserted or when an
existing warehouse record must be updated.

Extraction Techniques
Hand Coding Development practices

Set up Header and Comment fields for the Code


Stick to the naming standards.
Test everything - both Test everything - both Unit
testing and system testing.
Document Everything

Extracting Data
Criteria for Identifying Extraction Tool.

The Source System Platform and Database.


Tools cannot access all types of data source on all
types of Computing platforms

Built-in Extraction or Duplication Functionality.


The availability of built-in extraction or duplication
reduces the technical difficulties inherent in the
data extraction process.

Extracting Data
Criteria for Identifying Extraction Tool.

The Batch windows of the Operational


Systems.
Some extraction mechanism are faster or more
efficient than others. The batch windows of the
operational system determine the time frame for
the extraction.

Extraction Tools
Extraction Tools include
Apertus Carleton. Passport
Evolutionary Technologies. ETL Extract.
Platinum. InfoPump

TRANSFORMING DATA

Transforming Data
IMPORTANCE OF QUALITY DATA.
TRANSFORMATION
TRANSFORMING DATA : PROBLEMS AND
SOLUTIONS
TRANSFORMATION TECHNIQUES
TRANSFORMATION TOOLS

Importance of Quality Data


Quality Data:
Before the extracted data is to be transformed, the
quality of the data has to be looked on. Once quality
data is transformed there will be minimum necessary to
change the data at the target which reduces
inconsistencies between source and target.

Data Quality Assurance


Characteristic of Quality Data
Accurate
Complete
Consistent
Unique
Timely

Data Quality Assurance


Data Quality Tools assist warehousing teams with the
task of locating and correcting data errors.
Corrections of data can be made to source or to the
target. But when corrections are made to target it
causes inconsistencies between the source and target
data which create synchronization problems.

Data Quality Tools


Though dirty data continue to be the biggest issues for
data warehousing initiatives, research indicates that
data quality investments are small percentage to total
warehouse spending.

DataFlux. Data Quality Workbench.


Pine Cone Systems. Content Tracker.
Prism. Quality Manager.
Vality Technology. Integrity Data Reengineering

Transformation
Transformation :
Transformation is process by which extracted data are
transformed into appropriate format. The data extracted
in put into the staging area where cleaning, scrubbing
takes place and stored so that transformation of the
clean data can take place. For transformation phase
data can come from cleansing tool. After transformation
data goes to the transportation stage.

Transforming Data : Problems and


Solutions
The Common Problems of Data that come out of a
Legacy System are
Inconsistent or Incorrect use of codes and special
characters.
A Single Field is Used for Unofficial or undocumented
purposes.
Overloaded Codes.
Evolving Data.
Missing, Incorrect or Duplicate values.

Transforming Data Problems and


Solutions
There are different solutions available to ensure the
data to be loaded is Correct or not
Cross-Footing
A template for the quality data norms can be used to
identify the erroneous data by comparing with the
norms in the template.

Manual Examination
A sampling methodology can be selected and a manual
examination can be made on the sampled data

Process Validation
Scripts can be generated which takes care of identifying
erroneous and segregate them.

Transformation Techniques

Field Splitting and consolidation :


Single physical field in source system needs to split up
into more than one target warehouse field.
Several source system field must be consolidated and
stored in one single warehouse field

Address field
# 123 ABC Street,
DEF City,
Republic of GH

No :
Street :
City :
Country:

123
ABC STREET
DEF
GH

Transformation Techniques
Standardization : Standards and conventions for
abbreviations are applied to individual data items to
improve uniformity in both source and target objects.
System A
Order Date
05 August 1998
----------------------------System B
Order Date
08-08-98

System A
Order Date
August 05 1998
----------------------------System B
Order Date
August 08 1998

Transformation Techniques
Deduplication : Rules are defined to identify duplicate
stores of customers or products. In case of two or
more repeated records, they are merged to form one
warehouse record.
System A
Customer Name :
John W Istin
-----------------------------------System B
Customer Name :
John William Istin

Customer Name :
John William Istin

Transformation Tools
Some of the Transformation tools includes
Apertus Carleton. Enterprise/Integrarot.
Data Mirror. Transformation Server.
Informatica. Power Mart Designer.

TRANSPORTATION

Transporting the Data


TRANSPORTING DATA INTO WAREHOUSE
BUILDING THE TRANSPORTATION PROCESS
TRANSPORTING THE DATA
POST PROCESSING OF LOADED DATA

Transporting Data into


Warehouse
The transformed data is then transported into the data warehouse. The load
images are transported through the loaders into the warehouse.
Data Loaders :
Data loaders load transformed data into the data warehouse.
Stored procedurs can be used to handle the warehouse loading if the images
are available in same RDBMS engine.

Transporting Data into


Warehouse
Extract

Source Data

Load

Staging Area

Warehouse Schema

Transporting Data into Warehouse


Warehouse Schema: It is nothing but the Dimensional
Model(dimensions and facts)
Staging Area: It is nothing but workspace where data is
ready after cleaning. This is for minimizing the time
required to prepare the data.
Source Data: This can be flat file, oracle table or some
other form.

Transporting Data into Warehouse


First of all source data (can be in flat file or oracle or
other) comes to staging area. This is called Extraction
from source and then putting into the staging area after
cleaning data, can be done thru a tool or PL/SQL or SQL
Loader.

In the Staging area data can be transformed to the


required format. After transforming the data in the
staging area data can be moved to the warehouse thru
the tool or PL/SQL scripts.

Building the Transporting


Process
For Transporting Data we can use:
PL/SQL scripts
SQL Loader Routines for flat files
ETL Tool

Building the Transporting Process


Using PL/SQL
With PL/SQL Scripts we can load the data into the
warehouse from the one or more source tables or
files. We use PL/SQL for adding surrogate key to the
tables and doing some transformation. We do
transformation based on the requirement and plus for
storing the data in such a way to increase the
performance.

Building the Transporting Process


Using SQL Loader
Similarly we can use SQL Loader for directly putting
the data from the flat files to the tables. We use this
for the Bulk loading. SQL Loader can be used for
loading
varying length and
fixed format files.

Building the Transporting Process


Using Tools
We can also use tool for this purpose. In tool there will
be graphical features. You have to map the source to
target and add some transformation things to this and
it will automatically generate the script for transporting
the data to the target.
Tools->
Oracle warehouse Builder,Informatica

Transporting the Data


After Building the Process data is loaded into the
warehouse. For PL/SQL process this can be done
by executing the Procedures and for SQL Loader
routines this can be done by running routines.

Post Processing of Loaded


Data
Scheduling of Jobs
Oracle Enterprise Manager or Some Oracle Package
(DBMS_JOB) can be used for this purpose. All the
Jobs or procedures can be scheduled according to
the loading requirement. In OEM you can submit a
job for the scheduling and set the interval for the job.
In the later stage you can alter this setting.

Post Processing of Loaded


Data
OEM internally use DBMS_JOB for all the scheduling
purposes. DBMS_JOB is a package that can be used
for the scheduling purposes. You can schedule any job
and set the interval by writing a procedure for the job.
Job will be automatically executed at the interval set in
the job.

Post Processing of Loaded


Data
create or replace procedure schedule_job is
job_no number;
begin
DBMS_JOB.SUBMIT( job_no,
'insert_temp;',
sysdate,
'sysdate+1/48' );
commit;
dbms_output.put_line('job '||to_char(job_no));
end;

Datawarehouse Building
Source A part A

Source B part B

Source C part C
Operational

Extraction

Transformation
Categorization
of transaction
data
Users View

A
B
C

Analytical

ETVL Tools
The following are the Popular ETVL Tools

Oralce Warehouse Builder.

Informatica.

Sagent.

SAS Warehouse Administrator.

ETVL Tools
Oracle Warehouse Builder - Key Features

Easy to Use - Graphical Design.

Wizard driven Interface.

Integrated Metadata via Common Warehouse


Model (CWM).

Tightly Integrated with Oracle 8i.

A Library of Pre-defined Transformations


available

ETVL Tools
Oracle Warehouse Builder - Key Features

Graphical mapping and Transformation design.

Automated Code Generation.

Support for Heterogeneous Sources.

LEAVING A METADATA TRAIL


DEFINING WAREHOUSE METADATA
DEVELOPING A METADATA STRATEGY
EXAMINING TYPES OF METADATA
METADATA MANAGEMENT TOOLS
COMMON WAREHOUSE METADATA

DEFINING WAREHOUSE
METADATA

Metadata

What is Metadata?
Traditionally defined as data about data
Form of abstraction that describes the
structure and contents of the data
warehouse

Metadata

Metadata is more comprehensive and transcends the


data.
Metadata provide the format and name of data items
It actually provides the context in which the data
element exists.
provides information such as the domain of possible
values;
the relation that data element has to others;
the data's business rules,
rules
and even the origin of the data.
data

Importance of Metadata
Metadata establish the context of the
Warehouse data
Metadata help warehouse administrators and users
locate and understand data items, both in the source
systems and in the warehouse data structures.
E.g.: The date 02/05/98 could mean either May 2, 1998
or February 5, 1998 depending on the date
convention used. Metadata describing the format of
this date field could help determine the definite and
unambiguous meaning of the data item.

Importance of Metadata

Metadata facilitate the Analysis Process


Metadata must provide warehouse end-users with
the information they need to easily perform the
analysis steps. It should thus allow users to quickly
locate data that are in the warehouse.
Metadata should allow analysts to interpret data
correctly by providing information about data formats
and data definitions.

Importance of Metadata
Metadata are a form of Audit Trail for Data
Transformation
Metadata document the transformation of source data
into warehouse data. Hence warehouse metadata must
be capable of explaining how a particular piece of
warehouse data was derived from the operational
systems.
All business rules governing the transformation of data
to new values or new formats are also documented as
metadata.

Importance of Metadata
This kind of audit trail is required:
- to build the users confidence regarding the
veracity and quality of warehouse data
- to know where the data came from so that the user
has a good understanding of warehouse data
- by some warehousing products that use this type
of metadata to generate extraction and
transformation scripts for use in the warehouse
back-end

Importance of Metadata
Metadata Improve or Maintain Data Quality
Metadata can improve or maintain warehouse data quality
through the definition of valid values for individual warehouse
data items. Using a data quality tool prior to actual loading
into the warehouse, the warehouse load images can be
reviewed to check for compliance with valid values for key
data items. Data errors are quickly highlighted for correction.
Metadata can be used as the basis for any error-correction
processing that should be done if a data error is found. Errorcorrection rules are documented in the metadata repository
and executed by program code on an as needed basis.

DEVELOPING A METADATA
STRATEGY

METADATA STRATEGY

Metadata organization and administration, which


promotes sharing and central management of metadata
in distributed repository architecture.

Content Creation and integrity, to maintain consistency


of metadata that may be passed among various tools
throughout the phases of the projects;

METADATA STRATEGY

Component-based metadata sharing, which includes


facilities for exchanging metadata among upstream
design/modeling tools and downstream analytical
problems
Planning for the future, necessary for ensuring
compatibility with emerging metadata and
interoperability standards.

EXAMINING
TYPES OF METADATA

METADATA TYPES

ADMINISTRATIVE METADATA
END-USER METADATA
OPTIMIZATION METADATA

The metadata has 3 major


categorize
There is the metadata associated with the decisionsupport database.
This metadata describes the database structures
such as tables, columns and partitions, as well as
security settings and operational information.
The second category of data warehouse metadata is
used by the end user to navigate the database.
A query and analysis tool, such as BusinessObjects
from Business Objects Inc. or PowerPlay from
Cognos Corp., usually creates and manages this
metadata.

The metadata has 3 major


categorize
The third category is the metadata created by the
back-end extract/transformation tool that's used to
move data from the source systems to the data
warehouse.
This metadata is primarily concerned with source
data definitions, transformation logic and sourceto-target data mappings. These tools also must be
concerned with process scheduling, maintaining
data integrity and error management.

ADMINISTRATIVE METADATA

These contain descriptions of the source databases and their


contents, the data warehouse objects, and the business rules
to transform data from the sources into the data warehouse.
Data Sources: Descriptions of all data sources used by the
warehouse, including information about the data ownership.
Any relationships between different data sources (e.g., one
provides data to the other) are also documented.
Source-to-target field mapping: The mapping of source fields
(in operational systems) to target fields (in the data warehouse)
explains what fields are used to populate the data warehouse.
It also documents transformations and formatting changes that
were applied to the original, raw data to derive the warehouse
data.

ADMINISTRATIVE METADATA

Warehouse Schema Design: Describes the warehouse


servers, databases, database tables, fields, and any
hierarchies that may exist in the data. All referential
tables, system codes, etc., are also documented.
Warehouse back-end data structure: Model of the
back-end of the warehouse, including staging tables, load
image tables, and any other temporary data structures
that are used during the data transformation process.
Warehouse back-end tools or programs: A definition of
each extraction, transformation, and quality assurance
program or tool that is used to build or refresh the data
warehouse.

ADMINISTRATIVE METADATA

Warehouse architecture: If the warehouse architecture is


one where an enterprise warehouse feeds many
departmental or vertical data marts the warehouse
architecture should be documented. If the data mart
contains a logical subset of the data warehouse contents,
this subset should also be defined.
Business Rules and Policies: All applicable business
rules and policies are documented. Examples include
business formulae for computing costs or profits.
Access and Security rules: Rules governing the flow of
data across various users and their access limitations are
documented.

END-USER METADATA
End-User metadata help users create their queries and
interpret the results, and also contain,
Warehouse Contents : Must describe the data
structure and contents of the data warehouse in user
friendly terms. Aliases, rules, summaries and
precomputed totals are to be documented.
Predefined Queries & Reports : Queries & reports
that have been predefined and documented to avoid
duplication of effort.
Business rules & Policies : All business rules and
changes of there rles over time should be documented.

END-USER METADATA
Hierarchy Definitions : Hierarchy definitions are
important to support driling up and down warehouse
dimensions.
Status Information : Status information is required
to inform warehouse users of the warehouse status
at any point of time.
Data Quality : Known data quality problems in the
ware house should be clearly documented, this will
prompt users to make careful use of warehouse
data.

END-USER METADATA
Warehouse load History : A history of data errors,
data volume, load schedule should be available.
Warehouse purging rules : The rules which
determine when data is removed from warehouse
should be known to end-users.

OPTIMIZATION METADATA

Metadata are maintained to aid in the optimization of


the data warehouse design and performance.
Aggregate Definitions : All warehouse aggregates
should be documented so that front end tools with
aggregate navigation facilities rely on this type of
metadata.
Collection of Query Statistics : It is helpful to track
the type of queries that are made against the
warehouse. This helps in optimization and tuning.
Also helps to identify data that are largely unused.

METADATA MANAGEMENT
TOOLS

METADATA MANAGEMENT TOOLS


Metadata Catalog is a generic descriptor for the overall
set of metadata used in the warehouse.
Tools are needed for cataloging all of this metadata and
keeping track of it. The tool probably cant read and
write all the metadata directly, but it will manage
metadata stored in many locations.
The functions and services required in the metadata
catalog maintenance includes
1. Information catalog integration/Merge- from data model
to database to front-end tools
2. Metadata managemtnt - remove old unused entries.

METADATA MANAGEMENT TOOLS

3. Capture existing metadata - from mainframe or other


sources.
4. Manage and display graphical and tabular
representations of the
Metadata catalog contents - metadata browser.
5. Maintain user profiles for application and security use.
6. Security for the metadata catalog.
7. Local or centralized metadata catalog support.
8. Creating remote procedure calls to provide

COMMON WAREHOUSE
METADATA

The CWM Metamodel


The CWM metamodel is organized into 18 packages
arranged in 4 layers on a UML base (Fig next).
CWM new architecture defines its sub-metamodel as
individual packages. Because CWM uses modeling
techniques that minimize the number of dependencies
between its packages, tool integrators can select only
those metamodel services they need while avoiding
problems common to large, monolithic metamodels (UML).

The CWM Metamodel


Warehouse
Process

Management

Analysis

Transformation OLAP

Resource

Foundation

Counts

Classes

Associations

CWM

157

115

CWMX

130

77

Total

287

192

Object
(UML)

Relational

Warehouse
Operation
Data
Information
Business
Mining Visualization Nomenclature

Record

Multi
Dimensional

XML

Business Data
Keys
Type
Software
Expressions
Information Types
Index Mapping Deployment

UML 1.3
(Foundation, Behavioral_Elements, Model_Management)

The CWM Metamodel Cont.

The Four layers of the CWM collect together different


sorts of metamodel packages:
Base Layer contains the standard UML 1.3 notation
and the extensions to support warehouse concepts
The Foundation layer contains the metamodel
shared by other packages (Business Information, Data
Types, Expressions, Keys & Indexes, Software
Deployment, Type Mapping).
The Resource layer contains data models used for
operational data sources and target data warehouses.

The CWM Metamodel Cont.

The Analysis layer provides metamodels supporting


logical services that may be mapped onto data stores
defined by Resource layer packages. For example, the
Transformation metamodel supports the definition of
transformations between data warehouse sources and
targets, and the OLAP metamodel allows data
warehouses stored in either relational or
multidimensional data engines to be viewed as
dimensions and cubes.

The CWM Metamodel Cont.

The Management layer metamodels support the


operation of data warehouses by allowing the definition
and scheduling of operational tasks (Warehouse
Process package) and by recording the activity of
warehouse processes and related statistics (Warehouse
Operation package).

CWM Design Basis

In accordance to solution framework metamodeling


architecture constitutes 4 layers

Metamodeling language (M3)


Metamodels (M2)
Metadata or Models (M1)
Data or Objects (M0)

CWM Design Basis


Standard OMG Components
Modeling Language: UML
Metadata Interchange: XMI
Metadata API:
MOF IDL Mapping

M
I
D
D
L
E
W
A
R
E

A
P
P
L
I
C
A
T
I
O
N

Meta-metamodel
Layer (M3)

MOF: Class, Attribute,


Operation,
Association

Metamodel
Layer(M2)

UML: Class, Attribute


CWM: Table, Column
ElementType, Attribute

Metadata/Model
Layer(M1)

User Data/Object
Layer (M0)

Stock: name, price

<Stock name=IBM
price=112/>

Our Vision..

Enable
Decisions@speed
of thought

SATYAM - Our People Make The


Difference

Thank

You might also like