You are on page 1of 165

DataStage

:: FUNDAMENTAL CONCEPTS:: DAY 1 Introduction for Phases of DataStage Four different phases are in DataStage, they are Phase I: Data Profiling It is for source system analyses, and the analysis are 1. Column analysis, 2. Primary key analysis,
3. Foreign key analysis, by this analysis whether we can find the data is dirty or not.
2010

4. Base Line analysis, and 5. Cross domain analysis. Phase II: Data Quality (or also called as cleansing) In this process we must follow inter dependent i.e., after one after one process as shown below. Parsing Correcting Standardizing Matching Consolidated Phase III: Data Transmission In this ETL process is done here, the data transmission from one stage to another stage And ETL means E- Extract T- Transmission L- Load. Phase IV: Meta Data Management - Meta data means where the data for data. Inter Dependent

Navs notes

Page 1

DataStage
DAY 2 How the ETL programming tool works? Pictorial view:
2010
Data Base ETL Process Business Interface Flat files

ETL

db

BI

DM

DWH
MS Excel

Figure: ETL programming process

Navs notes

Page 2

DataStage
DAY 3 Continue
2010
Extracting from .txt (ASCII code)

Source
Extract window

Staging (permanent data)

Understand to DataStage Format (Native Format)

Source

Staging (after transmission)

Load window

Source

DWH
data base or resides in local repository

Loading the data into .txt (ASCII code)

ETL is a process that is performs in stages: S OLTP T S T sa S sa T sa DWH

stage area

Here, S- source and T- target. Home Work (HW): one record for each kindle (multiple records for multiple addresses and dummy records for joint accounts);

Navs notes

Page 3

DataStage
DAY 4 ETL Developer Requirements

Q: One record for each kindle(multiple records for multiple addresses and dummy records for joint accounts); Kindle means information of customers. Customer Loan Bank Credit card Savings kindle

Customer maintaining one record but handling different addresses is called single view customer or single version of truth.

HW explanation: Here we must read the query very care fully and understand the terminology of the words in business perceptive. Multiple records means multiple of the customers(records) and multiple addresses means one customer(one account) maintaining multiple of addresses like savings/credit cards/current account/loan. ETL Developer Requirements: HLD LLD ,, ,, ,,

Inputs here, HLD- high level document Developer LLD- low level document

Navs notes

Page 4

2010

DataStage
ETL Developer Requirements are: 1. Under Standing forums/team leads/project leads.
3. Logical designs: means paper work. 4. Physical model: using Tool.
2010

2. Prepare Questions: after reading document which is given and ask to friends/

5. UNIT Test 6. Performance Tuning


7. Peer Reviews: it is nothing but releasing versions(version control *.**)

here, * means range of 1-9. 8. Design Turn Over Document (DTD)/ Detailed Design Document(DDD)/ Technical Design Document(TDD)
9. Backups: means importing and exporting the data require. 10. Job Sequencing

Navs notes

Page 5

DataStage
DAY 5 How the DWH project is under taken? HLD Requirements: x Warehouse(WH) -HLD x x
as developer involves
2010

Process:

TD Developer system engineer

jobs in % Developer (70% - 80%) Production(10%) Migration (30%)

x TEST

Production x

Migration x

here, x cross mark that developer not involves in the flow.


mean where the developer involves in the project and implement all TEN

requirements shown above.

Production based companies are like IBM and so on. Migration means Support based companies like TCS, Cognizent, Satyam Mahindra and so on.

In Migration: works both server and parallel jobs. Server jobs parallel jobs Up to 2002 this environment worked In this it converts up to, 70% automatically 30% manually. after 2002 and up to till this environment

IBM launched X-Migrator, which convert server jobs to parallel jobs

Navs notes

Page 6

DataStage
Project divided into some category with respective to period as shown below and its period( time of the project). Categories Simple Medium Complex Too complex 5.1. Project Process: Period (that taken in months and years) 6m 6m 1y 1 11/2 y 11/2 y 5y and so on(it may takes many years depend up on project)
2010

(high level documents) HLD Requirements: SRS BRD (here, business analyzer/ Subject matter expert)

HLD Warehouse:

Architecture Schema (structure) Dimensions and tables (target tables) Facts

(low level docs) LLD TD Mapping Docs (specifications-specs) Test Specs Naming Docs

Navs notes

Page 7

DataStage
5.2. Mapping Document: For example if a query requirements are 1-experience employee, 2- dname, and 3- first
2010

name, middle name, last name. For this mapping pictorial way as we see in the way:

Common fields

S.no

Load order

Target Entity

Target

Source

Source Fields
Hire date Dno

Transmi ssion
Current DateHire date (CD-HD)

Constan t Pk Fk Sk

Error Handling F C D C

Attributes Tables Eno

FName Exp_tbl MName LName Exp_emp DName

Emp Dept

Ename Eno Dno Dname

Funneling S1
Get data from Multiple tables

C
Is combining

Target

S2 Horizontal combining or vertical combining

As per example here horizontal combination is used

Navs notes

Page 8

DataStage
Emp
HC

Trg
2010

Dept rows.
As Developer maximum 30 Target fields will get.

Here, HC means Horizontal combination is used for combine primary rows with secondary

As Developer maximum 100 source fields will get. Look Up! means cross verification from primary table. After document: .txt (fwf, cv, vl, sc, s & t, h & t) ( F/ dB) S1
T HC H C TRG

(Types of dB)

S2

Format of Mapping Document.

DAY 6 Architecture of DWH


Navs notes Page 9

DataStage

For example:

dB

every branch have each mgr

Manager Reliance comm. Reliance Group : Reliance power Manager Reliance Fresh ` TLM needs manager

Top Level mgr(TLM) details of below sales customer employee period order Input

Explanation of above example: Reliance group with some there branches and every branch have one manager. And for all this manager one Top level manager (TLM) will be there. And TLM needs the details of list shown above for analyze.
Bottom level

For above example how ETL process is done shown below reliance fresh
ETL PROCES S ETL PROCES S

RC-mgr ERP
mini WH/ Data mart

DWH
Dependent Data Mart

independent Data Mart


Reliance Fresh(taking one from group directly)

Dependent Data Mart: means the ETL process takes all manager information or dB and keep in the Warehouse. By that the data transmission between warehouse and data mart where depends upon by each other. Here Data mart is also called as Bottom level/ mini WH as

Navs notes

Page 10

2010

DataStage
shown in blue color in above figure i.e., the data of individual manager (like RF, RC, RP and so on). Hence the data mart depends up on the WH is called dependent data mart. Independent Data Mart: only one or individual manager i.e., data mart were directly access the ETL process with out any help of Warehouse. Thats why its called independent data mart. 6.1 Two level approaches: For the both approaches two layers architecture will apply. 1. Top-Bottom level approach, and 2. Bottom- Top level approach. 6.1. Top Bottom level approach: The level start from top means as per example Reliance group to their individual managers their ETL process from their to Data Warehouse (top level) and from their to all separate data marts (bottom level). R Comm. Data Mart
2010

R Power Reliance Group

ETL PROCE SS

Data Mart Warehouse

R Fresh
Top level Layer I Layer II

Data Mart
Bottom level

Top Bottom level approach

In the above the top bottom level is defined, and this approach is invented by W. H. Inner. Here, warehouse is top level and all data mart are bottom level as shown in the above figure.

Navs notes

Page 11

DataStage
6.2. Bottom top level approach: Means from here the ETL process takes directly from data mart (DM) and the data put
2010

in the warehouse for reference purpose or storing the DM in the Data WareHouse (DWH).

R comm. DM R power Reliance Group R fresh


Layer I Bottom level
ETL PROCE SS

DM DM
Layer II

DWH

Top level

Bottom Top level approach is invented by R Kimbell. Here, one data mart (DM) contains information like customer, products, employees, location and so on. Top Bottom level approach These two approaches comes under two layer Architecture Bottom Top level approach

Programming (coding)

Navs notes

Page 12

DataStage

ETL Tools:

GUI(graph user interface) This tools to extract the data from heterogeneous source.
2010

ETL program Tools are Tara Data/ Oracle/ DB2 & so on

6.2. Four layers of DWH Architecture: 6.2.1. Layer I: DM

DM Source
Layer I

DWH

Source DM
Layer I

In this layer the data send directly in first case from source to Data WareHouse(DWH) and in second case source to group of Data Marts(DM). 6.2.2. Layer II:

DM SRC
Layer I

DM SRC DWH
DM
Layer I Layer II

DWH DM
Layer II

TOP BOTTOM APPROACH

BOTTOM TOP APPROACH

In this layer the data follow from source data warehouse data mart and this type of follow is called top bottom approach. And in another case the data follow from source data

Navs notes

Page 13

DataStage
marts data warehouse and this type of following data is called bottom top approach. For this Layer II architecture is explained in the above shown example eg. Reliance group. * (99.99% using layer 3 and layer 4) 6.2.3. Layer III:
2010

DM Source ODS DWH DM DM


Layer I Layer II Layer III

In this layer the data follow from source ODS (operations data stores) DWH Data Marts. Here the new concept add that is ODS means operations of data stores for at period like 6 months or one year that data used to solve instance problem where the ETL developer is not involved here. And who solve the instance/ temporary problems that team called Interface team is involved here. The ODS data stores after the period into the DWH and from that it goes to DM there the ETL developers involves here in layer 3. The clear explanation about the layer 3 architecture in the below example, it is the best example for clear explanation.

Example #1:

Navs notes

Page 14

DataStage

(at least or max. 2hrs to solve the problem )


Layer I

ETL dev. Involves here

Airport terminal
Interface team involves here
Airport base station

DWH Stores problem info for future references DM

Layer III
ODS

Layer II

Problem information captured

Data Base (stores the technical problem in dB for 1year)


OPERATIONS DATA STORE

Example explanation: In this example, source is aero plan that is for waiting for landing to the airport terminal. But it is not to suppose to land because of some technical problem in the airport base station. To solve this type operations special team involves here i.e., interface team. In the airport base station the technical problems and the Operations Data Store (ODS) in db i.e., simple say problem information captured. But the ODS stores the data for one year only. And years of database stores in the data warehouse because of some technical problems to be not repeat or for future reference. From DWH to it goes to Data Marts here ETL developers involves for solve technical problems i.e., is also called layer 3 architecture of data warehouse. DAY 7

Navs notes

Page 15

2010
DM DM

Source (it is waiting for landing, because of some technical problem)

DataStage
Continues.. Project Architecture: 7.1. Layer IV: Layer 4 is also called as Project Architecture

look up
It is for data backup of DWH & SVC L3 Business intelligence

Source 1

Interface Files (FLAT FILES)

ETL
Read flat files through DS L2

DW

BI DM

Source 2

L4
Format MISMATC H

Condition MISMATC H

ODS

SVC DM

Reporting

Source

Layer I

SRC

Figure: Project Architecture / layer IV Here, ODS-operations data store, DW- Data Warehouse, DM- Data Mart, SVC- Single view customer,
BI- Business Intelligence. L2 & L3 & L4- layer2,3,4. ------------- reference data - - - - - - - -> reject data

About the project architecture:

Navs notes

Page 16

2010

DataStage
In project architecture, there are 4 layers. In first layer source to interface files(flat files), to ODS. When ETL sending the flat files to ODS if any mismatch data will there it will drops that data. There are two types mismatch data 1. Condition mismatch 2. Format mismatch. In third layer the ETL transfer the data to warehouse. In last layer data warehouse to check whether a single customer or not and data loading or transmission in between DWH and DM(business intelligence). Note: (Information about dropped data when the transmission done between ETL reads the flat files(.txt, csv, .xml and so on) to ODS.) Two types of mismatch data:

2010
Trg only req. dno = 10

Coming to second layer ETL reads the flat files through the DataStage(DS) and sends

Condition mismatch(CM): this verify the data from flat files whether they are conditions are correct or mismatched, if it is mismatched the record will drops automatically. To see the drop data the reference link is used and it shows which record is condition mismatched.

Format mismatch(FM): this is also like condition mismatch but it checks on the format whether the sending data or records is format is correct or mismatched. Here also reference link is used to see drop data.

Example for condition mismatch: An employee table contains some data SQL> select * from emp;
EID 08 19 99 15 ENAME Naveen Munna Suman Sravan DNO 10 20 30 10
Contains dno 10,20,30,1 0

emp tbl

TR G
Referenc e link drops20,30 from emp

Example for Format Mismatch:

Navs notes

Page 17

DataStage

EID EName Place 111 naveen mncl 222 knl munna

Here the table format is tab space separated.


2010

The cross mark record has format mismatched so that the record its just rejected.

7.2. Single View Customer (SCV): It is also called as single version of truth. For example: *to make unique customer? Same records Phase II > identify field by field. Phase III> cannot identify in this.

CName Adds. naveen savings munna insurance suman credit

5 multiple records of customers

transforming
CName Adds. savings, loan insurance, credit

Here DataStage people involves in this process SVC/ single version of truth

naveen munna deposit suman

This type of transforming is also called as Reverse Pivoting. NOTE: Business intelligence(BI DM) is for data backup of DWH & SVC(single version of truth). DAY 8

Navs notes

Page 18

DataStage
Dimensional Model Modeling: it represent in physical or logical design from our source system to target system. o Logical design: client perspective, o Physical design: data base perspective.
Pictorial View Logical View
EM P SQ De pt B

optional Manual

Here the above is Designing manual

Data Modelers are use DM Tools


o o

ERWIN ER STUDIO

Forward Engineering Reverse Engineering

Entity relation windows (ERWIN), Entity relation studio(ER-Studio) these two are data modelers where logical and physical design is done.
Mata Data: every entity has a structure is called Meta Data(simple say a data to a

data) o In a table there are attributes and domain, two types of domain they are 1. Alphabetical and 2. Number.
Forward Engineering (FE): its starting from the scratch. Reverse Engineering (RE): its create from existing system is known as RE. are simple

say altering the existing process For example: Q: An client required a experience of an employee.
Navs notes Page 19

2010

DataStage

SRC Implicit requirement (is experience of employee) Hire Date


2010

EMP_table

From Developer point of view is Explicit Requirement (to find out everything as per the client requirement want to see) TRG
ENo

(Employee hire detailed information) EName Years Months Days Hours Minutes Seconds

Nano_Seconds

Lowest level Detailed Information 8.1. Dimensional Table: To find out everything as per the client requirement want to see (or) the Lowest level of detailed information in the target tables is called Dimensional Table.

Q: how the tables are interconnected is shown below. -

Here taking some tables and linking with them with related to other tables. Like in product table. This link is created by using foreign key and primary key. Foreign Key: means which is constraint and used as reference for other table. Primary Key: means which is constraint, it is a combination of unique and not null. Surrogate key. Tables as follow. Foreign Key

Navs notes

Page 20

DataStage
Product_ID PRD_Desc PRD_TYPE_ID
2010

Primary Key
PRD_TYPE_ID PRD_SP_ID PRD_Category

Foreign Key

Link Establishing Using Fk & Pk

Primary Key
PRD_SP_ID SName ADD1

8.2. Normalized Data: In a table there if repetitive information or records is called Redundancy, that information is to minimize or that technique is called as Normalization. For example:
ENO EName Designation Quali. Add1 Add2 111 222 333 444 DNo Higher

naveen ETL Developer 10 M.TECH JNTU HYD munna System analysis 20 M.SC SVU HYD Sravan JAVA Developer 10 M.TECH JNTU HYD Raju Call Center 30 M.SC

These is Repetitive Information or Redundancy

Fk
ENO EName Designation DNo This dividing

Pk
DNo Higher Quali. Add1 Add2

Technique 10 111 naveen ETL M.TECH is known Developer 10 222 munna JNTU HYD System analysis 20 M.SC Normalization 20 333 Sravan JAVA (or) reducing Developer 10 444 Raju Redundancy Call Center 30 555 Rajesh JAVA The target table must be always in De-Normalized format.

SVU

HYD

Navs notes

Page 21

DataStage

HC

Normalization

De-Normalization

De-Normalization means combining the multiple tables into one table. And

combining is done by Horizontal combine.


But it is not in all cases, that de-normalized is must and should.

DAY 9

Navs notes

Page 22

2010

DataStage
E-R Model An Entity-Relationship Model: In logical design, there are two options to design a job. They are 1. Optional, and 2. Manual. Mandatory is must 1- primary table & n-secondary table

EMP table The given two tables EMP and DEPT


ENO EName Designation

DEPT table

DNo DNo 10 20

Higher Quali. Add1 Add2 M.TECH JNTU HYD M.SC

111 naveen ETL Developer 10 222 munna System analysis 20 333 Sravan JAVA Developer 10 444 Raju Call Center 30 555 Rajesh JAVA

SVU

HYD

Primary (or also known as Master Table) Secondary (or also known as Child Table) Here from above two tables the primary table is DEPT table, because is not depends for any other table. And EMP table is secondary table because it is depends on the DEPT table.
But when we take in real time, that we joining the two table by using Horizontal

combining it takes the EMP table as primary table and DEPT table as secondary table. 9.1. Horizontal Combine:
Navs notes Page 23

2010

DataStage
To perform horizontal combining we must follow these cases. It must have multiple sources.
2010

There should be dependency.


1 Primary, n secondary.

Horizontal combining is also called as JOIN. HC means combining primary rows with secondary rows based on the primary key column values. There are three types of keys, they are o Primary key, o Foreign key, and
o

Surgut key.

For example combining these two tables: EMP & DEPT tables Fk
ENO EName Designation DNo 10 20

Pk
DNo Higher Quali. Add1 Add2 M.TECH JNTU HYD M.SC

111 naveen ETL Developer 10 222 munna System analysis 20 333 Sravan JAVA Developer 10 444 Raju Call Center 30 555 naveen ETL

SVU

HYD

After combining or joining the table by using HC, hence its like below
ENO EName Quali. Add1 111 naveen M.TECH JNTU 222 munna SVU HYD Designation Add2 ETL Developer HYD System analysis DNo 10 20 M.SC Higher

Navs notes

Page 24

DataStage
9.2. Different types of Schemas: There are four types schemas, they are o Snow Flake Schema, o Multi STAR Schema, and o Galaxy Schema. 1. STAR Schema: In the star schema, you must know about two things o Dimensional table, and o Fact table.
Dimensional table: means Lowest level detailed information of a table.
2010

o STAR Schema,

Fact Table: means it is collection of foreign key from n- dimensional tables. Definition of STAR Schema: A Fact Table collection of foreign key surrounded by multiple dimensional table and each dimensional collection of de-normalized data, it is called STAR Schema. The data transmission is done in two different methods, in pictorial way it look like as below Transmission
Sourc e
T DIM table T

FACT tbl

in practical way it directly from source to dimensional table and fact table.
DIM table

Sourc e

T T

FACT tbl

Navs notes

Page 25

DataStage

Example for STAR Schema:


2010

As taking some tables as below to derive a star schema from that. Q: display what suman buy a lux product in ameerpet on January 1st week?

Bridge/ intermediate table Product table Brand table Category table Customer table Unit table Customer_Category_table
Fact table Cust_Dim_tbl
Pk
Fk Fk Fk

PRD_Dim_tbl

Pk

Pk

Date_Dim_tbl

Customer table Location table

Loc_Dim_tbl
Pk

Here, Pk primary key, and Fk foreign key. By above shown that fact table is surrounded by the dimensional table, and fact table is collection of foreign key, where dimensional table is lowest level detailed information. And fact table is also called as Bridge or Intermediate table. But in current market STAR Schema and Snow Flake Schema is using rarely.
In the fact table, measurements mean taking the information as per client requirement

or user requirement.
As per above question, it needs information PRD_dim_tbl, Date_dim_tbl,

Cust_dim_tbl, and Loc_dim_tbl. The link is creating to the measurements i.e., for Fact table by foreign key and primary key.

Navs notes

Page 26

DataStage

2. Snow Flake Schema: lookup table is called Snow Flake Schema. For example: Fk Pk Fk Pk Fk Pk
Area
2010

The fact-tables surrounded by dimensional table, each dimensional table have

EMP_t bl

Dept_t bl

Locati ons

If we want to require the information from location table it fetch from that table and display the client required. To minimize the huge data at once or in a one dimensional table, some times it not possible to bring as soon as possible if huge data in dimensional table.
That is reason we divide the dimensional table, into some tables. And that tables is

known as look up tables Definition of Snow Flake Schema: The Fact table surrounded by dimensional tables, and each dimensional table have look up tables is called Snow Flake Schema. STAR Schema works effectively De-normalization
D N

Cagnos/B O

Sour ce

DWH

Reports
N

MIG/H1 Normalization Snow Flake Schema works effectively


Navs notes Page 27

DataStage
NOTE: Selection of Schema in run time it is depends on report generation.

:: DataStage CONCEPTS:: DAY 10 DataStage (DS) Concepts:

2010

History of DS, Feature of DS, Differences between 7.5.x2 and 8.0.1 versions, Architecture of 7.5.x2 and 8.0.1 versions, Enhancements and new features of version 8.0.1

HISTORY of DataStage An ETL tool according year 2006 there are 600 tools in market, some of they are DataStage Parallel Extends, ODI(OWB), SAS(ETL Studio), BODI, Abinity and so on But DataStage is so famous and widely used in the market and it is to expensive also.
Q: What is DataStage? ANS: DataStage is a comprehensive ETL tool, which provides End to End Enterprise Resource Planning (ERP) solution (here, comprehensive means good in all areas)

History begins:
-

In 1997, first version of DataStage is released by VMARK company i.e., US based company, and the Mr. LEE SCHEFFLER is father of DataStage. Only 5 members involved in release the software into the market. DataStage those days called as Data Integrator.

Navs notes

Page 28

DataStage
-

There are 90% changes from 1997 to 2010 comparing to release versions. In 1997, Data Integrator is acquiring by company name called TORRENT. After two years i.e., in 1999, INFORMIX Company has acquired Data Integrator from TORRENT Company.
2010

In 2000, ACENTIAL Company acquired both Data Base and Data Integrator and after that ACENTIAL DataStage Server Edition released in this year. o By this company the DataStage has popularized into the market from that year. o And released software were 30 tools used to run.

In 2002, ADSS + ORCHESTRATE means ACENTIAL company is integrated with ORCHESTRATE company for the parallel capabilities.
o

Because ORCHESTRATE (PX, UNIX) have parallel extendable capabilities in UNIX environment. By integrating ADSS + ORCHESTRATE and they named as ADSSPX.

o And ADSSPX is version is 6.0, from that version parallel operations starts or parallel capabilities starts.
o

From that parallel versions gone on developing up to 7.5.1 version,

o But from 6.0 to 7.5.1 versions they supports only UNIX flavors environment. o Because server configured only on UNIX flat form or environment.
-

In 2004, a version 7.5.x2 is released that support server configuration for windows flat form also.
o o

For this ADSSPX is integrated with MKS_TOOL_KIT. MKS_TOOL_KIT is virtual UNIX machine that brings the capabilities to windows for support server configuration. NOTE: After installing the ADSSPX+MKS_TOOL_KIT into the windows, and all the UNIX commands works in the windows flat form.

Navs notes

Page 29

DataStage

In 2004, December the version 7.5.x2 were having ASCENTIAL suite components o They are,

Profile stage, Quality stage, Audit stage, Meta stage, DataStage Px, DataStage Tx, DataStage MUS, and so on these are individual tools.

o There are 12 types of ASCENTIAL suite components.


-

In 2005, February the IBM acquired all the ASCENTIAL suite components and the IBM released IBM DS EE i.e., enterprise edition.

In 2006, the IBM has made some changes to the IBM DS EE and the changes are the integrated the profiling stage and audit stage into one, quality stage, Meta stage, and DataStage Px.
o

With the combination of four stages they have released IBM WEBSPHERE DS & QS 8.0

o This is also called as Integrated Developer Environment i.e., IDE. In 2009, IBM has released another version that IBM INFOSPHERE DS & QS 8.1 o In current market, 7.5.x2 using 40 50% 8.0.1 using 30 40% 8.1 using 10 20%

Navs notes

Page 30

2010

DataStage

NOTE: DataStage is Front End, it nothing to be stored.


2010

DAY 11 DataStage FEATURES Features of DS: There are 5 important features of DataStage, they are Any to Any, Plat form Independent, Node configuration, Partition parallelism, and Pipe line parallelism.

Any to Any: o DataStage that capable to any source to any target. Plat form Independent: o A job can run in any processor is called plat form independent o Three types of processor are there, they are

UNI, Symmetric Multi Processor(SMP), and Massively Multi Processor (MMP). SMP
HDD SMP -1 C P U Page 31 SMP -2

UNI
HDD

MMP

C C P U Navs notes P U

C P U

C P U

DataStage

SMP -3

SMP -n

RAM Node Configuration:


o

RAM

Node is software that is created in operating system.

o Node is a logical CPU i.e., is instance of physical CPU.


o

Hence, using software it is the process of creating virtual CPUs is called Node Configuration.

o Node configuration concept is exclusively work on the DataStage, it is the best feature comparing from other ETL tools. o For example:

An ETL job requires executing 1000records? For above question an UNI processor takes 10mins to execute 1000 records. But for the same question an SMP processor takes 2.5 minutes to execute 1000 records. It is explained clearly in below diagram.

1000 records

HDD

1000 records HDD


C C P U C P U C P U
Here,1000 records share for four CPU hence execution time reduced.

C P U

P U

10 minutes
Navs notes

2.5 minutes
Page 32

2010

DataStage
RAM RAM

reduce the execution time for UNI processor. o Using Node Configuration for the above example to UNI processor.

o In below figure how the virtual CPUs can create and reduce the execution time of the process. 1000 records
created multiple nodes

HDD

CP U CP PU U CP PU U CP PU U

Node1 Node2 Node3 Node4

C P U

10 minutes reduces PU to 2.5minutes

RAM
Partition parallelism: o

Partition is a distributing the data across the nodes, based on partition techniques.

o Considering one example why we use the partition techniques o Example: taking some records in EMP table and some in DEPT table EMP table have 9 records, DEPT table have 3 records.

o After partitioning these records output must and should have 9 records, because here primary table is 9 records. EMP(10,20,10,30,20,10,10,20,30) and DEPT(10,20,30)

Navs notes

Page 33

2010

o As per above example, Node Configuration is also can create virtual CPUs to

DataStage

N1 12 N2 N3
o

10,20,10 30,20,10 10,20,30

10 20 30

only 2 matched 1 1 4
total only 4 matched But output must be 9 records

In the above example, only 4 records are in there in final output and 5 records are missing for this reason the partition techniques are introduced. And there are two types of partition parallelism categories, in those total 8 types of partition techniques are there. Key based Hash Modulus Range Db/2

Key less Same Random Entire Round robin

o Key based category or key based techniques will give the assurance, to the same key column value to collected same key partition.
o

Key less technique is used to append the data for joining given tables.

From above taken records we partitioning using key based. Key based partitioning

Navs notes

Page 34

2010

DataStage
EMP DNO N1 10 N2 20 N3 30

DEPT

DAY 12 Continues Features of DataStage Partition Parallelism:


o

Re Partition: means re distributing the distributed data.


ENO EName DNo Loc 111 AP 222 TN 333 KN 444 naveen munna Sravan Raju 30 10 20 10

P1 P2 P3

EMP

10 20 30

N1 N2 N3 Dno

N1 AP N2 TN N3 KN Loc
Page 35

Dno DEPT

JOIN

Navs notes

2010

JOIN

DataStage
o

First partition is done by key based partition for dno, and taking a separate column as location, for that it re distributing for the distributed data. i.e.,
2010

known as Re Partition.

o Reverse Partitioning:

It is also called as collecting. But it done in one case only or in one situation only : when the data move from parallel stage to sequential stage the collecting happens in this case only Designing job in stages is also called as link or pipe, this is channel it is moving data from one stage to another stage.

SRC

TRS F

TRG

S1 Example:

S2

S3

Here collecting to Nodes N1 Nn into N


2

S 1

Parallel files

Sequential/Single file

There are four categories of collecting techniques Order Round robin

Navs notes

Page 36

DataStage
N1 a,x N2 b,y N3 c,z Pipeline Parallelism: N Sort merge Auto
Order Auto a x b y c z a b c x y z a b c x y z a z y c x b RR SM
2010

Example for collecting techniques:

All pipes carry the data parallel and the process done simultaneously o In server environment: the execution process is called traditional batch processing. o For example: how the execution done in server environment we see Extract
S 1

Transform 10min
S 2

Load
S 3

10min

10min

HD

HD

Here, the execution taken 30minutes to complete.

Same job in parallel environment : E T


R3 R1

L
S 3

R5

S 1

R4

S 2

R2

Navs notes

Page 37

DataStage
Here, all the pipe carry the data parallel and processing the job simultaneously and the execution taken only 10 minutes to complete
2010

By using the pipeline parallelism we can reduce the process time.

DAY 13 Differences between 7.5.x2 & 8.0.1 Differences: 7.5.x2 7.5.x2 - 4 client components * DS Designer * DS Manager * DS Director * DS Administrator 8.0.1 8.0.1 - 5 client components * DS Designer * DS Director * DS Administrator * Web Console * Information Analyzer - Architecture Components * Server Component * Client Component - Architecture Components * Common User Interface * Common Repository * Common Engine * Common Connectivity * Common shared Services II- tier architecture - N-Tier architecture
Page 38

Navs notes

DataStage
-

OS dependent w.r.t. users Capable of P-III & P-IV No web based administration File based repository

- OS independent w.r.t. users but one time dependent only. - Web based administration through - Data Base based repository
2010

- Capable of all phases.

web console( simple say work from home)


-

13.1. Client components of 7.5.x2:


-

DS Designer: it is to create jobs, compile, run and multiple job compile. 4 types of jobs can handle by DS Designer. Mainframes job Server job Parallel job Job sequence job

DS Director: it can handle the given list below. Schedule , run jobs Monitor, Unlock, batch jobs Views(job, status, logs) Message Handling. Import and Export of Repository components Node Configuration Create project Delete project Organize project

DS Manager: it can handle the given list below.

DS Administrator: it can handle the given list below.

13.1. Client components of 8.0.1:


Navs notes Page 39

DataStage

DS Designer: it is to create jobs, compile, run and multiple job compile.


2010

4 types of jobs can handle by DS Designer. Mainframes job Server job Parallel job Job sequence job Data quality job

DS Director: same in as above shown in 7.5.x2 DS Administrator: same in as above shown in 7.5.x2 Web Console: administrator components through which performing. Security services Scheduling services Logging services Reporting services Session management Domain manager It perform all phase-I activities Column analysis, Primary key analysis, Foreign key analysis, Base Line analysis, and Cross domain analysis.

Information Analyzer: is also called as console for IBM INFO SERVER.

As an ETL developer you can come across DS Designer and DS Director. But, some information to be knows about Web console, Information Analyzer, and DS Administrator.

Navs notes

Page 40

DataStage

DAY 14 Description of 7.5.x2 & 8.0.1 Architecture

14.1. Architecture of 7.5.x2: * Server Components: it is divided into 3 categories, they are a. Repository b. Engine c. Package Installer
Repository: is also called as project or work area.

o Here repository is also Integrated Developer Environment(IDE) IDE performs design, compile, run, save jobs.

o Repository organize different component at one area is called collection of components. Some of components are Jobs Table definition Shared container

Navs notes

Page 41

2010

DataStage
Routines .. etc.,

o Repository is for developing application as well as storing application.


Engine: it is executing DataStage jobs and it automatically selects the partition
2010

technique.
o

Never leave any stage to auto?

If we leave it auto, it select auto partition technique it causes effect on the performance.

Package Installer: in this component contains two types of package installer one plug-

in and another is packs. Example:


Derivers needed 1100 to install

Comput er

Interfac e

Printer

1100 driver provide

Here, interface is also called as plug-in between computer and printer.

ER P

SW

DS

Packs

Best example that normal windows XP acquires Service Pack2 for more capabilities
Here, packs are used to configuration for DataStage to ERP solution.

*Client components: it is divided into 4 categories, they are a. DS Designer b. DS Manager c. DS Director

Navs notes

Page 42

DataStage
d. DS Administrator These categories are shown above what they handle i.e., in page no 39. 14.2. Architecture of 8.0.1: 1. Common user interface: is also called as unified user. a. Web console b. Information analyzer c. DS Designer d. DS Director e. DS Administrator 2. Common Repository: is divided into two types a. Global repository: it is for DataStage jobs files to store here. (its checks security issues) b. Local repository: it is for individual files stores here(its for performance issue) o common repository is also called as Mata Data SERVER o three types Project level MD Design level MD Operation level MD
2010

3. Common Engine: o It is responsible of Data Profiling analysis Data Quality analysis Data Transmission analysis

4. Common Connectivity: It provides the connections to common repository.

Navs notes

Page 43

DataStage

WC

IA

DE

DI

DA
2010

REPOSITORY MD SERVER Project level MD Design level MD Operation level MD

Common shared services

DP

DQ DT DA Common Engine

Common Connectivity Table representation of 8.0.1 Architecture

DAY 15 Enhancement & New feature of version8 In version 8.0.1, there are 8 categories of stages. Processing stage: o New Stage: 1. SCD(slow changing Dimension) 2. FTP(File transfer Protocol)
3. WTX(Web Sphere Transfer)

o Enhanced Stage:
1. Surrogate key stage: it is new concept introduced.

2. Lookup stage, previously lookup having i. Normal lookup ii. Sparse lookup Newly added iii. Range lookup iv. Case less lookup

Navs notes

Page 44

DataStage
Data Base Stage: o New stages:
2010

IWAY Classic federation ODBC connector NETEZZA

o Enhanced Stages: All Stages techniques used with respect to SQL Builder.

Palate of the version 8.0.1


General Data Base File Processing Real time Restructure

Data Quality new X X

Development & Debug

X X

here,

have changes X no changes

o Palate is shortcuts of stages where we can drag n drop into canvas to do design the job. o Data Quality is exclusively new concept of 8.0.1. o Data Base and processing stages have some changes that shown above.
o

Other stages are same as version 7.5.x2 i.e., no changes in this version.

Navs notes

Page 45

DataStage

:: Stages Process & Lab Work:: DAY 16 Starting steps for DataStage tool The starting of DataStage on the system we must follow the difference steps to do job.

Five difference steps job development process (this is for design a job). DB2 Repository started and DataStage server started. After started: select DS Designer & enter uid: & enter pwd:
admin **** (eg.: phil) Project\navs..

& attach appropriate project:


Palate -> (its from tool bar)
General Data Quality Database Where the File place we Navs notes & Debug Development design the job. Processing Real Time Eg: Seq to Seq Restructure
Designer Canvas or Editor CANVAS

Select appropriate stage in the palate and dragging them on to the CANVAS. And link them (or giving connectivity) and after that setting properties is important.

Page 46

2010

DataStage
Palate means which contains all stages shortcuts i.e., 7 stages in 7.5.2 & 8 stages in 8.0.
2010

This stages are categorized into two groups, they are 1 > Active Stage (what ever stage is transmission is called active stage). 2 > Passive Stage (here what ever stage whether extracting or loading is called passive stage). In 8 categories we have use sequential stage and parallel stage jobs.

Save, compile and run the job.

Run director (to see views) or to view the status of your job.

DAY 17 My first job creating process Process:


In computer desktop, the current running process will show at the left Conner in that

a round symbol with green color is to start when it is not automatically starts. i.e., whether the server for DataStage was start or not. If not manually to start.
When 8th version of DataStage is installed five client components short cuts visible

on desktop.

Web Console Information Analyzer DS Administrator DS Designer DS Director

Web Console: when you will click, it displays the login page appears

Navs notes

Page 47

DataStage
o If server is not started, it displays the page cannot open error will appear. o If error occurs like that, the server must be restart for doing or creating jobs. DS Administrator: it is for creating/deleting/organize the project. DS Director: it is for views the status of the job executed, and to view log, status, warnings.
2010

DS Designer: when you will click on the designer icon, it will display to attach the project for creating a new job. As shown as below o User id: admin
o

Password: ****

o If authentication failed to login i.e., because repository interface error.


Below figure showing how to authenticate & shows designer canvas for creating

jobs.
Attach the project X Domain Localhost:8080 User Name admin Password phil Project Teleco OK canc el

After authentication, it displays the Designer canvas o And it ask which job want to you do, they are

Navs notes

Page 48

DataStage
Main frames Parallel
2010

Sequential Server jobs

After clicking on parallel jobs, go to tool bar view palate. In palate the 8 types of stages were displayed for designing a job, they are

General Data Quality Data Base File Development & Debug Processing Real Time Re Structure

17.1. File Stage: Q: How data can read from files? File stage can read only flat files and the formats of flat files are .txt, .csv, .xml In .txt there are different types of formats like fwf, sc, csv, s & t, H & T. .csv means comma separated value. .xml means extendable markup language. In File Stage, there are substages like sequential stage, data set, file set and so on. o Example how a job can execute: one sequential file(SF) to another SF.

Source

Target

Navs notes

Page 49

DataStage
o Source file require target/output properties, and o Target file require input/source properties.
2010

In source file, how we to read a file? o On double clicking source file, we must set the properties as below

File name Location Format Structure

\\ browse the file name. \\ example in c:\ \\ .txt, .csv, .xml \\ meta data

General properties of sequential file:

1. Setting / importing source file from local server.

Select a file name: File: \ c:\data\se_source_file.txt File: \? (This option for multiple purposes) C:\data\se_source_file.txt Browse button

2. Format selection: -

As per input file taken and the data must to be in given format Like tab/ space/ comma must to be select one them.

Navs notes

Page 50

DataStage

3. Column structure defining:


2010

LOA

To get the structure of file. Steps for load a structure Import o Sequential file Browse the file and import Select the import file o Define the structure.
These three are general properties when we design for simple job.

DAY 18 Sequential File Stage

Navs notes

Page 51

DataStage
Sequential file stage also says as output properties For single structure format only we going to use sequential file stage.
2010

Output Properties About Sequential File Stage and how it works:

Input Properties

Step1: Sequential file is file stage, that it to read flat files from different of

extensions(.txt, .csv, .xml)


Step 2: SF it reads/writes sequentially by default, when it reads/writes from single

file.
o

And it also reads/writes parallel when it read/writes to or from multiple files

Step 3: Sequential stage supports one input (or) one output and one reject link.

Link : Link is also a stage that transforms data from one stage to another stage. o That link has divided into categories. Stream link Reject link SF SF SF SF

Link Marker:

Reference link

SF

SF

It is show how the link behaves between the transmissions from source to target.

Navs notes

Page 52

DataStage

1. Ready BOX: it is indicate that a stage is ready with Mata Data and data transform
2010

between sequential stages to sequential stage.

Ready BOX
2. FAN IN: it indicates when a data transform from parallel stage to sequential stage and it

done when collecting happens

FAN IN

3. FAN OUT: it indicates when a data transform from sequential stage to parallel stage and

it is also called auto partition.

FAN OUT
4. BOX: it indicates when a data transform from parallel stage to parallel stage and it is

also known as partitioning.

BOX

Navs notes

Page 53

DataStage
5. BOW TIE: it indicates when a data transform parallel stage to parallel stage and it is

also known as re-partitioning.


2010

BOW TIE

Link Color: The link color indicates the process in execution of a job.

LINK RED: o A link in RED color means BLACK: o A link in BLACK color means a stage is ready. BLUE:
o

case1: a stage not connected properly and case2: job aborted

A link in BLUE color means it indicates that a job execution on process

GREEN: o A link in GREEN color means execution of job finished.

NOTE: Stage is an operator; operator is a pre built in component. Because the stage that imports import operator for purpose of creating in Native Format. Native Format is DataStage under stable format. So, stage is a operator.

Navs notes

Page 54

DataStage
Compile: Compile is a translator that source code to target code.
2010

Compiling .C function

. C

HLL

. EX E . OB J

BC

ALL

*HLL High Level Language *ALL Assembly Level Language *BC Binary Code

Compiling process in DataStage: GU I . EX E . OB J MC

OSH Code & C++ *MC Machine Code *OSH Orchestrate Shell Script

Note: Orchestrate Shell Script generate for all stage except one i.e., Transformer stage that is done by C++. In process, it checks for
Link Requirement (checks for link)

Mandatory stage properties Syntax Rules

Navs notes

Page 55

DataStage
DAY 19 Sequential File Stage Properties Properties:
Read Methods: two options are o o
2010

Specific File: user or client to give specifically each file name. File Pattern: we can use wild card character and search for pattern i.e., * & ? For example: C:\eid*.txt C:\eid??.txt

Reject Mode: to handle a format/data type/condition miss match records.

Three options
o o o

Continue: Drops the miss match and continue other records. Fail: job aborted. Output: its capture the drop data through the link to another sequential file.

First line or record of table: true/false.


o o

If it false, it display the first line also a drop record. Else it is true, its doesnt drop the first record.

Missing File Mode: if any file name miss this option used

Two options
o o

Ok: drops the file name when missed. Error: if file name miss it aborts the job.

File Name Column: source information at the target it gives information about which

record in which address in local server. Directly to add a new column to existing table and its displays in that column. Row Number Column: Source record number at target it gives information about which source record number at target table.

Navs notes

Page 56

DataStage
It is also directly to add a new column to existing table and its displays in that column. Read First Rows: will get you top first n-records rows records
Filter: blocking unwanted data based on UNIX filter commands
2010

o Read First Rows option will asks give n value to display the n number of

Example:

Like grep, egrep, ..so on

o grep moon ; \\ it is case sensitive that display only moon contained records. o grep - i moon \\ it ignores the case sensitive it displays all moon records. o grep - w moon \\ it displays exact match record. Read from Multiple Nodes: we can read the data parallel from using sequential stage Reads parallel is possible Loading parallel is not possible

LIMITATIONS of SF:

o It should be sequential processing( process the data in sequential) o Memory limit 2gb(.txt format) o Problem with sequential is conversions. Like ASCII NF ASCII NF

o It is lands or resides the data outside of boundary of DataStage.

Navs notes

Page 57

DataStage

DAY 20 General settings DataStage and about Data Set Default setting for startup with parallel job: Tools o Options Select a default And to create new: it ask which type of job u want.
2010

Types of jobs are main frames/parallel/sequential/server. After setting above when we restart the DS Designer it directly goes designer canvas. According naming standards every stage has to be named. o Naming a stage is simple, just right click on a stage rename option is visible and name a stage as naming standards.

General Stage: In this stage the some of stage were used for commenting a stage what they behave or what a stage can perform to do i.e., simple giving comments for a stage. Let we discuss on Annotation & Description Annotation Annotation: it is for stage comment. Description Annotation: it is used for job title (any one tile can keep).

Parallel Capable of 3 jobs:

Resides into or

Navs notes

Page 58

DataStage
SRC Extracting TRG landing the data into LS/RR/db
2010

Q: In which format the data sends between the source file to target file?

A: if we send a .txt file from source, it is ASCII format because .txt file support only ASCII format and DataStage support the Native format only, here the ASCII code will convert into Native format that is understandable to DataStage. And at target ASCII code will convert into .txt format to user/client visible.
Native Format is also called as Virtual Dataset. When we convert NF code into ASCII. Target need to import an operator.

NF ASCII src_f.txt
When we convert ASCII code into NF. SRC need to import an

ASCII trg_f.txt

Data Set (DS): It is file stage, and it is used staging the data when we design dependent jobs. Data Set over comes the limitation of sequential file stage for the better performance. By default Data Set sends the data in parallel. In Data Set the data lands in to Native Format. Q: How the Data Set over comes the sequential file limitation? By default the data process parallel. More than 2 GB.
Page 59

Navs notes

DataStage
-

No need of conversion, because Dataset represent or data directly resides into Native format. The data Lands in the DataStage repository. Data Set extension is *.ds

Structure saving as st_trg src_f.txt Q: How the conversion is easy in Data Set? -

trg_f.ds

we can copy the trg_f.ds file name and also we must save the structure of the trg_f.ds example st_trg. We can use the saved file name and structure of the target in other job.
copying the structure st_trg & trg_f.ds for reusing here.

trg_f.ds -

trg_f.txt

Data Set can read only Native Format file, like DataStage reads only orchestrate format.

Navs notes

Page 60

2010

DataStage

DAY 21 Types of Data Set (DS) Two types of Data Set, they are Virtual (temporary) Persistency (permanent)
-

Virtual: it is a Data Set stage that the data moves in the link from one stage to another stage i.e., link holds the data temporary. Persistency: means the data sending from the link it directly lands into the repository. That data is permanent.

Alias of Data Set: o ORCHESTRATE FILE o OS FILE Q: How many files are created internally when we created data set? A: Data Set is not a single file; it creates multiple files when it created internally. o Descriptor file o Data file o Control file o Header file
Descriptor File: it contains schema details and address of data.

Data File: consists of data in the Native Format and resides in DataStage repository.

Control File:

Navs notes

Page 61

2010

DataStage
It resides in the operating system and both acting as interface between descriptor file and data file. Physical file means it stores in the local drive/ local server.
Permanently stores in the install program files c:\ibm\inser..\server\dataset{pools}
2010

Header File:

Q: How can we organize Data Set to view/copy/delete in real time and etc., A: Case1: we cant directly delete the Data Set Case2: we cant directly see it or view it.
Data Set organizes using utilities.

o Using GUI i.e., we have utility in tool (dataset management) o Using Command Line: we have to start with $orachadmin grep moon; Navigation of organize Data Set in GUI: o Tools

Dataset Management File_name.ds(eg.: dataset.ds)

o Then we will see the general information of dataset Schema window Data window Copy window Delete window

At command line o $orachadmin rm dataset.ds (this is correct process) \\ this command for remove a file o $rm dataset.ds (this is wrong process) \\ cannot write like this o $ds records \\ to view files in a folder

Navs notes

Page 62

DataStage
Q: What is the operator which associates to Dataset: A: Dataset doesnt have any operator, but it uses copy operator has a its operator.
2010

Dataset Version:
-

Dataset have version control Dataset has version for different DataStage version Default version in 8 is it saves in the version 4.1 i.e., v41

Q: how to perform version control in run time? A: we have set the environment variable for this question. Navigation for how to set a environment variable. Job properties o Parameters Add environments variable Compile
o

Dataset version ($APT_WRITE_DS_VERSION) Click on that.

After doing this when we want to save the job, it will ask whether which version you

want.

Navs notes

Page 63

DataStage

DAY 22 File Set & Sequential File (SF) input properties File Set (FS): It is also a staging the data. -

File stage is same to design in dependent jobs. Data Set & File Set are same, but having minor differences The differences between DS & FS are shown below
Data Set Having parallel extendable capabilities More than 2 GB limit NO REJECT link with the Dataset DS is exclusively for internal use DataStage environment File Set Having parallel extendable capabilities More than 2GB limit REJECT LINK with in File Set

External application create FS we use the any other application Import / Export operator Binary Format .fs extension

Copy (file name) operator Native format .ds files saves

But, Data Set have more performance than File Set.

Navs notes

Page 64

2010

DataStage
Sequential File Stage: input properties Setting input properties at target file, and at target there have four properties
2010

1. File update mode 2. Cleanup on failure 3. First line in column names 4. Reject Mode File Update Mode: having three options append/create (error if exists)/overwrite

o Append: when the multiple file or single file sending to sequential target its appends one file after another file to single file.
o o

Create (error if exists): just creating a file if not exist or given wrong. Overwrite: its overwriting one file with another file.

Setting passing value in Run time(for file update mode) o Job properties Parameters Add environment variables o Parallel Automatically overwrite ($APT_CLOBBER_OUTPUT)

Cleanup on Failure: having two options true/false,


True the cleanup on failure option when it is true it adds partially coded or records.

Its works only when file update mode is equal to append. False its simple appends or overwrites the records. First Line in Column Names: having two options true/false. True it is enable the first row or record as a fields of column
False it is simple reads every row include first row read as record.

Navs notes

Page 65

DataStage
Reject mode: here reject mode is same like as output properties we discussed already before. In this we have three options continue/fail/output. continues process remain records.
2010

Continue it just drops when the format/condition/data type miss match the data and

Fail it just abort the file when format/condition/data type miss match were found. Output it capture the drops record data. DAY 23 Development & Debug Stage The development and debug stage having three categories, they are
1. Stage that Generated Data:

a. Row Generated Data b. Column Generated Data


2. The stage that used to Pick Sample Data:

a. Head b. Tail c. Simple


3. The stage that helps in Debugging:

a. Peek Simply say in development and debug we having 6 types of stages and the 6 stages where divided into three categories as above shown. 23.1. Stages that Generated Data: Row Generator Data: It having only one output
Navs notes Page 66

DataStage
-

The row generator is for generating the sample data; in some cases it is used. Some cases are, o For doing testing purpose.
o
2010

When client unable to give the data.

To make job design simple that shoots for jobs.

Row Generator can generate the junk data automatically by considering data type, or we manual can set a some related understandable data by giving user define values. In this having only one property and select a structure for creating junk data.

Row Generator design as below:

ROW Generator Navigation for Row Generator: Opening the RG properties Properties

DS_TRG

o Number of records = XXX( user define value) Column


o

Load structure or Meta data if existing or we can type their.

For example n=30 Data generated for the 30 records and the junk data also generated considering the data type. Q: how to generate User define value instead of junk data? A: first we must go to the RG properties Column o Double click serial number or press ctrl+E Generator

Navs notes

Page 67

DataStage

Type = cycle/random (it is integer data type) In integer data type we have three option
2010

Under cycle type:

There are three types of cycle generated data Increment, Initial value, and limit. Q: when we select initial value=30? A: it starts from 30 only.

Q: when we select increment=45? A: it going to generate a cycle value of from 45 and after adds every number with 45. Q: when we select limit=20? A: it is going to generate up to limit number in a cycle form. Under Random type: There are three types of random generated data limit, seed, and signed. Q: when we select limit=20? A: it going to generate random value up to limit=20 and continues if more than 20 rows. Q: when we select seed=XX; A: it is going to generate the junk data for random values. Q: when we select signed? A: it going to generate signed values for the field (values between limit and +limit), otherwise generate values between 0 and +limit. Column Generator Data: it having the one input and one output Main purpose of column generator to group a table as one.

Navs notes

Page 68

DataStage
-

And by using this we add extra column for the added column the junk data will be generated in the output. dropping created column into existing table.
2010

Here mapping should be done in the column generated properties, means just drag and

Sequential file -

Column Generator

DataSet

Coming to the column generator properties. To open the properties just double clicking on that. Stage o Options Column to generate =? And so on we can give up to required.

Navigation:

Output o Mapping After adding extra column it will visible here, and for mapping we drag simple to existing table into right side of a table.

Column o We can change data type as you require.

In the output,
-

The junk data will generate automatically for extra added columns. For manual we can generate some meaning full data to extra columns Navigation for manual: o Column Ctrl+E Generator

Navs notes

Page 69

DataStage
o

Algorithm = two options cycle/ alphabet

o Cycle it have only one option i.e., value


2010

o Alphabet it also have only one option i.e., string. Cycle is same like above shown in row generator. Q: when we select alphabet where string=naveen? A: it going to generate different rows with given alphabetical wise.

DAY 24 Pick sample Data & Peek 24.1. Pick sample data: it is a debug stage; there are three types of pick sample data. Head Tail Sample

Head: it reads the top n records of the every partition.

o It having one input and one output.


o

In the head stage mapping must and should do.

SF_SRC Properties of Head: o Rows

HEAD

DS_TRG

Navs notes

Page 70

DataStage
All Rows(after skip)=false
-

It is to copy all rows to the output following any requested skip


2010

positioning Number of rows(per partition)=XX o Partitions All partition = true -

It copy number of rows from input to output per partition.

True: copies row from all partitions False: copies from specific partition numbers, which must be specified.

Tail: it is debug stage, that it can read bottom n rows from every partition

o Tail stage having one input and one output. o In this stage mapping must and should do. That mapping done in the tail output properties.

SF_SRC Properties of Tail:

TAIL_F

DS_TRG

o The properties of head and tail are similar way as show above. o Mainly we must give the value for number of rows to display
Sample: it is also a debug stage consists of period and percentage o o

Period: means when its operating is supports one input and one output. Percentage: means when its operating is supports one input and multiple of outputs.

Navs notes

Page 71

DataStage

SF_SRC

SAMPLE

DS_TRG

Period: if I have some records in source table and when we give n number of

period value it displays or retrieves the every nth record from the source table.
Skip: it also displays or retrieves the every nth record from given source table.

Percentage: it reads from one input to multiple outputs. o

Coming to the properties Options -

Percentage = 25 and we must set target =1 Percentage = 50 , target = 0 Percentage = 15 , target = 2

o Here we setting target number that is called link order. o Link Order: it specifies to which output the specific data has to be send. o Mapping: it should be done for multiple outputs.

Target1 Target2 SF_SRC SAMPLE

Navs notes

Page 72

2010

DataStage
Target3

NOTE: sum of percentage of all outputs must be less than are equal to <= to n records of input records.
o

In the percentage it distributes the data in percentage form. When sample receives the 90% of data from source. It considers 90% as 100% and it distributes as we specify.

24.2. PEEK:

it is a debug stage and it helps in debugging stage

SF_SRC It is used in three types they are

PEEK

1. It can use as copying the data from Source to multiple outputs. 2. Send the data into logs. 3. And it can use as stub stage. Q: How to send the data into logs? Opening properties of peek stage, we must assign o Number of row = value? o Peek record output mode = job log and so on, as per options
Navs notes Page 73

2010

DataStage
o

If we put column name = false, it doesnt shows the column in the log.

For seeing the log records that we stored.


2010

o In DS Director

From Peek log peek - We see here n values of records and fields

Q: When the peek act as copy stage? A: It is done when the sequence file it doesnt send the data to multiple outputs. In that time the peek act as copy stage. Q: What is Stub Stage? A: Stub Stage is a place holder, because in some situations a client requires only dropped data. In that time the stub stage acts as a place holder which holds the output data as temporary, and its sends the rejected data to the another file. DAY 25 Database Stages In this stage we have use generally oracle enterprise, ODBC enterprise, Tara data with ODBC, and dynamic RDBMS and so on.
25.1. Oracle Enterprise:

Oracle enterprise is a data base stage, it reads tables from the oracle data base from source to the target o Oracle enterprise reads multiple tables from, but it loads in the one output.

Oracle Enterprise o Properties of Oracle Enterprise(OE):

Data Set

Navs notes

Page 74

DataStage
Read Method have four options Auto Generated \\ it generated auto query
2010

SQL Builder \\ its new concept apart comparing from v7 to v8. Table \\ giving table name here User Defined \\ here we are giving user defined SQL query.

If we select table option Table = <table name>

Connection Password = ***** User = Scott Remote server = oracle

o Navigations for how the data load to the column This is for already data present in plug-in. Select load option in column Going to the table definitions Than to plug-in Loading EMP table from their.

If table not in the not their in plug-in. Select load option in column Then we go to import Import meta data definition o Select related plug-in Oracle User id: Scott

Navs notes

Page 75

DataStage

Password: tiger After loading select specific table and import.


2010

After importing into column, in define we must change hired date data type as Time Stamp.

Q: A table containing 300 records in that, I need only 100 fields from that? A: In read method we use user-defined SQL query to solve this problem by writing a query for reading 100 records. But by the first read method option, we can auto generate the query by that we can use by coping the query statement in user-defined SQL. Q: What we can do when we dont know how to write a select command? A: Selecting in read method = SQL Builder After selecting SQL Builder option from read method o Oracle 10g o From their dragging which table you want o And select column or double clicking in the dragged table There we can select what condition we need to get. It is totally automated.

NOTE: in version 7.5.x2 we dont have saving and reusing the properties. Data connection: its main purpose is reusing the saved properties. Q: How to reuse the saved properties? A: navigation for how to save and reuse the properties Opening the OE properties o Select stage Data connection There load saved dc

Navs notes

Page 76

DataStage
o Naveen_dbc \\ it is a saved dc o Save in table definition.
2010

DAY 26 ODBC Enterprise ODBC Enterprise is a data base stage About ODBC Enterprise: Oracle needs some plug-ins to connect the DataStage. When DataStage version7 released that time the oracle 9i provides some drivers to use. When coming to connection oracle enterprise connects directly to oracle data base. But ODBC needs OS drivers to hit oracle or to connect oracle data base.
Oracle Enterpris e Navs notes ODBC Enterpris e

ORACLE DB Page 77 OS

DataStage
Directly hitting

Use OS drivers to hit the oracle db

Difference between Oracle Enterprise (OE) and ODBC Enterprise OE Version dependent Good performance Specific to oracle Uses plug-ins No rejects at source ODBCE Version independent Poor performance For multiple db Uses OS drivers Reject at SRC &TRG.

Q: How database connect using ODBC?

ODBCE First step: opening the properties of ODBCE Read method = table o Table = EMP Connection

Data Set

o Data Source = WHR \\ WHR means name of ODBC driver


Navs notes Page 78

2010

DataStage
o Password = ****** o User = Scott Creating of WHR ODBC driver at OS level. o Administration tools ODBC Add o MS ODBC for Oracle Giving name as WHR Providing user name= Scott And server= tiger.
2010

ODBCE driver at OS level having lengthy process to connect, to over this ODBC connector were introduced. Using ODBC Connector is quick process as we compare with ODBCE. Best Feature by using ODBC Connector is Schema reconciliation. That automatically handles data type miss match between the source data types and DataStage data types. Differences between ODBCE and ODBC Connector.

ODBCE Connector It cannot make the list of Data Source Name (DSN). In the ODBCE no testing the connection. ODBCE read sequentially and load

ODBC It provides the list have in ODBC DSN. In this we can test the connection by test button. It read parallel and loads parallel (good performance).

Navs notes

Page 79

DataStage
Properties of ODBC Connector: o Selecting Data Source Name DSN = WHR
2010

o User name = Scott o Password = ***** o SQL query 26.1. MS Excel with ODBCE:

First step is to create MS Excel that is called work book. Its having n number of sheets in that. For example CUST work book is created Q: How to read Excel work book with ODBCE? A: opening the properties of ODBCE Read method = table o Table = empl$ \\ when we reading from excel name must be in double codes end with $ symbol.

Connections o DSN = EXE o Password = ***** o User = xxxxx Column o Load Import ODBC table definitions
Navs notes

DSN \\ here select work book User id & password


Page 80

DataStage
o Filter \\ enable by click on include system tables o And select which you need & ok
2010

In Operating System o Add in ODBC MS EXCEL drivers Name = EXE \\ it is DSN

Q: How do you read Excel format in Sequential File? A: By changing the CUST excel format into CUST.csv 26.2. Tara Data with ODBCE: Tara Data is like an oracle cooperation data base, which use as a data base. Q: How to read Tara Data with ODBC A: we must start the Tara Data connection (by clicking shortcut). o And in OS also we must start Start ->control panel ->Administrator tools -> services -> Tara Data db initiator \\ must start here

o Add DSN in ODBC drivers Select Tara data in add list We must provide details as shown below User id = tduser Password = tduser Server : 127.0.0.1

After these things we must open the properties of ODBCE

o Read method = table Table = financial.customer

Navs notes

Page 81

DataStage
o Connections Column o Load Import Table definitions\plug-in\taradata Server: 127.0.0.1 Uid = tduser Pwd = tduser DSN = tduser
2010

Uid = tduser Pwd = tduser

After all this navigation at last we view the data, which we have load in source.

DAY 27 Dynamic RDBMS and PROCESSING STAGE 27.1. Dynamic RDBMS: It is data base stage; it is also called as DRS It supports multiple inputs and multiple outputs

Navs notes

Page 82

DataStage

Ln_EMP_Data

Data Set

DRS Ln_DEPT_Data Data Set It all most common properties of oracle enterprise. Coming to DRS properties o Select db type i.e., oracle o Oracle
o

Scott Tiger

\\ for authentication

At output Ln_EMP_Data \\ set emp table here And Ln_DEPT_Data \\ set dept table here

o Column Load Meta data for table EMP & DEPT.

In oracle enterprise we can read multiple files, but we cant load into multiple files. We can solve this problem with DRS that we can read multiple files and load in to multiple files.
Navs notes Page 83

2010

DataStage

Some of data base stages:

Netezza can use in target only to set in input properties.

27.2. Processing Stage: In this 28 processing stages are there, but we use 10 stages generally. And the 10 stages are very important. They are, 1. Transformer 2. Look UP 3. Join 4. Copy 5. Funnel 6. Remove duplicates 7. Slowly changing dimension 8. Modify 9. Sort 10. Surrogate key

27.3. Transformer Stage: The symbol of Transformer Stage is

Navs notes

Page 84

2010

IWay can use in source only to set in output properties.

DataStage

A simple query that we solving by using transformer i.e,


2010

Q: calculate the salary and commission of an employee from EMP table.

Oracle Enterprise

Transformer

Data Set

Here, setting the connection and load Meta data in to column

here, source field and structure available mapping should be do. Transformer Stage is all in one stage.

Properties of Transformer Stage:

o For above question we must create a column to write description In the down at output properties clicking in empty position. That column we name as NETSAL By double clicking on the NETSAL, we can write derivation here. For example, IN.SAL + IN.COMM \\ we can write by write clicking their It visible in input column\function\ and so on.

After that when we execute the null values records it drops and remaining records it sends to the target. o For this we can functions in derivation IN.SAL + NullToZero (IN.COMM)

o By this derivation we can null values records as target.

Navs notes

Page 85

DataStage

Q: NETSAL= SAL + COMM +200; how to include this logic in derivation? A: adding THome column in output properties. In THome derivation part we include this logic
o
2010

Logic: if NETSAL > 2000 then TakeHome = NETSAL 200 else TakeHome = NETSAL

If (IN.SAL + NullToZero (IN.COMM))> 2000

Then (IN.SAL + NullToZero (IN.COMM)) 200 Else (IN.SAL + NullToZero (IN.COMM) ) + 200

o By this logic it takes more time in huge records, so the best way to over this problem is Stage Variable. Stage Variable: it is a temporary variable which will holds the value until the process completes and which doesnt sent to the result to output Stage variable is shown in the tool bar of transformer properties. After clicking that it visible in the input properties In stage variable we must add a column for example, NS Variables to adding column 1 NS 0 integer 4 0 After adding NS column To NS column including the derivation, IN.SAL + NullToZero (IN.COMM).
Adding these derivations to the input properties to created columns.

o NETSAL = NS
o

THome = if (NS > 2000) then (NS -200) else (NS + 200). DAY 28 Transformer Functions-I

Examples on Transformer Functions:

Navs notes

Page 86

DataStage
1. Left Function 2. Right Function 4. Concatenate Function 5. Field Function 6. Constraints Function (Filter) For example, a word MINDQUEST, from that word we need only QUE. Right Function using the above for question - R(L(7),3) Left Function L(R(5),3)
Substring SST(5,3)
2010

3. Substring Function

Filter: DataStage in 3 different ways 1. Source level


2. Stages (filter, switch, extended filter)

3. Constraints (transformer, lookup) Constraints: In transformer constraints used as filter, means constraints is also called as filter Q: how a constraint used in Transformer? A: in transformer properties, we will see a constraints row in output link. There we can write the derivation by double clicking.

Differences between Basic transformer and parallel transformer:


Its effects on performance. Basic Transformer Dont effects on performance, but it Parallelon Transformer effects compile time. Can execute in any platform.

Basic Tx can only execute up to SMP. Navs notes Basic Tx can call the Routines which is in basic and shell

Page 87

It supports wide range of language or multiple

DataStage

NOTE: Tx is very sensitive with respect to Data Types, if an source and target be cannot different data types. Q: How the below file can read and perform operation like filtering, separating by using left, right, substring functions and date display like DD-MM-YYYY? A: File.txt HINVC23409CID45432120080203DOL TPID5650 5 8261.99 TPID5655 4 2861.69 TPID5657 7 6218.96 HINVC12304CID46762120080304EUO TPID5640 3 5234.00 TPID5645 2 7855.67 TPID5657 9 7452.28 HINVC43205CID67632120080405EUO TPID5630 8 1657.57 TPID5635 6 9564.13 TPID5637 1 2343.64

Design: IN1 IN2

Navs notes

Page 88

2010

DataStage
SF Tx1 IN3 Tx2

OUT Tx3 Total five steps to need to solve the given question: Step 1: Loading file.txt into sequential file, in the properties of sequential file loading the whole data into one record. Means here creating one column called REC and no need of loading of Meta data for this. Step 2: IN1 Tx- Properties, in this step we are filtering the H staring records from the given file. Here, we are creating two columns TYPE and DATA. DS

IN1 REC

IN1

CONSTRAINT Left (IN1.REC,1)=H IN2

Derivation

Column

Left (IN1.REC, 1) IN1.REC DATA

TYPE

Step 3: IN2 Tx properties, here creating four column and separating the data as per created columns.
IN2 TYPE DATA Navs notes IN3 Left (IN1.REC, 1) Left (Right (IN2.DATA, 21), 9) CID IN2.DATA [20, 8] INVCNO Page 89

2010

DataStage
Derivation Column

Step 4: IN3 Tx properties, here BILL_DATE column going to change into DD-MM-YYYY format using Stage Variable.
Stage Variable IN3 INVCNO CID BILL_DA TE CURR
Derivation Column

Right (IN3.BILL_DATE, 2) Right (Left (IN3.BILL_DATE, 6), 2) Left (IN3.BILL_DATE, 4)

D M Y

OUT
Derivation Column

IN3.INVCNO INVCNO IN3.CID CID D:-: M:-: Y

Step 5: here, setting the output file name for displaying the BILL_DATE.

DAY 29 Transformer Functions-II Examples on Transformer Functions II:

Navs notes

Page 90

2010

DataStage

1. Field Function: it separates the fields using delimiter support.


2010

2. Trim: it removes all special characters. 3. Trim B: it removes all after spaces. 4. Trim F: it removes all before spaces. 5. Trim T & L: it removes all after and before spaces. 6. Strip White Spaces: it removes all spaces. 7. Compact White Spaces: it removes before, after, middle one, spaces.

Q: A file.txt consisting of special character, comma delimiters and spaces (before, after, and in between). How to solve by above functions and at last it to be one record? File.txt

EID,ENAME,STATE 111, NaVeen, AP TN

222@, MUnNA, @333, Sra

van, KN@

444, @ San DeeP, KN 555, anvesh,MH

Design: IN1 SF Tx IN2 Tx

Navs notes

Page 91

DataStage

IN3
2010

OUT Tx Total Five steps to solve the File.txt using above functions: Step 1: Here, extracting the file.txt and setting into all data into one record to the new column created that REC. no need of load meta data to this. Point to remember keep that first line is column name = true. Step 2:IN1. Tx properties In link IN1 having the REC, that REC to divide into fields by comma delimiter i.e., using field functions.
IN1 REC
Derivation

DS

IN2
Column

Field(IN2.REC,,,1) Field(IN2.REC,,,2) Field(IN2.REC,,,3)

EID ENAME STATE

Step 3: IN2. Tx properties


Here, to remove special characters, spaces, lower cases into upper cases by using the

trim, Strip Whitespaces (SWS), Up case functions.


IN2

IN3
Derivation Column

EID ENAME STATE Navs notes

Trim(IN2.EID,@,) EID Upcase(Trim(SWS(IN2.ENAME,@,)) ENAME

Page 92

DataStage

Step 4: IN3. Tx properties Here, all rows that divided into fields are concatenating means adding all records into one REC.
IN3

OUT
Derivation Column

EID ENAME STATE

IN3.EID: IN3.ENAME: IN3.STATE

REC

Step 5: For the output, here assigning a target file. And at last the answer will display in one record but all special characters, spaces were removed after doing are implementing the transformer functions to the above file.txt. Final output: Trg_file.ds REC 111NAVEEN AP 222 MUNNATN 333SRAVAN KN 444SAN DEEPKN 555 ANVESHMH

29.1. Re-Structure Stage: 1. Column Export 2. Column Import Column Export:


Navs notes Page 93

2010

DataStage
it is used to combine the multiple of columns into single column and it is also like concatenate in the transformer function. o Input o Output Column Import: it is used to explore from single column into multiple columns and it is also like field separator in the transformer function. Properties: o Input o Output Import column type = varchar Import output column= EID Import output column= ENAME Import output column= STATE DAY 30 JOB Parameters (Dynamic Binding) Column method= Column To Import = REC Export column type = varchar Export output column = REC Column method = explicit Column To Export = EID Column To Export = ENAME Column To Export = STATE
2010

Properties:

Navs notes

Page 94

DataStage
Dynamic Binding: After compiling the job and passing the values during the runtime is known as Assuming one scenario that when we taking a oracle enterprise, we must provide the table and load its meta data. Here table name must be static bind. But there is no need for giving the authentication to oracle are to be static bind, because of some security reasons. For this we can use job parameters that can provide values at runtime to authenticate. Job parameters: job parameters is a technique that passing values at the runtime, it is also called dynamic binding. Job parameters are divided into two types, they are o Local variables o Global Variable
Local variables (params): it is created by the DS Designer only, it can use with in the
2010

dynamic binding.

job only.
Global Variables: it is also called as environment variables, it is divided into two

types. They are,


o

Existing: comes with in DataStage, in this two types one general and another one parallel. Under parallel compiler, operator specific, reporting will available.

User Defining: it is created in the DataStage administrator only.

NOTE: The local parameters that created one job they cannot be reused in other job, this is up to version7. But coming to version8 we can reuse them by technique called parameter set. But in version7 we can also reuse parameters by User Define values by DataStage Administrator. Q: How to give Runtime values using parameters for the following list?
a. To give runtime values for user ID, password, and remote server?

Navs notes

Page 95

DataStage
b. Department number (DNO) to keep as constraint and runtime to select list of any number to display it? d. Providing target file name at runtime? e. Re-using the global and parameter set? Design:
2010

c. Add BONUS to SAL + COMM at runtime?

ORACLE Step1:

Tx

Data Set

Creating job parameters for given question in local variable. Job parameters o Parameters Name
a b c

DNAME USER Password SERVER DEPT BONUS DRIVE FOLDER TARGET

Type string Encrypted String List Integer String String String

Default value SCOTT ****** ORACLE 10 1000 C:\ Repository\ dataset.ds

UID PWD RS DNO BONUS IP FOLDER TRG FILE

Here, a, b, c, d are represents a solution for the given question. Step 2:Creating global job parameters and parameter set.

Navs notes

Page 96

DataStage
DS Administrator o Select a project
2010

Properties General o Environment variables

User defined (there we can write parameters) Default value SCOTT ****** ORACLE

Name UID PWD RS

DNAME USER Password SERVER

Type string Encrypted String

Here, global parameters are preceded by $ symbol. For Re-use, we must o Add environment variables User defined

UID

$UID

PWD $PWD RS $RS

Step 3: Creating parameter set for multiple values & providing UID and PWD other values for DEV, PRD, and TEST. In local variables job parameters o Select multiple of values by clicking on And create parameter set Providing name to the set o SUN_ORA Saving in Table definition In table definition

Navs notes

Page 97

DataStage
o Edit SUN_ORA values to add Name DEV PRD TEST UID SYSTEM PRD TEST PWD ****** ****** ****** SERVER SUN ORACLE
2010

MOON

For re-using this to another job. o Add parameters set (in job parameters) Table definitions Navs o SUN_ORA(select here to use) NOTE: Parameter set use in the jobs with in the project only. Step 4: In oracle enterprise properties selecting the table name and later assign created job parameter as shown below. Properties: Read method = table o Table = EMP Connection o Password = #PWD# o User = #UID# o Remote Server = #RS# Column: Load o Meta data for EMP table
Parameters Insert job parameters $UID $PWD variables $RS SUN_ORA.UID SUN_ORA.PWD parameter set SUN_ORA.RS UID PWD Local variables global environment

Navs notes

Page 98

DataStage

Step 5:
2010

In Tx properties dept no using as a constraint and assign bonus to bonus column.


Stage Variable IN EID ENAME STATE SAL COMM DEPTNO
Derivation Column

IN.SAL + NullToZero(IN.COMM) NS

OUT Constraint: IN.DEPTNO = DNO


Derivation Column

IN.EID IN.ENAME NS NS+BONUS

EID ENAME NETSAL BONUS

Here, DNO and BONUS are the job parameters we have created above to use here. For that simply right click->job parameters->DNO/BONUS (choose what you want) Step 6: Target file set at runtime, means following below steps to follow to keep at runtime. Data set properties o Target file= #IP##FOLDER##TRGFILE# Here, when run the job it asks in what drive, and in which folder. At last it asks what target file name you want.

Navs notes

Page 99

DataStage

DAY 31 Sort Stage (Processing Stage) Q: What is sorting? Here sorting means higher than we know actually. Q: Why to sort the data? To provide sorted data to some sort stages like join/ aggregator/ merge/ remove duplicates for the good performance. Two types of sorting:
1. Traditional sorting: simple sort arranging the data in ascending order or descending
2010

order.
2. Complex sorting: it is only for sort stages and to create group id, blocking unwanted

sorting, and group wise sorting. In DataStage we can perform sorting in three levels: Source level: it can only possible in data base. Link level: it can use in traditional sort. Stage level: it can use in traditional sorting as well as complex sorting. Q: What is best level to sort when we consider the performance? At Link level sort is the best we can perform. Source level sort: o It can be done in only data base, like oracle enterprise and so on. o How it will be done in Oracle Enterprise (OE)?

Navs notes

Page 100

DataStage
Go to OE properties Link level sort:
o

Select user define SQL


2010

Query: select * from EMP order by DEPTNO.

Here sorting will be done in the link stage that is shown how in pictorial way.

o And it will use in traditional sorting only. o Link sort is best sort in case of performance.

OE JOIN DS

Q: How to perform a Link Sort? Here as per above design, open the JOIN properties. And go to partitions o Select partition technique (here default is auto) Mark perform sort When we select unique (it removes duplicates) When we select stable (it displays the stable data)

Q: Get all unique records to target1 and remaining to another target2? For this we must create group id, it indicates the group identification.

Navs notes

Page 101

DataStage
It is done in a stage called sort stage, in the properties of the sort stage and in the options by keeping create key change column (CKCC) = true, default is false.
2010

Here we must select to which column group id you want. Sort Stage:

It is a processing stage, that it can sort the data in traditional sort or in complex sort.

Sort Stage Complex sort means to create group id, blocking unwanted sorting, and group wise sorting in some sort stage like join, merge, aggregate, and remove duplicates. Traditional sort means sorting in ascending order or descending order. Sort Properties: Input properties o Sorting key = EID (select the column from source table) o Key mode = sort (sort/ dont sort (previously sorted)/ dont sort (previously grouped)) o Options Create cluster key change column = false (true/ false) Create key change column = (true/ false) Output properties o Mapping should be done here. True = enables group id. False = disables the group id.

Navs notes

Page 102

DataStage

DAY 32 A Transformer & Sort stage job Q: Sort the given file and extract the all addresses to one column of a unique record and count of the addresses to new column. File.txt

EID, ENAME, ACCTYPE 111, munna, savings 333, naveen, loans 222, kumar, credit 111,munna, current 222, kumar, loans 111, munna, insurance 333, naveen, current 111, munna, loans 222, kumar, savings

Design:

SF

Sort1

DS

Navs notes

Page 103

2010

DataStage

Sequential File (SF): here reads the file.txt for the process. Sort1: here sorting key = EID

And enables the CKCC for group id.

Transformer (TX): here logic to implement operation for target. o

Properties of TX:
Stage Variable

IN2 EID ENAME ACCTYP E KeyChan

Derivation

Column

if (IN2.keychange = 1) then IN2.ACCTYPE func1 else func1 :,: IN2.ACCTYPE if(IN2.keychange=1) then 1 else c+1

OUT
Derivation Column

IN2.EID EID IN3.ENAME func1 ACCTYPE

ENAME

For this logic output will displays like below: EID, ENAME, ACCTYPE 111, munna, savings 111,munna, savings, current 111, munna, savings, current, insurance 111, munna, savings, current, insurance, loans 222, kumar, credit 222, kumar, credit ,loans 222, kumar, credit ,loans, savings 333, naveen, current 333, naveen, current, loans COUNT 1 2 3 4 1 2 3 1 2

Navs notes

Page 104

2010

Tx

Sort2

DataStage

Sort2: o

Here, in the properties we must set as below. Stage Key=ACCTYPE


o o

Sort key mode = sort Sort order = Descending order

Input Partition type: hash Sorting o Perform sort Stable (uncheck) Unique (check this)

o Selected Output

Key= count Usage= sorting, partitioning Options= ascending, case sensitive

Mapping should be doing here.

Data Set (DS):

o Input: partition type: hash

o Sorting:

Navs notes

Page 105

2010

DataStage

Perform sort Stable (check this) EID, ENAME, ACCTYPE, COUNT 4 3 2 DAY 33 FILTER STAGE
2010

Unique (check this)

Final output:

o Selected Key= EID Usage= sorting, partition Ascending

111, munna, sav, curr, insu, loans 222, kumar, credit ,loans, sav 333, naveen, current, loans

Filter means blocking the unwanted data. In DataStage Filter stage can perform in three level, they are 1. Source level 2. Stage level 3. Constraints
Source Level Filter: it can be done in data base and as well as in file at source level. o

Data Base: by write filter quires like select * from EMP where DEPTNO = 10.

o Source File: here we have option called filter there we can write filter commands like grep moon/ grep I moon/ grep w moon .
Stage Filter: o

Stage filters use in three stages, and they are 1. Filter, 2. Switch and 3. External filter.

o Difference between if and switch:

Poor performance. IF IF can write n number of column in condition. It have n number of cases.

Better SWITCH performance than IF.

Navs notes

SWITCH can only one Page 106 condition can perform. It can only have 128 cases.

DataStage

Here filter is like an IF, switch as switch.

o Differences between three filter stages.


FILTER FILTER Condition on multiple columns. It have, o 1 input n outputs 1 reject SWITCH Condition on single column. It have, o 1 input 128 outputs 1 - default EXTERNAL It is using by the GREP commands. It have, o 1 input 1 output no rejects

Filter stage: it having one input, n outputs, and one reject link. The symbol of filter is

Filter Q: How the filter stage to send the data from source to target? Design: DS
Navs notes T 1 Page 107

2010

DataStage

OE
Reject T 2

DS

DS Step1:
Connecting to the oracle for extracting the EMP table from it.

Step2: Filter properties Predicates o Where clauses = DEPT NO =10 Output link =1

o Where clauses = SAL > 1000 and SAL < 3000 Output link = 2

o Output rejects = true // it is for output reject data. Link ordering o Order of the following output links Output: o Mapping should be done for links of the targets we have. Step3:
Assigning a target files names in the target.

Here, Mapping for T1 and T2 should be done separately for both.

Navs notes

Page 108

2010

Filter

DataStage
It have no reject link, we must convert a link as reject link. Because it has n number of outputs.
2010

DAY 34 Jobs on Filter and properties of Switch stage Assignment Job 1: a. Only DEPTNO 10 to target1? b. Condition SAL>1000 and SAL<3000 satisfied records to target2? c. Only DEPTNO 20 where clause = SAL<1000 and SAL>3000 to target3? d. Reject data to target4? Design to the JOB1:
T

Filter

EMP_TBL

Filter

Navs notes

Page 109

DataStage

T
2010

Step1: For target1: In filter where clause for target1 is DEPTNO=10 and link order=0. Step2: For target2: where clause = SAL>1000 and SAL<3000 and link order=1. Step3: For target3: where clause= DEPTNO=20 and link order=0. Step4: For target4: convert link into reject link and output reject link=true. Job 2: a. All records from source to target1? b. Only DEPTNO=30 to target2?
c. Where clause = SAL<1000 and SAL>3000 to target3?

d. Reject data to target4? Design to the JOB2:


T

Copy

EMP_TBL
T

Filter
T

Navs notes

Page 110

DataStage
Step1: For target1 mapping should be done output links for this. Step2: For target2 where clause = DEPTNO=30 and link order =0. Step3: For target3 where clause = SAL<1000 and SAL>3000 and link order=1. Step4: For target4 convert link into reject link and output reject link=true.

Job 3: a. All unique records of DEPTNO to target1? b. All duplicates records of DEPTNO to target2? c. All records to target3? d. Only DEPTNO 10 records to target4? e. Condition SAL>1000 & SAL<3000, but no DEPTNO=10 to target5? Design to the JOB3:
K= T

Filter EMP_TBL

K= T

TT

Navs notes

Page 111

2010

DataStage
Filter

Step1: For target1: where clause = keychange=1 and link order=0. Step2: For target2: where clause = keychange=0 and link order=1. Step3: For target3: mapping should be done output links for this. Step4: For target4: where clause= DEPTNO=10. Step5: For target5: in filter properties put output rows only once= true for where clause SAL>1000 & SAL<3000. SWITCH Stage: Condition on single column and it has only 1 input, 128 outputs and 1- default. Picture of switch stage:

Properties of Switch stage: Input o Selector column = DEPTNO


Cases o

values Case = 10 = 0 link order

o Case = 20 = 1 Options
Navs notes Page 112

2010

DataStage
o If no found = options (Drop/ fail/ output) Drop= drops the data and continue the process.
2010

Fail= if any records drops job aborts. Output= to view reject data through the link.

DAY 35 External Filter and Combining External Filter: It is processes stage, which can perform filter by UNIX commands. It having 1-input, 1-output, and 1-reject link. To perform a text file, first it must read in single record in the input. Example filter command: grep newyork.

Sequential File External Filter properties:

External Filter

Data Set

o Filter command = grep newyork o Grep v newyork \\ other than new it filters. Combining: in DataStage combining can done in three types. They are

Navs notes

Page 113

DataStage
o Horizontal combining o Vertical combining
2010

o Funneling combining

Horizontal combining: combining primary rows with secondary rows based on primary key. o This stage that perform by JOIN, LOOKUP, and MERGE. These three stages differs with each other with respect to, o Inputs requirements, o Treatment of unmatched records, and o Memory usage. DAY 36 Horizontal Combining (HC) and Description of HC stages Horizontal Combining (HC): combining the primary rows with secondary rows based on primary key. Selection of primary table is situation based.

ENO EName DNo 111 10 222 DNo LOC 10 20 40 naveen munna DName IT SE SA HYD SEC DNO DNAME LOC ENO ENAME

H C

Here we can combine

Navs notes

Page 114

DataStage
Inner join, Left outer join, full outer join If T1= {10, 20, 30} and T2= {10, 20, 40} Inner Join: Matched primary and secondary records.
T1 T2
2010

Right outer join, and

Left Outer Join: Matched primary & secondary and unmatched primary records.
T1 (T1 T2)

Right Outer Join: Matched primary & secondary and unmatched secondary records.
T2 (T1 T2)

Full Outer Join: Matched primary & secondary and unmatched primary & unmatched secondary records.
T1 T2

Description of HC stages: The description of horizontal combining is divided into nine parts. They are, o Input names, o Input output rejects, o Join types,
o

Input requirements with respect to sorting,

o De duplication (removing duplicates),


o o

Treatment of unmatched records, Memory usage,

o Key column names, and o Types of inner join.

Navs notes

Page 115

DataStage
The differences between join, lookup, and merge with respect to above nine points are

shown below.
JOIN MERGE Input names: When we work on HC with JOIN the first SRC is left table, and last SRC is right table. And all middle SRCs are intermediate tables. Input output rejects: N inputs (inner, LOJ, ROJ) 2 inputs (FOJ) 1 output, and 1 LOOKUP The first link from source is primary/ input and remaining links are lookup/ references links. The first table is master table and remaining tables are updates tables.
2010

N Inputs (normal) 2 inputs (sparse) 1 output, and 1 reject

N inputs 1 output (n 1) rejects.

Join Types: Inner join, left outer join, right outer join, and full outer join. Inner Join Left outer join Inner join Left outer join

:: Input Requirements with respect to sorting:: Primary: mandatory Optional Optional Mandatory Mandatory

Secondary: ::De Duplication (removing the duplicates):: Primary: OK (nothing happens) OK Warnings OK

Warnings Secondary: OK :: Treatment of Unmatched Records:: Primary: Drop (inner) Target (Left) Drop, Target (continue), reject (unmatched primary records) Drop

Drop, target (keep) Drop Reject Page 116 (unmatched secondary

Navs notes Secondary: Drop (inner)

DataStage

:: MEMORY USAGE:: Light memory :: Key Column Names:: Must be SAME :: Type of Inner Join :: ALL ALL ANY Optional Same in case of lookup file set Must be SAME Heavy memory Light memory

DAY 37 LOOKUP stage (Processer Stage) Lookup stage: In real time projects, 95% of horizontal combining is used by this stage.
Look up stage is for cross verification of primary records with secondary records.

DataStage version8 supports four types of LOOKUP, they are o Normal LOOKUP o Sparse LOOKUP o Range LOOKUP o Case less LOOKUP For example in simple job with EMP and DEPT tables: Primary table as EMP with column consisting of EID, ENAME, DNO Reference table as DEPT with column consisting of DNO, DNAME, LOC

Navs notes

Page 117

2010

DataStage

DEPT table (reference/ lookup)

EMP table (Primary/ input)

LOOKUP

Data Set (target)

LOOKUP properties for two tables:


Primary Table ENO ENAM E DNO Target ENO ENAM E DNAM Reference Table DNO DNAM E LOC

Key column for both tables It can set by just drag from primary table to reference table to DNO column.

Navs notes

Page 118

2010

DataStage
In tool bar of LOOKUP stage consists of constraints button, in that we have to select
Continue: this option for Left Outer Join.
2010

Drop: it is to Inner Join. Fail: its aborts job, if a primary unmatched records are their. Reject: its captured the primary unmatched records.

Case less LOOKUP: In execution by default it acts as a case sensitive. But we have a option to remove the case sensitive i.e.,
o

Key type = case less.

DAY 38 Sparse and Range LOOKUP Sparse LOOKUP: If the source is database, its supports only two inputs. Normal lookup: is cross verification of primary records with secondary at memory. Sparse lookup: is cross verification of primary records with secondary at source level itself. To set sparse lookup we must adjust key type as sparse in reference table only. By default Normal LOOKUP is done in lookup stage. Note: sparse lookup not support another reference when it is database. But in ONE Case sparse LOOKUP stage can supports n references. By taking lookup file set

Navs notes

Page 119

DataStage

Job1: a sequential file extracting a text file to load into lookup file set (lfs).
2010

Sequential file Here in lookup file set properties:

Lookup file set

o Column names should same as in sequential file. o Target file stored in .lfs extension. o Address of the target must save to use in another job.

Job2: in this job we are using lookup file set as sparse lookup. LFS LFS

SF

LOOKUP

DS

In lookup file set, we must paste the address of the above lfs. Lookup file supports n references means indirectly sparse supports n references.

Navs notes

Page 120

DataStage

Range lookup is keeping condition in between the tables. How to set the range lookup: In LOOKUP properties: Select the check box for column you need to condition.

Condition for LOOKUP stage: How to write a condition in the lookup stage? o Go to tool bar constraint, there we will see condition box. o In condition, for example: in.primary= AP
o

For multiple links we can write multiple conditions for n references. DAY 39 Funnel, Copy and Modify stages

Funnel Stage: It is a processing stage which performs combining of multiple sources to a target. To perform the funnel stage some conditions must to follow:
1. Columns should be same 2. Columns names also should be same

3. Columns names should be case sensitive 4. Data type should be same Funnel stage it is process to append the records one table after the one, but above four conditions has to be meet.

Navs notes

Page 121

2010

Range LOOKUP:

DataStage
In this stage the column GEN M has to exchange into 1 and F=0;
2010

Simple example for funnel stage:


ENO EN GEN 111 HYD 222 naveen M munna Loc T X

ENO GEN Copy /Modi fy

EN ADD

EMPID EName Loc Company GEN 444 555 IT SA

Country 1 0 In this column names has change as primary table.

DEL INDIA IBM NY USA IBM

Funnel operation three modes:


Continues funnel: its random. Sequence: collection of records is based on link order. Sort funnel: its based on key column values.

Copy Stage: It is processing stage which can be used from. 1. Copying source data to multiple targets. 2. Charge the column names. 3. Drop the columns. 4. Stub stage. NOTE: best for change column names and drop columns.

Navs notes

Page 122

DataStage

Modify Stage: 1. Drop the columns. 2. Keep the columns. 3. Change the column names. 4. Modify the data types. 5. Alter the data.
2010

It is processing stage which can perform.

Oracle Enterprise

Modify

Data Set

From OE using modify stage send data into data set with respect to above five points.

In modify properties:
Specification: drop SAL, MGR, DEPTNO o

Here drops the above columns.

Specification: keep SAL, MGR, DEPTNO o

Here accept the columns, remaining columns were drops.

At runtime: Data Set Management (view the operation process)

Specification: <new column name> DOJ=HIREDATE<old column>

o Here to change column name.

Navs notes

Page 123

DataStage
Specification: <new column name>DOJ=DATE_FROM_TIMESTAMP(HIREDATE)

<old column>
2010

o Here changing the column name with data type.

DAY 40 JOIN Stage (processing stage) Join stage it used in horizontal combining with respect to input requirements, treatment of unmatched records, and memory usage.
Join stage input names are left table, right table, and intermediate tables. Join stage having n inputs (inner, LOJ, ROJ), 2 inputs (FOJ), 1- output, no

reject.
Types of Join stage are inner, left outer join, right outer join, and full outer join. Input requirements with respect to sorting: it is mandatory in primary and secondary

tables.

Navs notes

Page 124

DataStage
Input requirements with respect to de duplication: nothing happens means its OK

when de duplication. drops and when it is LOJ will keep all records in target. And in secondary table in Inner option its drops and it ROJ will keep all records in target.
Memory usage: light memory in join stage.
2010

Treatment of unmatched records: in primary table when the option Inner its simple

Key column names should be SAME in this stage. All types of inner join will supports. A simple job for JOIN Stage:

JOIN properties: Need a key column


o

Inner JOIN, Left Outer JOIN comes in left table.

o Right Outer JOIN comes in right table. o Full Outer JOIN comes both tables, in this no scope from third table thats why FOJ have two inputs. In join stage when we sort with different key column names, that job can executes but its effect on the performance (simply say WARNINGS will occurs)

Navs notes

Page 125

DataStage
We can change the column name by two types Copy stage and with query statement. Example of SQL query: select DEPTNO1 as DEPTNO, DN, and Loc from DEPT;
2010

DAY 41 MERGE Stage (processing stage) Merge stage is a processing stage it perform horizontal combining with respect to input requirements, treatment of unmatched records, and memory usage.
Merge stage input names are master and updates. N inputs, 1 output, and (n 1) rejects for merge stage. Join types of this stage are inner join, and left outer join. Input requirements with respect to sorting is mandatory to sort before perform merge

stage.
Navs notes Page 126

DataStage
Input requirements with respect to de duplication in the primary table it will get

warnings when we dont remove the duplicates in primary table. And in secondary
Treatment of unmatched records in primary table Drop (drops), Target (keep) the
2010

table nothing will happens its OK when we dont remove the duplicates.

unmatched records of the unmatched primary table records. And in secondary table drops and reject it captures the unmatched secondary table records.
In the merge stage the memory usage is LIGHT memory. The key column names must be the SAME. In type of inner join it compares in ANY update tables.

NOTE:
Static information stores in the master table.

All changes information stores in the update tables. Merge operates with only two options o Keep (left outer join) o Drop (inner Join)

Simple job for MERGE stage:

PID PRD_DESC PRD_MANF 11 indica tata 22 swift maruthi 33 civic

PID PRD_SUPP PRD_CAT 11 abc XXX 33 xyz XXX 55 pqr XXX 77 mno XXX

PID PRD_AGE PRD_PRICE 11 4 1000 22 9 1200 66 3 1500 88 9 1020

Master Table Master table

Update (U1)

Update (U2)

Navs notes

Page 127

DataStage

TRG
2010

U1

U2 or Reject (U1) In MERGE properties:


Merge have inbuilt sort = (Ascending Order/Descending Order)

Reject (U2)

Must to follow link order.


Merge supports (n-1) reject links. NOTE: there has to be same number of reject links as update links or zero reject links.

Here COPY stage is acting as STUB Stage means holding the data with out sending the data into the target. DAY 42 Remove Duplicates & Aggregator Stages Remove Duplicates: It is a processing stage which removes the duplicates from a column and retains the first or last duplicate rows.

Sequential File

Remove Duplicates

Data Set

Navs notes

Page 128

DataStage
Properties of Remove Duplicates: Two options in this stage.
2010

o Key column= <column name> o Dup to retain=(first/last) Remove Duplicates stage supports 1 input and 1 output. NOTE: for every n input and n output stages should must done mapping. Aggregator:

It is a processing stage that performs count of rows and different calculation between columns i.e. group by same operation in oracle.

SF Properties of Aggregator: Grouping keys: o Group= Deptno Aggregator

Aggregator

DS

o Aggregator type = count rows (count rows/ calculation/ re calculation) o Count output column= count <column name> 1Q: Count the number of all records and deptno wise in a EMP table? 1 Design:

OE_EMP

Copy of EMP

Counting rows of deptno

TRG1

Navs notes

Page 129

DataStage

Generating a column

counting rows of created column

TRG2

For doing some group calculation between columns: Example: Select group key Group= DEPTNO - Aggregation type = calculation - Column for calculation = SAL <column name> Operations are Maximum value output column = max <new column name> Minimum value output column = min <new column name>
Sum of column = sum <new column name> and so on.

Here, doing calculation on SAL based on DEPTNO; 2Q In Target one dept no wise to find maximum, minimum, and sum of rows, and in target two company wise maximum? 2 Design:
OE_emp copy of emp max, min, sum of deptno trg1

Company: IBM

max of IBM

trg2

3Q:

To find max salary from emp table of a company and find all the details of that?

Navs notes

Page 130

2010

DataStage
& 4Q: To find max, min, sum of salary of a deptno wise in a emp table?
dummy dno=10

3 & 4 Design:
compare emp

max(deptno) UNION ALL diving

dno=20

compare copy min(deptno)

dummy

dno=30

company: IBM

compare maximum SAL with his details

max (IBM)

DAY 43 Slowly Changing Dimensions (SCD) Stage Before SCD we must understand: types of loading 1. Initial load 2. Incremental load
Initial load: complete dump in dimensions or data warehouse i.e., target also before

data is called Initial load.


The subsequent is alter is called incremental load i.e., coming from OLTP also source

is after data.

Navs notes

Page 131

2010

DataStage
Example: #1 Before data (already data in a table) CID 11 CNAME A ADD HYD GEN M BALANCE Phone No 30000 988531068 8 After data (update n insert at source level data) CID 11 CNAME A ADD SEC GEN M BALANCE Phone No 60000 988586542 2 Column fields that have changes types: Address slowly change Balance rapid change Phone No often change Age frequently AGE 25

AGE 24

Example: #2 Before Data: CID 11 22 33 CNAME A B C ADD HYD SEC DEL

After Data: (update n insert option loading a table) CID 11 22 CNAME A B ADD HYD CUL

Navs notes

Page 132

2010

DataStage
33 D PUN

We have SIX Types of SCDs are there, they are SCD I SCD II SCD III SCD IV or V SCD VI Explanation: SCD I: execution. SCD II: it maintains both current update data and historical data. With some special it only maintains current update, and no historical data were organized. As per SCD I, it updates the before data with after data and no history present after the

operation columns they are, surrogate key, active flag, effect start date, and effect end date;
In SCD II, not having primary key that need system generated primary key, i.e.,

surrogate key. Here surrogate key acting as a primary key.


And when SCD II performs we get a practical problem is to identify old and current

record. That we can solve by active flag: Y or N.


In SCD II, new concepts are introduced here i.e., effect start date (ESDATE) and

effect end date (EEDATE).


Record version: it is concept that when the ESDATE and EEDATE where not able to

use is some conditions.


Unique key: the unique key is done by comparing.

SCD III:

SCD I (+) SCD II maintain the history but no duplicates.

Navs notes

Page 133

2010

Extracting after and before data from DW (or) database to compare and upsert.

DataStage

SCD IV or V:

SCD II + record version


2010

When we not maintain date version then the record version useful. SCD VI: SCD I + unique identification.

Example table of SCD data: SID 1 2 3 4 5 6 7 8 CID 11 22 33 22 44 11 22 55 CNAME A B C B D A B E ADD HYD SEC DEL DEL MCI GDK RAJ CUL AF N N Y N Y Y Y Y ESDATE 03-06-06 03-06-06 03-06-06 08-09-07 08-09-07 30-11-10 30-11-10 30-11-10 EEDATE 29-11-10 07-09-07 9999-12-31 29-11-10 9999-12-31 9999-12-31 9999-12-31 9999-12-31 RV 1 1 1 2 1 2 3 1 UID 1 2 3 2 5 1 2 8

Table: this table is describing the SCD six types and the description is shown above. DAY 44 SCD I & SCD II (Design and Properties)

SCD I: Type1 (Design and Properties): Transfer job 10,20,30


OE_DIM before fact DS_FACT 10, 20, 40 10, 20, 40

Load job

DS_TRG_DIM 10, 20, 40 After dim

OE_UPSERT

10,20, 40
DS_TRG_DIM

-update and insert

OE_SRC

Navs notes

Page 134

DataStage
In oracle we have to create table1 and table2, Table1: o Insert into src values(111, naveen); o Insert into src values(222, munna); o Insert into src values(333, kumar); Table2:
Create table DIM(SKID number, SNO number, SNAME varchar2(25));
2010
BEFORE

Create table SRC(SNO number, SNAME varchar2(25));

o No records to display; Processes of transform job SCD1: Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above links that coming from different sources. Step 2: SCD1 properties Fast path 1 of 5: Fast path 2 of 5: select output link as:
fact

navigating the key column value between before and after tables

AFTER

SNO SNAME

KEY EXPR

COLUMN N PURPOSE SKID surrogate key AFTER.SNO SNO business key

Fast path 3 of 5:

selecting source type and source name. source name: D:\study\navs\empty.txt


Page 135

Source type: Flat file


Navs notes

DataStage

NOTE: for every time of running the program we should empty the source name i.e.,
2010

empty.txt, else surrogate key will continue with last stored value. Fast path 4 of 5: select output in DIM.

AFTER

DIM

SNO SNAME

Derivation COLUMN N PURPOSE next sk() SKID surrogate key AFTER.SNO SNO business key

For path 5 of 5:

setting the output paths to FACT data set.

AFTER

FACT

SNO SNAME

Derivation COLUMN N BEFORE.SKID SKID AFTER.SNO SNO

BEFORE

SKID SNO SNAME

Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are, -update n insert \\ if key column value is already.

Navs notes

Page 136

DataStage
-insert n update \\ if key column value is new.

Before table
CID CNAME SKID 10 abc 1 20 xyz 2 30 pqr 3

Target Dimensional table of SCD I


CID 10 20 40 CNAME SKID abc 1 nav 2 pqr 3

After table
CID CNAME 10 abc 20 nav 40 pqr

SCD II: (Design and Properties): Transfer job


10,20,30 before OE_DIM fact DS_FACT 10, 20, 20, 30, 40 10, 20, 20, 30, 40

Load job

DS_TRG_DIM 10, 20, 40 After dim 10, 20, 20, 30, 40

OE_UPSERT

-update and insert

OE_SRC

DS_TRG_DIM

Step 1: in transformer stage:

Navs notes

Page 137

2010

Here SCD I result is for the below input

DataStage
Adding some columns to the to before table to covert EEDATE and ESDATE columns into time stamp transformer stage to perform SCD II In TX properties:
BEFORE BEFORE_TX

SKID SNO SNAME ESDATE EEDATE ACF

Derivation NAM BEFORE.SKID SKID BEFORE.SNO BEFORE.SNAME SNAME

COLUMN

SNO

In SCD II properties: Fast path 1 of 5: select output link as:


fact

Fast path 2 of 5:

navigating the key column value between before and after tables
BEFORE

AFTER

KEY EXPR

SNO SNAME

COLUMN N PURPOSE SKID surrogate key AFTER.SNO SNO business key SNAME Type2 ESDATE experi date Page 138

Navs notes

2010

DataStage
Fast path 3 of 5: selecting source type and source name. source name: D:\study\navs\empty.txt

NOTE: for every time of running the program we should empty the source name i.e., empty.txt, else surrogate key will continue with last stored value. Fast path 4 of 5:
AFTER

select output in DIM.


DIM

SNO SNAME

Derivation COLUMN N PURPOSE Expires next sk() SKID surrogate key AFTER.SNO SNO business key AFTER.SNAME SNAME Type2 curr date() ESDATE experi date -

Date from Julian (Julian day from day (current date ()) 1) For path 5 of 5: setting the output paths to FACT data set.

AFTER

SNO SNAME

FACT

Derivation COLUMN NAME BEFORE.SKID SKID AFTER.SNO SNO AFTER.SNAME SNAME BEFORE.ESD ESDATE

BEFORE

SKID SNO SNAME ESDATE EEDATE ACF

Navs notes

Page 139

2010

Source type: Flat file

DataStage

Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you they are, -update n insert -insert n update \\ if key column value is already. \\ if key column value is new.
2010

are loading into oracle we must change the write method = upsert in that we have two options

Here SCD II result is for the below input Before table


CID CNAME SKID ESDATE EEDATE ACF 10 abc 1 01-10-08 99-1231 Y 20 xyz 20 01-10-08

Target Dimensional table of SCD II


CID CNAME SKID ESDATE EEDATE ACF 10 abc 1 01-10-08 99-1231 Y 20 xyz 2 01-10-08 09-12-10 N 20 xyz 4 10-12-10

After table
CID CNAME 10 abc 20 nav 40

DAY 45 Change Capture, Change Apply & Surrogate Key stages

Change Capture Stage: It is processing stage, that it capture whether a record from table is copy or edited or insert or to delete by keeping the code column name. Simple example of change capture:

Navs notes

Page 140

DataStage
Change_capture

Properties of Change Capture: Change keys o Key = EID (key column name) Change valves o Values =? \\ ENAME o Values =? \\ ADD Options o Change mode = (explicit keys & values / explicit keys, values) o Drop output for copy = (false/ true) false default o Drop output for delete = (false/ true) false default o Drop output for edit = (false/ true) false default o Drop output for insert = (false/ true) false default

Sort order = ascending order

Copy code = 0 Delete code = 2 Edit code = 3 Insert code = 1 Code column name = <column name>

o Log statistics = (false/ true) false default Change Apply Stage: It is processing stage, that it applies the changes of records of a table.

Navs notes

Page 141

2010

DataStage

Change Apply Properties of Change Apply: Change keys o Key = EID Options o Change mode = explicit key & values o Check value columns on delete = (false/ true) true - default o Log statistics = false o Code column name = <column name> \\ change capture and this has to be SAME for apply operations Sort order = ascending order

SCD II in version 7.5.x2 Design of that


ESDATE=current date () EEDATE= 9999-12-31 Key=EID ACF= Y

-option: e k & v
Before.txt

c=3 c=all

after.txt key= EID

-option: e k & v Navs notes Page 142

2010

DataStage
before.txt ESDATE- current date () EEDATE- if c=3 then DFJD(JDFD(CD())-1) else EEDATE = 9999-12-31 ACF- if(c=3) then N else Y

SURROGATE KEY Stage: In version 7.5.x2: identifying last value which generated for the first time compiling and running the job in surrogate key stage, for that reason in version 7 we have to do a another job to store a last generated value. And that job in version 7.5.x2: design

SF

Sk

copy

ds

Tail

peek

In this job, a surrogate key stage used for generates the system key column values that are like primary key values. But it generate at first compile only.
But by taking tail stage with that we tracing the last value and storing into the peek

stage that is in buffer. With that buffer value we can generate the sequence values that are surrogate key in version 7.5.x2. In version 8.0: The above problem with version7 is over comes by version 8.0 surrogate key by taking an empty text(empty.txt) file and storing last value information in that file, and by using that it generates the sequence values

Navs notes

Page 143

2010

DataStage

Before.txt

SK

Data Set

Properties of SK version8: Option 1: generated output column name = skid Source name = g:\data\empty.txt Source type = flat file Option 2: database type= oracle (DB2/ oracle) Source name = sq9 (in oracle create sequence sq9)\\ it is like empty.txt Password= tiger User id= scott Server name= oracle Source type = database sequence

DAY 46 DataStage Manager Export: Export is used to save the group of jobs for the export purpose that where we want. Navigation - how to export? DataStage toolbar
Change selection: AD D

or

o Job components to export

REMOV E

or

SELECT ALL

Here there are three options are Export job designs with executables(where applicable)

Navs notes

Page 144

2010

DataStage
o Export to file
Source name\.....

Export job designs without executables Export job executables without designs
2010

Where we want locate the export file. o Type of export


dsx

By two options we can export file


-

dsx 7 bit encoded xml

Import:

It is used to import the .dsx or .xml extensions to a particular project and also to import some definitions as shown below. Options of import are o DataStage components o DataStage components (xml) o External function definitions o Web services function definitions o Table definitions
o

IMS definitions In IMS two options are, Database description (DBD) Program Specification Block (PSB / PCB)

In DataStage components.. o Import from file


Give the source name to import . Navs notes Page 145

DataStage

Import all Import selected Generate Report:

overwrite without query


2010

perform impact analysis

It is for to generate report to a job or a specific, that it generates a report to a job instantly. For that, go to File o Generate report Report name Options Use default style sheet Use custom style sheet After finishing the settings: Its generates in default position /reportingsendfile/ send file/ tempDir.tmp Node Configuration: Q: To see nodes in a project: o Go to run director Check in logs Double click on main program: APT config file

Q: What are Node Components? 1. Node name logical CPU name. 2. Fast name server name or system name. 3. Pools logical area where stages are executed. 4. Resource memory associated with node.

Navs notes

Page 146

DataStage

Node components stores in the discs permanent in the below address.


2010

c:\ibm\information server\server\parasets

o Node components stores temporary is the below address. c:\ibm\information server\scratch

Q: What node that handles to run each and every job and name of the configuration file? o Every job runs on APT node as on below name that is default for every job.
o

Name of configuration file is C:\ibm\.........\default.apt

Q: How to run a job on specific configuration file? o Job properties Parameters Add environment variables o Parallel Compiler Config file (Add $APT_CONFIG_FILE)

Q: How to create a new Node configuration File? o Tools Configurations There we see o
Default.apt

o Default.apt will have the single node information.


o

We can create new node by option NEW


NEW

Save the things after creating new nodes

Navs notes

Page 147

DataStage

By, save configuration As

o NOTE: Best 8 or 16 nodes is to create in a project, and


2010

Q:

2^0,2^1(say) CPUs have & so on.

If uni processing system with 1 CPU needs minimum 1 node to run a job then for SMP with 4 CPU needs how many minimum nodes? o Only 1 node.

Advanced Find: It is the new feature to version8 It consists of to find objects of a job like list shown below 1. Where used, 2. Dependency, 3. Compared report. Q: How to run a job in a job? Navigation for how to run a job in a job Job properties o Job control Select a job ------------------------------------o Dependencies Select job (first compile this job before the main job) Q: Repository of Advance Find (means palate of advance find)? o Name to find:
Nav*

here, Job Control Language (JCL) script presents.

o Folder to search: D:\datastage\ o Type o Creation


Navs notes Page 148

DataStage
o Last modification o Where used
2010

Find objects that use any of the following objects. Options: Add, remove, remove all

o Dependencies of job Q: Advance Find of repository through tool bar? o Cross project compare. o Compare against o Export
o

Multiple job compile

o Add to palate
o

Create copy

o Locate in tree
o

Find dependencies

Q: How to find dependency in a job? o Go to tool bar Repository Find dependency: all types of a job

DAY 47 DataStage Director DS Director maintains: Schedule Monitor Views


Navs notes Page 149

DataStage
o Job view o Status view
2010

o Log view Message Handling Batch jobs Unlocking Schedule: Schedule means a job can run in specific timings To set timings for that, o Right click on job in the DS Director Click on add to schedule And set the timings.

In real time, specific the job sequence by some tools shown below o Tools to schedule jobs (its happen the production only) Control M Cron tab Autosys

Purge: It means cleaning or wash out or deleting the already created logs In job can we clear Job logs having a option is FILTER. By right clicking we can filter. Navigation for set the purge.
Navs notes Page 150

DataStage
o Tool bar Job o Immediate purge o Auto purge Monitor: It shows the Status of job, numbers of row where executed, started at (time), elapsed time (i.e. rows/sec), percentage used by CPU) Navigation for job that how to monitor. o Right click on job Click monitor it shows performance of a job
2010

Clear log (choose the option)

Like below figure for a simple job.


StatusNo. rows started at elaspsed time rows/sec %CPU Finished 6 sys time 00:00:03 2 =9 Finished 6 sys time 00:00:03 2 =7

NOTE: Based on this we can check the performance tuning of a stage in a particular job. Reasons for warnings:
Default warnings in sequential file are

1. Field <column name> has import error and no default value; data : { e i d }, at offset: 0 2. Import warnings at record 0.

Navs notes

Page 151

DataStage
3. Import unsuccessful at record 0. o These three warnings can solve by a simple option in sequential file, i.e., Missing record delimiter \r\n, saw EOF instead (format mismatch) When we working on look-up, in the secondary stage have duplicates we with get warning. Where these is length miss match, like in source length (10) and target (20). When sorting for different key column in join. When second stage in merge. Abort a job: Q: How can we abort a job conditionally? Conditionally o When we Run a job Their we can keep a constraint Like warnings o No limit o Abort job after: In transformer stage o Constraint Otherwise/log Abort after rows: 5 (if 5 records not meet the constraint its simple aborts the job) We can keep constraint same like this only in Range Lookup. Message Handling: If the warnings are failed to handle then we come across the message handling Navigation for how to add rule set a message handle the warnings.
5

Navs notes

Page 152

2010

First line is column names= set as true.(here default option is false)

DataStage
Jog logs o Right click on a warning
2010

Add rule to message handler Two options Suppress from log Demote to information

Choose any one of above option and add rule.

Batch jobs: Executing set of jobs in a order Q: How to create a Batch? Navigation for creating a batch DS Director o Tools Batch New (give the name of batch) Add jobs in created job batch o Just compile after adding in new batch. Allow multiple instances: Same job can open by multiple clients and run the job If we not enable the option it will open in a read only that you cant edit. But a job can execute by multiple users at the same time in director. Navigation for enable the allow multiple instance Go to tool bar in DS Designer o Job properties Unlock the jobs: Check the box on allow multiple instances

Navs notes

Page 153

DataStage
We can unlock the jobs for multiple instances by release all the permissions Navigation for unlock the job Tool bar o Job Cleanup resources Processes Show by job Show all o Release all For global to see PIDs for jobs, for that DS Administrator o General Environment variables Parallel o Reporting

2010

DS Director

Add (APT_PM_SHOW_ PIDS)

Set as (true/false)

Navs notes

Page 154

DataStage
DAY 48 Web Console Administrator Components of administrator: Administration: o User & group Users User name & password is created here. And assigning permissions
2010

Session managements: o Active sessions Reports: o DS INDIA (server/system name) Domain Management: o License Update the license here Upload to review View report. We can create the reports. For admin

Scheduling management: It is know what user is doing from part o Scheduling views New

Navs notes

Page 155

DataStage
schedule | Run creation task run | last update DAY 49 Job Sequencing Stages of job sequencing: It is for executing jobs in sequence that we can schedule job sequencing Or Its control the order of execution jobs A simple job will process in below process. o Extract o Transform o Load o Master jobs: its control the order of execution. Important stages in job sequencing are 1. Job activity 2. Sequencer 3. Terminator activity 4. Exception handler 5. Notification activity 6. Wait for file activity Job Activity: It is job activity that holds the job and it have 1-input and n-outputs
2010

Job activity How the Job Activity drag into design canvas?

Navs notes

Page 156

DataStage
- In two methods we can, 1. Go to tool bar view repository jobs just drag the job to the canvas. Simple job:
OK WAR
2010

2. Go to tool bar view palate job activity just drag the icon to the canvas.

Student
FAIL

Sequencer

student rank

Terminator activity Properties of Job Activity: Load a job what you want in active o Job name:
Execution action:
D:\DS\scd_job

RUN

Do not check point Run

options - (Run/Reset if required, than run/ Validate only/ Reset only) Check Point: Job has re-started where it aborted it is called check point It is special option that we must enable manually Go to
o

Job properties of DS Designer Enable check point

Navs notes

Page 157

DataStage

Parameter mapping: If job have already some parameters to that we can map to the another job if we need Triggers: It holds the link expression type that how to act
Name of output link OK WAR warnings Fail Expression type OK-(conditional) Expression executed OK
2010

WAR-(conditional) execution finished with Failed-(conditional) execution failed

And some more options in Expression type Terminator Activity: It is stage that handles the error if it fails Properties: It consists of two options: for if any sub ordinate jobs are still running. Its for job failure o Send STOP requests to all Running Jobs And wait for all jobs to finish Its for server downs in between the process running. o Abort without sending STOP requests Wait for all jobs to finish first.
Navs notes Page 158

Unconditional Otherwise User Status

N/A (its default) N/A = <user define message>

Custom-(conditional) - custom

DataStage

Sequencer: it holds multiple inputs and multiple outputs ANY its for FAIL (n number of links)
2010

It has two options or modes: Exception handler:

ALL ANY

ALL its for OK & WAR links

It handles the server interrupts we dont connect any stage here it will separate in a job A simple job for exception handler:

Exception handler

Notification activity

Terminator activity

Exception handler properties: Its have only general information Notification Activity: It is sending acknowledgement in between the process Option to fill in the properties: SMTP Mail server Name: Senders email address: Recipients email address: Email subject: Attachments: Email body: Wait for file Activity:
D:\DS\SCD_LOAD browse file

To place the job in pause

Navs notes

Page 159

DataStage
File name: Two options: wait for file to appear Do not timeout (no time length for the above options)
2010

Wait for file to disappear Timeout length (hh:mm:ss)

DAY 50 Performance tuning w.r.t partition techniques & Stages Partition techniques: are two categories Key based: 1. Hash 2. Modulus 3. DB2 4. Range Key less: 1. Same 2. Round Robin 3. Entire 4. Random In key based partition technique: DB2 is used when the target is database. DB2 and Range techniques are used rarely. Hash partition technique: o It is selected when number of key columns will be there. i.e., key columns (>1) and hetro data types (means different different data types) o Other than this situation we can select modulus partition technique. Modulus partition technique: o It distributes the data based on mod values. o And mod formula is MOD(value/ Number of nodes)

Navs notes

Page 160

DataStage

NOTE: Modulus is having high performance than Hash, because the way its groups the data NOTE: But modules can only be selected, if the only one key column and only one data type that is only integer (data type). In Key less partition technique: Same: is never distributes the data, but is carry previous technique that continuous. Entire: will distribute the same group of records to all nodes. That is the purpose of avoiding the mismatch records in between the operation. Round Robin: it is for generated stage like Column Generator and so on is associated this partition technique. o It is the best partition technique than comparing to random. Random: all key less partition techniques stages are used this technique its default. Performance tuning w.r.t Stages:
If when Sorting already perform then JOIN stage we can use.
2010

and based on the mod value.

Else LOOKUP stage is the best.

LOOKUP FILE SET: is options use to remove duplicates in lookup stage.

SORT stage: if complex sort : go to Stage sort Else: go to link sort.

Remove Duplicates: the data already sort Remove duplicates stage

Sorting and remove duplicates go to link sort (unique)

Constraints: when operation and constraints needed go to Transformer stage

Navs notes

Page 161

DataStage

Else only constraints simply go to FILTER stage.

DAY 51 Compress, Expand, Generic, Pivot, xml input & output Stages

Compress Stage: It is a processing stage that compresses the records into single format means in single file or it compresses the records into zip. It supports 1 input and 1 output.

Properties: Stage o Options Input o <do nothing> Output o Load the Meta data of the source file. Expand Stage: It is a processing stage the extract the compress data or its extract the zip data into unzip data. It supports 1 input and 1 output.
Navs notes Page 162

Command= (compress/gzip)

2010

Conversions: Modify stage and Transformer stage (it takes more compile time).

DataStage

Properties: Stage: o Options : - command= (uncompress/gunzip) Input: o <do nothing> Output: o Load the Meta data of the source file for the further process. Encode Stage:

It is processing stage that encodes the records into single format with the support of command line. It supports 1-input and 1-output.

Properties: Stage
o

Options: Command line = (compress/ gzip)

Input o <do nothing> Output o Load the Meta data of the source file. Decode Stage: It is processing stage that decodes the encoded data. It supports 1-intput and 1-output.

Navs notes

Page 163

2010

DataStage

Stage o Options: command line = (uncompress/gunzip) Output


o

Load the Meta data of the source file.

Generic Stage: It is processing stage that holds any operator can call here, but it must and should full fill the properties. It supports n- inputs and n-outputs, but no rejects

When compiling the job, the job related OSH code will generated.

Generic stage can call the operator on the datastage.


Its purpose is migration serve jobs to parallel jobs (IBM has x- migrator that converts

into 70%)
And it can call ANY operator here, but it must full fill the properties.

Properties: Stage o Options

Operator: copy (we can write any stage operator here)

Input o <do nothing> Output o Load the Meta data of the source file.

Navs notes

Page 164

2010

Properties:

DataStage
Pivot Stage: It is processing stage that converts rows into columns in a table.
2010

Its supports 1-input and 1-output. Properties: Stage <do nothing> Input: <do nothing> Output:
Column name Length REC varchar Derivation 25 SQL Type

<col_n with comma separated>

XML Stages: It is real time stage that the data stores in single records or in aggregator with in the xml format.
And XML Stage divided into two types, they are

1. XML Output 2. XML Input XML Input: .

Navs notes

Page 165

You might also like