You are on page 1of 163




Introduction for Phases of DataStage

Four different phases are in DataStage, they are
Phase I: Data Profiling
It is for source system analyses, and the analysis are
1. Column analysis,
2. Primary key analysis,
3. Foreign key analysis, by this analysis whether we can find the data is “dirty” or “not”.
4. Base Line analysis, and
5. Cross domain analysis.

Phase II: Data Quality (or also called as cleansing)
In this process we must follow inter dependent i.e., after one after one process as
shown below.
Correcting Inter
Standardizing Dependent

Phase III: Data Transmission
In this ETL process is done here, the data transmission from one stage to another stage
And ETL means
E- Extract
T- Transmission
L- Load.
Phase IV: Meta Data Management - “Meta data means where the data for data”.


How the ETL programming tool works?

 Pictorial view:

Data Base

ETL Process Business Interface

Flat files

MS Excel

Figure: ETL programming process




Extracting from .txt (ASCII code)

Staging (permanent data)

Understand to DataStage Format (Native Format)
Source Staging (after transmission)

Load window Source DWH

Loading the data into .txt (ASCII code) data base or resides in
local repository

ETL is a process that is performs in stages:

OLTP stage area sa sa sa DWH

Here, S- source and T- target.

Home Work (HW): one record for each kindle (multiple records for multiple addresses and
dummy records for joint accounts);



. ETL Developer Requirements  Q: One record for each kindle(multiple records for multiple addresses and dummy records for joint accounts).high level document Developer LLD. HLD. Multiple records means multiple of the customers(records) and multiple addresses means one customer(one account) maintaining multiple of addresses like savings/credit cards/current account/loan. . Inputs here. Customer Loan Bank Credit card Savings kindle  Customer maintaining one record but handling different addresses is called ‘single view customer’ or ‘single version of truth’.. .. Kindle means information of customers. ETL Developer Requirements: HLD LLD . HW explanation: Here we must read the query very care fully and understand the terminology of the words in business perceptive.low level document ETL Developer Requirements are: Kalyan-9703114894 .

1. Performance Tuning 7. * means range of 1-9. Design Turn Over Document (DTD)/ Detailed Design Document(DDD)/ Technical Design Document(TDD) 9. 10. Peer Reviews: it is nothing but releasing versions(version control *. Prepare Questions: after reading document which is given and ask to friends/ forums/team leads/project leads. 8. 3. Physical model: using Tool.**) here. UNIT Test 6. 4. Job Sequencing DAY 5 Kalyan-9703114894 . 5. Under Standing 2. Logical designs: means paper work. Backups: means importing and exporting the data require.

70% automatically 30% manually.  Migration means Support based companies like TCS. How the DWH project is under taken? Process: HLD Requirements: Warehouse(WH) -HLD x x x TD jobs in % Developer (70% . which convert server jobs to parallel jobs In this it converts up to. Server jobs – parallel jobs Up to 2002 this environment worked after 2002 and up to till this environment IBM launched X-Migrator.  – mean where the developer involves in the project and implement all TEN requirements shown above. In Migration: works both server and parallel jobs. x – cross mark that developer not involves in the flow.  Production based companies are like IBM and so on.80%)  as developer involves Developer system engineer Production(10%) Migration (30%) x TEST Production Migration x x here. Kalyan-9703114894 . Satyam Mahindra and so on. Cognizent.

Period (that taken in months and years) Simple 6m Medium 6m – 1y Complex 1– 11/2 y Too complex 11/2 y – 5y and so on(it may takes many years depend up on project) 5.1. Categories . Project Process: (high level documents) HLD SRS (here. business analyzer/ Subject matter expert ) Requirements: BRD HLD Architecture Warehouse: Schema (structure) Dimensions and tables (target tables) Facts (low level doc’s) LLD Mapping Doc’s (specifications-spec’s) TD Test Spec’s Naming Doc’s Kalyan-9703114894 .Project divided into some category with respective to period as shown below and its period( time of the project).

middle name.5. Pk F Dno Hire date (CD-HD) FName Ename Fk C Exp_tbl MName Emp Eno Sk D LName Dept Dno C Exp_emp Dname DName Funneling S1 Get data from Multiple tables Target ‘C’ Is combining S2 Horizontal combining or vertical combining As per example here horizontal combination is used Kalyan-9703114894 .no Load Target Target Source Source Transmis Constan Error order Entity Attributes Tables Fields sion t Handling Eno Hire date Current Date. last name.first name. Mapping Document: For example if a query requirements are 1-experience employee. 2. and 3.dname. For this mapping pictorial way as we see in the way: Common fields S.2.

 As Developer maximum 30 Target fields will get. Emp HC Trg Dept Here. h & t) ( F/ dB) (Types of dB) S1 HC TRG T S2 H C Format of Mapping Document. HC means Horizontal combination is used for combine primary rows with secondary rows. After document: . sc.  As Developer maximum 100 source fields will get.txt (fwf. vl. “Look Up!” means cross verification from primary table. s & t. DAY 6 Architecture of DWH Kalyan-9703114894 . cv.

Reliance Group : manager Top Level mgr(TLM) Reliance power details of below sales Manager customer Reliance Fresh TLM needs employee period ` order Input Explanation of above example: Reliance group with some there branches and every branch have one manager. By that the data transmission between warehouse and data mart where depends upon by each other. Bottom level For above example how ETL process is done shown below RC-mgr reliance fresh ERP mini WH/ ETL ETL PROCESS PROCESS Data mart DWH independent Data Mart Dependent Data Mart Reliance Fresh(taking one from group directly) Dependent Data Mart: means the ETL process takes all manager information or dB and keep in the Warehouse. Here Data mart is also called as ‘Bottom level’/ ‘mini WH’ as Kalyan-9703114894 .For example: dB every branch have each mgr Manager Reliance comm. And for all this manager one Top level manager (TLM) will be there. And TLM needs the details of list shown above for analyze.

Here.e. Hence the data mart depends up on the WH is called dependent data mart.1. H. 6.shown in blue color in above figure i. and 2. and this approach is invented by W. warehouse is top level and all data mart are bottom level as shown in the above figure. Independent Data Mart: only one or individual manager i.e. Data Mart R Power ETL Data Mart PROCESS Reliance Group Warehouse R Fresh Data Mart Top level Bottom level Layer I Layer II Top – Bottom level approach In the above the top – bottom level is defined. That’s why its called independent data mart. data mart were directly access the ETL process with out any help of Warehouse. Bottom. RP and so on). R Comm. the data of individual manager (like RF. 6. Top – Bottom level approach: The level start from top means as per example Reliance group to their individual managers their ETL process from their to Data Warehouse (top level) and from their to all separate data marts (bottom level). Top-Bottom level approach. 1.. RC. Inner. Kalyan-9703114894 .Top level approach..1 Two level approaches: For the both approaches two layers architecture will apply.

DM R power ETL PROCESS DM DWH Reliance Group R fresh DM Layer I Layer II Bottom level Top level Bottom – Top level approach is invented by R Kimbell. R comm. Bottom – top level approach: Means from here the ETL process takes directly from data mart (DM) and the data put in the warehouse for reference purpose or storing the DM in the Data WareHouse (DWH).2. employees. Here. products.6. Top – Bottom level approach These two approaches comes under two layer Architecture Bottom – Top level approach Programming (coding) Kalyan-9703114894 . one data mart (DM) contains information like customer. location and so on.

And in another case the data follow from source – data Kalyan-9703114894 .  ETL program Tool’s are “Tara Data/ Oracle/ DB2 & so on…” Layer I: DM DM Source DWH Source DM Layer I Layer I In this layer the data send directly in first case from source to Data WareHouse(DWH) and in second case source to group of Data Marts(DM).2.1.  ETL Tool’s: GUI(graph user interface) This tool’s to “extract the data from heterogeneous source”. 6. Layer II: DM DM SRC DWH SRC DWH DM DM Layer I Layer II Layer I Layer II TOP – BOTTOM APPROACH BOTTOM – TOP APPROACH In this layer the data follow from source – data warehouse – data mart and this type of follow is called “top – bottom approach”. Four layers of DWH Architecture: 6.

* (99. Example #1: Kalyan-9703114894 . Reliance group.3. The clear explanation about the layer 3 architecture in the below example.99% using layer 3 and layer 4) 6. For this Layer II architecture is explained in the above shown example eg.marts – data warehouse and this type of following data is called “bottom – top approach”. Here the new concept add that is ODS means operations of data stores for at period like 6 months or one year that data used to solve instance problem where the ETL developer is not involved here. Layer III: DM Source ODS DWH DM DM Layer I Layer II Layer III In this layer the data follow from source – ODS (operations data stores) – DWH – Data Marts. it is the best example for clear explanation. The ODS data stores after the period into the DWH and from that it goes to DM there the ETL developers involves here in layer 3.2. And who solve the instance/ temporary problems that team called Interface team is involved here.

But the ODS stores the data for one year only.e. Involves here Layer I DM DM Airport terminal DWH Interface team involves here Stores problem info for future references DM Airport base station Layer III Layer II ODS Problem information captured Data Base (stores the technical problem in dB for 1year) OPERATIONS DATA STORE Example explanation: In this example. simple say problem information captured. because of some technical problem) (at least or max. is also called layer 3 architecture of data warehouse. source is aero plan that is for waiting for landing to the airport terminal. In the airport base station the technical problems and the Operations Data Store (ODS) in db i.. To solve this type operations special team involves here i. 2hrs to solve the problem ) ETL dev. And years of database stores in the data warehouse because of some technical problems to be not repeat or for future reference. From DWH to it goes to Data Marts here ETL developers involves for solve technical problems i. DAY 7 Kalyan-9703114894 . interface team. Source (it is waiting for landing.e.e. But it is not to suppose to land because of some technical problem in the airport base station...

Data Mart.. DW. Project Architecture: 7.layer2.Data Warehouse.4. L2 & L3 & L4..1..Single view customer.Business Intelligence.3.. BI. ------------- reference data . Continues…. SVC.. DM. ODS-operations data store. Layer IV: Layer 4 is also called as “Project Architecture” looku p It is for data backup of DWH & SVC L3 Source1 Interface Business intelligence ETL DW Files (FLAT FILES) Read flat files through DS BI DM Source2 L2 L4 Condition Format ODS MISMATCH SVC DM MISMATCH Reporting Source3 Layer I SRC Figure: Project Architecture / layer IV Here.-> reject data About the project architecture: Kalyan-9703114894 ...

Format mismatch.30. There are two types mismatch data 1.  In first layer source to interface files(flat files). csv. To see the drop data the reference link is used and it shows which record is condition mismatched. Contains Trg only req. In project architecture.xml and so on) to ODS. Example for condition mismatch: An employee table contains some data SQL> select * from emp.  In third layer the ETL transfer the data to warehouse.20.) Two types of mismatch data:  Condition mismatch(CM): this verify the data from flat files whether they are conditions are correct or mismatched. Here also reference link is used to see drop data. dno = 10 EID ENAME DNO dno 10.10 08 Naveen 10 emp TRG 19 Munna 20 tbl 99 Suman 30 Reference 15 Sravan 10 link drops20. .txt.  Format mismatch(FM): this is also like condition mismatch but it checks on the format whether the sending data or records is format is correct or mismatched. When ETL sending the flat files to ODS if any mismatch data will there it will drops that data.  Coming to second layer ETL reads the flat files through the DataStage(DS) and sends to ODS.30 from emp Example for Format Mismatch: Kalyan-9703114894 .  In last layer data warehouse to check whether a single customer or not and data loading or transmission in between DWH and DM(business intelligence). if it is mismatched the record will drops automatically. Note: (Information about dropped data when the transmission done between ETL reads the flat files(. there are 4 layers. Condition mismatch 2.

deposit SVC/ single version of truth suman credit This type of transforming is also called as Reverse Pivoting. naveen loan munna deposit 5 multiple records of customers transforming CName Adds. For example: *to make unique customer? CName Adds.2. Same records naveen savings Phase – II > identify field by field. 7. EID EName Place Here the table format is tab – space separated. loan munna insurance. NOTE: Business intelligence(BI DM) is for data backup of DWH & SVC(single version of truth). munna insurance suman credit Phase – III> cannot identify in this. Single View Customer (SCV): It is also called as “single version of truth”. Here DataStage people involves in this process naveen savings. 111 naveen mncl 222 munna knl The cross mark record has format mismatched so that the 333suman hyd record its just rejected. DAY 8 Kalyan-9703114894 .

Entity relation studio(ER-Studio) these two are data modeler’s where logical and physical design is done. two types of domain they are 1. Alphabetical and 2.  Reverse Engineering (RE): it’s create from existing system is known as RE. o Physical design: data base perspective. Number. Kalyan-9703114894 .  Mata Data: every entity has a structure is called Meta Data(simple say ‘a data to a data’) o In a table there are attributes and domain. Dimensional Model Modeling: it represent in physical or logical design from our source system to target system.  Forward Engineering (FE): it’s starting from the scratch. o Logical design: client perspective. are simple say “ altering the existing process” For example: Q: An client required a experience of an employee. Logical View Pictorial View EMP Dept optional Manual SQ B Here the above is Designing manual Data Modeler’s are use DM Tools o ERWIN Forward Engineering o ER – STUDIO Reverse Engineering Entity relation windows (ERWIN).

1. . . . Here taking some tables and linking with them with related to other tables.SRC EMP_table Implicit requirement (is experience of employee) Hire Date From Developer point of view is Explicit Requirement (to find out everything as per the client requirement want to see) TRG (Employee hire detailed information) ENo EName Years Months Days Hours Minutes Seconds Nano_Seconds Lowest level Detailed Information 8. Dimensional Table: To find out everything as per the client requirement want to see (or) the “Lowest level of detailed information” in the target tables is called Dimensional Table. Primary Key: means which is constraint. Tables as follow. it is a combination of unique and not null. . Foreign Key: means which is constraint and used as reference for other table. Like in product table. Surrogate key. This link is created by using foreign key and primary key. Q: how the tables are interconnected is shown below. . . . Foreign Key Kalyan-9703114894 .

Add1 Add2 111 naveen ETL Developer 10 M.2.TECH JNTU HYD 666 Hari .SC KU WRL 444 Raju Call Center 30 Normalization 555 Rajesh JAVA Developer 10 (or) reducing 666 Hari .TECH JNTU HYD or Redundancy Fk Pk ENO EName Designation DNo DNo Higher Quali.SC SVU HYD 333 Sravan JAVA Developer 10 is known 30 M.NET Developer 20 Redundancy 777 Mahesh Bank PO 10  The target table must be always in De-Normalized format. Add1 Add2 This dividing 111 naveen ETL Developer 10 10 M.SC KU WRL Repetitive 555 Rajesh JAVA Developer 10 M.TECH JNTU HYD 222 munna System analysis 20 M.TECH JNTU HYD Technique 222 munna System analysis 20 20 M. Normalized Data: In a table there if repetitive information or records is called Redundancy.TECH JNTU HYD These is 444 Raju Call Center 30 M.NET Developer 20 M. that information is to minimize or that technique is called as Normalization. Kalyan-9703114894 .SC SVU HYD 333 Sravan JAVA Developer 10 M.SC SVU HYD Information 777 Mahesh Bank PO 10 M. For example: ENO EName Designation DNo Higher Quali. Product_ID PRD_Desc PRD_TYPE_ID Primary Key Foreign Key Link Establishing PRD_TYPE_ID PRD_Category PRD_SP_ID Using Fk & Pk Primary Key PRD_SP_ID SName ADD1 8.

DAY 9 Kalyan-9703114894 .  But it is not in all cases. And combining is done by Horizontal combine. HC Normalization De-Normalization  De-Normalization means combining the multiple tables into one table. that de-normalized is must and should.

Manual.  But when we take in real time. because is not depends for any other table.  Mandatory is must 1. Add1 Add2 111 naveen ETL Developer 10 10 M. there are two options to design a job. that we joining the two table by using Horizontal combining it takes the EMP table as primary table and DEPT table as secondary table.1. They are 1. And EMP table is secondary table because it is depends on the DEPT table.TECH JNTU HYD 222 munna System analysis 20 20 M. Optional.SC SVU HYD 333 Sravan JAVA Developer 10 30 M. Horizontal Combine: Kalyan-9703114894 .primary table & n-secondary table EMP table DEPT table The given two tables EMP and DEPT ENO EName Designation DNo DNo Higher Quali.NET Developer 20 777 Mahesh Bank PO 10 Primary (or also known as Master Table) Secondary (or also known as Child Table)  Here from above two tables the primary table is DEPT table. 9.SC KU WRL 444 Raju Call Center 30 555 Rajesh JAVA Developer 10 666 Hari . and 2. E-R Model An Entity-Relationship Model: In logical design.

TECH JNTU HYD 222 munna System analysis 20 M. To perform horizontal combining we must follow these cases.  1 – Primary.SC KU WRL 444 Raju Call Center 30 555 naveen ETL Developer 10 666 munna System analysis 20 777 naveen ETL Developer 10 After combining or joining the table by using HC. hence it’s like below ENO EName Designation DNo Higher Quali. Add1 Add2 111 naveen ETL Developer 10 M. they are o Primary key. Add1 Add2 111 naveen ETL Developer 10 10 M.  There should be dependency. Horizontal combining is also called as JOIN.  There are three types of keys.SC SVU HYD 333 Sravan JAVA Developer 10 M.2. For example combining these two tables: EMP & DEPT tables Fk Pk ENO EName Designation DNo DNo Higher Quali.SC SVU HYD 333 Sravan JAVA Developer 10 30 M. Different types of Schema’s: Kalyan-9703114894 .SC KU WRL 9.TECH JNTU HYD 444 Raju Call Center 30 M. and o Surgut key.  It must have multiple sources.  HC means combining primary rows with secondary rows based on the primary key column values.TECH JNTU HYD 222 munna System analysis 20 20 M. n – secondary. o Foreign key.

they are o STAR Schema. it is called STAR Schema. DIM T table Source T FACT tbl Kalyan-9703114894 . STAR Schema:  In the star schema. you must know about two things o Dimensional table. and o Galaxy Schema. and o Fact table.  Dimensional table: means ‘Lowest level detailed information’ of a table.  in pictorial way it look like as below Transmission Source T DIM T FACT tbl table  in practical way it directly from source to dimensional table and fact table. Definition of STAR Schema: “A Fact Table collection of foreign key surrounded by multiple dimensional table and each dimensional collection of de-normalized data.” The data transmission is done in two different methods. There are four types schemas. 1. o Multi STAR Schema. o Snow Flake Schema.dimensional tables.  Fact Table: means it is collection of foreign key from n.

 In the fact table. measurements mean taking the information as per client requirement or user requirement.  By above shown that fact table is surrounded by the dimensional table.. Pk – primary key. Q: display what suman buy a lux product in ameerpet on January 1st week? Bridge/ intermediate table Product table Brand table PRD_Dim_tbl Date_Dim_tbl Category table Pk Pk Customer table Unit table Customer_Category_table Fact table Fk Fk Fk Fk measurements Customer table Cust_Dim_tbl Loc_Dim_tbl Location table Pk Pk  Here. it needs information PRD_dim_tbl.  But in current market STAR Schema and Snow Flake Schema is using rarely. where dimensional table is lowest level detailed information. Cust_dim_tbl. Date_dim_tbl. and Loc_dim_tbl.e. and fact table is collection of foreign key.  As per above question. Kalyan-9703114894 .Example for STAR Schema: “As taking some tables as below to derive a star schema from that”. The link is creating to the measurements i. and Fk – foreign key.  And fact table is also called as Bridge or Intermediate table. for Fact table by foreign key and primary key.

For example: Fk Pk Fk Pk Fk Pk EMP_tb Dept_tb Locatio Area l l ns  If we want to require the information from location table it fetch from that table and display the client required.  To minimize the huge data at once or in a one dimensional table.  That is reason we divide the dimensional table. 2. Snow Flake Schema: The fact-tables surrounded by dimensional table. some times it not possible to bring as soon as possible if huge data in dimensional table. into some tables. each dimensional table have lookup table is called Snow Flake Schema. STAR Schema works effectively De-normalization Cagnos/B O D N DWH Reports Source N MIG/H1 Normalization Snow Flake Schema works effectively  NOTE: Selection of Schema in run time it is depends on report generation. And that tables is known as “look up tables” Definition of Snow Flake Schema: “The Fact table surrounded by dimensional tables. and each dimensional table have look up tables is called Snow Flake Schema”. Kalyan-9703114894 .

In 1997. which provides End – to – End Enterprise Resource Planning (ERP) solution (here. Q: What is DataStage? ANS: DataStage is a comprehensive ETL tool.5. . In 1997.0. and the Mr. . US based company.  Feature of DS.  BODI.  Abinity and so on… But DataStage is so famous and widely used in the market and it is to expensive also. Kalyan-9703114894 .0. . There are 90% changes from 1997 to 2010 comparing to release versions. some of they are  DataStage Parallel Extends.0.x2 and 8. comprehensive means good in all areas) History begins: .x2 and 8..1 HISTORY of DataStage An ETL tool according year 2006 there are 600 tools in market.  Enhancements and new features of version 8.e.  SAS(ETL Studio).1 versions. Data Integrator is acquiring by company name called TORRENT. :: DataStage CONCEPTS:: DAY 10 DataStage (DS) Concepts:  History of DS.  Differences between 7.1 versions.  Architecture of 7. first version of DataStage is released by VMARK company i. DataStage those days called as “Data Integrator”. .  ODI(OWB).5. LEE SCHEFFLER is father of DataStage. Only 5 members involved in release the software into the market.

e. and all the UNIX commands works in the windows flat form.x2 is released that support server configuration for windows flat form also. o From that parallel versions gone on developing up to 7.1 versions they supports only UNIX flavors environment. In 2000. o And ADSSPX is version is 6. o And released software were 30 tools used to run. o For this ADSSPX is integrated with MKS_TOOL_KIT.. . o Because server configured only on UNIX flat form or environment. UNIX) have parallel extendable capabilities in UNIX environment. a version 7.1 version. .5. o By this company the DataStage has popularized into the market from that year.5.0 to 7. . ADSS + ORCHESTRATE means ACENTIAL company is integrated with ORCHESTRATE company for the parallel capabilities. In 2002. o MKS_TOOL_KIT is virtual UNIX machine that brings the capabilities to windows for support server configuration. from that version parallel operations starts or parallel capabilities starts.5. After two years i. o By integrating ADSS + ORCHESTRATE and they named as ADSSPX.0. in 1999. o NOTE: After installing the ADSSPX+MKS_TOOL_KIT into the windows. Kalyan-9703114894 . In 2004. . o But from 6. ACENTIAL Company acquired both Data Base and Data Integrator and after that ACENTIAL DataStage Server Edition released in this year. o Because ORCHESTRATE (PX. INFORMIX Company has acquired Data Integrator from TORRENT Company.

. IBM has released another version that “IBM INFOSPHERE DS & QS 8. . these are individual tools. Kalyan-9703114894 . Meta stage. IDE.x2 were having ASCENTIAL suite components o They are.  DataStage Px.0. December the version 7.. quality stage. In 2009. February the IBM acquired all the ASCENTIAL suite components and the IBM released IBM DS EE i. In 2006.5.  7.0” o This is also called as “Integrated Developer Environment” i. and so on o There are 12 types of ASCENTIAL suite components.e. it nothing to be stored.1 using 30 – 40%  8. In 2005.1” o In current market. . enterprise edition.  Meta stage.e. o With the combination of four stages they have released “IBM WEBSPHERE DS & QS 8.  DataStage MUS.x2 using 40 – 50%  8.5.  Audit stage.  Profile stage. .1 using 10 – 20% NOTE: DataStage is Front End. the IBM has made some changes to the IBM DS EE and the changes are the integrated the profiling stage and audit stage into one..  DataStage Tx. and DataStage Px.  Quality stage. In 2004.

 Plat form Independent: o “A job can run in any processor is called plat form independent” o Three types of processor are there. . Pipe line parallelism. Plat form Independent. DAY 11 DataStage FEATURES Features of DS: There are 5 important features of DataStage. and . . Any to Any. they are  UNI. and  Massively Multi Processor (MMP).  Symmetric Multi Processor(SMP). Node configuration. . Partition parallelism. they are . UNI SMP MMP HDD HDD SMP -1 SMP -2 C C C C C “““ P P P SMP -3 SMP -n P P ””” U U U U U Kalyan-9703114894 .  Any to Any: o DataStage that capable to any source to any target.

” o Node configuration concept is exclusively work on the DataStage.  But for the same question an SMP processor takes 2. P P P P P U U U U U 10 minutes 2. using software it is “the process of creating virtual CPU’s is called Node Configuration. o “Node is a logical CPU i.1000 records share C C C C C for four CPU’ hence execution time reduced. is instance of physical CPU. RAM RAM  Node Configuration: o Node is software that is created in operating system.5 minutes to execute 1000 records. Node Configuration is also can create virtual CPU’s to reduce the execution time for UNI processor.  It is explained clearly in below diagram. 1000 records HDD 1000 records HDD Here.. o For example:  An ETL job requires executing 1000records?  For above question an UNI processor takes 10mins to execute 1000 records. o Hence.5 minutes RAM RAM o As per above example. it is the best feature comparing from other ETL tools.e. Kalyan-9703114894 .

o Using Node Configuration for the above example to UNI processor.10 10 only 2 matched 12 N2 30.30) N1 10. based on partition techniques.10 20 1 4 total only 4 matched But output must be 9 records Kalyan-9703114894 . o After partitioning these records output must and should have 9 records.  DEPT table have 3 records.20.20.10. because here primary table is 9 records.30) and DEPT(10.20. EMP(10.5minutes PU RAM  Partition parallelism: o Partition is a distributing the data across the nodes. o Considering one example why we use the partition technique’s o Example: taking some records in EMP table and some in DEPT table  EMP table have 9 records. o In below figure how the virtual CPU’s can create and reduce the execution time of the process.10. 1000 records HDD created multiple nodes Node1 CPU PU CPU Node2 C PU Node3 P CPU U PU Node4 CPU 10 minutes reduces to 2.30.

o Key less technique is used to append the data for joining given tables.  Key based  Hash  Modulus  Range  Db/2  Key less  Same  Random  Entire  Round robin o Key based category or key based techniques will give the assurance. Key based partitioning EMP DNO N1 10 JOIN N2 20 DEPT N3 30 Kalyan-9703114894 . o And there are two types of partition parallelism categories. N3 10. in those total 8 types of partition techniques are there.30 30 1 o In the above example.20. From above taken records we partitioning using key based. only 4 records are in there in final output and 5 records are missing for this reason the partition techniques are introduced. to the same key column value to collected same key partition.

. known as Re – Partition. ENO EName DNo Loc 111 naveen 10 AP P1 222 munna 20 TN 333 Sravan 10 KN P2 444 Raju 30 AP 555 naveen 10 TN 666 munna 20 TN P3 777 naveen 30 KN EMP 10 N1 N1 AP JOIN 20 N2 N2 TN Dno DEPT 30 N3 N3 KN Dno Loc o First partition is done by key based partition for dno. and taking a separate column as location. i. for that it re – distributing for the distributed data.e. Kalyan-9703114894 . DAY 12 Continues… Features of DataStage  Partition Parallelism: o Re – Partition: means re – distributing the distributed data.

o Reverse Partitioning:  It is also called as collecting. SRC TRSF TRG S1 S2 S3  Example: Here collecting to Nodes N1 N S1 S2 Nn Parallel files into Sequential/Single file  There are four categories of collecting techniques  Order  Round robin  Sort – merge  Auto  Example for collecting techniques: N1 a. this is channel it is moving data from one stage to another stage.y N x b b z b c c y y x x c c y y x Kalyan-9703114894 z z z b . But it done in one case only or in one situation only : “when the data move from parallel stage to sequential stage the collecting happens in this case only”  Designing job in “stages”  is also called as link or pipe.x Order RR SM Auto a a a a N2 b.

Kalyan-9703114894 . o For example: how the execution done in server environment we see Extract Transform Load 10min S 10min S 10min S 2 3 1 HD HD  Here. the execution taken 30minutes to complete. o Same job in parallel environment : E T L R5 R3 R1 S S S 1 R4 2 R2 3  Here. N3 c.z  Pipeline Parallelism: “All pipes carry the data parallel and the process done simultaneously” o In server environment: the execution process is called traditional batch processing. all the pipe carry the data parallel and processing the job simultaneously and the execution taken only 10 minutes to complete  By using the pipeline parallelism we can reduce the process time.

Architecture Components .Web based administration through web console( simple say work from home) Kalyan-9703114894 .N-Tier architecture .1 .5.t. users but one time dependent only. .0.4 client components .x2 7.0.x2 8.5. OS dependent w. No web based administration . DAY 13 Differences between 7.r. Capable of P-III & P-IV . users .5 client components * DS Designer * DS Designer * DS Manager * DS Director * DS Director * DS Administrator * DS Administrator * Web Console * Information Analyzer . II.OS independent w.1 8. .Architecture Components * Server Component * Common User Interface * Client Component * Common Repository * Common Engine * Common Connectivity * Common shared Services .0.1 Differences: 7.t.5.r.tier architecture .Capable of all phases.x2 & 8.

logs)  Message Handling.  Create project  Delete project  Organize project 13.1. Client components of 8. batch jobs  Views(job. . status. DS Director: it can handle the given list below. run job’s  Monitor. .  Mainframes job  Server job Kalyan-9703114894 .  Import and Export of Repository components  Node Configuration . run and multiple job compile.1.  4 types of jobs can handle by DS Designer.  Schedule . DS Manager: it can handle the given list below.x2: .Data Base based repository 13. DS Designer: it is to create jobs. Client components of 7. compile.1: .  Mainframes job  Server job  Parallel job  Job sequence job .0. compile. run and multiple job compile.5. File based repository .  4 types of jobs can handle by DS Designer. Unlock. DS Administrator: it can handle the given list below. DS Designer: it is to create jobs.

and  Cross domain analysis.  Security services  Scheduling services  Logging services  Reporting services  Session management  Domain manager .  Primary key analysis.5.  Base Line analysis. Information Analyzer. Web Console: administrator components through which performing. and DS Administrator. DS Administrator: same in as above shown in 7.x2 . But. DS Director: same in as above shown in 7. Kalyan-9703114894 . some information to be knows about Web console.  It perform all phase-I activities  Column analysis.  Parallel job  Job sequence job  Data quality job . As an ETL developer you can come across DS Designer and DS Director.x2 . Information Analyzer: is also called as console for IBM INFO SERVER.5.  Foreign key analysis.

Kalyan-9703114894 . they are a. o Never leave any stage to auto?  If we leave it auto.1 Architecture 14. compile. DAY 14 Description of 7. Some of components are  Job’s  Table definition  Shared container  Routines ….5.x2: * Server Components: it is divided into 3 categories. it select auto partition technique it causes effect on the performance. run. Engine c.1. o Here repository is also Integrated Developer Environment(IDE)  IDE performs design.5. etc. Package Installer  Repository: is also called as project or work area. save jobs.. Architecture of 7. Repository b. o Repository organize different component at one area is called collection of components.0. o Repository is for developing application as well as storing application..x2 & 8.  Engine: it is executing DataStage jobs and it automatically selects the partition technique.

ERP SW DS Packs  Best example that normal windows XP acquires Service Pack2 for more capabilities  Here.  Package Installer: in this component contains two types of package installer one plug- in and another is pack’s.1: 1. DS Administrator  These categories are shown above what they handle i. Information analyzer Kalyan-9703114894 . DS Director d.e. *Client components: it is divided into 4 categories. interface is also called as plug-in between computer and printer. 14. a. Web console b. they are a.. packs are used to configuration for DataStage to ERP solution. Common user interface: is also called as unified user. DS Designer b.2. DS Manager c. Architecture of 8. in page no 39.0. Example: Derivers needed 1100 to install Computer Printer Interface 1100 driver provide  Here.

DS Designer d. Global repository: it is for DataStage jobs files to store here. Local repository: it is for individual files stores here(it’s for performance issue) o common repository is also called as Mata Data SERVER o three types  Project level MD  Design level MD  Operation level MD 3. Common Connectivity: It provides the connections to common repository. WC IA DE DI DA REPOSITORY Common shared services MD SERVER Project level MD DP DQ DT DA Design level MD Common Engine Operation level MD Common Connectivity Kalyan-9703114894 . Common Engine: o It is responsible of  Data Profiling analysis  Data Quality analysis  Data Transmission analysis 4. (it’s checks security issues) b. DS Administrator 2. Common Repository: is divided into two types a. DS Director e. c.

WTX(Web Sphere Transfer) o Enhanced Stage: 1.1. Sparse lookup Newly added iii. previously lookup having i. FTP(File transfer Protocol) 3. SCD(slow changing Dimension) 2. Surrogate key stage: it is new concept introduced. Table representation of “8. there are 8 categories of stages.  Processing stage: o New Stage: 1.0. 2.0. Lookup stage. Range lookup iv. Normal lookup ii.1 Architecture” DAY 15 Enhancement & New feature of version8 In version 8. Kalyan-9703114894 . Case less lookup  Data Base Stage: o New stages:  IWAY  Classic federation  ODBC connector  NETEZZA o Enhanced Stages:  All Stages techniques used with respect to SQL Builder.

0..1  General X  Data Quality new  Data Base  File X  Development & Debug X  Processing  Real time X here.5. Kalyan-9703114894 .0.  Palate of the version 8.1. o Other stages are same as version 7. no changes in this version.x2 i. o Data Base and processing stages have some changes that shown above.e. o Data Quality is exclusively new concept of 8. have changes  Restructure X X no changes o Palate is shortcuts of stages where we can drag n drop into canvas to do design the job.

 After started: select DS Designer & enter uid: admin & enter pwd: **** (eg. Palate -> (it’s from tool bar)  Palate means which contains all stages shortcuts General CANVAS i.: phil) & attach appropriate project: Project\navs….2 & 8 – stages in 8. compile and run the job.  In 8 categories we have use sequential stage and parallel stage jobs. :: Stages Process & Lab Work:: DAY 16 Starting steps for DataStage tool The starting of DataStage on the system we must follow the difference steps to do job.5.e.0.  Five difference steps job development process (this is for design a job).  Save.. SEQ SEQ Restructure 2 –> Passive Stage (here what ever stage whether extracting or loading is called passive stage).  And link them (or giving connectivity) and after Designer Canvas or Editor that setting properties is important. Data Quality Database Where the place  This stages are categorized into two groups. Kalyan-9703114894 . are Development & Debug Eg: Seq to Seq 1 –> Active Stage (what ever stage is Processing Real Time transmission is called active stage).. 7 – stages in 7.  DB2 Repository started and DataStage server started.  Select appropriate stage in the palate and dragging them on to the CANVAS. they File we design the job.

the server must be restart for doing or creating jobs. it will display to attach the project for creating a new job.  DS Administrator: it is for creating/deleting/organize the project.  DS Designer: when you will click on the designer icon. status. warnings. i.. it displays “ the login page appears” o If server is not started.  Run director (to see views) or to view the status of your job. it displays “the page cannot open” error will appear.e. If not manually to start. As shown as below o User id: admin Kalyan-9703114894 . and to view log. o If error occurs like that.  DS Director: it is for views the status of the job executed.  Web Console  Information Analyzer  DS Administrator  DS Designer  DS Director  Web Console: when you will click. whether the server for DataStage was start or not.  When 8th version of DataStage is installed five client components short cuts visible on desktop. the current running process will show at the left Conner in that a round symbol with green color is to start when it is not automatically starts. DAY 17 My first job creating process Process:  In computer desktop.

go to tool bar – view – palate. it displays the Designer canvas o And it ask which job want to you do.e. o Password: **** o If authentication failed to login i.  Below figure showing how to authenticate & shows designer canvas for creating jobs. they are  General  Data Quality  Data Base  File  Development & Debug  Processing Kalyan-9703114894 . because repository interface error.  In palate the 8 types of stages were displayed for designing a job. they are  Main frames  Parallel  Sequential  Server jobs  After clicking on parallel jobs. Attach the project X Domain Localhost:8080 OK User Name admin cancel Password phil Project Teleco  After authentication..

. Source Target o Source file require target/output properties. how we to read a file? o On double clicking source file.csv means comma separated value. csv.  Location \\ example in c:\  Format \\ .txt File: \? (This option for multiple purposes) Browse button Kalyan-9703114894 .  . data set.  Real Time  Re – Structure 17.txt. File Stage: Q: How data can read from files?  File stage can read only flat files and the formats of flat files are . . file set and so on.txt C:\data\se_source_file. sc.xml means extendable markup language. In source file.1. o Example how a job can execute: one sequential file(SF) to another SF. . .txt. Select a file name: File: \ c:\data\se_source_file. In File Stage. . H & T. we must set the properties as below  File name \\ browse the file name. .csv. s & t. and o Target file require input/source properties.xml  In . Setting / importing source file from local server. there are sub–stages like sequential stage.  .txt there are different types of formats like fwf.csv.xml  Structure \\ meta data General properties of sequential file: 1.

Steps for load a structure . Column structure defining: LOAD To get the structure of file. As per input file taken and the data must to be in given format . Format selection: .  These three are general properties when we design for simple job. Kalyan-9703114894 .2. 3. Import o Sequential file  Browse the file and import  Select the import file o Define the structure. . Like “tab/ space/ comma” must to be select one them.

Output Input Properties Properties . About Sequential File Stage and how it works:  Step1: Sequential file is file stage.xml)  Step 2: SF it reads/writes sequentially by default. For single structure format only we going to use sequential file stage. o And it also reads/writes parallel when it read/writes to or from multiple files  Step 3: Sequential stage supports one input (or) one output and one reject link.csv. when it reads/writes from single file. that it to read flat files from different of extensions(. Link : Link is also a stage that transforms data from one stage to another stage. .txt. .  Stream link SF SF  Reject link SF SF Kalyan-9703114894 . DAY 18 Sequential File Stage Sequential file stage also says as “output properties” . o That link has divided into categories.

 Reference link SF SF Link Marker: It is show how the link behaves between the transmissions from source to target. Ready BOX: it is indicate that “a stage is ready with Mata Data” and data transform between sequential stages to sequential stage. Ready BOX 2. FAN OUT: it indicates when “a data transform from sequential stage to parallel stage” and it is also called auto partition. Kalyan-9703114894 . 1. FAN IN: it indicates when “a data transform from parallel stage to sequential stage” and it done when collecting happens FAN IN 3. FAN OUT 4. BOX: it indicates when “a data transform from parallel stage to parallel stage” and it is also known as partitioning.

BOW – TIE Link Color: The link color indicates the process in execution of a job. BOX 5. stage is a operator. Native Format is DataStage under stable format.  BLUE: o A link in BLUE color means “ it indicates that a job execution on process”  GREEN: o A link in GREEN color means “execution of job finished”. LINK  RED: o A link in RED color means  case1: a stage not connected properly and  case2: job aborted  BLACK: o A link in BLACK color means “a stage is ready”. NOTE: “Stage is an operator. operator is a pre – built in component”. Kalyan-9703114894 . Because the stage that imports import operator for purpose of creating in Native Format. BOW – TIE: it indicates when “a data transform parallel stage to parallel stage” and it is also known as re-partitioning. So.

 Compiling .C function . Transformer stage that is done by C++.e.OBJ ALL *HLL – High Level Language *ALL – Assembly Level Language *BC – Binary Code  Compiling process in DataStage: GUI . HLL .. BC C EXE .Compile: Compile is a translator that source code to target code.OBJ OSH Code & C++ *MC – Machine Code *OSH – Orchestrate Shell Script Note: Orchestrate Shell Script generate for all stage except one i. it checks for  Link Requirement (checks for link)  Mandatory stage properties  Syntax Rules Kalyan-9703114894 . In process. MC EXE .

DAY 19
Sequential File Stage Properties

 Read Methods: two options are
o Specific File: user or client to give specifically each file name.
o File Pattern: we can use wild card character and search for pattern i.e., * & ?
 For example: C:\eid*.txt
 Reject Mode: to handle a “format/data type/condition” miss match records.
Three options
o Continue: Drops the miss match and continue other records.
o Fail: job aborted.
o Output: its capture the drop data through the link to another sequential file.
 First line or record of table: true/false.
o If it false, it display the first line also a drop record.
o Else it is true, it’s doesn’t drop the first record.

 Missing File Mode: if any file name miss this option used
Two options
o Ok: drops the file name when missed.
o Error: if file name miss it aborts the job.

 File Name Column: “source information at the target” it gives information about which
record in which address in local server.
Directly to add a new column to existing table and it’s displays in that column.


 Row Number Column: “Source record number at target” it gives information about
which source record number at target table.
It is also directly to add a new column to existing table and it’s displays in that column.
 Read First Rows: “will get you top first n-records rows”
o Read First Rows option will asks give n value to display the n number of

 Filter: “blocking unwanted data based on UNIX filter commands”
 Like grep, egrep, …… on
 Example:
o grep “moon” ; \\ it is case sensitive that display only moon contained records.
o grep - i “moon” \\ it ignores the case sensitive it displays all moon records.
o grep - w “moon” \\ it displays exact match record.

 Read from Multiple Nodes: we can read the data parallel from using sequential stage
 Reads parallel is possible
 Loading parallel is not possible

o It should be sequential processing( process the data in sequential)
o Memory limit 2gb(.txt format)
o Problem with sequential is conversions.
 Like ASCII – NF – ASCII – NF
o It is lands or resides the data “outside of boundary” of DataStage.


DAY 20
General settings DataStage and about Data Set

Default setting for startup with parallel job:
- Tools
o Options
 Select a default
 And to create new: it ask which type of job u want.
- Types of jobs are main frames/parallel/sequential/server.
- After setting above when we restart the DS Designer it directly goes designer canvas.

 According naming standards every stage has to be named.
o Naming a stage is simple, just right click on a stage rename option is visible
and name a stage as naming standards.

General Stage:
In this stage the some of stage were used for commenting a stage what they behave or
what a stage can perform to do i.e., simple giving comments for a stage.

 Let we discuss on Annotation & Description Annotation
- Annotation: it is for stage comment.
- Description Annotation: it is used for job title (any one tile can keep).

Parallel Capable of 3 jobs:


 By default Data Set sends the data in parallel.txt When we convert ASCII code into NF. it is ASCII format because .txt trg_f. SRC need to import an operator. By default the data process parallel. Data Set (DS): “It is file stage.  In Data Set the data lands in to “Native Format”. Target need to import an operator. NF ASCII ASCII src_f. here the ASCII code will convert into Native format that is understandable to DataStage.txt file support only ASCII format and DataStage support the Native format only. And at target ASCII code will convert into .txt format to user/client visible. Resides into or SRC TRG Extracting landing the data into LS/RR/db Q: In which format the data sends between the source file to target file? A: if we send a .  “Native Format” is also called as Virtual Dataset. and it is used staging the data when we design dependent jobs”.  Data Set over comes the limitation of sequential file stage for the better performance. When we convert NF code into ASCII. Q: How the Data Set over comes the sequential file limitation? .txt file from source. Kalyan-9703114894 .

ds Structure saving as “st_trg” src_f. trg_f.ds trg_f.ds” file name and also we must save the structure of the trg_f. like DataStage reads only orchestrate format. No need of conversion.txt . because Dataset represent or data directly resides into Native format. . Kalyan-9703114894 . Data Set can read only Native Format file. .ds for reusing here.ds example st_trg. More than 2 GB. We can use the saved file name and structure of the target in other job.ds Q: How the conversion is easy in Data Set? .txt trg_f. Data Set extension is *. copying the structure st_trg & trg_f. . we can copy the “trg_f. The data Lands in the DataStage repository. . .

Virtual: it is a Data Set stage that the data moves in the link from one stage to another stage i. Kalyan-9703114894 . Alias of Data Set: o ORCHESTRATE FILE o OS FILE Q: How many files are created internally when we created data set? A: Data Set is not a single file.. That data is permanent. they are  Virtual (temporary)  Persistency (permanent) . link holds the data temporary.  Data File: consists of data in the Native Format and resides in DataStage repository. Persistency: means the data sending from the link it directly lands into the repository. DAY 21 Types of Data Set (DS) Two types of Data Set.e. . o Descriptor file o Data file o Control file o Header file  Descriptor File: it contains schema details and address of data. it creates multiple files when it created internally.

 Permanently stores in the install program files c:\ibm\inser.ds(eg..  Data Set organizes using utilities.  Navigation of organize Data Set in GUI: o Tools  Dataset Management ..ds) o Then we will see the general information of dataset  Schema window  Data window  Copy window  Delete window  At command line o $orachadmin rm dataset.\server\dataset{“pools”} Q: How can we organize Data Set to view/copy/delete in real time and etc.ds (this is correct process) \\ this command for remove a file o $rm dataset. A: Case1: we can’t directly delete the Data Set Case2: we can’t directly see it or view it. File_name.  Control File: It resides in the operating system and both acting as interface between descriptor file and data file. we have utility in tool (dataset management) o Using Command Line: we have to start with $orachadmin grep “moon”. o Using GUI i.: dataset.e.  Header File:  Physical file means it stores in the local drive/ local server..ds (this is wrong process) \\ cannot write like this o $ds records \\ to view files in a folder Kalyan-9703114894 .

Default version in 8 is it saves in the version 4.  Job properties o Parameters  Add environments variable . Kalyan-9703114894 . v41 Q: how to perform version control in run time? A: we have set the environment variable for this question.Q: What is the operator which associates to Dataset: A: Dataset doesn’t have any operator. Dataset has version for different DataStage version .  Navigation for how to set a environment variable. Dataset have version control . Dataset Version: .e..  After doing this when we want to save the job. but it uses copy operator has a it’s operator. it will ask whether which version you want. Compile o Dataset version ($APT_WRITE_DS_VERSION)  Click on that.1 i.

. Data Set & File Set are same.ds files saves extension  . Kalyan-9703114894 . DAY 22 File Set & Sequential File (SF) input properties File Set (FS): “It is also a staging the data”. but having minor differences . The differences between DS & FS are shown below Data Set File Set  Having parallel extendable  Having parallel extendable capabilities capabilities  More than 2 GB limit  More than 2GB limit  NO REJECT link with the  REJECT LINK with in File Set Dataset  DS is exclusively for internal  External application create use DataStage environment FS we use the any other application  Copy (file name) operator  Import / Export operator  Native format  Binary Format  . Data Set have more performance than File Set. File stage is same to design in dependent jobs. .fs extension  Multiple segments  Multiple files  But.

Its works only when “file update mode” is equal to append. and at target there have four properties 1. Reject Mode File Update Mode: having three options – append/create (error if exists)/overwrite o Append: when the multiple file or single file sending to sequential target it’s appends one file after another file to single file. First Line in Column Names: having two options – true/false. o Overwrite: it’s overwriting one file with another file.Sequential File Stage: input properties .  True – the cleanup on failure option when it is true it adds partially coded or records.  False – it’s simple appends or overwrites the records. Setting input properties at target file.  True – it is enable the first row or record as a fields of column  False – it is simple reads every row include first row read as record. Kalyan-9703114894 . File update mode 2. Add environment variables o Parallel  Automatically overwrite ($APT_CLOBBER_OUTPUT) Cleanup on Failure: having two options – true/false.  Setting passing value in Run time(for file update mode) o Job properties  Parameters . Cleanup on failure 3. First line in column names 4. o Create (error if exists): just creating a file if not exist or given wrong.

DAY 23 Development & Debug Stage The development and debug stage having three categories.  Fail – it just abort the file when format/condition/data type miss match were found. Row Generated Data b. Tail c. Stages that Generated Data: Row Generator Data: “It having only one output” Kalyan-9703114894 .1. Simple 3. The stage that helps in Debugging: a. Peek  Simply say in development and debug we having 6 types of stages and the 6 stages where divided into three categories as above shown. Head b. The stage that used to Pick Sample Data: a. Stage that Generated Data: a. Column Generated Data 2.Reject mode: here reject mode is same like as output properties we discussed already before. they are 1.  Output – it capture the drops record data.  Continue – it just drops when the format/condition/data type miss match the data and continues process remain records. 23. In this we have three options – continue/fail/output.

o When client unable to give the data. . Opening the RG properties . . Properties o Number of records = XXX( user define value) . Q: how to generate User define value instead of junk data? A: first we must go to the RG properties . . The row generator is for generating the sample data. Some cases are. Row Generator can generate the junk data automatically by considering data type. in some cases it is used. or we manual can set a some related understandable data by giving user define values. o For doing testing purpose. Column o Double click serial number or press ctrl+E  Generator Kalyan-9703114894 . . o To make job design simple that shoots for jobs. Data generated for the 30 records and the junk data also generated considering the data type. Row Generator design as below: ROW Generator DS_TRG Navigation for Row Generator: . For example n=30 . Column o Load structure or Meta data if existing or we can type their. In this having only one property and select a structure for creating junk data.

Main purpose of column generator to group a table as one.  Type = cycle/random (it is integer data type)  In integer data type we have three option Under cycle type: There are three types of cycle generated data Increment. A: it is going to generate the junk data for random values. Q: when we select seed=XX. Q: when we select limit=20? A: it is going to generate up to limit number in a cycle form. and limit. seed. Kalyan-9703114894 . Q: when we select increment=45? A: it going to generate a cycle value of from 45 and after adds every number with 45. Initial value. otherwise generate values between 0 and +limit. Q: when we select limit=20? A: it going to generate random value up to limit=20 and continues if more than 20 rows. Q: when we select initial value=30? A: it starts from 30 only. Under Random type: There are three types of random generated data – limit. Column Generator Data: “it having the one input and one output” . Q: when we select signed? A: it going to generate signed values for the field (values between –limit and +limit). and signed.

. . Navigation: . . And by using this we add extra column for the added column the junk data will be generated in the output. . . means just drag and dropping created column into existing table. Coming to the column generator properties. Output o Mapping  After adding extra column it will visible here. The junk data will generate automatically for extra added columns. For manual we can generate some meaning full data to extra column’s . . Here mapping should be done in the column generated properties. Column o We can change data type as you require. Sequential file Column Generator DataSet . In the output. Stage o Options  Column to generate =?  And so on we can give up to required. . Navigation for manual: o Column  Ctrl+E  Generator Kalyan-9703114894 . To open the properties just double clicking on that. and for mapping we drag simple to existing table into right side of a table.

Sample  Head: “it reads the top ‘n’ records of the every partition”. Tail . .1. there are three types of pick sample data”.e. Pick sample data: “it is a debug stage. value o Alphabet – it also have only one option i.. o In the head stage mapping must and should do. Head . o It having one input and one output. DAY 24 Pick sample Data & Peek 24. SF_SRC HEAD DS_TRG  Properties of Head: o Rows Kalyan-9703114894 . string.e.. Cycle is same like above shown in row generator. . o Algorithm = two options “cycle/ alphabet” o Cycle – it have only one option i. Q: when we select alphabet where string=naveen? A: it going to generate different rows with given alphabetical wise.

SF_SRC TAIL_F DS_TRG  Properties of Tail: o The properties of head and tail are similar way as show above. o Mainly we must give the value for “number of rows to display”  Sample: “it is also a debug stage consists of period and percentage” o Period: means when it’s operating is supports one input and one output. o Partitions  All partition = true .  All Rows(after skip)=false .  Tail: “it is debug stage. False: copies from specific partition numbers. Kalyan-9703114894 . It copy number of rows from input to output per partition. True: copies row from all partitions . o In this stage mapping must and should do. That mapping done in the tail output properties. It is to copy all rows to the output following any requested skip positioning  Number of rows(per partition)=XX . which must be specified. that it can read bottom ‘n’ rows from every partition” o Tail stage having one input and one output. o Percentage: means when it’s operating is supports one input and multiple of outputs.

o Link Order: it specifies to which output the specific data has to be send. SF_SRC SAMPLE DS_TRG  Period: if I have some records in source table and when we give ‘n’ number of period value it displays or retrieves the every nth record from the source table. Percentage = 50 . Percentage = 15 . target = 0 .  Percentage: it reads from one input to multiple outputs. Target1 Target2 SF_SRC SAMPLE Kalyan-9703114894 .  Skip: it also displays or retrieves the every nth record from given source table. Percentage = 25 and we must set target =1 . o Mapping: it should be done for multiple outputs. target = 2 o Here we setting target number that is called link order. o Coming to the properties  Options .

o In the percentage it distributes the data in percentage form. And it can use as stub stage. as per options Kalyan-9703114894 . Q: How to send the data into logs?  Opening properties of peek stage.2. It can use as copying the data from Source to multiple outputs. Target3 NOTE: sum of percentage of all outputs must be less than are equal to ‘<=’ to ‘n’ records of input records. 2. 3. When sample receives the 90% of data from source. It considers 90% as 100% and it distributes as we specify. 24. PEEK: “it is a debug stage and it helps in debugging stage” SF_SRC PEEK It is used in three types they are 1. we must assign o Number of row = value? o Peek record output mode = job log and so on. Send the data into logs.

 For seeing the log records that we stored. and its sends the rejected data to the another file. DAY 25 Database Stages In this stage we have use generally oracle enterprise. Oracle Enterprise Data Set o Properties of Oracle Enterprise(OE):  Read Method have four options Kalyan-9703114894 . ODBC enterprise. because in some situations a client requires only dropped data. it doesn’t shows the column in the log. but it loads in the one output. In that time the peek act as copy stage. Tara data with ODBC. 25. o In DS Director  From Peek – log – peek . o If we put column name = false.We see here ‘n’ values of records and fields Q: When the peek act as copy stage? A: It is done when the sequence file it doesn’t send the data to multiple outputs. and dynamic RDBMS and so on. Oracle Enterprise: “Oracle enterprise is a data base stage. Q: What is Stub Stage? A: Stub Stage is a place holder.1. it reads tables from the oracle data base from source to the target” o Oracle enterprise reads multiple tables from. In that time the stub stage acts as a place holder which holds the output data as temporary.

 Table \\ giving table name here  User Defined \\ here we are giving user defined SQL query.  Select load option in column  Going to the table definitions  Than to plug-in  Loading EMP table from their.  Select load option in column  Then we go to import  Import “meta data definition” o Select related plug-in  Oracle  User id: Scott  Password: tiger  After loading select specific table and import.  Auto Generated \\ it generated auto query  SQL Builder \\ its new concept apart comparing from v7 to v8. Kalyan-9703114894 .  If table not in the not their in plug-in.  If we select table option  Table = “<table name>”  Connection  Password = *****  User = Scott  Remote server = oracle o Navigations for how the data load to the column  This is for already data present in plug-in.

we can auto generate the query by that we can use by coping the query statement in user-defined SQL.5. Kalyan-9703114894 .x2 we don’t have saving and reusing the properties. Q: How to reuse the saved properties? A: navigation for how to save and reuse the properties  Opening the OE properties o Select stage  Data connection  There load saved dc o Naveen_dbc \\ it is a saved dc o Save in table definition. I need only 100 fields from that? A: In read method we use user-defined SQL query to solve this problem by writing a query for reading 100 records. in define we must change hired date data type as “Time Stamp”. Q: A table containing 300 records in that. NOTE: in version 7. Q: What we can do when we don’t know how to write a select command? A: Selecting in read method = SQL Builder  After selecting SQL Builder option from read method o Oracle 10g o From their dragging which table you want o And select column or double clicking in the dragged table  There we can select what condition we need to get.  But by the first read method option.  It is totally automated.  After importing into column. Data connection: its main purpose is reusing the saved properties.

But ODBC needs OS drivers to hit oracle or to connect oracle data base. When DataStage version7 released that time the oracle 9i provides some drivers to use.  When coming to connection oracle enterprise connects directly to oracle data base. DAY 26 ODBC Enterprise ODBC Enterprise is a data base stage About ODBC Enterprise:  Oracle needs some plug-ins to connect the DataStage. Oracle Enterprise Directly hitting ORACLE DB ODBC OS Kalyan-9703114894 Enterprise .

Use OS drivers to hit the oracle db  Difference between Oracle Enterprise (OE) and ODBC Enterprise OE ODBCE  Version dependent  Version independent  Good performance  Poor performance  Specific to oracle  For multiple db  Uses plug-ins  Uses OS drivers  No rejects at source  Reject at SRC &TRG. Q: How database connect using ODBC? ODBCE Data Set First step: opening the properties of ODBCE  Read method = table o Table = EMP  Connection o Data Source = WHR \\ WHR means name of ODBC driver o Password = ****** o User = Scott Kalyan-9703114894 .

 Using ODBC Connector is quick process as we compare with ODBCE.  In the ODBCE “no testing the  In this we can test the connection”. performance). o Administration tools  ODBC  Add o MS ODBC for Oracle  Giving name as WHR  Providing user name= Scott  And server= tiger. connection by test button.  Creating of WHR ODBC driver at OS level. ODBC DSN.  Differences between ODBCE and ODBC Connector. ODBCE ODBC Connector  It cannot make the list of  It provides the list have in Data Source Name (DSN). That automatically handles data type miss match between the source data types and DataStage data types.  Properties of ODBC Connector: o Selecting Data Source Name DSN = WHR o User name = Scott Kalyan-9703114894 .  ODBCE driver at OS level having lengthy process to connect.  ODBCE read sequentially  It read parallel and loads and load parallel (poor parallel (good performance). to over this ODBC connector were introduced.  Best Feature by using ODBC Connector is “Schema reconciliation”.

1.  For example CUST work book is created Q: How to read Excel work book with ODBCE? A: opening the properties of ODBCE  Read method = table o Table = “empl$” \\ when we reading from excel name must be in double codes end with $ symbol. MS Excel with ODBCE:  First step is to create MS Excel that is called “work book”.  Connections o DSN = EXE o Password = ***** o User = xxxxx  Column o Load  Import ODBC table definitions  DSN \\ here select work book  User id & password o Filter \\ enable by click on include system tables o And select which you need & ok  In Operating System Kalyan-9703114894 . o Password = ***** o SQL query 26. It’s having ‘n’ number of sheets in that.

o Add in ODBC  MS EXCEL drivers  Name = EXE \\ it is DSN Q: How do you read Excel format in Sequential File? A: By changing the CUST excel format into CUST. Q: How to read Tara Data with ODBC A: we must start the Tara Data connection (by clicking shortcut). which use as a data base.csv 26. Tara Data with ODBCE:  Tara Data is like an oracle cooperation data base.1  After these things we must open the properties of ODBCE o Read method = table  Table = financial.0.2.customer o Connections  DSN = tduser  Uid = tduser  Pwd = tduser Kalyan-9703114894 .0. o And in OS also we must start  Start ->control panel ->Administrator tools -> services ->  Tara Data db initiator \\ must start here o Add DSN in ODBC drivers  Select Tara data in add list  We must provide details as shown below  User id = tduser  Password = tduser  Server : 127.

1  Uid = tduser  Pwd = tduser  After all this navigation at last we view the data.0.  Column o Load  Import  Table definitions\plug-in\taradata  Server: 127.1. DAY 27 Dynamic RDBMS and PROCESSING STAGE 27. it is also called as DRS”  It supports multiple inputs and multiple outputs Ln_EMP_Data Data Set Kalyan-9703114894 .0. which we have load in source. Dynamic RDBMS: “It is data base stage.

e. but we can’t load into multiple files..  We can solve this problem with DRS that we can read multiple files and load in to multiple files. oracle o Oracle  Scott \\ for authentication  Tiger o At output  Ln_EMP_Data \\ set emp table here  And Ln_DEPT_Data \\ set dept table here o Column  Load  Meta data for table EMP & DEPT. Some of data base stages:  IWay can use in source only to set in output properties.  Netezza can use in target only to set in input properties. DRS Ln_DEPT_Data Data Set  It all most common properties of oracle enterprise.  Coming to DRS properties o Select db type i. Kalyan-9703114894 .  In oracle enterprise we can read multiple files.

Slowly changing dimension 8.27. Funnel 6. but we use 10 stages generally.e. Processing Stage: In this 28 processing stages are there. They are. Q: calculate the salary and commission of an employee from EMP table. Modify 9. Remove duplicates 7.3. Look UP 3. Transformer Stage: The symbol of Transformer Stage is A simple query that we solving by using transformer i. Sort 10. Join 4. Transformer 2. Copy 5. And the 10 stages are very important. Surrogate key 27. 1.2. Kalyan-9703114894 .

SAL + NullToZero (IN. setting the connection here. Q: NETSAL= SAL + COMM Logic: if NETSAL > 2000 then TakeHome = NETSAL – 200 else TakeHome = NETSAL +200.  In THome derivation part we include this logic Kalyan-9703114894 .COMM \\ we can write by write clicking their  It visible in input column\function\ and so on.  For example. IN. source field and structure available and load Meta data in to column mapping should be do.SAL + IN. o For this we can functions in derivation  IN. Oracle Enterprise Transformer Data Set Here. we can write derivation here.  After that when we execute the null values records it drops and remaining records it sends to the target. how to include this logic in derivation? A: adding THome column in output properties.  Transformer Stage is “all in one stage”.  That column we name as NETSAL  By double clicking on the NETSAL.COMM) o By this derivation we can null values records as target.  Properties of Transformer Stage: o For above question we must create a column to write description  In the down at output properties clicking in empty position.

NS Variables to adding column 1 NS 0 integer . Stage Variable: “it is a temporary variable which will holds the value until the process completes and which doesn’t sent to the result to output”  Stage variable is shown in the tool bar of transformer properties.SAL + NullToZero (IN.  Adding these derivations to the input properties to created columns. Left Function 2. Right Function 3. Field Function 6. DAY 28 Transformer Functions-I Examples on Transformer Functions: 1. o If (IN.SAL + NullToZero (IN.COMM))> 2000  Then (IN. Substring Function 4. IN.SAL + NullToZero (IN. 4 0 After adding NS column  To NS column including the derivation. o NETSAL = NS o THome = if (NS > 2000) then (NS -200) else (NS + 200).COMM)) – 200  Else (IN.COMM) ) + 200 o By this logic it takes more time in huge records. so the best way to over this problem is Stage Variable.SAL + NullToZero (IN.COMM).  After clicking that it visible in the input properties In stage variable we must add a column for example. Concatenate Function 5. Constraints Function (Filter) Kalyan-9703114894 .

R(L(7). Differences between Basic transformer and parallel transformer: Basic Transformer Parallel Transformer  Its effects on performance. Constraints (transformer. we will see a constraints row in output link. extended filter) 3.For example. but it effects on compile time. means constraints is also called as filter” Q: how a constraint used in Transformer? A: in transformer properties.3) Filter: DataStage in 3 different ways 1.  Basic Tx can call the  It supports wide range of Routines which is in basic language or multiple and shell scripting language. from that word we need only QUE.  Basic Tx can only execute up  Can execute in any platform. There we can write the derivation by double clicking. Kalyan-9703114894 . switch.3)  Left Function – L(R(5). to SMP.3)  Substring – SST(5. lookup) Constraints: “In transformer constraints used as filter. a word MINDQUEST.  Don’t effects on performance. Source level 2. Stages (filter. languages or all languages.  Right Function using the above for question .

64 Design: IN1 IN2 SF Tx1 Tx2 IN3 OUT Kalyan-9703114894 .txt HINVC23409CID45432120080203DOL TPID5650 5 8261.99 TPID5655 4 2861. if an source and target be cannot different data types.57 TPID5635 6 9564. right. substring functions and date display like DD-MM-YYYY? A: File.96 HINVC12304CID46762120080304EUO TPID5640 3 5234.28 HINVC43205CID67632120080405EUO TPID5630 8 1657. separating by using left.69 TPID5657 7 6218.13 TPID5637 1 2343.67 TPID5657 9 7452.NOTE: Tx is very sensitive with respect to Data Types. Q: How the below file can read and perform operation like filtering.00 TPID5645 2 7855.

1) INVCNO Left (Right (IN2.REC. 4) CURR Kalyan-9703114894 .REC. we are creating two columns TYPE and DATA. 9) CID IN2.Properties. here creating four column and separating the data as per created columns.REC.DATA. IN2 IN3 TYPE Derivation Column Name DATA Left (IN1. 8] BILL_DATE Right (IN2. 1) TYPE IN1. Means here creating one column called REC and no need of loading of Meta data for this. IN1 IN1 CONSTRAINT Left (IN1. Here. Step 2: IN1 Tx.1)=”H” REC IN2 Derivation Column Name Left (IN1. in this step we are filtering the “H” staring records from the given file.DATA [20. in the properties of sequential file loading the whole data into one record.REC DATA Step 3: IN2 Tx properties.DATA.txt into sequential file. Tx3 DS Total five steps to need to solve the given question: Step 1: Loading file. 21).

2) D CID Right (Left (IN3.CID CID D:’-‘: M:’-‘: Y BILL_DATE IN3.INVCNO INVCNO IN3. Trim B: “it removes all after spaces”. DAY 29 Transformer Functions-II Examples on Transformer Functions II: 1. 2. 3. 6).CURR CURR Step 5: here. Field Function: “it separates the fields using delimiter support”. 2) M BILL_DATE Left (IN3. 6. here BILL_DATE column going to change into DD-MM-YYYY format using Stage Variable.BILL_DATE. Stage Variable Derivation Column Name IN3 INVCNO Right (IN3. Trim F: “it removes all before spaces”. 5. 4. Trim T & L: “it removes all after and before spaces”. 4) Y CURR OUT Derivation Column Name IN3. setting the output file name for displaying the BILL_DATE.BILL_DATE. Trim: “it removes all special characters”. Kalyan-9703114894 .BILL_DATE.Step 4: IN3 Tx properties. Strip White Spaces: “it removes all spaces”.

comma delimiters and spaces (before. Q: A file. spaces”. anvesh. MUnNA.STATE 111. How to solve by above functions and at last it to be one record? File. Compact White Spaces: “it removes before. after.7. TN @333. @ San DeeP. after.txt consisting of special character. and in between). NaVeen. KN 555. AP 222@.txt EID.MH Design: IN1 IN2 SF Tx Tx IN3 OUT Tx DS Kalyan-9703114894 .ENAME. Sra van. middle one. KN@ 444.

2) ENAME Field(IN2. no need of load meta data to this.REC.REC. Up case functions.ENAME.’.txt using above functions: Step 1: Here.””) STATE Step 4: IN3. extracting the file.. to remove special characters. Kalyan-9703114894 .  Point to remember keep that first line is column name = true. Step 2:IN1. Strip Whitespaces (SWS).’. spaces.txt and setting into all data into one record to the new column created that REC.e.’. Tx properties  Here.EID).”@”. all rows that divided into fields are concatenating means adding all records into one REC.EID. Tx properties  Here.REC. IN2 IN3 EID Derivation Column Name ENAME Trim(IN2. Tx properties  In link IN1 having the REC.”@”. IN1 IN2 REC Derivation Column Name Field(IN2.””) EID STATE Upcase(Trim(SWS(IN2.Total Five steps to solve the File. using field functions. lower cases into upper cases by using the trim.’.”@”.””)) ENAME Trim(SWS(IN2.’. that REC to divide into fields by comma delimiter i.’.1) EID Field(IN2.3) STATE Step 3: IN2.

STATE REC STATE Step 5:  For the output. Column Import Column Export:  “it is used to combine the multiple of columns into single column” and it is also like concatenate in the transformer function.  Properties: o Input  Column method = explicit  Column To Export = EID  Column To Export = ENAME Kalyan-9703114894 . spaces were removed after doing are implementing the transformer functions to the above file.txt.ENAME: IN3. Column Export 2.ds REC 111NAVEEN AP 222 MUNNATN 333SRAVAN KN 444SAN DEEPKN 555 ANVESHMH 29. Final output: Trg_file.1. Re-Structure Stage: 1.EID: IN3. here assigning a target file. And at last the answer will display in one record but all special characters. IN3 OUT EID Derivation Column Name ENAME IN3.

 Assuming one scenario that when we taking a oracle enterprise.  Column To Export = STATE o Output  Export column type = “varchar”  Export output column = REC Column Import:  “it is used to explore from single column into multiple columns” and it is also like field separator in the transformer function. Kalyan-9703114894 . because of some security reasons.  But there is no need for giving the authentication to oracle are to be static bind. we must provide the table and load its meta data.  Properties: o Input  Column method=  Column To Import = REC o Output  Import column type = “varchar”  Import output column= EID  Import output column= ENAME  Import output column= STATE DAY 30 JOB Parameters (Dynamic Binding) Dynamic Binding: “After compiling the job and passing the values during the runtime is known as dynamic binding”. Here table name must be static bind. For this we can use job parameters that can provide values at runtime to authenticate.

they are o Local variables o Global Variable  Local variables (params): “it is created by the DS Designer only. in this two types one general and another one parallel.  Global Variables: “it is also called as environment variables”. Re-using the global and parameter set? Design: Kalyan-9703114894 . it is divided into two types. To give runtime values for user ID. Under parallel compiler. o Existing: comes with in DataStage. But coming to version8 we can reuse them by technique called parameter set”. password.Job parameters: “job parameters is a technique that passing values at the runtime. Providing target file name at runtime? e. operator specific. Q: How to give Runtime values using parameters for the following list? a. and remote server? b. NOTE: “The local parameters that created one job they cannot be reused in other job. reporting will available. Add BONUS to SAL + COMM at runtime? d. Department number (DNO) to keep as constraint and runtime to select list of any number to display it? c. But in version7 we can also reuse parameters by User Define values by DataStage Administrator. They are.  Job parameters are divided into two types. o User Defining: it is created in the DataStage administrator only. it is also called dynamic binding”. this is up to version7. it can use with in the job only”.

we must Kalyan-9703114894 . d are represents a solution for the given question. Step 2:“Creating global job parameters and parameter set”. a. b. global parameters are preceded by $ symbol. c.  Job parameters o Parameters Name DNAME Type Default value  UID USER string SCOTT a  PWD Password Encrypted ******  RS SERVER String ORACLE b  DNO DEPT List 10 c  BONUS BONUS Integer 1000  IP DRIVE String C:\ d  FOLDER FOLDER String Repository\  TRG FILE TARGET String dataset.  For Re-use.  DS Administrator o Select a project  Properties  General o Environment variables  User defined (there we can write parameters) Name DNAME Type Default value UID USER string SCOTT PWD Password Encrypted ****** RS SERVER String ORACLE  Here. ORACLE Tx Data Set Step1: “Creating job parameters for given question in local variable”.ds Here.

PRD.  In local variables job parameters o Select multiple of values by clicking on  And create parameter set  Providing name to the set o SUN_ORA  Saving in Table definition  In table definition o Edit SUN_ORA values to add Name UID PWD SERVER DEV SYSTEM ****** MOON PRD PRD ****** SUN TEST TEST ****** ORACLE  For re-using this to another job. o Add environment variables  User defined  UID $UID  PWD $PWD  RS $RS Step 3: “Creating parameter set for multiple values & providing UID and PWD other values for DEV. and TEST”. Kalyan-9703114894 . o Add parameters set (in job parameters)  Table definitions  Navs o SUN_ORA(select here to use) NOTE: “Parameter set use in the jobs with in the project only”.

Step 4: “In oracle enterprise properties selecting the table name and later assign created job parameter as shown below”.ENAME ENAME NS NETSAL NS+BONUS BONUS Kalyan-9703114894 .SAL + NullToZero(IN.COMM) NS ENAME STATE SAL COMM DEPTNO OUT Constraint: IN. Parameters Insert job parameters Properties:  Read method = table $UID o Table = EMP $PWD global environment variables  Connection $RS o Password = #PWD# SUN_ORA.RS Column:  Load UID o Meta data for EMP table PWD Local variables RS Step 5: “In Tx properties dept no using as a constraint and assign bonus to bonus column”.UID o User = #UID# SUN_ORA.EID EID IN.PWD parameter set o Remote Server = #RS# SUN_ORA. Stage Variable Derivation Column Name IN EID IN.DEPTNO = DNO Derivation Column Name IN.

At last it asks what target file name you want. Q: Why to sort the data? “To provide sorted data to some sort stages like join/ aggregator/ merge/ remove duplicates for the good performance”.Here. For that simply right click->job parameters->DNO/BONUS (choose what you want) Step 6: “Target file set at runtime. Two types of sorting: Kalyan-9703114894 . when run the job it asks in what drive. DAY 31 Sort Stage (Processing Stage) Q: What is sorting? “Here sorting means higher than we know actually”. and in which folder. means following below steps to follow to keep at runtime”.  Data set properties o Target file= #IP##FOLDER##TRGFILE# Here. DNO and BONUS are the job parameters we have created above to use here.

OE Kalyan-9703114894 . like oracle enterprise and so on.1. o How it will be done in Oracle Enterprise (OE)?  Go to OE properties  Select user define SQL o Query: select * from EMP order by DEPTNO. o Link sort is best sort in case of performance. Complex sorting: “it is only for sort stages and to create group id. Traditional sorting: “simple sort arranging the data in ascending order or descending order”. blocking unwanted sorting. 2. Source level sort: o It can be done in only data base.  Link level: “it can use in traditional sort”.  Stage level: “it can use in traditional sorting as well as complex sorting”. In DataStage we can perform sorting in three levels:  Source level: “it can only possible in data base”. o And it will use in traditional sorting only. Q: What is best level to sort when we consider the performance? “At Link level sort is the best we can perform”. and group wise sorting”. Link level sort: o Here sorting will be done in the link stage that is shown how in pictorial way.

open the JOIN properties”.  And go to partitions o Select partition technique (here default is ‘auto’)  Mark “perform sort”  When we select unique (it removes duplicates)  When we select stable (it displays the stable data) Q: Get all unique records to target1 and remaining to another target2? “For this we must create group id. and group wise sorting in some sort stage like join. merge. JOIN DS Q: How to perform a Link Sort? “Here as per above design.  It is done in a stage called sort stage.  Here we must select to which column group id you want. in the properties of the sort stage and in the options by keeping create key change column (CKCC) = “true”. default is false. blocking unwanted sorting. and remove duplicates. Kalyan-9703114894 . Sort Stage: “It is a processing stage. it indicates the group identification”.  Traditional sort means sorting in ascending order or descending order. that it can sort the data in traditional sort or in complex sort”. aggregate. Sort Stage  Complex sort means to create group id.

Sort Properties:
 Input properties
o Sorting key = EID (select the column from source table)
o Key mode = sort (sort/ don’t sort (previously sorted)/ don’t sort (previously
o Options
 Create cluster key change column = false (true/ false)
 Create key change column = (true/ false)
 True = enables group id.
 False = disables the group id.
 Output properties
o Mapping should be done here.

DAY 32
A Transformer & Sort stage job

Q: Sort the given file and extract the all addresses to one column of a unique record and count
of the addresses to new column.
File.txt 111, munna, savings
333, naveen, loans
222, kumar, credit
111,munna, current
222, kumar, loans
111, munna, insurance
333, naveen, current
111, munna, loans
222, kumar, savings


SF Sort1 DS

Tx Sort2

 Sequential File (SF): here reads the file.txt for the process.
 Sort1: here sorting key = EID
 And enables the CKCC for group id.

 Transformer (TX): here logic to implement operation for target.
o Properties of TX:

Stage Variable
Derivation Column Name
EID if (IN2.keychange = 1) then IN2.ACCTYPE func1
ENAME else func1 :’,’: IN2.ACCTYPE
ACCTYPE if(IN2.keychange=1) then 1 else c+1 c

Derivation OUT Column Name


 For this logic output will displays like below:


111, munna, savings 1
111,munna, savings, current 2
111, munna, savings, current, insurance 3
111, munna, savings, current, insurance, loans 4
222, kumar, credit 1
222, kumar, credit ,loans 2
222, kumar, credit ,loans, savings 3
333, naveen, current 1
333, naveen, current, loans 2

 Sort2:
o Here, in the properties we must set as below.
 Stage
o Sort key mode = sort
o Sort order = Descending order
 Input
 Partition type: hash


loans 2  Ascending DAY 33 FILTER STAGE Filter means “blocking the unwanted data”. loans 4  Usage= sorting. credit . insu. Source level 2. ACCTYPE. sav. partitioning  Options= ascending. kumar.  Data Set (DS): o Input:  partition type: hash o Sorting:  Perform sort  Stable (check this)  Unique (check this) Final output: o Selected EID. COUNT  Key= EID 111. Stage level Kalyan-9703114894 . In DataStage Filter stage can perform in three level. current. ENAME.  Sorting o Perform sort  Stable (uncheck)  Unique (check this) o Selected  Key= count  Usage= sorting. case sensitive  Output  Mapping should be doing here. sav 3 333. partition 222. munna. naveen. they are 1.

o Here filter is like an IF. External filter”. Filter. 2.  Stage Filter: o “Stage filters use in three stages. switch as switch. Constraints  Source Level Filter: “it can be done in data base and as well as in file at source level”. commands. condition can perform.3. and they are 1. column. o Source File: here we have option called filter there we can write filter commands like “grep “moon”/ grep –I “moon”/ grep –w “moon” ”.  It can only have 128 cases.  It have ‘n’ number of cases.default no rejects .  Better performance than IF. o Difference between if and switch: IF SWITCH  Poor performance.  It have. FILTER SWITCH EXTERNAL FILTER  Condition on multiple  Condition on single  It is using by the GREP columns. o Differences between three filter stages.  It have. o 1 – input o 1 – input o 1 – input n – outputs Kalyan-9703114894 128 – outputs 1 – output 1 – reject 1 . Switch and 3.  IF can write ‘n’ number of  SWITCH can only one column in condition. o Data Base: by write filter quires like “select * from EMP where DEPTNO = 10”.  It have.

Filter stage: “it having one input. n outputs. and one reject link”. Step2: Filter properties  Predicates o Where clauses = DEPT NO =10 Kalyan-9703114894 .  The symbol of filter is Filter Q: How the filter stage to send the data from source to target? Design: DS T1 Filter OE T2 DS Reject link DS Step1:  Connecting to the oracle for extracting the EMP table from it.

Condition SAL>1000 and SAL<3000 satisfied records to target2? c.  Link ordering o Order of the following output links  Output: o Mapping should be done for links of the targets we have.  Output link =1 o Where clauses = SAL > 1000 and SAL < 3000  Output link = 2 o Output rejects = true // it is for output reject data.  Here. we must convert a link as reject link. Because it has ‘n’ number of outputs. Only DEPTNO 20 where clause = SAL<1000 and SAL>3000 to target3? Kalyan-9703114894 . Mapping for T1 and T2 should be done separately for both. It have no reject link. Only DEPTNO 10 to target1? b. DAY 34 Jobs on Filter and properties of Switch stage Assignment Job 1: a. Step3:  “Assigning a target files names in the target”.

Where clause = SAL<1000 and SAL>3000 to target3? d. Step2: “For target2: where clause = SAL>1000 and SAL<3000 and link order=1”. All records from source to target1? b. Only DEPTNO=30 to target2? c. Job 2: a. d. Reject data to target4? Design to the JOB2: Kalyan-9703114894 . Step3: “For target3: where clause= DEPTNO=20 and link order=0”. Reject data to target4? Design to the JOB1: T1 Filter T2 EMP_TBL T3 Filter T4 Step1: “For target1: In filter where clause for target1 is DEPTNO=10 and link order=0”. Step4: “For target4: convert link into reject link and output reject link=true”.

Step4: “For target4 convert link into reject link and output reject link=true”. All duplicates records of DEPTNO to target2? c. Job 3: a. Step2: “For target2 where clause = DEPTNO=30 and link order =0”. All records to target3? d. All unique records of DEPTNO to target1? b. Only DEPTNO 10 records to target4? e. Condition SAL>1000 & SAL<3000. but no DEPTNO=10 to target5? Kalyan-9703114894 . T1 Copy EMP_TBL T2 T3 Filter T4 Step1: “For target1 mapping should be done output links for this”. Step3: “For target3 where clause = SAL<1000 and SAL>3000 and link order=1”.

Step4: “For target4: where clause= DEPTNO=10”. Step3: “For target3: mapping should be done output links for this”.default”. 128 – outputs and 1. Step5: “For target5: in filter properties put output rows only once= true for where clause SAL>1000 & SAL<3000”.Design to the JOB3: K=1 T1 Filter K=0 T2 EMP_TBL TT3 T4 Filter T5 Step1: “For target1: where clause = keychange=1 and link order=0”. Step2: “For target2: where clause = keychange=0 and link order=1”. Kalyan-9703114894 . SWITCH Stage: “Condition on single column and it has only 1 – input.

Picture of switch stage: Properties of Switch stage:  Input o Selector column = DEPTNO  Cases values o Case = 10 = 0 link order o Case = 20 = 1  Options o If no found = options (Drop/ fail/ output)  Drop= drops the data and continue the process.  Example filter command: grep “newyork”.  To perform a text file.  Output= to view reject data through the link. Kalyan-9703114894 . 1-output. first it must read in single record in the input.  It having 1-input.  Fail= if any records drops job aborts. DAY 35 External Filter and Combining External Filter: “It is processes stage. and 1-reject link. which can perform filter by UNIX commands”.

LOOKUP.  They are o Horizontal combining o Vertical combining o Funneling combining Horizontal combining: combining primary rows with secondary rows based on primary key. Combining: “in DataStage combining can done in three types”. o Treatment of unmatched records. Sequential File External Filter Data Set  External Filter properties: o Filter command = grep “newyork” o Grep –v “newyork” \\ other than new it filters. and MERGE. DAY 36 Horizontal Combining (HC) and Description of HC stages Horizontal Combining (HC): “combining the primary rows with secondary rows based on primary key”. Kalyan-9703114894 . o Inputs requirements.  These three stages differs with each other with respect to. and o Memory usage. o This stage that perform by JOIN.

30} and T2= {10. Right outer join. and full outer join If T1= {10.  T1 T2 Left Outer Join: “Matched primary & secondary and unmatched primary records”. They are. ENO EName DNo 111 naveen 10 222 munna 20 DNO DNAME LOC ENO ENAME 333 kumar 30 H DNo DName LOC C 10 IT HYD 20 SE SEC 40 SA WRL Here we can combine Inner join. Kalyan-9703114894 . 40} Inner Join: “Matched primary and secondary records”.  Selection of primary table is situation based.  T1  T2 Description of HC stages: “The description of horizontal combining is divided into nine parts”. Left outer join.  T1 (T1  T2) Right Outer Join: “Matched primary & secondary and unmatched secondary records”.  T2 (T1  T2) Full Outer Join: “Matched primary & secondary and unmatched primary & unmatched secondary records”. 20. 20.

o Treatment of unmatched records. and merge with respect to above nine points are shown below. ROJ) N – Inputs (normal) N – inputs 2 – inputs (FOJ) 2 – inputs (sparse) 1 – output 1 – output. and 1 – reject. all middle SRC’s are links. o Join types. Join Types: Inner join. LOJ. intermediate tables. :: Input Requirements with respect to sorting:: Kalyan-9703114894 . and Left outer join Left outer join full outer join. primary/ input and remaining and remaining tables are and last SRC is right table. And links are lookup/ references updates tables. right outer join. Input output rejects: N – inputs (inner. JOIN LOOKUP MERGE Input names: When we work on HC with The first link from source is The first table is master table JOIN the first SRC is left table. and o Types of inner join. lookup. o Input output rejects. o Key column names. o Input requirements with respect to sorting. o Memory usage.  The differences between join. 1 – output. o Input names. Inner Join Inner join left outer join. o De – duplication (removing duplicates). and 1 – reject (n – 1) rejects.

Target (Left) reject (unmatched primary target (keep) records) Secondary: Drop (inner) Drop Drop Target (right) Reject (unmatched secondary records) :: MEMORY USAGE:: Light memory Heavy memory Light memory :: Key Column Names:: Optional Must be SAME Same in case of lookup file Must be SAME set :: Type of Inner Join :: ALL ALL ANY DAY 37 LOOKUP stage (Processer Stage) Lookup stage:  In real time projects. Kalyan-9703114894 . Drop.  “Look up stage is for cross verification of primary records with secondary records”. Primary: mandatory Optional Mandatory Secondary: mandatory Optional Mandatory ::De – Duplication (removing the duplicates):: Primary: OK (nothing OK Warnings happens) Secondary: OK Warnings OK :: Treatment of Unmatched Records:: Primary: Drop (inner) Drop. Target (continue). 95% of horizontal combining is used by this stage.

DNAME. DNO  Reference table as DEPT with column consisting of DNO. ENAME. they are o Normal LOOKUP o Sparse LOOKUP o Range LOOKUP o Case less LOOKUP For example in simple job with EMP and DEPT tables:  Primary table as EMP with column consisting of EID. LOC DEPT table (reference/ lookup) EMP table (Primary/ input) LOOKUP Data Set (target) LOOKUP properties for two tables: Primary Table ENO Target ENAME ENO DNO ENAME DNAME Kalyan-9703114894 DNO DNAME LOC .  DataStage version8 supports four types of LOOKUP.

 Normal lookup: “is cross verification of primary records with secondary at memory”. DAY 38 Sparse and Range LOOKUP Sparse LOOKUP:  If the source is database. In tool bar of LOOKUP stage consists of constraints button. if a primary unmatched records are their.  Fail: its aborts job. its supports only two inputs. o Key type = case less. in that we have to select  Continue: this option for Left Outer Join.  Drop: it is to Inner Join. Reference Table Key column for both tables  It can set by just drag from primary table to reference table to DNO column.  Reject: it’s captured the primary unmatched records.e. Case less LOOKUP: In execution by default it acts as a case sensitive.  But we have a option to remove the case sensitive i. Kalyan-9703114894 ..

o Address of the target must save to use in another job.  Sparse lookup: “is cross verification of primary records with secondary at source level itself”. By taking lookup file set Job1: a sequential file extracting a text file to load into lookup file set (lfs). LFS LFS …………………… Kalyan-9703114894 . Job2: in this job we are using lookup file set as sparse lookup.  To set sparse lookup we must adjust key type as sparse in reference table only. o Target file stored in . Note: sparse lookup not support another reference when it is database. But in ONE Case sparse LOOKUP stage can supports ‘n’ references. Sequential file Lookup file set  Here in lookup file set properties: o Column names should same as in sequential file. By default Normal LOOKUP is done in lookup stage.lfs extension.

SF LOOKUP DS  In lookup file set. Range LOOKUP:  “Range lookup is keeping condition in between the tables”. Kalyan-9703114894 . DAY 39 Funnel.primary= “AP” o For multiple links we can write multiple conditions for ‘n’ references.  Lookup file supports ‘n’ references means indirectly sparse supports ‘n’ references. Copy and Modify stages Funnel Stage: “It is a processing stage which performs combining of multiple sources to a target”.  How to set the range lookup: In LOOKUP properties:  Select the check box for column you need to condition. there we will see condition box. for example: in. we must paste the address of the above lfs. o In condition. Condition for LOOKUP stage:  How to write a condition in the lookup stage? o Go to tool bar constraint.

Columns names should be case sensitive 4. Data type should be same Funnel stage it is process to append the records one table after the one. but above four conditions has to be meet. Kalyan-9703114894 . ENO EN Loc GEN 111 naveen HYD M T ENO EN ADD GEN 222 munna SEC F X 333 kumar BAN M Copy/ Modif y EMPID EName Loc Country Company GEN 444 IT DEL INDIA IBM 1 In this column names has change as 555 SA NY USA IBM 0 primary table. Funnel operation three modes:  Continues funnel: it’s random. Columns names also should be same 3.To perform the funnel stage some conditions must to follow: 1.  Sequence: collection of records is based on link order. Columns should be same 2.  Sort funnel: it’s based on key column values. In this stage the column GEN M has Simple example for funnel stage: to exchange into 1 and F=0.

3. Modify the data types. Modify Stage: “It is processing stage which can perform”. 4. 2. 3. Oracle Enterprise Modify Data Set From OE using modify stage send data into data set with respect to above five points. 1. In modify properties:  Specification: drop SAL. Drop the columns. Change the column names. NOTE: best for change column names and drop columns. Drop the columns. DEPTNO o Here drops the above columns. Charge the column names. MGR. DEPTNO Kalyan-9703114894 . Copying source data to multiple targets. Alter the data. 5. Stub stage.  Specification: keep SAL. MGR. 4. Keep the columns.Copy Stage: “It is processing stage which can be used from”. 1. 2.

DAY 40 JOIN Stage (processing stage) Join stage it used in horizontal combining with respect to input requirements.  At runtime: Data Set Management (view the operation process)  Specification: <new column name> DOJ=HIREDATE<old column> o Here to change column name. remaining columns were drops. and memory usage. o Here accept the columns.  Specification: <new column name>DOJ=DATE_FROM_TIMESTAMP(HIREDATE) <old column> o Here changing the column name with data type. treatment of unmatched records. Kalyan-9703114894 .

 Input requirements with respect to de – duplication: nothing happens means it’s OK when de – duplication. in this no scope from third table that’s why FOJ have two inputs. Kalyan-9703114894 .  Join stage having n – inputs (inner. Left Outer JOIN comes in left table. A simple job for JOIN Stage: JOIN properties:  Need a key column o Inner JOIN. ROJ). 1. right outer join. 2 – inputs (FOJ). o Full Outer JOIN comes both tables.  Join stage input names are left table. no reject.  Memory usage: light memory in join stage.  All types of inner join will supports. and intermediate tables. right table. left outer join.  Types of Join stage are inner. LOJ. And in secondary table in Inner option it’s drops and it ROJ will keep all records in target.  Treatment of unmatched records: in primary table when the option Inner its simple drops and when it is LOJ will keep all records in target.output.  Input requirements with respect to sorting: it is mandatory in primary and secondary tables.  Key column names should be SAME in this stage. and full outer join. o Right Outer JOIN comes in right table.

 Merge stage input names are master and updates. DAY 41 MERGE Stage (processing stage) Merge stage is a processing stage it perform horizontal combining with respect to input requirements. and memory usage.  In join stage when we sort with different key column names. treatment of unmatched records. Kalyan-9703114894 . that job can executes but its effect on the performance (simply say WARNINGS will occurs)  We can change the column name by two types Copy stage and with query statement. DN. Example of SQL query: select DEPTNO1 as DEPTNO. and Loc from DEPT.

 In type of inner join it compares in ANY update tables.  Input requirements with respect to sorting is mandatory to sort before perform merge stage.  All changes information stores in the update tables.  The key column names must be the SAME.  Input requirements with respect to de – duplication in the primary table it will get warnings when we don’t remove the duplicates in primary table.  Treatment of unmatched records in primary table Drop (drops). NOTE:  Static information stores in the master table.  N – inputs. and left outer join.  Merge operates with only two options o Keep (left outer join) o Drop (inner Join) Simple job for MERGE stage: PID PRD_DESC PRD_MANF PID PRD_SUPP PRD_CAT PID PRD_AGE PRD_PRICE 11 indica tata 11 abc XXX 11 4 1000 22 swift maruthi 33 xyz XXX 22 9 1200 33 civic hundai 55 pqr XXX 66 3 1500 44 logon mahindra 77 mno XXX 88 9 1020 Kalyan-9703114894 .  In the merge stage the memory usage is LIGHT memory. and (n – 1) rejects for merge stage.  Join types of this stage are inner join. Target (keep) the unmatched records of the unmatched primary table records. And in secondary table nothing will happens its OK when we don’t remove the duplicates. And in secondary table drops and reject it captures the unmatched secondary table records. 1 – output.

 Here COPY stage is acting as STUB Stage means holding the data with out sending the data into the target. Kalyan-9703114894 . DAY 42 Remove Duplicates & Aggregator Stages Remove Duplicates: “It is a processing stage which removes the duplicates from a column and retains the first or last duplicate rows”.Master Table Update (U1) Update (U2) Master table TRG U1 U2 or Reject (U1) Reject (U2) In MERGE properties:  Merge have inbuilt sort = (Ascending Order/Descending Order)  Must to follow link order.  NOTE: there has to be same number of reject links as update links or zero reject links.  Merge supports (n-1) reject links.

Aggregator: “It is a processing stage that performs count of rows and different calculation between columns i. Sequential File Remove Duplicates Data Set Properties of Remove Duplicates:  Two options in this stage. NOTE: for every n – input and n – output stages should must done mapping. group by same operation in oracle”.e. SF Aggregator DS Properties of Aggregator:  Grouping keys: o Group= Deptno  Aggregator o Aggregator type = count rows (count rows/ calculation/ re – calculation) o Count output column= count <column name> 1Q: Count the number of all records and deptno wise in a EMP table? 1 Design: Kalyan-9703114894 . o Key column= <column name> o Dup to retain=(first/last) Remove Duplicates stage supports 1 – input and 1 – output.

2Q In Target one dept no wise to find maximum. min. and sum of rows. doing calculation on SAL based on DEPTNO. OE_EMP Copy of EMP Counting rows of deptno TRG1 Generating a column counting rows of created column TRG2 For doing some group calculation between columns: Example: Select group key Group= DEPTNO . minimum. Here.Aggregation type = calculation . sum of deptno trg1 Kalyan-9703114894 .Column for calculation = SAL <column name> Operations are  Maximum value output column = max <new column name>  Minimum value output column = min <new column name>  Sum of column = sum <new column name> and so on. and in target two company wise maximum? 2 Design: OE_emp copy of emp max.

Incremental load  Initial load: complete dump in dimensions or data warehouse i. Company: IBM max of IBM trg2 3Q: To find max salary from emp table of a company and find all the details of that? & 4Q: To find max. target also before data is called Initial load. min. sum of salary of a deptno wise in a emp table? 3 & 4 Design: dummy dno=10 compare emp max(deptno) dno=20 UNION ALL diving compare dummy dno=30 copy min(deptno) company: IBM compare maximum SAL with his details max (IBM) DAY 43 Slowly Changing Dimensions (SCD) Stage Before SCD we must understand: types of loading 1.e. Initial load 2. Kalyan-9703114894 ..

e. Example: #1 Before data (already data in a table) CID CNAME ADD GEN BALANCE Phone No AGE 11 A HYD M 30000 9885310688 24 After data (update n insert at source level data) CID CNAME ADD GEN BALANCE Phone No AGE 11 A SEC M 60000 9885865422 25 Column fields that have changes types: Address – slowly change Balance – rapid change Phone No – often change Age – frequently Example: #2 Before Data: CID CNAME ADD 11 A HYD 22 B SEC 33 C DEL After Data: (update ‘n’ insert option loading a table) CID CNAME ADD 11 A HYD Kalyan-9703114894 . coming from OLTP also source is after data..  The subsequent is alter is called incremental load i.

e.  Record version: it is concept that when the ESDATE and EEDATE where not able to use is some conditions. and no historical data were organized”. With some special operation columns they are. i. active flag. they are  SCD – I  SCD – II  SCD – III  SCD – IV or V  SCD – VI Explanation: SCD – I: “it only maintains current update. Kalyan-9703114894 . SCD – II: “it maintains both current update data and historical data”. not having primary key that need system generated primary key.  In SCD – II. surrogate key. new concepts are introduced here i.22 B CUL 33 D PUN  Extracting after and before data from DW (or) database to compare and upsert. effect start date.  In SCD – II. That we can solve by active flag: “Y” or “N”.. As per SCD – I. and effect end date. We have SIX Types of SCD’s are there.  Unique key: the unique key is done by comparing. effect start date (ESDATE) and effect end date (EEDATE).e. Here surrogate key acting as a primary key. it updates the before data with after data and no history present after the execution. surrogate key..  And when SCD – II performs we get a practical problem is to identify old and current record. SCD – III: SCD – I (+) SCD – II “maintain the history but no duplicates”.

40 After dim 10. Example table of SCD data: SID CID CNAME ADD AF ESDATE EEDATE RV UID 1 11 A HYD N 03-06-06 29-11-10 1 1 2 22 B SEC N 03-06-06 07-09-07 1 2 3 33 C DEL Y 03-06-06 9999-12-31 1 3 4 22 B DEL N 08-09-07 29-11-10 2 2 5 44 D MCI Y 08-09-07 9999-12-31 1 5 6 11 A GDK Y 30-11-10 9999-12-31 2 1 7 22 B RAJ Y 30-11-10 9999-12-31 3 2 8 55 E CUL Y 30-11-10 9999-12-31 1 8 Table: this table is describing the SCD six types and the description is shown above.20. 40 -update and insert OE_SRC DS_TRG_DIM Kalyan-9703114894 . SCD – VI: SCD – I + unique identification. 20. 40 DS_TRG_DIM OE_UPSERT 10. DAY 44 SCD I & SCD II (Design and Properties) SCD – I: Type1 (Design and Properties): Transfer job Load job 10.SCD – IV or V: SCD – II + record version “When we not maintain date version then the record version useful”.30 OE_DIM before fact DS_FACT 10. 20. 40 10. 20.20.

SNO number. Table2:  Create table DIM(SKID number. Table1:  Create table SRC(SNO number. ‘naveen’).SNO SNO business key SNAME Type1 Fast path 3 of 5: selecting source type and source name. SNAME varchar2(25)). o Insert into src values(333. ‘munna’).In oracle we have to create table1 and table2. o Insert into src values(111. o No records to display. ‘kumar’). Source type: Flat file source name: D:\study\navs\empty. Processes of transform job SCD1: Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above links that coming from different sources. Step 2: “SCD1 properties” Fast path 1 of 5: select output link as: fact Fast path 2 of 5: navigating the key column value between before and after tables AFTER BEFORE SNO KEY EXPR COLUMN N PURPOSE SNAME SKID surrogate key AFTER.txt Kalyan-9703114894 . o Insert into src values(222. SNAME varchar2(25)).

NOTE: for every time of running the program we should empty the source name i.e.,
empty.txt, else surrogate key will continue with last stored value.

Fast path 4 of 5: select output in DIM.

SNAME next sk() SKID surrogate key
AFTER.SNO SNO business key

For path 5 of 5: setting the output paths to FACT data set.

SNO Derivation COLUMN N



Step 3: In the Next job, i.e. in load job if we change or edit in the source table and when you
are loading into oracle we must change the write method = upsert in that we have two options
they are, -update n insert \\ if key column value is already.


-insert n update \\ if key column value is new.

Here SCD I result is for the below input
Before table
10 abc 1
20 xyz 2
30 pqr 3 Target Dimensional table of SCD I
10 abc 1
20 nav 2
After table 40 pqr 3
10 abc
20 nav
40 pqr

SCD – II: (Design and Properties):

Transfer job Load job
OE_DIM fact DS_FACT 10, 20, 20, 30, 40 10, 20, 20, 30, 40


10, 20, 40 After dim 10, 20, 20, 30, 40 -update and insert


Step 1: in transformer stage:


Adding some columns to the to before table – to covert EEDATE and ESDATE columns into
time stamp transformer stage to perform SCD II

In TX properties:


In SCD II properties:

Fast path 1 of 5: select output link as: fact

Fast path 2 of 5: navigating the key column value between before and after tables

SNO SKID surrogate key
SNAME AFTER.SNO SNO business key
ESDATE experi date
ESDATE expire date
ACF current indicator


SKID SKID AFTER.txt NOTE: for every time of running the program we should empty the source name i. AFTER DIM SNO Derivation COLUMN N PURPOSE Expires SNAME next sk() SKID surrogate key - AFTER. Fast path 4 of 5: select output in DIM...SNAME SNAME BEFORE.Fast path 3 of 5: selecting source type and source name.txt. empty.SNAME SNAME Type2 - curr date() ESDATE experi date - “9999-12-31” ESDATE expire date (….EED ESDATE BEFORE BEFORE. Source type: Flat file source name: D:\study\navs\empty.) “Y” ACF CI “N” Date from Julian (Julian day from day (current date ()) – 1) For path 5 of 5: setting the output paths to FACT data set.SNO SNO AFTER. else surrogate key will continue with last stored value.ACF ACF SKID SNO SNAME ESDATE EEDATE ACF Kalyan-9703114894 .e.ESD ESDATE BEFORE. AFTER SNO FACT SNAME Derivation COLUMN NAME BEFORE.SNO SNO business key - AFTER.

Here SCD II result is for the below input Before table CID CNAME SKID ESDATE EEDATE ACF 10 abc 1 01-10-08 99-12-31 Y 20 xyz 20 01-10-08 99-12-31 Y 30 pqr 30 01-10-08 99-12-31 Y Target Dimensional table of SCD II CID CNAME SKID ESDATE EEDATE ACF 10 abc 1 01-10-08 99-12-31 Y 20 xyz 2 01-10-08 09-12-10 N After table CID CNAME 20 xyz 4 10-12-10 99-12-31 Y 10 abc 30 pqr 3 01-10-08 99-12-31 Y 20 nav 40 mun 5 01-10-08 99-12-31 Y 40 mun DAY 45 Change Capture. -insert n update \\ if key column value is new.e. -update n insert \\ if key column value is already. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are. Simple example of change capture: Kalyan-9703114894 .Step 3: In the Next job. Change Apply & Surrogate Key stages Change Capture Stage: “It is processing stage. i. that it capture whether a record from table is copy or edited or insert or to delete by keeping the code column name”.

Kalyan-9703114894 . values) o Drop output for copy = (false/ true) “false – default ” o Drop output for delete = (false/ true) “false – default” o Drop output for edit = (false/ true) “false – default” o Drop output for insert = (false/ true) “false – default”  Copy code = 0  Delete code = 2  Edit code = 3  Insert code = 1  Code column name = <column name> o Log statistics = (false/ true) “false – default” Change Apply Stage: “It is processing stage. that it applies the changes of records of a table”. Change_capture Properties of Change Capture:  Change keys o Key = EID (key column name)  Sort order = ascending order  Change valves o Values =? \\ ENAME o Values =? \\ ADD  Options o Change mode = (explicit keys & values / explicit keys.

default” o Log statistics = false o Code column name = <column name> \\ change capture and this has to be SAME for apply operations SCD II in version 7.txt key= EID -option: e k & v Kalyan-9703114894 . Change Apply Properties of Change Apply:  Change keys o Key = EID  Sort order = ascending order  Options o Change mode = explicit key & values o Check value columns on delete = (false/ true) “true .5.txt c=3 c=all after.x2 Design of that ESDATE=current date () EEDATE= “9999-12-31” Key=EID ACF= “Y” -option: e k & v Before.

a surrogate key stage used for generates the system key column values that are like primary key values. And that job in version 7.0: “The above problem with version7 is over comes by version 8. In version 8.  With that buffer value we can generate the sequence values that are surrogate key in version 7.x2: “identifying last value which generated for the first time compiling and running the job in surrogate key stage.if(c=3) then “N” else “Y” SURROGATE KEY Stage: In version 7.x2: design SF Sk copy ds Tail peek  In this job.current date () EEDATE.before. and by using that it generates the sequence values” Kalyan-9703114894 .0 surrogate key by taking an empty text(empty.txt ESDATE.5.txt) file and storing last value information in that file. for that reason in version 7 we have to do a another job to store a last generated value”.5.5.if c=3 then DFJD(JDFD(CD())-1) else EEDATE = “9999-12-31” ACF.  But by taking tail stage with that we tracing the last value and storing into the peek stage that is in buffer.x2. But it generate at first compile only.

txt SK Data Set Properties of SK version8: Option 1: generated output column name = skid Source name = g:\data\empty. Before.txt Password= tiger User id= scott Server name= oracle Source type = database sequence DAY 46 DataStage Manager Export: “Export is used to save the group of jobs for the export purpose that where we want”. Export job designs with executables(where applicable) Kalyan-9703114894 . Navigation .“how to export”? DataStage toolbar  Change selection: ADD or REMOVE or SELECT ALL o Job components to export Here there are three options are .txt Source type = flat file Option 2: database type= oracle (DB2/ oracle) Source name = sq9 (in oracle – create sequence sq9)\\ it is like empty.

o Import from file Give the source name to import …. Export job designs without executables ..... Options of import are o DataStage components… o DataStage components (xml)… o External function definitions o Web services function definitions o Table definitions o IMS definitions  In IMS two options are. Export job executables without designs o Export to file  Source name\.xml extensions to a particular project and also to import some definitions as shown below”.dsx or . o Type of export  dsx By two options we can export file .  Database description (DBD)  Program Specification Block (PSB / PCB)  In DataStage components. Kalyan-9703114894 . Where we want locate the export file.. . xml Import: “It is used to import the . dsx 7 – bit encoded .

3. Import all overwrite without query Import selected perform impact analysis Generate Report: “It is for to generate report to a job or a specific. 4. go to  File o Generate report  Report name  Options Use default style sheet Use custom style sheet After finishing the settings:  It’s generates in default position “/reportingsendfile/ send file/ tempDir. that it generates a report to a job instantly”. Pools – logical area where stages are executed. For that. 2. Fast name – server name or system name. Node name – logical CPU name. Resource – memory associated with node.tmp” Node Configuration: Q: To see nodes in a project: o Go to run director  Check in logs  Double click on main program: APT config file Q: What are Node Components? 1. Kalyan-9703114894 .

. and Kalyan-9703114894 ... o We can create new node by option NEW o NEW  Save the things after creating new nodes  By..apt o Default...apt Q: How to run a job on specific configuration file? o Job properties  Parameters  Add environment variables o Parallel  Compiler  Config file (Add $APT_CONFIG_FILE) Q: How to create a new Node configuration File? o Tools  Configurations  There we see o Default.  “c:\ibm\information server\scratch” Q: What node that handles to run each and every job and name of the configuration file? o Every job runs on APT node as on below name that is default for every job.  “c:\ibm\information server\server\parasets” o Node components stores temporary is the below address. o Node components stores in the disc’s permanent in the below address. o Name of configuration file is C:\ibm\.apt will have the single node information...\default. save configuration As o NOTE: Best 8 or 16 nodes is to create in a project.

 2^0,2^1(say) CPU’s have & so on.
Q: If uni processing system with 1 CPU needs minimum 1 node to run a job then for SMP
with 4 CPU needs how many minimum nodes?
o Only 1 node.

Advanced Find:
“It is the new feature to version8”
It consists of to find objects of a job like list shown below
1. Where used,
2. Dependency,
3. Compared report.
Q: How to run a job in a job?
Navigation for how to run a job in a job
 Job properties
o Job control
 Select a job
 -------------
 ------------- here, Job Control Language (JCL) script presents.
 -------------
o Dependencies
 Select job (first compile this job before the main
Q: Repository of Advance Find (means palate of advance find)?
o Name to find: Nav*

o Folder to search: D:\datastage\
o Type
o Creation
o Last modification
o Where used


 Find objects that use any of the following objects.
 Options: Add, remove, remove all
o Dependencies of job

Q: Advance Find of repository through tool bar?
o Cross project compare….
o Compare against
o Export
o Multiple job compile
o Add to palate
o Create copy
o Locate in tree
o Find dependencies

Q: How to find dependency in a job?
o Go to tool bar
 Repository
 Find dependency: all types of a job

DAY 47
DataStage Director

DS Director maintains:
 Schedule
 Monitor
 Views
o Job view
o Status view


o Log view
 Message Handling
 Batch jobs
 Unlocking

“Schedule means a job can run in specific timings”
 To set timings for that,
o Right click on job in the DS Director
 Click on “add to schedule…”
 And set the timings.

 In real time, specific the job sequence by some tools shown below
o Tools to schedule jobs (its happen the production only)
 Control M
 Cron tab
 Autosys

“It means cleaning or wash out or deleting the already created logs”
- In job can we clear
- Job logs having a option is FILTER. By right clicking we can
 Navigation for set the purge.
o Tool bar
 Job
- Clear log (choose the option)


started at (time).  First line is column names= set as true. Reasons for warnings:  Default warnings in sequential file are 1.e. o Immediate purge o Auto purge Monitor: “It shows the Status of job. o These three warnings can solve by a simple option in sequential file. 3. i. elapsed time (i. Import unsuccessful at record 0. rows started at elaspsed time rows/sec %CPU Finished 6 sys time 00:00:03 2 =9 Finished 6 sys time 00:00:03 2 =7 Finished 6 sys time 00:00:03 2 =0 NOTE: Based on this we can check the performance tuning of a stage in a particular job. data : { e i d }. percentage used by CPU)”  Navigation for job that how to monitor. rows/sec).(here default option is false)  Missing record delimiter “\r\n”. Status No. Field “<column name>” has import error and no default value.e.. at offset: 0 2. o Right click on job  Click monitor  “it shows performance of a job”  Like below figure for a simple job. saw EOF instead (format mismatch) Kalyan-9703114894 . Import warnings at record 0. numbers of row where executed.

 When second stage in merge. Message Handling: “If the warnings are failed to handle then we come across the message handling”  Navigation for how to add rule set a message handle the warnings.  When we working on look-up. like in source length (10) and target (20). Abort a job: Q: How can we abort a job conditionally?  Conditionally o When we Run a job  Their we can keep a constraint  Like warnings o No limit 5 o Abort job after:  In transformer stage o Constraint  Otherwise/log  Abort after rows: 5 (if 5 records not meet the constraint it’s simple aborts the job)  We can keep constraint same like this only in Range Lookup.  When sorting for different key column in join.  Jog logs o Right click on a warning  Add rule to message handler  Two options  Suppress from log Kalyan-9703114894 .  Where these is length miss match. in the secondary stage have duplicates we with get warning.

 But a job can execute by multiple users at the same time in director. Allow multiple instances: “Same job can open by multiple clients and run the job”  If we not enable the option it will open in a read only that you can’t edit. Batch jobs: “Executing set of jobs in a order” Q: How to create a Batch? Navigation for creating a batch  DS Director o Tools  Batch  New (give the name of batch)  Add jobs in created job batch o Just compile after adding in new batch.  Demote to information  Choose any one of above option and add rule.  Navigation for enable the allow multiple instance  Go to tool bar in DS Designer o Job properties  Check the box on “allow multiple instances” Unlock the jobs: “We can unlock the jobs for multiple instances by release all the permissions” Navigation for unlock the job DS Director  Tool bar o Job  Cleanup resources Kalyan-9703114894 .

for that  DS Administrator o General  Environment variables  Parallel o Reporting  Add (APT_PM_SHOW_ PIDS)  Set as (true/false) DAY 48 Web Console Administrator Components of administrator:  Administration: o User & group Kalyan-9703114894 .  Processes  Show by job  Show all o Release all For global to see PIDs for jobs.

 We can create the reports.  Users  User name & password is created here.  Domain Management: o License  Update the license here  Upload to review  Scheduling management: “It is know what user is doing from part” o Scheduling views  New  schedule | Run  creation task run | last update DAY 49 Job Sequencing Stages of job sequencing: “It is for executing jobs in sequence that we can schedule job sequencing” Kalyan-9703114894 .  And assigning permissions  Session managements: o Active sessions  For admin  Reports: o DS  INDIA (server/system name)  View report.

Job activity 2. Go to tool bar – view – palate – job activity – just drag the icon to the canvas. Exception handler 5. 1. Important stages in job sequencing are 1. Terminator activity 4. Go to tool bar – view – repository – jobs – just drag the job to the canvas. Or “Its control the order of execution jobs”  A simple job will process in below process. Simple job: OK WAR Kalyan-9703114894 . Notification activity 6. Wait for file activity Job Activity: “It is job activity that holds the job and it have 1-input and n-outputs” Job activity How the Job Activity drag into design canvas? . o Extract o Transform o Load o Master jobs: “its control the order of execution”. Sequencer 3.In two methods we can. 2.

(Run/Reset if required. than run/ Validate only/ Reset only) Check Point: “Job has re-started where it aborted it is called check point”  It is special option that we must enable manually  Go to o Job properties of DS Designer  Enable check point Parameter mapping: “If job have already some parameters to that we can map to the another job if we need” Triggers: “It holds the link expression type that how to act” Name of Expression type Expression output link OK OK-(conditional) “executed OK” WAR WAR-(conditional) “execution finished with warnings” Kalyan-9703114894 Fail Failed-(conditional) “execution failed” . Student Sequencer student rank FAIL Terminator activity Properties of Job Activity:  Load a job what you want in active o Job name: D:\DS\scd_job  Execution action: RUN Do not check point Run options .

Otherwise “N/A” . Custom-(conditional) . o Abort without sending STOP requests Wait for all jobs to finish first. Sequencer: “it holds multiple inputs and multiple outputs” It has two options or modes: ALL – it’s for OK & WAR links ALL ANY ANY – it’s for FAIL (‘n’ number of links) Exception handler: “It handles the server interrupts”  we don’t connect any stage here it will separate in a job A simple job for exception handler: Kalyan-9703114894 . User Status = “<user define message>” . Unconditional “N/A (its default)” .“custom” Terminator Activity: “It is stage that handles the error if it fails” Properties: It consists of two options: for if any sub ordinate jobs are still running.And some more options in “Expression type” .  Its for job failure o Send STOP requests to all Running Jobs And wait for all jobs to finish  It’s for server downs in between the process running.

r.t partition techniques & Stages Partition techniques: “are two categories” Key based: Kalyan-9703114894 . Exception handler Notification activity Terminator activity Exception handler properties: “Its have only general information” Notification Activity: “It is sending acknowledgement in between the process” Option to fill in the properties: SMTP Mail server Name: Senders email address: Recipients email address: Email subject: Attachments: Email body: Wait for file Activity: “To place the job in pause” File name: D:\DS\SCD_LOAD browse file Two options: wait for file to appear Wait for file to disappear Timeout length (hh:mm:ss) Do not timeout (no time length for the above options) DAY 50 Performance tuning w.

but is carry previous technique that continuous. DB2 4. Modulus 3. o And mod formula is MOD(value/ Number of nodes) NOTE: Modulus is having high performance than Hash.e. key columns (>1) and hetro data types (means different different data types) o Other than this situation we can select “modulus partition technique”. Random In key based partition technique:  DB2 is used when the target is database. because the way its groups the data and based on the mod value.  Entire: will distribute the same group of records to all nodes. Entire 4.  Hash partition technique: o It is selected when number of key columns will be there. i. Kalyan-9703114894 .  DB2 and Range techniques are used rarely. Range Key less: 1. That is the purpose of avoiding the mismatch records in between the operation. 1. Round Robin 3.  Modulus partition technique: o It distributes the data based on mod values. Same 2.. In Key less partition technique:  Same: is never distributes the data. NOTE: But modules can only be selected. if the only one key column and only one data type that is only integer (data type). Hash 2.

Pivot.  Random: all key less partition techniques stages are used this technique its default. Generic. DAY 51 Compress. Expand.  LOOKUP FILE SET: is options use to remove duplicates in lookup stage.r.t Stages:  If when Sorting already perform then JOIN stage we can use.  SORT stage: if complex sort : go to Stage sort  Else: go to link sort. xml input & output Stages Kalyan-9703114894 . o It is the best partition technique than comparing to random.  Conversions: Modify stage and Transformer stage (it takes more compile time).  Remove Duplicates: the data already sort – Remove duplicates stage  Sorting and remove duplicates – go to link sort (unique)  Constraints: when operation and constraints needed – go to Transformer stage  Else only constraints – simply go to FILTER stage.  Round Robin: it is for generated stage like Column Generator and so on is associated this partition technique. Performance tuning w.  Else LOOKUP stage is the best.

 It supports – “1 input and 1 output”.Compress Stage: “It is a processing stage that compresses the records into single format means in single file or it compresses the records into zip”. Properties:  Stage o Options  Command= (compress/gzip)  Input o <do nothing>  Output o Load the ‘Meta’ data of the source file.command= (uncompress/gunzip)  Input: o <do nothing>  Output: Kalyan-9703114894 . Expand Stage: “It is a processing stage the extract the compress data or its extract the zip data into unzip data”.  It supports – “1 input and 1 output”. Properties:  Stage: o Options : .

 It supports – “1-intput and 1-output”.  It supports – “1-input and 1-output”. Generic Stage: Kalyan-9703114894 . o Load the Meta data of the source file for the further process. Decode Stage: “It is processing stage that decodes the encoded data”. Properties:  Stage o Options: command line = (uncompress/gunzip)  Output o Load the ‘Meta’ data of the source file. Encode Stage: “It is processing stage that encodes the records into single format with the support of command line”. Properties:  Stage o Options: Command line = (compress/ gzip)  Input o <do nothing>  Output o Load the ‘Meta’ data of the source file.

but it must and should full fill the properties”.  Its supports – “1-input and 1-output”.  Properties: Stage – <do nothing>  Input: <do nothing>  Output: Column name Derivation SQL Type Length REC <col_n with comma separated> varchar 25 XML Stages: Kalyan-9703114894 .  Its purpose is migration serve jobs to parallel jobs (IBM has x. but no rejects”  When compiling the job. but it must full fill the properties. “It is processing stage that holds any operator can call here. Properties:  Stage o Options  Operator: copy (we can write any stage operator here)  Input o <do nothing>  Output o Load the Meta data of the source file.migrator that converts into 70%)  And it can call ANY operator here. the job related OSH code will generated.  Generic stage can call the operator on the datastage.  It supports – “n.inputs and n-outputs. Pivot Stage: “It is processing stage that converts rows into columns in a table”.

they are 1. Kalyan-9703114894 . XML Output 2.  And XML Stage divided into two types. “It is real time stage that the data stores in single records or in aggregator with in the xml format”. XML Input XML Input: “”.