DataStage

:: FUNDAMENTAL CONCEPTS:: DAY 1 Introduction for Phases of DataStage Four different phases are in DataStage, they are Phase I: Data Profiling It is for source system analyses, and the analysis are 1. Column analysis, 2. Primary key analysis,
3. Foreign key analysis, by this analysis whether we can find the data is “dirty” or “not”.
2010

4. Base Line analysis, and 5. Cross domain analysis. Phase II: Data Quality (or also called as cleansing) In this process we must follow inter dependent i.e., after one after one process as shown below. Parsing Correcting Standardizing Matching Consolidated Phase III: Data Transmission In this ETL process is done here, the data transmission from one stage to another stage And ETL means E- Extract T- Transmission L- Load. Phase IV: Meta Data Management - “Meta data means where the data for data”. Inter Dependent

Navs notes

Page 1

DataStage
DAY 2 How the ETL programming tool works?  Pictorial view:
2010
Data Base ETL Process Business Interface Flat files

ETL

db

BI

DM

DWH
MS Excel

Figure: ETL programming process

Navs notes

Page 2

DataStage
DAY 3 Continue…
2010
Extracting from .txt (ASCII code)

Source
Extract window

Staging (permanent data)

Understand to DataStage Format (Native Format)

Source

Staging (after transmission)

Load window

Source

DWH
data base or resides in local repository

Loading the data into .txt (ASCII code)

ETL is a process that is performs in stages: S OLTP T S T sa S sa T sa DWH

stage area

Here, S- source and T- target. Home Work (HW): one record for each kindle (multiple records for multiple addresses and dummy records for joint accounts);

Navs notes

Page 3

HW explanation: Here we must read the query very care fully and understand the terminology of the words in business perceptive. ETL Developer Requirements: HLD LLD . . Multiple records means multiple of the customers(records) and multiple addresses means one customer(one account) maintaining multiple of addresses like savings/credit cards/current account/loan.low level document Navs notes Page 4 2010 . HLD.high level document Developer LLD.DataStage DAY 4 ETL Developer Requirements • Q: One record for each kindle(multiple records for multiple addresses and dummy records for joint accounts). Inputs here. Customer Loan Bank Credit card Savings kindle • Customer maintaining one record but handling different addresses is called ‘single view customer’ or ‘single version of truth’.. Kindle means information of customers. ...

10. Physical model: using Tool. 4. Logical designs: means paper work. UNIT Test 6. 2010 2. 8. Prepare Questions: after reading document which is given and ask to friends/ 5. Under Standing forums/team leads/project leads. Performance Tuning 7.**) here. Peer Reviews: it is nothing but releasing versions(version control *.DataStage ETL Developer Requirements are: 1. Job Sequencing Navs notes Page 5 . Backups: means importing and exporting the data require. 3. Design Turn Over Document (DTD)/ Detailed Design Document(DDD)/ Technical Design Document(TDD) 9. * means range of 1-9.

after 2002 and up to till this environment IBM launched X-Migrator. Migration means Support based companies like TCS. Server jobs – parallel jobs Up to 2002 this environment worked In this it converts up to. 70% automatically 30% manually. In Migration: works both server and parallel jobs.DataStage DAY 5 How the DWH project is under taken? HLD Requirements: x Warehouse(WH) -HLD x x  as developer involves 2010 Process: TD Developer system engineer jobs in % Developer (70% .  – mean where the developer involves in the project and implement all TEN requirements shown above. x – cross mark that developer not involves in the flow.80%) Production(10%) Migration (30%) x TEST Production x Migration x here. which convert server jobs to parallel jobs Navs notes Page 6 . Satyam Mahindra and so on. • • Production based companies are like IBM and so on. Cognizent.

DataStage Project divided into some category with respective to period as shown below and its period( time of the project).1. Project Process: Period (that taken in months and years) 6m 6m – 1y 1– 11/2 y 11/2 y – 5y and so on(it may takes many years depend up on project) 2010 (high level documents) HLD Requirements: SRS BRD (here. Categories Simple Medium Complex Too complex 5. business analyzer/ Subject matter expert) HLD Warehouse: Architecture Schema (structure) Dimensions and tables (target tables) Facts (low level doc’s) LLD TD Mapping Doc’s (specifications-spec’s) Test Spec’s Naming Doc’s Navs notes Page 7 .

DataStage 5. 2. last name.2.dname. and 3.first 2010 name. For this mapping pictorial way as we see in the way: Common fields S. middle name.no Load order Target Entity Target Source Source Fields Hire date Dno Transmi ssion Current DateHire date (CD-HD) Constan t Pk Fk Sk Error Handling F C D C Attributes Tables Eno FName Exp_tbl MName LName Exp_emp DName Emp Dept Ename Eno Dno Dname Funneling S1 Get data from Multiple tables ‘C’ Is combining Target S2 Horizontal combining or vertical combining As per example here horizontal combination is used Navs notes Page 8 . Mapping Document: For example if a query requirements are 1-experience employee.

h & t) ( F/ dB) S1 T HC H C TRG (Types of dB) S2 Format of Mapping Document.DataStage Emp HC Trg 2010 Dept rows. HC means Horizontal combination is used for combine primary rows with secondary  As Developer maximum 100 source fields will get. cv. Here.txt (fwf. DAY 6 Architecture of DWH Navs notes Page 9 . “Look Up!” means cross verification from primary table. vl. s & t. sc.  As Developer maximum 30 Target fields will get. After document: .

Here Data mart is also called as ‘Bottom level’/ ‘mini WH’ as Navs notes Page 10 2010 . By that the data transmission between warehouse and data mart where depends upon by each other. And TLM needs the details of list shown above for analyze. And for all this manager one Top level manager (TLM) will be there. Bottom level For above example how ETL process is done shown below reliance fresh ETL PROCES S ETL PROCES S RC-mgr ERP mini WH/ Data mart DWH Dependent Data Mart independent Data Mart Reliance Fresh(taking one from group directly) Dependent Data Mart: means the ETL process takes all manager information or dB and keep in the Warehouse.DataStage For example: dB every branch have each mgr Manager Reliance comm. Reliance Group : Reliance power Manager Reliance Fresh ` TLM needs manager Top Level mgr(TLM) details of below sales customer employee period order Input Explanation of above example: Reliance group with some there branches and every branch have one manager.

R Comm. 1. RC. Hence the data mart depends up on the WH is called dependent data mart. Navs notes Page 11 .1.1 Two level approaches: For the both approaches two layers architecture will apply. H. Data Mart 2010 R Power Reliance Group ETL PROCE SS Data Mart Warehouse R Fresh Top level Layer I Layer II Data Mart Bottom level Top – Bottom level approach In the above the top – bottom level is defined. the data of individual manager (like RF. 6. Top – Bottom level approach: The level start from top means as per example Reliance group to their individual managers their ETL process from their to Data Warehouse (top level) and from their to all separate data marts (bottom level). 6.DataStage shown in blue color in above figure i. That’s why its called independent data mart. Top-Bottom level approach.. Bottom.e. RP and so on). and this approach is invented by W. Independent Data Mart: only one or individual manager i. warehouse is top level and all data mart are bottom level as shown in the above figure.. data mart were directly access the ETL process with out any help of Warehouse.e. Here. and 2. Inner.Top level approach.

DataStage 6. one data mart (DM) contains information like customer. DM R power Reliance Group R fresh Layer I Bottom level ETL PROCE SS DM DM Layer II DWH Top level Bottom – Top level approach is invented by R Kimbell. R comm. Top – Bottom level approach These two approaches comes under two layer Architecture Bottom – Top level approach Programming (coding) Navs notes Page 12 . products. employees. Bottom – top level approach: Means from here the ETL process takes directly from data mart (DM) and the data put 2010 in the warehouse for reference purpose or storing the DM in the Data WareHouse (DWH).2. location and so on. Here.

2010 ETL program Tool’s are “Tara Data/ Oracle/ DB2 & so on…” 6. Layer II: DM SRC Layer I DM SRC DWH DM Layer I Layer II DWH DM Layer II TOP – BOTTOM APPROACH BOTTOM – TOP APPROACH In this layer the data follow from source – data warehouse – data mart and this type of follow is called “top – bottom approach”. And in another case the data follow from source – data Navs notes Page 13 .DataStage • • ETL Tool’s: GUI(graph user interface) This tool’s to “extract the data from heterogeneous source”.2.2. Layer I: DM DM Source Layer I DWH Source DM Layer I In this layer the data send directly in first case from source to Data WareHouse(DWH) and in second case source to group of Data Marts(DM).1.2.2. 6. Four layers of DWH Architecture: 6.

2. For this Layer II architecture is explained in the above shown example eg. And who solve the instance/ temporary problems that team called Interface team is involved here. The clear explanation about the layer 3 architecture in the below example. The ODS data stores after the period into the DWH and from that it goes to DM there the ETL developers involves here in layer 3. it is the best example for clear explanation. * (99.99% using layer 3 and layer 4) 6.DataStage marts – data warehouse and this type of following data is called “bottom – top approach”. Layer III: 2010 DM Source ODS DWH DM DM Layer I Layer II Layer III In this layer the data follow from source – ODS (operations data stores) – DWH – Data Marts.3. Reliance group. Here the new concept add that is ODS means operations of data stores for at period like 6 months or one year that data used to solve instance problem where the ETL developer is not involved here. Example #1: Navs notes Page 14 .

DAY 7 Navs notes Page 15 2010 DM DM Source (it is waiting for landing. source is aero plan that is for waiting for landing to the airport terminal. simple say problem information captured. Involves here Airport terminal Interface team involves here Airport base station DWH Stores problem info for future references DM Layer III ODS Layer II Problem information captured Data Base (stores the technical problem in dB for 1year) OPERATIONS DATA STORE Example explanation: In this example. And years of database stores in the data warehouse because of some technical problems to be not repeat or for future reference. interface team. because of some technical problem) .e.e.. In the airport base station the technical problems and the Operations Data Store (ODS) in db i. 2hrs to solve the problem ) Layer I ETL dev.. From DWH to it goes to Data Marts here ETL developers involves for solve technical problems i. To solve this type operations special team involves here i. But the ODS stores the data for one year only.. But it is not to suppose to land because of some technical problem in the airport base station.DataStage (at least or max. is also called layer 3 architecture of data warehouse.e.

layer2.. ODS-operations data store.Business Intelligence... L2 & L3 & L4.4.DataStage Continues…. Project Architecture: 7.. DM. Layer IV: Layer 4 is also called as “Project Architecture” look up It is for data backup of DWH & SVC L3 Business intelligence Source 1 Interface Files (FLAT FILES) ETL Read flat files through DS L2 DW BI DM Source 2 L4 Format MISMATC H Condition MISMATC H ODS SVC DM Reporting Source Layer I SRC Figure: Project Architecture / layer IV Here. BI..Data Mart.1. ------------- reference data ... DW.3.Data Warehouse. SVC.-> reject data About the project architecture: Navs notes Page 16 2010 .Single view customer.

csv. To see the drop data the reference link is used and it shows which record is condition mismatched. EID 08 19 99 15 ENAME Naveen Munna Suman Sravan DNO 10 20 30 10 Contains dno 10.DataStage In project architecture.  In first layer source to interface files(flat files). to ODS. if it is mismatched the record will drops automatically.20. Here also reference link is used to see drop data. • Format mismatch(FM): this is also like condition mismatch but it checks on the format whether the sending data or records is format is correct or mismatched.  In last layer data warehouse to check whether a single customer or not and data loading or transmission in between DWH and DM(business intelligence).) Two types of mismatch data: • 2010 Trg only req.xml and so on) to ODS. There are two types mismatch data 1.1 0 emp tbl TR G Referenc e link drops20.30.30 from emp Example for Format Mismatch: Navs notes Page 17 . When ETL sending the flat files to ODS if any mismatch data will there it will drops that data.  In third layer the ETL transfer the data to warehouse. dno = 10  Coming to second layer ETL reads the flat files through the DataStage(DS) and sends Condition mismatch(CM): this verify the data from flat files whether they are conditions are correct or mismatched. Condition mismatch 2. Format mismatch. Example for condition mismatch: An employee table contains some data SQL> select * from emp. there are 4 layers. .txt. Note: (Information about dropped data when the transmission done between ETL reads the flat files(.

savings.2. credit Here DataStage people involves in this process SVC/ single version of truth naveen munna deposit suman This type of transforming is also called as Reverse Pivoting. CName Adds. Single View Customer (SCV): It is also called as “single version of truth”. loan insurance. naveen savings munna insurance suman credit 5 multiple records of customers transforming CName Adds. NOTE: Business intelligence(BI DM) is for data backup of DWH & SVC(single version of truth).DataStage EID EName Place 111 naveen mncl 222 knl munna Here the table format is tab – space separated. 2010 The cross mark record has format mismatched so that the record its just rejected. 7. For example: *to make unique customer? Same records Phase – II > identify field by field. Phase – III> cannot identify in this. DAY 8 Navs notes Page 18 .

two types of domain they are 1.DataStage Dimensional Model Modeling: it represent in physical or logical design from our source system to target system. Number. are simple say “ altering the existing process” For example: Q: An client required a experience of an employee. Alphabetical and 2. Navs notes Page 19 2010 .  Mata Data: every entity has a structure is called Meta Data(simple say ‘a data to a data’) o In a table there are attributes and domain.  Reverse Engineering (RE): it’s create from existing system is known as RE. Pictorial View Logical View EM P SQ De pt B optional Manual Here the above is Designing manual Data Modeler’s are use DM Tools o o ERWIN ER – STUDIO Forward Engineering Reverse Engineering Entity relation windows (ERWIN). o Physical design: data base perspective.  Forward Engineering (FE): it’s starting from the scratch. o Logical design: client perspective. Entity relation studio(ER-Studio) these two are data modeler’s where logical and physical design is done.

Surrogate key. This link is created by using foreign key and primary key. Primary Key: means which is constraint. it is a combination of unique and not null.1. Tables as follow.DataStage SRC Implicit requirement (is experience of employee) Hire Date 2010 EMP_table From Developer point of view is Explicit Requirement (to find out everything as per the client requirement want to see) TRG ENo (Employee hire detailed information) EName Years Months Days Hours Minutes Seconds Nano_Seconds Lowest level Detailed Information 8. Q: how the tables are interconnected is shown below. Foreign Key: means which is constraint and used as reference for other table. Foreign Key - Navs notes Page 20 . Like in product table. Dimensional Table: To find out everything as per the client requirement want to see (or) the “Lowest level of detailed information” in the target tables is called Dimensional Table. - Here taking some tables and linking with them with related to other tables.

SC Normalization 20 333 Sravan JAVA (or) reducing Developer 10 444 Raju Redundancy Call Center 30 555 Rajesh target table must be always in De-Normalized format. Add1 Add2 111 222 333 444 DNo Higher naveen ETL Developer 10 M. Normalized Data: In a table there if repetitive information or records is called Redundancy.TECH is known Developer 10 222 munna JNTU HYD System analysis 20 M. For example: ENO EName Designation Quali.TECH JNTU HYD munna System analysis 20 M.TECH JNTU HYD Raju Call Center 30 M.2.SC SVU HYD Sravan JAVA Developer 10 M.SC These is Repetitive Information or Redundancy Fk ENO EName Designation DNo This dividing Pk DNo Higher Quali. that information is to minimize or that technique is called as Normalization.DataStage Product_ID PRD_Desc PRD_TYPE_ID 2010 Primary Key PRD_TYPE_ID PRD_SP_ID PRD_Category Foreign Key Link Establishing Using Fk & Pk Primary Key PRD_SP_ID SName ADD1 8. JAVA  The SVU HYD Navs notes Page 21 . Add1 Add2 Technique 10 111 naveen ETL M.

 But it is not in all cases. DAY 9 Navs notes Page 22 2010 . that de-normalized is must and should. And combining is done by Horizontal combine.DataStage HC Normalization De-Normalization  De-Normalization means combining the multiple tables into one table.

Optional. there are two options to design a job.1. And EMP table is secondary table because it is depends on the DEPT table. They are 1.primary table & n-secondary table EMP table The given two tables EMP and DEPT ENO EName Designation DEPT table DNo DNo 10 20 Higher Quali.TECH JNTU HYD M.  But when we take in real time. Manual.SC 111 naveen ETL Developer 10 222 munna System analysis 20 333 Sravan JAVA Developer 10 444 Raju Call Center 30 555 Rajesh JAVA SVU HYD Primary (or also known as Master Table) Secondary (or also known as Child Table)  Here from above two tables the primary table is DEPT table. Horizontal Combine: Navs notes Page 23 2010 . because is not depends for any other table. that we joining the two table by using Horizontal combining it takes the EMP table as primary table and DEPT table as secondary table.  Mandatory is must 1. and 2. 9.DataStage E-R Model An Entity-Relationship Model: In logical design. Add1 Add2 M.

TECH JNTU HYD M. For example combining these two tables: EMP & DEPT tables Fk ENO EName Designation DNo 10 20 Pk DNo Higher Quali.TECH JNTU 222 munna SVU HYD Designation Add2 ETL Developer HYD System analysis DNo 10 20 M.DataStage To perform horizontal combining we must follow these cases. o Foreign key.  HC means combining primary rows with secondary rows based on the primary key column values. Add1 111 naveen M.SC 111 naveen ETL Developer 10 222 munna System analysis 20 333 Sravan JAVA Developer 10 444 Raju Call Center 30 555 naveen ETL SVU HYD After combining or joining the table by using HC. 2010  There should be dependency.SC Higher Navs notes Page 24 .  1 – Primary. Horizontal combining is also called as JOIN. they are o Primary key. n – secondary. Add1 Add2 M.  It must have multiple sources.  There are three types of keys. and o Surgut key. hence it’s like below ENO EName Quali.

2010 o STAR Schema. o Multi STAR Schema.  in pictorial way it look like as below Transmission Sourc e T DIM table T FACT tbl  in practical way it directly from source to dimensional table and fact table.  Dimensional table: means ‘Lowest level detailed information’ of a table. Different types of Schema’s: There are four types schemas.” The data transmission is done in two different methods. and o Galaxy Schema.2.dimensional tables.  Fact Table: means it is collection of foreign key from n. and o Fact table. DIM table Sourc e T T FACT tbl Navs notes Page 25 . STAR Schema:  In the star schema. they are o Snow Flake Schema.DataStage 9. 1. you must know about two things o Dimensional table. Definition of STAR Schema: “A Fact Table collection of foreign key surrounded by multiple dimensional table and each dimensional collection of de-normalized data. it is called STAR Schema.

Navs notes Page 26 .  And fact table is also called as Bridge or Intermediate table.. The link is creating to the measurements i. and fact table is collection of foreign key.  But in current market STAR Schema and Snow Flake Schema is using rarely.  By above shown that fact table is surrounded by the dimensional table. Date_dim_tbl. Cust_dim_tbl.e. Q: display what suman buy a lux product in ameerpet on January 1st week? Bridge/ intermediate table Product table Brand table Category table Customer table Unit table Customer_Category_table Fact table Cust_Dim_tbl Pk Fk Fk Fk PRD_Dim_tbl Pk Pk Date_Dim_tbl Customer table Location table Loc_Dim_tbl Pk  Here. and Loc_dim_tbl.DataStage Example for STAR Schema: 2010 “As taking some tables as below to derive a star schema from that”.  As per above question. where dimensional table is lowest level detailed information. measurements mean taking the information as per client requirement or user requirement.  In the fact table. it needs information PRD_dim_tbl. and Fk – foreign key. for Fact table by foreign key and primary key. Pk – primary key.

each dimensional table have EMP_t bl Dept_t bl Locati ons  If we want to require the information from location table it fetch from that table and display the client required.  To minimize the huge data at once or in a one dimensional table. Snow Flake Schema: lookup table is called Snow Flake Schema. For example: Fk Pk Fk Pk Fk Pk Area 2010 The fact-tables surrounded by dimensional table.  That is reason we divide the dimensional table. and each dimensional table have look up tables is called Snow Flake Schema”.DataStage 2. And that tables is known as “look up tables” Definition of Snow Flake Schema: “The Fact table surrounded by dimensional tables. some times it not possible to bring as soon as possible if huge data in dimensional table. STAR Schema works effectively De-normalization D N Cagnos/B O Sour ce DWH Reports N MIG/H1 Normalization Snow Flake Schema works effectively Navs notes Page 27 . into some tables.

1 versions.0. Only 5 members involved in release the software into the market.e. US based company.  ODI(OWB). Architecture of 7. Enhancements and new features of version 8. DataStage those days called as “Data Integrator”. first version of DataStage is released by VMARK company i.0.DataStage  NOTE: Selection of Schema in run time it is depends on report generation. :: DataStage CONCEPTS:: DAY 10 DataStage (DS) Concepts:     2010 History of DS.. Feature of DS.x2 and 8.  SAS(ETL Studio).  Abinity and so on… But DataStage is so famous and widely used in the market and it is to expensive also.1  HISTORY of DataStage An ETL tool according year 2006 there are 600 tools in market.1 versions.0. some of they are  DataStage Parallel Extends.5. Q: What is DataStage? ANS: DataStage is a comprehensive ETL tool.  BODI. and the Mr.x2 and 8. which provides End – to – End Enterprise Resource Planning (ERP) solution (here.5. comprehensive means good in all areas) History begins: - In 1997. Differences between 7. LEE SCHEFFLER is father of DataStage. - Navs notes Page 28 .

and all the UNIX commands works in the windows flat form.x2 is released that support server configuration for windows flat form also. UNIX) have parallel extendable capabilities in UNIX environment. a version 7. from that version parallel operations starts or parallel capabilities starts.. o But from 6.5. By integrating ADSS + ORCHESTRATE and they named as ADSSPX. - In 2004. ADSS + ORCHESTRATE means ACENTIAL company is integrated with ORCHESTRATE company for the parallel capabilities. in 1999. o o For this ADSSPX is integrated with MKS_TOOL_KIT. - In 2002.e. o And released software were 30 tools used to run.5. MKS_TOOL_KIT is virtual UNIX machine that brings the capabilities to windows for support server configuration.1 versions they supports only UNIX flavors environment. In 1997. ACENTIAL Company acquired both Data Base and Data Integrator and after that ACENTIAL DataStage Server Edition released in this year. o By this company the DataStage has popularized into the market from that year.5. NOTE: After installing the ADSSPX+MKS_TOOL_KIT into the windows. After two years i.DataStage - There are 90% changes from 1997 to 2010 comparing to release versions. Data Integrator is acquiring by company name called TORRENT.0 to 7. o Because server configured only on UNIX flat form or environment. 2010 - - In 2000.1 version. o Because ORCHESTRATE (PX.0. o From that parallel versions gone on developing up to 7. INFORMIX Company has acquired Data Integrator from TORRENT Company. o Navs notes Page 29 . o o And ADSSPX is version is 6.

- In 2006. and DataStage Px. enterprise edition.        Profile stage. IBM has released another version that “IBM INFOSPHERE DS & QS 8.e.x2 were having ASCENTIAL suite components o They are.0. o With the combination of four stages they have released “IBM WEBSPHERE DS & QS 8.    7. Audit stage.0” o This is also called as “Integrated Developer Environment” i.. December the version 7. DataStage Tx. IDE. - In 2005.DataStage - In 2004.. o There are 12 types of ASCENTIAL suite components. quality stage. DataStage MUS. Quality stage. February the IBM acquired all the ASCENTIAL suite components and the IBM released IBM DS EE i. Meta stage.5. In 2009. Meta stage.x2 using 40 – 50% 8.5. DataStage Px.1 using 10 – 20% Navs notes Page 30 2010 . and so on these are individual tools.1” o In current market. the IBM has made some changes to the IBM DS EE and the changes are the integrated the profiling stage and audit stage into one.e.1 using 30 – 40% 8.

SMP HDD SMP -1 C P U Page 31 SMP -2 UNI HDD MMP C C P U Navs notes P U C P U C P U .  Plat form Independent: o “A job can run in any processor is called plat form independent” o Three types of processor are there. they are    UNI. and Pipe line parallelism.DataStage NOTE: DataStage is Front End. they are Any to Any.  Any to Any: o DataStage that capable to any source to any target. it nothing to be stored. Node configuration. 2010 DAY 11 DataStage FEATURES Features of DS: There are 5 important features of DataStage. and Massively Multi Processor (MMP). Symmetric Multi Processor(SMP). Partition parallelism. Plat form Independent.

   1000 records HDD 1000 records HDD C C P U C P U C P U Here. is instance of physical CPU. o “Node is a logical CPU i.DataStage “““ SMP -3 ””” SMP -n RAM  Node Configuration: o RAM Node is software that is created in operating system. o Hence.” o Node configuration concept is exclusively work on the DataStage. o For example:  An ETL job requires executing 1000records? For above question an UNI processor takes 10mins to execute 1000 records.5 minutes Page 32 2010 .5 minutes to execute 1000 records.1000 records share for four CPU’ hence execution time reduced. But for the same question an SMP processor takes 2.. using software it is “the process of creating virtual CPU’s is called Node Configuration. it is the best feature comparing from other ETL tools. It is explained clearly in below diagram. C P U P U 10 minutes Navs notes 2.e.

o Using Node Configuration for the above example to UNI processor.20. 1000 records created multiple nodes HDD CP U CP PU U CP PU U CP PU U Node1 Node2 Node3 Node4 C P U 10 minutes reduces to 2.20.10.30. o After partitioning these records output must and should have 9 records. DEPT table have 3 records.10. Node Configuration is also can create virtual CPU’s to .20.DataStage RAM RAM reduce the execution time for UNI processor.30) and DEPT(10.20.5minutes PU RAM  Partition parallelism: o Partition is a distributing the data across the nodes. based on partition techniques.30) Navs notes Page 33 2010 o As per above example. because here primary table is 9 records. o Considering one example why we use the partition technique’s o Example: taking some records in EMP table and some in DEPT table   EMP table have 9 records.10. EMP(10. o In below figure how the virtual CPU’s can create and reduce the execution time of the process.

in those total 8 types of partition techniques are there. to the same key column value to collected same key partition. From above taken records we partitioning using key based.30 10 20 30 only 2 matched 1 1 4 total only 4 matched But output must be 9 records In the above example. Key based partitioning Navs notes Page 34 2010 .20.20.10 10.10 30.  Key based • • • •  • • • • Hash Modulus Range Db/2 o Key less Same Random Entire Round robin o Key based category or key based techniques will give the assurance. only 4 records are in there in final output and 5 records are missing for this reason the partition techniques are introduced.20. And there are two types of partition parallelism categories. o Key less technique is used to append the data for joining given tables.DataStage N1 12 N2 N3 o 10.

DataStage EMP DNO N1 10 N2 20 N3 30 DEPT DAY 12 Continues… Features of DataStage  Partition Parallelism: o Re – Partition: means re – distributing the distributed data. ENO EName DNo Loc 111 AP 222 TN 333 KN 444 naveen munna Sravan Raju 30 10 20 10 P1 P2 P3 EMP 10 20 30 N1 N2 N3 Dno N1 AP N2 TN N3 KN Loc Page 35 Dno DEPT JOIN Navs notes 2010 JOIN .

. and taking a separate column as location. i. o Reverse Partitioning:  It is also called as collecting.   SRC TRS F TRG S1 Example: S2 S3  Here collecting to Nodes N1 Nn into N 2 S 1 S Parallel files Sequential/Single file  There are four categories of collecting techniques • • Order Round robin Navs notes Page 36 .DataStage o First partition is done by key based partition for dno. 2010 known as Re – Partition. this is channel it is moving data from one stage to another stage. for that it re – distributing for the distributed data.e. But it done in one case only or in one situation only : “when the data move from parallel stage to sequential stage the collecting happens in this case only” Designing job in “stages” is also called as link or pipe.

y N3 c.DataStage • •  N1 a. the execution taken 30minutes to complete.x N2 b. Same job in parallel environment : E T R3 R1 L S 3 R5 S 1 R4 S 2 R2 Navs notes Page 37 .z  Pipeline Parallelism: N Sort – merge Auto Order Auto a x b y c z a b c x y z a b c x y z a z y c x b RR SM 2010 Example for collecting techniques: “All pipes carry the data parallel and the process done simultaneously” o In server environment: the execution process is called traditional batch processing. o For example: how the execution done in server environment we see Extract S 1 Transform 10min S 2 Load S 3 10min 10min HD HD  o Here.

tier architecture .1 8. all the pipe carry the data parallel and processing the job simultaneously and the execution taken only 10 minutes to complete 2010  By using the pipeline parallelism we can reduce the process time.Architecture Components * Common User Interface * Common Repository * Common Engine * Common Connectivity * Common shared Services II.0.5.0. DAY 13 Differences between 7.5 client components * DS Designer * DS Director * DS Administrator * Web Console * Information Analyzer .1 Differences: 7.4 client components * DS Designer * DS Manager * DS Director * DS Administrator 8.x2 .0.5.x2 & 8.Architecture Components * Server Component * Client Component .5.N-Tier architecture Page 38 Navs notes .x2 7.1 .DataStage  Here.

.5. run job’s Monitor. compile. users Capable of P-III & P-IV No web based administration File based repository .r.0.    13.OS independent w.1. web console( simple say work from home) - 13. users but one time dependent only.Web based administration through . run and multiple job compile. logs) Message Handling.  4 types of jobs can handle by DS Designer. batch jobs Views(job.DataStage - OS dependent w. status.1: Navs notes Page 39 .t. • • • • Mainframes job Server job Parallel job Job sequence job - DS Director: it can handle the given list below.   - DS Administrator: it can handle the given list below.r.t. Client components of 7.     Schedule . Client components of 8.Capable of all phases.1. Import and Export of Repository components Node Configuration Create project Delete project Organize project - DS Manager: it can handle the given list below.Data Base based repository 2010 . Unlock.x2: - DS Designer: it is to create jobs.

But. Information Analyzer. - Information Analyzer: is also called as console for IBM INFO SERVER. • • • • • 2010  4 types of jobs can handle by DS Designer.5. and Cross domain analysis. compile.       Security services Scheduling services Logging services Reporting services Session management Domain manager It perform all phase-I activities • • • • • Column analysis.5. run and multiple job compile. Navs notes Page 40 .DataStage - DS Designer: it is to create jobs. Mainframes job Server job Parallel job Job sequence job Data quality job - DS Director: same in as above shown in 7.  As an ETL developer you can come across DS Designer and DS Director. Primary key analysis. Foreign key analysis.x2 Web Console: administrator components through which performing. some information to be knows about Web console. Base Line analysis. and DS Administrator.x2 DS Administrator: same in as above shown in 7.

Some of components are    Job’s Table definition Shared container Navs notes Page 41 2010 . Architecture of 7. o Here repository is also Integrated Developer Environment(IDE)  IDE performs design.DataStage DAY 14 Description of 7.5.x2: * Server Components: it is divided into 3 categories.1 Architecture 14. Engine c.1. they are a. o Repository organize different component at one area is called collection of components. Repository b. compile.x2 & 8.5. Package Installer  Repository: is also called as project or work area. save jobs. run.0.

Example: Derivers needed 1100 to install Comput er Interfac e Printer 1100 driver provide  Here. o Never leave any stage to auto?  If we leave it auto.  Engine: it is executing DataStage jobs and it automatically selects the partition 2010 technique. packs are used to configuration for DataStage to ERP solution. they are a.. DS Manager c. *Client components: it is divided into 4 categories. interface is also called as plug-in between computer and printer. o Repository is for developing application as well as storing application. ER P SW DS Packs  Best example that normal windows XP acquires Service Pack2 for more capabilities  Here.DataStage  Routines …. it select auto partition technique it causes effect on the performance.  Package Installer: in this component contains two types of package installer one plug- in and another is pack’s. DS Director Navs notes Page 42 .. etc. DS Designer b.

Common Connectivity: It provides the connections to common repository. DS Administrator  These categories are shown above what they handle i. Common user interface: is also called as unified user.. (it’s checks security issues) b. in page no 39. Common Repository: is divided into two types a. DS Administrator 2. Global repository: it is for DataStage jobs files to store here.DataStage d. Information analyzer c.e.0. Architecture of 8. DS Designer d. DS Director e. Web console b.1: 1. 14. Common Engine: o It is responsible of    Data Profiling analysis Data Quality analysis Data Transmission analysis 4. a. Navs notes Page 43 .2. Local repository: it is for individual files stores here(it’s for performance issue) o common repository is also called as Mata Data SERVER o three types    Project level MD Design level MD Operation level MD 2010 3.

previously lookup having i.0. 2. Surrogate key stage: it is new concept introduced.  Processing stage: o New Stage: 1. Case less lookup Navs notes Page 44 .0.DataStage WC IA DE DI DA 2010 REPOSITORY MD SERVER Project level MD Design level MD Operation level MD Common shared services DP DQ DT DA Common Engine Common Connectivity Table representation of “8. Lookup stage.1 Architecture” DAY 15 Enhancement & New feature of version8 In version 8. Normal lookup ii. WTX(Web Sphere Transfer) o Enhanced Stage: 1. there are 8 categories of stages. FTP(File transfer Protocol) 3. SCD(slow changing Dimension) 2. Range lookup iv. Sparse lookup Newly added iii.1.

1.e. o Other stages are same as version 7.1      General Data Base File Processing Real time Restructure X Data Quality new X X Development & Debug    X X here. no changes in this version.  Palate of the version 8. o Data Base and processing stages have some changes that shown above.DataStage  Data Base Stage: o New stages: 2010     IWAY Classic federation ODBC connector NETEZZA o Enhanced Stages:  All Stages techniques used with respect to SQL Builder.5.. o Data Quality is exclusively new concept of 8.0.x2 i.0. Navs notes Page 45 . have changes X no changes o Palate is shortcuts of stages where we can drag n drop into canvas to do design the job.

Processing Real Time Eg: Seq to Seq Restructure Designer Canvas or Editor CANVAS Select appropriate stage in the palate and dragging them on to the CANVAS.  After started: select DS Designer & enter uid: & enter pwd: admin **** (eg.DataStage :: Stages Process & Lab Work:: DAY 16 Starting steps for DataStage tool The starting of DataStage on the system we must follow the difference steps to do job.  DB2 Repository started and DataStage server started.. • Five difference steps job development process (this is for design a job). And link them (or giving connectivity) and after that setting properties is important.: phil) Project\navs…. & attach appropriate project:  Palate -> (it’s from tool bar)  General Data Quality Database Where the File place we Navs notes & Debug Development design the job. Page 46 2010 .

  Save.  Web Console Information Analyzer DS Administrator DS Designer DS Director      Web Console: when you will click. If not manually to start.0.e.e.  Run director (to see views) or to view the status of your job. the current running process will show at the left Conner in that a round symbol with green color is to start when it is not automatically starts. In 8 categories we have use sequential stage and parallel stage jobs.. DAY 17 My first job creating process Process:  In computer desktop. 2010  This stages are categorized into two groups.DataStage  Palate means which contains all stages shortcuts i.5.  When 8th version of DataStage is installed five client components short cuts visible on desktop. 7 – stages in 7. whether the server for DataStage was start or not.. i.2 & 8 – stages in 8. they are 1 –> Active Stage (what ever stage is transmission is called active stage). 2 –> Passive Stage (here what ever stage whether extracting or loading is called passive stage). compile and run the job. it displays “ the login page appears” Navs notes Page 47 .

 DS Director: it is for views the status of the job executed. Attach the project X Domain Localhost:8080 User Name admin Password phil Project Teleco OK canc el  After authentication. it displays “the page cannot open” error will appear. they are Navs notes Page 48 . warnings.  Below figure showing how to authenticate & shows designer canvas for creating jobs. status. it displays the Designer canvas o And it ask which job want to you do. it will display to attach the project for creating a new job.DataStage o If server is not started. the server must be restart for doing or creating jobs.. 2010  DS Designer: when you will click on the designer icon. and to view log.e. As shown as below o User id: admin o Password: **** o If authentication failed to login i.  DS Administrator: it is for creating/deleting/organize the project. o If error occurs like that. because repository interface error.

 . sc. go to tool bar – view – palate. s & t.1. csv.txt.xml means extendable markup language. they are         General Data Quality Data Base File Development & Debug Processing Real Time Re – Structure 17.  In palate the 8 types of stages were displayed for designing a job. there are sub–stages like sequential stage. file set and so on. In File Stage. Source Target Navs notes Page 49 . . .xml  In .csv. o Example how a job can execute: one sequential file(SF) to another SF.csv means comma separated value.txt there are different types of formats like fwf. H & T. data set. File Stage: Q: How data can read from files?  File stage can read only flat files and the formats of flat files are .  .DataStage     Main frames Parallel 2010 Sequential Server jobs  After clicking on parallel jobs.

csv.xml \\ meta data General properties of sequential file: 1. Select a file name: File: \ c:\data\se_source_file. we must set the properties as below     File name Location Format Structure \\ browse the file name. \\ example in c:\ \\ .DataStage o Source file require target/output properties. 2010 - In source file.txt Browse button 2. Setting / importing source file from local server.txt File: \? (This option for multiple purposes) C:\data\se_source_file. Navs notes Page 50 .txt. and o Target file require input/source properties. Format selection: - As per input file taken and the data must to be in given format Like “tab/ space/ comma” must to be select one them. . how we to read a file? o On double clicking source file. .

DAY 18 Sequential File Stage Navs notes Page 51 .  These three are general properties when we design for simple job.DataStage 3. Column structure defining: 2010 LOA To get the structure of file. Steps for load a structure Import o Sequential file  Browse the file and import • Select the import file o Define the structure.

 Stream link Reject link SF SF SF SF   Link Marker: Reference link SF SF It is show how the link behaves between the transmissions from source to target. o And it also reads/writes parallel when it read/writes to or from multiple files  Step 3: Sequential stage supports one input (or) one output and one reject link. 2010 Output Properties About Sequential File Stage and how it works: Input Properties  Step1: Sequential file is file stage. . . Link : Link is also a stage that transforms data from one stage to another stage.xml)  Step 2: SF it reads/writes sequentially by default. Navs notes Page 52 .DataStage Sequential file stage also says as “output properties” For single structure format only we going to use sequential file stage. that it to read flat files from different of extensions(.txt.csv. when it reads/writes from single file. o That link has divided into categories.

BOX Navs notes Page 53 .DataStage 1. FAN OUT 4. BOX: it indicates when “a data transform from parallel stage to parallel stage” and it is also known as partitioning. FAN IN: it indicates when “a data transform from parallel stage to sequential stage” and it done when collecting happens FAN IN 3. Ready BOX 2. Ready BOX: it is indicate that “a stage is ready with Mata Data” and data transform 2010 between sequential stages to sequential stage. FAN OUT: it indicates when “a data transform from sequential stage to parallel stage” and it is also called auto partition.

Because the stage that imports import operator for purpose of creating in Native Format. BOW – TIE: it indicates when “a data transform parallel stage to parallel stage” and it is also known as re-partitioning. Navs notes Page 54 .  BLUE: o case1: a stage not connected properly and case2: job aborted A link in BLUE color means “ it indicates that a job execution on process”  GREEN: o A link in GREEN color means “execution of job finished”. NOTE: “Stage is an operator. operator is a pre – built in component”. 2010 BOW – TIE Link Color: The link color indicates the process in execution of a job. Native Format is DataStage under stable format.DataStage 5. LINK  RED: o A link in RED color means    BLACK: o A link in BLACK color means “a stage is ready”. So. stage is a operator.

Transformer stage that is done by C++. OB J BC ALL *HLL – High Level Language *ALL – Assembly Level Language *BC – Binary Code  Compiling process in DataStage: GU I .e. EX E .DataStage Compile: Compile is a translator that source code to target code. C HLL .. OB J MC OSH Code & C++ *MC – Machine Code *OSH – Orchestrate Shell Script Note: Orchestrate Shell Script generate for all stage except one i. In process. it checks for  Link Requirement (checks for link)  Mandatory stage properties  Syntax Rules Navs notes Page 55 . EX E . 2010  Compiling .C function .

txt C:\eid??. Directly to add a new column to existing table and it’s displays in that column.  First line or record of table: true/false.  File Name Column: “source information at the target” it gives information about which record in which address in local server.DataStage DAY 19 Sequential File Stage Properties Properties:  Read Methods: two options are o o 2010 Specific File: user or client to give specifically each file name. Navs notes Page 56 . File Pattern: we can use wild card character and search for pattern i. Else it is true. Error: if file name miss it aborts the job. o o If it false.. Three options o o o Continue: Drops the miss match and continue other records. Fail: job aborted. * & ?  For example: C:\eid*.  Missing File Mode: if any file name miss this option used Two options o o Ok: drops the file name when missed.  Row Number Column: “Source record number at target” it gives information about which source record number at target table.txt  Reject Mode: to handle a “format/data type/condition” miss match records. it’s doesn’t drop the first record. it display the first line also a drop record. Output: its capture the drop data through the link to another sequential file.e.

o grep .. \\ it is case sensitive that display only moon contained records.  Like ASCII – NF – ASCII – NF o It is lands or resides the data “outside of boundary” of DataStage.  Read from Multiple Nodes: we can read the data parallel from using sequential stage   Reads parallel is possible Loading parallel is not possible  LIMITATIONS of SF: o It should be sequential processing( process the data in sequential) o Memory limit 2gb(.txt format) o Problem with sequential is conversions. o grep .i “moon” \\ it ignores the case sensitive it displays all moon records. Navs notes Page 57 .w “moon” \\ it displays exact match record.DataStage It is also directly to add a new column to existing table and it’s displays in that column. …….  Read First Rows: “will get you top first n-records rows” records  Filter: “blocking unwanted data based on UNIX filter commands” 2010 o Read First Rows option will asks give n value to display the n number of   Example: Like grep.so on o grep “moon” . egrep.

After setting above when we restart the DS Designer it directly goes designer canvas. just right click on a stage rename option is visible and name a stage as naming standards.  According naming standards every stage has to be named. General Stage: In this stage the some of stage were used for commenting a stage what they behave or what a stage can perform to do i. simple giving comments for a stage.  Let we discuss on Annotation & Description Annotation Annotation: it is for stage comment..DataStage DAY 20 General settings DataStage and about Data Set Default setting for startup with parallel job: Tools o Options  Select a default • And to create new: it ask which type of job u want. 2010 Types of jobs are main frames/parallel/sequential/server. o Naming a stage is simple. Description Annotation: it is used for job title (any one tile can keep).e. Parallel Capable of 3 jobs: Resides into or Navs notes Page 58 .

txt When we convert ASCII code into NF.txt Data Set (DS): “It is file stage. And at target ASCII code will convert into . Target need to import an operator. SRC need to import an ASCII trg_f. here the ASCII code will convert into Native format that is understandable to DataStage. When we convert NF code into ASCII.  By default Data Set sends the data in parallel. Q: How the Data Set over comes the sequential file limitation? By default the data process parallel.  “Native Format” is also called as Virtual Dataset.txt format to user/client visible.txt file support only ASCII format and DataStage support the Native format only.  Data Set over comes the limitation of sequential file stage for the better performance.  In Data Set the data lands in to “Native Format”.DataStage SRC Extracting TRG landing the data into LS/RR/db 2010 Q: In which format the data sends between the source file to target file? A: if we send a . and it is used staging the data when we design dependent jobs”.txt file from source. More than 2 GB. Page 59 Navs notes . NF ASCII src_f. it is ASCII format because .

copying the structure st_trg & trg_f.ds” file name and also we must save the structure of the trg_f.ds for reusing here.ds - trg_f.DataStage - No need of conversion. The data Lands in the DataStage repository. like DataStage reads only orchestrate format. We can use the saved file name and structure of the target in other job. Navs notes Page 60 2010 - . because Dataset represent or data directly resides into Native format. Data Set extension is *.ds we can copy the “trg_f.txt Data Set can read only Native Format file.txt Q: How the conversion is easy in Data Set? - trg_f.ds example st_trg. trg_f.ds - Structure saving as “st_trg” src_f.

DataStage

DAY 21 Types of Data Set (DS) Two types of Data Set, they are  Virtual (temporary)  Persistency (permanent)
-

Virtual: it is a Data Set stage that the data moves in the link from one stage to another stage i.e., link holds the data temporary. Persistency: means the data sending from the link it directly lands into the repository. That data is permanent.

-

Alias of Data Set: o ORCHESTRATE FILE o OS FILE Q: How many files are created internally when we created data set? A: Data Set is not a single file; it creates multiple files when it created internally. o Descriptor file o Data file o Control file o Header file
 Descriptor File: it contains schema details and address of data. 

Data File: consists of data in the Native Format and resides in DataStage repository.

 Control File:

Navs notes

Page 61

2010

DataStage
It resides in the operating system and both acting as interface between descriptor file and data file.  Physical file means it stores in the local drive/ local server.
 Permanently stores in the install program files c:\ibm\inser..\server\dataset{“pools”}
2010

 Header File:

Q: How can we organize Data Set to view/copy/delete in real time and etc., A: Case1: we can’t directly delete the Data Set Case2: we can’t directly see it or view it.
 Data Set organizes using utilities.

o Using GUI i.e., we have utility in tool (dataset management) o Using Command Line: we have to start with $orachadmin grep “moon”;  Navigation of organize Data Set in GUI: o Tools

Dataset Management File_name.ds(eg.: dataset.ds)

o Then we will see the general information of dataset     Schema window Data window Copy window Delete window

 At command line o $orachadmin rm dataset.ds (this is correct process) \\ this command for remove a file o $rm dataset.ds (this is wrong process) \\ cannot write like this o $ds records \\ to view files in a folder

Navs notes

Page 62

DataStage
Q: What is the operator which associates to Dataset: A: Dataset doesn’t have any operator, but it uses copy operator has a it’s operator.
2010

Dataset Version:
-

Dataset have version control Dataset has version for different DataStage version Default version in 8 is it saves in the version 4.1 i.e., v41

-

Q: how to perform version control in run time? A: we have set the environment variable for this question.  Navigation for how to set a environment variable.  Job properties o Parameters  Add environments variable Compile
o

Dataset version ($APT_WRITE_DS_VERSION)  Click on that.

 After doing this when we want to save the job, it will ask whether which version you

want.

Navs notes

Page 63

Data Set have more performance than File Set. - File stage is same to design in dependent jobs. Data Set & File Set are same.DataStage DAY 22 File Set & Sequential File (SF) input properties File Set (FS): “It is also a staging the data”.fs extension  Copy (file name) operator  Native format  .ds files saves  But. Navs notes Page 64 2010 . but having minor differences The differences between DS & FS are shown below Data Set  Having parallel extendable capabilities  More than 2 GB limit  NO REJECT link with the Dataset  DS is exclusively for internal use DataStage environment File Set  Having parallel extendable capabilities  More than 2GB limit  REJECT LINK with in File Set -  External application create FS we use the any other application  Import / Export operator  Binary Format  .

Cleanup on failure 3. Reject Mode File Update Mode: having three options – append/create (error if exists)/overwrite o Append: when the multiple file or single file sending to sequential target it’s appends one file after another file to single file. File update mode 2. o o Create (error if exists): just creating a file if not exist or given wrong.  Setting passing value in Run time(for file update mode) o Job properties  Parameters Add environment variables o Parallel  Automatically overwrite ($APT_CLOBBER_OUTPUT) Cleanup on Failure: having two options – true/false.  True – the cleanup on failure option when it is true it adds partially coded or records.  False – it’s simple appends or overwrites the records. Navs notes Page 65 .DataStage Sequential File Stage: input properties Setting input properties at target file. Overwrite: it’s overwriting one file with another file. and at target there have four properties 2010 1.  True – it is enable the first row or record as a fields of column  False – it is simple reads every row include first row read as record. Its works only when “file update mode” is equal to append. First Line in Column Names: having two options – true/false. First line in column names 4.

Row Generated Data b.DataStage Reject mode: here reject mode is same like as output properties we discussed already before. Stages that Generated Data: Row Generator Data: “It having only one output” Navs notes Page 66 . Column Generated Data 2. Tail c. The stage that helps in Debugging: a. Simple 3. 2010  Continue – it just drops when the format/condition/data type miss match the data and  Fail – it just abort the file when format/condition/data type miss match were found. In this we have three options – continue/fail/output. they are 1. Stage that Generated Data: a.1. Head b. continues process remain records. Peek  Simply say in development and debug we having 6 types of stages and the 6 stages where divided into three categories as above shown.  Output – it capture the drops record data. The stage that used to Pick Sample Data: a. 23. DAY 23 Development & Debug Stage The development and debug stage having three categories.

For example n=30 Data generated for the 30 records and the junk data also generated considering the data type. Q: how to generate User define value instead of junk data? A: first we must go to the RG properties Column o Double click serial number or press ctrl+E  Generator Navs notes Page 67 . To make job design simple that shoots for jobs. o 2010 - o When client unable to give the data. o For doing testing purpose. - Row Generator can generate the junk data automatically by considering data type.DataStage - The row generator is for generating the sample data. Some cases are. Row Generator design as below: ROW Generator Navigation for Row Generator: Opening the RG properties Properties DS_TRG o Number of records = XXX( user define value) Column o Load structure or Meta data if existing or we can type their. In this having only one property and select a structure for creating junk data. in some cases it is used. or we manual can set a some related understandable data by giving user define values.

DataStage • Type = cycle/random (it is integer data type) In integer data type we have three option 2010 • Under cycle type: There are three types of cycle generated data Increment. seed. and limit. Navs notes Page 68 . Q: when we select signed? A: it going to generate signed values for the field (values between –limit and +limit). otherwise generate values between 0 and +limit. Q: when we select initial value=30? A: it starts from 30 only. Q: when we select seed=XX. Q: when we select limit=20? A: it is going to generate up to limit number in a cycle form. A: it is going to generate the junk data for random values. Initial value. Under Random type: There are three types of random generated data – limit. Column Generator Data: “it having the one input and one output” Main purpose of column generator to group a table as one. and signed. Q: when we select limit=20? A: it going to generate random value up to limit=20 and continues if more than 20 rows. Q: when we select increment=45? A: it going to generate a cycle value of from 45 and after adds every number with 45.

- Column o We can change data type as you require. In the output. dropping created column into existing table. 2010 Here mapping should be done in the column generated properties. Stage o Options   Column to generate =? And so on we can give up to required. For manual we can generate some meaning full data to extra column’s Navigation for manual: o Column  Ctrl+E • Generator - Navs notes Page 69 . - The junk data will generate automatically for extra added columns. and for mapping we drag simple to existing table into right side of a table.DataStage - And by using this we add extra column for the added column the junk data will be generated in the output. To open the properties just double clicking on that. means just drag and Sequential file - Column Generator DataSet Coming to the column generator properties. Navigation: - Output o Mapping  After adding extra column it will visible here.

Pick sample data: “it is a debug stage. Cycle is same like above shown in row generator.1. o In the head stage mapping must and should do. there are three types of pick sample data”. Q: when we select alphabet where string=naveen? A: it going to generate different rows with given alphabetical wise. string.. value 2010 o Alphabet – it also have only one option i. SF_SRC  Properties of Head: o Rows HEAD DS_TRG Navs notes Page 70 . Head Tail Sample  Head: “it reads the top ‘n’ records of the every partition”.e.DataStage o Algorithm = two options “cycle/ alphabet” o Cycle – it have only one option i..e. DAY 24 Pick sample Data & Peek 24. o It having one input and one output.

o Mainly we must give the value for “number of rows to display”  Sample: “it is also a debug stage consists of period and percentage” o o Period: means when it’s operating is supports one input and one output. which must be specified.DataStage  All Rows(after skip)=false - It is to copy all rows to the output following any requested skip 2010 positioning  Number of rows(per partition)=XX o Partitions  All partition = true - It copy number of rows from input to output per partition. o In this stage mapping must and should do. That mapping done in the tail output properties. SF_SRC  Properties of Tail: TAIL_F DS_TRG o The properties of head and tail are similar way as show above.  Tail: “it is debug stage. that it can read bottom ‘n’ rows from every partition” o Tail stage having one input and one output. Percentage: means when it’s operating is supports one input and multiple of outputs. True: copies row from all partitions False: copies from specific partition numbers. Navs notes Page 71 .

o Coming to the properties  Options - Percentage = 25 and we must set target =1 Percentage = 50 . target = 2 o Here we setting target number that is called link order. o Link Order: it specifies to which output the specific data has to be send.DataStage SF_SRC SAMPLE DS_TRG  Period: if I have some records in source table and when we give ‘n’ number of period value it displays or retrieves the every nth record from the source table. Target1 Target2 SF_SRC SAMPLE Navs notes Page 72 2010 . target = 0 Percentage = 15 .  Skip: it also displays or retrieves the every nth record from given source table.  Percentage: it reads from one input to multiple outputs. o Mapping: it should be done for multiple outputs.

It can use as copying the data from Source to multiple outputs. as per options Navs notes Page 73 2010 . o In the percentage it distributes the data in percentage form. 3. Q: How to send the data into logs?  Opening properties of peek stage. It considers 90% as 100% and it distributes as we specify.2. And it can use as stub stage. When sample receives the 90% of data from source. we must assign o Number of row = value? o Peek record output mode = job log and so on. 2. PEEK: “it is a debug stage and it helps in debugging stage” SF_SRC It is used in three types they are PEEK 1. 24.DataStage Target3 NOTE: sum of percentage of all outputs must be less than are equal to ‘<=’ to ‘n’ records of input records. Send the data into logs.

 For seeing the log records that we stored. 2010 o In DS Director  From Peek – log – peek . DAY 25 Database Stages In this stage we have use generally oracle enterprise. it doesn’t shows the column in the log. because in some situations a client requires only dropped data. In that time the peek act as copy stage.We see here ‘n’ values of records and fields Q: When the peek act as copy stage? A: It is done when the sequence file it doesn’t send the data to multiple outputs. 25. and its sends the rejected data to the another file. Q: What is Stub Stage? A: Stub Stage is a place holder. and dynamic RDBMS and so on. Tara data with ODBC.DataStage o If we put column name = false. it reads tables from the oracle data base from source to the target” o Oracle enterprise reads multiple tables from. but it loads in the one output. Oracle Enterprise: “Oracle enterprise is a data base stage. In that time the stub stage acts as a place holder which holds the output data as temporary.1. Oracle Enterprise o Properties of Oracle Enterprise(OE): Data Set Navs notes Page 74 . ODBC enterprise.

If table not in the not their in plug-in. • • • •  Select load option in column Going to the table definitions Than to plug-in Loading EMP table from their.DataStage  Read Method have four options • • • •  •  • • • Auto Generated \\ it generated auto query 2010 SQL Builder \\ its new concept apart comparing from v7 to v8. If we select table option Table = “<table name>” Connection Password = ***** User = Scott Remote server = oracle o Navigations for how the data load to the column  This is for already data present in plug-in. Table \\ giving table name here User Defined \\ here we are giving user defined SQL query. • • • Select load option in column Then we go to import Import “meta data definition” o Select related plug-in   Oracle User id: Scott Navs notes Page 75 .

Data connection: its main purpose is reusing the saved properties. Q: A table containing 300 records in that. I need only 100 fields from that? A: In read method we use user-defined SQL query to solve this problem by writing a query for reading 100 records.DataStage   • Password: tiger After loading select specific table and import. Q: How to reuse the saved properties? A: navigation for how to save and reuse the properties  Opening the OE properties o Select stage  Data connection • There load saved dc Navs notes Page 76 . It is totally automated. we can auto generate the query by that we can use by coping the query statement in user-defined SQL.5.  But by the first read method option.x2 we don’t have saving and reusing the properties. NOTE: in version 7. 2010 After importing into column. in define we must change hired date data type as “Time Stamp”. Q: What we can do when we don’t know how to write a select command? A: Selecting in read method = SQL Builder  After selecting SQL Builder option from read method o Oracle 10g o From their dragging which table you want o And select column or double clicking in the dragged table   There we can select what condition we need to get.

DataStage o Naveen_dbc \\ it is a saved dc o Save in table definition. Oracle Enterpris e Navs notes ODBC Enterpris e ORACLE DB Page 77 OS . But ODBC needs OS drivers to hit oracle or to connect oracle data base. When DataStage version7 released that time the oracle 9i provides some drivers to use. 2010 DAY 26 ODBC Enterprise ODBC Enterprise is a data base stage About ODBC Enterprise:  Oracle needs some plug-ins to connect the DataStage.  When coming to connection oracle enterprise connects directly to oracle data base.

Q: How database connect using ODBC? ODBCE First step: opening the properties of ODBCE  Read method = table o Table = EMP  Connection Data Set o Data Source = WHR \\ WHR means name of ODBC driver Navs notes Page 78 2010 .DataStage Directly hitting Use OS drivers to hit the oracle db  Difference between Oracle Enterprise (OE) and ODBC Enterprise OE  Version dependent  Good performance  Specific to oracle  Uses plug-ins  No rejects at source ODBCE  Version independent  Poor performance  For multiple db  Uses OS drivers  Reject at SRC &TRG.

 ODBCE read sequentially and load ODBC  It provides the list have in ODBC DSN.DataStage o Password = ****** o User = Scott  Creating of WHR ODBC driver at OS level.  Using ODBC Connector is quick process as we compare with ODBCE.  Best Feature by using ODBC Connector is “Schema reconciliation”.  Differences between ODBCE and ODBC Connector.  It read parallel and loads parallel (good performance). That automatically handles data type miss match between the source data types and DataStage data types. o Administration tools  ODBC • Add o MS ODBC for Oracle    Giving name as WHR Providing user name= Scott And server= tiger.  In the ODBCE “no testing the connection”. Navs notes Page 79 . ODBCE Connector  It cannot make the list of Data Source Name (DSN).  In this we can test the connection by test button. to over this ODBC connector were introduced. 2010  ODBCE driver at OS level having lengthy process to connect.

MS Excel with ODBCE:  First step is to create MS Excel that is called “work book”.1.  Connections o DSN = EXE o Password = ***** o User = xxxxx  Column o Load  Import ODBC table definitions • • Navs notes DSN \\ here select work book User id & password Page 80 .  For example CUST work book is created Q: How to read Excel work book with ODBCE? A: opening the properties of ODBCE  Read method = table o Table = “empl$” \\ when we reading from excel name must be in double codes end with $ symbol. It’s having ‘n’ number of sheets in that.DataStage  Properties of ODBC Connector: o Selecting Data Source Name DSN = WHR 2010 o User name = Scott o Password = ***** o SQL query 26.

2. o And in OS also we must start  Start ->control panel ->Administrator tools -> services -> • Tara Data db initiator \\ must start here o Add DSN in ODBC drivers   Select Tara data in add list We must provide details as shown below • • • User id = tduser Password = tduser Server : 127.csv 26. which use as a data base. Tara Data with ODBCE:  Tara Data is like an oracle cooperation data base.DataStage o Filter \\ enable by click on include system tables o And select which you need & ok 2010  In Operating System o Add in ODBC  MS EXCEL drivers • Name = EXE \\ it is DSN Q: How do you read Excel format in Sequential File? A: By changing the CUST excel format into CUST.0.0. Q: How to read Tara Data with ODBC A: we must start the Tara Data connection (by clicking shortcut).1  After these things we must open the properties of ODBCE o Read method = table  Table = financial.customer Navs notes Page 81 .

it is also called as DRS”  It supports multiple inputs and multiple outputs Navs notes Page 82 .0.DataStage o Connections     Column o Load  Import • • • • Table definitions\plug-in\taradata Server: 127.1 Uid = tduser Pwd = tduser DSN = tduser 2010 Uid = tduser Pwd = tduser  After all this navigation at last we view the data.1. Dynamic RDBMS: “It is data base stage. which we have load in source.0. DAY 27 Dynamic RDBMS and PROCESSING STAGE 27.

but we can’t load into multiple files.DataStage Ln_EMP_Data Data Set DRS Ln_DEPT_Data Data Set  It all most common properties of oracle enterprise.e. oracle o Oracle   o Scott Tiger \\ for authentication At output   Ln_EMP_Data \\ set emp table here And Ln_DEPT_Data \\ set dept table here o Column  Load • Meta data for table EMP & DEPT.  Coming to DRS properties o Select db type i. Navs notes Page 83 2010 ..  In oracle enterprise we can read multiple files.  We can solve this problem with DRS that we can read multiple files and load in to multiple files.

And the 10 stages are very important. . Transformer 2. but we use 10 stages generally. Sort 10. Modify 9. Remove duplicates 7. Funnel 6.3. Slowly changing dimension 8. Join 4.2. 27. Surrogate key 27.DataStage Some of data base stages: • Netezza can use in target only to set in input properties. Processing Stage: In this 28 processing stages are there. They are. Look UP 3. Copy 5. Transformer Stage: The symbol of Transformer Stage is Navs notes Page 84 2010 • IWay can use in source only to set in output properties. 1.

COMM \\ we can write by write clicking their  It visible in input column\function\ and so on. Oracle Enterprise Transformer Data Set Here.  Transformer Stage is “all in one stage”.e. we can write derivation here. source field and structure available mapping should be do. For example.  After that when we execute the null values records it drops and remaining records it sends to the target.SAL + IN. Navs notes Page 85 .COMM) o By this derivation we can null values records as target. IN.DataStage A simple query that we solving by using transformer i.SAL + NullToZero (IN. That column we name as NETSAL By double clicking on the NETSAL. setting the connection and load Meta data in to column here.  Properties of Transformer Stage: o For above question we must create a column to write description     In the down at output properties clicking in empty position. 2010 Q: calculate the salary and commission of an employee from EMP table. o For this we can functions in derivation  IN.

how to include this logic in derivation? A: adding THome column in output properties. IN. so the best way to over this problem is Stage Variable.DataStage Q: NETSAL= SAL + COMM +200. Stage Variable: “it is a temporary variable which will holds the value until the process completes and which doesn’t sent to the result to output”  Stage variable is shown in the tool bar of transformer properties.  After clicking that it visible in the input properties In stage variable we must add a column for example.SAL + NullToZero (IN.SAL + NullToZero (IN.SAL + NullToZero (IN.COMM))> 2000   Then (IN.  Adding these derivations to the input properties to created columns.COMM) ) + 200 o By this logic it takes more time in huge records. DAY 28 Transformer Functions-I Examples on Transformer Functions: Navs notes Page 86 .SAL + NullToZero (IN. NS Variables to adding column 1 NS 0 integer 4 0 After adding NS column  To NS column including the derivation.  In THome derivation part we include this logic o 2010 Logic: if NETSAL > 2000 then TakeHome = NETSAL – 200 else TakeHome = NETSAL If (IN.COMM).COMM)) – 200 Else (IN. o NETSAL = NS o THome = if (NS > 2000) then (NS -200) else (NS + 200).

Substring Function Filter: DataStage in 3 different ways 1. Stages (filter. means constraints is also called as filter” Q: how a constraint used in Transformer? A: in transformer properties. a word MINDQUEST.3) 2010 3. Differences between Basic transformer and parallel transformer:  Its effects on performance. Field Function 6. Navs notes  Basic Tx can call the Routines which is in basic and shell Page 87  It supports wide range of language or multiple . There we can write the derivation by double clicking.R(L(7). Source level 2. Left Function 2.3)  Substring – SST(5. we will see a constraints row in output link.3)  Left Function – L(R(5). extended filter) 3. lookup) Constraints: “In transformer constraints used as filter.  Basic Tx can only execute up to SMP. switch. Constraints Function (Filter) For example. Right Function 4. Basic Transformer  Don’t effects on performance.DataStage 1. Constraints (transformer. Concatenate Function 5.  Right Function using the above for question . from that word we need only QUE. but it Parallel Transformer effects on compile time.  Can execute in any platform.

right.28 HINVC43205CID67632120080405EUO TPID5630 8 1657. separating by using left.13 TPID5637 1 2343.57 TPID5635 6 9564.00 TPID5645 2 7855.99 TPID5655 4 2861.69 TPID5657 7 6218.64 Design: IN1 IN2 Navs notes Page 88 2010 .DataStage NOTE: Tx is very sensitive with respect to Data Types.txt HINVC23409CID45432120080203DOL TPID5650 5 8261.67 TPID5657 9 7452. if an source and target be cannot different data types.96 HINVC12304CID46762120080304EUO TPID5640 3 5234. Q: How the below file can read and perform operation like filtering. substring functions and date display like DD-MM-YYYY? A: File.

here creating four column and separating the data as per created columns. 9) CID IN2. 1) Left (Right (IN2. we are creating two columns TYPE and DATA. Here. Step 2: IN1 Tx.DATA [20. DS IN1 REC IN1 CONSTRAINT Left (IN1.REC.txt into sequential file. 1) IN1. in the properties of sequential file loading the whole data into one record.REC. 8] INVCNO Page 89 2010 .REC DATA TYPE Step 3: IN2 Tx properties. IN2 TYPE DATA Navs notes IN3 Left (IN1.REC. 21).DataStage SF Tx1 IN3 Tx2 OUT Tx3 Total five steps to need to solve the given question: Step 1: Loading file.DATA.1)=”H” IN2 Derivation Column Left (IN1. Means here creating one column called REC and no need of loading of Meta data for this. in this step we are filtering the “H” staring records from the given file.Properties.

CID CID D:’-‘: M:’-‘: Y Step 5: here. DAY 29 Transformer Functions-II Examples on Transformer Functions II: Navs notes Page 90 2010 . setting the output file name for displaying the BILL_DATE.BILL_DATE. Stage Variable IN3 INVCNO CID BILL_DA TE CURR Derivation Column Right (IN3. 4) D M Y OUT Derivation Column IN3.BILL_DATE.DataStage Derivation Column Step 4: IN3 Tx properties. here BILL_DATE column going to change into DD-MM-YYYY format using Stage Variable. 6). 2) Left (IN3.INVCNO INVCNO IN3. 2) Right (Left (IN3.BILL_DATE.

Trim: “it removes all special characters”. Trim B: “it removes all after spaces”. and in between).STATE 111. Field Function: “it separates the fields using delimiter support”. 3. Sra van. middle one. NaVeen.txt EID. 7. 5. spaces”. 2010 2. Q: A file. anvesh. @333. Compact White Spaces: “it removes before. @ San DeeP. 6. Strip White Spaces: “it removes all spaces”. AP TN 222@. after. Trim F: “it removes all before spaces”. after. 4.MH Design: IN1 SF Tx IN2 Tx Navs notes Page 91 . comma delimiters and spaces (before. Trim T & L: “it removes all after and before spaces”.DataStage 1.txt consisting of special character. KN@ 444. MUnNA. How to solve by above functions and at last it to be one record? File.ENAME. KN 555.

EID.REC.””) EID Upcase(Trim(SWS(IN2.’.DataStage IN3 2010 OUT Tx Total Five steps to solve the File.3) EID ENAME STATE Step 3: IN2. Tx properties  In link IN1 having the REC.ENAME.’. using field functions..”@”.’. IN2 IN3 Derivation Column EID ENAME STATE Navs notes Trim(IN2.txt using above functions: Step 1: Here. IN1 REC Derivation DS IN2 Column Field(IN2.REC.e. spaces. to remove special characters. lower cases into upper cases by using the trim.REC.1) Field(IN2.’.’.””)) ENAME Page 92 . Strip Whitespaces (SWS). Tx properties  Here. Up case functions.txt and setting into all data into one record to the new column created that REC.’. that REC to divide into fields by comma delimiter i. no need of load meta data to this.2) Field(IN2. Step 2:IN1. extracting the file.”@”.  Point to remember keep that first line is column name = true.

Tx properties  Here. Re-Structure Stage: 1.STATE REC Step 5:  For the output. IN3 OUT Derivation Column EID ENAME STATE IN3.ENAME: IN3.DataStage Step 4: IN3. spaces were removed after doing are implementing the transformer functions to the above file.EID: IN3.ds REC 111NAVEEN AP 222 MUNNATN 333SRAVAN KN 444SAN DEEPKN 555 ANVESHMH 29. And at last the answer will display in one record but all special characters. Column Import Column Export: Navs notes Page 93 2010 .txt.1. all rows that divided into fields are concatenating means adding all records into one REC. here assigning a target file. Column Export 2. Final output: Trg_file.

DataStage  “it is used to combine the multiple of columns into single column” and it is also like concatenate in the transformer function. o Input     o Output   Column Import:  “it is used to explore from single column into multiple columns” and it is also like field separator in the transformer function.  Properties: o Input   o Output     Import column type = “varchar” Import output column= EID Import output column= ENAME Import output column= STATE DAY 30 JOB Parameters (Dynamic Binding) Column method= Column To Import = REC Export column type = “varchar” Export output column = REC Column method = explicit Column To Export = EID Column To Export = ENAME Column To Export = STATE 2010  Properties: Navs notes Page 94 .

this is up to version7. To give runtime values for user ID.  Job parameters are divided into two types.  But there is no need for giving the authentication to oracle are to be static bind. it is divided into two types. NOTE: “The local parameters that created one job they cannot be reused in other job. it is also called dynamic binding”. it can use with in the 2010 dynamic binding”.  Global Variables: “it is also called as environment variables”. and remote server? Navs notes Page 95 . Under parallel compiler.DataStage Dynamic Binding: “After compiling the job and passing the values during the runtime is known as  Assuming one scenario that when we taking a oracle enterprise. But in version7 we can also reuse parameters by User Define values by DataStage Administrator. we must provide the table and load its meta data. job only”. o Existing: comes with in DataStage. For this we can use job parameters that can provide values at runtime to authenticate. in this two types one general and another one parallel. reporting will available. Job parameters: “job parameters is a technique that passing values at the runtime. because of some security reasons. they are o Local variables o Global Variable  Local variables (params): “it is created by the DS Designer only. Here table name must be static bind. operator specific. password. o User Defining: it is created in the DataStage administrator only. But coming to version8 we can reuse them by technique called parameter set”. They are. Q: How to give Runtime values using parameters for the following list? a.

Providing target file name at runtime? e. d are represents a solution for the given question. Navs notes Page 96 . c. Re-using the global and parameter set? Design: 2010 c. a.DataStage b.  Job parameters o Parameters Name  a b c DNAME USER Password SERVER DEPT BONUS DRIVE FOLDER TARGET Type string Encrypted String List Integer String String String Default value SCOTT ****** ORACLE 10 1000 C:\ Repository\ dataset. Add BONUS to SAL + COMM at runtime? ORACLE Step1: Tx Data Set “Creating job parameters for given question in local variable”. b. Step 2:“Creating global job parameters and parameter set”.ds UID PWD RS DNO BONUS IP FOLDER TRG FILE      d   Here. Department number (DNO) to keep as constraint and runtime to select list of any number to display it? d.

 In local variables job parameters o Select multiple of values by clicking on  And create parameter set • Providing name to the set o SUN_ORA  Saving in Table definition • In table definition Navs notes Page 97 . and TEST”. we must o Add environment variables  User defined • • • UID $UID PWD $PWD RS $RS Step 3: “Creating parameter set for multiple values & providing UID and PWD other values for DEV. global parameters are preceded by $ symbol.DataStage  DS Administrator o Select a project • 2010  Properties General o Environment variables  User defined (there we can write parameters) Default value SCOTT ****** ORACLE Name UID PWD RS DNAME USER Password SERVER Type string Encrypted String  Here. PRD.  For Re-use.

UID SUN_ORA. Step 4: “In oracle enterprise properties selecting the table name and later assign created job parameter as shown below”.DataStage o Edit SUN_ORA values to add Name DEV PRD TEST UID SYSTEM PRD TEST PWD ****** ****** ****** SERVER SUN ORACLE 2010 MOON  For re-using this to another job.RS UID PWD Local variables global environment Navs notes Page 98 . Properties:  Read method = table o Table = EMP  Connection o Password = #PWD# o User = #UID# o Remote Server = #RS# Column:  Load o Meta data for EMP table Parameters Insert job parameters $UID $PWD variables $RS SUN_ORA.PWD parameter set SUN_ORA. o Add parameters set (in job parameters)  Table definitions • Navs o SUN_ORA(select here to use) NOTE: “Parameter set use in the jobs with in the project only”.

DataStage

Step 5:
2010

“In Tx properties dept no using as a constraint and assign bonus to bonus column”.
Stage Variable IN EID ENAME STATE SAL COMM DEPTNO
Derivation Column

IN.SAL + NullToZero(IN.COMM) NS

OUT Constraint: IN.DEPTNO = DNO
Derivation Column

IN.EID IN.ENAME NS NS+BONUS

EID ENAME NETSAL BONUS

Here, DNO and BONUS are the job parameters we have created above to use here. For that simply right click->job parameters->DNO/BONUS (choose what you want) Step 6: “Target file set at runtime, means following below steps to follow to keep at runtime”.  Data set properties o Target file= #IP##FOLDER##TRGFILE# Here, when run the job it asks in what drive, and in which folder. At last it asks what target file name you want.

Navs notes

Page 99

DataStage

DAY 31 Sort Stage (Processing Stage) Q: What is sorting? “Here sorting means higher than we know actually”. Q: Why to sort the data? “To provide sorted data to some sort stages like join/ aggregator/ merge/ remove duplicates for the good performance”. Two types of sorting:
1. Traditional sorting: “simple sort arranging the data in ascending order or descending
2010

order”.
2. Complex sorting: “it is only for sort stages and to create group id, blocking unwanted

sorting, and group wise sorting”. In DataStage we can perform sorting in three levels:  Source level: “it can only possible in data base”.  Link level: “it can use in traditional sort”.  Stage level: “it can use in traditional sorting as well as complex sorting”. Q: What is best level to sort when we consider the performance? “At Link level sort is the best we can perform”. Source level sort: o It can be done in only data base, like oracle enterprise and so on. o How it will be done in Oracle Enterprise (OE)?

Navs notes

Page 100

DataStage
 Go to OE properties • Link level sort:
o

Select user define SQL
2010

o

Query: select * from EMP order by DEPTNO.

Here sorting will be done in the link stage that is shown how in pictorial way.

o And it will use in traditional sorting only. o Link sort is best sort in case of performance.

OE JOIN DS

Q: How to perform a Link Sort? “Here as per above design, open the JOIN properties”.  And go to partitions o Select partition technique (here default is ‘auto’)  Mark “perform sort” • • When we select unique (it removes duplicates) When we select stable (it displays the stable data)

Q: Get all unique records to target1 and remaining to another target2? “For this we must create group id, it indicates the group identification”.

Navs notes

Page 101

and group wise sorting in some sort stage like join. Sort Stage: “It is a processing stage. 2010  Here we must select to which column group id you want. False = disables the group id. aggregate. merge. default is false. and remove duplicates. blocking unwanted sorting.  Traditional sort means sorting in ascending order or descending order. in the properties of the sort stage and in the options by keeping create key change column (CKCC) = “true”. that it can sort the data in traditional sort or in complex sort”.DataStage  It is done in a stage called sort stage. Sort Properties:  Input properties o Sorting key = EID (select the column from source table) o Key mode = sort (sort/ don’t sort (previously sorted)/ don’t sort (previously grouped)) o Options   Create cluster key change column = false (true/ false) Create key change column = (true/ false) • •  Output properties o Mapping should be done here. Sort Stage  Complex sort means to create group id. Navs notes Page 102 . True = enables group id.

naveen.txt EID. kumar. munna. ENAME. insurance 333.munna. ACCTYPE 111. loans 222. current 111. loans 111. naveen. kumar. File. savings 333. credit 111.DataStage DAY 32 A Transformer & Sort stage job Q: Sort the given file and extract the all addresses to one column of a unique record and count of the addresses to new column. loans 222. savings Design: SF Sort1 DS Navs notes Page 103 2010 . kumar. munna. munna. current 222.

munna. munna. o Properties of TX: Stage Variable IN2 EID ENAME ACCTYP E KeyChan Derivation Column if (IN2. munna.DataStage  Sequential File (SF): here reads the file.munna. savings. credit . credit . naveen. savings 333. current.keychange=1) then 1 else c+1 OUT Derivation Column IN2. ENAME.txt for the process.ACCTYPE func1 else func1 :’. current.ACCTYPE if(IN2. kumar. current 333.keychange = 1) then IN2. kumar.ENAME func1 ACCTYPE ENAME  For this logic output will displays like below: EID.’: IN2.EID EID IN3. kumar. insurance. savings 111. savings. ACCTYPE 111. current 111.loans. loans 222.loans 222. savings. credit 222.  Sort1: here sorting key = EID  And enables the CKCC for group id.  Transformer (TX): here logic to implement operation for target. insurance 111. naveen. loans COUNT 1 2 3 4 1 2 3 1 2 Navs notes Page 104 2010 Tx Sort2 . current.

partitioning Options= ascending.  Stage • Key=ACCTYPE o o Sort key mode = sort Sort order = Descending order  Input • • Partition type: hash Sorting o Perform sort   Stable (uncheck) Unique (check this) o Selected     Output • Key= count Usage= sorting. in the properties we must set as below.  Data Set (DS): o Input:  partition type: hash o Sorting: Navs notes Page 105 2010 .DataStage  Sort2: o Here. case sensitive Mapping should be doing here.

2. o Data Base: by write filter quires like “select * from EMP where DEPTNO = 10”. loans 222. IF  IF can write ‘n’ number of column in condition. Switch and 3.  Stage Filter: o “Stage filters use in three stages. ACCTYPE. . COUNT 4 3 2 DAY 33 FILTER STAGE 2010   Unique (check this) Final output: o Selected    Key= EID Usage= sorting. kumar. current. Constraints  Source Level Filter: “it can be done in data base and as well as in file at source level”. naveen.  Better SWITCH performance than IF. sav. partition Ascending 111. o Source File: here we have option called filter there we can write filter commands like “grep “moon”/ grep –I “moon”/ grep –w “moon” ”.DataStage  Perform sort Stable (check this) EID. External filter”. and they are 1. sav 333. munna. ENAME. curr.  It have ‘n’ number of cases. o Difference between if and switch:  Poor performance. Navs notes  SWITCH can only one Page 106 condition can perform. credit . loans Filter means “blocking the unwanted data”.  It can only have 128 cases. In DataStage Filter stage can perform in three level. Filter. Stage level 3.loans. insu. Source level 2. they are 1.

switch as switch.default EXTERNAL  It is using by the GREP commands.  It have. o Differences between three filter stages.  It have.  It have. o 1 – input 128 – outputs 1 . o 1 – input n – outputs 1 – reject SWITCH  Condition on single column. n outputs. and one reject link”.  The symbol of filter is Filter Q: How the filter stage to send the data from source to target? Design: DS Navs notes T 1 Page 107 2010 . o 1 – input 1 – output no rejects Filter stage: “it having one input.DataStage o Here filter is like an IF. FILTER FILTER  Condition on multiple columns.

DataStage

OE
Reject T 2

DS

DS Step1:
 Connecting to the oracle for extracting the EMP table from it.

Step2: Filter properties  Predicates o Where clauses = DEPT NO =10  Output link =1

o Where clauses = SAL > 1000 and SAL < 3000  Output link = 2

o Output rejects = true // it is for output reject data.  Link ordering o Order of the following output links  Output: o Mapping should be done for links of the targets we have.  Step3:
 “Assigning a target files names in the target”.

Here, Mapping for T1 and T2 should be done separately for both.

Navs notes

Page 108

2010

Filter

DataStage
It have no reject link, we must convert a link as reject link. Because it has ‘n’ number of outputs.
2010

DAY 34 Jobs on Filter and properties of Switch stage Assignment Job 1: a. Only DEPTNO 10 to target1? b. Condition SAL>1000 and SAL<3000 satisfied records to target2? c. Only DEPTNO 20 where clause = SAL<1000 and SAL>3000 to target3? d. Reject data to target4? Design to the JOB1:
T

Filter

EMP_TBL

T

T

Filter

Navs notes

Page 109

DataStage

T
2010

Step1: “For target1: In filter where clause for target1 is DEPTNO=10 and link order=0”. Step2: “For target2: where clause = SAL>1000 and SAL<3000 and link order=1”. Step3: “For target3: where clause= DEPTNO=20 and link order=0”. Step4: “For target4: convert link into reject link and output reject link=true”. Job 2: a. All records from source to target1? b. Only DEPTNO=30 to target2?
c. Where clause = SAL<1000 and SAL>3000 to target3?

d. Reject data to target4? Design to the JOB2:
T

Copy

EMP_TBL
T

T

Filter
T

Navs notes

Page 110

Condition SAL>1000 & SAL<3000. All records to target3? d. Step4: “For target4 convert link into reject link and output reject link=true”. Only DEPTNO 10 records to target4? e. Job 3: a. Step2: “For target2 where clause = DEPTNO=30 and link order =0”.DataStage Step1: “For target1 mapping should be done output links for this”. All unique records of DEPTNO to target1? b. All duplicates records of DEPTNO to target2? c. Step3: “For target3 where clause = SAL<1000 and SAL>3000 and link order=1”. but no DEPTNO=10 to target5? Design to the JOB3: K= T Filter EMP_TBL K= T TT T Navs notes Page 111 2010 .

Step2: “For target2: where clause = keychange=0 and link order=1”. Step3: “For target3: mapping should be done output links for this”. Step5: “For target5: in filter properties put output rows only once= true for where clause SAL>1000 & SAL<3000”. Picture of switch stage: Properties of Switch stage:  Input o Selector column = DEPTNO  Cases o values Case = 10 = 0 link order o Case = 20 = 1  Options Navs notes Page 112 2010 . Step4: “For target4: where clause= DEPTNO=10”. SWITCH Stage: “Condition on single column and it has only 1 – input.default”. 128 – outputs and 1.DataStage Filter T Step1: “For target1: where clause = keychange=1 and link order=0”.

DataStage o If no found = options (Drop/ fail/ output)    Drop= drops the data and continue the process.  They are Navs notes Page 113 . 1-output. and 1-reject link. Output= to view reject data through the link. which can perform filter by UNIX commands”. DAY 35 External Filter and Combining External Filter: “It is processes stage.  It having 1-input. Combining: “in DataStage combining can done in three types”.  To perform a text file. Sequential File  External Filter properties: External Filter Data Set o Filter command = grep “newyork” o Grep –v “newyork” \\ other than new it filters. 2010 Fail= if any records drops job aborts.  Example filter command: grep “newyork”. first it must read in single record in the input.

o Treatment of unmatched records. DAY 36 Horizontal Combining (HC) and Description of HC stages Horizontal Combining (HC): “combining the primary rows with secondary rows based on primary key”. ENO EName DNo 111 10 222 DNo LOC 10 20 40 naveen munna DName IT SE SA HYD SEC DNO DNAME LOC ENO ENAME H C Here we can combine Navs notes Page 114 .DataStage o Horizontal combining o Vertical combining 2010 o Funneling combining Horizontal combining: combining primary rows with secondary rows based on primary key. and o Memory usage.  Selection of primary table is situation based. and MERGE. o Inputs requirements.  These three stages differs with each other with respect to. LOOKUP. o This stage that perform by JOIN.

o Key column names. full outer join If T1= {10.  T2 (T1  T2) Full Outer Join: “Matched primary & secondary and unmatched primary & unmatched secondary records”. They are. o Input names. and Left Outer Join: “Matched primary & secondary and unmatched primary records”. Navs notes Page 115 . o Input output rejects.DataStage Inner join. 40} Inner Join: “Matched primary and secondary records”. o Input requirements with respect to sorting. o o Treatment of unmatched records. 20. Memory usage.  T1  T2 Description of HC stages: “The description of horizontal combining is divided into nine parts”. 30} and T2= {10. o Join types.  T1 (T1  T2) Right Outer Join: “Matched primary & secondary and unmatched secondary records”. and o Types of inner join. o De – duplication (removing duplicates). 20. Left outer join.  T1 T2 2010 Right outer join.

ROJ) 2 – inputs (FOJ) 1 – output. Target (continue). Inner Join Left outer join Inner join Left outer join :: Input Requirements with respect to sorting:: Primary: mandatory Optional Optional Mandatory Mandatory Secondary: ::De – Duplication (removing the duplicates):: Primary: OK (nothing happens) OK Warnings OK Warnings Secondary: OK :: Treatment of Unmatched Records:: Primary: Drop (inner) Target (Left) Drop. Join Types: Inner join. left outer join. The first table is master table and remaining tables are updates tables. reject (unmatched primary records) Drop Drop. and last SRC is right table. target (keep) Drop Reject Page 116 (unmatched secondary Navs notes Secondary: Drop (inner) . and 1 – reject N – inputs 1 – output (n – 1) rejects. LOJ. lookup. right outer join. JOIN MERGE Input names: When we work on HC with JOIN the first SRC is left table. and full outer join. and 1 – LOOKUP The first link from source is primary/ input and remaining links are lookup/ references links. and merge with respect to above nine points are shown below. And all middle SRC’s are intermediate tables.DataStage  The differences between join. 2010 N – Inputs (normal) 2 – inputs (sparse) 1 – output. Input output rejects: N – inputs (inner.

 DataStage version8 supports four types of LOOKUP. LOC Navs notes Page 117 2010 . DNAME. DNO  Reference table as DEPT with column consisting of DNO. they are o Normal LOOKUP o Sparse LOOKUP o Range LOOKUP o Case less LOOKUP For example in simple job with EMP and DEPT tables:  Primary table as EMP with column consisting of EID.  “Look up stage is for cross verification of primary records with secondary records”.DataStage :: MEMORY USAGE:: Light memory :: Key Column Names:: Must be SAME :: Type of Inner Join :: ALL ALL ANY Optional Same in case of lookup file set Must be SAME Heavy memory Light memory DAY 37 LOOKUP stage (Processer Stage) Lookup stage:  In real time projects. ENAME. 95% of horizontal combining is used by this stage.

Navs notes Page 118 2010 .DataStage DEPT table (reference/ lookup) EMP table (Primary/ input) LOOKUP Data Set (target) LOOKUP properties for two tables: Primary Table ENO ENAM E DNO Target ENO ENAM E DNAM Reference Table DNO DNAM E LOC Key column for both tables  It can set by just drag from primary table to reference table to DNO column.

By taking lookup file set Navs notes Page 119 .  But we have a option to remove the case sensitive i.. Note: sparse lookup not support another reference when it is database.  Reject: it’s captured the primary unmatched records. DAY 38 Sparse and Range LOOKUP Sparse LOOKUP:  If the source is database. Case less LOOKUP: In execution by default it acts as a case sensitive. But in ONE Case sparse LOOKUP stage can supports ‘n’ references. o Key type = case less.e. in that we have to select  Continue: this option for Left Outer Join. 2010  Drop: it is to Inner Join.  Normal lookup: “is cross verification of primary records with secondary at memory”. By default Normal LOOKUP is done in lookup stage.  Sparse lookup: “is cross verification of primary records with secondary at source level itself”. if a primary unmatched records are their.DataStage In tool bar of LOOKUP stage consists of constraints button. its supports only two inputs.  To set sparse lookup we must adjust key type as sparse in reference table only.  Fail: its aborts job.

LFS …………………… LFS SF LOOKUP DS  In lookup file set.DataStage Job1: a sequential file extracting a text file to load into lookup file set (lfs). Navs notes Page 120 .lfs extension. 2010 Sequential file  Here in lookup file set properties: Lookup file set o Column names should same as in sequential file. we must paste the address of the above lfs. o Address of the target must save to use in another job. o Target file stored in .  Lookup file supports ‘n’ references means indirectly sparse supports ‘n’ references. Job2: in this job we are using lookup file set as sparse lookup.

Data type should be same Funnel stage it is process to append the records one table after the one. for example: in. Columns names should be case sensitive 4. To perform the funnel stage some conditions must to follow: 1. Navs notes Page 121 2010 Range LOOKUP: .primary= “AP” o For multiple links we can write multiple conditions for ‘n’ references. o In condition. DAY 39 Funnel. Columns should be same 2. Columns names also should be same 3. there we will see condition box.  How to set the range lookup: In LOOKUP properties:  Select the check box for column you need to condition.DataStage  “Range lookup is keeping condition in between the tables”. Condition for LOOKUP stage:  How to write a condition in the lookup stage? o Go to tool bar constraint. Copy and Modify stages Funnel Stage: “It is a processing stage which performs combining of multiple sources to a target”. but above four conditions has to be meet.

Navs notes Page 122 . Copying source data to multiple targets. DEL INDIA IBM NY USA IBM Funnel operation three modes:  Continues funnel: it’s random. 4.  Sort funnel: it’s based on key column values.  Sequence: collection of records is based on link order. 3.DataStage In this stage the column GEN M has to exchange into 1 and F=0. 2. Drop the columns. NOTE: best for change column names and drop columns. Charge the column names. 2010 Simple example for funnel stage: ENO EN GEN 111 HYD 222 naveen M munna Loc T X ENO GEN Copy /Modi fy EN ADD EMPID EName Loc Company GEN 444 555 IT SA Country 1 0 In this column names has change as primary table. Stub stage. 1. Copy Stage: “It is processing stage which can be used from”.

Keep the columns. MGR. remaining columns were drops. 2010 “It is processing stage which can perform”.  At runtime: Data Set Management (view the operation process)  Specification: <new column name> DOJ=HIREDATE<old column> o Here to change column name. 2. Alter the data. Oracle Enterprise Modify Data Set From OE using modify stage send data into data set with respect to above five points. DEPTNO o Here accept the columns. DEPTNO o Here drops the above columns. Change the column names. MGR. 5. Drop the columns. Modify the data types. 3. Navs notes Page 123 . 4.  Specification: keep SAL.DataStage Modify Stage: 1. In modify properties:  Specification: drop SAL.

no reject.  Types of Join stage are inner. and intermediate tables. left outer join. DAY 40 JOIN Stage (processing stage) Join stage it used in horizontal combining with respect to input requirements. right outer join. LOJ. Navs notes Page 124 . right table.  Join stage having n – inputs (inner. and memory usage. treatment of unmatched records. 1.DataStage  Specification: <new column name>DOJ=DATE_FROM_TIMESTAMP(HIREDATE) <old column> 2010 o Here changing the column name with data type. 2 – inputs (FOJ).output.  Input requirements with respect to sorting: it is mandatory in primary and secondary tables.  Join stage input names are left table. and full outer join. ROJ).

A simple job for JOIN Stage: JOIN properties:  Need a key column o Inner JOIN. in this no scope from third table that’s why FOJ have two inputs. drops and when it is LOJ will keep all records in target.DataStage  Input requirements with respect to de – duplication: nothing happens means it’s OK when de – duplication. Left Outer JOIN comes in left table. And in secondary table in Inner option it’s drops and it ROJ will keep all records in target.  Memory usage: light memory in join stage.  All types of inner join will supports. that job can executes but its effect on the performance (simply say WARNINGS will occurs) Navs notes Page 125 . o Full Outer JOIN comes both tables.  In join stage when we sort with different key column names. 2010  Treatment of unmatched records: in primary table when the option Inner its simple  Key column names should be SAME in this stage. o Right Outer JOIN comes in right table.

1 – output. and (n – 1) rejects for merge stage. treatment of unmatched records.  N – inputs.DataStage  We can change the column name by two types Copy stage and with query statement.  Merge stage input names are master and updates. and Loc from DEPT. and memory usage. Example of SQL query: select DEPTNO1 as DEPTNO.  Join types of this stage are inner join. DN. and left outer join. 2010 DAY 41 MERGE Stage (processing stage) Merge stage is a processing stage it perform horizontal combining with respect to input requirements.  Input requirements with respect to sorting is mandatory to sort before perform merge stage. Navs notes Page 126 .

unmatched records of the unmatched primary table records.  Merge operates with only two options o Keep (left outer join) o Drop (inner Join) Simple job for MERGE stage: PID PRD_DESC PRD_MANF 11 indica tata 22 swift maruthi 33 civic PID PRD_SUPP PRD_CAT 11 abc XXX 33 xyz XXX 55 pqr XXX 77 mno XXX PID PRD_AGE PRD_PRICE 11 4 1000 22 9 1200 66 3 1500 88 9 1020 Master Table Master table Update (U1) Update (U2) Navs notes Page 127 . And in secondary  Treatment of unmatched records in primary table Drop (drops).  In the merge stage the memory usage is LIGHT memory.  In type of inner join it compares in ANY update tables. And in secondary table drops and reject it captures the unmatched secondary table records.DataStage  Input requirements with respect to de – duplication in the primary table it will get warnings when we don’t remove the duplicates in primary table.  All changes information stores in the update tables. Target (keep) the 2010 table nothing will happens its OK when we don’t remove the duplicates.  The key column names must be the SAME. NOTE:  Static information stores in the master table.

 Merge supports (n-1) reject links. Sequential File Remove Duplicates Data Set Navs notes Page 128 . DAY 42 Remove Duplicates & Aggregator Stages Remove Duplicates: “It is a processing stage which removes the duplicates from a column and retains the first or last duplicate rows”.  NOTE: there has to be same number of reject links as update links or zero reject links.DataStage TRG 2010 U1 U2 or Reject (U1) In MERGE properties:  Merge have inbuilt sort = (Ascending Order/Descending Order) Reject (U2)  Must to follow link order.  Here COPY stage is acting as STUB Stage means holding the data with out sending the data into the target.

e.DataStage Properties of Remove Duplicates:  Two options in this stage. NOTE: for every n – input and n – output stages should must done mapping. 2010 o Key column= <column name> o Dup to retain=(first/last) Remove Duplicates stage supports 1 – input and 1 – output. group by same operation in oracle”. Aggregator: “It is a processing stage that performs count of rows and different calculation between columns i. SF Properties of Aggregator:  Grouping keys: o Group= Deptno  Aggregator Aggregator DS o Aggregator type = count rows (count rows/ calculation/ re – calculation) o Count output column= count <column name> 1Q: Count the number of all records and deptno wise in a EMP table? 1 Design: OE_EMP Copy of EMP Counting rows of deptno TRG1 Navs notes Page 129 .

sum of deptno trg1 Company: IBM max of IBM trg2 3Q: To find max salary from emp table of a company and find all the details of that? Navs notes Page 130 2010 . 2Q In Target one dept no wise to find maximum.Column for calculation = SAL <column name> Operations are  Maximum value output column = max <new column name>  Minimum value output column = min <new column name>  Sum of column = sum <new column name> and so on. min. and sum of rows.DataStage Generating a column counting rows of created column TRG2 For doing some group calculation between columns: Example: Select group key Group= DEPTNO . Here. minimum. and in target two company wise maximum? 2 Design: OE_emp copy of emp max. doing calculation on SAL based on DEPTNO.Aggregation type = calculation .

. coming from OLTP also source is after data. target also before data is called Initial load. sum of salary of a deptno wise in a emp table? dummy dno=10 3 & 4 Design: compare emp max(deptno) UNION ALL diving dno=20 compare copy min(deptno) dummy dno=30 company: IBM compare maximum SAL with his details max (IBM) DAY 43 Slowly Changing Dimensions (SCD) Stage Before SCD we must understand: types of loading 1. Initial load 2.e. min. Navs notes Page 131 2010 .e.  The subsequent is alter is called incremental load i.. Incremental load  Initial load: complete dump in dimensions or data warehouse i.DataStage & 4Q: To find max.

DataStage Example: #1 Before data (already data in a table) CID 11 CNAME A ADD HYD GEN M BALANCE Phone No 30000 988531068 8 After data (update n insert at source level data) CID 11 CNAME A ADD SEC GEN M BALANCE Phone No 60000 988586542 2 Column fields that have changes types: Address – slowly change Balance – rapid change Phone No – often change Age – frequently AGE 25 AGE 24 Example: #2 Before Data: CID 11 22 33 CNAME A B C ADD HYD SEC DEL After Data: (update ‘n’ insert option loading a table) CID 11 22 CNAME A B ADD HYD CUL Navs notes Page 132 2010 .

Here surrogate key acting as a primary key. SCD – III: SCD – I (+) SCD – II “maintain the history but no duplicates”. and no historical data were organized”.  In SCD – II.  Unique key: the unique key is done by comparing.. .. they are  SCD – I  SCD – II  SCD – III  SCD – IV or V  SCD – VI Explanation: SCD – I: execution.  And when SCD – II performs we get a practical problem is to identify old and current record. surrogate key. effect start date.e. surrogate key. i. With some special “it only maintains current update. Navs notes Page 133 2010  Extracting after and before data from DW (or) database to compare and upsert. and effect end date. SCD – II: “it maintains both current update data and historical data”.e. effect start date (ESDATE) and effect end date (EEDATE). active flag. new concepts are introduced here i. That we can solve by active flag: “Y” or “N”.DataStage 33 D PUN We have SIX Types of SCD’s are there. As per SCD – I.  Record version: it is concept that when the ESDATE and EEDATE where not able to use is some conditions. it updates the before data with after data and no history present after the operation columns they are. not having primary key that need system generated primary key.  In SCD – II.

20. 40 10. DAY 44 SCD I & SCD II (Design and Properties) SCD – I: Type1 (Design and Properties): Transfer job 10.DataStage SCD – IV or V: SCD – II + record version 2010 “When we not maintain date version then the record version useful”.20. SCD – VI: SCD – I + unique identification. 20.30 OE_DIM before fact DS_FACT 10. 40 After dim OE_UPSERT 10. Example table of SCD data: SID 1 2 3 4 5 6 7 8 CID 11 22 33 22 44 11 22 55 CNAME A B C B D A B E ADD HYD SEC DEL DEL MCI GDK RAJ CUL AF N N Y N Y Y Y Y ESDATE 03-06-06 03-06-06 03-06-06 08-09-07 08-09-07 30-11-10 30-11-10 30-11-10 EEDATE 29-11-10 07-09-07 9999-12-31 29-11-10 9999-12-31 9999-12-31 9999-12-31 9999-12-31 RV 1 1 1 2 1 2 3 1 UID 1 2 3 2 5 1 2 8 Table: this table is describing the SCD six types and the description is shown above. 20. 40 DS_TRG_DIM -update and insert OE_SRC Navs notes Page 134 .20. 40 Load job DS_TRG_DIM 10.

DataStage In oracle we have to create table1 and table2. Processes of transform job SCD1: Step 1: Load plug-in Meta data from oracle of before and after data as shown in the above links that coming from different sources.SNO SNO business key Fast path 3 of 5: selecting source type and source name. Table1: o Insert into src values(111. Table2:  Create table DIM(SKID number. Step 2: “SCD1 properties” Fast path 1 of 5: Fast path 2 of 5: select output link as: fact navigating the key column value between before and after tables AFTER SNO SNAME KEY EXPR COLUMN N PURPOSE SKID surrogate key AFTER. SNAME varchar2(25)). o Insert into src values(222. ‘munna’). SNAME varchar2(25)). SNO number. source name: D:\study\navs\empty. ‘naveen’). o No records to display. ‘kumar’).txt Page 135 Source type: Flat file Navs notes . o Insert into src values(333. 2010 BEFORE  Create table SRC(SNO number.

-update n insert \\ if key column value is already.SKID SKID AFTER. in load job if we change or edit in the source table and when you are loading into oracle we must change the write method = upsert in that we have two options they are.txt.e.SNO SNO business key For path 5 of 5: setting the output paths to FACT data set. AFTER DIM SNO SNAME Derivation COLUMN N PURPOSE next sk() SKID surrogate key AFTER. AFTER FACT SNO SNAME Derivation COLUMN N BEFORE.e..SNO SNO BEFORE SKID SNO SNAME Step 3: In the Next job. 2010 empty.DataStage NOTE: for every time of running the program we should empty the source name i. Fast path 4 of 5: select output in DIM. Navs notes Page 136 . i. else surrogate key will continue with last stored value.

30. 40 10. 20. 30. 20. 20. 20. 20. 40 Load job DS_TRG_DIM 10. Before table CID CNAME SKID 10 abc 1 20 xyz 2 30 pqr 3 Target Dimensional table of SCD I CID 10 20 40 CNAME SKID abc 1 nav 2 pqr 3 After table CID CNAME 10 abc 20 nav 40 pqr SCD – II: (Design and Properties): Transfer job 10. 20. 40 OE_UPSERT -update and insert OE_SRC DS_TRG_DIM Step 1: in transformer stage: Navs notes Page 137 2010 Here SCD I result is for the below input .DataStage -insert n update \\ if key column value is new.20. 30. 40 After dim 10.30 before OE_DIM fact DS_FACT 10. 20.

SKID SKID BEFORE.DataStage Adding some columns to the to before table – to covert EEDATE and ESDATE columns into time stamp transformer stage to perform SCD II In TX properties: BEFORE BEFORE_TX SKID SNO SNAME ESDATE EEDATE ACF Derivation NAM BEFORE.SNAME SNAME COLUMN SNO In SCD II properties: Fast path 1 of 5: select output link as: fact Fast path 2 of 5: navigating the key column value between before and after tables BEFORE AFTER KEY EXPR SNO SNAME COLUMN N PURPOSE SKID surrogate key AFTER.SNO BEFORE.SNO SNO business key SNAME Type2 ESDATE experi date Page 138 Navs notes 2010 .

ESD ESDATE BEFORE SKID SNO SNAME ESDATE EEDATE ACF Navs notes Page 139 2010 Source type: Flat file .e.SNO SNO business key AFTER. else surrogate key will continue with last stored value. DIM SNO SNAME Derivation COLUMN N PURPOSE Expires next sk() SKID surrogate key AFTER.DataStage Fast path 3 of 5: selecting source type and source name. Fast path 4 of 5: AFTER select output in DIM.SNAME SNAME BEFORE.SNO SNO AFTER..txt NOTE: for every time of running the program we should empty the source name i. AFTER SNO SNAME FACT Derivation COLUMN NAME BEFORE.SNAME SNAME Type2 curr date() ESDATE experi date - Date from Julian (Julian day from day (current date ()) – 1) For path 5 of 5: setting the output paths to FACT data set.SKID SKID AFTER.txt. source name: D:\study\navs\empty. empty.

e. Simple example of change capture: Navs notes Page 140 . i. Change Apply & Surrogate Key stages Change Capture Stage: “It is processing stage. 2010 are loading into oracle we must change the write method = upsert in that we have two options Here SCD II result is for the below input Before table CID CNAME SKID ESDATE EEDATE ACF 10 abc 1 01-10-08 99-1231 Y 20 xyz 20 01-10-08 Target Dimensional table of SCD II CID CNAME SKID ESDATE EEDATE ACF 10 abc 1 01-10-08 99-1231 Y 20 xyz 2 01-10-08 09-12-10 N 20 xyz 4 10-12-10 After table CID CNAME 10 abc 20 nav 40 DAY 45 Change Capture. -update n insert -insert n update \\ if key column value is already.DataStage Step 3: In the Next job. \\ if key column value is new. that it capture whether a record from table is copy or edited or insert or to delete by keeping the code column name”. in load job if we change or edit in the source table and when you they are.

DataStage Change_capture Properties of Change Capture:  Change keys o Key = EID (key column name)   Change valves o Values =? \\ ENAME o Values =? \\ ADD  Options o Change mode = (explicit keys & values / explicit keys. that it applies the changes of records of a table”. Navs notes Page 141 2010 . values) o Drop output for copy = (false/ true) “false – default ” o Drop output for delete = (false/ true) “false – default” o Drop output for edit = (false/ true) “false – default” o Drop output for insert = (false/ true) “false – default”      Sort order = ascending order Copy code = 0 Delete code = 2 Edit code = 3 Insert code = 1 Code column name = <column name> o Log statistics = (false/ true) “false – default” Change Apply Stage: “It is processing stage.

txt c=3 c=all after.x2 Design of that ESDATE=current date () EEDATE= “9999-12-31” Key=EID ACF= “Y” -option: e k & v Before.txt key= EID -option: e k & v Navs notes Page 142 2010 .DataStage Change Apply Properties of Change Apply:  Change keys o Key = EID   Options o Change mode = explicit key & values o Check value columns on delete = (false/ true) “true .default” o Log statistics = false o Code column name = <column name> \\ change capture and this has to be SAME for apply operations Sort order = ascending order SCD II in version 7.5.

x2.0: “The above problem with version7 is over comes by version 8.5.  With that buffer value we can generate the sequence values that are surrogate key in version 7. for that reason in version 7 we have to do a another job to store a last generated value”.5.txt) file and storing last value information in that file.0 surrogate key by taking an empty text(empty.current date () EEDATE. a surrogate key stage used for generates the system key column values that are like primary key values. In version 8. and by using that it generates the sequence values” Navs notes Page 143 2010 .x2: “identifying last value which generated for the first time compiling and running the job in surrogate key stage.x2: design SF Sk copy ds Tail peek  In this job.  But by taking tail stage with that we tracing the last value and storing into the peek stage that is in buffer. But it generate at first compile only. And that job in version 7.DataStage before.if(c=3) then “N” else “Y” SURROGATE KEY Stage: In version 7.5.txt ESDATE.if c=3 then DFJD(JDFD(CD())-1) else EEDATE = “9999-12-31” ACF.

Navigation .“how to export”? DataStage toolbar  Change selection: AD D or o Job components to export REMOV E or SELECT ALL Here there are three options are Export job designs with executables(where applicable) Navs notes Page 144 2010 .txt Source type = flat file Option 2: database type= oracle (DB2/ oracle) Source name = sq9 (in oracle – create sequence sq9)\\ it is like empty.txt Password= tiger User id= scott Server name= oracle Source type = database sequence DAY 46 DataStage Manager Export: “Export is used to save the group of jobs for the export purpose that where we want”.DataStage Before.txt SK Data Set Properties of SK version8: Option 1: generated output column name = skid Source name = g:\data\empty.

o Type of export  dsx By two options we can export file - dsx 7 – bit encoded xml Import: “It is used to import the .. • • Database description (DBD) Program Specification Block (PSB / PCB)  In DataStage components.xml extensions to a particular project and also to import some definitions as shown below”.. Export job designs without executables Export job executables without designs 2010 Where we want locate the export file..dsx or . Navs notes Page 145 ..DataStage o Export to file  Source name\. o Import from file Give the source name to import …. Options of import are o DataStage components… o DataStage components (xml)… o External function definitions o Web services function definitions o Table definitions o IMS definitions  In IMS two options are..

2.tmp” Node Configuration: Q: To see nodes in a project: o Go to run director  Check in logs • Double click on main program: APT config file Q: What are Node Components? 1. Pools – logical area where stages are executed.DataStage Import all Import selected Generate Report: overwrite without query 2010 perform impact analysis “It is for to generate report to a job or a specific. 4. For that. Resource – memory associated with node. Node name – logical CPU name. 3. Navs notes Page 146 . go to  File o Generate report  Report name • Options Use default style sheet Use custom style sheet After finishing the settings:  It’s generates in default position “/reportingsendfile/ send file/ tempDir. Fast name – server name or system name. that it generates a report to a job instantly”.

.DataStage o Node components stores in the disc’s permanent in the below address...apt will have the single node information. 2010  “c:\ibm\information server\server\parasets” o Node components stores temporary is the below address.apt Q: How to run a job on specific configuration file? o Job properties  Parameters • Add environment variables o Parallel  Compiler • Config file (Add $APT_CONFIG_FILE) Q: How to create a new Node configuration File? o Tools  Configurations • There we see o Default.. o We can create new node by option NEW NEW o  Save the things after creating new nodes Navs notes Page 147 .apt o Default. o Name of configuration file is C:\ibm\..\default...  “c:\ibm\information server\scratch” Q: What node that handles to run each and every job and name of the configuration file? o Every job runs on APT node as on below name that is default for every job..

o Folder to search: D:\datastage\ o Type o Creation Navs notes Page 148 . save configuration As o NOTE: Best 8 or 16 nodes is to create in a project. Dependency. Advanced Find: “It is the new feature to version8” It consists of to find objects of a job like list shown below 1. 2. 3.DataStage  By. If uni processing system with 1 CPU needs minimum 1 node to run a job then for SMP with 4 CPU needs how many minimum nodes? o Only 1 node. and 2010 • Q: 2^0. Compared report. Where used. Job Control Language (JCL) script presents. Q: How to run a job in a job? Navigation for how to run a job in a job  Job properties o Job control  Select a job • • • ------------------------------------o Dependencies  Select job (first compile this job before the main job) Q: Repository of Advance Find (means palate of advance find)? o Name to find: Nav* here.2^1(say) CPU’s have & so on.

remove all o Dependencies of job Q: Advance Find of repository through tool bar? o Cross project compare….DataStage o Last modification o Where used 2010   Find objects that use any of the following objects. o Compare against o Export o Multiple job compile o Add to palate o Create copy o Locate in tree o Find dependencies Q: How to find dependency in a job? o Go to tool bar  Repository • Find dependency: all types of a job DAY 47 DataStage Director DS Director maintains:  Schedule  Monitor  Views Navs notes Page 149 . Options: Add. remove.

 Navigation for set the purge. By right clicking we can filter. Navs notes Page 150 . o Right click on job in the DS Director  Click on “add to schedule…” • And set the timings. specific the job sequence by some tools shown below o Tools to schedule jobs (its happen the production only)    Control M Cron tab Autosys Purge: “It means cleaning or wash out or deleting the already created logs” In job can we clear Job logs having a option is FILTER.DataStage o Job view o Status view 2010 o Log view  Message Handling  Batch jobs  Unlocking Schedule: “Schedule means a job can run in specific timings”  To set timings for that.  In real time.

Field “<column name>” has import error and no default value. at offset: 0 2. percentage used by CPU)”  Navigation for job that how to monitor. Import warnings at record 0. Navs notes Page 151 . StatusNo. rows started at elaspsed time rows/sec %CPU Finished 6 sys time 00:00:03 2 =9 Finished 6 sys time 00:00:03 2 =7 NOTE: Based on this we can check the performance tuning of a stage in a particular job.DataStage o Tool bar  Job o Immediate purge o Auto purge Monitor: “It shows the Status of job. started at (time). rows/sec). o Right click on job  Click monitor • “it shows performance of a job” 2010 - Clear log (choose the option)  Like below figure for a simple job. numbers of row where executed.e. data : { e i d }. Reasons for warnings:  Default warnings in sequential file are 1. elapsed time (i.

DataStage 3.  When sorting for different key column in join. i.  Where these is length miss match. Import unsuccessful at record 0.. Message Handling: “If the warnings are failed to handle then we come across the message handling”  Navigation for how to add rule set a message handle the warnings. like in source length (10) and target (20). Abort a job: Q: How can we abort a job conditionally?  Conditionally o When we Run a job  Their we can keep a constraint • Like warnings o No limit o Abort job after:  In transformer stage o Constraint  Otherwise/log • Abort after rows: 5 (if 5 records not meet the constraint it’s simple aborts the job)  We can keep constraint same like this only in Range Lookup. in the secondary stage have duplicates we with get warning.e.  Missing record delimiter “\r\n”. saw EOF instead (format mismatch)  When we working on look-up.(here default option is false) .  When second stage in merge. o These three warnings can solve by a simple option in sequential file. 5 Navs notes Page 152 2010  First line is column names= set as true.

DataStage  Jog logs o Right click on a warning 2010   Add rule to message handler Two options • • Suppress from log Demote to information  Choose any one of above option and add rule.  But a job can execute by multiple users at the same time in director. Batch jobs: “Executing set of jobs in a order” Q: How to create a Batch? Navigation for creating a batch  DS Director o Tools  Batch • • New (give the name of batch) Add jobs in created job batch o Just compile after adding in new batch.  Navigation for enable the allow multiple instance  Go to tool bar in DS Designer o Job properties  Unlock the jobs: Check the box on “allow multiple instances” Navs notes Page 153 . Allow multiple instances: “Same job can open by multiple clients and run the job”  If we not enable the option it will open in a read only that you can’t edit.

for that  DS Administrator o General  Environment variables • Parallel o Reporting  2010 DS Director Add (APT_PM_SHOW_ PIDS) • Set as (true/false) Navs notes Page 154 .DataStage “We can unlock the jobs for multiple instances by release all the permissions” Navigation for unlock the job  Tool bar o Job   Cleanup resources Processes • • Show by job Show all o Release all For global to see PIDs for jobs.

And assigning permissions 2010  Session managements: o Active sessions   Reports: o DS  INDIA (server/system name) • •  Domain Management: o License   Update the license here Upload to review View report. For admin  Scheduling management: “It is know what user is doing from part” o Scheduling views  New Navs notes Page 155 . We can create the reports.DataStage DAY 48 Web Console Administrator Components of administrator:  Administration: o User & group  Users • • User name & password is created here.

Job activity 2. Wait for file activity Job Activity: “It is job activity that holds the job and it have 1-input and n-outputs” 2010 Job activity How the Job Activity drag into design canvas? Navs notes Page 156 . Notification activity 6. Exception handler 5. o Extract o Transform o Load o Master jobs: “its control the order of execution”. Important stages in job sequencing are 1. Sequencer 3. Terminator activity 4.DataStage • • schedule | Run creation task run | last update DAY 49 Job Sequencing Stages of job sequencing: “It is for executing jobs in sequence that we can schedule job sequencing” Or “Its control the order of execution jobs”  A simple job will process in below process.

Go to tool bar – view – palate – job activity – just drag the icon to the canvas.DataStage .(Run/Reset if required. Go to tool bar – view – repository – jobs – just drag the job to the canvas. 1. than run/ Validate only/ Reset only) Check Point: “Job has re-started where it aborted it is called check point”  It is special option that we must enable manually  Go to o Job properties of DS Designer  Enable check point Navs notes Page 157 .In two methods we can. Student FAIL Sequencer student rank Terminator activity Properties of Job Activity:  Load a job what you want in active o Job name:  Execution action: D:\DS\scd_job RUN Do not check point Run options . Simple job: OK WAR 2010 2.

 Its for job failure o Send STOP requests to all Running Jobs And wait for all jobs to finish  It’s for server downs in between the process running. Navs notes Page 158 Unconditional Otherwise User Status “N/A (its default)” “N/A” = “<user define message>” Custom-(conditional) . o Abort without sending STOP requests Wait for all jobs to finish first.DataStage Parameter mapping: “If job have already some parameters to that we can map to the another job if we need” Triggers: “It holds the link expression type that how to act” Name of output link OK WAR warnings” Fail Expression type OK-(conditional) Expression “executed OK” 2010 WAR-(conditional) “execution finished with Failed-(conditional) “execution failed” And some more options in “Expression type” Terminator Activity: “It is stage that handles the error if it fails” Properties: It consists of two options: for if any sub ordinate jobs are still running.“custom” .

DataStage Sequencer: “it holds multiple inputs and multiple outputs” ANY – it’s for FAIL (‘n’ number of links) 2010 It has two options or modes: Exception handler: ALL ANY ALL – it’s for OK & WAR links “It handles the server interrupts”  we don’t connect any stage here it will separate in a job A simple job for exception handler: Exception handler Notification activity Terminator activity Exception handler properties: “Its have only general information” Notification Activity: “It is sending acknowledgement in between the process” Option to fill in the properties: SMTP Mail server Name: Senders email address: Recipients email address: Email subject: Attachments: Email body: Wait for file Activity: D:\DS\SCD_LOAD browse file “To place the job in pause” Navs notes Page 159 .

 Hash partition technique: o It is selected when number of key columns will be there. o And mod formula is MOD(value/ Number of nodes) Navs notes Page 160 . i. Range Key less: 1. Entire 4. Round Robin 3. Hash 2.e. DB2 4.  DB2 and Range techniques are used rarely..  Modulus partition technique: o It distributes the data based on mod values.DataStage File name: Two options: wait for file to appear Do not timeout (no time length for the above options) 2010 Wait for file to disappear Timeout length (hh:mm:ss) DAY 50 Performance tuning w. Random In key based partition technique:  DB2 is used when the target is database. key columns (>1) and hetro data types (means different different data types) o Other than this situation we can select “modulus partition technique”.r. Modulus 3.t partition techniques & Stages Partition techniques: “are two categories” Key based: 1. Same 2.

DataStage

NOTE: Modulus is having high performance than Hash, because the way its groups the data NOTE: But modules can only be selected, if the only one key column and only one data type that is only integer (data type). In Key less partition technique:  Same: is never distributes the data, but is carry previous technique that continuous.  Entire: will distribute the same group of records to all nodes. That is the purpose of avoiding the mismatch records in between the operation.  Round Robin: it is for generated stage like Column Generator and so on is associated this partition technique. o It is the best partition technique than comparing to random.  Random: all key less partition techniques stages are used this technique its default. Performance tuning w.r.t Stages:
 If when Sorting already perform then JOIN stage we can use.
2010

and based on the mod value.

 Else LOOKUP stage is the best.

 LOOKUP FILE SET: is options use to remove duplicates in lookup stage.

SORT stage: if complex sort : go to Stage sort  Else: go to link sort.

 Remove Duplicates: the data already sort – Remove duplicates stage •

Sorting and remove duplicates – go to link sort (unique)

 Constraints: when operation and constraints needed – go to Transformer stage

Navs notes

Page 161

DataStage

Else only constraints – simply go to FILTER stage.

DAY 51 Compress, Expand, Generic, Pivot, xml input & output Stages

Compress Stage: “It is a processing stage that compresses the records into single format means in single file or it compresses the records into zip”.  It supports – “1 input and 1 output”.

Properties:  Stage o Options   Input o <do nothing>  Output o Load the ‘Meta’ data of the source file. Expand Stage: “It is a processing stage the extract the compress data or its extract the zip data into unzip data”.  It supports – “1 input and 1 output”.
Navs notes Page 162

Command= (compress/gzip)

2010

 Conversions: Modify stage and Transformer stage (it takes more compile time).

DataStage

Properties:  Stage: o Options : - command= (uncompress/gunzip)  Input: o <do nothing>  Output: o Load the Meta data of the source file for the further process. Encode Stage:

“It is processing stage that encodes the records into single format with the support of command line”.  It supports – “1-input and 1-output”.

Properties:  Stage
o

Options: Command line = (compress/ gzip)

 Input o <do nothing>  Output o Load the ‘Meta’ data of the source file. Decode Stage: “It is processing stage that decodes the encoded data”.  It supports – “1-intput and 1-output”.

Navs notes

Page 163

2010

Navs notes Page 164 2010 Properties: .  It supports – “n. but it must full fill the properties.migrator that converts into 70%)  And it can call ANY operator here.DataStage  Stage o Options: command line = (uncompress/gunzip)  Output o Load the ‘Meta’ data of the source file.inputs and n-outputs. Properties:  Stage o Options  Operator: copy (we can write any stage operator here)  Input o <do nothing>  Output o Load the Meta data of the source file. the job related OSH code will generated. Generic Stage: “It is processing stage that holds any operator can call here.  Generic stage can call the operator on the datastage. but no rejects”  When compiling the job. but it must and should full fill the properties”.  Its purpose is migration serve jobs to parallel jobs (IBM has x.

 Properties: Stage – <do nothing>  Input: <do nothing>  Output: Column name Length REC varchar Derivation 25 SQL Type <col_n with comma separated> XML Stages: “It is real time stage that the data stores in single records or in aggregator with in the xml format”. 2010  Its supports – “1-input and 1-output”. XML Output 2. XML Input XML Input: “”. they are 1.DataStage Pivot Stage: “It is processing stage that converts rows into columns in a table”.  And XML Stage divided into two types. Navs notes Page 165 .

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times