‡ Datastage ± ETL tool used to extract data from any number or type of database transform the data and load the data to data warehouse. ± It has client and server components. ± Client components are (DS designer, DS Director, DS Manager, Ds Administrator) ± Server components are (Repository, DS Server) ± Datastage Files will be having an extension .dsx ± If we run the job in the designer we won¶t be able to view the logs.


There option of running the job in the designer is also possible but its an optional feature.Day-1 ‡ Datastage Designer ± It is used to create the jobs. compile them. ± Once the job is executed we can check the logs to get the details of the job. 4 . ‡ Datastage Director ± It is used to validate the jobs created using the designer then run and monitor the jobs.

Day-1 ‡ Datastage Manager ± It is used to view and edit the contents of the repository. import and manage the metadata. ‡ Datastage Administrator ± It is a user interface used to do administrative tasks. 5 . ± It is used to export.

Day-1 ‡ Development / Debug stage ± Row generator ± Column generator ± Peek ± Tail ± Head ± Sample 6 .

Day-1 ‡ Files ± Sequential File ± Dataset ± Fileset ± Lookup Fileset ‡ Processing Stage ± Sort ± Copy ± Remove Duplicates ± Join 7 .

Day-1 ± ± ± ± ± ± ± ± ± Look Up Merge Funnel Modify Aggregator Transformer Pivot Filter Surrogate Key generator 8 .

‡ It can be used as an input data for testing purposes. 9 .Day-1 Row Generator (no input -> one output) ‡ Row generator generates mock data based on the metadata . ‡ It depends on the number of input rows. ‡ By default the number of rows generated will be 10. Column Generator (1 input -> 1 output) ‡ We must enter the metadata for the column to be generated.

‡ To set a default value for all the records in the column being generated then set the default value to that particular value. ‡ We can also set the increment by.Day-1 So if the input has 15 rows then the output will also have 15 rows. start value. ‡ 10 . end value. Head (1 input -> 1 output) ‡ It is used to pass the first µn¶ records of the input file.

Tail (1 input -> 1 output) ‡ It is similar to head the main difference is it selects the last µn¶ records of the input data. ‡ In case if the value of n=10 and the input has 50 records then only 1st 10 records will pass through. Sample (1 input -> n output) ‡ It selects µn¶ records where the value of µn¶ is given as µ%¶ from the full set of input records in a 11 .Day-1 ‡ By Default the µn¶ is set to 10 it can be changed to any value we need.

Day-1 Random manner. If n=50% and input has 150 records the O/P will have 75 records selected in random manner. Peek (1 input -> 1 output (optional)) ‡ Allows us to view the data in logs in the director. ‡ If the output is set to Output itself then the output from the peek is stored in a separate output file. 12 ‡ ‡ . And cannot be seen in the logs. ‡ It has the records only if the stage is valid and is executed. ‡ By default the output property is set to log so that we can view it in the director.

DAY-2 13 .

2 Types of Jobs we use in Datastage ‡ Parallel Job ‡ Parallel Shared Container ‡ Job Sequence Server job is used rarely. Parallel Processing ‡ Mainly there of 2 types of parallel processing ± Partition parallelism ± Pipeline parallelism 14 .Day .

15 .2 ‡ Partition Parallelism ± To process very large quantities of data it is split into smaller subsets and each subset of data will be handled by a separate processor so that the job will be completed faster. ‡ Pipeline Parallelism ± In this case there are multiple stages then each processor will start processing a particular stage and load the data to the pipeline. ± At the end of the job the data subsets can be collected back together again and written to a single data source.Day .

Sequential File ‡ Sequential file may be a text file are CSV or a fixed width file etc.Day . ‡ The data from the sequential file will be transferred in a sequential manner.2 ‡ So multiple stages will be executed subsequently so that the job will be completed at a much faster pace. 16 . ‡ So the fan out/in process is done before the data is transferred from/to parallel processing stages..

17 .ds ‡ It can be viewed only using datastage and it is encrypted.2 Dataset ‡ Dataset has a extension of .Day . ‡ The data from the dataset can be accessed in parallel so it is much faster when compared to the sequential files.

DAY-3 18 .

right or intermediate. 19 .Day-3 Join [2 input (more than 2 optional) -> 1 output] ‡ It performs join based on the field that is set as the key. ‡ The 4 types of Join can be performed ± Inner join ± Left outer join ± Right Outer Join ± Full Outer Join ‡ We can explicitly set the input table to be left.

Pivot (1 Input -> 1 output) ‡ It is used to convert Columns to rows. ‡ It does the transpose function of the matrix.Day-3 ‡ This is done in the Link ordering Tab. ‡ It is the faster when compared to look up but slow when compared to Merge. ‡ The data to the join must be sorted. ‡ It does not have a reject link. 20 .

Remove Duplicates (1 Input -> 1 Output) ‡ Pre-requisite for this stage is that the data that comes as input must be sorted.q3.q4. 21 .q2.q2.q3. ‡ If duplicate values exists in the input then only one record will be retained others will be dropped. ‡ In the properties we can set which record is to be retained like either the first one or the last one.Day-3 ‡ If there are 4 columns q1.q4 and we are going to convert them into a row with a common column name µq¶ the description for q = q1.

‡ It is better to have a copy between two transformation stages.Day-3 ‡ It checks the first record with the very next record if both are same then one record will be dropped. 22 . ‡ We must also give a field or key based on which the duplicate records have to be removed. Copy ( 1 Input -> µn¶ Output) ‡ It is use to just copy the input data to multiple output links. So if the input is not sorted duplicate values will remain in the output also.

‡ We can also mention how to handle the null value like first or last preference. 23 .Day . The subsequent records will be dropped.Input -> 1 Output) ‡ It is used to sort the data that comes as a input based on a particular key that we mention explicitly. ‡ We can also choose the sor order either ascending or descending. ‡ We have a option called allow duplicates if this is set to false if duplicate records exist only the first record will be retained.4 Sort ( 1.

‡ We cannot have a reject link in for the sort stage. ‡ It is active if the allow duplicates is set to true and its passive in the later case. ‡ In funnel there are 3 types: ± Sort ± Sequence ± Continuous 24 .4 ‡ Sort can also be done in the link by setting the properties in the link. Funnel ( µn¶ Input -> 1 Output) ‡ It is used to combine multiple input files into a single output file.Day .

‡ Sort we must give the key so that it sorts the output from the given input from n files and sort it and combine it into single file and gives as output.4 ‡ By default it will be in continuous only.Day . ‡ In Continuous it combines like 1st record from 1st file then 1st from 2nd file then again subsequent records from the input files. 25 . ‡ In case of Sequence funnel initially all the records from the first input will be there then the ones from the 2nd input like that for all the input files.

4 Merge ( 1 master Input. ‡ It is mandatory that the input must be sorted. ‡ Master and updates must be explicitly mentioned. µn¶ Update Input -> µn¶ Rejects) ‡ In merge the number of update link must match with the number of reject link. ‡ It is fastest way of joining two or more tables. ‡ The reject will have the records that are dropped from the updates only.Day . 26 .

27 .1 Reject (optional)) ‡ It is used to filter the records from the input based on a condition and pass the records which pass the condition to the output.4 Filter ( 1 Input -> n Output .Day . ‡ The records that does not satisfy any of the conditions will be either dropped or passed on to the reject link. ‡ The link ordering can be done in the link ordering tab in the properties dialog box. ‡ We can set multiple conditions and get multiple output based on those conditions.

5 28 .DAY .

copy. ‡ It can do filter. 29 . ‡ In the properties there is a constraints option where we can set any conditions to be checked in the incoming data..5 Transformer ( 1 Input -> µn¶ Output . ‡ It has stage variable where we can create some variables to store some values that are to be used frequently among the output. sort etc. 1 Reject) ‡ This stage . ‡ It must be used only if the process cannot be done using any other stages. ‡ It is an active stage.Day .

string. ‡ Null handling.Day . conditional statement like if-else. trigonometric etc. 30 . ‡ The stage variables declared and defined will get calculated in the order they are derived. date functions can also be used in the derivations.5 ‡ The derivations option will allow us to perform some calculations using functions like arithmetic. ‡ We can derive the outputs from the transformer to the multiple targets based on the derivations..

31 .5 Look Up (1 source (Primary). µn¶ Reference -> 1 Reject) ‡ Look up is used to join the tables based on the key (field) we mention.Day . ‡ The record from the primary is looked up in the reference input and if the key match the join is performed. ‡ There are 2 types of look up: ± Normal lookup ± Sparse lookup ‡ In this case the reject will get the records that are rejected from the primary.

5 ‡ In case if there is no record matching the primary table in the reference table then the record is dropped and passed to the reject link. ‡ It is the slowest compared to merge and join stage. ‡ In look up also we can set constraints. ‡ There of two types of constraints: ± Condition failure ± Lookup failure 32 .Day .

‡ Drop just drops the record that didn¶t meet the condition and continues processing the next record.e) it aborts the process. 33 . ‡ Reject passes the rejected records to the reject link.5 ‡ In case of failure there are 4 ways of handling it: ± Continue ± Fail ± Reject ± Drop ‡ Continue just puts a null value in case of failure and continue with the process.Day . ‡ Fail passes quits the job and comes out of the execution (i.

