KT Document

Manikandan. K

1

DAY - 1

2

Day-1
‡ Datastage ± ETL tool used to extract data from any number or type of database transform the data and load the data to data warehouse. ± It has client and server components. ± Client components are (DS designer, DS Director, DS Manager, Ds Administrator) ± Server components are (Repository, DS Server) ± Datastage Files will be having an extension .dsx ± If we run the job in the designer we won¶t be able to view the logs.

3

compile them.Day-1 ‡ Datastage Designer ± It is used to create the jobs. 4 . There option of running the job in the designer is also possible but its an optional feature. ‡ Datastage Director ± It is used to validate the jobs created using the designer then run and monitor the jobs. ± Once the job is executed we can check the logs to get the details of the job.

5 . ‡ Datastage Administrator ± It is a user interface used to do administrative tasks. ± It is used to export.Day-1 ‡ Datastage Manager ± It is used to view and edit the contents of the repository. import and manage the metadata.

Day-1 ‡ Development / Debug stage ± Row generator ± Column generator ± Peek ± Tail ± Head ± Sample 6 .

Day-1 ‡ Files ± Sequential File ± Dataset ± Fileset ± Lookup Fileset ‡ Processing Stage ± Sort ± Copy ± Remove Duplicates ± Join 7 .

Day-1 ± ± ± ± ± ± ± ± ± Look Up Merge Funnel Modify Aggregator Transformer Pivot Filter Surrogate Key generator 8 .

Day-1 Row Generator (no input -> one output) ‡ Row generator generates mock data based on the metadata . ‡ It depends on the number of input rows. ‡ By default the number of rows generated will be 10. 9 . ‡ It can be used as an input data for testing purposes. Column Generator (1 input -> 1 output) ‡ We must enter the metadata for the column to be generated.

start value. end value. ‡ To set a default value for all the records in the column being generated then set the default value to that particular value. Head (1 input -> 1 output) ‡ It is used to pass the first µn¶ records of the input file. ‡ 10 . ‡ We can also set the increment by.Day-1 So if the input has 15 rows then the output will also have 15 rows.

‡ In case if the value of n=10 and the input has 50 records then only 1st 10 records will pass through. Sample (1 input -> n output) ‡ It selects µn¶ records where the value of µn¶ is given as µ%¶ from the full set of input records in a 11 . Tail (1 input -> 1 output) ‡ It is similar to head the main difference is it selects the last µn¶ records of the input data.Day-1 ‡ By Default the µn¶ is set to 10 it can be changed to any value we need.

12 ‡ ‡ .Day-1 Random manner. And cannot be seen in the logs. ‡ By default the output property is set to log so that we can view it in the director. If n=50% and input has 150 records the O/P will have 75 records selected in random manner. Peek (1 input -> 1 output (optional)) ‡ Allows us to view the data in logs in the director. ‡ If the output is set to Output itself then the output from the peek is stored in a separate output file. ‡ It has the records only if the stage is valid and is executed.

DAY-2 13 .

2 Types of Jobs we use in Datastage ‡ Parallel Job ‡ Parallel Shared Container ‡ Job Sequence Server job is used rarely.Day . Parallel Processing ‡ Mainly there of 2 types of parallel processing ± Partition parallelism ± Pipeline parallelism 14 .

± At the end of the job the data subsets can be collected back together again and written to a single data source.Day . ‡ Pipeline Parallelism ± In this case there are multiple stages then each processor will start processing a particular stage and load the data to the pipeline. 15 .2 ‡ Partition Parallelism ± To process very large quantities of data it is split into smaller subsets and each subset of data will be handled by a separate processor so that the job will be completed faster.

Day .2 ‡ So multiple stages will be executed subsequently so that the job will be completed at a much faster pace. Sequential File ‡ Sequential file may be a text file are CSV or a fixed width file etc.. ‡ The data from the sequential file will be transferred in a sequential manner. 16 . ‡ So the fan out/in process is done before the data is transferred from/to parallel processing stages.

ds ‡ It can be viewed only using datastage and it is encrypted. ‡ The data from the dataset can be accessed in parallel so it is much faster when compared to the sequential files.Day .2 Dataset ‡ Dataset has a extension of . 17 .

DAY-3 18 .

19 . right or intermediate.Day-3 Join [2 input (more than 2 optional) -> 1 output] ‡ It performs join based on the field that is set as the key. ‡ The 4 types of Join can be performed ± Inner join ± Left outer join ± Right Outer Join ± Full Outer Join ‡ We can explicitly set the input table to be left.

20 . Pivot (1 Input -> 1 output) ‡ It is used to convert Columns to rows. ‡ It is the faster when compared to look up but slow when compared to Merge. ‡ It does not have a reject link. ‡ It does the transpose function of the matrix.Day-3 ‡ This is done in the Link ordering Tab. ‡ The data to the join must be sorted.

21 .q3.q2. Remove Duplicates (1 Input -> 1 Output) ‡ Pre-requisite for this stage is that the data that comes as input must be sorted.Day-3 ‡ If there are 4 columns q1.q4 and we are going to convert them into a row with a common column name µq¶ the description for q = q1. ‡ In the properties we can set which record is to be retained like either the first one or the last one.q3.q2. ‡ If duplicate values exists in the input then only one record will be retained others will be dropped.q4.

Copy ( 1 Input -> µn¶ Output) ‡ It is use to just copy the input data to multiple output links. ‡ It is better to have a copy between two transformation stages. 22 .Day-3 ‡ It checks the first record with the very next record if both are same then one record will be dropped. ‡ We must also give a field or key based on which the duplicate records have to be removed. So if the input is not sorted duplicate values will remain in the output also.

‡ We can also choose the sor order either ascending or descending. ‡ We can also mention how to handle the null value like first or last preference. ‡ We have a option called allow duplicates if this is set to false if duplicate records exist only the first record will be retained. 23 .Input -> 1 Output) ‡ It is used to sort the data that comes as a input based on a particular key that we mention explicitly. The subsequent records will be dropped.4 Sort ( 1.Day .

Day .4 ‡ Sort can also be done in the link by setting the properties in the link. Funnel ( µn¶ Input -> 1 Output) ‡ It is used to combine multiple input files into a single output file. ‡ It is active if the allow duplicates is set to true and its passive in the later case. ‡ In funnel there are 3 types: ± Sort ± Sequence ± Continuous 24 . ‡ We cannot have a reject link in for the sort stage.

25 . ‡ In Continuous it combines like 1st record from 1st file then 1st from 2nd file then again subsequent records from the input files. ‡ Sort we must give the key so that it sorts the output from the given input from n files and sort it and combine it into single file and gives as output.4 ‡ By default it will be in continuous only. ‡ In case of Sequence funnel initially all the records from the first input will be there then the ones from the 2nd input like that for all the input files.Day .

‡ It is fastest way of joining two or more tables. 26 .Day . ‡ The reject will have the records that are dropped from the updates only.4 Merge ( 1 master Input. ‡ Master and updates must be explicitly mentioned. µn¶ Update Input -> µn¶ Rejects) ‡ In merge the number of update link must match with the number of reject link. ‡ It is mandatory that the input must be sorted.

‡ We can set multiple conditions and get multiple output based on those conditions.1 Reject (optional)) ‡ It is used to filter the records from the input based on a condition and pass the records which pass the condition to the output.Day . ‡ The records that does not satisfy any of the conditions will be either dropped or passed on to the reject link.4 Filter ( 1 Input -> n Output . ‡ The link ordering can be done in the link ordering tab in the properties dialog box. 27 .

DAY .5 28 .

‡ It is an active stage. ‡ It must be used only if the process cannot be done using any other stages.5 Transformer ( 1 Input -> µn¶ Output .. 1 Reject) ‡ This stage .Day . ‡ In the properties there is a constraints option where we can set any conditions to be checked in the incoming data. 29 . ‡ It has stage variable where we can create some variables to store some values that are to be used frequently among the output. copy. sort etc. ‡ It can do filter.

string.Day .. 30 . ‡ The stage variables declared and defined will get calculated in the order they are derived. conditional statement like if-else. trigonometric etc. ‡ Null handling. date functions can also be used in the derivations. ‡ We can derive the outputs from the transformer to the multiple targets based on the derivations.5 ‡ The derivations option will allow us to perform some calculations using functions like arithmetic.

Day . µn¶ Reference -> 1 Reject) ‡ Look up is used to join the tables based on the key (field) we mention. ‡ The record from the primary is looked up in the reference input and if the key match the join is performed.5 Look Up (1 source (Primary). ‡ There are 2 types of look up: ± Normal lookup ± Sparse lookup ‡ In this case the reject will get the records that are rejected from the primary. 31 .

‡ There of two types of constraints: ± Condition failure ± Lookup failure 32 .5 ‡ In case if there is no record matching the primary table in the reference table then the record is dropped and passed to the reject link.Day . ‡ It is the slowest compared to merge and join stage. ‡ In look up also we can set constraints.

Day . ‡ Fail passes quits the job and comes out of the execution (i.e) it aborts the process.5 ‡ In case of failure there are 4 ways of handling it: ± Continue ± Fail ± Reject ± Drop ‡ Continue just puts a null value in case of failure and continue with the process. ‡ Reject passes the rejected records to the reject link. 33 . ‡ Drop just drops the record that didn¶t meet the condition and continues processing the next record.

Thank You! 34 .

Sign up to vote on this title
UsefulNot useful