KT Document

Manikandan. K

1

DAY - 1

2

Day-1
‡ Datastage ± ETL tool used to extract data from any number or type of database transform the data and load the data to data warehouse. ± It has client and server components. ± Client components are (DS designer, DS Director, DS Manager, Ds Administrator) ± Server components are (Repository, DS Server) ± Datastage Files will be having an extension .dsx ± If we run the job in the designer we won¶t be able to view the logs.

3

There option of running the job in the designer is also possible but its an optional feature.Day-1 ‡ Datastage Designer ± It is used to create the jobs. compile them. ± Once the job is executed we can check the logs to get the details of the job. 4 . ‡ Datastage Director ± It is used to validate the jobs created using the designer then run and monitor the jobs.

Day-1 ‡ Datastage Manager ± It is used to view and edit the contents of the repository. import and manage the metadata. ‡ Datastage Administrator ± It is a user interface used to do administrative tasks. 5 . ± It is used to export.

Day-1 ‡ Development / Debug stage ± Row generator ± Column generator ± Peek ± Tail ± Head ± Sample 6 .

Day-1 ‡ Files ± Sequential File ± Dataset ± Fileset ± Lookup Fileset ‡ Processing Stage ± Sort ± Copy ± Remove Duplicates ± Join 7 .

Day-1 ± ± ± ± ± ± ± ± ± Look Up Merge Funnel Modify Aggregator Transformer Pivot Filter Surrogate Key generator 8 .

‡ It can be used as an input data for testing purposes. 9 .Day-1 Row Generator (no input -> one output) ‡ Row generator generates mock data based on the metadata . ‡ It depends on the number of input rows. ‡ By default the number of rows generated will be 10. Column Generator (1 input -> 1 output) ‡ We must enter the metadata for the column to be generated.

‡ To set a default value for all the records in the column being generated then set the default value to that particular value. ‡ We can also set the increment by.Day-1 So if the input has 15 rows then the output will also have 15 rows. start value. ‡ 10 . end value. Head (1 input -> 1 output) ‡ It is used to pass the first µn¶ records of the input file.

Tail (1 input -> 1 output) ‡ It is similar to head the main difference is it selects the last µn¶ records of the input data. ‡ In case if the value of n=10 and the input has 50 records then only 1st 10 records will pass through. Sample (1 input -> n output) ‡ It selects µn¶ records where the value of µn¶ is given as µ%¶ from the full set of input records in a 11 .Day-1 ‡ By Default the µn¶ is set to 10 it can be changed to any value we need.

Day-1 Random manner. If n=50% and input has 150 records the O/P will have 75 records selected in random manner. Peek (1 input -> 1 output (optional)) ‡ Allows us to view the data in logs in the director. ‡ If the output is set to Output itself then the output from the peek is stored in a separate output file. 12 ‡ ‡ . And cannot be seen in the logs. ‡ It has the records only if the stage is valid and is executed. ‡ By default the output property is set to log so that we can view it in the director.

DAY-2 13 .

2 Types of Jobs we use in Datastage ‡ Parallel Job ‡ Parallel Shared Container ‡ Job Sequence Server job is used rarely. Parallel Processing ‡ Mainly there of 2 types of parallel processing ± Partition parallelism ± Pipeline parallelism 14 .Day .

15 .2 ‡ Partition Parallelism ± To process very large quantities of data it is split into smaller subsets and each subset of data will be handled by a separate processor so that the job will be completed faster. ‡ Pipeline Parallelism ± In this case there are multiple stages then each processor will start processing a particular stage and load the data to the pipeline. ± At the end of the job the data subsets can be collected back together again and written to a single data source.Day .

Sequential File ‡ Sequential file may be a text file are CSV or a fixed width file etc.Day . ‡ The data from the sequential file will be transferred in a sequential manner.2 ‡ So multiple stages will be executed subsequently so that the job will be completed at a much faster pace. 16 . ‡ So the fan out/in process is done before the data is transferred from/to parallel processing stages..

17 .ds ‡ It can be viewed only using datastage and it is encrypted.2 Dataset ‡ Dataset has a extension of .Day . ‡ The data from the dataset can be accessed in parallel so it is much faster when compared to the sequential files.

DAY-3 18 .

right or intermediate. 19 .Day-3 Join [2 input (more than 2 optional) -> 1 output] ‡ It performs join based on the field that is set as the key. ‡ The 4 types of Join can be performed ± Inner join ± Left outer join ± Right Outer Join ± Full Outer Join ‡ We can explicitly set the input table to be left.

Pivot (1 Input -> 1 output) ‡ It is used to convert Columns to rows. ‡ It does the transpose function of the matrix.Day-3 ‡ This is done in the Link ordering Tab. ‡ It is the faster when compared to look up but slow when compared to Merge. ‡ The data to the join must be sorted. ‡ It does not have a reject link. 20 .

Remove Duplicates (1 Input -> 1 Output) ‡ Pre-requisite for this stage is that the data that comes as input must be sorted.q3.q4. 21 .q2.q2.q3. ‡ If duplicate values exists in the input then only one record will be retained others will be dropped. ‡ In the properties we can set which record is to be retained like either the first one or the last one.Day-3 ‡ If there are 4 columns q1.q4 and we are going to convert them into a row with a common column name µq¶ the description for q = q1.

‡ It is better to have a copy between two transformation stages.Day-3 ‡ It checks the first record with the very next record if both are same then one record will be dropped. 22 . ‡ We must also give a field or key based on which the duplicate records have to be removed. Copy ( 1 Input -> µn¶ Output) ‡ It is use to just copy the input data to multiple output links. So if the input is not sorted duplicate values will remain in the output also.

‡ We can also mention how to handle the null value like first or last preference. 23 .Day . The subsequent records will be dropped.Input -> 1 Output) ‡ It is used to sort the data that comes as a input based on a particular key that we mention explicitly. ‡ We can also choose the sor order either ascending or descending. ‡ We have a option called allow duplicates if this is set to false if duplicate records exist only the first record will be retained.4 Sort ( 1.

‡ We cannot have a reject link in for the sort stage. ‡ It is active if the allow duplicates is set to true and its passive in the later case. ‡ In funnel there are 3 types: ± Sort ± Sequence ± Continuous 24 .4 ‡ Sort can also be done in the link by setting the properties in the link. Funnel ( µn¶ Input -> 1 Output) ‡ It is used to combine multiple input files into a single output file.Day .

‡ Sort we must give the key so that it sorts the output from the given input from n files and sort it and combine it into single file and gives as output.4 ‡ By default it will be in continuous only.Day . ‡ In Continuous it combines like 1st record from 1st file then 1st from 2nd file then again subsequent records from the input files. 25 . ‡ In case of Sequence funnel initially all the records from the first input will be there then the ones from the 2nd input like that for all the input files.

4 Merge ( 1 master Input. ‡ It is mandatory that the input must be sorted. ‡ Master and updates must be explicitly mentioned. µn¶ Update Input -> µn¶ Rejects) ‡ In merge the number of update link must match with the number of reject link. ‡ It is fastest way of joining two or more tables. ‡ The reject will have the records that are dropped from the updates only.Day . 26 .

27 .1 Reject (optional)) ‡ It is used to filter the records from the input based on a condition and pass the records which pass the condition to the output.4 Filter ( 1 Input -> n Output .Day . ‡ The records that does not satisfy any of the conditions will be either dropped or passed on to the reject link. ‡ The link ordering can be done in the link ordering tab in the properties dialog box. ‡ We can set multiple conditions and get multiple output based on those conditions.

5 28 .DAY .

copy. ‡ It can do filter. 29 . ‡ In the properties there is a constraints option where we can set any conditions to be checked in the incoming data..5 Transformer ( 1 Input -> µn¶ Output . ‡ It has stage variable where we can create some variables to store some values that are to be used frequently among the output. sort etc. 1 Reject) ‡ This stage . ‡ It must be used only if the process cannot be done using any other stages. ‡ It is an active stage.Day .

string. ‡ Null handling.Day . conditional statement like if-else. trigonometric etc. 30 . ‡ The stage variables declared and defined will get calculated in the order they are derived. date functions can also be used in the derivations.5 ‡ The derivations option will allow us to perform some calculations using functions like arithmetic. ‡ We can derive the outputs from the transformer to the multiple targets based on the derivations..

31 .5 Look Up (1 source (Primary). µn¶ Reference -> 1 Reject) ‡ Look up is used to join the tables based on the key (field) we mention.Day . ‡ The record from the primary is looked up in the reference input and if the key match the join is performed. ‡ There are 2 types of look up: ± Normal lookup ± Sparse lookup ‡ In this case the reject will get the records that are rejected from the primary.

5 ‡ In case if there is no record matching the primary table in the reference table then the record is dropped and passed to the reject link. ‡ It is the slowest compared to merge and join stage. ‡ In look up also we can set constraints. ‡ There of two types of constraints: ± Condition failure ± Lookup failure 32 .Day .

‡ Drop just drops the record that didn¶t meet the condition and continues processing the next record.e) it aborts the process. 33 . ‡ Reject passes the rejected records to the reject link.5 ‡ In case of failure there are 4 ways of handling it: ± Continue ± Fail ± Reject ± Drop ‡ Continue just puts a null value in case of failure and continue with the process.Day . ‡ Fail passes quits the job and comes out of the execution (i.

Thank You! 34 .

Sign up to vote on this title
UsefulNot useful