You are on page 1of 4


What is the main difference between data set and file set stage?
Dataset is an internal format of DataStage the main points to be considered abou
t dataset before using are:
1) It stores data in binary in the internal format of DataStage so, it takes les
s time to read/write from dataset
than any other source/target.
2)It preserves the partioning schemes so that you don't have to partition it aga
3)You cannot view data without datastage
Now, About Fileset
1)It stores data in the format similar to a sequential file.
2)Only advantage of using fileset over a sequential file is "it preserves partio
ning scheme"
3)You can view the data but in the order defined in partitioning scheme '
What is difference between Join/Lookup/Merge stages? How these will react if dup
licates records come in input links?
Join Stage:
1.) It has n input links(one being primary and remaining being secondary links),
one output link and there is no reject link
2.) It has 4 join operations: inner join, left outer join, right outer join and
full outer join
3.) join occupies less memory, hence performance is high in join stage
4.) Here default partitioning technique would be Hash partitioning technique
5.) Prerequisite condition for join is that before performing join operation, th
e data should be sorted.
Look up Stage:
1.) It has n input links, one output link and 1 reject link
2.) It can perform only 2 join operations: inner join and left outer join
3.) Join occupies more memory, hence performance reduces
4.) Here default partitioning technique would be Entire
Merge Stage:
1.) Here we have n inputs master link and update links and n-1 reject links
2.) in this also we can perform 2 join operations: inner join, left outer join
3.) the hash partitioning technique is used by default
4.) Memory used is very less, hence performance is high
5.) sorted data in master and update links are mandatory
How many rejects links I can give in Merge stage?
In join stage, if one input have col1,col2,col3 and other have col4,col5,col6 th
en how to join this and perform left outer join ?

When we use Lookup Stage?
DataStage doesn't know how large your data is, so cannot make an informed choice
whether to combine data using a join stage or a lookup stage.
Here's how to decide which to use:
if the reference datasets are big enough to cause trouble, use a join. A join do

Entire: Each partition receives the entire dataset Keyed partitioning: Keyed partitioning examines the data values in one or more key columns. Look up stage doest not required the data to be sorted.ensuring that records with the same values in those key columns are assigned to the same partition Hash :Assigns rows with the same values in one or more key columns to the same p artition using an internal hashing a high-speed sort on the driving and reference datasets. This can involve I/O if the data is big enough. in a round-robin partiti on assignment Random: Distributes rows evenly across partitions in a random partition assignme nt. the Merge stage allows you to specify seve ral reject links as many as input links. Modulus :Assigns rows with the same values in a single integer key column to the same partition using a simple modulus calculation. Once the sort is over the join processing is very fast and never involves paging or other I/O Unlike Join stages and Lookup stages. explain why? 9 Types Auto DB2 Entire Hash Modulus Random Range Round robin Same What is Keyless partitiong ? Keyless partitioning methods distribute rows without examining the contents of t he data Same: Retains existing partitioning from previous stage Round-robin: Distributes rows evenly across partitions. but the I/O is all highly optimi zed and sequential. DB2 :For DB2 Enterprise Server Edition with DPF (DB2/UDB) only Matches the inter nal partitioning of the specified source or target table. Can we use Hash Partition for reference link in Lookup stage? Yes How many types of joins supported by Merge/Join/Lookup stages? Which Partition methods should we use in Merge/Lookup/Join. How to remove duplicates in a table without using inner query? What is a "degenerate dimension"? . Range :Assigns rows with the same values in one or more key columns to the same partition using a specified range map generated by pre-reading the dataset.

serverjob. Derivations and Constants? Stage Variable . speed up data transfer between two data sources: http://www-01.7. The following variables are used only for fixed-length records: APT_MIN_TRANSPORT_BLOCK_SIZE APT_MAX_TRANSPORT_BLOCK_SIZE APT_DEFAULT_TRANSPORT_BLOCK_SIZE APT_LATENCY_COEFFICIENT The default value is 131072 bytes.iis. System variables are See the Date function. 2) Active Stage: It is the "T" of ETL and Passive Stage : It is the "E & L" of E TL 3) Define data aggregartion ? Summerizes the data 4) An InterProcess (IPC) stage is a passive stage which provides a communication channel between IBM® InfoSphere® D ataStage® processes running simultaneously in the same 5) what are stage variables ? What are Stage Variables.swg.ds .What is Transport Blocks ? The following environment variables are all concerned with the block size used f or the internal transfer of data as jobs Derivation @FALSE The compiler replaces the value with 0. Some of the settings only apply to fixed length records. @DAY The day of the month extracted from the value in @DATE. 1) what are system variables ? IBM® InfoSphere® DataStage® provides a set of variables containing useful system infor mation that you can access from a transform or routine. @DATE The internal date when the program started.doc/topics/c_dsvjbref_InterProcess_Stages.An intermediate processing variable that retains value during r ead and doesnt pass the value into target column.Expression that specifies value to be passed on to the target colum .

6) Containers : Usage and Types? Container is a collection of stages used for the purpose of Reusability. you could use one t o make a server plug-in stage available to a parallel job) 7) Where Datastage stores his repositiry ? most of part in SQL server and Oracle 8) What is Surrogate key ? 9) What are routines ? 10) What are job parameters ? 11) Datastage architecture ? 12) What is ora bulck stage? this stage is used to bulck load the oracle target table 13) .Conditions that are either true or false that specifies flow of data with a link. You can also include server shared containers in parallel jobs as a way of incorporating server job functionality into a parallel stage (for example. Used in parallel jobs. There are 2 types of Containers.n. Constant . 2.Parallel shared container. Used in server jobs (can also be used in parallel job s). There are two types of shared container:· 1. a) Local Container: Job Specific b) Shared Cont ainer: Used in any job within a project.Server shared container.