You are on page 1of 8

Datastage Interview Questions & Answers

1. Tell me about u r current project?


Ans: Explain in detail.
2. What is the difference between OLTP and OLAP?
Ans:
OLTP systems contain normalized data where as OLAP systems contain denormalized data. OLTP stores current data where as OLAP stores current and
history data for analysis.
The query retrieval is very fast in OLTP when compared to the OLAP systems
because in OLTP all data is stored in one table and in OLAP data is stored in
multiple tables.
3. What are the dimension & facts u r loading?
Ans: say some dimension tables and fact tables names of u r project.
4. How these dimensions and facts are connected?
Ans:
By using primary keys in dimension tables and foreign keys in fact tables we
can connect he dimension and fact tables.
5. What is the use of having flag values, timestamp values in target
tables?
Ans:
Flag values and timestamp values are used to maintain history.
6. What is the difference between star schema and Snowflake schema?
Ans:
In star schema dimension tables contain De-normalized data and fact tables
contain normalized data where as in snow flake schema both dimension and
fact tables contain normalized data.
7. What is the use of partitioning and what are the types of
partitioning?
Ans:
If you want to process huge amount of data then we need partitioning.
By using partitioning we can send the data to into different nodes.
Partitioning is of 2 types:
1) Pipeline parallelism: It is the ability to downstream stages to begin
processing a row once the upstream has finished processing that row.
2) Partition Parallelism: For example, if we have 100 records and 4 node
configuration file, then each node will process 25 records.
8. What are link partitioner and link collector?
Ans:
Link partitioner is used to send the data to different nodes and link collector
is used to collect the data from that nodes.
9. How do you preserve partitioning?
Ans:
By using same partition we can preserve partitioning.
10.
What are the types of partitioning techniques?
Ans:
Hash, entire, same, modulus, auto and etc,

11.
If we use SAME partitioning in the first stage which
partitioning method it will take?
Ans:
DataStage uses Round robin when it partitions the data initially.
12.
What is the use of modulus partitioning?
Ans:
If key column is an integer, then we will use modulus. Of course, we can use
hash partition as well but performance wise modulus is better because
depending on the hash code hash partition will send the data to different
nodes .Hash requires more time to process the data.
13.
What are the types of transformers used in DataStage PX?
Ans:
Transformers are 2 types. a. Basic Transformer b. Parallel transformer
Difference:
A Basic transformer compiles in "Basic Language" whereas a Normal
Transformer compiles in "C++".
Basic transformer does not run on multiple nodes whereas a Normal
Transformer can run on multiple nodes giving better performance.
Basic transformer takes less time to compile than the Normal Transformer.
Usage:
A basic transformer should be used in Server Jobs.
Normal Transformer should be used in Parallel jobs as it will run on multiple
nodes here giving better performance.
14. What are Performance tunings you have done in your last project to increase the
performance of slowly running jobs?
Ans:

1. Using Dataset stage instead of sequential files wherever necessary.


2. Use Join stage instead of Lookup stage when the data is huge.
3. Use Operator stages like remove duplicate, Filter, and Copy etc instead of
transformer stage.
4. Sort the data before sending to change capture stage or remove duplicate
stage.
5. Key column should be hash partitioned and sorted before aggregate
operation.
6. Filter unwanted records in beginning of the job flow itself.
15.
What is Peek Stage? When do you use it?
Ans:

The Peek stage is a Development/Debug stage. It can have a single input link
and any number of output links. The Peek stage lets you print record column
values either to the job log or to a separate output link as the stage copies
records from its input data set to one or more output data sets, like the Head
stage and the Tail stage. The Peek stage can be helpful for monitoring the
progress of your application or to diagnose a bug in your application.
16.
What is row generator? When do you use it?
Ans:

The Row Generator stage is a Development/Debug stage. It has no input


links, and a single output link.

The Row Generator stage produces a set of mock data fitting the specified
metadata.
This is useful where we want to test our job but have no real data available to
process.
17.
What is RCP? How it is implemented?
Ans:

DataStage is flexible about Meta data. It can cope with the situation where
Meta data isnt fully defined. You can define part of your schema and specify
that, if your job encounters extra columns that are not defined in the Meta
data when it actually runs, it will adopt these extra columns and propagate
them through the rest of the job. This is known as runtime column
propagation (RCP).
This can be enabled for a project via the DataStage Administrator, and set for
individual links via the Outputs Page Columns tab for most stages or in the
Outputs page General tab for Transformer stages. You should always ensure
that runtime column propagation is turned on.
RCP is implemented through Schema File.
The schema file is a plain text file contains a record (or row) definition.
18.
What are Stage Variables, Derivations and Constants?
Ans:

Stage Variable - An intermediate processing variable that retains value during


read and doesnt pass the value into target column.
Derivation - Expression that specifies value to be passed on to the target
column. Constant - Conditions that are either true or false that specifies flow
of data with a link.
The order of execution is Stage variables -> Constraints -> Derivations.
19.
What is the significance of surrogate key in DataStage?
Ans:
Surrogate key is mainly used in SCD type2.For example, I have a table EMP
and empno is the primary key. Whenever I will try to load duplicate data on
empno it will give referential integrity error. For that reason we have
surrogate key concept in Datastage. Surrogate will generate sequence
numbers and by using these surrogate key number, we can uniquely
identified each record in a table.
20.
What is the difference between SCD type1, type2 and type3?
Ans: Type1: Maintain only current data.
Type2: Maintain current data and full historical data.
Type3: Maintain current data and previous data.
21.
How to implement SCD type2 in DataStage?
Ans: First we will use Change Capture stage. It will compare before dataset
and after dataset give and will give the change codes for copy, insert, and
update and delete records. Then we will generate surrogate key by using
Surrogate key generator stage and then will use Change Apply stage to apply
the changes.
22.
How to capture duplicate data in DataStage (or) I have a file A
and the column is eno and values in eno are 1, 2,3,4,1 and, 2. I want

1,2,3,4 (unique records) in one file and 1,2 (duplicate records) in


another file. How will u do that in DataStage?
Ans: By using Create key change column property in sort stage we can
capture the duplicates. This property will give 1 to unique records and 0 to
duplicate records. Then by using filter or transformer stage, we can send
unique records into one file and duplicate records into another file.
23.
) I have a file A and the column is eno and values in eno are 1,
2,3,4,1 and, 2. I want 3, 4 (complete unique records) in one file and
1, 2 (complete duplicate records) in another file. How will u do that
in DataStage?
Ans:
By using aggregator stage we can do that. We need to set aggregator
type=count rows and output column name=out1 and group by =eno (key
column). Then by using filter or transformer stage, we can send complete
unique records into one file and complete duplicate records into another file.
I have File A and B. Both are customer files. File A is having a, b, c
are columns. File B is having b and c is columns. I want an output in
File C where a, b and c are columns. I have 10 records in File A and 5
records in file B. Now tell me how to concatenate these two files in
DataStage?
Ans:
In file A, a, b, c are the columns and in file B, b and c are the columns. So we
need to generate a dummy column a (a is the column name) in File B by
using Column Generator stage, then by using Funnel stage we can
concatenate these 2 files. Then u will get 15 records in output.
24.
What is the difference between Normal Lookup and sparse
lookup?
Ans:
If the reference table is having fewer amounts of data than primary table, then better to go for
normal lookup. If the reference table is having huge amounts of data than primary table, then
better to go for sparse lookup.

25.
What is meant by Junk dimension?
Ans:
Junk dimensions are dimensions that contain miscellaneous data (like flags
and indicators) that do not fit in the base dimension table.
26.
What is meant by Degenerated dimension?
Ans:
A degenerate dimension is data that is dimensional in nature but stored in a
fact table.
Degenerate Dimension:
This is nothing but dimension data stored within fact tables.
Example: If you have a dimension that has Order Number and Invoice number
fields and have one-to-one relational ship with fact table.
In such case, you may want to go with one table with billion records instead
of two tables with billion records.
You would consider storing these fields within fact itself instead of keeping it

in a separate dimension table to save the space.


Junk Dimension:
Junk dimension is nothing but miscellaneous data that does not fit in any
base dimension hence stored in a separate table.
Example:
If you have fields like flags or indicators and repeating in each fact record.
You may think to create a separate table to hold all possible flags and
indicators and keep reference in fact table.

27.
What type of data you are getting?
Ans:
Customers data only.
28.
I have created a dataset on 4 node configuration file. Tell me
total how many files will be crated? What are those?
Ans:
Total 5 files will be created. Those are 1 descriptor file and 4 dataset files.
29.
I have created a dataset on 2 node configuration file. Can I use
the same dataset on 4 node configuration file?
Ans:
Yes. We can do that. But vice versa is not possible means if you create a
dataset on 4 node configuration file and try to reuse the same dataset on 2
node configuration file, Job will get executed without any error, but you will
not get the expected data in output.
30.
What value would be listed on the datasets when the column
value is "NULL"?
Ans: Dataset will show NULL when there is a null in the data.
Oracle Interview Questions & Answers
31.
What is meant by Referential Integrity?
Ans: Referential integrity is used to maintain relationship between tables and
to maintain consistent data in tables that means not duplicated data.
32.
How do you connect to the oracle server?
Ans: By creating dsn name, we can connect to the server.
33.
What is the difference between Union and Union All?
Ans: Union sorts the combined set and removes duplicates where as Union
All does not remove duplicates.
Union All is faster than Union because Unions duplicates elimination required
sorting operation which takes time.
34.
What is the difference between Delete and Truncate?
Ans:
a. Delete is a DML command where as Truncate is DDL command.
b. We can write WHERE clause in Delete operation where as we cannot
write Where clause in Truncate operation.
c. We can rollback the data in Delete where as we cannot rollback the
data in Truncate.

d. Truncate is faster than Delete because when you perform delete


operation, first it will store the data in rollback space and then delete
operation will be performed but when you perform truncate operation
it will directly delete the data.
35.
What is the difference between view and a materialized view?
Ans:
e. View is a logical representation of data where as materialize view is
physical duplicate representation of data.
f. View doesnt hold data but it point to the data where as materialize
view holds data.
g. Whenever you update a base table the corresponding views will be
automatically refreshed where as in materialized view, we can update
for a certain period of time.
h. The main purpose of Materialized view is to do calculations and display
data from multiple tables using joins.
36.
What is the difference between Where clause and having
clause?
Ans:
i. Where clause can be used without GROUP BY clause where as having
cannot be used without GROUP BY clause.
j. The Where clause selects data before grouping where as having clause
selects data after grouping.
k. The where clause cannot contain aggregate functions where as having
clause can contain aggregate functions.
37.
What is the difference between In clause and Exists clause?
Ans: If the result of sub query is small then it is better to use In clause where
as If the result of sub query is huge then it is better to use Exists clause.
38.
What is the use of DROP option in the ALTER TABLE command?
Ans: Drop option in Alter Table command is used to drop constraints specified
on the table.
39.
What is the use of CASCADE Constraints?
Ans: When this clause is used with the DROP command, a parent table can
be dropped even when a child table exists.
40.
How to find duplicate records in a table?
Ans: Select empno, count (*) from EMP group by empno having count (*) >
1;
41.
How to remove duplicates from a table?
a) Delete from EMP e1 where rowid not in (select min (rowid) from EMP e2
group by empno);
b) Delete from EMP e1 where empno in (select empno from Emp e2 group
by empno having count(*) >1);
42.
How to retrieve the first 10 records from a table?
Ans: Select * from (select empno, ename, sal, row_number () over (order by
Sal desc) as row_number from EMP) where row_number < =10;
43.
How to find 2nd highest salary from a table?
Ans: Select max (sal) as high2 from EMP where sal < (Select max (sal) from
EMP);

44.
How to fins 5th highest salary from a table?
Ans: Select min (sal) as high5 from EMP where sal in (select distinct top 5 sal
from EMP order by sal desc);
45.
How to find the nth salary from a table?
Ans: Select distinct (e1.sal) from EMP e1 where &N= (Select count (distinct
(e2.sal)) from EMP e2 where e1.sal=e2.sal);
46.
Tell me the syntax of decode statement?
Ans: The decode function has the same functionality of If-then-else
statement.
Decode (expression, search, result [search, result] [, default]
Expression is the value to compare. Search is the value that is
compared against expression. Result is value returned, if the
expression equals to search.
Example:
Select
ename,decode(empid,1000,IBM,2000,Microsoft,3000,capgemini,
tcs) as result from emp;
The above decode statement is equivalent to the following IF-THENELSE statement:
IF empid = 1000 THEN
result := 'IBM';
ELSIF empid = 2000 THEN
result := 'Microsoft';
ELSE
result := 'Capgemini';
END IF;
Note:
And also prepare on SQL queries which are in SQL Question &
Answers document.
Prepare Unix commands as well (diff b/w find and grep and how to
delete a dataset by using Orchadmin command)
Prepare on the below question as well:
1. Merge Statement (insert, delete and update in one sql)
2. Ref table - 5lakhs records. Primary table - 50000 records then go for sparse
lookup. If it is vice versa then normal lookup
3. Unix command to run a datastage job

4. Normal table vs Dimension table. I have 2 tables A and B. how to classify


which is normal table and dimension tables In normal table dont have any
hierarchies but in dimension tables hierarchies are there.
5. Normalization and de normalization
6. Normalization is done on basis of dimension or fact. (on basis of attributes)
7. Select empno, count(*) from emp group by empno having count(*) > 1
8. Decode function syntax
9. Types of source systems, issues u have faced in u project
10.Unix command to find and replace a string Sed command
11.How to extract only duplicates in datastage Create key change column in
sort stage
12.$? ,$0 what it will do

You might also like