You are on page 1of 81

DatawareHouse Basic Questions

Centralized Data Warehouse

A Centralized Data Warehouse is a data warehousing implementation wherein a single data warehouse serves the needs
of several separate business units simultaneously using a single data model that spans the needs of multiple business
divisions.

What is Central Data Warehouse

A Central Data Warehouse is a repository of company data where a database is created from operational data extracts. This database
adheres to a single, consistent enterprise data model to ensure consistency in decision making support across the company.

A Central Data Warehouse is a single physical database which contains business data for a specific function area, department, branch,
division or the whole enterprise. Choosing the central data warehouse is commonly based on where there is the largest common need
for informational data and where the largest numbers of end users are already hooked to a central computer or a network.

A central data warehouse employs the computing style of having all the information systems located and managed from one physical
location even if there are many data sources spread around the globe.

What is Active Data Warehouse

Active Data Warehouse is repository of any form of captured transactional data so that they can be used for the purpose
of finding trends and patterns to be used for future decision making.

What is Active Metadata Warehouse

An Active Metadata Warehouse is a repository of Metadata to help speed up data reporting and analyses from an active
data warehouse. In its most simple definition, a Metadata is data describing data.

What is Enterprise Data Warehouse


Enterprise Data Warehouse is a centralized warehouse which provides service for the entire enterprise. A data warehouse
is by essence a large repository of historical and current transaction data of an organization. An Enterprise Data
Warehouse is a specialized data warehouse which may have several interpretations.

In order to give a clear picture of an Enterprise Data Warehouse and how it differs from an ordinary data warehouses, five
attributes are being considered. This is not really exclusive they bring people closer to a focused meaning of the
Enterprise Data Warehouse from among the many interpretations of the term. These attributes mainly pertain to the
overall philosophy as well as the underlying infrastructure of an Enterprise Data Warehouse.

The first attribute of an Enterprise Data Warehouse is that it should have a single version of truth and that entire goal of
the warehouse's design is to come up with a definitive representation of the organization's business data as well as
the corresponding rules. Given the number and variety of systems and silos of company data that exist within any
business organization, many business warehouses may not qualify as an Enterprise Data Warehouse.

The second attribute is that an Enterprise Data Warehouse should have multiple subject areas. In order to have a unified
version of the truth for an organization, an Enterprise Data Warehouse should contain all subject areas related to
the enterprise such as marketing, sale, finance, human resource and others.

The third attribute is that an Enterprise Data Warehouse should have a normalized design. This may be an arguable
attribute as both normalized and de-normalized databases have their own advantages for a data warehouse. In fact,
may data warehouse designers have used denormalized models such as star or snowflake schemas for
implementing data marts. But many also go for normalized databases for an Enterprise Data Warehouse in the
consideration of flexibility first and performance second.

The fourth attribute is that an Enterprise Data Warehouse should be implemented as a Mission-Critical Environment. The
entire underlying infrastructure should be able to handle any unforeseen critical conditions because failure in the
data warehouse means stoppage of the business operation and loss of income and revenue. An Enterprise Data
Warehouse should have high availability features such as online parameter or database structural changes,
business continuance such as failover and disaster recovery features and security features.

Finally an Enterprise Data Warehouse should be scalable across several dimensions. It should expect that a company's
main objective is to grow and that the warehouse should be able to handle the growth of data as well as the growing
complexities of processes which will come together with the evolution of the business enterprise.

What is Functional Data Warehouse

Today's business environment is very data driven and more companies are hoping to create competitive advantage over
other business organization competitors by creating a system whereby they can assess the current status of their
operations any at any given moment and at the same time, they can also analyze trends and patterns within the
company operation and its relation to the trends and patterns of the industry in a truly up-to-date fashion.

Breaking down the Enterprise Data Warehouse into several Functional Data Warehouses can have many big benefits.
Since the organization as a data driven enterprise deals with very high level volumes of data, having separate
Functional Data Warehouses distributes the load and compartmentalize the processes. With this set up, there will
no way the whole information system will break down because if there is a glitch in one of the functional data
warehouses, only that certain point will have to be temporarily halted while being fixed. As opposed to one
monolithic data warehouse setup, if the central database breaks down, the whole system will suffer.

What is Operational Data Store (ODS)

An Operational Data Store (ODS) is an integrated database of operational data. Its sources include legacy systems and it
contains current or near term data. An ODS may contain 30 to 60 days of information, while a data warehouse
typically contains years of data.

An operational data store is basically a database that is used for being an interim area for a data warehouse. As such, its
primary purpose is for handling data which are progressively in use such as transactions, inventory and collecting
data from Point of Sales. It works with a data warehouse but unlike a data warehouse, an operational data store
does not contain static data. Instead, an operational data store contains data which are constantly updated through
the course of the business operations.

What is Operational Database


An operational database contains enterprise data which are up to date and modifiable. In an enterprise data management
system, an operational database could be said to be an opposite counterpart of a decision support database which
contain non-modifiable data that are extracted for the purpose of statistical analysis. An example use of a decision
support database is that it provides data so that the average salary of many different kinds of workers can be
determined while the operational database contains the same data which would be used to calculate the amount for
pay checks of the workers depending on the number of days that they have reported in any given period of time.

Data profiling

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and
collecting statistics and information about that data. The purpose of these statistics may be to:

Find out whether existing data can easily be used for other purposes

Give metrics on data quality including whether the data conforms to company standards

Assess the risk involved in integrating data for new applications, including the challenges of joins

Track data quality

Assess whether metadata accurately describes the actual values in the source database

Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data
problems late in the project can incur time delays and project cost overruns.

Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data
governance for improving data quality

Data governance is a quality control discipline for assessing, managing, using, improving, monitoring, maintaining, and
protecting organizational information.[1] It is a system of decision rights and accountabilities for information-related
processes, executed according to agreed-upon models which describe who can take what actions with what
information, and when, under what circumstances, using what methods.[2]:

What are derived facts and cumulative facts?

There are 2 kinds of derived facts that are additive and can be calculated entirely from the other facts in the same fact
table row can be shown in a user view as if they existed in the real data. The user will never know the difference.

The second kind of derived fact is a non additive calculation, such as ratio or cumulative fact that is typically expressed at
a different level of details than the base facts themselves.

A Cumulative fact might be year-to-date or month-to-date fact. In any case these kinds of derived facts can not be
presented in a simple view at the DBMS level because they violate the grain of the fact table. They need to be
calculated at query time by the BI tool.

what is the data type of the surrogate key


Question :

Data type of the surrogate key is either integer or numeric or number


Answer :

Question : What is hybrid slowly changing dimension

Hybrid SCDs are combination of both SCD 1 and SCD 2.


Answer :
It may happen that in a table, some columns are important and we need to track changes for
them i.e. capture the historical data for them whereas in some columns even if the data changes,
we don't care.

For such tables we implement Hybrid SCDs, where in some columns are Type 1 and some are
Type 2.
Question : Can a dimension table contain numeric values?

Yes. But those data type will be char (only the values can numeric/char)
Answer :

What are Data Marts


Question :

A data mart is a focused subset of a data warehouse that deals with a single area(like different
department) of data and is organized for quick analysis
Answer :

Question : Difference between Snow flake and Star Schema. What are situations where Snow flake
Schema is better than Star Schema to use and when the opposite is true?

Star schema contains the dimension tables mapped around one or more fact tables.
Answer :
It is a de normalized model.

No need to use complicated joins.

Queries results fastly.

Snowflake schema

It is the normalized form of Star schema.

Contains in depth joins, bcas the table’s r spited in to many pieces. We can easily do modification
directly in the tables.

We have to use complicated joins, since we have more tables.

There will be some delay in processing the Query.

Question : What is ER Diagram

The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way
to unify the network and relational database views.
Answer :
Simply stated the ER model is a conceptual data model that views the real world as entities and
relationships. A basic component of the model is the Entity-Relationship diagram which is used to
visually represent data objects.

Since Chen wrote his paper the model has been extended and today it is commonly used for
database design For the database designer, the utility of the ER model is:

it maps well to the relational model. The constructs used in the ER model can easily be
transformed into relational tables.
It is simple and easy to understand with a minimum of training. Therefore, the model can be used
by the database designer to communicate the design to the end user.

In addition, the model can be used as a design plan by the database developer to implement a
data model in specific database management software.

Question : What is degenerate dimension table?


Degenerate Dimensions : If a table contains the values, which r neither dimension nor measures
is called degenerate dimensions’ : invoice id,empno
Answer :

Question : What is VLDB

The perception of what constitutes a VLDB continues to grow. A one terabyte database would
normally be considered to be a VLDB
Answer :

Question : What are the various ETL tools in the Market

Various ETL tools used in market are:


Answer :
Informatica
Data Stage
Oracle Warehouse Builder
Ab Initio
Data Junction

Question : What is the main difference between schema in RDBMS and schemas in Data
Warehouse....?

RDBMS Schema
* Used for OLTP systems
Answer :
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* Cannot solve extract and complex problems
* Poorly modeled

DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model

Question : What are the possible data marts in Retail sales?

Product information, sales information


Answer :

Question : 1. What is incremental loading?


2. What is batch processing?
3. What is cross reference table?
4.what is aggregate fact table

Incremental loading means loading the ongoing changes in the OLTP.


Answer :
Aggregate table contains the [measure] values, aggregated /grouped/summed up to some level
of hierarchy.

Question : What is meant by metadata in context of a Data warehouse and how it is important?
Meta data is the data about data; Business Analyst or data modeler usually capture information
about data - the source (where and how the data is originated), nature of data (char, varchar,
Answer :
nullable, existence, valid values etc) and behavior of data (how it is modified / derived and the life
cycle) in data dictionary a.k.a metadata. Metadata is also presented at the Datamart level,
subsets, fact and dimensions, ODS etc. For a DW user, metadata provides vital information for
analysis / DSS.

What is a linked cube?


Question :

Linked cube in which a sub-set of the data can be analyzed into great detail. The linking ensures
that the data in the cubes remain consistent.
Answer :

Question : What is surrogate key? where we use it explain with examples

Surrogate key is a substitution for the natural primary key.


Answer :
It is just a unique identifier or number for each row that can be used for the primary key to the
table. The only requirement for a surrogate primary key is that it is unique for each row in the
table.

Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the
dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or
SQL Server Identity values for the surrogate key.

It is useful because the natural primary key (i.e. Customer Number in Customer table) can
change and this makes updates more difficult.

Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the
primary keys (according to the business users) but ,not only can these change, indexing on a
numerical value is probably better and you could consider creating a surrogate key called, say,
AIRPORT_ID. This would be internal to the system and as far as the client is concerned you may
display only the AIRPORT_NAME.

2. Adapted from response by Vincent on Thursday, March 13, 2003

Another benefit you can get from surrogate keys (SID) is :

Tracking the SCD - Slowly Changing Dimension.

Let me give you a simple, classical example:

On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would be
in your Employee Dimension). This employee has a turnover allocated to him on the Business
Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit 'BU1' to
Business Unit 'BU2.' All the new turnover has to belong to the new Business Unit 'BU2' but the
old one should Belong to the Business Unit 'BU1.'

If you used the natural business key 'E1' for your employee within your data warehouse
everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.'

If you use surrogate keys, you could create on the 2nd of June a new record for the Employee
'E1' in your Employee Dimension with a new surrogate key.

This way, in your fact table, you have your old data (before 2nd of June) with the SID of the
Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee 'E1' +
'BU2.'

You could consider Slowly Changing Dimension as an enlargement of your natural key: natural
key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the natural
key enlargement process is that you might not have all part of your new key within your fact
table, so you might not be able to do the join on the new enlarge key -> so you need another id.

Question : What r the data types present in bo?n what happens if we implement view in the designer
n report

Three different data types: Dimensions, Measure and Detail.


Answer :
View is nothing but an alias and it can be used to resolve the loops in the universe.

Question : What is data validation strategies for data mart validation after loading process

Data validation is to make sure that the loaded data is accurate and meets the business
requirements.
Answer :
Strategies are different methods followed to meet the validation requirements

Question : What is Data warehousing Hierarchy?

Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing data. A
Answer :
hierarchy can be used to define data aggregation. For example, in a time dimension, a hierarchy
might aggregate data from the month level to the quarter level to the year level. A hierarchy can
also be used to define a navigational drill path and to establish a family structure.

Within a hierarchy, each level is logically connected to the levels above and below it. Data values
at lower levels aggregate into the data values at higher levels. A dimension can be composed of
more than one hierarchy. For example, in the product dimension, there might be two hierarchies--
one for product categories and one for product suppliers.

Dimension hierarchies also group levels from general to granular. Query tools use hierarchies to
enable you to drill down into your data to view different levels of granularity. This is one of the key
benefits of a data warehouse.

When designing hierarchies, you must consider the relationships in business structures. For
example, a divisional multilevel sales organization.

Hierarchies impose a family structure on dimension values. For a particular level value, a value at
the next higher level is its parent, and values at the next lower level are its children. These
familial relationships enable analysts to access data quickly.

Levels

A level represents a position in a hierarchy. For example, a time dimension might have a
hierarchy that represents data at the month, quarter, and year levels. Levels range from general
to specific, with the root level as the highest or most general level. The levels in a dimension are
organized into one or more hierarchies.

Level Relationships

Level relationships specify top-to-bottom ordering of levels from most general (the root) to most
specific information. They define the parent-child relationship between the levels in a hierarchy.

Hierarchies are also essential components in enabling more complex rewrites. For example, the
database can aggregate an existing sales revenue on a quarterly base to a yearly aggregation
when the dimensional dependencies between quarter and year are known

Question : What is BUS Schema?


BUS Schema is composed of a master suite of confirmed dimension and standardized definition
if facts.
Answer :

Question : What are the methodologies of Data Warehousing?

Every company has methodology of their own. But to name a few SDLC Methodology, AIM
methodology are standard Ely used. Other methodologies are AMM, World class methodology
Answer :
and many more.

Question : What is conformed dimension?

Conformed dimensions are the dimensions which can be used across multiple Data Marts in
combination with multiple facts tables accordingly
Answer :

Question : What is Difference between E-R Modeling and Dimensional Modeling?

Modeling is nothing but designing the database by using any Database Normalization techniques
(1NF 2NF 3NF.......etc)
Answer :
Data Modeling is having 2 types:

1. ER Modeling
2. Dimensional Modeling

ER Modeling is for OLTP databases uses any of the forms 1NF or 2NF or 3NF. Contains
Normalized data

Dimensional Modeling is for Data warehouses uses 3NF. Contains denormalized data.

The only difference between these 2 modeling techniques is the Normalization Form used to
design the databases. Both modeling techniques are represented using ER diagrams. So
depends upon the client requirement it should be decided...

Question : Why fact table is in normal form?

The Fact table is central table in Star schema Fact table is kept Normalized because it’s very
bigger and so we should avoid redundant data in it. That’s why we make different dimensions
Answer :
there by making normalized star schema model which helps in query performance + to eliminate
redundant data.

Question : What is the definition of normalized and denormalized view and what are the differences
between them

Normalization is the process of removing redundancies.


Answer :
Denormalization is the process of allowing redundancies.

Question : What is junk dimension?


What is the difference between junk dimension and degenerated dimension?

Junk dimension: Grouping of Random flags and text Attributes in a dimension and moving them
to a separate sub dimension.
Answer :
Degenerate Dimension: Keeping the control information on Fact table ex: Consider a Dimension
table with fields like order number and order line number and have 1:1 relationship with Fact
table, In this case this dimension is removed and the order information will be directly stored in a
Fact table in order eliminate unnecessary joins while retrieving order information..

Question : What is the difference between view and materialized view

View - store the SQL statement in the database and let you use it as a table. Every time you
access the view, the SQL statement executes.
Answer :
Materialized view - stores the results of the SQL in table form in the database. SQL statement
only executes once and after that every time you run the query, the stored result set is used.
Pros include quick query results.

Question : What is the main difference between Inmon and Kimball philosophies of data
warehousing?

Both differed in the concept of building the datawarehosue...


Answer :
According to Kimball ...

Kimball views data warehousing as a constituency of data marts. Data marts are focused on
delivering business objectives for departments in the organization. And the data warehouse is a
conformed dimension of the data marts. Hence a unified view of the enterprise can be obtained
from the dimension modeling on a local departmental level.

Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the
development of the data warehouse can start with data from the online store. Other subject areas
can be added to the data warehouse as their needs arise. Point-of-sale (POS) data can be added
later if management decides it is necessary.

i.e.,

Kimball--First Data Marts--Combined way ---Data warehouse

Inman---First Data warehouse--Later----Data marts

What is the advantages data mining over traditional approaches?


Question :

Data Mining is used for the estimation of future. For example, if we take a company/business
organization, by using the concept of Data Mining, we can predict the future of business in terms
Answer :
of Revenue (or) Employees (or) Customers (or) Orders etc.

Traditional approaches use simple algorithms for estimating the future. But, it does not give
accurate results when compared to Data Mining.

What are the different architecture of data warehouse


Question :

I think, there are two main things


Answer :
1. Top down - (bill Inmon)

2.Bottom up - (Ralph kimbol)

What are the steps to build the data warehouse


Question :
As far I know...
Answer :
Gathering business requirements
Identifying Sources
Identifying Facts
Defining Dimensions
Define Attributes
Redefine Dimensions & Attributes
Organize Attribute Hierarchy & Define Relationship
Assign Unique Identifiers
Additional convetions:Cardinality/Adding ratios

Question : What is fact less fact table? Where you have used it in your project?

Fact less table means only the key available in the Fact there is no measures available
Answer :

Question : What is the difference between ODS and OLTP

ODS:- It is nothing but a collection of tables created in the Data warehouse that maintains only
current data
Answer :
where as OLTP maintains the data only for transactions, these are designed for recording daily
operations and transactions of a business

Question : What is the difference between data warehouse and BI?

Simply speaking, BI is the capability of analyzing the data of a data warehouse in advantage of
that business. A BI tool analyzes the data of a data warehouse and to come into some business
Answer :
decision depending on the result of the analysis.

Question : What is the difference between OLAP and datawarehosue

Data warehouse is the place where the data is stored for analyzing
Answer :
where as OLAP is the process of analyzing the data, managing aggregations,

partitioning information into cubes for in-depth visualization.

Question : what is aggregate table and aggregate fact table ... any examples of both

Aggregate table contains summarized data. The materialized view is aggregated tables.
Answer :
For ex in sales we have only date transaction. If we want to create a report like sales by product
per year. In such cases we aggregate the date vales into week_agg, month_agg, quarter_agg,
year_agg. To retrieve date from these tables we use @aggregate function.

Question : What are non-additive facts in detail?

A fact may be measure, metric or a dollar value. Measure and metric are non additive facts.
Answer :
Dollar value is additive fact. If we want to find out the amount for a particular place for a particular
period of time, we can add the dollar amounts and come up with the total amount.

A non additive fact, for e.g. measure height(s) for 'citizens by geographical location' , when we
rollup 'city' data to 'state' level data we should not add heights of the citizens rather we may want
to use it to derive 'count'

Question : Why Denormalization is promoted in Universe Designing?

In a relational data model, for normalization purposes, some lookup tables are not merged as a
single table. In a dimensional data modeling (star schema), these tables would be merged as a
Answer :
single table called DIMENSION table for performance and slicing data. Due to this merging of
tables into one large Dimension table, it comes out of complex intermediate joins. Dimension
tables are directly joined to Fact tables. Though, redundancy of data occurs in DIMENSION
table, size of DIMENSION table is 15% only when compared to FACT table. So only
Denormalization is promoted in Universe Designing.

Question : Is OLAP databases are called decision support system??? True/false?

True
Answer :

Question : What is snapshot

You can disconnect the report from the catalog to which it is attached by saving the report with a
snapshot of the data. However, you must reconnect to the catalog if you want to refresh the data.
Answer :

ETL Basic Questions


******************************
Question : What is a data warehousing?

Data Warehouse is a repository of integrated information, available for queries and analysis.
Data and information are extracted from heterogeneous sources as they are generated....This
Answer :
makes it much easier and more efficient to run queries over data that originally came from
different sources.

Typical relational databases are designed for on-line transactional processing (OLTP) and do not
meet the requirements for effective on-line analytical processing (OLAP). As a result, data
warehouses are designed differently than traditional relational databases.

Question : What are Data Marts?

Data Marts are designed to help manager make strategic decisions about their business.
Answer :
Data Marts are subset of the corporate-wide data that is of value to a specific group of users.

There are two types of Data Marts:

1.Independent data marts – sources from data captured form OLTP system, external providers or
from data generated locally within a particular department or geographic area.

2. Dependent data mart – sources directly form enterprise data warehouses.

Question : What is ER Diagram?

The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way
to unify the network and relational database views.
Answer :
Simply stated the ER model is a conceptual data model that views the real world as entities and
relationships. A basic component of the model is the Entity-Relationship diagram which is used to
visually represent data objects.

Since Chen wrote his paper the model has been extended and today it is commonly used for
database design For the database designer, the utility of the ER model is:

it maps well to the relational model. The constructs used in the ER model can easily be
transformed into relational tables.
It is simple and easy to understand with a minimum of training. Therefore, the model can be used
by the database designer to communicate the design to the end user.

In addition, the model can be used as a design plan by the database developer to implement a
data model in specific database management software.

Question : What is a Star Schema?

Star schema is a type of organizing the tables such that we can retrieve the result from the
database easily and fastly in the warehouse environment. Usually a star schema consists of one
Answer :
or more dimension tables around a fact table which looks like a star, so that it got its name.

Question : What is Dimensional Modeling? Why is it important?

Dimensional Modeling is a design concept used by many data warehouse designers to build their
data warehouse. In this design model all the data is stored in two types of tables - Facts table
Answer :
and Dimension table. Fact table contains the facts/measurements of the business and the
dimension table contains the context of measurements i.e., the dimensions on which the facts
are calculated.

Why is Data Modeling Important?


---------------------------------------

Data modeling is probably the most labor intensive and time consuming part of the development
process. Why bother especially if you are pressed for time? A common response by practitioners
who write on the subject is that you should no more build a database without a model than you
should build a house without blueprints.

The goal of the data model is to make sure that the all data objects required by the database are
completely and accurately represented. Because the data model uses easily understood
notations and natural language, it can be reviewed and verified as correct by the end-users.

The data model is also detailed enough to be used by the database developers to use as a
"blueprint" for building the physical database. The information contained in the data model will be
used to define the relational tables, primary and foreign keys, stored procedures, and triggers. A
poorly designed database will require more time in the long-term. Without careful planning you
may create a database that omits data required to create critical reports, produces results that
are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements.

Question : What Snow Flake Schema?

Snowflake Schema, each dimension has a primary dimension table, to which one or more
additional dimensions can join. The primary dimension table is the only table that can join to the
Answer :
fact table.

Question : What are Aggregate tables?

Aggregate table contains the summary of existing warehouse data which is grouped to certain
levels of dimensions. Retrieving the required data from the actual table, which have millions of
Answer :
records will take more time and also affects the performance. To avoid this we can aggregate the
table to certain required level and can use it. This table reduces the load in the database server
and increases the performance of the query and can retrieve the result very fastly.
Question : What is the Difference between OLTP and OLAP?

Main Differences between OLTP and OLAP are:-


Answer :
1. User and System Orientation

OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT
professionals.

OLAP: market-oriented, used for data analysis by knowledge workers (managers, executives,
analysis).

2. Data Contents

OLTP: manages current data, very detail-oriented.

OLAP: manages large amounts of historical data, provides facilities for summarization and
aggregation, stores information at different levels of granularity to support decision making
process.

3. Database Design

OLTP: adopts an entity relationship(ER) model and an application-oriented database design.

OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database
design.

4. View

OLTP: focuses on the current data within an enterprise or department.

OLAP: spans multiple versions of a database schema due to the evolutionary process of an
organization; integrates information from many organizational locations and data stores

Question : What is ETL?

ETL stands for extraction, transformation and loading.


Answer :
ETL provide developers with an interface for designing source-to-target mappings, transformation
and job control parameter.
· Extraction
Take data from an external source and move it to the warehouse pre-processor database.

· Transformation
Transform data task allows point-to-point generating, modifying and transforming data.

· Loading
Load data task adds records to a database table in a warehouse.

Question : What are the various Reporting tools in the Market?

1. MS-Excel
2. Business Objects (Crystal Reports)
Answer :
3. Cognos (Impromptu, Power Play)
4. Micro strategy
5. MS reporting services
6. Informatics Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. ProClarity

Question : What is Fact table?

Fact Table contains the measurements or metrics or facts of business process. If your business
process is "Sales" , then a measurement of this business process such as "monthly sales
number" is captured in the Fact table. Fact table also contains the foreign keys for the dimension
Answer :
tables.

Question : What is a dimension table?

A dimensional table is a collection of hierarchies and categories along which the user can drill
down and drill up. It contains only the textual attributes.
Answer :

Question : What is a lookup table?

A lookup table is the one which is used when updating a warehouse. When the lookup is placed
on the target table (fact table / warehouse) based upon the primary key of the target, it just
Answer :
updates the table by allowing only new records or updated records based on the lookup
condition.

Question : What is a general purpose scheduling tool?

The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data
from Source To Target at specific time or based on some condition.
Answer :

Question : What are modeling tools available in the Market?

There are a number of data modeling tools


Answer :
Tool Name Company Name
Erwin Computer Associates
Embarcadero Embarcadero Technologies
Rational Rose IBM Corporation
Power Designer Sybase Corporation
Oracle Designer Oracle Corporation

Question : What is real time data-warehousing?

Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data
warehousing. Real-time activity is activity that is happening right now. The activity could be
Answer :
anything such as the sale of widgets. Once the activity is complete, there is data about it.

Data warehousing captures business activity data. Real-time data warehousing captures
business activity data as it occurs. As soon as the business activity is complete and there is data
about it, the completed activity data flows into the data warehouse and becomes available
instantly. In other words, real-time data warehousing is a framework for deriving information from
data as the data becomes available.

Question : What is data mining?

Data mining is a process of extracting hidden trends within a data warehouse. For example an
insurance data ware house can be used to mine data for the most high risk people to insure in a
Answer :
certain geographical area.

Question : What are Normalization, First Normal Form, Second Normal Form, And Third Normal
Form?

1.Normalization is process for assigning attributes to entities–Reduces data redundancies–Helps


eliminate data anomalies–Produces controlled redundancies to link tables
Answer :
2.Normalization is the analysis of functional dependency between attributes / data items of user
views
It reduces a complex user view to a set of small ands table subgroups of fields / relations

1NF:Repeating groups must be eliminated, Dependencies can be identified, All key


attributesdefined,No repeating groups in table

2NF: The Table is already in1NF,Includes no partial dependencies–No attribute dependent on a


portion of primary key, Still possible to exhibit transitivedependency,Attributes may be
functionally dependent on non-key attributes

3NF: The Table is already in 2NF, Contains no transitive dependencies

Question : What is ODS?

1. ODS means Operational Data Store.


Answer :
Submitted by Francis C. (xxchen74 @ hotmail. com)

2. A collection of operation or bases data that is extracted from operation databases and
standardized, cleansed, consolidated, transformed, and loaded into enterprise data architecture.
An ODS is used to support data mining of operational data, or as the store for base data that is
summarized for a data warehouse. The ODS may also be used to audit the data warehouse to
assure summarized and derived data is calculated properly. The ODS may further become the
enterprise shared operational database, allowing operational systems that are being
reengineered to use the ODS as there operation databases.

Question : What type of Indexing mechanism do we need to use for a typical data warehouse?

On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the
other types of clustered/non-clustered, unique/non-unique indexes.
Answer :
To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports bitmaps.

Question : Which columns go to the fact table and which columns go the dimension table?

The Primary Key columns of the Tables (Entities) go to the Dimension Tables as Foreign Keys.
Answer :
The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys.

Question : What is a level of Granularity of a fact table?

Level of granularity means level of detail that you put into the fact table in a data warehouse. For
example: Based on design you can decide to put the sales data in each transaction. Now, level of
Answer :
granularity would mean what detail you are willing to put for each transactional fact. Product
sales with respect to each minute or you want to aggregate it unto minute and put that data.

Question : What does level of Granularity of a fact table signify?

Granularity
The first step in designing a fact table is to
Answer :
determine the granularity of the fact table. By
granularity, we mean the lowest level of information
that will be stored in the fact table. This
constitutes two steps:

Determine which dimensions will be included.


Determine where along the hierarchy of each dimension
the information will be kept.
The determining factors usually goes back to the
requirements

Question : How are the Dimension tables designed?

Most dimension tables are designed using Normalization principles unto 2NF. In some instances
they are further normalized to 3NF.
Answer :
Find where data for this dimension are located.

Figure out how to extract this data.

Determine how to maintain changes to this dimension (see more on this in the next section).

Change fact table and DW population routines.

Question : What are slowly changing dimensions?

SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three types
Answer :
SCD1: only maintained updated values.

Ex: a customer address modified we update existing record with new address.

SCD2: maintaining historical information and current information by using

A) Effective Date

B) Versions

C) Flags

or combination of these

scd3: by adding new columns to target table we maintain historical information and current
information

Question : What are non-additive facts?

Non-Additive: Non-additive facts are facts that cannot


be summed up for any of the dimensions present in the
Answer :
fact table.

Question : What are conformed dimensions?

Conformed dimensions are dimensions which are common to the cubes.(cubes are the schemas
contains facts and dimension tables)
Answer :
Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and
Dimensions
here D1,D2 are the Conformed Dimensions

Question : What is VLDB?

VLDB stands for Very Large Database.


Answer :
It is an environment or storage space managed by a relational database management system
(RDBMS) consisting of vast quantities of information.

Submitted By: Francis C. (xxchen74 @ hotmail. com)

_____________________

VLDB doesn’t refer to size of database or vast amount of information stored. It refers to the
window of opportunity to take back up the database.

Window of opportunity refers to the time of interval and if the DBA was unable to take back up in
the specified time then the database was considered as VLDB.
Question : What are Semi-additive and fact less facts and in which scenario will you use such kinds
of fact tables?

Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive.
Answer :
EX: Average daily balance

A fact table without numeric fact columns is called fact less fact table.

Ex: Promotion Facts

While maintain the promotion values of the transaction (ex: product samples) because this table
doesn’t contain any measures.

Question : How do you load the time dimension?

Time dimensions are usually loaded by a program that


loops through all possible dates that may appear in
Answer :
the data. It is not unusual for 100 years to be
represented in a time dimension, with one row per day.

Question : Why OLTP database are designs not generally a good idea for a Data Warehouse?

Since in OLTP, tables are normalized and hence query response will be slow for end user and
OLTP doesn’t contain years of data and hence cannot be analyzed.
Answer :

Question : Why should you put your data warehouse on a different system than your OLTP system?

An OLTP system is basically “data oriented” (ER model) and not “Subject oriented "(Dimensional
Model) .That is why we design a separate system that will have a subject oriented OLAP
Answer :
system...

Moreover if a complex query is fired on an OLTP system will cause a heavy overhead on the
OLTP server that will affect the day-to-day business directly.

_____________

The loading of a warehouse will likely consume a lot


of machine resources. Additionally, users may create
queries or reports that are very resource intensive
because of the potentially large amount of data
available. Such loads and resource needs will
conflict with the needs of the OLTP systems for
resources and will negatively impact those production systems.

Question : What is Full load & Incremental or Refresh load?

Full Load: completely erasing the contents of one or more tables and reloading with fresh data.
Answer :
Incremental Load: applying ongoing changes to one or more tables based on a predefined
schedule

Question : What are snapshots? What are materialized views & where do we use them? What is a
materialized view lo

Materialized view is a view in which data is also stored in some temp table.i.e if we will go with
the View concept in DB in that we only store query and once we call View it extract data from
Answer :
DB.But In materialized View data is stored in some temp tables.

Question : What is the difference between etl tool and olap tools
ETL tool is meant for extraction data from the legacy systems and load into specified data base
with some process of cleansing data.
Answer :
ex: Informatica,data stage ....etc

OLAP is meant for Reporting purpose. In OLAP data available in Multidirectional model. So that u
can write simple query to extract data fro the data base.

ex: Businee objects,Cognos....etc

Question : Where do we use semi and non additive facts

Additive: A measure can participate arithmetic calculations using all or any dimensions.
Answer :
Ex: Sales profit

Semi additive: A measure can participate arithmetic calculations using some dimensions.

Ex: Sales amount

Non Additive’s measure can't participate arithmetic calculations using dimensions.

Ex: temperature

Question : What is a staging area? Do we need it? What is the purpose of a staging area?

Data staging is actually a collection of processes used to prepare source system data for loading
a data warehouse. Staging includes the following steps:
Answer :
Source data extraction, Data transformation (restructuring),

Data transformation (data cleansing, value transformations),

Surrogate key assignments

Question : What are the various methods of getting incremental records or delta records from the
source systems?

One foolproof method is to maintain a field called 'Last Extraction Date' and then impose a
condition in the code saying 'current_extraction_date > last_extraction_date'.
Answer :

Question : What is a three tier data warehouse?

A data warehouse can be thought of as a three-tier system in which a middle system provides
usable data in a secure way to end users. On either side of this middle system are the end users
Answer :
and the back-end data stores.

Question : What are active transformation / Passive transformations?

Active transformation can change the number of rows that pass through it. (decrease or increase
rows)
Answer :
Passive transformation can not change the number of rows that pass through it.

Question : Compare ETL & Manual development?

ETL - The process of extracting data from multiple sources. (Ex. flat files,XML, COBOL, SAP etc)
is more simpler with the help of tools.
Answer :
Manual - Loading the data other than flat files and oracle table need more effort.

ETL - High and clear visibility of logic.


Manual - complex and not so user friendly visibility of logic.

ETL - Contains Meta data and changes can be done easily.


Manual - No Meta data concept and changes needs more effort.

ETL- Error handling, log summary and load progress makes life easier for developer and
maintainer.
Manual - need maximum effort from maintenance point of view.

ETL - Can handle Historic data very well.


Manual - as data grows the processing time degrades.

These are some differences b/w manual and ETL development.

Defining the data warehouse

A Data Warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of


management decisions.

Subject-
oriented:
Focus on
natural
data
groups, Integrated: Provide consistent formats and encodings.
not
applicatio
ns
boundarie
s.
Time-variant: Data is organized by time and is stored in diverse time slices.
Nonvolatile: No updates are allowed. Only load and retrieval operations.

Subject-orientation mandated a cross-functional slice of data drawn from multiple sources to support a diversity
of needs. This was a radical departure from serving only the vertical application views of data (supply-side) or
the overlapping departmental needs for data (demand side).

The integration goal was taken from the realm of enterprise rhetoric down to something attainable. Integration is
not the act of wiring applications together. Nor is it simply commingling data from a variety of sources.
Integration is the process of mapping dissimilar codes to a common base, developing consistent data element
presentations and delivering this standardized data as broadly as possible.

Time variance is the most confusing Inmon concept but also a most pivotal one. At its essence, it calls for
storage of multiple copies of the underlying detail in aggregations of differing periodicity and/or time frames.
You might have detail for seven years along with weekly, monthly and quarterly aggregates of differing
duration. The time variant strategy is essential; not only for performance but also for maintaining the
consistency of reported summaries across departments and over time.5

Non-volatile design is essential. It is also the principle most often violated or poorly implemented. Non-volatility
literally means that once a row is written, it is never modified. This is necessary to preserve incremental net
change history. This, in turn, is required to represent data as of any point in time. When you update a data row,
you destroy information. You can never recreate a fact or total that included the unmodified data. Maintaining
"institutional memory" is one of the higher goals of data warehousing.

Inmon lays out several other principles that are not a component of his definition. Some of these principles were
initially controversial but are commonly accepted now. Others are still in dispute or have fallen into disfavor.
Modification of one principle is the basis for the next leap forward.

Data Warehouse 2000: Real-Time Data Warehousing


Our next step in the data warehouse saga is to eliminate the snapshot concept and the batch ETL mentality
that has dominated since the very beginning. The majority of our developmental dollars and a massive amount
of processing time go into retrieving data from operational databases. What if we eliminated this whole write
then detect then extract process? What if the data warehouse read the same data stream that courses into and
between the operational system modules? What if data that was meaningful to the data warehouse
environment was written by the operational system to a queue as it was created?

This is the beginning of a real-time data warehouse model. But, there is more.

What if we had a map for every operational instance that defined its initial transformation and home location in
the data warehouse detail/history layer? What if we also had publish-and-subscribe rules that defined
downstream demands for this instance in either raw form or as a part of some derivation or aggregation? What
if this instance was propagated from its operational origin through its initial transformation then into the
detail/history layer and to each of the recipient sites in parallel and in real time?

1. Which DataStage EE client application is used to manage roles for DataStage


projects?
A. Director
B. Manager
C. Designer
D. Administrator
2. Importing metadata from data modeling tools like ERwin is accomplished by
which facility?
A. MetaMerge
B. MetaExtract
C. MetaBrokers
D. MetaMappers
3. Which two statements are true of writing intermediate results between parallel
jobs to persistent data sets? (Choose two.)
A. Datasets are pre-indexed.
B. Datasets are stored in native internal format.
C. Datasets retain data partitioning and sort order.
D. Datasets can only use RCP when a schema file is specified.
4. You are reading customer data using a Sequential File stage and sorting it by
customer ID using the Sort stage. Then the sorted data is to be sent to an
Aggregator stage which will count the number of records for each customer.

Which partitioning method is more likely to yield optimal performance without


violating the business requirements?

A. Entire
B. Random
C. Round Robin
D. Hash by customer ID
5. A customer wants to create a parallel job to append to an existing Teradata table
with an input file of over 30 gigabytes. The input data also needs to be
transformed and combined with two additional flat files. The first has State codes
and is about 1 gigabyte in size. The second file is a complete view of the current
data which is roughly 40 gigabytes in size. Each of these files will have a one to
one match and ultimately be combined into the original file.

Which DataStage stage will communicate with Teradata using the maximum
parallel performance to write the results to an existing Teradata table?

A. Teradata API
B. Teradata Enterprise
C. Teradata TPump
D. Teradata MultiLoad
6. Which column attribute could you use to avoid rejection of a record with a NULL
when it is written to a nullable field in a target Sequential File?
A. null field value
B. bytes to skip
C. out format
D. pad char
7. You are reading customer records from a sequential file. In addition to the
customer ID, each record has a field named Rep ID that contains the ID of the
company representative assigned to the customer. When this field is blank, you
want to retrieve the customers representative from the REP table.

Which stage has this functionality?

A. Join Stage
B. Merge Stage
C. Lookup Stage
D. No stage has this functionality.
8. You want to ensure that you package all the jobs that are used in a Job Sequence
for deployment to a production server.

Which command line interface utility will let you search for jobs that are used in a
specified Job Sequence?

A. dsjob
B. dsinfo
C. dsadmin
D. dssearch
9. Your job is running in a grid environment consisting of 50 computers each having
two processors. You need to add a job parameter that will allow you to run the job
using different sets of resources and computers on different job runs.

Which environment variable should you add to your job parameters?

A. APT_CONFIG_FILE
B. APT_DUMP_SCORE
C. APT_EXECUTION_MODE
D. APT_RECORD_COUNTS
10. Which two statements are valid about Job Templates? (Choose two.)
A. Job Templates can be created from any parallel job or Job Sequence.
B. Job Templates should include recommended environment variables including
APT_CONFIG_FILE.
C. Job Templates are stored on the DataStage development server where they
can be shared among developers.
D. The locatation where Job Templates are stored can be changed within
DataStage Designer Tools - Options menu.

Answer Key:

1. D
2. C
3. B and C
4. D
5. B
6. A
7. C
8. D
9. A

10. A and B

About
Privacy Contact
IBM
ETL is important, as it is the way data actually gets loaded into the warehouse. This article assumes that data is always
loaded into a data warehouse, whereas the term ETL can in fact refer to a process that loads any database.

Contents

[hide]

• 1 Extract
• 2 Transform
• 3 Load
• 4 Challenges
• 5 Tools
o 5.1 Some ETL tools
• 6 Surveys
• 7 See also

• 8 Bibliography
[edit]

Extract

The first part of an ETL process is to extract the data from the source systems. Most data warehousing projects
consolidate data from different source systems. Each separate system may also use a different data organization / format.
Common data source formats are relational databases and flat files, but may include non-relational database structures
such as IMS or other data structures such as VSAM or ISAM. Extraction converts the data into a format for transformation
processing.

[edit]

Transform

The transform phase applies a series of rules or functions to the extracted data to derive the data to be loaded. Some
data sources will require very little manipulation of data. However, in other cases any combination of the following
transformations types may be required:

• Selecting only certain columns to load (or if you prefer, null columns not to load)
• Translating coded values (e.g. If the source system stores M for male and F for female but the warehouse
stores 1 for male and 2 for female)
• Encoding free-form values (e.g. Mapping "Male" and "M" and "Mr" onto 1)
• Deriving a new calculated value (e.g. sale_amount = qty * unit_price)
• Joining together data from multiple sources (e.g. lookup, merge, etc)
• Summarizing multiple rows of data (e.g. total sales for each region)
• Generating surrogate key values
• Transposing or pivotting (turning multiple columns into multiple rows or vice versa)
[edit]

Load

The load phase loads the data into the data warehouse. Depending on the requirements of the organization, this process
ranges widely. Some data warehouses merely overwrite old information with new data. More complex systems can
maintain a history and audit trail of all changes to the data.

[edit]

Challenges

ETL processes can be quite complex, and significant operational problems can occur with improperly designed ETL
systems.

The range of data values or data quality in an operational system may be outside the expectations of designers at the
time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to
identify the data conditions that will need to be managed by transform rules specifications.

The scalability of an ETL system across the lifetime of its usage needs to be established during analysis. This includes
understanding the volumes of data that will have to be processed within Service Level Agreements. The time available to
extract from source systems may change, which may mean the same amount of data may have to be processed in less
time. Some ETL systems have to scale to process terabytes of data to update data warehouses with tens of terabytes of
data. Increasing volumes of data may require designs that can scale from daily batch to intra-day micro-batch to
integration with message queues for continuous transformation and update.

Recent developments in ETL software has been the implementation of parallel processing. This has enabled a number of
methods to improve overall performance of ETL processes when dealing with large volumes of data.

There are 3 main types of parallelisms as implemented in ETL applications:

Data: By splitting a single sequential file into smaller data files to provide parallel access.

Pipeline: Allowing the simultaneous running of several components on the same data stream. E.g. performing step 2:
lookup a value on record 1 at the same time as step 1: add two fields together is performed on record 2.

Component: The simultaneous running of multiple processes on different data streams in the same job. E.g. doing a sort
on input file 1 at the same time that the contents of input file 2 are deduped.

All three types of parallelisms are usually combined in a single job.

An additional difficulty is making sure the data being uploaded is relatively consistent. Since multiple source databases all
have different update cycles (some may be updated every few minutes, while others may take days or weeks), an ETL
system may be required to hold back certain data until all sources are synchronized. Likewise, where a warehouse may
have to be reconciled to the contents in a source system or with the general ledger establishing synchronization and
reconciliation points is necessary.

[edit]

Tools

While an ETL process can be created using almost any programming language, creating them from scratch is quite
complex. Increasingly, companies are buying ETL tools to help in the creation of ETL processes.

A good ETL tool must be able to communicate with the many different relational databases and read the various file
formats used throughout an organization. ETL tools have started to migrate into Enterprise Application Integration, or
even Enterprise Service Bus, systems that now cover much more than just the extraction, transformation and loading of
data. Many ETL vendors now have data profiling, data quality and metadata capabilities.
What are other Performance tunings you have done in your last
Question: Added: 7/26/2006
project to increase the performance of slowly running jobs?

Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)
Use SQL Code while extracting the data
Handle the nulls
Minimise the warnings
Reduce the number of lookups in a job design
Use not more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.
There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use
dataset Stages to store the data.
IPC Stage that is provided in Server Jobs not in Parallel Jobs
Check the write cache of Hash file. If the same hash file is used for Look up and as well as target,
disable this Option.
If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the
performance. Also, check the order of execution of the routines.
Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7
lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writting into. Make sure your cache is big enough to
hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
This would also minimize overflow on the hash file.

If possible, break the input into multiple threads and run multiple instances of the job.
Answer: Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using
Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts,
updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of
jobs
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution
of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run
jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in one of
the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may
be the case if the constraint calls routines or external macros but if it is inline code then the overhead
will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the
unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally
faster than using ODBC or OLE.

CoolInterview.com

How can I extract data from DB2 (on IBM iSeries) to the data
Question: Added: 7/26/2006
warehouse via Datastage as the ETL tool. I mean do I first need
to use ODBC to create connectivity and use an adapter for the
extraction and transformation of data? Thanks so much if
anybody could provide an answer.

You would need to install ODBC drivers to connect to DB2 instance (does not come with regular
drivers that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to
connect to DB2) and then try out
Answer:

CoolInterview.com

What is DS Designer used for - did u use it?


Question: Added: 7/26/2006

You use the Designer to build jobs by creating a visual design that models the flow and
transformation of data from the data source through to the target warehouse. The Designer graphical
Answer: interface lets you select stage icons, drop them onto the Designer work area, and add links.

CoolInterview.com

How can I connect my DB2 database on AS400 to DataStage?


Question: Added: 7/26/2006
Do I need to use ODBC 1st to open the database connectivity
and then use an adapter for just connecting between the two?
Thanks alot of any replies.

You need to configure the ODBC connectivity for database (DB2 or AS400) in the datastage.
Answer: CoolInterview.com

How to improve the performance of hash file?


Question: Added: 7/26/2006

You can inprove performance of hashed file by

1 .Preloading hash file into memory -->this can be done by enabling preloading options in hash file
output stage

2. Write caching options -->.It makes data written into cache before being flushed to disk.you can
Answer: enable this to ensure that hash files are written in order onto cash before flushed to disk instead of
order in which individual rows are written

3 .Preallocating--> Estimating the approx size of the hash file so that file need not to be splitted to
often after write operation

CoolInterview.com

How can we pass parameters to job by using file.


Question: Added: 7/26/2006

You can do this, by passing parameters from unix file, and then calling the execution of a datastage
Answer: job. the ds job has the parameters defined (which are passed by unix)
CoolInterview.com

What is a project? Specify its various components?


Question: Added: 7/26/2006

You always enter DataStage through a DataStage project. When you start a DataStage client you are
prompted to connect to a project. Each project contains:

DataStage jobs.

Answer: Built-in components. These are predefined components used in a job.

User-defined components. These are customized components created using the DataStage Manager
or DataStage Designer

CoolInterview.com

How can u implement slowly changed dimensions in datastage?


Question: Added: 7/26/2006
explain?

2) can u join flat file and database in datastage?how?


Yes, we can do it in an indirect way. First create a job which can populate the data from database into
a Sequential file and name it as Seq_First1. Take the flat file which you are having and use a Merge
Answer: Stage to join the two files. You have various join types in Merge Stage like Pure Inner Join, Left Outer
Join, Right Outer Join etc., You can use any one of these which suits your requirements.
CoolInterview.com

Can any one tell me how to extract data from more than 1
Question: Added: 7/26/2006
hetrogenious Sources.
mean, example 1 sequenal file, Sybase , Oracle in a singale
Job.

Yes you can extract the data from from two heterogenious sources in data stages using the the
transformer stage it's so simple you need to just form a link between the two sources in the
Answer: transformer stage.
CoolInterview.com

Will the data stage consider the second constraint in the


Question: Added: 7/26/2006
transformer once the first condition is satisfied ( if the link
odering is given)

Answer:
Answer: Yes.

wht is the difference beteen validated ok and compiled in


Question: Added: 7/26/2006
datastage.

When we say "Validating a Job", we are talking about running the Job in the "check only" mode. The
following checks are made :

- Connections are made to the data sources or data warehouse.


Answer: - SQL SELECT statements are prepared.
- Files are opened. Intermediate files in Hashed File, UniVerse, or ODBC stages that use the local
data source are created, if they do not already exist.

CoolInterview.com

Why do you use SQL LOADER or OCI STAGE?


Question: Added: 7/26/2006

When the source data is anormous or for bulk data we can use OCI and SQL loader depending upon
Answer: the source
CoolInterview.com

Where we use link partitioner in data stage job?explain with


Question: Added: 7/26/2006
example?

We use Link Partitioner in DataStage Server Jobs.The Link Partitioner stage is an active stage which
takes one input andallows you to distribute partitioned rows to up to 64 output links.Through Link
Answer: Partioner,Link Collector and IPC Stage we can achieve the Parallelism capabilities in Server jobs.
CoolInterview.com

Purpose of using the key and difference between Surrogate


Question: Added: 7/26/2006
keys and natural key

We use keys to provide relationships between the entities(Tables). By using primary and foreign key
relationship, we can maintain integrity of the data.

The natural key is the one coming from the OLTP system.

Answer: The surrogate key is the artificial key which we are going to create in the target DW. We can use
thease surrogate keys insted of using natural key. In the SCD2 scenarions surrogate keys play a
major role

CoolInterview.com
# How does DataStage handle the user security?
Question: Added: 7/26/2006

we have to create users in the Administrators and give the necessary priviliges to users.
Answer: CoolInterview.com

How can I specify a filter command for processing data while


Question: Added: 7/26/2006
defining sequential file output data?

We have some thing called as after job subroutine and Before subroutine, with then we can execute
the Unix commands.

Answer: Here we can use the sort sommand or the filter cdommand

CoolInterview.com

How to parametarise a field in a sequential file?I am using


Question: Added: 7/26/2006
Datastage as ETL Tool,Sequential file as source.

We cannot parameterize a particular field in a sequential file, instead we can parameterize the source
Answer: file name in a sequential file.
CoolInterview.com

Is it possible to move the data from oracle ware house to SAP


Question: Added: 7/26/2006
Warehouse using with DATASTAGE Tool.

We can use DataStage Extract Pack for SAP R/3 and DataStage Load Pack for SAP BW to transfer
the data from oracle to SAP Warehouse. These Plug In Packs are available with DataStage Version
Answer: 7.5
CoolInterview.com

How to implement type2 slowly changing dimensions in data


Question: Added: 7/26/2006
stage?explain with example?

We can handle SCD in the following ways

Type 1: Just use, “Insert rows Else Update rows”

Or

“Update rows Else Insert rows”, in update action of target

Type 2: Use the steps as follows

a) U have use one hash file to Look-Up the target


Answer:
b) Take 3 instances of target

c) Give different conditions depending on the process

d) Give different update actions in target

e) Use system variables like Sysdate and Null.

CoolInterview.com

How to handle the rejected rows in datastage?


Question: Added: 7/26/2006

We can handle rejected rows in two ways with help of Constraints in a Tansformer.1) By Putting on
the Rejected cell where we will be writing our constarints in the properties of the Transformer2)Use
REJECTED in the expression editor of the ConstraintCreate a hash file as a temporory storage for
rejected rows. Create a link and use it as one of the output of the transformer. Apply either ofthe two
Answer: stpes above said on that Link. All the rows which are rejected by all the constraints will go to the Hash
File.

CoolInterview.com
How do we do the automation of dsjobs?
Question: Added: 7/26/2006

We can call Datastage Batch Job from Command prompt using 'dsjob'. We can also pass all the
parameters from command prompt.
Answer: Then call this shell script in any of the market available schedulers.
The 2nd option is schedule these jobs using Data Stage director.
CoolInterview.com

What is Hash file stage and what is it used for?


Question: Added: 7/26/2006

We can also use the Hash File stage to avoid / remove dupilcate rowsby specifying the hash key on a
Answer: particular fileld
CoolInterview.com

What is version Control?


Question: Added: 7/26/2006

Version Control

stores different versions of DS jobs

runs different versions of same job


Answer:
reverts to previos version of a job

view version histories

CoolInterview.com

How to find the number of rows in a sequential file?


Question: Added: 7/26/2006

Using Row Count system variable


Answer: CoolInterview.com

Suppose if there are million records did you use OCI? if not then
Question: Added: 7/26/2006
what stage do you prefer?

Using Orabulk
Answer: CoolInterview.com

How to run the job in command prompt in unix?


Question: Added: 8/18/2006

Using dsjob command,

-options

Answer: dsjob -run -jobstatus projectname jobname

CoolInterview.com

How to find errors in job sequence?


Question: Added: 7/26/2006

using DataStage Director we can find the errors in job sequence


Answer: CoolInterview.com

How do you eliminate duplicate rows?


Question: Added: 7/26/2006

Use Remove Duplicate Stage: It takes a single sorted data set as input, removes all duplicate
Answer: records, and writes the results to an output data set.

Without Remove duplicate Stage


**************************************
1. In Target make the column as the key column and run the job.
2. Go to partitioning tab there select hash, select perform sort, select unique, select the
column on which you want to remove duplicates and then run. It will work.
3. For example, the source is coming as database table, you can write user defined query at
source level like select distinct, or source data is coming like sequential file, then you can
pass to the sort stage there you can give option like allow duplicates is false in the sort
stage.In The Sequential file we are having the property filter ther u can give the sort –u <file
name> it removes the duplicates and gives the output.

CoolInterview.com

If I add a new environment variable in Windows, how can I


Question: Added: 7/26/2006
access it in DataStage?

U can view all the environment variables in designer. U can check it in Job properties. U can add and
Answer: access the environment variables from Job properties
CoolInterview.com

Question:
How do you
Added: 7/26/2006
pass the
parameter to
the job
sequence if the
job is running at
night?

Two ways
1. Ste the default values of Parameters in the Job Sequencer and map these parameters to job.
2. Run the job in the sequencer using dsjobs utility where we can specify the values to be taken for
Answer: each parameter.

CoolInterview.com

What is the transaction size and array size in OCI stage? how
Question: Added: 7/26/2006
these can be used?

Transaction Size - This field exists for backward compatibility, but it is ignored for release 3.0 and
later of the Plug-in. The transaction size for new jobs is now handled by Rows per transaction on the
Transaction Handling tab on the Input page.

Rows per transaction - The number of rows written before a commit are executed for the transaction.
Answer: The default value is 0, that is, all the rows are written before being committed to the data table.

Array Size - The number of rows written to or read from the database at a time. The default value is 1,
that is, each row is written in a separate statement.

CoolInterview.com

What is the difference between drs and odbc stage


Question: Added: 7/26/2006

To answer your question the DRS stage should be faster then the ODBC stage as it uses native
database connectivity. You will need to install and configure the required database clients on your
DataStage server for it to work.

Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported
Answer: databases. It supports ODBC connections too. Read more of that in the plug-in documentation.

ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless
for switching from one database to another. It uses the native connectivities for the chosen target ...

CoolInterview.com

How do you track performance statistics and enhance it?


Question: Added: 7/26/2006

Through Monitor we can view the performance statistics.


Answer: CoolInterview.com
what is the mean of Try to have the constraints in the 'Selection'
Question: Added: 7/26/2006
criteria of the jobs itself. This will eliminate the unnecessary
records even getting in before joins are made?

This means try to improve the performance by avoiding use of constraints wherever possible and
Answer: instead using them while selecting the data itself using a where clause. This improves performace.
CoolInterview.com

My requirement is like this :


Question: Added: 7/26/2006
Here is the codification suggested:

SALE_HEADER_XXXXX_YYYYMMDD.PSV
SALE_LINE_XXXXX_YYYYMMDD.PSV

XXXXX = LVM sequence to ensure unicity and continuity of file


exchanges
Caution, there will an increment to implement.
YYYYMMDD = LVM date of file creation

COMPRESSION AND DELIVERY TO:


SALE_HEADER_XXXXX_YYYYMMDD.ZIP AND
SALE_LINE_XXXXX_YYYYMMDD.ZIP

if we run that job the target file names are like this
sale_header_1_20060206 & sale_line_1_20060206.

If we run next time means the target files we like this


sale_header_2_20060206 & sale_line_2_20060206.

If we run the same in next day means the target files we want
like this
sale_header_3_20060306 & sale_line_3_20060306.

i.e., whenever we run the same job the target files automatically
changes its filename to
filename_increment to previous number(previousnumber +
1)_currentdate;

Please do needful by repling this question..

This can be done by using unix script

1. Keep the Target filename as constant name xxx.psv

2. Once the job completed, invoke the Unix Script through After job routine - ExecSh

Answer: 3. The script should get the number used in previous file and increment it by 1, After that move the file
from xxx.psv to filename_(previousnumber + 1)_currentdate.psv and then delete the xxx.psv file.This
is the Easiest way to implement.

CoolInterview.com

How to drop the index befor loading data in target and how to
Question: Added: 7/26/2006
rebuild it in data stage?

This can be achieved by "Direct Load" option of SQLLoaded utily.


Answer: CoolInterview.com

What are the Job parameters?


Question: Added: 7/26/2006

These Parameters are used to provide Administrative access and change run time values of the job.
Answer:
EDIT>JOBPARAMETERS

In that Parameters Tab we can define the name,prompt,type,value

CoolInterview.com

There are three different types of user-created stages available


Question: Added: 7/26/2006
for PX.
What are they? Which would you use? What are the
disadvantage for using each type?

These are the three different stages:


i) Custom
Answer: ii) Build
iii) Wrapped
CoolInterview.com

What are the different types of lookups in datastage?


Question: Added: 7/26/2006

There are two types of lookupslookup stage and lookupfilesetLookup:Lookup refrence to another
stage or Database to get the data from it and transforms to other database.LookupFileSet:It allows
you to create a lookup file set or reference one for a lookup. The stage can have a single input link or
a single output link. The output link must be a reference link. The stage can be configured to execute
Answer: in parallel or sequential mode when used with an input link. When creating Lookup file sets, one file
will be created for each partition. The individual files are referenced by a single descriptor file, which
by convention has the suffix .fs.
CoolInterview.com

How can we create Containers?


Question: Added: 7/26/2006

There are Two types of containers

1.Local Container

2.Shared Container

Local container is available for that particular Job only.

Where as Shared Containers can be used any where in the project.

Local container:

Answer: Step1:Select the stages required

Step2:Edit>ConstructContainer>Local

SharedContainer:

Step1:Select the stages required

Step2:Edit>ConstructContainer>Shared

Shared containers are stored in the SharedContainers branch of the Tree Structure

CoolInterview.com

Briefly describe the various client components?


Question: Added: 7/26/2006

There are four client components


Answer:
DataStage Designer. A design interface used to create DataStage applications (known as jobs). Each
job specifies the data sources, the transforms required, and the destination of the data. Jobs are
compiled to create executables that are scheduled by the Director and run by the Server.

DataStage Director. A user interface used to validate, schedule, run, and monitor DataStage jobs.

DataStage Manager. A user interface used to view and edit the contents of the Repository.

DataStage Administrator. A user interface used to configure DataStage projects and users.

CoolInterview.com

Types of views in Datastage Director?


Question: Added: 7/27/2006

There are 4 types of views in Datastage Director


a) Job Status View - Dates of Jobs Compiled,Finished,Start time,End time and Elapsedtime
b)Job Scheduler View-It Displays whar are the jobs are scheduled.
Answer: c) Log View - Status of Job last run
d) Detail View - Warning Messages, Event Messages, Program Generated Messages.

CoolInterview.com

How to implement routines in data stage,


Question: Added: 7/26/2006

There are 3 kind of routines is there in Datastage.

1.server routines which will used in server jobs.

these routines will write in BASIC Language

Answer: 2.parlell routines which will used in parlell jobs

These routines will write in C/C++ Language

3.mainframe routines which will used in mainframe jobs

CoolInterview.com

What are the environment variables in datastage?give some


Question: Added: 7/26/2006
examples?

Theare are the variables used at the project or job level.We can use them to to configure the job
ie.we can associate the configuration file(Wighout this u can not run ur job), increase the sequential
or dataset read/ write buffer.

ex: $APT_CONFIG_FILE
Answer:
Like above we have so many environment variables. Please go to job properties and click on "add
environment variable" to see most of the environment variables.

What are the Steps involved in development of a job in


Question: Added: 7/26/2006
DataStage?

The steps required are:

select the datasource stage depending upon the sources for ex:flatfile,database, xml etc

select the required stages for transformation logic such as transformer,link collector,link partitioner,
Answer: Aggregator, merge etc

select the final target stage where u want to load the data either it is datawatehouse, datamart,
ODS,staging etc

Question:
What is the
Added: 7/26/2006
purpose of
exception
activity in data
stage 7.5?

What is Modulus and Splitting in Dynamic Hashed File?


Question: Added: 7/27/2006

The modulus size can be increased by contacting your Unix Admin.


Answer:

What are Static Hash files and Dynamic Hash files?


Question: Added: 7/26/2006

The hashed files have the default size established by their modulus and separation when you create
them, and this can be static or dynamic.

Answer: Overflow space is only used when data grows over the reserved size for someone of the groups
(sectors) within the file. There are many groups as the specified by the modulus.

What is the exact difference betwwen Join,Merge and Lookup


Question: Added: 7/26/2006
Stage?

The exact difference between Join,Merge and lookup is

The three stages differ mainly in the memory they use

DataStage doesn't know how large your data is, so cannot make an informed choice whether to
combine data using a join stage or a lookup stage. Here's how to decide which to use:

if the reference datasets are big enough to cause trouble, use a join. A join does a high-speed sort on
Answer: the driving and reference datasets. This can involve I/O if the data is big enough, but the I/O is all
highly optimized and sequential. Once the sort is over the join processing is very fast and never
involves paging or other I/O

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links as
many as input links.

What does separation option in static hash-file mean?


Question: Added: 7/26/2006

The different hashing algorithms are designed to distribute records evenly among the groups of the
file based on characters and their position in the record ids.

When a hashed file is created, Separation and Modulo respectively specifies the group buffer size
and the number of buffers allocated for a file. When a Static Hashfile is created, DATASTAGE
Answer: creates a file that contains the number of groups specified by modulo.

Size of Hashfile = modulus(no. groups) * Separations (buffer size)

DataStage Interview Questions & Answers

What is DS Administrator used for - did u use it?


Question: Added: 7/26/2006
The Administrator enables you to set up DataStage users, control the purging of the Repository, and, if
National Language Support (NLS) is enabled, install and manage maps and locales.
Answer:

What is the max capacity of Hash file in DataStage?


Question: Added: 7/26/2006

Take a look at the uvconfig file:

# 64BIT_FILES - This sets the default mode used to


# create static hashed and dynamic files.
# A value of 0 results in the creation of 32-bit
# files. 32-bit files have a maximum file size of
# 2 gigabytes. A value of 1 results in the creation
Answer: # of 64-bit files (ONLY valid on 64-bit capable platforms).
# The maximum file size for 64-bit
# files is system dependent. The default behavior
# may be overridden by keywords on certain commands.
64BIT_FILES 0

What is the difference between Symetrically parallel


Question: Added: 7/26/2006
processing,Massively parallel processing?

Symmetric Multiprocessing (SMP) - Some Hardware resources may be shared by processor. Processor
communicate via shared memory and have single operating system.

Cluster or Massively Parallel Processing (MPP) - Known as shared nothing in which each processor have
exclusive access to hardware resources. CLuster systems can be physically dispoersed.The processor
Answer:
have their own operatins system and communicate via high speed network

What is the order of execution done internally in the transformer


Question: Added: 7/26/2006
with the stage editor having input links on the lft hand side and
output links?

Stage variables, constraints and column derivation or expressions.


Answer:

What are Stage Variables, Derivations and Constants?


Question: Added: 7/27/2006

Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the
value into target column.

Derivation - Expression that specifies value to be passed on to the target column.


Answer:
Constant - Conditions that are either true or false that specifies flow of data with a link.

What is SQL tuning? how do you do it ?


Question: Added: 7/26/2006

Sql tunning can be done using cost based optimization

this parameters are very important of pfile

sort_area_size , sort_area_retained_size,db_multi_block_count,open_cursors,cursor_sharing
Answer:
optimizer_mode=choose/role

How to implement type2 slowly changing dimenstion in datastage?


Question: Added: 7/26/2006
give me with example?
Slow changing dimension is a common problem in Dataware housing. For example: There exists a
customer called lisa in a company ABC and she lives in New York. Later she she moved to Florida. The
company must modify her address now. In general 3 ways to solve this problem

Type 1: The new record replaces the original record, no trace of the old record at all, Type 2: A new record
is added into the customer dimension table. Therefore, the customer is treated essentially as two different
people. Type 3: The original record is modified to reflect the changes.

In Type1 the new one will over write the existing one that means no history is maintained, History of the
person where she stayed last is lost, simple to use.
Answer:
In Type2 New record is added, therefore both the original and the new record Will be present, the new
record will get its own primary key, Advantage of using this type2 is, Historical information is maintained
But size of the dimension table grows, storage and performance can become a concern.
Type2 should only be used if it is necessary for the data warehouse to track the historical changes.

In Type3 there will be 2 columns one to indicate the original value and the other to indicate the current
value. example a new column will be added which shows the original address as New york and the current
address as Florida. Helps in keeping some part of the history and table size is not increased. But one
problem is when the customer moves from Florida to Texas the new york information is lost. so Type 3
should only be used if the changes will only occur for a finite number of time.

Functionality of Link Partitioner and Link Collector?


Question: Added: 7/27/2006

server jobs mainly execute the jobs in sequential fashion,the ipc stage as well as link partioner and link
collector will simulate the parllel mode of execution over the sever jobs having single cpu Link Partitioner :
It receives data on a single input link and diverts the data to a maximum no.of 64 output links and the data
Answer: processed by the same stage having same meta dataLink Collector : It will collects the data from 64
inputlinks, merges it into a single data flowand loads to target. these both r active stagesand the design
and mode of execution of serverjobs has to be decidead by the designer

What is the difference between sequential file and a dataset?


Question: Added: 7/26/2006
When to use the copy stage?

Sequentiial Stage stores small amount of the data with any extension in order to acces the file where
as DataSet is used to store Huge amount of the data and it opens only with an extension (.ds ) .
Answer: The Copy stage copies a single input data set to a number of output datasets. Each record of the
input data set is copied to every output data set.Records can be copied without modification or you
can drop or change the order of columns.

What Happens if RCP is disable ?


Question: Added: 7/26/2006

Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stage
whose output connects to the shared container input, then meta data will be propagated at run time,
so there is no need to map it at design time.
Answer:
If RCP is disabled for the job, in such case OSH has to perform Import and export every time when
the job runs and the processing time job is also increased.

What are Routines and where/how are they written and have
Question: Added: 7/26/2006
you written any routines before?

RoutinesRoutines are stored in the Routines branch of the DataStage Repository,where you can
create, view, or edit them using the Routine dialog box. Thefollowing program components are
classified as routines:• Transform functions. These are functions that you can use whendefining
custom transforms. DataStage has a number of built-intransform functions which are located in the
Routines ➤ Examples➤ Functions branch of the Repository. You can also defineyour own transform
functions in the Routine dialog box.• Before/After subroutines. When designing a job, you can specify
asubroutine to run before or after the job, or before or after an activestage. DataStage has a number
Answer: of built-in before/after subroutines,which are located in the Routines ➤ Built-in ➤ Before/Afterbranch
in the Repository. You can also define your ownbefore/after subroutines using the Routine dialog
box.• Custom UniVerse functions. These are specialized BASIC functionsthat have been defined
outside DataStage. Using the Routinedialog box, you can get DataStage to create a wrapper that
enablesyou to call these functions from within DataStage. These functionsare stored under the
Routines branch in the Repository. Youspecify the category when you create the routine. If NLS is
enabled,

How we can call the routine in datastage job?explain with


Question: Added: 7/26/2006
steps?
Routines are used for impelementing the business logic they are two types 1) Before Sub Routines
and 2)After Sub Routinestepsdouble click on the transformer stage right click on any one of the
Answer: mapping field select [dstoutines] option within edit window give the business logic and select the
either of the options( Before / After Sub Routines)

Types of Parallel Processing?


Question: Added: 7/27/2006

Parallel Processing is broadly classified into 2 types.


Answer: a) SMP - Symmetrical Multi Processing.
b) MPP - Massive Parallel Processing.

What are orabulk and bcp stages?


Question: Added: 7/26/2006

ORABULK is used to load bulk data into single table of target oracle database.
Answer:
BCP is used to load bulk data into a single table for microsoft sql server and sysbase.

What is the OCI? and how to use the ETL Tools?


Question: Added: 7/26/2006

OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle to load
Answer: the data. It is kind of the lowest level of Oracle being used for loading the data.

It is possible to run parallel jobs in server jobs?


Question: Added: 7/26/2006

No, It is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel
Answer: jobs through Server Shared Containers.

It is possible to access the same job two users at a time in


Question: Added: 7/26/2006
datastage?

No, it is not possible to access the same job two users at the same time. DS will produce the
Answer: following error : "Job is accessed by other user"

Explain the differences between Oracle8i/9i?


Question: Added: 7/26/2006

Answer: Mutliproceesing,databases more dimesnionsal modeling

Do u know about METASTAGE?


Question: Added: 7/26/2006

MetaStage is used to handle the Metadata which will be very useful for data lineage and data
Answer: analysis later on. Meta Data defines the type of data we are handling. This Data Definitions are stored
in repository and can be accessed with the use of MetaStage.

What is merge and how it can be done plz explain with simple
Question: Added: 7/26/2006
example taking 2 tables

Merge is used to join two tables.It takes the Key columns sort them in Ascending or descending
order.Let us consider two table i.e Emp,Dept.If we want to join these two tables we are having
Answer: DeptNo as a common Key so we can give that column name as key and sort Deptno in ascending
order and can join those two tables

What is merge ?and how to use merge?


Question: Added: 7/26/2006

Merge is a stage that is available in both parallel and server jobs.

Answer: The merge stage is used to join two tables(server/parallel) or two tables/datasets(parallel). Merge
requires that the master table/dataset and the update table/dataset to be sorted. Merge is performed
on a key field, and the key field is mandatory in the master and update dataset/table.

What is difference between Merge stage and Join stage?


Question: Added: 7/26/2006

Merge and Join Stage Difference :


Answer:
1. Merge Reject Links are there

2. can take Multiple Update links

3. If you used it for comparision , then first matching data will be the output .

Because it uses the update links to extend the primary details which are coming from master link

What are the enhancements made in datastage 7.5 compare


Question: Added: 7/26/2006
with 7.0

Many new stages were introduced compared to datastage version 7.0. In server jobs we have stored
procedure stage, command stage and generate report option was there in file tab. In job sequence
Answer: many stages like startloop activity, end loop activity,terminate loop activity and user variables
activities were introduced. In parallel jobs surrogate key stage, stored procedure stage were
introduced. For all other specifications,

What is NLS in datastage? how we use NLS in Datastage ?


Question: Added: 7/26/2006
what advantages in that ? at the time of installation i am not
choosen that NLS option , now i want to use that options what
can i do ? to reinstall that datastage or first uninstall and install
once again ?

Answer: Just reinstall you can see the option to include the NLS

How can we join one Oracle source and Sequential file?.


Question: Added: 7/26/2006

Answer: Join and look up used to join oracle and sequential file

Question:
What is job
Added: 7/26/2006
control?how
can it used
explain with
steps?

JCL defines Job Control Language it is ued to run more number of jobs at a time with or without using
loops. steps:click on edit in the menu bar and select 'job properties' and enter the parameters
Answer: asparamete prompt typeSTEP_ID STEP_ID stringSource SRC stringDSN DSN stringUsername unm
stringPassword pwd stringafter editing the above steps then set JCL button and select the jobs from
the listbox and run the job

What is the difference between Datastage and Datastage TX?


Question: Added: 7/26/2006

Its a critical question to answer, but one thing i can tell u that Datastage Tx is not a ETL tool & this is
not a new version of Datastage 7.5.
Answer:
Tx is used for ODS source ,this much i know

If the size of the Hash file exceeds 2GB..What happens? Does


Question: Added: 7/26/2006
it overwrite the current rows?

Answer: It overwrites the file

Do you know about INTEGRITY/QUALITY stage?


Question: Added: 7/26/2006

Integriry/quality stage is a data integration tool from ascential which is used to staderdize/integrate
Answer: the data from different sources

How much would be the size of the database in DataStage ?


Question: Added: 7/26/2006
What is the difference between Inprocess and Interprocess ?

In-process
Answer: You can improve the performance of most DataStage jobs by turning in-process row buffering on and
recompiling the job. This allows connected active stages to pass data via buffers rather than row by
row.
Note: You cannot use in-process row-buffering if your job uses COMMON blocks in transform
functions to pass data between stages. This is not recommended practice, and it is advisable to
redesign your job to use row buffering rather than COMMON blocks.

Inter-process
Use this if you are running server jobs on an SMP parallel system. This enables the job to run using a
separate process for each active stage, which will run simultaneously on a separate processor.
Note: You cannot inter-process row-buffering if your job uses COMMON blocks in transform functions
to pass data between stages. This is not recommended practice, and it is advisable to redesign your
job to use row buffering rather than COMMON blocks.

How can you do incremental load in datastage?


Question: Added: 7/26/2006

Incremental load means daily load.

when ever you are selecting data from source, select the records which are loaded or updated
between the timestamp of lastsuccessful load and todays load start date and time.

Answer: for this u have to pass parameters for those two dates.

store the last rundate and time in a file and read the parameter through job parameters and state
second argument as currentdate and time.

What is the meaning of the following..


Question: Added: 7/26/2006
1)If an input file has an excessive number of rows and can be
split-up then use standard

2)logic to run jobs in parallel

3)Tuning should occur on a job-by-job basis. Use the power of


DBMS.

I want to process 3 files in sequentially one by one , how can i do that. while processing the files it
Question:
should fetch files automatically .

If the metadata for all the files r same then create a job having file name as parameter, then use
same job in routine and call the job with different file name...or u can create sequencer to use the
Answer: job...

Added: 7/26/2006

What happends out put of hash file is connected to


Question: Added: 7/26/2006
transformer ..What error it throws

If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if
there is no primary link to the same Transformer stage, if there is no primary link then this will treat as
Answer: primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any
error code.

What is iconv and oconv functions?


Question: Added: 7/26/2006

Iconv( )-----converts string to internal storage format


Answer: Oconv( )----converts an expression to an output format

How I can convert Server Jobs into Parallel Jobs?


Question: Added: 7/26/2006

I have never tried doing this, however, I have some information which will help you in saving a lot of
Answer: time. You can convert your server job into a server shared container. The server shared container
can also be used in parallel jobs as shared container.

Can we use shared container as lookup in datastage server


Question: Added: 7/26/2006
jobs?

I am using DataStage 7.5, Unix. we can use shared container more than one time in the job.There is
Answer: any limit to use it. why because in my job i used the Shared container at 6 flows. At any time only 2
flows are working. can you please share the info on this.

DataStage from Staging to MDW is only running at 1 row per


Question: Added: 7/26/2006
second! What do we do to remedy?

I am assuming that there are too many stages, which is causing problem and providing the solution.

In general. if you too many stages (especially transformers , hash look up), there would be a lot of
overhead and the performance would degrade drastically. I would suggest you to write a query
instead of doing several look ups. It seems as though embarassing to have a tool and still write a
query but that is best at times.

If there are too many look ups that are being done, ensure that you have appropriate indexes while
querying. If you do not want to write the query and use intermediate stages, ensure that you use
proper elimination of data between stages so that data volumes do not cause overhead. So, there
might be a re-ordering of stages needed for good performance.
Answer:
Other things in general that could be looked in:

1) for massive transaction set hashing size and buffer size to appropriate values to perform as much
as possible in memory and there is no I/O overhead to disk.

2) Enable row buffering and set appropriate size for row buffering

3) It is important to use appropriate objects between stages for performance

CoolInterview.com

What is the flow of loading data into fact & dimensional tables?
Question: Added: 7/26/2006

Here is the sequence of loading a datawarehouse.

1. The source data is first loading into the staging area, where data cleansing takes place.

2. The data from staging area is then loaded into dimensions/lookups.


Answer:
3.Finally the Fact tables are loaded from the corresponding source tables from the staging area.

Question:
How to handle
Added: 7/26/2006
Date
convertions in
Datastage?
Convert a
mm/dd/yyyy
format to yyyy-
dd-mm?

Here is the right conversion:

Answer: Function to convert mm/dd/yyyy format to yyyy-dd-mm is


Oconv(Iconv(Filedname,"D/MDY[2,2,4]"),"D-YDM[4,2,2]") .

What is difference between serverjobs & paraller jobs


Question: Added: 7/26/2006

Server jobs. These are available if you have installed DataStage Server. They run on the DataStage Server, connecting to
other data sources as necessary.
Answer:
Parallel jobs. These are only available if you have installed Enterprise Edition. These run on DataStage servers that are SMP,
MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.
What are the most important aspects that a beginner must consider doin his first
Question: Added: 7/26/2006
DS project ?

He should be good at DataWareHousing Concepts and he should be familiar with all stages
Answer:

What is hashing algorithm and explain breafly how it works?


Question: Added: 7/26/2006

Hashing is key-to-address translation. This means the value of a key is transformed into a disk address by means of an
algorithm, usually a relative block and anchor point within the block. It's closely related to statistical probability as to how well
the algorithms work.

It sounds fancy but these algorithms are usually quite simple and use division and remainder techniques. Any good book on
Answer: database systems will have information on these techniques.

Interesting to note that these approaches are called "Monte Carlo Techniques" because the behavior of the hashing or
randomizing algorithms can be simulated by a roulette wheel where the slots represent the blocks and the balls represent the
records (on this roulette wheel there are many balls not just one).

How the hash file is doing lookup in serverjobs?How is it comparing the key
Question: Added: 7/26/2006
values?

Hashed File is used for two purpose: 1. Remove Duplicate Records 2. Then Used for reference lookups.The hashed file
contains 3 parts: Each record having Hashed Key, Key Header and Data portion.By using hashed algorith and the key valued
Answer: the lookup is faster.

What are types of Hashed File?


Question: Added: 7/26/2006

Hashed File is classified broadly into 2 types.

a) Static - Sub divided into 17 types based on Primary Key Pattern.


b) Dynamic - sub divided into 2 types
Answer: i) Generic
ii) Specific.

Default Hased file is "Dynamic - Type30.

Difference between Hashfile and Sequential File?


Question: Added: 7/26/2006

Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash
Answer: file used as a reference for look up. Sequential file cannot

Where actually the flat files store?what is the path?


Question: Added: 7/26/2006

Answer: Flat files stores the data and the path can be given in general tab of the sequential file stage

What is data set? and what is file set?


Question: Added: 7/26/2006

File set:- It allows you to read data from or write data to a file set. The stage can have a single input
link. a single output link, and a single rejects link. It only executes in parallel modeThe data files and
the file that lists them are called a file set. This capability is useful because some operating systems
Answer: impose a 2 GB limit on the size of a file and you need to distribute files among nodes to prevent
overruns.

Datasets r used to import the data in parallel jobs like odbc in server jobs

what is meaning of file extender in data stage server jobs.


Question: Added: 7/26/2006
can we run the data stage job from one job to another job that
file data where it is stored and what is the file extender in ds
jobs.

File extender means the adding the columns or records to the already existing the file, in the data
Answer: stage,
we can run the data stage job from one job to another job in data stage.

How do you merge two files in DS?


Question: Added: 7/26/2006

Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or
Answer: created a job to concatenate the 2 files into one if the metadata is different.

What is the default cache size? How do you change the cache
Question: Added: 7/27/2006
size if needed?

Default read cache size is 128MB. We can increase it by going into Datastage Administrator and
Answer: selecting the Tunable Tab and specify the cache size over there.

What about System variables?


Question: Added: 7/26/2006

DataStage provides a set of variables containing useful system information that you can access from
a transform or routine. System variables are read-only.

@DATE The internal date when the program started. See the Date function.

@DAY The day of the month extracted from the value in @DATE.

@FALSE The compiler replaces the value with 0.

@FM A field mark, Char(254).

@IM An item mark, Char(255).

@INROWNUM Input row counter. For use in constrains and derivations in Transformer stages.

@OUTROWNUM Output row counter (per link). For use in derivations in Transformer stages.

@LOGNAME The user login name.

@MONTH The current extracted from the value in @DATE.

@NULL The null value.

@NULL.STR The internal representation of the null value, Char(128).

Answer: @PATH The pathname of the current DataStage project.

@SCHEMA The schema name of the current DataStage project.

@SM A subvalue mark (a delimiter used in UniVerse files), Char(252).

@SYSTEM.RETURN.CODE
Status codes returned by system processes or commands.

@TIME The internal time when the program started. See the Time function.

@TM A text mark (a delimiter used in UniVerse files), Char(251).

@TRUE The compiler replaces the value with 1.

@USERNO The user number.

@VM A value mark (a delimiter used in UniVerse files), Char(253).

@WHO The name of the current DataStage project directory.

@YEAR The current year extracted from @DATE.

REJECTED Can be used in the constraint expression of a Transformer stage of an output link.
REJECTED is initially TRUE, but is set to FALSE whenever an output link is successfully written.

Where does unix script of datastage executes weather in clinet


Question: Added: 7/26/2006
machine or in server.suppose if it eexcutes on server then it will
Datastage jobs are executed in the server machines only. There is nothing that is stored in the client
Answer: machine.

What is DS Director used for - did u use it?


Question: Added: 7/26/2006

Answer: Datastage Director is GUI to monitor, run, validate & schedule datastage server jobs.

What's the difference between Datastage Developers and


Question: Added: 7/26/2006
Datastage Designers. What are the skill's required for this.

Datastage developer is one how will code the jobs.datastage designer is how will desgn the job, i
mean he will deal with blue prints and he will design the jobs the stages that are required in
Answer: developing the code

If data is partitioned in your job on key 1 and then you


Question: Added: 7/26/2006
aggregate on key 2, what issues could arise?

Answer: Data will partitioned on both the keys ! hardly it will take more for execution .

Dimension Modelling types along with their significance


Question: Added: 7/26/2006

Data Modeling
1) E-R Diagrams

Answer: 2) Dimensional modeling


2.a) logical modeling
2.b)Physical modeling
CoolInterview.com

What is job control?how it is developed?explain with steps?


Question: Added: 7/26/2006

Controlling Datstage jobs through some other Datastage jobs. Ex: Consider two Jobs XXX and YYY.
The Job YYY can be executed from Job XXX by using Datastage macros in Routines.

To Execute one job from other job, following steps needs to be followed in Routines.

1. Attach job using DSAttachjob function.


Answer:
2. Run the other job using DSRunjob function

3. Stop the job using DSStopJob function

CoolInterview.com

Containers : Usage and Types?


Question: Added: 7/27/2006

Container is a collection of stages used for the purpose of Reusability. There are 2 types of
Containers.
a) Local Container: Job Specific
b) Shared Container: Used in any job within a project. ·
There are two types of shared container:·
Answer: 1.Server shared container. Used in server jobs (can also be used in parallel jobs).·
2.Parallel shared container. Used in parallel jobs. You can also include server shared containers in
parallel jobs as a way of incorporating server job functionality into a parallel stage (for example, you
could use one to make a server plug-in stage available to a parallel job).regardsjagan
CoolInterview.com

* What are constraints and derivation?


Question: Added: 7/26/2006
* Explain the process of taking backup in DataStage?
*What are the different types of lookups available in DataStage?

Constraints are used to check for a condition and filter the data. Example: Cust_Id<>0 is set as a
Answer: constraint and it means and only those records meeting this will be processed further.
Derivation is a method of deriving the fields, for example if you need to get some SUM,AVG etc.

What does a Config File in parallel extender consist of?


Question: Added: 7/27/2006

Config file consists of the following.


Answer: a) Number of Processes or Nodes.
b) Actual Disk Storage Location.

How can you implement Complex Jobs in datastage


Question: Added: 7/26/2006

Complex design means having more joins and more look ups. Then that job design will be called as
complex job.We can easily implement any complex design in DataStage by following simple tips in
terms of increasing performance also. There is no limitation of using stages in a job. For better
Answer: performance, Use at the Max of 20 stages in each job. If it is exceeding 20 stages then go for another
job.Use not more than 7 look ups for a transformer otherwise go for including one more transformer.

What are validations you perform after creating jobs in designer.


Question: Added: 7/26/2006
What r the different type of errors u faced during loading and
how u solve them

Check for Parameters.


Answer: and check for inputfiles are existed or not and also check for input tables existed or not and also
usernames, datasource names, passwords like that

How do you fix the error "OCI has fetched truncated data" in
Question: Added: 7/26/2006
DataStage

Answer: Can we use Change capture stage to get the truncated data's.Members please confirm.

What user varibale activity when it used how it used !where it is


Question: Added: 7/26/2006
used with real example

By using This User variable activity we can create some variables in the job sequnce,this variables r
available for all the activities in that sequnce.
Answer: Most probablly this activity is @ starting of the job sequnce

How we use NLS function in Datastage? what are advantages


Question: Added: 7/26/2006
of NLS function? where we can use that one? explain briefly?

By using NLS function we can do the following


- Process the data in a wide range of languages
- Use Local formats for dates, times and money
Answer: - Sort the data according to the local rules
If NLS is installed, various extra features appear in the product.
For Server jobs, NLS is implemented in DataStage Server engine
For Parallel jobs, NLS is implemented using the ICU library.

If a DataStage job aborts after say 1000 records, how to


Question: Added: 7/26/2006
continue the job from 1000th record after fixing the error?

By specifying Checkpointing in job sequence properties, if we restart the job. Then job will start by
Answer: skipping upto the failed record.this option is available in 7.5 edition.

Differentiate Database data and Data warehouse data?


Question: Added: 7/26/2006

By Database, one means OLTP (On Line Transaction Processing). This can be the source systems
Answer: or the ODS (Operational Data Store), which contains the transactional data.
What is environment variables?what is the use of this?
Question: Added: 8/18/2006

Basically Environment variable is predefined variable those we can use while creating DS job.We can
Answer: set eithere as Project level or Job level.Once we set specific variable that variable will be availabe
into the project/job.We can also define new envrionment variable.For that we can got to DS Admin .

What are all the third party tools used in DataStage?


Question: Added: 7/26/2006

Answer: Autosys, TNG, event coordinator

What is APT_CONFIG in datastage


Question: Added: 7/26/2006

APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt
Answer: file that has the node's information and Configuration of SMP/MMP server.

If your running 4 ways parallel and you have 10 stages on the


Question: Added: 7/26/2006
canvas, how many processes does datastage create?

Answer is 40
Answer: You have 10 stages and each stage can be partitioned and run on 4 nodes which makes total number
of processes generated are 40

Did you Parameterize the job or hard-coded the values in the


Question: Added: 7/26/2006
jobs?

Always parameterized the job. Either the values are coming from Job Properties or from a ‘Parameter
Manager’ – a third part tool. There is no way you will hard–code some parameters in your jobs. The
Answer: often Parameterized variables in a job are: DB DSN name, username, password, dates W.R.T for the
data to be looked against at.

Defaults nodes for datastage parallel Edition


Question: Added: 7/26/2006

Actually the Number of Nodes depend on the number of processors in your system.If your system is
Answer: supporting two processors we will get two nodes by default.

What will you in a situation where somebody wants to send you


Question: Added: 7/26/2006
a file and use that file as an input or reference and then run job.

A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be
Answer: you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

What are the command line functions that import and export the
Question: Added: 7/26/2006
DS jobs?

A. dsimport.exe- imports the DataStage components.


Answer: B. dsexport.exe- exports the DataStage components.

Dimensional modelling is again sub divided into 2 types.


Question: Added: 7/26/2006

a)Star Schema - Simple & Much Faster. Denormalized form.


Answer: b)Snowflake Schema - Complex with more Granularity. More normalized form.

What are Sequencers?


Question: Added: 7/26/2006

A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can
have multiple input triggers as well as multiple output triggers.The sequencer operates in two
Answer: modes:ALL mode. In this mode all of the inputs to the sequencer must be TRUE for any of the
sequencer outputs to fire.ANY mode. In this mode, output triggers can be fired if any of the sequencer
inputs are TRUE

What are the Repository Tables in DataStage and What are


Question: Added: 7/26/2006
they?

A datawarehouse is a repository(centralized as well as distributed) of Data, able to answer any


Answer: adhoc,analytical,historical or complex queries.Metadata is data about data. Examples of metadata
include data element descriptions, data type descriptions, attribute/property descriptions,
range/domain descriptions, and process/method descriptions. The repository environment
encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation
services. Metadata includes things like the name, length, valid values, and description of a data
element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from
changes in the schema of operational systems.In data stage I/O and Transfer , under interface tab:
input , out put & transfer pages.U will have 4 tabs and the last one is build under that u can find the
TABLE NAME .The DataStage client components are:AdministratorAdministers DataStage projects
and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into
executable programs DirectorUsed to run and monitor the DataStage jobsManagerAllows you to view
and edit the contents of the repository.

Whats difference betweeen operational data stage (ODS) &


Question: Added: 7/26/2006
data warehouse?

A dataware house is a decision support database for organisational needs.It is subject oriented,non
volatile,integrated ,time varient collect of data.
Answer: ODS(Operational Data Source) is a integrated collection of related information . it contains maximum
90 days information.

How do you do Usage analysis in datastage ?


Question: Added: 7/26/2006

1. If u want to know some job is a part of a sequence, then in the Manager right click the job and
select Usage Analysis. It will show all the jobs dependents.

2. To find how many jobs are using a particular table.

Answer: 3. To find how many jobs are usinga particular routine.

Like this, u can find all the dependents of a particular object.

Its like nested. U can move forward and backward and can see all the dependents.

Question:
How do you
Added: 7/26/2006
pass filename
as the
parameter for a
job?

1. Go to DataStage Administrator->Projects->Properties->Environment->UserDefined. Here you can


see a grid, where you can enter your parameter name and the corresponding the path of the file.

Answer: 2. Go to the stage Tab of the job, select the NLS tab, click on the "Use Job Parameter" and select the
parameter name which you have given in the above. The selected parameter name appears in the
text box beside the "Use Job Parameter" button. Copy the parameter name from the text box and use
it in your job. Keep the project default in the text box.

How to remove duplicates in server job


Question: Added: 7/26/2006

1)Use a hashed file stage or


2) If you use sort command in UNIX(before job sub-routine), you can reject duplicated records using
Answer: -u parameter or
3)using a Sort stage

What are the difficulties faced in using DataStage ? or what are


Question: Added: 7/26/2006
the constraints in using DataStage ?

1)If the number of lookups are more?


Answer:
2)what will happen, while loading the data due to some regions job aborts?

Does Enterprise Edition only add the parallel processing for


Question: Added: 7/26/2006
better performance?Are any stages/transformations available in
the enterprise edition only?

• DataStage Standard Edition was previously called DataStage and DataStage Server Edition. •
Answer: DataStage Enterprise Edition was originally called Orchestrate, then renamed to Parallel Extender
when purchased by Ascential. • DataStage Enterprise: Server jobs, sequence jobs, parallel jobs. The
enterprise edition offers parallel processing features for scalable high volume solutions. Designed
originally for Unix, it now supports Windows, Linux and Unix System Services on mainframes. •
DataStage Enterprise MVS: Server jobs, sequence jobs, parallel jobs, mvs jobs. MVS jobs are jobs
designed using an alternative set of stages that are generated into cobol/JCL code and are
transferred to a mainframe to be compiled and run. Jobs are developed on a Unix or Windows server
transferred to the mainframe to be compiled and run. The first two versions share the same Designer
interface but have a different set of design stages depending on the type of job you are working on.
Parallel jobs have parallel stages but also accept some server stages via a container. Server jobs
only accept server stages, MVS jobs only accept MVS stages. There are some stages that are
common to all types (such as aggregation) but they tend to have different fields and options within
that stage.

What is the utility you use to schedule the jobs on a UNIX


Question: Added: 7/26/2006
server other than using Ascential Director?

"AUTOSYS": Thru autosys u can automate the job by invoking the shell script written to schedule the
Answer: datastage jobs.

It is possible to call one job in another job in server jobs?


Question: Added: 7/26/2006

I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add
the other job through job properties. In fact, you can attach zero or more jobs.

Answer: Steps will be Edit --> Job Properties --> Job Control

Click on Add Job and select the desired job.

How do u clean the datastage repository.


Question: Added: 8/18/2006

Remove log files periodically.....


Answer:

Ask a Question
Universe Commands from DS Administrator

Here is the process:-


1 Telnet onto the Datastage server
E.g. > telnet pdccal05
2 Logon with the datastage username and password as if you were logging
onto DS Director or an Administrator of the Server.
3. when prompted to the account name choose the project name or hit
enter

Welcome to the DataStage Telnet Server.


Enter user name: dstage
Enter password: *******
Account name or path(live): live

DataStage Command Language 7.5


Copyright (c) 1997 - 2004 Ascential Software Corporation. All Rights
Reserved
live logged on: Monday, June 12, 2006 12:48

Type >"DS.TOOLS"
You then get

DataStage Tools Menu

1. Report on project licenses


2. Rebuild Repository indices
3. Set up server-side tracing >>
4. Administer processes/locks >>
5. Adjust job tunable properties

Which would you like? ( 1 - 5 ) ? "select # 4"


You then get
DataStage Process and Lock Administration

1. List all processes


2. List state of a process
3. List state of all processes in a job

Stopping and Restarting the Server Engine


From time to time you may need to stop or restart the DataStage
server engine manually, for example, when you wish to shut down the
physical server.
A script called uv is provided for these purposes.
To stop the server engine, use:
# dshome/bin/uv -admin -stop
This shuts down the server engine and frees any resources held by
server engine processes.
To restart the server engine, use:
# dshome/bin/uv -admin -start
This ensures that all the server engine processes are started correctly.
You should leave some time between stopping and restarting. A
minimum of 30 seconds is recommended.

1. Dimension Modelling types along with their significance Subscribe


Data Modelling is Broadly classified into 2 types. a) E-R Diagrams (Entity - Relatioships). b) Dimensional Modelling.

2. What is the flow of loading data into fact & dimensional tables? Subscribe
Fact table - Table with Collection of Foreign Keys corresponding to the Primary Keys in Dimensional table. Consists of fields with
numeric values. Dimension table - Table with Unique Primary Key. Loa

3. Orchestrate Vs Datastage Parallel Extender? Subscribe


platform. Datastage
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX
used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the p
Read / Answer
4. Differentiate Primary Key and Partition Key? Subscribe
Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key.
Partition Key is a just a part of Primary Key. There are several methods of

5. How do you execute datastage job from command line prompt? Subscribe
Using "dsjob" command as follows. dsjob -run -jobstatus projectname jobname

6. What are Stage Variables, Derivations and Constants? Subscribe


Stage Variable - An intermediate processing variable that retains value during read and doesnt pass the value into target
column. Derivation - Expression that specifies value to be passed on to the target column.

7. What is the default cache size? How do you change the cache size if needed? Subscribe
Default cache size is 256 MB. We can incraese it by going into Datastage Administrator and selecting the Tunable Tab and
specify the cache size over there.

8. Containers : Usage and Types? Subscribe


Container is a collection of stages used for the purpose of Reusability. There are 2 types of Containers. a) Local Container: Job
Specific b) Shared Container: Used in any job within a project.

9. Compare and Contrast ODBC and Plug-In stages? Subscribe


ODBC : a) Poor Performance. b) Can be used for Variety of Databases. c) Can handle Stored Procedures. Plug-In: a) Good
Performance. b) Database specific.(Only one database) c) Cannot handle Stored Pr

10. How to run a Shell Script within the scope of a Data stage job? Subscribe
By using "ExcecSH" command at Before/After job properties.
11. Types of Parallel Processing? Subscribe
Parallel Processing is broadly classified into 2 types. a) SMP - Symmetrical Multi Processing. b) MPP - Massive Parallel
Processing.
12. What does a Config File in parallel extender consist of? Subscribe
Config file consists of the following. a) Number of Processes or Nodes. b) Actual Disk Storage Location.

13. Functionality of Link Partitioner and Link Collector? Subscribe


Link Partitioner : It actually splits data into various partitions or data flows using various partition methods . Link Collector : It
collects the data coming from partitions, merges it into a single partiotion.

14. What is Modulus and Splitting in Dynamic Hashed File? Subscribe


In a Hashed File, the size of the file keeps changing randomly. If the size of the file increases it is called as "Modulus". If the size
of the file decreases it is called as "Splitting

15. Types of vies in Datastage Director? Subscribe


There are 3 types of views in Datastage Director a) Job View - Dates of Jobs Compiled. b) Log View - Status of Job last run c)
Status View - Warning Messages, Event Messages, Program Generated Messag

16. Differentiate Database data and Data warehouse data? Subscribe


Data in a Database is a) Detailed or Transactional b) Both Readable and Writable. c) Current.

37. What are Static Hash files and Dynamic Hash files? Subscribe
As the names itself suggest what they mean. In general we use Type-30 dynamic Hash files. The Data file has a default size of
2Gb and the overflow file is used if the data exceeds the 2GB size.

62. Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate Subscribe
Does the selection of 'Clear the table and Insert rows' in the ODBC stage send a Truncate statement to the
DB or does it do some kind of Delete logic.
There is no TRUNCATE on ODBC stages. It is Clear table blah blah and that is a delete from statement. On an OCI stage such
as Oracle, you do have both Clear and Truncate options. They are radically di

63. How do you rename all of the jobs to support your new File-naming conventions? Subscribe
spreadsheet with new and old names. Export the whole project as a dsx. Write a
Create a Excel
Perl program, which can do a simple rename of the strings looking up the Excel file. Then
import the new d
Read / Answer

101. defaults nodes for datastage parallel Edition Subscribe


Asked by: sekhar
Latest Answer: Actually the Number of Nodes depend on the
number of processors in your system.

how to implement routines in data stage,have any o...

there are 3 kind of routines is there in Datastage.

1.server routines which will used in server jobs.

these routines will write in BASIC Language

2.parlell routines which will used in parlell jobs


These routines will write in C/C++ Language

3.mainframe routines which will used in mainframe jobs

SQL Interview Questions

How to find second maximum value from a table?


Question :

select max(field1) from tname1 where field1=(select max(field1) from tname1 where
field1<(select max(field1) from tname1);
Answer :
Field1- Salary field

Tname= Table name.

Question : How to retrieving the data from 11th column to n th column in a table.

select * from emp where rowid in ( select rowid from emp where rownum <=&upto
minus
Answer :
select rowid from emp where rownum <&startfrom)

from this you can select between any range.

Question : I have a table with duplicate names in it. Write me a query which returns only duplicate
rows with number of times they are repeated.

SELECT COL1 FROM TAB1


WHERE COL1 IN
Answer :
(SELECT MAX(COL1) FROM TAB1
GROUP BY COL1
HAVING COUNT(COL1) > 1 )

Question : Difference between Store Procedure and Trigger

Information related to Stored procedure you can see in


USER_SOURCE,USER_OBJECTS(current user) tables.
Answer :
Information related to triggers stored in USER_SOURCE,USER_TRIGGERS (current user)
Tables.

Stored procedure can't be inactive but trigger can be Inactive.

Question : When using a count(disitnct) is it better to use a self-join or temp table to find redundant
data, and provide an example?

Instead of this we can use GROUP BY Clause with HAVING condition.


Answer :
For ex,

Select count(*),lastname from tblUsers group by lastname having count(*)>1

This query return the duplicated lastnames values in the lastname column from tblUsers table.

Question : What is the advantage to use trigger in your PL?

Triggers are fired implicitly on the tables/views on which they are created. There are various
advantages of using a trigger. Some of them are:
Answer :
- Suppose we need to validate a DML statement(insert/Update/Delete) that modifies a table then
we can write a trigger on the table that gets fired implicitly whenever DML statement is executed
on that table.

- Another reason of using triggers can be for automatic updation of one or more tables whenever
a DML/DDL statement is executed for the table on which the trigger is created.

- Triggers can be used to enforce constraints. For eg : Any insert/update/ Delete statements
should not be allowed on a particular table after office hours. For enforcing this constraint
Triggers should be used.

- Triggers can be used to publish information about database events to subscribers. Database
event can be a system event like Database startup or shutdown or it can be a user even like User
loggin in or user logoff.

What the difference between UNION and UNIONALL?


Question :

Union will remove the duplicate rows from the result set while Union all doesn’t.
Answer :

How to display duplicate rows in a table?


Question :

select * from emp


Answer :
group by (empid)

having count(empid)>1

Question : What is the difference between TRUNCATE and DELETE commands

Both will result in deleting all the rows in the table .TRUNCATE call cannot be rolled back as it is
a DDL command and all memory space for that table is released back to the server. TRUNCATE
Answer :
is much faster.Whereas DELETE call is an DML command and can be rolled back.

Question : Which system table contains information on constraints on all the tables created

Ans:
Answer :
yes,

USER_CONSTRAINTS,

system table contains information on constraints on all the tables created

Question : What is the difference between Single row sub-Query and Scalar sub-Query

SINGLE ROW SUBQUERY RETURNS A VALUE WHICH IS USED BY WHERE CLAUSE ,


WHEREAS SCALAR SUBQUERY IS A SELECT STATEMENT USED IN COLUMN LIST CAN
Answer :
BE THOUGHT OF AS AN INLINE FUNCTION IN SELECT COLUMN LIST

Question : How to copy sql table.

COPY FROM database TO database action -


Answer :
destination_table (column_name, column_name...) USING query

eg

copy from scott/tiger@ORCL92 -

to scott/tiger@ORCL92-

create new_emp –

using select * from emp;

Question : What is table space


Table-space is a physical concept.it has pages where the records of the database is stored with
a logical perception of tables.so tablespace contains tables.
Answer :

What is difference between Co-related sub query and nested sub query??
Question :

Correlated subquery runs once for each row selected by the outer query. It contains a reference
to a value from the row selected by the outer query.
Answer :
Nested subquery runs only once for the entire nesting (outer) query. It does not contain any
reference to the outer query row.

For example,

Correlated Subquery:

select e1.empname, e1.basicsal, e1.deptno from emp e1 where e1.basicsal = (select


max(basicsal) from emp e2 where e2.deptno = e1.deptno)

Nested Subquery:

select empname, basicsal, deptno from emp where (deptno, basicsal) in (select deptno,
max(basicsal) from emp group by deptno)

Question : There is a eno & gender in a table. Eno has primary key and gender has a check
constraints for the values 'M' and 'F'.
While inserting the data into the table M was misspelled as F and F as M.
What is the update statement to replace F with M and M with F?

update <TableName> set gender=


Answer :
case where gender='F' Then 'M'

where gender='M' Then 'F'

Question : When we give SELECT * FROM EMP; How does oracle respond:

When u give SELECT * FROM EMP;


Answer :
the server check all the data in the EMP file and it displays the data of the EMP file

Question : What is reference cursor

Reference cursor is dynamic cursor used with SQL statement like For select* from emp;
Answer :

Question : WHAT OPERATOR PERFORMS PATTERN MATCHING?

Pattern matching operator is LIKE and it has to used with two attributes
Answer :
1. % and

2. _ ( underscore )

% means matches zero or more characters and under score means matching exactly one
character

There are 2 tables, Employee and Department. There are few records in employee table,
Question :
for which, the department is not assigned. The output of the query should contain all th
employees names and their corresponding departments, if the department is assigned
otherwise employee names and null value in the place department name. What is the
query?
What you want to use here is called a left outer join with Employee table on the left side. A left
outer join as the name says picks up all the records from the left table and based on the joint
Answer :
column picks the matching records from the right table and in case there are no matching records
in the right table, it shows null for the selected coloumns of the right table. eg in this query which
uses the key-word LEFT OUTER JOIN. syntax though varies across databases. In DB2/UDB it
uses the key word LEFT OUTER JOIN, in case of Oracle the connector is
Employee_table.Dept_id *= Dept_table.Dept_id

SQL Server/Sybase :

Employee_table.Dept_id(+) = Dept_table.Dept_id

Question : What is normalazation,types with eg\'s. _ with queries of all types

There are 5 normal forms. It is necessary for any database to be in the third normal form to
maintain referential integrity and non-redundance.
Answer :
First Normal Form: Every field of a table (row,col) must contain an atomic value
Second Normal Form: All columns of a table must depend entirely on the primary key column.
Third Normal Form: All columns of a table must depend on all columns of a composite primary
key.
Fourth Normal Form: A table must not contain two or more independent multi-valued facts. This
normal form is often avoided for maintenance reasons.
Fifth Normal Form: is about symmetric dependencies.
Each normal form assumes that the table is already in the earlier normal form.

Question : When we give SELECT * FROM EMP; How does oracle respond:

When u give SELECT * FROM EMP;


Answer :
the server check all the data in the EMP file and it displays the data of the EMP file

Question : Difference between a equijoin and a union

Indeed both equi join and the Union are very different. Equi join is used to establish a condition
between two tables to select data from them.. eg
Answer :
select a.employeeid, a.employeename, b.dept_name from

employeemaster a , DepartmentMaster b

where a.employeeid = b.employeeid;

This is the example of equijoin whereas with a Union allows you to select the similar data based
on different conditions eg

select a.employeeid, a.employeename from

employeemaster a where a.employeeid >100 b.employeeid

Union

select a.employeeid, a.employeename from

employeemaster a where a.employeename like 'B%'

the above is the example of Union where in we select employee name and Id for two different
conditions into the same recordset and is used thereafter.

Question : What is database?

A database is a collection of data that is organized so that itscontents can easily be accessed,
managed and updated.
Answer :

Question : Difference between VARCHAR and VARCHAR2?


Varchar means fixed length character data(size) ie., min size-1 and max-2000
Answer :
Varchar2 means variable length character data ie., min-1 to max-4000

Question : How to write a sql statement to find the first occurrence of a non zero value?

There is a slight chance the column "a" has a value of 0 which is not null. In that case, You'll
loose the information. There is another way of searching the first not null value of a column:
Answer :
select column_name from table_name where column_name is not null and rownum<2;

How can i hide a particular table name of our schema.


Question :

you can hide the table name by creating synonyms.


Answer :
e.g) you can create a synonym y for table x

create synonym y for x;

Question : What is the main difference between the IN and EXISTS clause in subqueries??

The main difference between the IN and EXISTS predicate in subquery is the way in which the
query gets executed.
Answer :
IN -- The inner query is executed first and the list of values obtained as its result is used by the
outer query.The inner query is executed for only once.

EXISTS -- The first row from the outer query is selected ,then the inner query is executed and ,
the outer query output uses this result for checking.This process of inner query execution repeats
as many no.of times as there are outer query rows. That is, if there are ten rows that can result
from outer query, the inner query is executed that many no.of times.

Question :
In subqueries, which is efficient ,the IN clause or EXISTS clause? Does they produce the
same result?????

EXISTS is efficient because,


Answer :
1.Exists is faster than IN clause.

2.IN check returns values to main query where as EXISTS returns Boolean (T or F).

Question : What is difference between DBMS and RDBMS?

1.RDBMS=DBMS+Refrential Integrity
Answer :
2. An RDBMS ia one that follows 12 rules of CODD.

Question : What is Materialized View?

A materialized view is a database object that contains the results of a query. They are local
copies of data located remotely or used to create summary tables based on aggregation of a
Answer :
tables data. Materialized views which store data based on the remote tables are also know as
snapshots.

Question : How to find out the 10th highest salary in SQL query?

Table - Tbl_Test_Salary
Column - int_salary
Answer :
select max(int_salary)
from Tbl_Test_Salary
where int_salary in
(select top 10 int_Salary from Tbl_Test_Salary order by int_salary)

Question : How to find second maximum value from a table?

select max(field1) from tname1 where field1=(select max(field1) from tname1 where
field1<(select max(field1) from tname1);
Answer :
Field1- Salary field

Tname= Table name.

Question : What are the advantages and disadvantages of primary key and foreign key in SQL?

Primary key
Answer :
Advantages

1) It is a unique key on which all the other candidate keys are functionally dependent

Disadvantage

1) There can be more than one keys on which all the other attributes are dependent on.

Foreign Key

Advantage

1)It allows referencing another table using the primary key for the other table

Question : What operator tests column for the absence of data?

IS NULL operator.
Answer :

Question : What is the parameter substitution symbol used with INSERT INTO command?

&
Answer :

Question : Which command displays the SQL command in the SQL buffer, and then executes it?

RUN.
Answer :

Question : What are the wildcards used for pattern matching.

_ for single character substitution and % for multi-character substitution.


Answer :

Question : What are the privileges that can be granted on a table by a user to others?

Insert, update, delete, select, references, index, execute, alter, all.


Answer :

Question : What command is used to get back the privileges offered by the GRANT command?

REVOKE.
Answer :

Question : Which system tables contain information on privileges granted and privileges obtained?

USER_TAB_PRIVS_MADE, USER_TAB_PRIVS_RECD.
Answer :

Question : What command is used to create a table by copying the structure of another table?

CREATE TABLE .. AS SELECT command


Explanation:
Answer :
To copy only the structure, the WHERE clause of the SELECT command should contain a
FALSE statement as in the following.
CREATE TABLE NEWTABLE AS SELECT * FROM EXISTINGTABLE WHERE 1=2;
If the WHERE condition is true, then all the rows or rows satisfying the condition will be copied to
the new table.

Question : Which date function is used to find the difference between two dates?

MONTHS_BETWEEN.
Answer :

Question : Why does the following command give a compilation error?


DROP TABLE &TABLE_NAME;

Variable names should start with an alphabet. Here the table name starts with an '&' symbol.
Answer :

Question : What is the advantage of specifying WITH GRANT OPTION in the GRANT command?

The privilege receiver can further grant the privileges he/she has obtained from the owner to any
other user.
Answer :

Question : What is the use of the DROP option in the ALTER TABLE command?

It is used to drop constraints specified on the table.


Answer :

Question : What is the use of DESC in SQL?

DESC has two purposes. It is used to describe a schema as well as to retrieve rows from table in
descending order.
Answer :
Explanation :
The query SELECT * FROM EMP ORDER BY ENAME DESC will display the output sorted on
ENAME in descending order.

Question : What is the use of CASCADE CONSTRAINTS?

When this clause is used with the DROP command, a parent table can be dropped even when a
child table exists.
Answer :

Question : Which function is used to find the largest integer less than or equal to a specific value?

FLOOR.
Answer :

Question : How we can count duplicate entry in particular table against Primary Key ? What are
constraints?

The syntax in the previous answer (where count(*) > 1) is very questionable. suppose you think
that you have duplicate employee numbers. there's no need to count them to find out which
Answer :
values were duplicate but the followin SQL will show only the empnos that are duplicate and how
many exist in the table:

Select empno, count(*)

from employee

group by empno

having count(*) > 1


Generally speaking aggregate functions (count, sum, avg etc.) go in the HAVING clause. I know
some systems allow them in the WHERE clause but you must be very careful in interpreting the
result. WHERE COUNT(*) > 1 will absolutely NOT work in DB2 or ORACLE. Sybase and
SQLServer is a different animal.

Question : How to display nth highest record in a table for example


How to display 4th highest (salary) record from customer table.

Query: SELECT sal FROM `emp` order by sal desc limit (n-1),1If the question: "how to display
4th highest (salary) record from customer table."The query will SELECT sal FROM `emp` order
Answer :
by sal desc limit 3,1

How to get the User Tables in Oracle?

1. SELECT table_name, comments


2. FROM dictionary
3. WHERE table_name LIKE 'user_%'
4. ORDER BY table_name;

View Description
ALL_CATALOG Contains every Tables, View and Synonym.
ALL_OBJECT_TABLES All Object-Oriented Tables.
ALL_TAB_COMMENTS comments for every Table, which are usually a short description of the Table.
ALL_TAB_GRANTS All owner Privileges, including PUBLIC('everyone').
ALL_TAB_GRANTS_RECD Privileges granted by others to others.
ALL_TYPES All Object Types.
ALL_TYPES_METHODS All Object Types that are Methods.
ALL_USERS Names and Create Dates for all Users.
DATA_DICT ('DICT') This item is in bold because it is very important. If you know nothing, you can start with
'SELECT * FROM DATA_DICT'. It is a top-level record of everything in the Data Dictionary.
DBA_FREE_SPACE Remaining Free Space in each Tablespace. A DBA typically spends a lot of time increasing
Free Space as Databases have to hold more data than originally planned.
USER_CATALOG All Tables, Views and Synonynms that the currently logged-in User can see.
USER_INDEXES Details of User-specific Indexes.
USER_TABLES Details of User-specific Tables, with some statistics, such as record counts.
USER_TAB_COLUMNS Details of Columns within each User Table, which is very useful when you want to find the
structure of your Tables.
USER_TAB_GRANTS Details of User-specific Privileges for Table access.
USER_VIEWS Details of User-specific Views, and the SQL for each View,(which is essential).

Q.How to rename the file in the after job routine with today’s date?

A.In the after job routine select ExecSh and write the following command.

mv <oldfilename.csv> <newfilename_`date '+%d_%m_%Y_%H%M%S'`.csv>

Explain the difference between server transformer and parallel transformer ??

A. The main difference is server Transformer supports basic transforms only,but in pararllel both basic and
pararllel transforms. server transformer is basic language compatability,pararllel transformer is c++ language
compatabillit
B. 1.multipla input links--single input link
C. accepts routines which r written in basic language-- in c/c++ language

Q.How to see the data in the Dataset in UNIX. What command we have to use to see the data in Dataset in
UNIX?
Ans : orchadmin dump <datasetname>.ds

What is the Diff between Change Capture and Change Apply Stages?
A. the 2 stages compares two data set(after and before) and
makes a record of the differences.
change apply stage combine the changes from the change
capture stage with the original before data set to
reproduce the after data set

Orchadmin is a command line utility provided by datastage to research on data sets.

The general callable format is : $orchadmin <command> [options] [descriptor file]

1. Before using orchadmin, you should make sure that either the working directory or the $APT_ORCHHOME/etc
contains the file “config.apt” OR

The environment variable $APT_CONFIG_FILE should be defined for your session.

Orchadmin commands

The various commands available with orchadmin are

1. CHECK: $orchadmin check

Validates the configuration file contents like , accesibility of all nodes defined in the configuration file, scratch disk
definitions and accesibility of all the nodes etc. Throws an error when config file is not found or not defined properly

2. COPY : $orchadmin copy <source.ds> <destination.ds>

Makes a complete copy of the datasets of source with new destination descriptor file name.

Please note that

a. You cannot use UNIX cp command as it justs copies the config file to a new name. The data is not copied.

b. The new datasets will be arranged in the form of the config file that is in use but not according to the old confing file that
was in use with the source.

3. DELETE : $orchadmin < delete | del | rm > [-f | -x] descriptorfiles….

The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be used to delete
one or more persistent data sets.

-f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset partitions from
accessible nodes and leave the other partitions in inaccesible nodes as orphans.

-x forces to use the current config file to be used while deleting than the one stored in data set.

4. DESCRIBE: $orchadmin describe [options] descriptorfile.ds

This is the single most important command.

1. Without any option lists the no.of.partitions, no.of.segments, valid segments, and preserve partitioning flag details of the
persistent dataset.

-c : Print the configuration file that is written in the dataset if any


-p: Lists down the partition level information.

-f: Lists down the file level information in each partition

-e: List down the segment level information .

-s: List down the meta-data schema of the information.

-v: Lists all segemnts , valid or otherwise

-l : Long listing. Equivalent to -f -p -s -v -e

5. DUMP: $orchadmin dump [options] descriptorfile.ds

The dump command is used to dump(extract) the records from the dataset.

Without any options the dump command lists down all the records starting from first record from first partition till last
record in last partition.

-delim ‘<string>’ : Uses the given string as delimtor for fields instead of space.

-field <name> : Lists only the given field instead of all fields.

-name : List all the values preceded by field name and a colon

-n numrecs : List only the given number of records per partition.

-p period(N) : Lists every Nth record from each partition starting from first record.

-skip N: Skip the first N records from each partition.

-x : Use the current system configuration file rather than the one stored in dataset.

6. TRUNCATE: $orchadmin truncate [options] descriptorfile.ds

Without options deletes all the data(ie Segments) from the dataset.

-f: Uses force truncate. Truncate accessible segments and leave the inaccesible ones.

-x: Uses current system config file rather than the default one stored in the dataset.

-n N: Leaves the first N segments in each partition and truncates the remaining.

7. HELP: $orchadmin -help OR $orchadmin <command> -help

Help manual about the usage of orchadmin or orchadmin commands.

Datastage: Job design tips

I am just collecting the general design tips that helps the developers to build clean & effective jobs.

1. Turn off Runtime Column propagation wherever it’s not required.


2.Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer stage only if the
anticipated volumes are high and performance becomes a problem. Otherwise use Transformer. Its very easy to code a
transformer than a modify stage.

3. Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop the metadata. Modify
stage will drop the metadata only when explicitey specified using DROP clause.

4. One of the most important mistake that developers often make is not to have a volumetric analyses done before you
decide to use Join or Lookup or Merge stages. Estimate the volumes and then decide which stage to go for.

5.Add reject files wherever you need reprocessing of rejected records or you think considerable data loss may happen.
Try to keep reject file at least at Sequential file stages and writing to Database stages.

6.Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for
sorting instead of datastage reources. Keep the join partitioning as Auto. Indicate don’t sort option between DB stage and
join stage using sort stage when using order by clause.

7. While doing Outer joins, you can make use of Dummy variables for just Null checking instead of fetching an explicit
column from table.

8. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator
options.

9. One of the most frequent mistakes that developers face is lookup failures by not taking care of String padchar that
datastage appends when converting strings of lower precision to higher precision.Try to decide on the
APT_STRING_PADCHAR, APT_CONFIG_FILE parameters from the beginning. Ideally APT_STRING_PADCHAR should
be set to OxOO (C/C++ end of string) and Configuration file to the maximum number of nodes available.

10. Data Partitioning is very important part of Parallel job design. It’s always advisable to have the data partitioning as
‘Auto’ unless you are comfortable with partitioning, since all DataStage stages are designed to perform in
the required way with Auto partitioning.

11.Do remember that Modify drops the Metadata only when it is explicitly asked to do so using KEEP/DROP clauses.

which partitioning follows in join,merge and lookup?

Join Stage follows Modulus partitioning method.Merge follows same partitioning method as well as Auto partitioning
method.Lookup follows Entire partitioning method.

These functions can be used in a job control routine, which is defined as part of a job’s properties and allows other jobs to
be run and controlled from the first job. Some of the functions can also be used for getting status information on the
current job; these are useful in active stage expressions and before- and after-stage subroutines.
To do this ... Use this function ...
Specify the job you want to control DSAttachJob
Set parameters for the job you want to control DSSetParam
Set limits for the job you want to control DSSetJobLimit
Request that a job is run DSRunJob
Wait for a called job to finish DSWaitForJob
Gets the meta data details for the specified link DSGetLinkMetaData
Get information about the current project DSGetProjectInfo
Get buffer size and timeout value for an IPC or Web
DSGetIPCStageProps
Service stage
Get information about the controlled job or current
DSGetJobInfo
job
Get information about the metabag properties
DSGetJobMetaBag
associated with the named job
Get information about a stage in the controlled job or
DSGetStageInfo
current job
Get the names of the links attached to the specified
DSGetStageLinks
stage
Get a list of stages of a particular type in a job. DSGetStagesOfType
Get information about the types of stage in a job. DSGetStageTypes
Get information about a link in a controlled job or
DSGetLinkInfo
current job
Get information about a controlled job’s parameters DSGetParamInfo
Get the log event from the job log DSGetLogEntry
Get a number of log events on the specified subject
DSGetLogSummary
from the job log
Get the newest log event, of a specified type, from
DSGetNewestLogId
the job log
Log an event to the job log of a different job DSLogEvent
Stop a controlled job DSStopJob
Return a job handle previously obtained from
DSDetachJob
DSAttachJob
Log a fatal error message in a job's log file and
DSLogFatal
aborts the job.
Log an information message in a job's log file. DSLogInfo
Put an info message in the job log of a job controlling
DSLogToController
current job.
Log a warning message in a job's log file. DSLogWarn
Generate a string describing the complete status of a
DSMakeJobReport
valid attached job.
Insert arguments into the message template. DSMakeMsg
Ensure a job is in the correct state to be run or
DSPrepareJob
validated.
Interface to system send mail facility. DSSendMail
Log a warning message to a job log file. DSTransformError
Convert a job control status or error code into an
DSTranslateCode
explanatory text message.
Suspend a job until a named file either exists or does
DSWaitForFile
not exist.
Checks if a BASIC routine is cataloged, either in
DSCheckRoutine
VOC as a callable item, or in the catalog space.
Execute a DOS or DataStage Engine command from
DSExecute
a before/after subroutine.
Set a status message for a job to return as a
DSSetUserStatus
termination message when it finishes

The number of macros are provided in the JOBCONTROL.H file to facilitate getting information about the current job, and
links and stages belonging to the current job. These can be used in expressions (for example for use in Transformer
stages), job control routines, filenames and table names, and before/after subroutines.

The available macros are:


DSHostName
DSProjectName
DSJobStatus
DSJobName
DSJobController
DSJobStartDate
DSJobStartTime
DSJobStartTimestamp
DSJobWaveNo
DSJobInvocations
DSJobInvocationId
DSStageName
DSStageLastErr
DSStageType
DSStageInRowNum
DSStageVarList
DSLinkRowCount
DSLinkLastErr
DSLinkName

These macros provide the functionality of using the DSGetProjectInfo, DSGetJobInfo, DSGetStageInfo, and
DSGetLinkInfo functions with the DSJ.ME token as the JobHandle and can be used in all active stages and before/after
subroutines. The macros provide the functionality for all the possible InfoType arguments for the DSGet…Info functions.
See the Function call help topics for more details.
Datastage Parallel jobs Vs Datastage Server jobs

--------------------------------------------------------------------------------

1) The basic difference between server and parallel jobs is the degree of parallelism
Server job stages do not have in built partitoning and parallelism mechanism for extracting and loading data between
different stages.

• All you can do to enhance the speed and performance in server jobs is to enable inter process row buffering through the
administrator. This helps stages to exchange data as soon as it is available in the link.
• You could use IPC stage too which helps one passive stage read data from another as soon as data is available. In
other words, stages do not have to wait for the entire set of records to be read first and then transferred to the next stage.
Link practitioner and link collector stages can be used to achieve a certain degree of partitioning parallelism.
• All of the above features which have to be explored in server jobs are built in Datastage Px.

2) The Px engine runs on a multiprocessor system and takes full advantage of the processing nodes defined in the
configuration file. Both SMP and MMP architecture is supported by datastage Px.

3) Px takes advantage of both pipeline parallelism and partitoning paralellism. Pipeline parallelism means that as soon as
data is available between stages( in pipes or links), it can be exchanged between them without waiting for the entire
record set to be read. Partitioning parallelism means that entire record set is partitioned into small sets and processed on
different nodes(logical processors). For example if there are 100 records, then if there are 4 logical nodes then each node
would process 25 records each. This enhances the speed at which loading takes place to an amazing degree. Imagine
situations where billions of records have to be loaded daily. This is where datastage PX comes as a boon for ETL process
and surpasses all other ETL tools in the market.

4) In parallel we have Dataset which acts as the intermediate data storage in the linked list, it is the best storage option it
stores the data in datastage internal format.

5) In parallel we can choose to display OSH , which gives information about the how job works.

6) In Parallel Transformer there is no reference link possibility, in server stage reference could be given to transformer.
Parallel stage can use both basic and parallel oriented functions.

7) Datastage server executed by datastage server environment but parallel executed under control of datastage runtime
environment

8) Datastage compiled in to BASIC(interpreted pseudo code) and Parallel compiled to OSH(Orchestrate Scripting
Language).

9) Debugging and Testing Stages are available only in the Parallel Extender.

10) More Processing stages are not included in Server example, Join, CDC, Lookup etc…..

11) In File stages, Hash file available only in Server and Complex flat file , dataset , lookup file set avail in parallel only.

12) Server Transformer supports basic transforms only, but in parallel both basic and parallel transforms.

13) Server transformer is basic language compatability, pararllel transformer is c++ language compatabillity

14) Look up of sequntial file is possible in parallel jobs

15) . In parallel we can specify more file paths to fetch data from using
file pattern similar to Folder stage in Server, while in server we can
specify one file name in one O/P link.

16). We can simulteneously give input as well as output link to a seq. file
stage in Server. But an output link in parallel means a reject link, that
is a link that collects records that fail to load into the sequential file
for some reasons.

17). The difference is file size Restriction.


Sequential file size in server is : 2GB
Sequential file size in parallel is : No Limitation..

18). Parallel sequential file has filter options too. Where you can specify the file pattern.
Datastage: Join or Lookup or Merge or CDC

Many times this question pops up in the mind of Datastage developers.

All the above stages can be used to do the same task. Match one set of data (say primary) with another set of
data(references) and see the results. DataStage normally uses different execution plans (hmm… i should ignore my
Oracle legacy when posting on Datastage). Since DataStage is not so nice as Oracle, to show its Execution plan easily,
we need to fill in the gap of Optimiser and analyze our requiremets. Well I have come up with a nice table ,

Most importantly its the Primary/Reference ratio that needs to be considered not the actual counts.

Primary Source Volume Reference Volume Preferred Method

Little (< 5 million) Very Huge ( > 50 million) Sparse Lookup

Little ( < 5 million) Little (< 5 million) Normal Lookup

Huge (> 10 million) Little (< 5 million) Normal Lookup

Little (< 5 million) Huge ( > 10 million)

Huge (> 10 million) Huge (> 10 million) Join

Huge (> 10 million) Huge (> 10 million) Merge, if you want to handle
rejects in reference links.

Datastage: Warning removals

Jump to Comments

Here I am collecting most of the warnings developers encounter when coding datastage jobs and trying to resolve them.

1. Warning: Possible truncation of input string when converting from a higher length string to lower length string
in Modify.

Resolution: In the Modify stage explicitly specify the length of output column.

Ex: CARD_ACCOUNT:string[max=16] = string_trim[" ", end, begin](CARD_ACCOUNT) instead of just CARD_ACCOUNT


= string_trim[" ", end, begin](CARD_ACCOUNT);

2. Warning: A Null result can come up in Non Nullable field. Mostly thrown by DataStage when aggregate
functions are used in Oracle DB stage.

Resolution: Use a Modify or Transformer stage in between lookup and Oracle stage. When using a modify stage, use the
handle_null clause.

EX: CARD_ACCOUNT:string[max=19] = handle_null(CARD_ACCOUNT,-128);

-128 will be replaced in CARD_ACCOUNT wherever CARD_ACCOUNT is Null.

3. Warning: Some Decimal precision error converting from decimal [p,s] to decimal[x,y].

Resolution: Specify the exact scale and precision of the output column in the Modfiy stage specification and use
trunc_zero (the default r_type with decimal_from_decimal conversion)
Ex: CREDIT_LIMIT:decimal[10,2] = decimal_from_decimal[trunc_zero](CREDIT_LIMIT);instead of just CREDIT_LIMIT =
decimal_from_decimal[trunc_zero](CREDIT_LIMIT);

For further information on where to specify the output column type explicitly and where not necessary, refer to the data
type default/manual conversion guide

For default data type conversion (‘d’) size specification is not required. For manual conversion (‘m’) explicit size
specification is required. The table is available in parallel job developer’s guide

4. Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0

Resolution: Clear the preserve partition flag before Sequential file stages.

5. Warning: A user defined sort operator does not satisfy the requirements.

Resolution: In the job flow just notice the columns on which sort is happening . The order of the columns also must be the
same. i.e if you specify sort on columns in the order X,Y in sort stage and specify join on the columns Y,X in order then
join stage will throw the warning, since it cannot recognise the change in order. Also , i guess DataStage throws this
warning at compile time . So if you rename a column in between stages, then also this warning is thrown. Say i have
sorted on Column X in sort stage, but the column name is changed to Y at the output interface, then also the warning is
thrown. Just revent the output interface column to ‘X’ and the warning disappears.

What is Bit Mapped Index


Bit Map Indexing is a technique commonly used in relational databases where the application uses binary coding in
representing data. This technique was originally used for low cardinality data but recent applications like the Sybase
IQ have used this technique efficiently

what is the difference between datasatge and datastage TX?

1. IBM DATASTAGE TX is the Any to Any message


Transformation tool. It accepts message of any format (
xml, fixed length ) and can convert it to any desired
format.
DataStage is an ETL tool Datastage TX is an EAI tool.
Datastage used in DWH ,TX used in EDI(Entrprise Data
Interchange). The application of both tools is vary from
both.

What does a Config File in parallel extender consist of?

Config file consists of the following.


a) Number of Processes or Nodes.
b) Actual Disk Storage Location.
{ node "node1"

{
fastname "stvsauxpac01"
pools ""
resource disk "/local/datastage/Datasets" {pools ""}
resource scratchdisk "/local/datastage/Scratch" {pools ""}
}
node "node2"
{
fastname "stvsauxpac01"
pools ""
resource disk "/local/datastage/Datasets" {pools ""}
resource scratchdisk "/local/datastage/Scratch" {pools ""} }}
The APT_Configuration file is having the information of resource disk,node pool,and scratch information,

node information in the sence it contatins the how many nodes we given to run the jobs, because based on the
nodes only datastage will create processors at backend while running the jobs,resource disk means this is the place
where exactly jobs will be loading,scrach information will be useful whenever we using the lookups in the jobs

What is exact difference between Parallel Jobs and server Jobs..


SERVER JOBS PARALLEL JOB

RUN ON SINGLE NODE MULTIPLE NODE

PIPELINING &
PARTITIONING DOES NOT SUPPORT SUPPORTS

LOADS WHEN 1 JOB FINISH SYNCHRONIZED

SERVER JOBS USES


SYMMETRICMULTIPROCESSING

PARALLEL JOBS
BOTH MASSIVE PARALLEL PROCESSING AND
SYMMETRICMULTIPROCESSING

what is data set? and what is file set?


I assume you are referring Lookup fileset only.It is only used for lookup stages only.
Dataset: DataStage parallel extender jobs use data sets to manage data within a job. You can think of each link in a job
as carrying a data set. The Data Set stage allows you to store data being operated on in a persistent form, which can then
be used by other DataStage jobs.
FileSet: DataStage can generate and name exported files, write them to their destination, and list the files it has generated
in a file whose extension is, by convention, .fs. The data files and the file that lists them are called a file set. This capability
is useful because some operating systems impose a 2 GB limit on the size of a file and you need to distribute files among
nodes to prevent overruns.

What is the max size of Data set stage?

The Max size of a Dataset is equal to the summation of the space available in the Resource disk specified in the
configuration file.

What are the different types of data warehousing?

There are four types:

Native data warehouse


Software data warehouse
package data warehouse
Data management

fetching last row from a particular column of sequential file


Develop a job source seq file--> Transformer--> output stage

In the transformer write a stage variable as rowcount with the following derivation
Goto DSfunctions click on DSGetLinkInfo..
you will get "DSGetLinkInfo(DSJ.ME,%Arg2%,%Arg3%,%Arg4%)"
Arg 2 is your source stage name
Arg 3 is your source link name
Arg 4 --> Click DS Constant and select DSJ.LINKROWCOUNT.
Now ur derivation is
"DSGetLinkInfo(DSJ.ME,"source","link", DSJ.LINKROWCOUNT)"

Create a constraint as @INROWNUM =rowcount


and map the required column to output link.

What is BUS Schema?

BUS Schema is composed of a master suite of confirmed dimension and standardized definition if facts.
In a BUS schema we would eventually have conformed dimensions and facts defined to be shared across all
enterprise data marts. This way all Data Marts can use the conformed dimensions and facts without having them
locally. This is the first step towards building an enterprise Data Warehouse from Kimball's perspective. For (e.g) we
may have different data marts for Sales, Inventory and Marketing and we need common entities like Customer,
Product etc to be seen across these data marts and hence would be ideal to have these as Conformed objects. The
challenge here is that some times each line of business may have different definitions for these conformed objects
and hence choosing conformed objects have to be designed with some extra care.

What is a linked cube?


Linked cube in which a sub-set of the data can be analysed into great detail. The linking ensures that the data in the
cubes remain consistent.

What is Data Modeling Hits: 1098

Description: Data Modeling is a method used to define and analyze data requirements needed to support the
business functions of an enterprise. These data requirements are recorded as a conceptual data model with
associated data definitions. Data modeling defines the relationships between data elements and structures.

What is Data Cluster Hits: 854

Description: Clustering in the computer science world is the classification of data or object into different groups. It
can also be referred to as partitioning of a data set into different subsets. Each data in the subset ideally shares
some common traits. Data cluster are created to meet specific requirements that cannot created using any of the
categorical levels.

What is Data Aggregation Hits: 821

Description: In Data Aggregation, value is derived from the aggregation of two or more contributing data
characteristics. Aggregation can be made from different data occurrences within the same data subject, business
transactions and a de-normalized database and between the real world and detailed data resource design within the
common data

What is Data Dissemination Hits: 578

Description: The best example of dissemination is the ubiquitous internet. Every single second throughout the year,
data gets disseminated to millions of users around the world. Data could sit on the millions of severs located in
scattered geographical locations. Data dissemination on the internet is possible through many different kinds of
communications protocols.
What is Data Distribution Hits: 796

Description: Often, data warehouses are being managed by more than just one computer server. This is because the
high volume of data cannot be handled by one computer alone. In the past, mainframes were used for processes
involving big bulks of data. Mainframes were giant computers housed in big rooms to be used for critical applications
involving lots of data

What is Data Administration Hits: 717

Description: Data administration refers to the way in which data integrity is maintained within data warehouse. Data
warehouses are very large repository of all sorts of data. These data maybe of different formats. To make these data
useful to the company, the database running the data warehouse has to be configured so that it obeys the business

What is Data Collection Frequency Hits: 537

Description: Data Collection Frequency, just as the name suggests refers to the time frequency at which data is
collected at regular intervals. This often refers to whatever time of the day or the year in any given length of period.
In a data warehouse, the relational database management systems continually gather, extract, transform and load
data onto the storage

What is Data Duplication Hits: 707

Description: The definition of what constitutes a duplicate has somewhat different interpretations. For instance, some
define a duplicate as having the exact syntactic terms and sequence, whether having formatting differences or not. In
effect, there are either no difference or only formatting differences and the contents of the data are exactly the same.

What is Data Integrity Rule Hits: 784

Description: In the past, Data Integrity was defined and enforced with data edits but this method did not cope up with
the growth of technology and data value quality greatly suffered at the cost of business operations. Organizations
were starting to realize that the rules were no longer appropriate for the business. The concept of business rules is
already widely
Top ten features in DataStage Hawk(Datastage 8.0)

1) The metadata server. To borrow a simile from that judge on American Idol "Using MetaStage is kind of like bathing
in the ocean on a cold morning. You know it's good for you but that doesn't stop it from freezing the crown jewels."
MetaStage is good for ETL projects but none of the projects I've been on has actually used it. Too much effort
required to install the software, setup the metabrokers, migrate the metadata, and learn how the product works and
write reports. Hawk brings the common repository and improved metadata reporting and we can get the positive
effectives of bathing in sea water without the shrinkage that comes with it.

2) QualityStage overhaul. Data Quality reporting can be another forgotten aspect of data integration projects. Like
MetaStage the QualityStage server and client had an additional install, training and implementation overhead so
many DataStage projects did not use it. I am looking forward to more integration projects using standardisation,
matching and survivorship to improve quality once these features are more accessible and easier to use.

3) Frictionless Connectivity and Connection Objects. I've called DB2 every rude name under the sun. Not because
it's a bad database but because setting up remote access takes me anywhere from five minutes to five weeks
depending on how obscure the error message and how hard it is to find the obscure setup step that was missed
during installation. Anything that makes connecting to database easier gets a big tick from me.

4) Parallel job range lookup. I am looking forward to this one because it will stop people asking for it on forums. It
looks good, it's been merged into the existing lookup form and seems easy to use. Will be interested to see the
performance.

5) Slowly Changing Dimension Stage. This is one of those things that Informatica were able to trumpet at product
comparisons, that they have more out of the box DW support. There are a few enhancements to make updates to
dimension tables easier, there is the improved surrogate key generator, there is the slowly changing dimension stage
and updates passed to in memory lookups. That's it for me with DBMS generated keys, I'm only doing the keys in the
ETL job from now on! DataStage server jobs have the hash file lookup where you can read and write to it at the
same time, parallel jobs will have the updateable lookup.

6) Collaboration: better developer collaboration. Everyone hates opening a job and being told it is locked. "Bloody
what his name has gone to lunch, locked the job and now his password protected screen saver is up! Unplug his
PC!" Under Hawk you can open a read only copy of a locked job plus you get told who has locked the job so you
know whom to curse.

7) Session Disconnection. Accompanied by the metallic cry of "exterminate! exterminate!" an administrator can
disconnect sessions and unlock jobs.

8) Improved SQL Builder. I know a lot of people cross the street when they see the SQL Builder coming. Getting the
SQL builder to build complex SQL is a bit like teaching a monkey how to play chess. What I do like about the current
SQL builder is that it synchronises your SQL select list with your ETL column list to avoid column mismatches. I am
hoping the next version is more flexible and can build complex SQL.

9) Improved job startup times. Small parallel jobs will run faster. I call it the death of a thousand cuts, your very large
parallel job takes too long to run because a thousand smaller jobs are starting and stopping at the same time and
cutting into CPU and memory. Hawk makes these cuts less painful.

10) Common logging. Log views that work across jobs, log searches, log date constraints, wildcard message filters,
saved queries. It's all good. You no longer need to send out a search party to find an error message.

How does Relational Data Modeling differ from that of Dimensional Data Modeling?

In relational models data is normalized to 1st, 2nd or 3rd Normal form. In dimensional models data is denormalized.
The typical design is that of a Star where their is a central fact table containing additive, measurable facts and this
central fact table is in relationship with dimensional tables which generally contain text filters that normally occurs in
the WHERE clause of a query.

What is data sparsity and how it effect on aggregation?

Data sparsity is term used for how much data we have for a particular dimension/entity of the model.
It affects aggregation depending on how deep the combination of members of the sparse dimension make up. If the
combination is a lot and those combinations do not have any factual data then creating space to store those
aggregations will be a waste as a result, the database will become huge.

What is weak entity?


A weak entity is a part of one-to-many relationship, with the identifying entity on one-side of the relationship, and
with total participation on many-side.
The weak entity relies on the identifying entity for its identification. The primary key of a weak entity is a composite
key (PK of identifying entity(identifier), discriminator). There could exist more than one value of disciminator for each
identifier.
How we are going to decide which schema we are going to implement in the data warehouse?
Pro Star Schema:

• Users find easier to query and faster query times.


• Space consequences of repeated data makes little difference as Dimensional
table hold relatively fewer rows then the large Fact tables.

Pro Snowflake Schema:

• DBA find easier to maintain


• May exist in stage but should be de-normalized in Production.
• Cleanest designs use surrogate keys for all dimension levels.
• Smaller storage size because normalized data takes less space.

In which normal form is the dimension table and fact table in the schema?

Unlike OLTP, the goal of dimensional and fact modeling is not to achieve the highest Normal Form but rather to
make key performance indicators (often sought after measures) readily accessible to ad-doc query.

That being said, Dimensions can strive to be in Boyce-Codd 3rd normal form, while fact tables may be in 1st normal
form - having only a primary key being unique.

De-normalized dimensional tables may be in only 1st normal form but have the advantage of low storage space,
while de-normalized 1st normal form dimensional table take more space but perform faster.

Every schema deals with Facts and Dimensions.

Fact will be the central table in the Schema were as Dimensions Table are the surrounded table of Fact.Dimensions
table is one who give the description of Business Transactions and always having Primary Key .Fact table is one
dealing with Measures and having Foriegn Key.

What is the difference between logical data model and physical data model in Erwin?
The Logical Data Model (LDM) is derived from the Conceptual Data Model (CDM).

The CDM consists of the major entity sets and the relationship sets, and does not state anything about the attributes of
the entity sets.

The LDM consists of the Entity Sets, their attributes, the Relationship sets, the cardinality, type of replationship, etc.

The Physical Data Model (PDM) consists of the Entity Sets (Tables), their attributes (columns of tables), the Relationship
sets (whose attributes are also mapped to columns of tables), along with the Datatype of the columns, the various integrity
constraints, etc.

Erwin calls the conversion / transformation of LDM => PDM as Forward Engineering which further leads to the actual code
generation and the conversion of Code => PDM => LDM as Reverse Engineering!

Unix Programming Interview Questions

what is the UNIX command to wait for a specified number of seconds before exit?

sleep seconds
DataStage Q & A

Configuration Files
APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many
configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this
environment variable is not defined then how DataStage determines which file to use?
If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file
(config.apt) in following path:
Current working directory.
INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.
What are the different options a logical node can have in the configuration file?
fastname – The fastname is the physical node name that stages use to open connections for high volume data
transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix
command ‘uname -n’.
pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes
you can group nodes into set of pools.
A pool can be associated with many nodes and a node can be part of many pools.
A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default pool name (”")
from the list.
A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes).
In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on
the node which is common to stage as well as job.
resource – resource resource_type “location” [{pools “disk_pool_name”}] | resource resource_type “value” .
resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is
unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.)
or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be
temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX,
ORACLE, etc.)
How datastage decides on which processing node a stage should be run?
If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all
nodes defined in the default node pool. (Default Behavior)
If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage.
(Refer to 2.2.3 for more detail).
When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your
parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also,
You need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is
possible that conductor node is not connected with the high-speed network switches. However, the other nodes are
connected to each other using a very high-speed network switches. How do you configure your system so that you
will be able to achieve optimized parallelism?
Make sure that none of the stages are specified to be run on the conductor node.
Use conductor node just to start the execution of parallel job.
Make sure that conductor node is not the part of the default pool.
Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not
necessarily the optimal parallelization?
Datastage creates one process for every stage for each processing node. Hence, if the hardware resource is not
available to support the maximum parallelization, the performance of overall system goes down. For example,
suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one
corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be
managed by a single operating system. Significant time will be spent in switching context and scheduling the
process.
Since we can have different logical processing nodes, it is possible that some node will be more suitable for some
stage while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for
which stage?
If a stage is performing a memory intensive task then it should be run on a node which has more disk space
available for it. E.g. sorting a data is memory intensive task and it should be run on such nodes.
If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need
to associate those stages with the processing node, which is physically mapped to the machine on which the
licensed software is installed. (Assumption: The machine on which licensed software is installed is connected
through other machines using high speed network.)
If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages
communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner.
Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs
from the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes)
and copies the same environment on them. However, it is possible to create a startup script which will selectively
change the environment on a specific node. This script has a default name of startup.apt. However, like main
configuration file, we can also have many startup configuration files. The appropriate configuration file can be picked
up using the environment variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT
environment variable?
Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup
script on the remote shell.
What are the generic things one must follow while creating a configuration file so that optimal parallelization can be
achieved?
Consider avoiding the disk/disks that your input files reside on.
Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles
even if they’re located on a RAID (Redundant Array of Inexpensive Disks) system.
Know what is real and what is NFS:
Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for storage, low-
level protocols).
Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary storage of
file/data during processing.
If you use NFS file system space for disk resources, then you need to know what you are doing. For example, your
final result files may need to be written out onto the NFS disk area, but that doesn’t mean the intermediate data sets
created and used temporarily in a multi-job sequence should use this NFS disk area. Better to setup a “final” disk
pool, and constrain the result sequential file or data set to reside there, but let intermediate storage go to local or
SAN resources, not NFS.
Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that
are already striped at the spindle level.

What is a conductor node?


Actually every process contains a conductor process where the execution was started and a section leader process
for each processing node and a player process for each set of combined operators and a individual player process
for each uncombined operator. whenever we want to kill a process we should have to destroy the player process and
then section leader process and then conductor process

Relational vs Dimensional
Relational Data Modeling Dimensional Data Modeling
Data is stored in RDBMS Data is stored in RDBMS or Multidimensional databases
Tables are units of storage Cubes are units of storage
Data is normalized and used for OLTP. Optimized Data is denormalized and used in data warehouse and data mart.
for OLTP processing Optimized for OLAP
Several tables and chains of relationships among
Few tables and fact tables are connected to dimensional tables
them
Volatile(several updates) and time variant Non volatile and time invariant
SQL is used to manipulate data MDX is used to manipulate data
Summary of bulky transactional data(Aggregates and Measures)
Detailed level of transactional data
used in business decisions
User friendly, interactive, drag and drop multidimensional OLAP
Normal Reports
Reports

ETL process and concepts

ETL stands for extraction, transformation and loading. Etl is a process that involves the following tasks:

• extracting data from source operational or archive systems which are the primary source of data for the data
warehouse
• transforming the data - which may involve cleaning, filtering, validating and applying business rules
• loading the data into a data warehouse or any other database or application that houses data

The ETL process is also very often referred to as Data Integration process and ETL tool as a Data Integration platform.
The terms closely related to and managed by ETL processes are: data migration, data management, data cleansing, data
synchronization and data consolidation.

The main goal of maintaining an ETL process in an organization is to migrate and transform data from the source OLTP
systems to feed a data warehouse and form data marts.

What is Data Modeling Hits: 1098

Description: Data Modeling is a method used to define and analyze data requirements needed to support the
business functions of an enterprise. These data requirements are recorded as a conceptual data model with
associated data definitions. Data modeling defines the relationships between data elements and structures.

What is Data Cluster Hits: 854

Description: Clustering in the computer science world is the classification of data or object into different groups. It
can also be referred to as partitioning of a data set into different subsets. Each data in the subset ideally shares
some common traits. Data cluster are created to meet specific requirements that cannot created using any of the
categorical levels.

What is Data Aggregation Hits: 821

Description: In Data Aggregation, value is derived from the aggregation of two or more contributing data
characteristics. Aggregation can be made from different data occurrences within the same data subject, business
transactions and a de-normalized database and between the real world and detailed data resource design within the
common data

What is Data Dissemination Hits: 578

Description: The best example of dissemination is the ubiquitous internet. Every single second throughout the year,
data gets disseminated to millions of users around the world. Data could sit on the millions of severs located in
scattered geographical locations. Data dissemination on the internet is possible through many different kinds of
communications protocols.

What is Data Distribution Hits: 796

Description: Often, data warehouses are being managed by more than just one computer server. This is because
the high volume of data cannot be handled by one computer alone. In the past, mainframes were used for processes
involving big bulks of data. Mainframes were giant computers housed in big rooms to be used for critical applications
involving lots of data

What is Data Administration Hits: 717

Description: Data administration refers to the way in which data integrity is maintained within data warehouse. Data
warehouses are very large repository of all sorts of data. These data maybe of different formats. To make these data
useful to the company, the database running the data warehouse has to be configured so that it obeys the business

What is Data Collection Frequency Hits: 537

Description: Data Collection Frequency, just as the name suggests refers to the time frequency at which data is
collected at regular intervals. This often refers to whatever time of the day or the year in any given length of period.
In a data warehouse, the relational database management systems continually gather, extract, transform and load
data onto the storage

What is Data Duplication Hits: 707

Description: The definition of what constitutes a duplicate has somewhat different interpretations. For instance,
some define a duplicate as having the exact syntactic terms and sequence, whether having formatting differences or
not. In effect, there are either no difference or only formatting differences and the contents of the data are exactly the
same.

What is Data Integrity Rule Hits: 784

Description: In the past, Data Integrity was defined and enforced with data edits but this method did not cope up
with the growth of technology and data value quality greatly suffered at the cost of business operations.
Organizations were starting to realize that the rules were no longer appropriate for the business. The concept of
business rules is already widely

What is data Migration?


Data migration is actually the translation of data from one format to another format or from one storage device to
another storage device.
Data migration typically has four phases: analysis of source data, extraction and transformation of data, validation
and repair of data, and use of data in the new program.

What is Data Management?


Data management is the development and execution of architectures, policies, practices and procedures in order to
manage the information lifecycle needs of an enterprise in an effective manner.

What is Data Synchronization? And why is it important?


When suppliers and retailers attempt to communicate with one another using unsynchronized data, there is
confusion. Neither party completely understands what the other is requesting. The inaccuracies cause costly errors
in a variety of business systems. By synchronizing item and supplier data, each organization works from identical
information, therefore, minimizing miscommunication. Data synchronization is the key to accurate and timely
exchange of item and supplier data across enterprises and organizations.

What is Data Cleansing?


Also referred to as data scrubbing, the act of detecting and removing and/or correcting a database’s dirty data (i.e.,
data that is incorrect, out-of-date, redundant, incomplete, or formatted incorrectly). The goal of data cleansing is not
just to clean up the data in a database but also to bring consistency to different sets of data that have been merged
from separate databases

What is Data Consolidation?


Data consolidation is usually associated with moving data from remote locations to a central location or combining
data due to an acquisition or merger.

Data Warehouse Architecture

The main difference between the database architecture in a standard, on-line transaction processing oriented system
(usually ERP or CRM system) and a Data Warehouse is that the system’s relational model is usually de-normalized into
dimension and fact tables which are typical to a data warehouse database design.
The differences in the database architectures are caused by different purposes of their existence.

In a typical OLTP system the database performance is crucial, as end-user interface responsiveness is one of the
most important factors determining usefulness of the application. That kind of a database needs to handle inserting
thousands of new records every hour. To achieve this usually the database is optimized for speed of Inserts, Updates and
Deletes and for holding as few records as possible. So from a technical point of view most of the SQL queries issued will
be INSERT, UPDATE and DELETE.

Opposite to OLTP systems, a Data Warehouse is a system that should give response to almost any question
regarding company performance measure. Usually the information delivered from a data warehouse is used by people
who are in charge of making decisions. So the information should be accessible quickly and easily but it doesn't need to
be the most recent possible and in the lowest detail level.

Usually the data warehouses are refreshed on a daily basis (very often the ETL processes run overnight) or once a month
(data is available for the end users around 5th working day of a new month). Very often the two approaches are
combined.

The main challenge of a DataWarehouse architecture is to enable business to access historical, summarized data with a
read-only access of the end-users. Again, from a technical standpoint the most SQL queries would start with a SELECT
statement.

In Data Warehouse environments, the relational model can be transformed into the following architectures:
 Star schema
 Snowflake schema
 Constellation schema

Star schema architecture

Star schema architecture is the simplest data warehouse design. The main feature of a star schema is a table at the
center, called the fact table and the dimension tables which allow browsing of specific categories, summarizing, drill-
downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form, while dimensional tables are de-
normalized (second normal form).
Despite the fact that the star schema is the simpliest datawarehouse architecture, it is most commonly used in the
datawarehouse implementations across the world today (about 90-95% cases).

Fact table
The fact table is not a typical relational database table as it is de-normalized on purpose - to enhance query response
times. The fact table typically contains records that are ready to explore, usually with ad hoc queries. Records in the fact
table are often referred to as events, due to the time-variant nature of a data warehouse environment.
The primary key for the fact table is a composite of all the columns except numeric values / scores (like QUANTITY,
TURNOVER, exact invoice date and time).

Typical fact tables in a global enterprise data warehouse are (usually there may be additional company or business
specific fact tables):

 sales fact table - contains all details regarding sales


 orders fact table - in some cases the table can be split into open orders and historical orders. Sometimes the values
for historical orders are stored in a sales fact table.
 budget fact table - usually grouped by month and loaded once at the end of a year.
 forecast fact table - usually grouped by month and loaded daily, weekly or monthly.
 inventory fact table - report stocks, usually refreshed daily

Dimension table

Nearly all of the information in a typical fact table is also present in one or more dimension tables. The main purpose of
maintaining Dimension Tables is to allow browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the composite primary key of the fact table.
In a star schema design, there is only one de-normalized table for a given dimension.

Typical dimension tables in a data warehouse are:

 time dimension table


 customers dimension table
 products dimension table
 key account managers (KAM) dimension table
 sales office dimension table

Star schema example

An example of a star schema architecture is depicted below.

Star schema DW architecture

Snowflake Schema architecture


Snowflake schema architecture is a more complex variation of a star schema design. The main difference is that
dimensional tables in a snowflake schema are normalized, so they have a typical relational database design.

Snowflake schemas are generally used when a dimensional table becomes very big and when a star schema can’t
represent the complexity of a data structure. For example if a PRODUCT dimension table contains millions of rows, the
use of snowflake schemas should significantly improve performance by moving out some data to other table (with
BRANDS for instance).

The problem is that the more normalized the dimension table is, the more complicated SQL joins must be issued to query
them. This is because in order for a query to be answered, many tables need to be joined and aggregates generated.
An example of a snowflake schema architecture is depicted below.

Snowflake schema DW architecture

Fact constellation schema architecture

For each star schema or snowflake schema it is possible to construct a fact constellation schema.
This schema is more complex than star or snowflake architecture, which is because it contains multiple fact tables. This
allows dimension tables to be shared amongst many fact tables.
That solution is very flexible, however it may be hard to manage and support.

The main disadvantage of the fact constellation schema is a more complicated design because many variants of
aggregation must be considered.

In a fact constellation schema, different fact tables are explicitly assigned to the dimensions, which are for given facts
relevant. This may be useful in cases when some facts are associated with a given dimension level and other facts with a
deeper dimension level.
Use of that model should be reasonable when for example, there is a sales fact table (with details down to the exact date
and invoice header id) and a fact table with sales forecast which is calculated based on month, client id and product id.
In that case using two different fact tables on a different level of grouping is realized through a fact constellation model.

An example of a constellation schema architecture is depicted below.


Fact constellation schema DW architecture

Question
*************
I am getting the records from the Source table and after doing the Look up with target I have to insert the
new records and also the Updated records in the Target table.

If i Use the Change Data capture Stage Or Difference stage then it will gives the Updated records only in the
output, but I have to Maintain the History in the Target table i.e. I want the Existing record and also the
Updated record in the Target Table. I don’t have Flag Columns in the Target Tables.

Is this Possible in Parallel jobs Without Using the Transformer Logic ?

Answers
*************
Use a Compare stage; as well as the result you get each source row as a single column subrecord (you can promote
these later with Promote Subrecord stages).

There are two ways to get and apply your change capture. You start with a Before set of data and an After set of
data. If you use the change capture stage alone it gives you new/changed/deleted/unchanged records. There is a
flag to turn each one of these on or off and an extra field is written out that indicates what type of change it is.

You can now either apply this to a target database table using insert/update/delete/load stages and transformers and
filters OR you can merge it with your Before set of data using the Change Apply stage.

The Change Apply will give you the new changes and the old history.

Questions
****************

Suppose I have to create a reject file with today’s Date and Time Stamp then What is the Solution?

In the Job Properties After Job routine type the file name and concat the date and time as follows.

cat <filename1> <file name2> >> <file name>`date +"%Y%m%d_%H%M%S"`.txt

Question
*****************Hi,

I have a problem using the iconv and oconv fuctions for the date conversion in Datastage 8.0.1. The format
that i entered the date was mm/dd/yyyy and the output desired was yyyy/mm/dd but i got an output of yyyy
mm dd.How can i add the slash in my output.

And, i would also like to know how I could convert to date in the form of
yy MON dd.
The fuction that I used is as follows.

Oconv(Iconv(InputFieldName,"D DMY[2,2,4]"),"D YMD[4,2,2]")

Solution:
Code:
Oconv(Iconv(InputFieldName,"DMDY"),"D/YMD[4,2,2]")

For the second question, try


Code:
Oconv(Iconv(InputFieldName,"DMDY"),"D YMD[2,A3,2]")

Difference between FTP and SMTP?

FTP is a file transfer protocol as send the file inside the network but SMTP is send the email inside the nework (as use
mailing purpose), or ftp and smtp supported for iis.

Difference between SUBSTR and INSTR


Hi All

INSTR function finds the numeric starting position of a string within a string.

As eg.

Select INSTR('Mississippi' 'i' 3 3) test1

INSTR('Mississippi' 'i' 1 3) test2

INSTR('Mississippi' 'i' -2 3) test3

from dual;

Its output would be like this

Test1 Test2 Test3

___________________________________________

11 8 2

SUBSTR function returns the selection of the specified string specified by numeric character positions.
As eg.

Select SUBSTR('The Three Musketeers' 1 3) from dual;

will return 'The'.


What is the difference between a Filter and a Switch...
A Filter stage is used to filter the incoming data for suppose u want to get the details of customer 20 if u give customer 20
as the constraint in filter it will display only the customer 20 files and u can also give a reject link the rest of the records will
go into reject link.where as in the switch
we need to give as cases
like case1 case2.
case1 10;
case2 20;
it will give the outputs of 10 and 20 customer records.
switch will check the cases and execute them.

In filter Stage we can give multiple conditions on multiple columns, But every time data come from source system and
filtered data loads into target .where as in switch stage we can give Multiple conditions on Single column, but data come
only once from the source and checks all the conditions in the switch stage and loads in to target.

What is the difference between Filter and External Filter Stage?

Filter is used to pass the records basing on some condition. We will have multiple where conditions to specify. We can
send to different output links

External filter is used to execute commands of unix.You can give the commands in the specified box with the
corresponding arguments.

cardinality

From an OLTP perspective, this refers to the number of rows in a table. From a data warehousing perspective, this
typically refers to the number of distinct values in a column. For most data warehouse DBAs, a more important issue is
the degree of cardinality.

degree of cardinality

The number of unique values of a column divided by the total number of rows in the table. This is particularly important
when deciding which indexes to build. You typically want to use bitmap indexes on low degree of cardinality columns and
B-tree indexes on high degree of cardinality columns. As a general rule, a cardinality of under 1% makes a good
candidate for a bitmap index.

fact

Data, usually numeric and additive, that can be examined and analyzed. Examples include sales, cost, and profit. Fact
and measure are synonymous; fact is more commonly used with relational environments, measure is more commonly
used with multidimensional environments.

derived fact (or measure)

A fact (or measure) that is generated from existing data using a mathematical operation or a data transformation.
Examples include averages, totals, percentages, and differences.

10 Ways to Make DataStage Run Slower

Everyone wants to tell you how to make your ETL jobs run faster, well here is how to make them slower!
The Structured Data blog has posted a list Top Ways How Not To Scale Your Data Warehouse that is a great chat about
bad ways to manage an Oracle Data Warehouse. It inspired me to find 10 ways to make DataStage jobs slower! How do
you puts the breaks on a DataStage job that supposed to be running on a massively scalable parallel architecture.

1. Use the same configuration file for all your jobs.

You may have two nodes configured for each CPU on your DataStage server and this allows your high volume jobs to run
quickly but this works great for slowing down your small volume jobs. A parallel job with a lot of nodes to partition across
is a bit like the solid wheel on a velodrome racing bike, they take a lot of time to crank up to full speed but once you are
there they are lightning fast. If you are processing a handful of rows the configuration file will instruct the job to partition
those rows across a lot of processes and then repartition them at the end. So a job that would take a second or less on a
single node can run for 5-10 seconds across a lot of nodes and a squadron of these jobs will slow down your entire
DataStage batch run!

2. Use a sparse database lookup on high volumes.

This is a great way to slow down any ETL tool, it works on server jobs or parallel jobs. The main difference is that server
jobs only do sparse database lookups - the only way to avoid a sparse lookup is to dump the table into a hash file. Parallel
jobs by default do cached lookups where the entire database table is moved into a lookup fileset either in memory of if it's
too large into scratch space on the disk. You can slow parallel jobs down by changing the lookup to a sparse lookup and
for every row processed it will send a lookup SQL statement to the database. So if you process 10 million rows you can
send 10 million SQL statements to the database! That will put the brakes on!

3. Keep resorting your data.

Sorting is the Achilles heel of just about any ETL tool, the average ETL job is like a busy restaurant, it makes a profit by
getting the diners in and out quickly and serving multiple seatings. If the restaurant fits 100 people can feed several
hundred in a couple hours by processing each diner quickly and getting them out the door. The sort stage is like having to
waiting until every person who is going to eat at that restaurant for that night has arrived and has been put in order of
height before anyone gets their food. You need to read every row before you can output your sort results. You can really
slow your DataStage parallel jobs down by putting in more than one sort, or giving a job data that is already sorted by the
SQL select statement but sorting it again anyway!

4. Design single threaded bottlenecks

This is really easy to do in server edition and harder (but possible) in parallel edition. Devise a step on the critical path of
your batch processing that takes a long time to finish and only uses a small part of the DataStage engine. Some good
bottlenecks: a large volume Server Job that hasn't been made parallel by multiple instance or interprocess functionality. A
script FTP of a file that keeps an entire DataStage Parallel engine waiting. A bulk database load via a single update
stream. Reading a large sequential file from a parallel job without using multiple readers per node.

5. Turn on debugging and forget that it's on

In a parallel job you can turn on a debugging setting that forces it to run in sequential mode, forever! Just turn it on to
debug a problem and then step outside the office and get run over by a tram. It will be years before anyone spots the
bottleneck!
6. Let the disks look after themselves

Never look at what is happening on your disk I/O - that's a Pandora's Box of better performance! You can get some
beautiful drag and slow down by ignoring your disk I/O as parallel jobs write a lot of temporary data and datasets to the
scratch space on each node and write out to large sequential files. Disk striping or partitioning or choosing the right disk
type or changing the location of your scratch space are all things that stand between you and slower job run times.

7. Keep Writing that Data to Disk

Staging of data can be a very good idea. It can give you a rollback point for failed jobs, it can give you a transformed
dataset that can be picked up and used by multiple jobs, it can give you a modular job design. It can also slow down
Parallel Jobs like no tomorrow - especially if you stage to sequential files! All that repartitioning to turn native parallel
datasets into a stupid ASCII metadata dumb file and then import and repartition to pick it up and process it again.
Sequential files are the Forest Gump of file storage, simple and practical but dumb as all hell. It costs time to write to one
and time to read and parse them so designing an end to end process that writes data to sequential files repeatedly will
give you massive slow down times.

8. Validate every field

A lot of data comes from databases. Often DataStage pulls straight out of these databases or saves the data to an ASCII
file before being processed by DataStage. One way to slow down your job and slow down your ETL development and
testing is to validate and transform metadata even though you know there is nothing wrong with it. For example, validating
that a field is VARCHAR(20) using DataStage functions even though the database defines the source field as
VARCHAR(20). DataStage has implicit validation and conversion of all data imported that validates that it's the metadata
you say it is. You can then do explicit metadata conversion and validation on top of that. Some fields need explicit
metadata conversion - such as numbers in VARCHAR fields and dates in string fields and packed fields, but most don't.
Adding a layer of validation you don't need should slow those jobs down.

9. Write extra steps in database code

The same phrase gets uttered on many an ETL project. "I can write that in SQL", or "I can write that in Java", or "I can do
that in an Awk script". Yes, we know, we know that just about any programming language can do just about anything - but
leaving a complex set of steps as a prequel or sequel to an ETL job is like leaving a turd on someones doorstep. You'll be
long gone when someone comes to clean it up. This is a sure fire way to end up with a step in the end to end integration
that is not scalable, is poorly documented, cannot be easily modified and slows everything down. If someone starts saying
"I can write that in..." just say "okay, if you sign a binding contract to support it for every day that you have left on this
earth".

10. Don't do Performance Testing

Do not take your highest volume jobs into performance testing, just keep the default settings, default partitioning and your
first draft design and throw that into production and get the hell out of there.

How to check Datastage internal error descriptions

To check the description of a number go to the datastage shell (from administrator or telnet to the server machine) and
invoke the following command:
 SELECT * FROM SYS.MESSAGE WHERE @ID='081021'; - where in that case the number 081021 is an error
number

The command will produce a brief error description which probably will not be helpful in resolving an issue but can be a
good starting point for further analysis.

How to stop a job when its status is running?

To stop a running job go to DataStage Director and click the stop button (or Job -> Stop from menu). If it doesn't help go
to Job -> Cleanup Resources, select a process with holds a lock and click Logout

If it still doesn't help go to the datastage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks
commands (options 7-10).

How to run and schedule a job from command line?

To run a job from command line use a dsjob command

Command Syntax: dsjob [-file | [-server ][-user ][-password ]] []

The command can be placed in a batch file and run in a system scheduler.

Is it possible to run two versions of datastage on the same pc?

Yes, even though different versions of Datastage use different system dll libraries.
To dynamically switch between Datastage versions install and run DataStage Multi-Client Manager. That application can
unregister and register system libraries used by Datastage.

How to release a lock held by jobs?

Go to the data stage shell and invoke the following command: ds.tools
It will open an administration panel. Go to 4.Administer processes/locks , then try invoking one of the clear locks
commands (options 7-10).

What is a command to analyze hashed file?

There are two ways to analyze a hashed file. Both should be invoked from the datastage command shell. These are:
• FILE.STAT command
• ANALYZE.FILE command
what is the difference between logging text and final text message in terminator stage

Every stage has a 'Logging Text' area on their General tab which logs an informational message when the stage is
triggered or started.
Informational - is a green line, DSLogInfo() type message.
The Final Warning Text - the red fatal, the message which is included in the sequence abort message

How to invoke an Oracle PLSQL stored procedure from a server job

To run a pl/sql procedure from Datastage a Stored Procedure (STP) stage can be used.
However it needs a flow of at least one record to run.

It can be designed in the following way:

• source odbc stage which fetches one record from the database and maps it to one column - for example: select
sysdate from dual
• A transformer which passes that record through. If required, add pl/sql procedure parameters as columns on the
right-hand side of tranformer's mapping
• Put Stored Procedure (STP) stage as a destination. Fill in connection parameters, type in the procedure name
and select Transform as procedure type. In the input tab select 'execute procedure for each row' (it will be run
once).
Design of a DataStage server job with Oracle plsql procedure call

Datastage routine to open a text file with error catching

Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END

Datastage routine which reads the first line from a text file
Note! work dir and file1 are parameters passed to the routine.
* open file1
OPENSEQ work_dir : '\' : file1 TO H.FILE1 THEN
CALL DSLogInfo("******************** File " : file1 : " opened successfully", "JobControl")
END ELSE
CALL DSLogInfo("Unable to open file", "JobControl")
ABORT
END

READSEQ FILE1.RECORD FROM H.FILE1 ELSE


Call DSLogWarn("******************** File is empty", "JobControl")
END

firstline = Trim(FILE1.RECORD[1,32]," ","A") ******* will read the first 32 chars


Call DSLogInfo("******************** Record read: " : firstline, "JobControl")
CLOSESEQ H.FILE1

How to adjust commit interval when loading data to the database?


In earlier versions of datastage the commit interval could be set up in:
General -> Transaction size (in version 7.x it's obsolete)

Starting from Datastage 7.x it can be set up in properties of ODBC or ORACLE stage in Transaction handling -> Rows per
transaction.
If set to 0 the commit will be issued at the end of a successfull transaction.

Database update actions in ORACLE stage


The destination table can be updated using various Update actions in Oracle stage. Be aware of the fact that it's crucial to
select the key columns properly as it will determine which column will appear in the WHERE part of the SQL statement.
Available actions:

• Clear the table then insert rows - deletes the contents of the table (DELETE statement) and adds new rows
(INSERT).
• Truncate the table then insert rows - deletes the contents of the table (TRUNCATE statement) and adds new
rows (INSERT).
• Insert rows without clearing - only adds new rows (INSERT statement).
• Delete existing rows only - deletes matched rows (issues only the DELETE statement).
• Replace existing rows completely - deletes the existing rows (DELETE statement), then adds new rows
(INSERT).
• Update existing rows only - updates existing rows (UPDATE statement).
• Update existing rows or insert new rows - updates existing data rows (UPDATE) or adds new rows (INSERT).
An UPDATE is issued first and if succeeds the INSERT is ommited.
• Insert new rows or update existing rows - adds new rows (INSERT) or updates existing rows (UPDATE). An
INSERT is issued first and if succeeds the UPDATE is ommited.
• User-defined SQL - the data is written using a user-defined SQL statement.
• User-defined SQL file - the data is written using a user-defined SQL statement from a file.

Use and examples of ICONV and OCONV functions?

ICONV and OCONV functions are quite often used to handle data in Datastage.
ICONV converts a string to an internal storage format and OCONV converts an expression to an output format.
Syntax:
Iconv (string, conversion code)
Oconv(expression, conversion )

Some useful iconv and oconv examples:


Iconv("10/14/06", "D2/") = 14167
Oconv(14167, "D-E") = "14-10-2006"
Oconv(14167, "D DMY[,A,]") = "14 OCTOBER 2006"
Oconv(12003005, "MD2$,") = "$120,030.05"

That expression formats a number and rounds it to 2 decimal places:


Oconv(L01.TURNOVER_VALUE*100,"MD2")

Iconv and oconv can be combined in one expression to reformat date format easily:
Oconv(Iconv("10/14/06", "D2/"),"D-E") = "14-10-2006"

Can Datastage use Excel files as a data input?


Microsoft Excel spreadsheets can be used as a data input in Datastage. Basically there are two possible
approaches available:

 Access Excel file via ODBC - this approach requires creating an ODBC connection to the Excel file on a
Datastage server machine and use an ODBC stage in Datastage. The main disadvantage is that it is impossible
to do this on an Unix machine. On Datastage servers operating in Windows it can be set up here:
Control Panel -> Administrative Tools -> Data Sources (ODBC) -> User DSN -> Add -> Driver do Microsoft
Excel (.xls) -> Provide a Data source name -> Select the workbook -> OK

 Save Excel file as CSV - save data from an excel spreadsheet to a CSV text file and use a sequential stage
in Datastage to read the data.

Error timeout waiting for mutex

The error message usually looks like follows:


... ds_ipcgetnext() - timeout waiting for mutex

There may be several reasons for the error and thus solutions to get rid of it.
The error usually appears when using Link Collector, Link Partitioner and Interprocess (IPC) stages. It may also
appear when doing a lookup with the use of a hash file or if a job is very complex, with the use of many
transformers.

There are a few things to consider to work around the problem:


- increase the buffer size (up to to 1024K) and the Timeout value in the Job properties (on the Performance
tab).
- ensure that the key columns in active stages or hashed files are composed of allowed characters – get rid of
nulls and try to avoid language specific chars which may cause the problem.
- try to simplify the job as much as possible (especially if it’s very complex). Consider splitting it into two or
three smaller jobs, review fetches and lookups and try to optimize them (especially have a look at the SQL
statements).

You might also like