Ascential Datastage: Parallel Job Developer'S Guide

Ascential DataStage
Parallel Job Developers

Guide
Version 6.0
September 2002
Part No. 00D-023DS60
Published by Ascential Software

19972002 Ascential Software Corporation. All rights reserved.
Ascential, DataStage and MetaStage are trademarks of Ascential Software Corporation or its affiliates and may
be registered in other jurisdictions.
Documentation Team: Mandy deBelin, Gretchen Wang
GOVERNMENT LICENSE RIGHTS
Software and documentation acquired by or for the US Government are provided with rights as follows:
(1) if for civilian agency use, with rights as restricted by vendors standard license, as prescribed in FAR 12.212;
(2) if for Dept. of Defense use, with rights as restricted by vendors standard license, unless superseded by a
negotiated vendor license, as prescribed in DFARS 227.7202. Any whole or partial reproduction of software or
documentation marked with this legend must reproduce this legend.
Table of Contents
Preface
Documentation Conventions .................................................................................... xix
User Interface Conventions ................................................................................ xxi
DataStage Documentation ......................................................................................... xxi
Chapter 1. Introduction
DataStage Parallel Jobs ............................................................................................... 1-1
Chapter 2. Designing Parallel Extender Jobs

Parallel Processing ...................................................................................................... 2-1
Pipeline Parallelism ............................................................................................. 2-2
Partition Parallelism ............................................................................................ 2-3
Combining Pipeline and Partition Parallelism ................................................ 2-4
Parallel Processing Environments ............................................................................ 2-5
The Configuration File ............................................................................................... 2-6
Partitioning and Collecting Data .............................................................................. 2-7
Partitioning ........................................................................................................... 2-7
Collecting .............................................................................................................. 2-8
The Mechanics of Partitioning and Collecting ................................................ 2-9
Meta Data ................................................................................................................... 2-11
Runtime Column Propagation ......................................................................... 2-12
Table Definitions ................................................................................................ 2-12
Schema Files and Partial Schemas ................................................................... 2-12
Data Types ........................................................................................................... 2-13
Complex Data Types ......................................................................................... 2-14
Incorporating Server Job Functionality ................................................................. 2-17
Chapter 3. Stage Editors

The Stage Page ............................................................................................................. 3-2
Table of Contents
iii
General Tab ........................................................................................................... 3-2

Properties Tab ....................................................................................................... 3-2
Advanced Tab ....................................................................................................... 3-5
Link Ordering Tab ................................................................................................ 3-6
Inputs Page ................................................................................................................... 3-9
General Tab ......................................................................................................... 3-10
Properties Tab ..................................................................................................... 3-10
Partitioning Tab .................................................................................................. 3-11
Columns Tab .......................................................................................................3-16
Format Tab ........................................................................................................... 3-24
Outputs Page ..............................................................................................................3-25
General Tab ......................................................................................................... 3-26
Properties Page ................................................................................................... 3-27
Columns Tab .......................................................................................................3-28
Format Tab ........................................................................................................... 3-29
Mapping Tab .......................................................................................................3-30
Chapter 4. Sequential File Stage

Stage Page .....................................................................................................................4-1
Advanced Tab ....................................................................................................... 4-2
Inputs Page ................................................................................................................... 4-2
Input Link Properties ........................................................................................... 4-3
Partitioning on Input Links ................................................................................ 4-4
Format of Sequential Files ...................................................................................4-7
Outputs Page ..............................................................................................................4-12
Output Link Properties ..................................................................................... 4-12
Reject Link Properties ........................................................................................ 4-15
Format of Sequential Files .................................................................................4-15
Using RCP With Sequential Stages ......................................................................... 4-20
Chapter 5. File Set Stage

Stage Page .....................................................................................................................5-2
Advanced Tab ....................................................................................................... 5-2
Inputs Page ................................................................................................................... 5-2
iv
Ascential DataStage Parallel Job Developers Guide
Format of File Set Files ........................................................................................ 5-8

Outputs Page ............................................................................................................. 5-12
Format of File Set Files ...................................................................................... 5-15
Using RCP With File Set Stages .............................................................................. 5-20
Chapter 6. Data Set Stage

Stage Page .................................................................................................................... 6-1
Advanced Tab ....................................................................................................... 6-2
Inputs Page .................................................................................................................. 6-2
Input Link Properties .......................................................................................... 6-2
Outputs Page ............................................................................................................... 6-4
Output Link Properties ....................................................................................... 6-4
Chapter 7. Lookup File Set Stage

Stage Page .................................................................................................................... 7-1
Advanced Tab ....................................................................................................... 7-2
Inputs Page .................................................................................................................. 7-2
Input Link Properties .......................................................................................... 7-3
Outputs Page ............................................................................................................... 7-7
Chapter 8. External Source Stage

Stage Page .................................................................................................................... 8-1
Advanced Tab ....................................................................................................... 8-2
Outputs Page ............................................................................................................... 8-2
Reject Link Properties .......................................................................................... 8-4
Format of Data Being Read ................................................................................ 8-5
Using RCP With External Source Stages ................................................................ 8-10
Chapter 9. External Target Stage

Stage Page .................................................................................................................... 9-1
Advanced Tab ....................................................................................................... 9-2
Table of Contents
Inputs Page ................................................................................................................... 9-2

Format of File Set Files ........................................................................................9-6
Outputs Page ..............................................................................................................9-12
Using RCP With External Target Stages ................................................................. 9-12
Chapter 10. Write Range Map Stage

Stage Page ...................................................................................................................10-1
Advanced Tab ..................................................................................................... 10-2
Inputs Page ................................................................................................................. 10-2
Input Link Properties ......................................................................................... 10-2
Partitioning on Input Links .............................................................................. 10-3
Chapter 11. SAS Data Set Stage

Stage Page ................................................................................................................... 11-1
Advanced Tab ..................................................................................................... 11-2
Inputs Page ................................................................................................................. 11-2
Outputs Page .............................................................................................................. 11-6
Chapter 12. DB2 Stage

Stage Page ...................................................................................................................12-1
Advanced Tab ..................................................................................................... 12-2
Inputs Page ................................................................................................................. 12-2
Outputs Page ............................................................................................................12-10
Output Link Properties ................................................................................... 12-11
Chapter 13. Oracle Stage

Stage Page ...................................................................................................................13-1
Advanced Tab ..................................................................................................... 13-1
Inputs Page ................................................................................................................. 13-2
vi

Outputs Page ........................................................................................................... 13-11
Output Link Properties ................................................................................... 13-11
Chapter 14. Teradata Stage

Stage Page .................................................................................................................. 14-1
Advanced Tab ..................................................................................................... 14-1
Inputs Page ................................................................................................................ 14-2
Input Link Properties ........................................................................................ 14-2
Outputs Page ............................................................................................................. 14-8
Chapter 15. Informix XPS Stage

Stage Page .................................................................................................................. 15-1
Advanced Tab ..................................................................................................... 15-1
Inputs Page ................................................................................................................ 15-2
Outputs Page ............................................................................................................. 15-7
Chapter 16. Transformer Stage

Transformer Editor Components ............................................................................ 16-3
Toolbar ................................................................................................................. 16-3
Link Area ............................................................................................................. 16-3
Meta Data Area .................................................................................................. 16-3
Shortcut Menus .................................................................................................. 16-4
Transformer Stage Basic Concepts .......................................................................... 16-5
Input Link ........................................................................................................... 16-5
Output Links ...................................................................................................... 16-5
Editing Transformer Stages ..................................................................................... 16-6
Using Drag and Drop ........................................................................................ 16-7
Find and Replace Facilities ............................................................................... 16-8
Creating and Deleting Columns ...................................................................... 16-9
Moving Columns Within a Link ...................................................................... 16-9
Table of Contents
vii
Editing Column Meta Data ............................................................................... 16-9

Defining Output Column Derivations ............................................................ 16-9
Defining Constraints and Handling Rejects .................................................16-12
Specifying Link Order .....................................................................................16-14
Defining Local Stage Variables .......................................................................16-15
The DataStage Expression Editor ..........................................................................16-18
Entering Expressions .......................................................................................16-18
Completing Variable Names ...........................................................................16-19
Validating the Expression ...............................................................................16-19
Exiting the Expression Editor .........................................................................16-19
Configuring the Expression Editor ................................................................16-20
Transformer Stage Properties ................................................................................16-20
Stage Page ..........................................................................................................16-20
Inputs Page ........................................................................................................16-21
Outputs Page ....................................................................................................16-24
Chapter 17. Aggregator Stage

Stage Page ...................................................................................................................17-2
Properties ............................................................................................................. 17-2
Advanced Tab ...................................................................................................17-10
Inputs Page ...............................................................................................................17-12
Partitioning on Input Links ............................................................................17-12
Outputs Page ............................................................................................................17-14
Mapping Tab .....................................................................................................17-15
Chapter 18. Join Stage

Stage Page ...................................................................................................................18-2
Properties ............................................................................................................. 18-2
Advanced Tab ..................................................................................................... 18-3
Link Ordering ..................................................................................................... 18-4
Inputs Page ................................................................................................................. 18-5
Outputs Page ..............................................................................................................18-7
Mapping Tab .......................................................................................................18-8
viii
Chapter 19. Funnel Stage

Stage Page .................................................................................................................. 19-2
Properties ............................................................................................................ 19-2
Advanced Tab ..................................................................................................... 19-4
Link Ordering ..................................................................................................... 19-5
Inputs Page ................................................................................................................ 19-5
Outputs Page ............................................................................................................. 19-8
Mapping Tab ....................................................................................................... 19-9
Chapter 20. Lookup Stage

Stage Page .................................................................................................................. 20-2
Properties ............................................................................................................ 20-2
Advanced Tab ..................................................................................................... 20-3
Link Ordering ..................................................................................................... 20-4
Inputs Page ................................................................................................................ 20-4
Outputs Page ............................................................................................................. 20-8
Mapping Tab ....................................................................................................... 20-9
Chapter 21. Sort Stage

Stage Page .................................................................................................................. 21-1
Properties ............................................................................................................ 21-1
Advanced Tab ..................................................................................................... 21-6
Inputs Page ................................................................................................................ 21-6
Outputs Page ............................................................................................................. 21-9
Mapping Tab ....................................................................................................... 21-9
Chapter 22. Merge Stage

Stage Page .................................................................................................................. 22-2
Properties ............................................................................................................ 22-2
Advanced Tab ..................................................................................................... 22-3
Table of Contents
ix
Link Ordering ..................................................................................................... 22-4

Inputs Page ................................................................................................................. 22-5
Outputs Page ..............................................................................................................22-8
Mapping Tab .......................................................................................................22-9
Chapter 23. Remove Duplicates Stage

Stage Page ...................................................................................................................23-2
Properties ............................................................................................................. 23-2
Advanced Tab ..................................................................................................... 23-3
Inputs Page ................................................................................................................. 23-3
Output Page ............................................................................................................... 23-6
Mapping Tab .......................................................................................................23-7
Chapter 24. Compress Stage

Stage Page ...................................................................................................................24-1
Properties ............................................................................................................. 24-2
Advanced Tab ..................................................................................................... 24-2
Input Page ...................................................................................................................24-3
Output Page ............................................................................................................... 24-5
Chapter 25. Expand Stage

Stage Page ...................................................................................................................25-1
Properties ............................................................................................................. 25-2
Advanced Tab ..................................................................................................... 25-2
Input Page ...................................................................................................................25-3
Output Page ............................................................................................................... 25-4
Chapter 26. Sample Stage

Stage Page ...................................................................................................................26-1
Properties ............................................................................................................. 26-2
Advanced Tab ..................................................................................................... 26-3
Link Ordering ..................................................................................................... 26-4

Input Page .................................................................................................................. 26-4
Outputs Page ............................................................................................................. 26-7
Mapping Tab ....................................................................................................... 26-8
Chapter 27. Row Generator Stage

Stage Page .................................................................................................................. 27-1
Advanced Tab ..................................................................................................... 27-2
Outputs Page ............................................................................................................. 27-2
Properties ............................................................................................................ 27-2
Chapter 28. Column Generator Stage

Stage Page .................................................................................................................. 28-1
Properties ............................................................................................................ 28-1
Advanced Tab ..................................................................................................... 28-3
Input Page .................................................................................................................. 28-3
Outputs Page ............................................................................................................. 28-6
Mapping Tab ....................................................................................................... 28-6
Chapter 29. Copy Stage

Stage Page .................................................................................................................. 29-1
Properties ............................................................................................................ 29-1
Advanced Tab ..................................................................................................... 29-2
Input Page .................................................................................................................. 29-3
Outputs Page ............................................................................................................. 29-5
Mapping Tab ....................................................................................................... 29-6
Chapter 30. External Filter Stage

Stage Page .................................................................................................................. 30-1
Properties ............................................................................................................ 30-1
Advanced Tab ..................................................................................................... 30-2
Input Page .................................................................................................................. 30-3
Table of Contents
xi
Outputs Page ..............................................................................................................30-5
Chapter 31. Change Capture Stage

Stage Page ...................................................................................................................31-2
Properties ............................................................................................................. 31-2
Advanced Tab ..................................................................................................... 31-5
Link Ordering ..................................................................................................... 31-6
Inputs Page ................................................................................................................. 31-7
Outputs Page ..............................................................................................................31-9
Mapping Tab .....................................................................................................31-10
Chapter 32. Change Apply Stage

Stage Page ...................................................................................................................32-3
Properties ............................................................................................................. 32-3
Advanced Tab ..................................................................................................... 32-6
Link Ordering ..................................................................................................... 32-7
Inputs Page ................................................................................................................. 32-7
Outputs Page ............................................................................................................32-10
Mapping Tab ..................................................................................................... 32-11
Chapter 33. Encode Stage

Stage Page ...................................................................................................................33-1
Properties ............................................................................................................. 33-1
Advanced Tab ..................................................................................................... 33-2
Inputs Page ................................................................................................................. 33-3
Outputs Page ..............................................................................................................33-5
Chapter 34. Decode Stage

Stage Page ...................................................................................................................34-1
Properties ............................................................................................................. 34-1
Advanced Tab ..................................................................................................... 34-2
Inputs Page ................................................................................................................. 34-3
xii
Outputs Page ............................................................................................................. 34-4
Chapter 35. Difference Stage

Stage Page .................................................................................................................. 35-2
Properties ............................................................................................................ 35-2
Advanced Tab ............................................................................................................ 35-5
Link Ordering ..................................................................................................... 35-6
Inputs Page ................................................................................................................ 35-6
Outputs Page ............................................................................................................. 35-9
Mapping Tab ..................................................................................................... 35-10
Chapter 36. Column Import Stage

Stage Page .................................................................................................................. 36-2
Properties ............................................................................................................ 36-2
Advanced Tab ..................................................................................................... 36-3
Inputs Page ................................................................................................................ 36-4
Outputs Page ............................................................................................................. 36-7
Format Tab .......................................................................................................... 36-7
Mapping Tab ..................................................................................................... 36-13
Reject Link ......................................................................................................... 36-13
Chapter 37. Column Export Stage

Stage Page .................................................................................................................. 37-1
Properties ............................................................................................................ 37-2
Advanced Tab ..................................................................................................... 37-3
Inputs Page ................................................................................................................ 37-3
Format Tab .......................................................................................................... 37-6
Outputs Page ........................................................................................................... 37-11
Mapping Tab ..................................................................................................... 37-12
Reject Link ......................................................................................................... 37-13
Chapter 38. Make Subrecord Stage

Stage Page .................................................................................................................. 38-2
Table of Contents
xiii
Properties ............................................................................................................. 38-2

Advanced Tab ..................................................................................................... 38-3
Inputs Page ................................................................................................................. 38-3
Outputs Page ..............................................................................................................38-6
Chapter 39. Split Subrecord Stage

Stage Page ...................................................................................................................39-1
Properties Tab ..................................................................................................... 39-2
Advanced Tab ..................................................................................................... 39-2
Inputs Page ................................................................................................................. 39-3
Outputs Page ..............................................................................................................39-5
Chapter 40. Promote Subrecord Stage

Stage Page ...................................................................................................................40-1
Properties ............................................................................................................. 40-2
Advanced Tab ..................................................................................................... 40-2
Inputs Page ................................................................................................................. 40-3
Outputs Page ..............................................................................................................40-5
Chapter 41. Combine Records Stage

Stage Page ...................................................................................................................41-1
Properties ............................................................................................................. 41-1
Advanced Tab ..................................................................................................... 41-3
Inputs Page ................................................................................................................. 41-3
Outputs Page ..............................................................................................................41-6
Chapter 42. Make Vector Stage

Stage Page ...................................................................................................................42-1
Properties ............................................................................................................. 42-2
Advanced Tab ..................................................................................................... 42-2
Inputs Page ................................................................................................................. 42-3
xiv

Outputs Page ............................................................................................................. 42-5
Chapter 43. Split Vector Stage

Stage Page .................................................................................................................. 43-1
Properties ............................................................................................................ 43-2
Advanced Tab ..................................................................................................... 43-2
Inputs Page ................................................................................................................ 43-3
Outputs Page ............................................................................................................. 43-5
Chapter 44. Head Stage

Stage Page .................................................................................................................. 44-2
Properties ............................................................................................................ 44-2
Advanced Tab ..................................................................................................... 44-3
Inputs Page ................................................................................................................ 44-4
Outputs Page ............................................................................................................. 44-6
Mapping Tab ....................................................................................................... 44-7
Chapter 45. Tail Stage

Stage Page .................................................................................................................. 45-1
Properties ............................................................................................................ 45-2
Advanced Tab ..................................................................................................... 45-3
Inputs Page ................................................................................................................ 45-3
Outputs Page ............................................................................................................. 45-6
Mapping Tab ....................................................................................................... 45-7
Chapter 46. Compare Stage

Stage Page .................................................................................................................. 46-1
Properties ............................................................................................................ 46-2
Advanced Tab ..................................................................................................... 46-3
Link Ordering Tab .............................................................................................. 46-4
Inputs Page ................................................................................................................ 46-5
Table of Contents
xv
Outputs Page ..............................................................................................................46-6
Chapter 47. Peek Stage

Stage Page ...................................................................................................................47-1
Properties ............................................................................................................. 47-1
Advanced Tab ..................................................................................................... 47-4
Link Ordering ..................................................................................................... 47-5
Inputs Page ................................................................................................................. 47-5
Outputs Page ..............................................................................................................47-8
Mapping Tab .......................................................................................................47-9
Chapter 48. SAS Stage

Stage Page ...................................................................................................................48-2
Properties ............................................................................................................. 48-2
Advanced Tab ..................................................................................................... 48-6
Link Ordering ..................................................................................................... 48-7
Inputs Page ................................................................................................................. 48-7
Outputs Page ............................................................................................................48-10
Mapping Tab ..................................................................................................... 48-11
Chapter 49. Specifying Custom Parallel Stages

Defining Custom Stages ...........................................................................................49-2
Defining Build Stages ............................................................................................... 49-7
Build Stage Macros ..................................................................................................49-16
How Your Code is Executed ...........................................................................49-18
Inputs and Outputs ..........................................................................................49-19
Example Build Stage ........................................................................................49-21
Defining Wrapped Stages .......................................................................................49-27
Example Wrapped Stage .................................................................................49-35
Chapter 50. Managing Data Sets

Structure of Data Sets ................................................................................................ 50-1
Starting the Data Set Manager .................................................................................50-3
xvi
Data Set Viewer ......................................................................................................... 50-4

Viewing the Schema .......................................................................................... 50-5
Viewing the Data ................................................................................................ 50-6
Copying Data Sets ............................................................................................. 50-7
Deleting Data Sets .............................................................................................. 50-8
Chapter 51. DataStage Development Kit (Job Control Interfaces)

DataStage Development Kit .................................................................................... 51-2
The dsapi.h Header File .................................................................................... 51-2
Data Structures, Result Data, and Threads .................................................... 51-2
Writing DataStage API Programs .................................................................... 51-3
Building a DataStage API Application ........................................................... 51-4
Redistributing Applications ............................................................................. 51-4
API Functions ..................................................................................................... 51-5
Data Structures ........................................................................................................ 51-44
Error Codes .............................................................................................................. 51-57
DataStage BASIC Interface .................................................................................... 51-61
Job Status Macros .................................................................................................. 51-103
Command Line Interface ..................................................................................... 51-104
The Logon Clause .......................................................................................... 51-104
Starting a Job ................................................................................................... 51-105
Stopping a Job ................................................................................................ 51-107
Listing Projects, Jobs, Stages, Links, and Parameters ............................... 51-107
Retrieving Information ................................................................................. 51-108
Accessing Log Files ........................................................................................ 51-110
Appendix A. Schemas
Schema Format ........................................................................................................... A-1
Date Columns ...................................................................................................... A-3
Decimal Columns ............................................................................................... A-3
Floating-Point Columns ..................................................................................... A-4
Integer Columns .................................................................................................. A-4
Raw Columns ...................................................................................................... A-4
String Columns ................................................................................................... A-5
Time Columns ..................................................................................................... A-5
Timestamp Columns .......................................................................................... A-5
Table of Contents
xvii
Vectors ...................................................................................................................A-6
Subrecords ............................................................................................................A-6
Tagged Columns ..................................................................................................A-8
Partial Schemas ...........................................................................................................A-9
Appendix B. Functions
Date and Time Functions ........................................................................................... B-1
Logical Functions ........................................................................................................ B-4
Mathematical Functions ............................................................................................ B-4
Null Handling Functions .......................................................................................... B-6
Number Functions ..................................................................................................... B-7
Raw Functions ............................................................................................................ B-8
String Functions .......................................................................................................... B-8
Type Conversion Functions .................................................................................... B-10
Utility Functions ....................................................................................................... B-12
Appendix C. Header Files

C++ Classes Sorted By Header File ...................................................................... C-1
C++ Macros Sorted By Header File ...................................................................... C-6
Index
xviii
Preface
This manual describes the features of the DataStage Manager and
DataStage Designer. It is intended for application developers and
system administrators who want to use DataStage to design and
develop data warehousing applications using parallel jobs.
If you are new to DataStage, you should read the DataStage Designer
Guide and the DataStage Manager Guide. These provide general
descriptions of the DataStage Manager and DataStage Designer, and
give you enough information to get you up and running.
This manual contains more specific information and is intended to be
used as a reference guide. It gives detailed information about parallel
job design and stage editors.
Documentation Conventions
This manual uses the following conventions:
Preface
Convention
Usage
Bold
In syntax, bold indicates commands, function

names, keywords, and options that must be
input exactly as shown. In text, bold indicates
keys to press, function names, and menu
selections.
UPPERCASE
In syntax, uppercase indicates BASIC statements

and functions and SQL statements and
keywords.
Italic
In syntax, italic indicates information that you

supply. In text, italic also indicates UNIX
commands and options, file names, and
pathnames.
Plain
In text, plain indicates Windows NT commands

and options, file names, and path names.
Courier
Courier indicates examples of source code and

system output.
xix
Convention
Usage
Courier Bold
In examples, courier bold indicates characters

that the user types or keys the user presses (for
example, <Return>).
[]
Brackets enclose optional items. Do not type the

brackets unless indicated.
{}
itemA | itemB
Braces enclose nonoptional items from which

you must select at least one. Do not type the
braces.
A vertical bar separating items indicates that
you can choose only one item. Do not type the
vertical bar.
...
Three periods indicate that more of the same

type of item can optionally follow.
A right arrow between menu commands indicates you should choose each command in
sequence. For example, Choose File Exit
means you should choose File from the menu
bar, then choose Exit from the File pull-down
menu.
This line
continues
The continuation character is used in source

code examples to indicate a line that is too long
to fit on the page, but must be entered as a single
line on screen.
The following conventions are also used:

Syntax definitions and examples are indented for ease in
reading.
All punctuation marks included in the syntaxfor example,
commas, parentheses, or quotation marksare required unless
otherwise indicated.
Syntax lines that do not fit on one line in this manual are
continued on subsequent lines. The continuation lines are
indented. When entering syntax, type the entire syntax entry,
including the continuation lines, on the same input line.
xx
User Interface Conventions

The following picture of a typical DataStage dialog box illustrates the
terminology used in describing user interface elements:
Drop
Down
List
The Inputs Page
The
General
Tab
Browse
Button
Field
Check
Box
Option
Button
Button
The DataStage user interface makes extensive use of tabbed pages,

sometimes nesting them to enable you to reach the controls you need
from within a single dialog box. At the top level, these are called
pages, at the inner level these are called tabs. In the example
above, we are looking at the General tab of the Inputs page. When
using context sensitive online help you will find that each page has a
separate help topic, but each tab uses the help topic for the parent
page. You can jump to the help pages for the separate tabs from within
the online help.
DataStage Documentation
DataStage documentation includes the following:
DataStage Parallel Job Developer Guide: This guide describes the
tools that are used in building a parallel job, and it supplies
programmers reference information.
Preface
xxi
DataStage Install and Upgrade Guide: This guide describes how

to install DataStage on Windows and UNIX systems, and how to
upgrade existing installations.
DataStage Server Job Developer Guide: This guide describes the
tools that are used in building a server job, and it supplies
DataStage Designer Guide: This guide describes the DataStage
Manager and Designer, and gives a general description of how to
create, design, and develop a DataStage application.
DataStage Manager Guide: This guide describes the DataStage
Director and how to validate, schedule, run, and monitor
DataStage server jobs.
XE/390 Job Developer Guide: This guide describes the tools that
are used in building a mainframe job, and it supplies
DataStage Director Guide: This guide describes the DataStage
Director and how to validate, schedule, run, and monitor
DataStage server jobs.
DataStage Administrator Guide: This guide describes DataStage
setup, routine housekeeping, and administration.
These guides are also available online in PDF format. You can read
them using the Adobe Acrobat Reader supplied with DataStage.
Extensive online help is also supplied. This is especially useful when
you have become familiar with using DataStage and need to look up
particular pieces of information.
xxii
1
Introduction
This chapter gives an overview of parallel jobs. Parallel jobs are compiled
and run on the DataStage server. Such jobs connect to a data source,
extract, and transform data and write it to a data warehouse.
DataStage also supports server jobs and mainframe jobs. Server jobs are
also compiled and run on the server. These are for use on non-parallel
systems and SMP systems with up to 64 processors. Server jobs are
described in DataStage Server Job Developers Guide. Mainframe jobs are
available if have XE/390 installed. These are loaded onto a mainframe and
compiled and run there. Mainframe jobs are described in XE/390 Job Developers Guide.
DataStage Parallel Jobs

DataStage jobs consist of individual stages. Each stage describes a particular database or process. For example, one stage may extract data from a
data source, while another transforms it. Stages are added to a job and
linked together using the Designer.
The following diagram represents one of the simplest jobs you could have:
a data source, a Transformer (conversion) stage, and the final database.
Introduction
1-1
The links between the stages represent the flow of data into or out of a
stage.
Data
Source
Transformer
Stage
Data
Warehouse
You must specify the data you want at each stage, and how it is handled.
For example, do you want all the columns in the source data, or only a
select few? Should the data be aggregated or converted before being
passed on to the next stage?
General information on how to construct your job and define the required
meta data using the DataStage Designer and the DataStage Manager is in
the DataStage Designer Guide and DataStage Manager Guide. Chapter 4
onwards of this manual describe the individual stage editors that you may
use when developing parallel jobs.
1-2
2
Designing Parallel
Extender Jobs
The DataStage Parallel Extender brings the power of parallel processing to
your data extraction and transformation applications.
This chapter gives a basic introduction to parallel processing, and
describes some of the key concepts in designing parallel jobs for
DataStage. If you are new to DataStage, you should read the introductory
chapters of the DataStage Designer Guide first so that you are familiar
with the DataStage Designer interface and the way jobs are built from
stages and links.
Parallel Processing
There are two basic types of parallel processing; pipeline and partitioning.
DataStage allows you to use both of these methods. The following sections
illustrate these methods using a simple DataStage job which extracts data
from a data source, transforms it in some way, then writes it to another
data source. In all cases this job would appear the same on your Designer
Designing Parallel Extender Jobs
2-1
canvas, but you can configure it to behave in different ways (which are
shown diagrammatically.
Pipeline Parallelism
If you implemented the example job using the parallel extender and ran it
sequentially, each stage would process a single row of data then pass it to
the next process, which would run and process this row then pass it on,
etc. If you ran it in parallel, on a system with at least three processing
nodes, the stage reading would start on one node and start filling a pipeline with the data it had read. The transformer stage would start running
on another node as soon as there was data in the pipeline, process it and
start filling another pipeline. The stage writing the transformed data to the
2-2
target database would similarly start writing as soon as there was data
available. Thus all three stages are operating simultaneously.
Time taken
Job running sequentially
Time taken
Conceptual representation of same job using pipeline parallelism
Partition Parallelism
Imagine you have the same simple job as described above, but that it is
handling very large quantities of data. In this scenario you could use the
power of parallel processing to your best advantage by partitioning the
data into a number of separate sets, with each partition being handled by
a separate processing node.
2-3
Using partition parallelism the same job would effectively be run simultaneously by several processing nodes, each handling a separate subset of
the total data.
At the end of the job the data partitions can be collected back together
again and written to a single data source.
Conceptual representation of job using partition parallelism
Combining Pipeline and Partition Parallelism

If your system has enough processors, you can combine pipeline and
partition parallel processing to achieve even greater performance gains. In
this scenario you would have stages processing partitioned data and
filling pipelines so the next one could start on that partition before the
previous one had finished.
Conceptual representation of job using pipeline and partitioning
2-4
Parallel Processing Environments

The environment in which you run your DataStage jobs is defined by your
systems architecture and hardware resources. All parallel-processing
environments are categorized as one of:
SMP (symmetric multiprocessing), in which some hardware
resources may be shared among processors.
Cluster or MPP (massively parallel processing), also known as
shared-nothing, in which each processor has exclusive access to
hardware resources.
SMP systems allow you to scale up the number of CPUs, which may
improve performance of your jobs. The improvement gained depends on
how your job is limited:
CPU-limited jobs. In these jobs the memory, memory bus, and
disk I/O spend a disproportionate amount of time waiting for the
CPU to finish its work. Running a CPU-limited application on
more processing nodes can shorten this waiting time so speed up
overall performance.
Memory-limited jobs. In these jobs CPU and disk I/O wait for the
memory or the memory bus. SMP systems share memory
resources, so it may be harder to improve performance on SMP
systems without hardware upgrade.
Disk I/O limited jobs. In these jobs CPU, memory and memory
bus wait for disk I/O operations to complete. Some SMP systems
allow scalability of disk I/O, so that throughput improves as the
number of processors increases. A number of factors contribute to
the I/O scalability of an SMP, including the number of disk spindles, the presence or absence of RAID, and the number of I/O
controllers.
In a cluster or MPP environment, you can use the multiple CPUs and their
associated memory and disk resources in concert to tackle a single job. In
this environment, each CPU has its own dedicated memory, memory bus,
disk, and disk access. In a shared-nothing environment, parallelization of
your job is likely to improve the performance of CPU-limited, memorylimited, or disk I/O-limited applications.
2-5
The Configuration File

One of the great strengths of the DataStage parallel extender is that, when
designing jobs, you dont have to worry too much about the underlying
structure of your system, beyond appreciating its parallel processing capabilities. If your system changes, is upgraded or improved, or if you
develop a job on one platform and implement it on another, you dont
necessarily have to change your job design.
DataStage learns about the shape and size of the system from the configuration file. It organizes the resources needed for a job according to what is
defined in the configuration file. When your system changes, you change
the file not the jobs.
Every MPP, cluster, or SMP environment has characteristics that define the
system overall as well as the individual processing nodes. These characteristics include node names, disk storage locations, and other distinguishing
attributes. For example, certain processing nodes might have a direct
connection to a mainframe for performing high-speed data transfers,
while other nodes have access to a tape drive, and still others are dedicated
to running an RDBMS application.
The configuration file describes every processing node that DataStage will
use to run your application. When you run a DataStage job, DataStage first
reads the configuration file to determine the available system resources.
When you modify your system by adding or removing processing nodes
or by reconfiguring nodes, you do not need to alter or even recompile your
DataStage job. Just edit the configuration file.
The configuration file also gives you control over parallelization of your
job during the development cycle. For example, by editing the configuration file, you can first run your job on a single processing node, then on
two nodes, then four, then eight, and so on. The configuration file lets you
measure system performance and scalability without actually modifying
your job.
You can define and edit the configuration file using the DataStage
Manager. This is described in the DataStage Manager Guide, which also
gives detailed information on how you might set up the file for different
systems.
2-6
Partitioning and Collecting Data

We have already described how you can use partitioning of data to implement parallel processing in your job (see Partition Parallelism on
page 2-3). This section takes a closer look at how you can partition data in
your jobs, and collect it together again.
Partitioning
In the simplest scenario you probably wont be bothered how your data is
partitioned. It is enough that it is partitioned and that the job runs faster.
In these circumstances you can safely delegate responsibility for partitioning to DataStage. Once you have identified where you want to
partition data, DataStage will work out the best method for doing it and
implement it.
The aim of most partitioning operations is to end up with a set of partitions
that are as near equal size as possible, ensuring an even load across your
processing nodes.
When performing some operations however, you will need to take control
of partitioning to ensure that you get consistent results. A good example
of this would be where you are using an aggregator stage to summarize
your data. To get the answers you want (and need) you must ensure that
related data is grouped together in the same partition before the summary
operation is performed on that partition. DataStage lets you do this.
There are a number of different partitioning methods available:
Round robin. The first record goes to the first processing node, the second
to the second processing node, and so on. When DataStage reaches the last
processing node in the system, it starts over. This method is useful for
resizing partitions of an input data set that are not equal in size. The round
robin method always creates approximately equal-sized partitions.
Random. Records are randomly distributed across all processing nodes.
Like round robin, random partitioning can rebalance the partitions of an
input data set to guarantee that each processing node receives an approximately equal-sized partition. The random partitioning has a slightly
higher overhead than round robin because of the extra processing
required to calculate a random value for each record.
Same. The operator using the data set as input performs no repartitioning
and takes as input the partitions output by the preceding stage. With this
2-7
partitioning method, records stay on the same processing node; that is,
they are not redistributed. Same is the fastest partitioning method.
Entire. Every instance of a stage on every processing node receives the
complete data set as input. It is useful when you want the benefits of
parallel execution, but every instance of the operator needs access to the
entire input data set. You are most likely to use this partitioning method
with stages that create lookup tables from their input.
Hash by field. Partitioning is based on a function of one or more columns
(the hash partitioning keys) in each record. This method is useful for
ensuring that related records are in the same partition. It does not necessarily result in an even distribution of data between partitions.
Modulus. Partitioning is based on a key column modulo the number of
partitions. This method is similar to hash by field, but involves simpler
computation.
Range. Divides a data set into approximately equal-sized partitions, each
of which contains records with key columns within a specified range. This
method is also useful for ensuring that related records are in the same
partition.
DB2. Partitions an input data set in the same way that DB2 would partition it. For example, if you use this method to partition an input data set
containing update information for an existing DB2 table, records are
assigned to the processing node containing the corresponding DB2 record.
Then, during the execution of the parallel operator, both the input record
and the DB2 table record are local to the processing node. Any reads and
writes of the DB2 table would entail no network activity.
The most common method you will see on the DataStage stages is Auto.
This just means that you are leaving it to DataStage to determine the best
partitioning method to use depending on the type of stage, and what the
previous stage in the job has done.
Collecting
Collecting is the process of joining your partitions back together again into
a single data set. There are various situations where you may want to do
this. There may be a stage in your job that you want to run sequentially
rather than in parallel, in which case you will need to collect all your partitioned data at this stage to make sure it is operating on the whole data set.
2-8
Similarly, at the end of a job, you may want to write all your data to a single
database, in which case you need to collect it before you write it.
There may be other cases where you dont want to collect the data at all.
For example, you may want to write each partition to a separate flat file.
Just as for partitioning, in many situations you can leave DataStage to
work out the best collecting method to use. There are situations, however,
where you will want to explicitly specify the collection method. The
following methods are available:
Round robin. Read a record from the first input partition, then from the
second partition, and so on. After reaching the last partition, start over.
After reaching the final record in any partition, skip that partition in the
remaining rounds.
Ordered. Read all records from the first partition, then all records from the
second partition, and so on. This collection method preserves the order of
totally sorted input data sets. In a totally sorted data set, both the records
in each partition and the partitions themselves are ordered.
Sorted merge. Read records in an order based on one or more columns of
the record. The columns used to define record order are called collecting
keys.
The most common method you will see on the DataStage stages is Auto.
This just means that you are leaving it to DataStage to determine the best
collecting method to use depending on the type of stage, and what the
previous stage in the job has done.
The Mechanics of Partitioning and Collecting

This section gives a quick guide to how partitioning and collecting is
represented in a DataStage job.
Partitioning Icons
Each parallel stage in a job can partition or repartition incoming data
before it operates on it. Equally it can just accept the partitions that the data
come in. There is an icon on the input link to a stage which shows how the
stage handles partitioning.
2-9
In most cases, if you just lay down a series of parallel stages in a DataStage
job and join them together, the auto method will determine partitioning.
This is shown on the canvas by the auto partitioning icon:
In some cases, stages have a specific partitioning method associated with

them that cannot be overridden. It always uses this method to organize
incoming data before it processes it. In this case an icon on the input link
tells you that the stage is repartitioning data:
If you specifically select a partitioning method for a stage, rather than just
leaving it to default to Auto, the following icon is shown:
You can specify that you want to accept the existing data partitions by
choosing a partitioning method of same. This is shown by the following
icon on the input link:
Partitioning methods are set on the Partitioning tab of the Inputs pages on
a stage editor (see page 3-11).
Preserve Partitioning Flag

A stage can also request that the next stage in the job preserves whatever
partitioning it has implemented. It does this by setting the Preserve Partitioning flag for its output link. Note, however, that the next stage may
2-10
ignore this request. It will only preserve partitioning as requested if it is

using the Auto partition method.
If the Preserve Partitioning flag is cleared, this means that the current stage
doesnt care what the next stage in the job does about partitioning.
On some stages, the Preserve Partitioning flag can be set to Propagate. In
this case the stage sets the flag on its output link according to what the
previous stage in the job has set. If the previous job is also set to Propagate,
the setting from the stage before that is used and so on until a Set or Clear
flag is encountered earlier in the job. If the stage has multiple inputs and
has a flag set to Propagate, its Preserve Partitioning flag is set if it is set on
any of the inputs, or cleared if all the inputs are clear.
Collecting Icons
A stage in the job which is set to run sequentially will need to collect partitioned data before it operates on it. There is an icon on the input link to a
stage which shows that it is collecting data:
Meta Data
Meta data is information about data. It describes the data flowing through
your job in terms of column definitions, which describe each of the fields
making up a data record.
DataStage has two alternative ways of handling meta data, through Table
definitions, or through Schema files. By default, parallel stages derive their
meta data from the columns defined on the Outputs or Inputs page
Column tab of your stage editor. Additional formatting information is
supplied, where needed, by a Formats tab on the Outputs or Inputs page.
You can also specify that the stage uses a schema file instead by explicitly
setting a property on the stage editor and specify the name and location of
the schema file.
2-11
Runtime Column Propagation

DataStage is also flexible about meta data. It can cope with the situation
where meta data isnt fully defined. You can define part of your schema
and specify that, if your job encounters extra columns that are not defined
in the meta data when it actually runs, it will adopt these extra columns
and propagate them through the rest of the job. This is known as runtime
column propagation (RCP). This can be enabled for a project via the
DataStage Administrator (see DataStage Administrator Guide), and set for
individual links via the Outputs Page Columns tab (see Columns Tab
on page 3-28).s
Table Definitions
A Table Definition is a set of related columns definitions that are stored in
the DataStage Repository. These can be loaded into stages as and when
required.
You can import a table definition from a data source via the DataStage
Manager or Designer. You can also edit and define new Table Definitions
in the Manager or Designer (see DataStage Manager Guide and DataStage
Designer Guide). If you want, you can edit individual column definitions
once you have loaded them into your stage.
You can also simply type in your own column definition from scratch on
the Outputs or Inputs page Column tab of your stage editor (see
page 3-16 and page 3-28). When you have entered a set of column definitions you can save them as a new Table definition in the Repository for
subsequent reuse in another job.
Schema Files and Partial Schemas

You can also specify the meta data for a stage in a plain text file known as
a schema file. This is not stored in the DataStage Repository but you could,
for example, keep it in a document management or source code control
system, or publish it on an intranet site.
The format of schema files is described in Appendix A of this manual.
Some parallel job stages allow you to use a partial schema. This means that
you only need define column definitions for those columns that you are
actually going to operate on. Partial schemas are also described in
Appendix A.
2-12
Data Types
When you work with parallel job column definitions, you will see that
they have an SQL type associated with them. This maps onto an underlying data type which you use when specifying a schema via a file, and
which you can view in the Parallel tab of the Edit Column Meta Data
dialog box (see page 3-16 for details). The following table summarizes the
underlying data types that columns definitions can have:
SQL Type
Underlying
Data Type
Size
Description
Date
date
4 bytes
Date with month, day, and year
Decimal
Numeric
decimal
(Roundup(p)+1)/
2
Packed decimal, compatible with

IBM packed decimal format
Float
Real
sfloat
4 bytes
IEEE single-precision (32-bit)

floating point value
Double
dfloat
8 bytes
IEEE double-precision (64-bit)

floating point value
TinyInt
int8
uint8
1 byte
Signed or unsigned integer of 8

bits
SmallInt
int16
uint16
2 bytes

bits
Integer
int32
uint32
4 bytes

bits
BigInt
int64
uint64
8 bytes

bits
Binary
Bit
LongVarBinary
Complex data
type
comprising
nested
columns\
rBinary
raw
1 byte per
character
Untypes collection, consisting of a

fixed or variable number of
contiguous bytes and an optional
alignment value
2-13
Underlying
Data Type
Size
Description
Unknown
string
Char
LongNVarChar
LongVarChar
NChar
NVarChar
VarChar
1 byte per
character
ASCII character string of fixed or

variable length
Char
subrec
sum of lengths of
subrecord fields
Complex data type comprising

nested columns
Char
tagged
sum of lengths of
subrecord fields
Complex data type comprising

tagged columns, of which one can
be referenced when the column is
used
Time
time
5 bytes
Time of day, with resolution of

seconds or microseconds
Timestamp
timestamp
9 bytes
Single field containing both data

and time value
SQL Type
Complex Data Types

Parallel jobs support three complex data types:
Subrecords
Tagged subrecords
Vectors
Subrecords
A subrecord is a nested data structure. The column with type subrecord
does not itself define any storage, but the columns it contains do. These
columns can have any data type, and you can nest subrecords one within
another. The LEVEL property is used to specify the structure of
2-14
subrecords. The following diagram gives an example of a subrecord structure.

Parent (subrecord)
Child1 (string)
Child2 (string)
LEVEL 01
Child3 (integer)
Child4 (date)
Child5 (subrecord)
Grandchild1 (string)
LEVEL02
Grandchild2 (time)
Grandchild3 (sfloat)
Tagged Subrecord
This is a special type of subrecord structure, it comprises a number of
columns of different types and the actual column is ONE of these, as indicated by the value of a tag at run time. The columns can be of any type
except subrecord or tagged. The following diagram illustrates a tagged
subrecord.
Parent (tagged)
Child1 (string)
Child2 (int8)
Child3 (raw)
Tag = Child1, so column has data type of string
Vector
A vector is a one dimensional array of any type except tagged. All the
elements of a vector are of the same type, and are numbered from 0. The
vector can be of fixed or variable length. For fixed length vectors the length
is explicitly stated, for variable length ones a property defines a link field
2-15
which gives the length at run time. The following diagram illustrates a
vector of fixed length and one of variable length.
Fixed Length
int32
int32
int32
int32 int32
3
int32
int32 int32
int32
int32
int32
8
Variable Length
int32
int32
0
1
link field = N
2-16
int32
2
int32 int32
3
int32
N
Incorporating Server Job Functionality

You can incorporate Server job functionality in your Parallel jobs by the
use of Shared Container stages. This allows you to, for example, use Server
job plug-in stages to access data source that are not directly supported by
Parallel jobs.
You create a new shared container in the DataStage Designer, add Server
job stages as required, and then add the shared container to your Parallel
job and connect it to the Parallel stages. Shared container stages used in
Parallel jobs have extra pages in their Properties dialog box, which enable
you to specify details about parallel processing and partitioning and
collecting data.
You can only use Shared Containers in this way on SMP systems (not MPP
or cluster systems).
The following limitations apply to the contents of such shared containers:
There must be zero or one container inputs, zero or more container
outputs, and at least one of either.
There can be no disconnected flows all stages must be linked to
the input or an output of the container directly or via an active
stage. When the container has an input and one or more outputs,
each stage must connect to the input and at least one of the
outputs.
There can be no synchronization by having a passive stage with
both input and output links.
For details on how to use Shared Containers, see DataStage Designer Guide.
2-17
2-18
3
Stage Editors
The Parallel job stage editors all use a generic user interface (with the
exception of the Transformer stage and Shared Container stages). This
chapter describes the generic editor and gives a guide to using it.
Parallel jobs have a large number of stages available. You can remove the
ones you dont intend to using regularly using the View Customize
Palette feature.
The stage editors divided into the following basic types:
Active. These are stages that perform some processing on the data
that is passing through them. Examples of active stages are the
Aggregator and Sort stages.
File. These are stages that read or write data contained in a file or
set of files. Examples of file stages are the Sequential File and Data
Set stages.
Database. These are stages that read or write data contained in a
database. Examples of database stages are the Oracle and DB2
stages.
All of the stage types use the same basic stage editor, but the pages that
actually appear when you edit the stage depend on the exact type of stage
you are editing. The following sections describe all the page types and sub
tabs that are available. The individual descriptions of stage editors in the
following chapters tell you exactly which features of the generic editor
each stage type uses.
Stage Editors
3-1
The Stage Page

All stage editors have a Stage page. This contains a number of subsidiary
tabs depending on the stage type. The only field the Stage page itself
contains gives the name of the stage being edited.
General Tab
All stage editors have a General tab, this allows you to enter an optional
description of the stage. Specifying a description here enhances job
maintainability.
Properties Tab
A Properties tab appears on the General page where there are general
properties that need setting for the particular stage you are editing. Properties tabs can also occur under Input and Output pages where there are
link-specific properties that need to be set.
3-2
All the properties for active stages are set under the General page.
Property Value field
The available properties are displayed in a tree structure. They are divided
into categories to help you find your way around them. All the mandatory
properties are included in the tree by default and cannot be removed.
Properties that you must set a value for (i.e. which have not got a default
value) are shown in the warning color (red by default), but change to black
when you have set a value. You can change the warning color by opening
the Options dialog box (select Tools Options from the DataStage
Designer main menu) and choosing the Transformer item from the tree.
Reset the Invalid column color by clicking on the color bar and choosing a
new color from the palette.
To set a property, select it in the list and specify the required property value
in the property value field. The title of this field and the method for
entering a value changes according to the property you have selected. In
the example above, the Key property is selected so the Property Value field
is called Key and you set its value by choosing one of the available input
columns from a drop down list. Key is shown in red because you must
select a key for the stage to work properly. The Information field contains
details about the property you currently have selected in the tree. Where
you can browse for a property value, or insert a job parameter whose value
Stage Editors
3-3
is provided at run time, a right arrow appears next to the field. Click on
this and a menu gives access to the Browse Files dialog box and/or a list
of available job parameters (job parameters are defined in the Job Properties dialog box - see DataStage Designer Guide).
Some properties have default values, and you can always return to the
default by selecting it in the tree and choosing Set to default from the
shortcut menu.
Some properties are optional. These appear in the Available properties to
add field. Click on an optional property to add it to the tree the tree or
choose to add it from the shortcut menu. You can remove it again by
selecting it in the tree and selecting Remove from the shortcut menu.
Some properties can be repeated. In the example above you can add
multiple key properties. The Key property appears in the Available properties to add list when you select the tree top level Properties node. Click
on the Key item to add multiple key properties to the tree.
Some properties have dependents. These are properties which somehow
relate to or modify the parent property. They appear under the parent in a
tree structure.
For some properties you can supply a job parameter as their value. At
runtime the value of this parameter will be used for the property. Such
properties are identifies by an arrow next to their Property Value box (as
shown for the example Sort Stage Key property above). Click the arrow to
get a list of currently defined job parameters to chose from (see DataStage
Designer Guide for information about job parameters).
You can switch to a multiline editor for entering property values for some
properties. Do this by clicking on the arrow next to their Property Value
box and choosing Switch to multiline editor from the menu.
The property capabilities are indicated by different icons in the tree as
follows:
non-repeating property with no dependents
non-repeating property with dependents
repeating property with no dependents
repeating property with dependents
The properties for individual stage types are described in the chapter
about the stage.
3-4
Advanced Tab
All stage editors have a Advanced tab. This allows you to:
Specify the execution mode of the stage. This allows you to choose
between Parallel and Sequential operation. If the execution mode
for a particular type of stage cannot be changed, then this drop
down list is disabled. Selecting Sequential operation forces the
stage to be executed on a single node. If you have intermixed
sequential and parallel stages this has implications for partitioning
and collecting data between the stages. You can also let DataStage
decide by choosing the default setting for the stage (the drop down
list tells you whether this is parallel or sequential).
Set or clear the preserve partitioning flag. This indicates whether
the stage wants to preserve partitioning at the next stage of the job.
You choose between Set, Clear and Propagate. For some stage
types, Propagate is not available. The operation of each option is as
follows:
Set. Sets the preserve partitioning flag, this indicates to the next
stage in the job that it should preserve existing partitioning if
possible.
Clear. Clears the preserve partitioning flag. Indicates that this
stage does not care which partitioning method the next stage
uses.
Propagate. Sets the flag to Set or Clear depending on what the
previous stage in the job has set (or if that is set to Propagate the
stage before that and so on until a preserve partitioning flag
setting is encountered).
You can also let DataStage decide by choosing the default setting
for the stage (the drop down list tells you whether this is set, clear,
or propagate).
Specify node map or node pool or resource pool constraints. This
enables you to limit where the stage can be executed as follows:
Node pool and resource constraints. This allows you to specify
constraints in a grid. Select Node pool or Resource pool from the
Constraint drop-down list. Select a Type for a resource pool and,
finally, select the name of the pool you are limiting execution to.
You can select multiple node or resource pools.
Stage Editors
3-5
Node map constraints. Select the option box and type in the
nodes to which execution will be limited in text box. You can also
browse through the available nodes to add to the text box. Using
this feature conceptually sets up an additional node pool which
doesnt appear in the configuration file.
The lists of available nodes, available node pools, and available
resource pools are derived from the configuration file.
The Data Set stage only allows you to select disk pool constraints.
Link Ordering Tab

This tab allows you to order the links for stages that have more than one
link and where ordering of the links is required.
3-6
The tab allows you to order input links and/or output links as needed.
Where link ordering is not important or is not possible the tab does not
appear
The link label gives further information about the links being ordered. In
the example we are looking at the Link Ordering tab for a Join stage. The
join operates in terms of having a left link and a right link, and this tab tells
you which actual link the stage regards as being left and which right. If
you use the arrow keys to change the link order, the link name changes but
not the link label. In our example, if you pressed the down arrow button,
DSLink27 would become the left link, and DSLink26 the right.
A Join stage can only have one output link, so in the example the Order
the following output links section is disabled.
The following example shows the Link Ordering tab from a Merge stage.
In this case you can order both input links and output links. The Merge
stage handles reject links as well as a stream link and the tab allows you to
Stage Editors
3-7
order these, although you cannot move them to the stream link position.
Again the link labels give the sense of how the links are being used.
The individual stage descriptions tell you whether link ordering is

possible and what options are available.
3-8
Inputs Page
The Inputs page gives information about links going into a stage. In the
case of a file or database stage an input link carries data being written to
the file or database. In the case of an active stage it carries data that the
stage will process before outputting to another stage. Where there are no
input links the stage editor has no Inputs page.
Where it is present, the Inputs page contains various tabs depending on
stage type. The only field the Inputs page itself contains is Input name,
which gives the name of the link being edited. Where a stage has more
than one input link, you can select the link you are editing from the Input
name drop-down list.
The Inputs page also has a Columns button. Click this to open a
window showing column names from the meta data defined for this link.
You can drag these columns various fields in the Inputs page tabs as
required.
Certain stage types will also have a View Data button. Press this to view
the actual data associated with the specified data source or data target. The
button is available if you have defined meta data for the link.
Stage Editors
3-9
General Tab
The Inputs page always has a General tab. this allows you to enter an
optional description of the link. Specifying a description for each link
enhances job maintainability.
Properties Tab
Some types of file and database stages can have properties that are particular to specific input links. In this case the Inputs page has a Properties
3-10
tab. This has the same format as the Stage page Properties tab (see Properties Tab on page 3-2).
Partitioning Tab
Most parallel stages have a default partitioning or collecting method associated with them. This is used depending on the execution mode of the
stage (i.e., parallel or sequential), whether Preserve Partitioning on the
Stage page Advanced tab is Set, Clear, or Propagate, and the execution
mode of the immediately preceding stage in the job. For example, if the
preceding stage is processing data sequentially and the current stage is
processing in parallel, the data will be partitioned before as it enters the
current stage. Conversely if the preceding stage is processing data in
parallel and the current stage is sequential, the data will be collected as it
enters the current stage.
You can, if required, override the default partitioning or collecting method
on the Partitioning tab. The selected method is applied to the incoming
data as it enters the stage on a particular link, and so the Partitioning tab
appears on the Inputs page. You can also use the tab to repartition data
between two parallel stages. If both stages are executing sequentially, you
cannot select a partition or collection method and the fields are disabled.
The fields are also disabled if the particular stage does not permit selection
Stage Editors
3-11
of partitioning or collection methods. The following table shows what can

be set from the Partitioning tab in what circumstances:
Preceding Stage
Current Stage
Partition Tab Mode
Parallel
Parallel
Partition
Parallel
Sequential
Collect
Sequential
Parallel
Partition
Sequential
Sequential
None (disabled)
The Partitioning tab also allows you to specify that the data should be
sorted as it enters.
The Partitioning tab has the following fields:

Partition type. Choose the partitioning (or collecting) type from
the drop-down list. The following partitioning types are available:
(Auto). DataStage attempts to work out the best partitioning
method depending on execution modes of current and preceding
stages, whether the Preserve Partitioning flag has been set on the
3-12
previous stage in the job, and how many nodes are specified in
the Configuration file. This is the default method for many
stages.
Entire. Every processing node receives the entire data set. No
further information is required.
Hash. The records are hashed into partitions based on the value
of a key column or columns selected from the Available list.
Modulus. The records are partitioned using a modulus function
on the key column selected from the Available list. This is
commonly used to partition on tag fields.
Random. The records are partitioned randomly, based on the
output of a random number generator. No further information is
required.
Round Robin. The records are partitioned on a round robin basis
as they enter the stage. No further information is required.
Same. Preserves the partitioning already in place. No further
information is required.
DB2. Replicates the DB2 partitioning method of a specific DB2
table. Requires extra properties to be set. Access these properties
by clicking the properties button
Range. Divides a data set into approximately equal size partitions
based on one or more partitioning keys. Range partitioning is
often a preprocessing step to performing a total sort on a data set.
Requires extra properties to be set. Access these properties by
clicking the properties button
The following collection types are available:
(Auto). DataStage attempts to work out the best collection
stages, and how many nodes are specified in the Configuration
file. This is the default collection method for many stages.
Ordered. Reads all records from the first partition, then all
records from the second partition, and so on. Requires no further
information.
Round Robin. Reads a record from the first input partition, then
from the second partition, and so on. After reaching the last partition, the operator starts over.
Stage Editors
3-13
Sort Merge. Reads records in an order based on one or more

columns of the record. This requires you to select a collecting key
column from the Available list.
Available. This lists the input columns for the input link. Key
columns are identified by a key icon. For partitioning or collecting
methods that require you to select columns, you click on the
required column in the list and it appears in the Selected list to the
right. This list is also used to select columns to sort on.
Selected. This list shows which columns have been selected for
partitioning on, collecting on, or sorting on and displays information about them. The available information is whether a sort is
being performed (indicated by an arrow), if so the order of the sort
(ascending or descending) and collating sequence (ASCII or
EBCDIC) and whether an alphanumeric key is case sensitive or
not. You can select sort order, case sensitivity, and collating
sequence from the shortcut menu. If applicable, the Usage field
indicates whether a particular key column is being used for
sorting, partitioning, or both.
Sorting. The check boxes in the section allow you to specify sort
details.
Sort. Select this to specify that data coming in on the link should
be sorted. Select the column or columns to sort on from the Available list.
Stable. Select this if you want to preserve previously sorted data
sets. The default is stable.
Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort
is also set, the first record is retained.
You can also specify sort direction, case sensitivity, and collating
sequence for each column in the Selected list by selecting it and
right-clicking to invoke the shortcut menu. The availability of
sorting depends on the partitioning method chosen.
If you require a more complex sort operation, you should use the
Sort stage.
3-14
DB2 Partition Properties

This dialog box appears when you select a Partition type of DB2 and click
the properties button
. It allows you to specify the DB2 table whose
partitioning method is to be replicated.
Range Partition Properties

This dialog box appears when you select a Partition type of Range and
click the properties button
. It allows you to specify the range map that
is to be used to determine the partitioning. Type in a pathname or browse
for a file.
Stage Editors
3-15
Columns Tab
The Inputs page always has a Columns tab. This displays the column
meta data for the selected input link in a grid.
There are various ways of populating the grid:

If the other end of the link has meta data specified for it, this will be
displayed in the Columns tab (meta data is associated with, and
travels with a link).
You can type the required meta data into the grid. When you have
done this, you can click the Save button to save the meta data as
a table definition in the Repository for subsequent reuse.
You can load an existing table definition from the Repository. Click
the Load button to be offered a choice of table definitions to load.
Note that when you load in this way you bring in the columns definitions, not any formatting information associated with them (to
load that, go to the Format tab).
You can drag a table definition from the Repository Window on the
Designer onto a link on the canvas. This transfers both the column
definitions and the associated format information.
3-16
If you click in a row and select Edit Row from the shortcut menu, the
Edit Column meta data dialog box appears, which allows you edit the row
details in a dialog box format. It also has a Parallel tab which allows you
to specify properties that are peculiar to parallel job column definitions.
The dialog box only shows those properties that are relevant for the
current link.
The Parallel tab enables you to specify properties that give more detail
about each column, and properties that are specific to the data type.
Field Format
This has the following properties:
Bytes to Skip. Skip the specified number of bytes from the end of
the previous column to the beginning of this column.
Delimiter. Specifies the trailing delimiter of the column. Type an
ASCII character or select one of whitespace, end, none, or null.
Stage Editors
3-17
whitespace. A whitespace character is used.

end. Specifies that the last column in the record is composed of
all remaining bytes until the end of the record.
none. No delimiter.
null. Null character is used.
Delimiter string. Specify a string to be written at the end of the
column. Enter one or more ASCII characters.
Generate on output. Creates a column and sets it to the default
value.
Prefix bytes. Specifies that each column in the data file is prefixed
by 1, 2, or 4 bytes containing, as a binary value, either the columns
length or the tag value for a tagged column.
Quote. Specifies that variable length columns are enclosed in
single quotes, double quotes, or another ASCII character or pair of
ASCII characters. Choose Single or Double, or enter an ASCII
character.
Start position. Specifies the starting position of a column in the
record. The starting position can be either an absolute byte offset
from the first record position (0) or the starting position of another
column.
Tag case value. Explicitly specifies the tag value corresponding to a
subfield in a tagged subrecord. By default the fields are numbered
0 to N-1, where N is the number of fields. (A tagged subrecord is a
column whose type can vary. The subfields of the tagged subrecord
are the possible types. The tag case value of the tagged subrecord
selects which of those types is used to interpret the columns value
for the record.)
User defined. Allows free format entry of any properties not
defined elsewhere. Specify in a comma-separated list.
String Type
This has the following properties:
Default. The value to substitute for a column that causes an error.
Export EBCDIC as ASCII. Select this to specify that EBCDIC characters are written as ASCII characters.
3-18
Is link field. Selected to indicate that a column holds the length of

a another, variable-length column of the record or of the tag value
of a tagged record field.
Layout max width. The maximum number of bytes in a column
represented as a string. Enter a number.
Layout width. The number of bytes in a column represented as a
string. Enter a number.
Pad char. Specifies the pad character used when strings or numeric
values are exported to an external string representation. Enter an
ASCII character or choose null.
Date Type
Byte order. Specifies how multiple byte data types are ordered.
Choose from:
little-endian. The high byte is on the left.
big-endian. The high byte is on the right.
native-endian. As defined by the native format of the machine.
Days since. Dates are written as a signed integer containing the
number of days since the specified date. Enter a date in the form
%yyyy-%mm-%dd.
Format. Specifies the data representation format of a column.
Choose from:
binary
text
Format string. The string format of a date. By default this is %yyyy%mm-%dd.
Is Julian. Select this to specify that dates are written as a numeric
value containing the Julian day. A Julian day specifies the date as
the number of days from 4713 BCE January 1, 12:00 hours (noon)
GMT.
Time Type
Choose from:
Stage Editors
3-19

Choose from:
binary
text
Format string. Specifies the format of columns representing time as
a string. By default this is %hh-%mm-%ss.
Is midnight seconds. Select this to specify that times are written as
a binary 32-bit integer containing the number of seconds elapsed
from the previous midnight.
Timestamp Type
Choose from:
Choose from:
binary
text
Format string. Specifies the format of a column representing a
timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.
Integer Type
Choose from:
native-endian. As defined by the native format of the
machine.C_format
Choose from:
binary
3-20
text
Out_format. Format string used for conversion of data from
integer or floating-point data to a string. This is passed to sprintf().
Decimal Type
Allow all zeros. Specifies whether to treat a packed decimal
column containing all zeros (which is normally illegal) as a valid
representation of zero. Select Yes or No.
Choose from:
binary
text
Packed. Select Yes to specify that the decimal columns contain data
in packed decimal format or No to specify that they contain
unpacked decimal with a separate sign byte. This property has two
dependent properties as follows:
Check. Select Yes to verify that data is packed, or No to not verify.
Signed. Select Yes to use the existing sign when writing decimal
columns. Select No to write a positive sign (0xf) regardless of the
columns actual sign value.
Stage Editors
3-21
Precision. Specifies the precision where a decimal column is

written in text format. Enter a number.
Rounding. Specifies how to round a decimal column when writing
it. Choose from:
up (ceiling). Truncate source column towards positive infinity.
down (floor). Truncate source column towards negative infinity.
nearest value. Round the source column towards the nearest
representable value.
truncate towards zero. This is the default. Discard fractional
digits to the right of the right-most fractional digit supported by
the destination, regardless of sign.
Scale. Specifies how to round a source decimal when its precision
and scale are greater than those of the destination.
Float Type
C_format. Perform non-default conversion of data from integer or
floating-point data to a string. This property specifies a C-language
format string used for writing integer or floating point strings. This
is passed to sprintf().
Choose from:
binary
text
3-22
Vectors
If the row you are editing represents a column which is a variable length
vector, tick the Variable check box. The Vector properties appear, these
give the size of the vector in one of two ways:
Link Field Reference. The name of a column containing the
number of elements in the variable length vector. This should have
an integer or float type, and have its Is Link field property set.
Vector prefix. Specifies 1-, 2-, or 4-byte prefix containing the
number of elements in the vector.
If the row you are editing represents a column which is a vector of known
length, enter the number of elements in the Vector Occurs box.
Subrecords
If the row you are editing represents a column which is part of a subrecord
the Level Number column indicates the level of the column within the
subrecord structure.
If you specify Level numbers for columns, the column immediately
preceding will be identified as a subrecord. Subrecords can be nested, so
can contain further subrecords with higher level numbers (i.e., level 06 is
nested within level 05). Subrecord fields have a Tagged check box to indicate that this is a tagged subrecord.
Stage Editors
3-23
Format Tab
Certain types of file stage (i.e., the Sequential File stage) also have a Format
tab which allows you to specify the format of the flat file or files being read
from.
The Format tab is similar in structure to the Properties tab. A flat file has
a number of properties that you can set different attributes for. Select the
property in the tree and select the attributes you want to set from the
Available properties to add window, it will then appear as a dependent
property in the property tree and you can set its value as required.
If you click the Load button you can load the format information from a
table definition in the Repository.
The short-cut menu from the property tree gives access to the following
functions:
Format as. This applies a predefined template of properties.
Choose from the following:
3-24
Delimited/quoted
Fixed-width records
UNIX line terminator
DOS line terminator
No terminator (fixed width)

Mainframe (COBOL)
Add sub-property. Gives access to a list of dependent properties
for the currently selected property (visible only if the property has
dependents).
Set to default. Appears if the currently selected property has been
set to a non-default value, allowing you to re-select the default.
Remove. Removes the currently selected property. This is disabled
if the current property is mandatory.
Remove all. Removes all the non-mandatory properties.
Details of the properties you can set are given in the chapter describing the
individual stage editors.
Outputs Page
The Outputs page gives information about links going out of a stage. In
the case of a file or database stage an input link carries data being read
from the file or database. In the case of an active stage it carries data that
the stage has processed. Where there are no output links the stage editor
has no Outputs page.
Where it is present, the Outputs page contains various tabs depending on
stage type. The only field the Outputs page itself contains is Output name,
which gives the name of the link being edited. Where a stage has more
than one output link, you can select the link you are editing from the
Output name drop-down list.
The Outputs page also has a Columns button. Click this to open a
window showing column names from the meta data defined for this link.
You can drag these columns to various fields in the Outputs page tabs as
required.
Stage Editors
3-25
General Tab
The Outputs page always has a General tab. this allows you to enter an
optional description of the link. Specifying a description for each link
enhances job maintainability.
3-26
Properties Page
Some types of file and database stages can have properties that are particular to specific output links. In this case the Outputs page has a Properties
tab. This has the same format as the Stage page Properties tab (see Properties Tab on page 3-2).
Stage Editors
3-27
Columns Tab
The Outputs page always has a Columns tab. This displays the column
meta data for the selected output link in a grid.
There are various ways of populating the grid:

If the other end of the link has meta data specified for it, this will be
displayed in the Columns tab (meta data is associated with, and
travels with a link).
You can type the required meta data into the grid. When you have
done this, you can click the Save button to save the meta data as
a table definition in the Repository for subsequent reuse.
You can load an existing table definition from the Repository. Click
the Load button to be offered a choice of table definitions to load.
If runtime column propagation is enabled in the DataStage Administrator,
you can select the Runtime column propagation to specify that columns
encountered by the stage can be used even if they are not explicitly defined
in the meta data. There are some special considerations when using
runtime column propagation with certain stage types:
3-28
Sequential File
File Set
External Source
External Target
See the individual stage descriptions for details of these.If you click in a
row and select Edit Row from the shortcut menu, the Edit Column meta
data dialog box appears, which allows you edit the row details in a dialog
box format. It also has a Parallel tab which allows you to specify properties
that are peculiar to parallel job column definitions. (See page 3-17 for
details.)
If the selected output link is a reject link, the column meta data grid is read
only and cannot be modified.
Format Tab
Certain types of file stage (i.e., the Sequential File stage) also have a Format
tab which allows you to specify the format of the flat file or files being
written to.
The Format page is similar in structure to the Properties page. A flat file
has a number of properties that you can set different attributes for. Select
the property in the tree and select the attributes you want to set from the
Stage Editors
3-29
Available properties to add window, it will then appear as a dependent

property in the property tree and you can set its value as required.
Format details are also stored with table definitions, and you can use the
Load button to load a format from a table definition stored in the
DataStage Repository.
Details of the properties you can set are given in the chapter describing the
individual stage editors.
Mapping Tab
For active stages the Mapping tab allows you to specify how the output
columns are derived, i.e., what input columns map onto them or how they
are generated.
The left pane shows the input columns and/or the generated columns.
These are read only and cannot be modified on this tab. These columns
represent the data that the stage has produced after it has processed the
input data.
3-30
The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived.You can fill it
in by dragging input columns over, or by using the Auto-match facility. If
you have not yet defined any output column definitions, this will define
them for you. If you have already defined output column definitions, the
stage performs the mapping for you as far as possible.
In the above example the left pane represents the data after it has been
joined. The Expression field shows how the column has been derived, the
Column Name shows the column after it has been renamed by the join
operation (preceded by leftRec_ or RightRec_). The right pane represents
the data being output by the stage after the join. In this example the data
has been mapped straight across.
More details about mapping operations for the different stages are given
in the individual stage descriptions.
A shortcut menu can be invoked from the right pane that allows you to:
Find and replace column names.

Validate a derivation you have entered.
Clear an existing derivation.
Append a new column.
Select all columns.
Insert a new column at the current position.
Delete the selected column or columns.
Cut and copy columns.
Paste a whole column.
Paste just the derivation from a column.
The Find button opens a dialog box which allows you to search for particular output columns.
Stage Editors
3-31
The Auto-Match button opens a dialog box which will automatically map
left pane columns onto right pane columns according to the specified
criteria.
Select Location match to map input columns onto the output ones occupying the equivalent position. Select Name match to match by names. You
can specify that all columns are to be mapped by name, or only the ones
you have selected. You can also specify that prefixes and suffixes are
ignored for input and output columns, and that case can be ignored.
3-32
4
Sequential File Stage
The Sequential File stage is a file stage. It allows you to read data from or
write data one or more flat files. The stage can have a single input link or
a single output link, and a single rejects link. It usually executes in parallel
mode but can be configured to execute sequentially if it is only reading one
file with a single reader.
When you edit a Sequential File stage, the Sequential File stage editor
appears. This is based on the generic stage editor described in Chapter 3,
Stage Editors.
The stage editor has up to three pages, depending on whether you are
reading or writing a file:
Stage page. This is always present and is used to specify general
information about the stage.
Inputs page. This is present when you are writing to a flat file. This
is where you specify details about the file or files being written to.
Outputs page. This is present when you are reading from a flat file.
This is where you specify details about the file or files being read
from.
There are one or two special points to note about using runtime column
propagation (RCP) with Sequential stages. See Using RCP With Sequential Stages on page 4-20 for details.
Stage Page
The General tab allows you to specify an optional description of the stage.
The Advanced page allows you to specify how the stage executes.
4-1
Advanced Tab
This tab allows you to specify the following:
Execution Mode. The stage can execute in parallel mode or
sequential mode. In parallel mode the contents of the file are
processed by the available nodes as specified in the Configuration
file, and by any node constraints specified on the Advanced tab. In
Sequential mode the entire contents of the file are processed by the
conductor node. When a stage is reading a single file the Execution
Mode is sequential and you cannot change it. When a stage is
reading multiple files, the Execution Mode is parallel and you
cannot change it.
Preserve partitioning. You can select Set or Clear. If you select Set
file read operations will request that the next stage preserves the
partitioning as is (it is ignored for file write operations). If you set
the Keep File Partitions output property this will automatically set
the preserve partitioning flag.
Node pool and resource constraints. Select this option to constrain
parallel execution to the node pool or pools and/or resource pools
or pools specified in the grid. The grid allows you to make choices
from drop down lists populated from the Configuration file.
Node map constraint. Select this option to constrain parallel
execution to the nodes in a a defined node map. You can define a
node map by typing node numbers into the text box or by clicking
the browse button to open the Available Nodes dialog box and
selecting nodes from there. You are effectively defining a new node
pool for this stage (in addition to any node pools defined in the
Configuration file).
Inputs Page
The Inputs page allows you to specify details about how the Sequential
File stage writes data to one or more flat files. The Sequential File stage can
have only one input link, but this can write to multiple files.
The General tab allows you to specify an optional description of the input
link. The Properties tab allows you to specify details of exactly what the
link does. The Partitioning tab allows you to specify how incoming data
is partitioned before being written to the file or files. The Formats tab gives
4-2
information about the format of the files being written. The Columns tab
specifies the column definitions of incoming data.
Details about Sequential File stage properties, partitioning, and formatting
are given in the following sections. See Chapter 3, Stage Editors, for a
general description of the other tabs.
Input Link Properties

The Properties tab allows you to specify properties for the input link.
These dictate how incoming data is written and to what files. Some of the
properties are mandatory, although many have default settings. Properties
without default settings appear in the warning color (red by default) and
turn black when you supply a value for them.
The following table gives a quick reference list of the properties and their
attributes. A more detailed description of each property follows.
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Target/File
Pathname
N/A
N/A
Target/File Update
Mode
Append/
Create/
Overwrite
Create
N/A
Options/Cleanup On
Failure
True/False
True
N/A
Options/Reject Mode
Continue/Fail
/Save
Continue
N/A
Options/Filter
Command
N/A
N/A
Options/Schema File
Pathname
N/A
N/A
Target Category
File. This property defines the flat file that the incoming data will be
written to. You can type in a pathname, or browse for a file. You can specify
multiple files by repeating the File property. Do this by selecting the Properties item at the top of the tree, and clicking on File in the Available
properties to add window. Do this for each extra file you want to specify.
You must specify at least one file to be written to, which must exist unless
you specify a File Update Mode of Create or Overwrite.
4-3
File Update Mode. This property defines how the specified file or files are
updated. The same method applies to all files being written to. Choose
from Append to append to existing files, Overwrite to overwrite existing
files, or Create to create a new file. If you specify the Create property for a
file that already exists you will get an error at runtime.
By default this property is set to Overwrite.
Options Category
Cleanup On Failure. This is set to True by default and specifies that the
stage will delete any partially written files if the stage fails for any reason.
Set this to False to specify that partially written files should be left.
Reject Mode. This specifies what happens to any data records that are not
written to a file for some reason. Choose from Continue to continue operation and discard any rejected rows, Fail to cease writing if any rows are
rejected, or Save to send rejected rows down a reject link.
Continue is set by default.
Filter. This is an optional property. You can use this to specify that the data
is passed through a filter program before being written to the file or files.
Specify the filter command, and any required arguments, in the Property
Value box.
Schema File. This is an optional property. By default the Sequential File
stage will use the column definitions defined on the Columns and Format
tabs as a schema for writing to the file. You can, however, override this by
specifying a file containing a schema. Type in a pathname or browse for a
file.
Partitioning on Input Links

The Partitioning tab allows you to specify details about how the incoming
data is partitioned or collected before it is written to the file or files. It also
allows you to specify that the data should be sorted before being written.
By default the stage partitions in Auto mode. This attempts to work out
the best partitioning method depending on execution modes of current
and preceding stages, whether the Preserve Partitioning option has been
set, and how many nodes are specified in the Configuration file. If the
Preserve Partitioning option has been set on the Stage page Advanced tab
4-4
(see page 4-2) the stage will attempt to preserve the partitioning of the
incoming data.
If the Sequential File stage is operating in sequential mode, it will first
collect the data before writing it to the file using the default Auto collection
method.
The Partitioning tab allows you to override this default behavior. The
exact operation of this tab depends on:
Whether the Sequential File stage is set to execute in parallel or
sequential mode.
Whether the preceding stage in the job is set to execute in parallel
or sequential mode.
If the Sequential File stage is set to execute in parallel, then you can set a
partitioning method by selecting from the Partitioning mode drop-down
list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the Stage page Advanced tab).
If the Sequential File stage is set to execute in sequential mode, but the
preceding stage is executing in parallel, then you can set a collection
method from the Collection type drop-down list. This will override the
default auto collection method.
The following partitioning methods are available:
previous stage in the job, and how many nodes are specified in the
Configuration file. This is the default collection method for the
Sequential File stage.
Entire. Each file written to receives the entire data set.
Hash. The records are hashed into partitions based on the value of
a key column or columns selected from the Available list.
Modulus. The records are partitioned using a modulus function on
the key column selected from the Available list. This is commonly
used to partition on tag fields.
output of a random number generator.
4-5

as they enter the stage.
Same. Preserves the partitioning already in place.
based on one or more partitioning keys. Range partitioning is often
a preprocessing step to performing a total sort on a data set.
The following Collection methods are available:
(Auto). DataStage attempts to work out the best collection method
depending on execution modes of current and preceding stages,
and how many nodes are specified in the Configuration file. This is
the default collection method for Sequential File stages.
Ordered. Reads all records from the first partition, then all records
from the second partition, and so on.
The Partitioning tab also allows you to specify that data arriving on the
inputlinkshouldbesortedbeforebeingwrittentothe fileorfiles.Thesort
is always carried out within data partitions. If the stage is partitioning
incoming data the sort occurs after the partitioning. If the stage is
collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.
Select the check boxes as follows:
Sort. Select this to specify that data coming in on the link should be
sorted. Select the column or columns to sort on from the Available
list.
sets. This is the default.
4-6
Unique. Select this to specify that, if multiple records have identical sorting key values, only one record is retained. If stable sort is
also set, the first record is retained.
You can also specify sort direction, case sensitivity, and collating sequence
for each column in the Selected list by selecting it and right-clicking to
invoke the shortcut menu.
Format of Sequential Files

The Format tab allows you to supply information about the format of the
flat file or files to which you are writing. The tab has a similar format to the
Properties tab and is described on page 3-24.
Select a property type from main tree then add the properties you want to
set to the tree structure by clicking on them in the Available properties to
set window. You can then set a value for that property in the Property
Value box. Pop up help for each of the available properties appears if you
over the mouse pointer over it.
The following sections list the Property types and properties available for
each type.
Record level. These properties define details about how data records are
formatted in the flat file. The available properties are:
Fill char. Specify an ASCII character or a value in the range 0 to
255. This character is used to fill any gaps in an exported record
caused by column positioning properties. Set to 0 by default.
Final delimiter string. Specify a string to be written after the last
column of a record in place of the column delimiter. Enter one or
more ASCII characters (precedes the record delimiter if one is
used).
Final delimiter. Specify a single character to be written after the
last column of a record in place of the column delimiter. Type an

end. Record delimiter is used (defaults to newline)
none. No delimiter (column length is used).
Intact. Allows you to define a partial record schema. See Partial

Schemas in Appendix A for details on complete versus partial
4-7
schemas. (The dependent property Check Intact is only relevant for

output links.)
Record delimiter string. Specify a string to be written at the end of
each record. Enter one or more ASCII characters.
Record delimiter. Specify a single character to be written at the end
of each record. Type an ASCII character or select one of the
following:
\n. Newline (the default).
null. Null character.
This is mutually exclusive with Record delimiter string, although
the dialog box does not enforce this.
Record length. Select Fixed where the fixed length columns are
being written. DataStage calculates the appropriate length for the
record. Alternatively specify the length of fixed records as number
of bytes.
Record Prefix. Specifies that a variable-length record is prefixed by
a 1-, 2-, or 4-byte length prefix. 1 byte is the default.
Record type. Specifies that data consists of variable-length blocked
records (varying) or implicit records (implicit). If you choose the
implicit property, data is written as a stream with no explicit record
boundaries. The end of the record is inferred when all of the
columns defined by the schema have been parsed. The varying
property allows you to specify one of the following IBM blocked or
spanned formats: V, VB, VS, or VBS.
This property is mutually exclusive with Record length, Record
delimiter, Record delimiter string, and Record prefix.
Field Defaults. Defines default properties for columns written to the file
or files. These are applied to all columns written. The available properties
are:
Delimiter. Specifies the trailing delimiter of all columns in the
record. Type an ASCII character or select one of whitespace, end,
none, or null.
4-8

none. No delimiter.
Delimiter string. Specify a string to be written at the end of each
length or the tag value for a tagged field.
Print field. This property is not relevant for input links.
character.
Vector prefix. For columns that are variable length vectors, specifies a 1-, 2-, or 4-byte prefix containing the number of elements in
the vector.
Type Defaults. These are properties that apply to all columns of a specific
data type unless specifically overridden at the column level. They are
divided into a number of subgroups according to data type.
General. These properties apply to several data types (unless overridden
at column level):
Byte order. Specifies how multiple byte data types (except string
and raw data types) are ordered. Choose from:
Choose from:
binary
text
4-9

String. These properties are applied to columns with a string data type,
unless overridden at column level.
Import ASCII as EBCDIC. Not relevant for input links.
Decimal. These properties are applied to columns with a decimal data
type unless overridden at column level.
it. Choose from:
4-10

Numeric. These properties are applied to columns with an integer or float
data type unless overridden at column level.
In_format. Not relevant for input links.
Date. These properties are applied to columns with a date data type unless
overridden at column level.
%yyyy-%mm-%dd.
GMT.
Time. These properties are applied to columns with a time data type
Timestamp. These properties are applied to columns with a timestamp
4-11

Outputs Page
The Outputs page allows you to specify details about how the Sequential
File stage reads data from one or more flat files. The Sequential File stage
can have only one output link, but this can read from multiple files.
It can also have a single reject link. This is typically used when you are
writing to a file and provides a location where records that have failed to
be written to a file for some reason can be sent.
The Output name drop-down list allows you to choose whether you are
looking at details of the main output link (the stream link) or the reject link.
The General tab allows you to specify an optional description of the
output link. The Properties tab allows you to specify details of exactly
what the link does. The Formats tab gives information about the format of
the files being read. The Columns tab specifies the column definitions of
incoming data.
Details about Sequential File stage properties and formatting are given in
the following sections. See Chapter 3, Stage Editors, for a general
description of the other tabs.
Output Link Properties

The Properties tab allows you to specify properties for the output link.
These dictate how incoming data is read from what files. Some of the properties are mandatory, although many have default settings. Properties
Category/Property
Values
Default
Mandatory?
Source/File
pathname
N/A
Y if Read
Method =
Specific Files(s)
4-12
Depen
Repeats? dent
of
Y
N/A
Category/Property
Values
Depen
Repeats? dent
of
Default
Mandatory?
Source/File Pattern pathname
N/A
Y if Read
Method = Field
Pattern
N/A
Source/Read
Method
Specific
File(s)/File
Pattern
Specific
Files(s)
N/A
Options/Missing
File Mode
Error/OK/
Depends
Depends
Y if File used
N/A
Options/Keep file
Partitions
True/false
False
N/A
Options/Reject
Mode
Continue/
Fail/Save
Continue
N/A
Options/Report
Progress
Yes/No
Yes
N/A
Options/Filter
command
N/A
N/A
Options/Number
Of Readers Per
Node
number
N/A
Options/Schema
File
pathname
N/A
N/A
Source Category
File. This property defines the flat file that data will be read from. You can
type in a pathname, or browse for a file. You can specify multiple files by
repeating the File property. Do this by selecting the Properties item at the
top of the tree, and clicking on File in the Available properties to add
window. Do this for each extra file you want to specify.
File Pattern. Specifies a group of files to import. Specify file containing a
list of files or a job parameter representing the file. The file could also
contain be any valid shell expression, in Bourne shell syntax, that generates a list of file names.
Read Method. This property specifies whether you are reading from a
specific file or files or using a file pattern to select files.
4-13
Options Category
Missing File Mode. Specifies the action to take if one of your File properties has specified a file that does not exist. Choose from Error to stop the
job, OK to skip the file, or Depends, which means the default is Error,
unless the file has a node name prefix of *: in which case it is OK. The
default is Depends.
Keep file Partitions. Set this to True to partition the imported data set
according to the organization of the input file(s). So, for example, if you are
reading three files you will have three partitions. Defaults to False.
Reject Mode. Allows you to specify behavior if a record fails to be read
for some reason. Choose from Continue to continue operation and discard
any rejected rows, Fail to cease reading if any rows are rejected, or Save to
send rejected rows down a reject link. Defaults to Continue.
Report Progress. Choose Yes or No to enable or disable reporting. By
default the stage displays a progress report at each 10% interval when it
can ascertain file size. Reporting occurs only if the file is greater than 100
KB, records are fixed length, and there is no filter on the file.
Filter. Specifies a UNIX command to process all exported data before it is
written to a file.
Number Of Readers Per Node. This is an optional property. Specifies the
number of instances of the file read operator on each processing node. The
default is one operator per node per input data file. If numReaders is greater
than one, each instance of the file read operator reads a contiguous range
of records from the input file. The starting record location in the file for
each operator, or seek location, is determined by the data file size, the
record length, and the number of instances of the operator, as specified by
numReaders.
The resulting data set contains one partition per instance of the file read
operator, as determined by numReaders. The data file(s) being read must
contain fixed-length records.
Schema File. This is an optional property. By default the Sequential File
stage will use the column definitions defined on the Columns and Format
tabs as a schema for reading the file. You can, however, override this by
specifying a file containing a schema. Type in a pathname or browse for a
file.
4-14
Reject Link Properties

You cannot change the properties of a Reject link. The Properties page for
a reject link is blank.
Similarly, you cannot edit the column definitions for a reject link The link
uses the column definitions for the link rejecting the data records.
Format of Sequential Files

flat file or files which you are reading. The tab has a similar format to the
Properties tab and is described on page 3-24.
Value box. Pop-up help for each of the available properties appears if you
hover the mouse pointer over it.
each type.
Fill char. Not relevant for Output links.
Final delimiter string. Specify the string that appears after the last
used).
Final delimiter. Specify a single character that appears after the

Intact. Allows you to define that this is a partial record schema. See
Appendix A for details on complete versus partial schemas. This
property has a dependent property:
4-15
Check Intact. Select this to force validation of the partial schema

as the file or files are. Note that this can degrade performance.
Record delimiter string. Specifies the string at the end of each
record. Enter one or more ASCII characters.
Record delimiter. Specifies the single character at the end of each
record. Type an ASCII character or select one of the following:
Mutually exclusive with Record delimiter string.
being read. DataStage calculates the appropriate length for the
of bytes.
implicit property, data is read as a stream with no explicit record
property is allows you to specify one of the following IBM blocked
or spanned formats: V, VB, VS, or VBS.
Field Defaults. Defines default properties for columns read from the file
or files. These are applied to all columns read. The available properties are:
record. This is skipped when the file is read. Type an ASCII character or select one of whitespace, end, none, or null.
whitespace. A whitespace character is used. By default all
whitespace characters are skipped when the file is read.
end. Specifies that the last column in the record is composed of all
remaining bytes until the end of the record.
4-16
none. No delimiter.
Delimiter string. Specify the string used as the trailing delimiter at
the end of each column. Enter one or more ASCII characters.
Print field. Select this to specify the stage writes a message for each
column that it reads of the format:
Importing columnname value

character.
the vector.
at column level):
Choose from:
binary
text
4-17

Export EBCDIC as ASCII. Not relevant for output links
Import ASCII as EBCDIC. Select this to specify that ASCII characters are read as EBCDIC characters.
in packed decimal format, No (separate) to specify that they
contain unpacked decimal with a separate sign byte, or No (zoned)
to specify that they contain an unpacked decimal in either ASCII or
EBCDIC text. This property has two dependent properties as
follows:
Signed. Select Yes to use the existing sign when reading decimal
columns. Select No to use a positive sign (0xf) regardless of the
Precision. Specifies the precision where a decimal column is in text
format. Enter a number.
Rounding. Specifies how to round a decimal column when reading
it. Choose from:
4-18

format string used for reading integer or floating point strings.
This is passed to sscanf().
In_format. Format string used for conversion of data from integer
or floating-point data to a string. This is passed to sscanf().
Out_format. Not relevant for output links.
Days since. Dates are read as a signed integer containing the
%yyyy-%mm-%dd.
Is Julian. Select this to specify that dates are read as a numeric
GMT.
Is midnight seconds. Select this to specify that times are read as a
binary 32-bit integer containing the number of seconds elapsed
4-19

Using RCP With Sequential Stages

Runtime column propagation (RCP) allows DataStage to be flexible about
the columns you define in a job. If RCP is enabled for a project, you need
can just define the columns you are interested in using in a job, but ask
DataStage to propagate the other columns through the various stages. So
such columns can be extracted from the data source and end up on your
data target without explicitly being operated on in between.
Sequential files, unlike most other data sources, do not have inherent
column definitions, and so DataStage cannot always tell where there are
extra columns that need propagating. You can only use RCP on sequential
files if you have used the Schema File property (see Schema File on
page 4-4 and on page 4-14) to specify a schema which describes all the
columns in the sequential file. You need to specify the same schema file for
any similar stages in the job where you want to propagate columns. Stages
that will require a schema file are:
4-20
Sequential File
File Set
External Source
External Target
5
File Set Stage
The File Set stage is a file stage. It allows you to read data from or write
data to a file set. The stage can have a single input link, a single output link,
and a single rejects link. It only executes in parallel mode.
What is a file set? DataStage can generate and name exported files, write
them to their destination, and list the files it has generated in a file whose
extension is, by convention, .fs. The data files and the file that lists them
are called a file set. This capability is useful because some operating
systems impose a 2 GB limit on the size of a file and you need to distribute
files among nodes to prevent overruns.
The amount of data that can be stored in each destination data file is
limited by the characteristics of the file system and the amount of free disk
space available. The number of files created by a file set depends on:
The number of processing nodes in the default node pool
The number of disks in the export or default disk pool connected to
each processing node in the default node pool
The size of the partitions of the data set
The File Set stage enables you to create and write to file sets, and to read
data back from file sets.
When you edit a File Set stage, the File Set stage editor appears. This is
based on the generic stage editor described in Chapter 3, Stage Editors.
reading or writing a file set:
File Set Stage
5-1
Inputs page. This is present when you are writing to a file set. This
is where you specify details about the file set being written to.
Outputs page. This is present when you are reading from a file set.
This is where you specify details about the file set being read from.
propagation (RCP) with File Set stages. See Using RCP With File Set
Stages on page 5-20 for details.
Stage Page
Advanced Tab
Execution Mode. This is set to parallel and cannot be changed.
Preserve partitioning. You can select Set or Clear. If you select Set,
file set read operations will request that the next stage preserves
the partitioning as is (it is ignored for file set write operations).
Inputs Page
The Inputs page allows you to specify details about how the File Set stage
writes data to a file set. The File Set stage can have only one input link.
5-2
is partitioned before being written to the file set. The Formats tab gives
information about the format of the files being written. The Columns tab
specifies the column definitions of the data.
Details about File Set stage properties, partitioning, and formatting are
given in the following sections. See Chapter 3, Stage Editors, for a

These dictate how incoming data is written and to what file set. Some of
the properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by default)
and turn black when you supply a value for them.
Category/Property
Values
Default
Manda
DepenRepeats?
tory?
dent of
Target/File Set
pathname
N/A
N/A
Target/File Set Update

Policy
Create (Error
if exists)
/Overwrite/Use
Existing
(Discard
records)/ Use
Existing
(Discard
schema &
records)
Error if
exists
N/A
Target/The default is
Overwrite.
Write/Omit
Write
N/A
Options/Cleanup on
Failure
True/False
True
N/A
Options/Single File Per

Partition.
True/False
False
N/A
File Set Stage
5-3
Category/Property
Values
Default
DepenManda
Repeats?
dent of
tory?
Options/Reject Mode
Continue/Fail
/Save
Continue
N/A
Options/Diskpool
string
N/A
N/A
Options/File Prefix
string
export.use N
rname
N/A
Options/File Suffix
string
none
N/A
Options/Maximum
File Size
number MB
N/A
N/A
Options/Schema File
pathname
N/A
N/A
Target Category
File Set. This property defines the file set that the incoming data will be
written to. You can type in a pathname of, or browse for a file set descriptor
file (by convention ending in .fs).
File Set Update Policy. Specifies what action will be taken if the file set
you are writing to already exists. Choose from:
Create (Error if exists)

Overwrite
Use Existing (Discard records)
Use Existing (Discard schema & records)
The default is Overwrite.

File Set Schema policy. Specifies whether the schema should be written
to the file set. Choose from Write or Omit. The default is Write.
Options Category
Cleanup on Failure. This is set to True by default and specifies that the
stage will delete any partially written files if the stage fails for any reason.
Set this to False to specify that partially written files should be left.
Single File Per Partition. Set this to True to specify that one file is
written for each partition. The default is False.
5-4
Reject Mode. Allows you to specify behavior if a record fails to be written

Diskpool. This is an optional property. Specify the name of the disk pool
into which to write the file set. You can also specify a job parameter.
File Prefix. This is an optional property. Specify a prefix for the name of
the file set components. If you do not specify a prefix, the system writes
the following: export.username, where username is your login. You can also
specify a job parameter.
File Suffix. This is an optional property. Specify a suffix for the name of
the file set components. The suffix is omitted by default.
Maximum File Size. This is an optional property. Specify the maximum
file size in MB. The value of numMB must be equal to or greater than 1.
Schema File. This is an optional property. By default the File Set stage
will use the column definitions defined on the Columns tab as a schema
for writing the file. You can, however, override this by specifying a file
containing a schema. Type in a pathname or browse for a file.

data is partitioned or collected before it is written to the file or files. It also
incoming data.
If the File Set stage is operating in sequential mode, it will first collect the
data before writing it to the file using the default Auto collection method.
File Set Stage
5-5
Whether the File Set stage is set to execute in parallel or sequential

mode.
or sequential mode.
If the File Set stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
This will override any current partitioning (even if the Preserve Partitioning option has been set on the Stage page Advanced tab).
If the File Set stage is set to execute in sequential mode, but the preceding
stage is executing in parallel, then you can set a collection method from the
Collection type drop-down list. This will override the default collection
method.
Configuration file. This is the default method for the File Set stage.
5-6

the default method for the File Set stage.
inputlinkshouldbesortedbeforebeingwrittentothe fileorfiles.Thesort
collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.
list.
File Set Stage
5-7
Format of File Set Files

files in the files set to which you are writing. The tab has a similar format
to the Properties tab and is described on page 3-24.
each type.
formatted in a file. The available properties are:
used).

Partial Schemas in Appendix A for details on complete versus
partial schemas. (The dependent property Check Intact is only relevant for output links.)
following:
5-8

of bytes.
Field Defaults. Defines default properties for columns written to the files.
These are applied to all columns written. The available properties are:
none, or null.
none. No delimiter.
File Set Stage
5-9
character.
the vector.
at column level):
Choose from:
binary
text
5-10
it. Choose from:
File Set Stage
5-11
%yyyy-%mm-%dd.
GMT.
Format string. Specifies the format of a column representing a timestamp as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.
Outputs Page
The Outputs page allows you to specify details about how the File Set
stage reads data from a file set. The File Set stage can have only one output
link. It can also have a single reject link, where records that have failed to
be written or read for some reason can be sent. The Output name drop-
5-12
down list allows you to choose whether you are looking at details of the
main output link (the stream link) or the reject link.
incoming data.
Details about File Set stage properties and formatting are given in the
following sections. See Chapter 3, Stage Editors, for a general description of the other tabs.

These dictate how incoming data is read from files in the file set. Some of
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Source/File Set
pathname
N/A
N/A
Options/Keep file
Partitions
True/False
False
N/A
Options/Reject Mode
Continue/Fail
/Save
Continue
N/A
Options/Report
Progress
Yes/No
Yes
N/A
Options/Filter
command
N/A
N/A
Options/Number Of
Readers Per Node
number
N/A
Options/Schema File
pathname
N/A
N/A
Options/Use Schema
Defined in File Set
True/False
False
N/A
File Set Stage
5-13
Source Category
File Set. This property defines the file set that the data will be read from.
You can type in a pathname of, or browse for, a file set descriptor file (by
convention ending in .fs).
Options Category
Keep file Partitions. Set this to True to partition the read data set
according to the organization of the input file(s). So, for example, if you are
reading three files you will have three partitions. Defaults to False.
can ascertain file size. Reporting occurs only if the file is greater than 100
KB, records are fixed length, and there is no filter on the file.
Filter. This is an optional property. You can use this to specify that the data
is passed through a filter program after being read from the files. Specify
the filter command, and any required arguments, in the Property Value
box.
Number Of Readers Per Node. This is an optional property. Specifies the
number of instances of the file read operator on each processing node. The
default is one operator per node per input data file. If numReaders is greater
than one, each instance of the file read operator reads a contiguous range
of records from the input file. The starting record location in the file for
each operator, or seek location, is determined by the data file size, the
record length, and the number of instances of the operator, as specified by
numReaders.
The resulting data set contains one partition per instance of the file read
operator, as determined by numReaders. The data file(s) being read must
contain fixed-length records.
Schema File. This is an optional property. By default the File Set stage
will use the column definitions defined on the Columns and Format tabs
5-14
as a schema for reading the file. You can, however, override this by specifying a file containing a schema. Type in a pathname or browse for a file.
Use Schema Defined in File Set. When you create a file set you have an
option to save the schema along with it. When you read the file set you can
use this schema in preference to the column definitions or a schema file by
setting this property to True.

You cannot change the properties of a Reject link. The Properties tab for a
reject link is blank.
Similarly, you cannot edit the column definitions for a reject link The link

files in the file set which you are reading. The tab has a similar format to
the Properties tab and is described on page 3-24.
each type.
used).
File Set Stage
5-15

partial schemas. This property has a dependent property:
of bytes.
Field Defaults. Defines default properties for columns read from the files.
These are applied to all columns read. The available properties are:
5-16

none. No delimiter.

character.
the vector.
at column level):
File Set Stage
5-17

Choose from:
binary
text
follows:
5-18

it. Choose from:
%yyyy-%mm-%dd.
GMT.
File Set Stage
5-19

Using RCP With File Set Stages

Data Set stage handle a set of sequential files. Sequential files, unlike most
other data sources, do not have inherent column definitions, and so
DataStage cannot always tell where there are extra columns that need
propagating. You can only use RCP on File Set stages if you have used the
Schema File property (see Schema File on page 5-5 and on page 5-14) to
specify a schema which describes all the columns in the sequential files
referenced by the stage. You need to specify the same schema file for any
similar stages in the job where you want to propagate columns. Stages that
will require a schema file are:
5-20
Sequential File
File Set
External Source
External Target
6
Data Set Stage
The Data Set stage is a file stage. It allows you to read data from or
write data to a data set. The stage can have a single input link or a
single output link. It can be configured to execute in parallel or
sequential mode.
What is a data set? DataStage parallel extender jobs use data sets to
store data being operated on in a persistent form. Data sets are operating system files, each referred to by a control file, which by
convention has the suffix .ds. Using data sets wisely can be key to
good performance in a set of linked jobs. You can also manage data
sets independently of a job using the Data Set Management utility,
available from the DataStage Designer, Manager, or Director, see
Chapter 50.
reading or writing a data set:
Stage page. This is always present and is used to specify
general information about the stage.
Inputs page. This is present when you are writing to a data set.
This is where you specify details about the data set being
written to.
Outputs page. This is present when you are reading from a
data set. This is where you specify details about the data set
being read from.
Stage Page
stage. The Advanced page allows you to specify how the stage
executes.
Data Set Stage
6-1
Advanced Tab
Execution Mode. This is not relevant for a data set and so is
disabled.
Preserve partitioning. A data set stores the setting of the
preserve partitioning flag with the data. It cannot be changed
on this stage and so the field is disabled (it does not appear if
your stage only has an input link).
Node pool and resource constraints. You can specify resource
constraints limit execution to the resource pools or pools specified in the grid. The grid allows you to make choices from drop
down lists populated from the Configuration file.
Node map constraint. This is not relevant to a Data Set stage.
Inputs Page
The Inputs page allows you to specify details about how the Data Set
stage writes data to a data set. The Data Set stage can have only one
input link.
input link. The Properties tab allows you to specify details of exactly
what the link does. The Columns tab specifies the column definitions
of the data.
Details about Data Set stage properties are given in the following
sections. See Chapter 3, Stage Editors, for a general description of
the other tabs.

These dictate how incoming data is written and to what data set. Some
of the properties are mandatory, although many have default settings.
Properties without default settings appear in the warning color (red
by default) and turn black when you supply a value for them.
6-2
The following table gives a quick reference list of the properties and
their attributes. A more detailed description of each property follows
Category/Property Values
Default
Mand
atory?
Repeats?
Depen
dent of
Target/File
pathname
N/A
N/A
Target/Update
Policy
Append/Create
(Error if
exists)/Overwrite/Use
existing
(Discard
records)/Use
existing
(Discard records
and schema)
Create
(Error if
exists)
N/A
Target Category
File. The name of the control file for the data set. You can browse for
the file or enter a job parameter. By convention, the file has the suffix
.ds.
Update Policy. Specifies what action will be taken if the data set you
are writing to already exists. Choose from:
Append. Append any new data to the existing data.
Create (Error if exists). DataStage reports an error if the data
set already exists.
Overwrite. Overwrites any existing data with new data.
Use existing (Discard records). Keeps the existing data and
discards any new data.
Use existing (Discard records and schema). Keeps the existing
data and discards any new data and its associated schema.
Data Set Stage
6-3
Outputs Page
The Outputs page allows you to specify details about how the Data
Set stage reads data from a data set. The Data Set stage can have only
one output link.
of incoming data.
Details about Data Set stage properties and formatting are given in the
following sections. See Chapter 3, Stage Editors, for a general

These dictate how incoming data is read from the data set. Some of the
properties are mandatory, although many have default settings. Properties without default settings appear in the warning color (red by
default) and turn black when you supply a value for them.
their attributes. A more detailed description of each property follows.
Category/Property
Values
Default
Manda
DepenRepeats?
tory?
dent of
Source/File
pathname
N/A
N/A
Source Category
the file or enter a job parameter. By convention the file has the suffix
.ds.
6-4
7
Lookup File Set Stage
The Lookup File Set stage is a file stage. It allows you to create a lookup
file set or reference one for a lookup. The stage can have a single input link
or a single output link. The output link must be a reference link. The stage
can be configured to execute in parallel or sequential mode when used
with an input link.
For more information about look up operations, see Chapter 20,Lookup
Stage.
When you edit a Lookup File Set stage, the Lookup File Set stage editor
Stage Editors.
The stage editor has up to two pages, depending on whether you are
creating or referencing a file set:
Inputs page. This is present when you are creating a lookup table.
This is where you specify details about the file set being created
and written to.
Outputs page. This is present when you are reading from a lookup
file set, i.e., where the stage is providing a reference link to a
Lookup stage. This is where you specify details about the file set
being read from.
Stage Page
7-1
Advanced Tab
This tab only appears when you are using the stage to create a reference
file set (i.e., where the stage has an input link). It allows you to specify the
following:
sequential mode. In parallel mode the contents of the table are
Sequential mode the entire contents of the table are processed by
the conductor node.
Inputs Page
The Inputs page allows you to specify details about how the Lookup File
Set stage writes data to a table or file set. The Lookup File Set stage can
have only one input link.
is partitioned before being written to the table or file set. The Columns tab
specifies the column definitions of the data.
Details about Lookup File Set stage properties and partitioning are given
in the following sections. See Chapter 3, Stage Editors, for a general
7-2

These dictate how incoming data is written and to the file set. Some of the
Category/Property
Values
Default
Mand
DepenRepeats?
atory?
dent of
Lookup Keys/Key
Input column
N/A
N/A
Lookup Keys/Case
Sensitive
True/False
True
Key
Target/Lookup File Set pathname
N/A
N/A
Options/Allow
Duplicates
True/False
False
N/A
Options/Diskpool
string
N/A
N/A
Lookup Keys Category

Key. Specifies the name of a lookup key column. The Key property must
be repeated if there are multiple key columns. The property has a dependent property, Case Sensitive.
Case Sensitive. This is a dependent property of Key and specifies
whether the parent key is case sensitive or not. Set to true by default.
Target Category
Lookup File Set. This property defines the file set that the incoming data
will be written to. You can type in a pathname of, or browse for a file set
descriptor file (by convention ending in .fs).
Options Category
Allow Duplicates. Set this to cause multiple copies of duplicate records to
be saved in the lookup table without a warning being issued. Two lookup
records are duplicates when all lookup key columns have the same value
7-3
in the two records. If you do not specify this option, DataStage issues a
warning message when it encounters duplicate records and discards all
but the first of the matching records.
into which to write the file set. You can also specify a job parameter.

data is partitioned or collected before it is written to the file set. It also
By default the stage will write to the file set in entire mode. The complete
data set is written to the file set.
If the Lookup File Set stage is operating in sequential mode, it will first
collect the data before writing it to the file using the default (auto) collection method.
Whether the Lookup File Set stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Lookup File Set stage is set to execute in parallel, then you can set a
list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).
If the Lookup File Set stage is set to execute in sequential mode, but the
Entire. Each file written to receives the entire data set. This is the
default partitioning method for the Lookup File Set stage.
7-4

the default method for the Lookup Data Set stage.
from the second partition, and so on. After reaching the last partition, the operator starts over. This is the default method for the
Lookup File Set stage.
The Partitioning tab normally allows you to specify that data arriving on
the input link should be sorted before being written to the lookup table.
Availability depends on the partitioning method chosen.
7-5
list.
7-6
Outputs Page
The Outputs page allows you to specify details about how the Lookup File
Set stage references a file set. The Lookup File Set stage can have only one
output link which is a reference link.
what the link does. The Columns tab specifies the column definitions of
incoming data.
Details about Lookup File Set stage properties are given in the following
sections. See Chapter 3, Stage Editors, for a general description of the
other tabs.

These dictate how incoming data is read from the lookup table. Some of
Category/Property
Values
Default
Mand
Depenatory Repeats?
dent of
?
Lookup Source/Lookup
File Set
pathname
N/A
N/A
Lookup Source Category

Lookup File Set. This property defines the file set that the data will be
referenced from. You can type in a pathname of, or browse for a file set
descriptor file (by convention ending in .fs).
7-7
7-8
8
External Source Stage
The External Source stage is a file stage. It allows you to read data that is
output from one or more source programs. The stage can have a single
output link, and a single rejects link. It can be configured to execute in
parallel or sequential mode.
The external source stage allows you to perform actions such as interface
with databases not currently supported by the DataStage Parallel
Extender.
When you edit an External Source stage, the External Source stage editor
Stage Editors.
The stage editor has two pages:
Outputs page. This is where you specify details about the program
or programs whose output data you are reading.
propagation (RCP) with External Source stages. See Using RCP With
External Source Stages on page 8-10 for details.
Stage Page
8-1
Advanced Tab
sequential mode. In parallel mode the data input from external
programs is processed by the available nodes as specified in the
Configuration file, and by any node constraints specified on the
Advanced tab. In Sequential mode all the data from the source
program is processed by the conductor node.
it will request that the next stage preserves the partitioning as is.
Clear is the default.
from drop-down lists populated from the Configuration file.
execution to the nodes in a defined node map. You can define a
Outputs Page
The Outputs page allows you to specify details about how the External
Source stage reads data from an external program. The External Source
stage can have only one output link. It can also have a single reject link,
where records that have failed to be read for some reason can be sent. The
Output name drop-down list allows you to choose whether you are
incoming data.
8-2
Details about External Source stage properties and formatting are given in

These dictate how data is read from the external program or programs.
Some of the properties are mandatory, although many have default
settings. Properties without default settings appear in the warning color
(red by default) and turn black when you supply a value for them.
Rep Depeneats? dent of
Default
Mandatory?
Source/Source
Program
string
N/A
Y if Source
Method =
Specific
Program(s)
N/A
Source/Source
Programs File
pathname
N/A
Y if Source
Y
Method =
Program File(s)
N/A
Source/Source
Method
Specific
Program(s)/
Program
File(s)
Specific
Program(s)
N/A
Options/Keep File
Partitions
True/False
False
N/A
Options/Reject
Mode
Continue/Fail
/Save
Continue
N/A
Options/Report
Progress
Yes/No
Yes
N/A
Options/Schema
File
pathname
N/A
N/A
8-3
Source Category
Source Program. Specifies the name of a program providing the source
data. DataStage calls the specified program and passes to it any arguments
specified. You can repeat this property to specify multiple program
instances with different arguments. You can use a job parameter to supply
program name and arguments.
Source Programs File. Specifies a file containing a list of program names
and arguments. You can browse for the file or specify a job parameter. You
can repeat this property to specify multiple files.
Source Method. This property specifies whether you directly specifying
a program (using the Source Program property) or using a file to specify a
program (using the Source Programs File property).
Options Category
Keep File Partitions. Set this to True to maintain the partitioning of the
read data. Defaults to False.
can ascertain input data size. Reporting occurs only if the input data size
is greater than 100 KB, records are fixed length, and there is no filter
specified.
Schema File. This is an optional property. By default the External Source
stage will use the column definitions defined on the Columns tab and
Schema tab as a schema for reading the file. You can, however, override
this by specifying a file containing a schema. Type in a pathname or
browse for a file.

8-4
Similarly, you cannot edit the column definitions for a reject link. The link
Format of Data Being Read

data which you are reading. The tab has a similar format to the Properties
tab and is described on page 3-24.
each type.
used).

partial schemas. This property has a dependent property:
8-5

of bytes.
Field Defaults. Defines default properties for columns read from the files.
These are applied to all columns read. The available properties are:
end. specifies that the last column in the record is composed of all
none. No delimiter.
8-6


character.
the vector.
at column level):
Choose from:
binary
text
8-7
follows:
it. Choose from:
8-8

%yyyy-%mm-%dd.
GMT.
8-9
Using RCP With External Source Stages

External Source stages, unlike most other data sources, do not have
inherent column definitions, and so DataStage cannot always tell where
there are extra columns that need propagating. You can only use RCP on
External Source stages if you have used the Schema File property (see
Schema File on page 8-4) to specify a schema which describes all the
columns in the sequential files referenced by the stage. You need to specify
the same schema file for any similar stages in the job where you want to
propagate columns. Stages that will require a schema file are:
8-10
Sequential File
File Set
External Source
External Target
9
External Target Stage
The External Target stage is a file stage. It allows you to write data to one
or more source programs. The stage can have a single input link and a
single rejects link. It can be configured to execute in parallel or sequential
mode.
The External Target stage allows you to perform actions such as interface
with databases not currently supported by the DataStage Parallel
Extender.
When you edit an External Target stage, the External Target stage editor
Stage Editors.
The stage editor has up to three pages:
Inputs page. This is where you specify details about the program
or programs you are writing data to.
Outputs Page. This appears if the stage has a rejects link.
propagation (RCP) with External Target stages. See Using RCP With
External Target Stages on page 9-12 for details.
Stage Page
9-1
Advanced Tab
sequential mode. In parallel mode the data output to external
programs is processed by the available nodes as specified in the
Configuration file, and by any node constraints specified on the
Advanced tab. In Sequential mode all the data from the source
program is processed by the conductor node.
it will request that the next stage preserves the partitioning as is.
Clear is the default.
from drop-down lists populated from the Configuration file.
Inputs Page
The Inputs page allows you to specify details about how the External
Target stage writes data to an external program. The External Target stage
can have only one input link.
is partitioned before being written to the external program. The Formats
tab gives information about the format of the data being written. The
Columns tab specifies the column definitions of the data.
Details about External Target stage properties, partitioning, and formatting are given in the following sections. See Chapter 3, Stage Editors, for
a general description of the other tabs.
9-2

These dictate how incoming data is written and to what program. Some of
Rep
eats
?
Dependent of
Y if Source
Method =
Specific
Program(s)
N/A
Y if Source
Method =
Program
File(s)
N/A
Specific
Y
Program(s)
N/A
True/False
True
N/A
Options/Reject Mode
Continue/Fail
/Save
Continue
N/A
Options/Schema File
pathname
N/A
N/A
Category/Property
Values
Default
Target /Destination
Program
string
N/A
Target /Destination
Programs File
pathname
N/A
Target /Target Method
Specific
Program(s)/
Program
File(s)
Options/Cleanup on
Failure
Mandatory?
Target Category
Destination Program. This is an optional property. Specifies the name of
a program receiving data. DataStage calls the specified program and
passes to it any arguments specified.You can repeat this property to
specify multiple program instances with different arguments. You can use
a job parameter to supply program name and arguments.
9-3
Destination Programs File. This is an optional property. Specifies a file

containing a list of program names and arguments. You can browse for the
file or specify a job parameter. You can repeat this property to specify
multiple files.
Target Method. This property specifies whether you directly specifying a
program (using the Destination Program property) or using a file to
specify a program (using the Destination Programs File property).
Cleanup on Failure. This is set to True by default and specifies that the
stage will delete any partially written data if the stage fails for any reason.
Set this to False to specify that partially data should be left.
Reject Mode. This is an optional property. Allows you to specify behavior
if a record fails to be written for some reason. Choose from Continue to
continue operation and discard any rejected rows, Fail to cease reading if
any rows are rejected, or Save to send rejected rows down a reject link.
Defaults to Continue.
Schema File. This is an optional property. By default the External Target
stage will use the column definitions defined on the Columns tab as a
schema for reading the file. You can, however, override this by specifying
a file containing a schema. Type in a pathname or browse for a file.

data is partitioned or collected before it is written to the target program. It
also allows you to specify that the data should be sorted before being
written.
By default the stage will write data in Auto mode. If the Preserve Partitioning option has been set on the previous stage in the job, this stage will
attempt to preserve the partitioning of the incoming data.
If the External Target stage is operating in sequential mode, it will first
collect the data before writing it to the file using the default round robin
collection method.
Whether the External Target stage is set to execute in parallel or
sequential mode.
9-4

or sequential mode.
If the External Target stage is set to execute in parallel, then you can set a
partitioning method by selecting from the Partitioning type drop-down
list. This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage in the job).
If the External Target stage is set to execute in sequential mode, but the
default Auto collection method.
Configuration file. This is the default partitioning method for the
External Target stage.
used to partition on tag columns.
9-5

and how many nodes are specified in the Configuration file.This is
the default method for the External Target stage.
input link should be sorted before being written to the target program. The
sort is always carried out within data partitions. If the stage is partitioning
collecting data, the sort occurs before the collection. The availability of
list.

data being written. The tab has a similar format to the Properties tab and
is described on page 3-24.
9-6
each type.
formatted in a file. The available properties are:
used).

partial schemas. (The dependent property Check Intact is only relevant for output links.)
following:
9-7
of bytes.
Field Defaults. Defines default properties for columns written to the files.
These are applied to all columns written. The available properties are:
none, or null.
none. No delimiter.
9-8

character.
the vector.
at column level):
Choose from:
binary
text
9-9

it. Choose from:
9-10

%yyyy-%mm-%dd.
GMT.
Format string. Specifies the format of a column representing a timestamp
as a string. defaults to %yyyy-%mm-%dd %hh:%nn:%ss.
9-11
Outputs Page
The Outputs page appears if the stage has a Reject link
output link.
Similarly, you cannot edit the column definitions for a reject link. The link
Using RCP With External Target Stages

External Target stages, unlike most other data targets, do not have inherent
column definitions, and so DataStage cannot always tell where there are
extra columns that need propagating. You can only use RCP on External
Target stages if you have used the Schema File property (see Schema File
on page 9-4) to specify a schema which describes all the columns in the
sequential files referenced by the stage. You need to specify the same
schema file for any similar stages in the job where you want to propagate
columns. Stages that will require a schema file are:
9-12
Sequential File
File Set
External Source
External Target
10
Write Range Map
Stage
The Write Range Map stage allows you to write data to a range map.
The stage can have a single input link. It can only run in parallel mode.
The Write Range Map stage takes an input data set produced by
sampling and sorting a data set and writes it to a file in a form usable
by the range partitioning method. The range partitioning method uses
the sampled and sorted data set to determine partition boundaries.
See Partitioning and Collecting Data on page 2-7 for a description of the range partitioning method.
A typical use for the Write Range Map stage would be in a job which
used the Sample stage to sample a data set, the Sort stage to sort it and
the Write Range Map stage to write the resulting data set to a file.
The Write Range Map stage editor has two pages:
Inputs page. This is present when you are writing a range
map. This is where you specify details about the file being
written to.
Stage Page
executes.
Write Range Map Stage
10-1
Advanced Tab
Execution Mode. The stage always executes in parallel mode.
Preserve partitioning. This is Set by default. The partitioning
mode is range and cannot be overridden.
Node pool and resource constraints. Select this option to
constrain parallel execution to the node pool or pools and/or
resource pools or pools specified in the grid. The grid allows
you to make choices from drop down lists populated from the
Configuration file.
node map by typing node numbers into the text box or by
clicking the browse button to open the Available Nodes dialog
box and selecting nodes from there. You are effectively
defining a new node pool for this stage (in addition to any
node pools defined in the Configuration file).
Inputs Page
The Inputs page allows you to specify details about how the Write
Range Map stage writes the range map to a file. The Write Range Map
stage can have only one input link.
what the link does. The Partitioning tab allows you to specify sorting
details. The Columns tab specifies the column definitions of the data.
Details about Write Range Map stage properties an partitioning are

These dictate how incoming data is written to the range map file.
Some of the properties are mandatory, although many have default
settings. Properties without default settings appear in the warning
10-2
Ascential DataStage Server Job Developers Guide
color (red by default) and turn black when you supply a value for
them.
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/File
Update Mode
Create/Overwrite
Create
N/A
Options/Key
input column
N/A
N/A
Options/Range
Map File
pathname
N/A
N/A
Options Category
File Update Mode. This is set to Create by default. If the file you
specify already exists this will cause an error. Choose Overwrite to
overwrite existing files.
Key. This allows you to specify the key for the range map. Choose an
input column from the drop-down list. You can specify a composite
key by specifying multiple key properties.
Range Map File. Specify the file that is to hold the range map. You
can browse for a file or specify a job parameter.

The Partitioning tab normally allows you to specify details about how
the incoming data is partitioned or collected before it is written to the
file or files. In the case of the Write Range Map stage execution is
always parallel, so there is never a need to set a collection method. The
partition method is set to Range and cannot be overridden.
Because the partition mode is set and cannot be overridden, you
cannot use the stage sort facilities, so these are disabled.
Write Range Map Stage
10-3
10-4
Ascential DataStage Server Job Developers Guide
11
SAS Data Set Stage
The Parallel SAS Data Set stage is a file stage. It allows you to read data
from or write data to a parallel SAS data set in conjunction with an
SAS stage. The stage can have a single input link or a single output
link. It can be configured to execute in parallel or sequential mode.
DataStage uses a parallel SAS data set to store data being operated on
by an SAS stage in a persistent form. A parallel SAS data set is a set of
one or more sequential SAS data sets, with a header file specifying the
names and locations of all the component files. By convention, the
header file has the suffix .psds.
reading or writing a data set:
Inputs page. This is present when you are writing to a data set.
This is where you specify details about the data set being
written to.
data set. This is where you specify details about the data set
being read from.
Stage Page
executes.
SAS Data Set Stage
11-1
Advanced Tab
sequential mode. In parallel mode the input data is processed
by the available nodes as specified in the Configuration file,
and by any node constraints specified on the Advanced tab. In
Sequential mode the entire data set is processed by the
conductor node.
Preserve partitioning. This is Propagate by default. It adopts
Set or Clear from the previous stage. You can explicitly select
Set or Clear. Select Set to request that next stage in the job
should attempt to maintain the partitioning.
Configuration file.
Inputs Page
The Inputs page allows you to specify details about how the SAS Data
Set stage writes data to a data set. The SAS Data Set stage can have
only one input link.
what the link does. The Partitioning tab allows you to specify how
incoming data is partitioned before being written to the data set. The
Columns tab specifies the column definitions of the data.
Details about SAS Data Set stage properties are given in the following
sections. See Chapter 3, Stage Editors, for a general description of
the other tabs.
11-2

These dictate how incoming data is written and to what data set. Some
of the properties are mandatory, although many have default settings.
Properties without default settings appear in the warning color (red
by default) and turn black when you supply a value for them.
their attributes. A more detailed description of each property follows:
Default
Mand
atory?
Repeats?
Depen
dent of
Target/File
pathname
N/A
N/A
Target/Update
Policy
Append/Create
(Error if
exists)/Overwrite/
Create
(Error if
exists)
N/A
Options Category
the file or enter a job parameter. By convention the file has the suffix
.psds.
Update Policy. Specifies what action will be taken if the data set you
are writing to already exists. Choose from:
Append. Append to the existing data set
Create (Error if exists). DataStage reports an error if the data
set already exists
Overwrite. Overwrite any existing file set
SAS Data Set Stage
11-3

The Partitioning tab allows you to specify details about how the
incoming data is partitioned or collected before it is written to the data
set. It also allows you to specify that the data should be sorted before
being written.
By default the stage partitions in Auto mode. This attempts to work
out the best partitioning method depending on execution modes of
current and preceding stages, whether the Preserve Partitioning
option has been set, and how many nodes are specified in the Configuration file. If the Preserve Partitioning option has been set on the
Stage page Advanced tab (see page 11-2) the stage will attempt to
preserve the partitioning of the incoming data.
If the SAS Data Set stage is operating in sequential mode, it will first
collect the data before writing it to the file using the default Auto
collection method.
Whether the SAS Data Set stage is set to execute in parallel or
sequential mode.
Whether the preceding stage in the job is set to execute in
If the SAS Data Set stage is set to execute in parallel, then you can set
a partitioning method by selecting from the Partitioning mode dropdown list. This will override any current partitioning (even if the
Preserve Partitioning option has been set on the Stage page Advanced
tab).
If the SAS Data Set stage is set to execute in sequential mode, but the
method from the Collection type drop-down list. This will override
the default auto collection method.
method depending on execution modes of current and
preceding stages, whether the Preserve Partitioning flag has
been set on the previous stage in the job, and how many nodes
are specified in the Configuration file. This is the default partitioning method for the Parallel SAS Data Set stage.
11-4

Hash. The records are hashed into partitions based on the
value of a key column or columns selected from the Available
list.
Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is
commonly used to partition on tag fields.
Round Robin. The records are partitioned on a round robin
basis as they enter the stage.
table. Requires extra properties to be set. Access these properties by clicking the properties button
Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range
partitioning is often a preprocessing step to performing a total
sort on a data set. Requires extra properties to be set. Access
these properties by clicking the properties button
preceding stages, and how many nodes are specified in the
Configuration file. This is the default collection method for
Parallel SAS Data Set stages.
records from the second partition, and so on.
Round Robin. Reads a record from the first input partition,
then from the second partition, and so on. After reaching the
last partition, the operator starts over.
columns of the record. This requires you to select a collecting
key column from the Available list.
SAS Data Set Stage
11-5
The Partitioning tab also allows you to specify that data arriving on
the input link should be sorted before being written to the data set.
The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the
stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.
Sort. Select this to specify that data coming in on the link
should be sorted. Select the column or columns to sort on from
the Available list.
Stable. Select this if you want to preserve previously sorted
data sets. This is the default.
Unique. Select this to specify that, if multiple records have
identical sorting key values, only one record is retained. If
stable sort is also set, the first record is retained.
sequence for each column in the Selected list by selecting it and rightclicking to invoke the shortcut menu.
Outputs Page
The Outputs page allows you to specify details about how the Parallel
SAS Data Set stage reads data from a data set. The Parallel SAS Data
Set stage can have only one output link.
of incoming data.
Details about Data Set stage properties and formatting are given in the
following sections. See Chapter 3, Stage Editors, for a general

These dictate how incoming data is read from the data set. Some of the
properties are mandatory, although many have default settings. Prop-
11-6
erties without default settings appear in the warning color (red by

Category/Property
Values
Default
DepenManda
Repeats?
dent of
tory?
Source/File
pathname
N/A
N/A
Source Category
File. The name of the control file for the parallel SAS data set. You can
browse for the file or enter a job parameter. The file has the suffix
.psds.
SAS Data Set Stage
11-7
11-8
12
DB2 Stage
The DB2 stage is a database stage. It allows you to read data from and
write data to a DB2 database. It can also be used in conjunction with a
Lookup stage to access a lookup table hosted by a DB2 database (see
Chapter 20, Lookup Stage.)
The DB2 stage can have a single input link and a single output reject
link, or a single output link or output reference link.
When you edit a DB2 stage, the DB2 stage editor appears. This is based
on the generic stage editor described in Chapter 3, Stage Editors.
reading or writing a database:
Inputs page. This is present when you are writing to a DB2
database. This is where you specify details about the data
being written.
DB2 database, or performing a lookup on a DB2 database. This
is where you specify details about the data being read.
To use DB2 stages you must have valid accounts and appropriate privileges on the databases to which they connect. The required DB2
privileges are as follows:
SELECT on any tables to be read.
INSERT on any existing tables to be updated.
TABLE CREATE to create any new tables.
INSERT and TABLE CREATE on any existing tables to be
replaced.
DB2 Stage
12-1
DBADM on any database written by LOAD method.

You can grant this privilege in several ways in DB2. One is to start
DB2, connect to a database, and grant DBADM privilege to a user, as
shown below:
db2> CONNECT TO db_name
db2> GRANT DBADM ON DATABASE TO USER user_name
where db_name is the name of the DB2 database and user_name is the
login name of the DataStage user. If you specify the message file property, the database instance must have read/write privilege on that file.
The users PATH should include $DB2_HOME/bin (e.g.,
/opt/IBMdb2/V7.1/bin). The LD_LIBRARY_PATH should include
$DB2_HOME/lib before any other lib statments (e.g.,
/opt/IBMdb2/V7.1/lib)
The following DB2 environment variables set the run-time characteristics of your system:
DB2INSTANCE specifies the user name of the owner of the
DB2 instance. DB2 uses DB2INSTANCE to determine the location of db2nodes.cfg. For example, if you set DB2INSTANCE to
"Mary", the location of db2nodes. cfg is ~Mary/sqllib/db2nodes.cfg.
DB2DBDFT specifies the name of the DB2 database that
you want to access from your DB2 stage.

There are two other methods of specifying the DB2 database:
1. The override database property of the DB2 stage Inputs or

Outputs link.
2.
The APT_DBNAME environment variable (this takes prece-
dence over DB2DBDFT).

The environment variable APT_RDBMS_COMMIT_ROWS specifies
the number of records to insert into a data set between commits. You
can set this environment variable to any value between 1 and (231 - 1)
to specify the number of records.
The default value is 2048. You may find that you can increase your
system performance by decreasing the frequency of these commits
using the environment variable APT_RDBMS_COMMIT_ROWS.
If you set APT_RDBMS_COMMIT_ROWS to 0, a negative number, or
an invalid value, a warning is issued and each partition commits only
once after the last insertion.
12-2
If you set APT_RDBMS_COMMIT_ROWS to a small value, you force

DB2 to perform frequent commits. Therefore, if your program terminates unexpectedly, your data set can still contain partial results that
you can use. However, you may pay a performance penalty because of
the high frequency of the commits. If you set a large value for
APT_RDBMS_COMMIT_ROWS, DB2 must log a correspondingly
large amount of rollback information. This, too, may slow your
application.
Stage Page
executes.
Advanced Tab
processed by the available nodes as specified in the Configuration file, and by any node constraints specified on the
Advanced tab. In Sequential mode the entire write is processed
by the conductor node.
Preserve partitioning. You can select Set or Clear. If you select
Set file read operations will request that the next stage
preserves the partitioning as is (it does not appear if your stage
only has an input link).
Configuration file.
DB2 Stage
12-3
Inputs Page
The Inputs page allows you to specify details about how the DB2
stage writes data to a DB2 database. The DB2 stage can have only one
input link writing to one table.
what the link does. The Partitioning tab allows you to specify how
incoming data is partitioned before being written to the database. The
Columns tab specifies the column definitions of incoming data.
Details about DB2 stage properties, partitioning, and formatting are

These dictate how incoming data is written and where. Some of the
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Target/Table
String
N/A
N/A
12-4
Repeats?
Dependent of
Y if Write
method =
Upsert
N/A
N/A
Y if Write
method =
Upsert
and Upd
N/A
String
N/A
Y if Write
method =
Upsert
N/A
Target/Write
Method
Write/Load
/Upsert
Load
N/A
Target/Write Mode
Append/
Create/
Replace/
Truncate
Append
N/A
Connection/Use
Database Environment Variable
True/False
True
N/A
Connection/Use
Server Environment
Variable
True/False
True
N/A
Connection/Override Database
string
N/A
Y (if Use
Database
environment
variable =
False)
N/A
Category/Property
Values
Default
Target/Upsert Mode
Auto-generated Update
& Insert/
Auto-generated Update
Only/Userdefined
Update &
Insert/Userdefined
Update
Only
Autogenerated
Update
& Insert
Target/Insert SQL
String
Target/Update SQL
DB2 Stage
Mandatory?
12-5
Repeats?
Dependent of
Y (if Use
Server
environment
variable =
False)
N/A
False
N/A
True/False
False
N/A
Options/Truncation
Length
number
18
Truncate
Column
Names
Options/Close
Command
string
N/A
N/A
Options/Default
String Length
number
32
N/A
Options/Open
Command
string
N/A
N/A
Options/Use ASCII
Delimited Format
True/False
False
Y (if Write N
Method =
Load)
N/A
Options/Cleanup on
Failure
True/False
False
Y (if Write N
Method =
Load)
N/A
Options/Message
File
pathname
N/A
N/A
Category/Property
Values
Default
Connection/Override Server
string
N/A
Options/Write Mode True/False

Options/Silently
Drop Columns Not
in Table
Mandatory?
Target Category
Table. Specify the name of the table to write to. You can specify a job
parameter if required.
Upsert Mode. This only appears for the Upsert write method. Allows
you to specify how the insert and update statements are to be derived.
Choose from:
12-6
Auto-generated Update & Insert. DataStage generates update

and insert statements for you, based on the values you have
supplied for table name and on column details. The statements
can be viewed by selecting the Insert SQL or Update SQL
properties.
Auto-generated Update Only. DataStage generates an update
statement for you, based on the values you have supplied for
table name and on column details. The statement can be
viewed by selecting the Update SQL properties.
User-defined Update & Insert. Select this to enter your own
update and insert statements. Then select the Insert SQL and
Update SQL properties and edit the statement proformas.
User-defined Update Only. Select this to enter your own
update statement. Then select the Update SQL property and
edit the statement proforma.
Insert SQL. Only appears for the Upsert write method. This property
allows you to view an auto-generated Insert statement, or to specify
your own (depending on the setting of the Update Mode property).
Update SQL. Only appears for the Upsert write method. This property allows you to view an auto-generated Update statement, or to
specify your own (depending on the setting of the Update Mode
property).
Write Method. Choose from Write, Upsert, or Load (the default).
Load takes advantage of fast DB2 loader technology for writing data
to the database. Upsert uses Insert and Update SQL statements to
write to the database.
Write Mode. Select from the following:
Append. This is the default. New records are appended to an
existing table.
Create. Create a new table. If the DB2 table already exists an
error occurs and the job terminates. You must specify this
mode if the DB2 table does not exist.
Replace. The existing table is first dropped and an entirely
new table is created in its place. DB2 uses the default partitioning method for the new table.
DB2 Stage
12-7
Truncate. The existing table attributes (including schema) and

the DB2 partitioning keys are retained, but any existing records
are discarded. New records are then appended to the table.
Connection Category
Use Server Environment Variable. This is set to True by default,
which causes the stage to use the setting of the DB2INSTANCE environment variable to derive the server. If you set this to False, you must
specify a value for the Override Server property.
Use Database Environment Variable. This is set to True by default,
which causes the stage to use the setting of the environment variable
APT_DBNAME, if defined, and DB2DBDFT otherwise to derive the
database. If you set this to False, you must specify a value for the
Override Database property.
Override Server. Optionally specifies the DB2 instance name for the
table. This property appears if you set Use Server Environment Variable property to False.
Override Database. Optionally specifies the name of the DB2 database to access. This property appears if you set Use Database
Environment Variable property to False.
Options Category
Silently Drop Columns Not in Table. This is False by default. Set to
True to silently drop all input columns that do not correspond to
columns in an existing DB2 table. Otherwise the stage reports an error
and terminates the job.
Truncate Column Names. Select this option to truncate column
names to 18 characters. To specify a length other than 18, use the Truncation Length dependent property:
Truncation Length
This is set to 18 by default. Change it to specify a different truncation length.
Close Command. This is an optional property. Use it to specify any
command to be parsed and executed by the DB2 database on all
12-8
processing nodes after the stage finishes processing the DB2 table. You
can specify a job parameter if required.
Default String Length. This is an optional property and is set to 32 by
default. Sets the default string length of variable-length strings written
to a DB2 table. Variable-length strings longer than the set length cause
an error.
The maximum length you can set is 4000 bytes. Note that the stage
always allocates the specified number of bytes for a variable-length
string. In this case, setting a value of 4000 allocates 4000 bytes for every
string. Therefore, you should set the expected maximum length of
your largest string and no larger.
Open Command. This is an optional property. Use it to specify any
processing nodes before the DB2 table is opened. You can specify a job
Use ASCII Delimited Format. This property only appears if Write
Mode is set to Load. Specify this option to configure DB2 to use the
ASCII-delimited format for loading binary numeric data instead of the
default ASCII-fixed format.
This option can be useful when you have variable-length columns,
because the database will not have to allocate the maximum amount
of storage for each variable-length column. However, all numeric
columns are converted to an ASCII format by DB2, which is a CPUintensive operation. See the DB2 reference manuals for more
information.
Cleanup on Failure. This property only appears if Write Mode is set
to Load. Specify this option to deal with failures during stage execution that leave the tablespace being loaded in an inaccessible state.
The cleanup procedure neither inserts data into the table nor deletes
data from it. You must delete rows that were inserted by the failed
execution either through the DB2 command-level interpreter or by
using the stage subsequently using the replace or truncate write
modes.
Message File. This property only appears if Write Mode is set to
Load. Specifies the file where the DB2 loader writes diagnostic
DB2 Stage
12-9
messages. The database instance must have read/write privilege to

the file.

The Partitioning tab allows you to specify details about how the
incoming data is partitioned or collected before it is written to the DB2
database. It also allows you to specify that the data should be sorted
before being written.
By default the stage partitions in DB2 mode.
If the DB2 stage is operating in sequential mode, it will first collect the
data before writing it to the file using the default Auto collection
method.
Whether the DB2 stage is set to execute in parallel or sequential
mode.
Whether the preceding stage in the job is set to execute in
If the DB2 stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down
list. This will override any current partitioning (even if the Preserve
Partitioning option has been set on the previous stage in the job).
If the DB2 stage is set to execute in sequential mode, but the preceding
stage is executing in parallel, then you can set a collection method
from the Collection type drop-down list. This will override the
default Auto collection method.
Hash. The records are hashed into partitions based on the
value of a key column or columns selected from the Available
list.
Modulus. The records are partitioned using a modulus function on the key column selected from the Available list. This is
commonly used to partition on tag columns.
12-10

Round Robin. The records are partitioned on a round robin
basis as they enter the stage.
DB2. Replicates the DB2 partitioning method of the specified
DB2 table. This is the default method for the DB2 stage.
Range. Divides a data set into approximately equal size partitions based on one or more partitioning keys. Range
partitioning is often a preprocessing step to performing a total
sort on a data set. Requires extra properties to be set. Access
these properties by clicking the properties button
preceding stages, and how many nodes are specified in the
Configuration file. This is the default collection method for
DB2 stages.
records from the second partition, and so on.
Round Robin. Reads a record from the first input partition,
then from the second partition, and so on. After reaching the
last partition, the operator starts over.
columns of the record. This requires you to select a collecting
key column from the Available list.
The Partitioning tab also allows you to specify that data arriving on
the input link should be sorted before being written to the database.
The sort is always carried out within data partitions. If the stage is
partitioning incoming data the sort occurs after the partitioning. If the
stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.
Sort. Select this to specify that data coming in on the link
should be sorted. Select the column or columns to sort on from
the Available list.
DB2 Stage
12-11
Stable. Select this if you want to preserve previously sorted

data sets. This is the default.
Unique. Select this to specify that, if multiple records have
identical sorting key values, only one record is retained. If
stable sort is also set, the first record is retained.
sequence for each column in the Selected list by selecting it and rightclicking to invoke the shortcut menu.
Outputs Page
The Outputs page allows you to specify details about how the DB2
stage reads data from a DB2 database. The DB2 stage can have only
one output link. Alternatively it can have a reference output link,
which is used by the Lookup stage when referring to a DB2 lookup
table. It can also have a reject link where rejected records are routed
(used in conjunction with an input link)
of incoming data.
Details about DB2 stage properties are given in the following sections.
See Chapter 3, Stage Editors, for a general description of the other
tabs.

These dictate how incoming data is read from what table. Some of the
12-12
Repeats?
Dependent of
N/A
N/A
N/A
Y (if Read
Method =
Table)
N/A
string
N/A
Table
Source/Select List
string
N/A
Table
Source/Query
string
N/A
Y (if Read
Method =
Query)
N/A
Source/Partition Table string
N/A
Query
Connection/Use Data- True/False

base Environment
Variable
True
N/A
Connection/Use
Server Environment
Variable
True/False
True
N/A
Connection/Override
Server
string
N/A
Y (if Use
Database
environment
variable =
False)
N/A
Mandatory?
Category/Property
Values
Default
Source/Lookup Type
Normal/
Sparse
Normal Y (if
output is
reference
link
connected
to Lookup
stage)
Source/Read Method
Table
Table/
Autogenerated
SQL/Userdefined
SQL
Source/Table
string
Source/Where clause
DB2 Stage
12-13
Repeats?
Dependent of
Y (if Use
Server
environment
variable =
False)
N/A
N/A
N/A
string
N/A
N/A
True/False
False
Y (if link is N
reference
and
Lookup
type =
sparse)
N/A
Category/Property
Values
Default
Connection/Override
Database
string
N/A
Options/Query
string
Options/Open
Command
Options/Make
Combinable
Mandatory?
Source Category
Lookup Type. Where the DB2 stage is connected to a Lookup stage
via a reference link, this property specifies whether the DB2 stage will
provide data for an in-memory look up (Lookup Type = Normal) or
whether the lookup will access the database directly (Lookup Type =
Sparse). If the Lookup Type is Normal, the Lookup stage can have
multiple reference links. If the Lookup Type is Sparse, the Lookup
stage can only have one reference link.
Read Method. Select Table to use the Table property to specify the
read (this is the default). Select Auto-generated SQL to have DataStage
automatically generate an SQL query based on the columns you have
defined and the table you specify in the Table property. Select Userdefined SQL to define your own query.
Query. This property is used to contain the SQL query when you
choose a Read Method of User-defined query or Auto-generated SQL.
If you are using Auto-generated SQL you must select a table and
specify some column definitions. Any SQL statement can contain
joins, views, database links, synonyms, and so on. It has the following
dependent option:
12-14
Partition Table
Specifies execution of the query in parallel on the processing
nodes containing a partition derived from the named table. If
you do not specify this, the stage executes the query sequentially on a single node.
Table. Specifies the name of the DB2 table. The table must exist and
you must have SELECT privileges on the table. If your DB2 user name
does not correspond to the owner of the specified table, you can prefix
it with a table owner in the form:
table_owner.table_name
If you are a Read method of Table, then the Table property has two
dependent properties:
Where clause
Allows you to specify a WHERE clause of the SELECT statement to specify the rows of the table to include or exclude from
the read operation. If you do not supply a WHERE clause, all
rows are read.
Select List
Allows you to specify an SQL select list of column names.
Connection Category
Use Server Environment Variable. This is set to True by default,
which causes the stage to use the setting of the DB2INSTANCE environment variable to derive the server. If you set this to False, you must
specify a value for the Override Server property.
Use Database Environment Variable. This is set to True by default,
which causes the stage to use the setting of the environment variable
APT_DBNAME, if defined, and DB2DBDFT otherwise to derive the
database. If you set this to False, you must specify a value for the
Override Database property.
Override Server. Optionally specifies the DB2 instance name for the
table. This property appears if you set Use Server Environment Variable property to False.
DB2 Stage
12-15
Override Database. Optionally specifies the name of the DB2 database to access. This property appears if you set Use Database
Environment Variable property to False.
Options Category
Close Command. This is an optional property. Use it to specify any
processing nodes after the stage finishes processing the DB2 table. You
can specify a job parameter if required.
Open Command. This is an optional property. Use it to specify any
processing nodes before the DB2 table is opened. You can specify a job
Make Combinable. Only applies to reference links where the Lookup
Type property has been set to sparse. Set to True to specify that the
lookup can be combined with its preceding and/or following process.
12-16
13
Oracle Stage
The Oracle stage is a database stage. It allows you to read data from and
write data to a Oracle database. It can also be used in conjunction with a
Lookup stage to access a lookup table hosted by an Oracle database (see
Chapter 20, Lookup Stage.)
The Oracle stage can have a single input link and a single reject link, or a
single output link or output reference link.
When you edit a Oracle stage, the Oracle stage editor appears. This is
based on the generic stage editor described in Chapter 3, Stage Editors.
Inputs page. This is present when you are writing to a Oracle database. This is where you specify details about the data being
written.
Outputs page. This is present when you are reading from a Oracle
database, or performing a lookup on a Oracle database. This is
where you specify details about the data being read.
You need to be running Oracle 8 or better, Enterprise Edition in order to
use the Oracle stage.
You must also do the following:
Oracle Stage
1.
Create the user defined environment variable ORACLE_HOME and

set this to the $ORACLE_HOME path (e.g., /disk3/oracle9i)
2.
Create the user defined environment variable ORACLE_SID and set

this to the correct service name (e.g., ODBCSOL).
13-1
3.
Add ORACLE_HOME/bin to your PATH and ORACLE_HOME/lib to

your LIBPATH, LD_LIBRARY_PATH, or SHLIB_PATH.
4.
Have login privileges to Oracle using a valid Oracle user name

and corresponding password. These must be recognized by
Oracle before you attempt to access it.
5.
Have SELECT privilege on:
DBA_EXTENTS
DBA_DATA_FILES
DBA_TAB_PARTITONS
DBA_OBJECTS
ALL_PART_INDEXES
ALL_PART_TABLES
ALL_INDEXES
SYS.GV_$INSTANCE (Only if Oracle Parallel Server is used)
Note: APT_ORCHHOME/bin must appear before ORACLE_HOME/bin

in your PATH.
We suggest that you create a role that has the appropriate SELECT privileges, as follows:
CREATE ROLE DSXE;
GRANT SELECT on sys.dba_extents to DSXE;
GRANT SELECT on sys.dba_data_files to DSXE;
GRANT SELECT on sys.dba_tab_partitions to DSXE;
GRANT SELECT on sys.dba_objects to DSXE;
GRANT SELECT on sys.all_part_indexes to DSXE;
GRANT SELECT on sys.all_part_tables to DSXE;
GRANT SELECT on sys.all_indexes to DSXE;
Once the role is created, grant it to users who will run DataStage jobs, as
follows:
GRANT DSXE to <oracle userid>;
Stage Page
13-2
Advanced Tab
Sequential mode the entire write is processed by the conductor
node.
read operations will request that the next stage preserves the partitioning as is (it is ignored for write operations).
Inputs Page
The Inputs page allows you to specify details about how the Oracle stage
writes data to a Oracle database. The Oracle stage can have only one input
link writing to one table.
is partitioned before being written to the database. The Columns tab specifies the column definitions of incoming data.
Details about Oracle stage properties, partitioning, and formatting are
Oracle Stage
13-3

These dictate how incoming data is written and where. Some of the properties are mandatory, although many have default settings. Properties
Repeats?
Dependent of
Y (if
Write
Method
= Load)
N/A
Autogenerated
Update &
insert
Y (if
Write
Method
=
Upsert)
N/A
string
N/A
N/A
Target/Insert
Array Size
number
500
Insert SQL
Target/Update
SQL
string
N/A
Y (if
Write
Method
=
Upsert)
N/A
Target/Write
Method
Upsert/Load
Load
N/A
Default
Target/Table
string
N/A
Target/Upsert
method
Auto-generated Update &

insert/Autogenerated
Update
Only/Userdefined
Update &
Insert/Userdefined
Update Only
Target/Insert SQL
13-4
Mandatory?
Repeats?
Dependent of
Y (if
Write
Method
= Load)
N/A
N/A
AutoAutogenerate/User generate
-defined
N/A
Connection/User
string
N/A
Y (if DB N
Options
Mode =
Autogenerate
)
DB Options
Mode
Connection/Password
string
N/A
Y (if DB N
Options
Mode =
Autogenerate
)
DB Options
Mode
Connection/Remote
Server
string
N/A
N/A
Options/Output
Reject Records
True/False
False
Y (if
Write
Method
=
Upsert)
N/A
Options/Silently
True/False
Drop Columns Not
in Table
False
Y (if
Write
Method
= Load)
N/A
Options/Truncate
Column Names
False
Y (if
Write
Method
= Load)
N/A
Default
Target/Write
Mode
Append/
Create/
Replace/
Truncate
Append
Connection/DB
Options
string
N/A
Connection/DB
Options Mode
Oracle Stage
True/False
Mandatory?
13-5
Default
Mandatory?
Repeats?
Dependent of
Options/Close
Command
string
N/A
N/A
Options/Default
String Length
number
32
N/A
Options/Index
Mode
Maintenance/
Rebuild
N/A
N/A
Options/Add
NOLOGGING
clause to Index
rebuild
True/False
False
Index
Mode
Options/Add
True/False
COMPUTE
STATISTICS clause
to Index rebuild
False
Index
Mode
Options/Open
Command
string
N/A
N/A
Options/Oracle 8
Partition
string
N/A
N/A
Target Category
Table. This only appears for the Load Write Method. Specify the name of
the table to write to. You can specify a job parameter if required.
Upsert method. This only appears for the Upsert write method. Allows
you to specify how the insert and update statements are to be derived.
Choose from:
Auto-generated Update & Insert. DataStage generates update and
insert statements for you, based on the values you have supplied
for table name and on column details. The statements can be
viewed by selecting the Insert SQL or Update SQL properties.
Auto-generated Update Only. DataStage generates an update
statement for you, based on the values you have supplied for table
name and on column details. The statement can be viewed by
selecting the Update SQL properties.
13-6
User-defined Update & Insert. Select this to enter your own

update and insert statements. Then select the Insert SQL and
Update SQL properties and edit the statement proformas.
User-defined Update Only. Select this to enter your own update
statement. Then select the Update SQL property and edit the statement proforma.
Insert SQL. Only appears for the Upsert write method. This property
allows you to view an auto-generated Insert statement, or to specify your
own (depending on the setting of the Update Mode property). It has a
dependent property:
Insert Array Size
Specify the size of the insert host array. The default size is 500
records. If you want each insert statement to be executed individually, specify 1 for this property.
Update SQL. Only appears for the Upsert write method. This property
allows you to view an auto-generated Update statement, or to specify your
own (depending on the setting of the Update Mode property).
Write Method. Choose from Upsert or Load (the default). Upsert allows
you to provide the insert and update SQL statements and uses Oracle hostarray processing to optimize the performance of inserting records. Write
sets up a connection to Oracle and inserts records into a table, taking a
single input data set. The Write Mode property determines how the
records of a data set are inserted into the table.
Write Mode. This only appears for the Load Write Method. Select from the
following:
Append. This is the default. New records are appended to an
existing table.
Create. Create a new table. If the Oracle table already exists an
error occurs and the job terminates. You must specify this mode if
the Oracle table does not exist.
Replace. The existing table is first dropped and an entirely new
table is created in its place. Oracle uses the default partitioning
method for the new table.
Oracle Stage
13-7
Truncate. The existing table attributes (including schema) and the

Oracle partitioning keys are retained, but any existing records are
discarded. New records are then appended to the table.
Connection Category
DB Options. Specify a user name and password for connecting to Oracle
in the form:
<user=<user>,password=< password>[,arraysize=
<num_records>]
Arraysize is only relevant to the Upsert Write Method.
DB Options Mode . If you select Auto-generate for this property,
DataStage will create a DB Options string for you. If you select Userdefined, you have to edit the DB Options property yourself. When Autogenerate is selected, there are two dependent properties:
User
The user name to use in the auto-generated DB options string.
Password
The password to use in the auto-generated DB options string.
Remote Server. This is an optional property. Allows you to specify a
remote server name.
Options Category
Output Reject Records. This only appears for the Upsert write method.
It is False by default, set to True to send rejected records to the reject link.
Silently Drop Columns Not in Table. This only appears for the Load
Write Method. It is False by default. Set to True to silently drop all input
columns that do not correspond to columns in an existing Oracle table.
Otherwise the stage reports an error and terminates the job.
Truncate Column Names. This only appears for the Load Write Method.
Set this property to True to truncate column names to 30 characters.
Close Command. This is an optional property and only appears for the
Load Write Method. Use it to specify any command, in single quotes, to be
13-8
parsed and executed by the Oracle database on all processing nodes after
the stage finishes processing the Oracle table. You can specify a job parameter if required.
Default String Length. This is an optional property and only appears for
the Load Write Method. It is set to 32 by default. Sets the default string
length of variable-length strings written to a Oracle table. Variable-length
strings longer than the set length cause an error.
The maximum length you can set is 2000 bytes. Note that the stage always
allocates the specified number of bytes for a variable-length string. In this
case, setting a value of 2000 allocates 2000 bytes for every string. Therefore,
you should set the expected maximum length of your largest string and no
larger.
Index Mode. This is an optional property and only appears for the Load
Write Method. Lets you perform a direct parallel load on an indexed table
without first dropping the index. You can choose either Maintenance or
Rebuild mode. The Index property only applies to append and truncate
Write Modes.
Rebuild skips index updates during table load and instead rebuilds the
indexes after the load is complete using the Oracle alter index rebuild
command. The table must contain an index, and the indexes on the table
must not be partitioned. The Rebuild option has two dependent
properties:
Add NOLOGGING clause to Index rebuild
This is False by default. Set True to add a NOLOGGING clause.
Add COMPUTE STATISTICS clause to Index rebuild
This is False by default. Set True to add a COMPUTE STATISTICS
clause.
Maintenance results in each table partitions being loaded sequentially.
Because of the sequential load, the table index that exists before the table
is loaded is maintained after the table is loaded. The table must contain an
index and be partitioned, and the index on the table must be a local rangepartitioned index that is partitioned according to the same range values
that were used to partition the table. Note that in this case sequential means
sequential per partition, that is, the degree of parallelism is equal to the
number of partitions.
Oracle Stage
13-9
Open Command. This is an optional property and only appears for the
Load Write Method. Use it to specify any command, in single quotes, to be
parsed and executed by the Oracle database on all processing nodes before
the Oracle table is opened. You can specify a job parameter if required.
Oracle 8 Partition. This is an optional property and only appears for the
Load Write Method. Name of the Oracle 8 table partition that records will
be written to. The stage assumes that the data provided is for the partition
specified.

data is partitioned or collected before it is written to the Oracle database.
It also allows you to specify that the data should be sorted before being
written.
incoming data.
If the Oracle stage is operating in sequential mode, it will first collect the
Whether the Oracle stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Oracle stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
If the Oracle stage is set to execute in sequential mode, but the preceding
13-10
Collection type drop-down list. This will override the default Same collection method.
stages, whether the Preserve Partitioning option has been set, and
how many nodes are specified in the Configuration file. This is the
default partitioning method for the Oracle stage.
Same. Preserves the partitioning already in place. This is the
default for Oracle stages.
DB2. Replicates the Oracle partitioning method of the specified
DB2 table.
the default collection method for Oracle stages.
Oracle Stage
13-11
input link should be sorted before being written to the file or files. The sort
list.
Outputs Page
The Outputs page allows you to specify details about how the Oracle stage
reads data from a Oracle database. The Oracle stage can have only one
output link. Alternatively it can have a reference output link, which is
used by the Lookup stage when referring to a Oracle lookup table. It can
also have a reject link where rejected records are routed (used in conjunction with an input link). The Output Name drop-down list allows you to
choose whether you are looking at details of the main output link or the
reject link.
13-12
incoming data.
Details about Oracle stage properties are given in the following sections.
See Chapter 3, Stage Editors, for a general description of the other tabs.

Mandatory?
Repeat
s?
Dependent of
Normal
Y (if
output is
reference
link
connected
to Lookup
stage)
N/A
Table/Query
Table
N/A
Source/Table
string
N/A
N/A
Source/Where
string
N/A
Table
Source/Select List
string
N/A
Table
Source/Query
string
N/A
N/A
Source/Partition Table string
N/A
Query
Connection/DB
Options
string
N/A
N/A
Connection/DB
Options Mode
AutoAutogenerate/User generate
-defined
N/A
Category/Property
Values
Default
Source/Lookup Type
Normal/
Sparse
Source/Read Method
Oracle Stage
13-13
Mandatory?
Repeat
s?
Dependent of
N/A
Y (if DB
Options
Mode =
Autogenerate)
DB
Options
Mode
Connection/Password string
N/A
Y (if DB
Options
Mode =
Autogenerate)
DB
Options
Mode
Connection/Remote
Server
string
N/A
N/A
Options/Close
Command
True/false
False
Y (for
reference
links)
N/A
Options/Close
Command
string
N/A
N/A
Options/Open
Command
string
N/A
N/A
Options/Make
Combinable
True/False
False
Y (if link
is reference and
Lookup
type =
sparse)
N/A
Category/Property
Values
Default
Connection/User
string
Source Category
Lookup Type. Where the Oracle stage is connected to a Lookup stage via
a reference link, this property specifies whether the Oracle stage will
provide data for an in-memory look up (Lookup Type = Normal) or
whether the lookup will access the database directly (Lookup Type =
Sparse). If the Lookup Type is Normal, the Lookup stage can have multiple
reference links. If the Lookup Type is Sparse, the Lookup stage can only
have one reference link.
Read Method. This property specifies whether you are specifying a table
or a query when reading the Oracle database.
13-14
Query. Optionally allows you to specify an SQL query to read a table. The
query specifies the table and the processing that you want to perform on
the table as it is read by the stage. This statement can contain joins, views,
database links, synonyms, and so on. It has the following dependent
option:
Table. Specifies the name of the Oracle table. The table must exist and you
must have SELECT privileges on the table. If your Oracle user name does
not correspond to the owner of the specified table, you can prefix it with a
table owner in the form:
table_owner.table_name
Table has dependent properties:
Where
Stream links only. Specifies a WHERE clause of the SELECT statement to specify the rows of the table to include or exclude from the
read operation. If you do not supply a WHERE clause, all rows are
read.
Select List
Optionally specifies an SQL select list, enclosed in single quotes,
that can be used to determine which columns are read. You must
specify the columns in list in the same order as the columns are
defined in the record schema of the input table.
Partition Table. This only appears for stream links. Specifies execution of
the SELECT in parallel on the processing nodes containing a partition
derived from the named table. If you do not specify this, the stage executes
the query sequentially on a single node.
Connection Category
DB Options. Specify a user name and password for connecting to Oracle
in the form:
<user=<user>,password=<password>[,arraysize=<num_records>]
Arraysize only applies to stream links. The default arraysize is 1000.

DB Options Mode. If you select Auto-generate for this property,
DataStage will create a DB Options string for you. If you select User-
Oracle Stage
13-15
defined, you have to edit the DB Options property yourself. When Autogenerate is selected, there are two dependent properties:
User
Password
Remote Server. This is an optional property. Allows you to specify a
remote server name.
Options Category
Close Command. This is an optional property and only appears for
stream links. Use it to specify any command to be parsed and executed by
the Oracle database on all processing nodes after the stage finishes
processing the Oracle table. You can specify a job parameter if required.
Open Command. This is an optional property only appears for stream
links. Use it to specify any command to be parsed and executed by the
Oracle database on all processing nodes before the Oracle table is opened.
You can specify a job parameter if required
Make Combinable. Only applies to reference links where the Lookup
Type property has been set to Sparse. Set to True to specify that the lookup
can be combined with its preceding and/or following process.
13-16
14
Teradata Stage
The Teradata stage is a database stage. It allows you to read data from and
write data to a Teradata database.
The Teradata stage can have a single input link or a single output link.
When you edit a Teradata stage, the Teradata stage editor appears. This is
based on the generic stage editor described in Chapter 3, Stage Editors,
reading or writing a file:
Inputs page. This is present when you are writing to a Teradata
database. This is where you specify details about the data being
written.
Outputs page. This is present when you are reading from a Teradata database. This is where you specify details about the data
being read.
There are no special steps you need in order to ensure that the Teradata
stage can communicate with Teradata, other than ensuring that you have
/usr/lib in your path.
Stage Page
Advanced Tab
Teradata Stage
14-1

node.
read operations will request that the next stage preserves the partitioning as is (the Preserve partitioning field is not visible unless
the stage has an output links).
Inputs Page
The Inputs page allows you to specify details about how the Teradata
stage writes data to a Teradata database. The Teradata stage can have only
one input link writing to one table.
Details about Teradata stage properties, partitioning, and formatting are
14-2

Category/Property
Values
Default
Manda
DepenRepeats?
tory?
dent of
Target/Table
Table_Name
N/A
N/A
Target/Primary
Index
Columns List
N/A
Table
Target/Select List
List
N/A
Table
Target/Write Mode
Append/
Create/
Replace/
Truncate
Append
N/A
Connection/DB
Options
String
N/A
N/A
Connection/Database
Database
Name
N/A
N/A
Connection/Server
Server Name
N/A
N/A
Options/Close
Command
Close
Command
500
Insert SQL
Options/Open
Command
Open
Command
False
N/A
Options/Silently
Drop Columns Not
in Table
True/False
False
N/A
Options/Default
String Length
String Length
32
N/A
Options/Truncate
Column Names
True/False
False
N/A
Options/Progress
Interval
Number
100000
N/A
Teradata Stage
14-3
Target Category
Table. Specify the name of the table to write to. The table name must be a
valid Teradata table name. Table has two dependent properties:
Select List
Specifies a list that determines which columns are written. If you
do not supply the list, the TeraData stage writes to all columns. Do
not include formatting characters in the list.
Primary Index
Specify a comma-separated list of column names that will become
the primary index for tables. Format the list according to Teradata
standards and enclose it in single quotes.
For performance reasons, the data set should not be sorted on the
primary index. The primary index should not be a smallint, or a
column with a small number of values, or a high proportion of null
values. If no primary index is specified, the first column is used.
All the considerations noted above apply to this case as well.
Connection Category
DB Options. Specify a user name and password for connecting to Teradata in the form:
<user = <user>, password= < password> [, arraysize =
<num_records>]
User
Password
Database. By default, the write operation is carried out in the default
database of the Teradata user whose profile is used. If no default database
is specified in that users Teradata profile, the user name is the default
14-4
database. If you supply the database name, the database to which it refers
must exist and you must have necessary privileges.
Server. Specify the name of a Teradata server.
Options Category
Close Command. Specify a Teradata command quotes to be parsed and
executed by Teradata on all processing nodes after the table has been
populated.
Open Command. Specify a Teradata command to be parsed and executed
by Teradata on all processing nodes before the table is populated.
Silently Drop Columns Not in Table. Specifying True causes the stage to
silently drop all unmatched input columns; otherwise the job fails.
Append Appends new records to the table. The database user
must have TABLE CREATE privileges and INSERT privileges on
the table being written to. This is the default.
Create. Creates a new table. The database user must have TABLE
CREATE privileges. If a table exists of the same name as the one
you want to create, the data flow that contains Teradataterminates
in error.
Replace. Drops the existing table and creates a new one in its place;
the database user must have TABLE CREATE and TABLE DELETE
privileges. If a table exists of the same name as the one you want to
create, it is overwritten.
Truncate. Retains the table attributes, including the table definition, but discards existing records and appends new ones. The
database user must have DELETE and INSERT privileges on the
table.
Default String Length. Specify the maximum length of variable-length
raw or string columns. The default length is 32 bytes. The upper bound is
slightly less than 32 KB.
Truncate Column Names. Specify whether the column names should be
truncated to 30 characters or not.
Teradata Stage
14-5
Progress Interval. By default, the stage displays a progress message for

every 100,000 records per partition it processes. Specify this option either
to change the interval or to disable the message. To change the interval,
specify a new number of records per partition. To disable the messages,
specify 0.

data is partitioned or collected before it is written to the Teradata database.
written.
incoming data.
If the Teradata stage is operating in sequential mode, it will first collect the
Whether the Teradata stage is set to execute in parallel or sequential mode.
or sequential mode.
If the Teradata stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
If the Teradata stage is set to execute in sequential mode, but the preceding
method.
14-6

default partitioning method for the Teradata stage.
default for Teradata stages.
the default collection method for Teradata stages.
Teradata Stage
14-7
input link should be sorted before being written to the database. The sort
list.
Outputs Page
The Outputs page allows you to specify details about how the Teradata
stage reads data from a Teradata database. The Teradata stage can have
only one output link.
incoming data.
Details about Teradata stage properties are given in the following sections.

14-8
Category\Property
Values
Default
Mandatory?
Repe
ats?
Dependent of
Source/Read Method
Table/Autogenerated
SQL/Userdefined SQL
Table
N/A
Source/Table
Table Name
Y (if Read
Method =
Table or
Autogenerated
SQL)
N/A
Source/Select List
List
N/A
Table
Source/Where Clause
Filter
N/A
Table
Source/Query
SQL query
N/A
Y (if Read
Method =
Userdefined
SQL or
Autogenerated
SQL
N/A
Connection/DB Options String
N/A
N/A
Connection/Database
Database
Name
N/A
N/A
Connection/Server
Server Name
N/A
N/A
Options/Close
Command
String
N/A
N/A
Options/Open
Command
String
N/A
N/A
Options/Progress
Interval
Number
100000
N/A
Teradata Stage
14-9
Source Category
Read Method. Select Table to use the Table property to specify the read
(this is the default). Select Auto-generated SQL this to have DataStage
defined and the table you specify in the Table property. You must select the
Query property and select Generate from the right-arrow menu to actually generate the statement. Select User-defined SQL to define your own
query.
Table. Specifies the name of the Teradata table to read from. The table
must exist, and the user must have the necessary privileges to read it.
The Teradata stage reads the entire table, unless you limit its scope by
means of the Select List and/or Where suboptions:
Select List
Specifies a list of columns to read. The items of the list must appear
in the same order as the columns of the table.
Where Clause
Specifies selection criteria to be used as part of an SQL statements
WHERE clause. Do not include formatting characters in the query
These dependent properties are only available when you have specifed a
Read Method of Table rather than Auto-generated SQL.
Query. This property is used to contain the SQL query when you choose a
Read Method of User-defined query or Auto-generated SQL. If you are
using Auto-generated SQL you must select a table and specify some
column definitions, then select Generate from the right-arrow menu to
have DataStage generate the query.
Connection Category
DB Options. Specify a user name and password for connecting to Teradata in the form:
<user = <user>, password= < password> [, arraysize =
<num_records>]
The default arraysize is 1000.
14-10

User
Password
Database. By default, the read operation is carried out in the default database of the Teradata user whose profile is used. If no default database is
specified in that users Teradata profile, the user name is the default database. This option overrides the default.
If you supply the database name, the database to which it refers must exist
and you must have the necessary privileges.
Server. Specify the name of a Teradata server.
Options Category
Close Command. Optionally specifies a Teradata command to be run
once by Teradata on the conductor node after the query has completed.
Open Command. Optionally specifies a Teradata command run once by
Teradata on the conductor node before the query is initiated.
Progress Interval. By default, the stage displays a progress message for
every 100,000 records per partition it processes. Specify this option either
to change the interval or to disable the message. To change the interval,
specify a new number of regards per partition. To disable the messages,
specify 0.
Teradata Stage
14-11
14-12
15
Informix XPS Stage
The Informix XPS stage is a database stage. It allows you to read data from
and write data to an Informix XPS database.
The Informix XPS stage can have a single input link or a single output link.
When you edit a Informix XPS stage, the Informix XPS stage editor
Stage Editors.
Inputs page. This is present when you are writing to an Informix
XPS database. This is where you specify details about the data
being written.
Outputs page. This is present when you are reading from an
Informix XPS database. This is where you specify details about the
data being read.
You must have the correct privileges and settings in order to use the
Informix XPS stage. You must have a valid account and appropriate privileges on the databases to which you connect.
You require read and write privileges on any table to which you connect,
and Resource privileges for using the Partition Table property on an
output link or using create and replace modes on an input link.
To configure access to Informix XPS:
1.
Make sure that Informix XPS is running.
2.
Make sure the INFORMIXSERVER is set in your environment. This

corresponds to a server name in sqlhosts and is set to the coserver
Informix XPS Stage
15-1
name of coserver 1. The coserver must be accessible from the node on

which you invoke your DataStage job.
3.
Make sure that INFORMIXDIR points to the installation directory of

your INFORMIX server.
4.
Make sure that INFROMIXSQLHOSTS points to the sql hosts path

(e.g., /disk6/informix/informix_runtime/etc/sqlhosts)
Stage Page
Advanced Tab
node.
read operations will request that the next stage preserves the partitioning as is (it is ignored for write operations).
15-2
Inputs Page
The Inputs page allows you to specify details about how the Informix XPS
stage writes data to an Informix XPS database. The stage can have only one
input link writing to one table.
Details about stage properties, partitioning, and formatting are given in

Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Target/Default
String Length
Append/
Create/
Replace/
Truncate
Append
N/A
Target/Table
Table Name
N/A
N/A
Connection/Database
Database
Name
N/A
N/A
Connection/Select
List
List
N/A
Table
Connection/Server Server Name
N/A
N/A
Options/Close
Command
500
Insert SQL
Informix XPS Stage
Close
Command
15-3
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/Open
Command
Open
Command
False
N/A
Options/Silently
Drop Columns Not
in Table
True/False
False
N/A
Options/Default
String Length
String Length
32
N/A
Target Category
Append Appends new records to the table. The database user
who writes in this mode must have Resource privileges. This is the
default mode.
Create. Creates a new table. The database user who writes in this
mode must have Resource privileges. The stage returns an error if
the table already exists.
Replace. Deletes the existing table and creates a new one in its
place. The database user who writes in this mode must have
Resource privileges.
Truncate. Retains the table attributes but discards existing records
and appends new ones. The stage will run more slowly in this
mode if the user does not have Resource privileges.
Table. Specify the name of the Informix XPS table to write to. It has a
dependent property:
Select List
Specifies a list that determines which columns are written. If you
do not supply the list, the stage writes to all columns.
Connection Category
Database. Specify the name of the Informix XPS database containing the
table specified by the Table property.
15-4
Server. Specify the name of an Informix XPS server.
Option Category
Close Command. Specify an INFORMIX SQL statement to be parsed and
executed by Informix XPS on all processing nodes after the table has been
populated.
Open Command. Specify an INFORMIX SQL statement to be parsed and
executed by Informix XPS on all processing nodes before opening the
table.
Silently Drop Columns Not in Table. Use this property to cause the
stage to drop, with a warning, all input columns that do not correspond to
the columns of an existing table. If do you not specify drop, an unmatched
column generates an error and the associated step terminates.
Default String Length. Set the default length of string columns. If you do
not specify a length, the default is 32 bytes. You can specify a length up to
255 bytes.

data is partitioned or collected before it is written to the Informix XPS
database. It also allows you to specify that the data should be sorted before
being written.
incoming data.
If the stage is operating in sequential mode, it will first collect the data
before writing it to the file using the default Auto collection method.
Whether the stage is set to execute in parallel or sequential mode.
Informix XPS Stage
15-5

or sequential mode.
If the stage is set to execute in parallel, then you can set a partitioning
method by selecting from the Partitioning mode drop-down list. This will
override any current partitioning (even if the Preserve Partitioning option
has been set on the Stage page Advanced tab).
If the stage is set to execute in sequential mode, but the preceding stage is
executing in parallel, then you can set a collection method from the Collection type drop-down list. This will override the default collection method.
default partitioning method for the Informix XPS stage.
default for INFORMIX stages.
15-6
the default collection method for Informix XPS stages.
input link should be sorted before being written to the database. The sort
list.
Outputs Page
The Outputs page allows you to specify details about how the Informix
XPS stage reads data from an Informix XPS database. The stage can have
only one output link.
Informix XPS Stage
15-7
incoming data.
Details about Informix XPS stage properties are given in the following
sections. See Chapter 3, Stage Editors, for a general description of the
other tabs.

Category\Property
Values
Default
Mandatory?
Repe
ats?
Dependent of
Source/Read Method
Table/Autogenerated
SQL/Userdefined SQL
Table
N/A
Source/Table
Table Name
Y (if Read
Method =
Table or
Autogenerated
SQL)
N/A
Source/Select List
List
N/A
Table
Source/Where Clause
Filter
N/A
Table
Source/Partition Table
Table
N/A
Table
Source/Query
SQL query
N/A
Y (if Read
Method =
Userdefined
SQL or
Autogenerated
SQL
N/A
15-8
Category\Property
Values
Default
Mandatory?
Repe
ats?
Dependent of
Connection/Database
Database
Name
N/A
N/A
Connection/Server
Server Name
N/A
N/A
Options/Close
Command
String
N/A
N/A
Options/Open
Command
String
N/A
N/A
Source Category
Read Method. Select Table to use the Table property to specify the read
(this is the default). Select Auto-generated SQL this to have DataStage
defined and the table you specify in the Table property. Select Userdefined SQL to define your own query.
Table. Specify the name of the Informix XPS table to read from. The table
must exist. You can prefix the table name with a table owner in the form:
table_owner.table_name.
Where Clause
Specify selection criteria to be used as part of an SQL statements
WHERE clause, to specify the rows of the table to include in or
exclude from the data set.
Select List
Specifies a list that determines which columns are read. If you do
not supply the list, the stage reads all columns. Do not include
formatting characters in the list.
Partition Table
Specify this property if the table is fragmented to improve performance by creating one instance of the stage per table fragment. If
the table is fragmented across nodes, this property creates one
instance of the stage per fragment per node. If the table is fragmented and you do not specify this option, the stage nonetheless
functions successfully, if more slowly. You must have Resource
privilege to invoke this property.
Informix XPS Stage
15-9
These dependent properties are only available when you have specified a
Read Method of Table rather than Auto-generated SQL.
Query. This property is used to contain the SQL query when you choose a
Read Method of User-defined query or Auto-generated SQL. If you are
using Auto-generated SQL you must select a table and specify some
column definitions to have DataStage generate the query.
Connection Category
Database. The name of the Informix XPS database.
Server. The name of the Informix XPS server.
Options Category
Close Command. Optionally specify an INFORMIX SQL statement to be
parsed and executed on all processing nodes after the table selection or
query is completed.
Open Command. Optionally specify an INFORMIX SQL statement to be
parsed and executed by the database on all processing nodes before the
read query is prepared and executed.
15-10
16
Transformer Stage
The Transformer stage is an active stage. Transformer stages do not extract
data or write data to a target database. They are used to handle extracted
data, perform any conversions required, and pass data to another active
stage or a stage that writes data to a target database or file.
Transformer stages can have a single input and any number of outputs. It
can have two types of reject link:
Constraint reject. This is a link defined inside the Transformer
stage which takes any rows that have failed the constraint on all
other output links.
Failure reject. This link is defined outside the Transformer stage
and takes any rows which have not been written to any of the
outputs links by reason of a write failure.
Unlike most of the other stages in a Parallel Extender job, the Transformer
stage has its own user interface. It does not use the generic interface as
described in Chapter 3.
Transformer Stage
16-1
When you edit a Transformer stage, the Transformer Editor appears. An

example Transformer stage is shown below. In this example, meta data has
been defined for the input and the output links.
16-2
Transformer Editor Components

The Transformer Editor has the following components.
Toolbar
The Transformer toolbar contains the following buttons:
show all or selected relations
stage
column auto-match
show/hide
properties
input link
stage variables find/replace
execution order
output link
execution order
constraints
cut copy
save column definition

paste
load column definition
Link Area
The top area displays links to and from the Transformer stage, showing
their columns and the relationships between them.
The link area is where all column definitions and stage variables are
defined.
The link area is divided into two panes; you can drag the splitter bar
between them to resize the panes relative to one another. There is also a
horizontal scroll bar, allowing you to scroll the view left or right.
The left pane shows the input link, the right pane shows output links.
Output columns that have no derivation defined are shown in red.
Within the Transformer Editor, a single link may be selected at any one
time. When selected, the links title bar is highlighted, and arrowheads
indicate any selected columns within that link.
Meta Data Area

The bottom area shows the column meta data for input and output links.
Again this area is divided into two panes: the left showing input link meta
data and the right showing output link meta data.
The meta data for each link is shown in a grid contained within a tabbed
page. Click the tab to bring the required link to the front. That link is also
selected in the link area.
Transformer Stage
16-3
If you select a link in the link area, its meta data tab is brought to the front
automatically.
You can edit the grids to change the column meta data on any of the links.
You can also add and delete meta data.
Shortcut Menus
The Transformer Editor shortcut menus are displayed by right-clicking the
links in the links area.
There are slightly different menus, depending on whether you right-click
an input link, an output link, or a stage variable. The input link menu
offers you operations on input columns, the output link menu offers you
operations on output columns and their derivations, and the stage variable menu offers you operations on stage variables.
The shortcut menu enables you to:
Open the Constraints dialog box to specify a constraint (only available for output links).
Open the Column Auto Match dialog box.
Display the Find/Replace dialog box.
Edit, validate, or clear a derivation, or stage variable.
Append a new column or stage variable to the selected link.
Select all columns on a link.
Insert or delete columns or stage variables.
Cut, copy, and paste a column or a key expression or a derivation
or stage variable.
If you display the menu from the links area background, you can:
Open the Stage Properties dialog box in order to specify stage or
link properties.
Open the Constraints dialog box in order to specify a constraint for
the selected output link.
Open the Link Execution Order dialog box in order to specify the
order in which links should be processed.
Toggle between viewing link relations for all links, or for the
selected link only.
16-4
Toggle between displaying stage variables and hiding them.

Right-clicking in the meta data area of the Transformer Editor opens the
standard grid editing shortcut menus.
Transformer Stage Basic Concepts

When you first edit a Transformer stage, it is likely that you will have
already defined what data is input to the stage on the input link. You will
use the Transformer Editor to define the data that will be output by the
stage and how it will be transformed. (You can define input data using the
Transformer Editor if required.)
This section explains some of the basic concepts of using a Transformer
stage.
Input Link
The input data source is joined to the Transformer stage via the input link,.
Output Links
You can have any number of output links from your Transformer stage.
You may want to pass some data straight through the Transformer stage
unaltered, but its likely that youll want to transform data from some
input columns before outputting it from the Transformer stage.
You can specify such an operation by entering a transform expression. The
source of an output link column is defined in that columns Derivation cell
within the Transformer Editor. You can use the Expression Editor to enter
expressions in this cell. You can also simply drag an input column to an
output columns Derivation cell, to pass the data straight through the
Transformer stage.
In addition to specifying derivation details for individual output columns,
you can also specify constraints that operate on entire output links. A
constraint is an expression that specifies criteria that data must meet
before it can be passed to the output link. You can also specify a reject link,
which is an output link that carries all the data not output on other links,
that is, columns that have not met the criteria.
Each output link is processed in turn. If the constraint expression evaluates
to TRUE for an input row, the data row is output on that link. Conversely,
Transformer Stage
16-5
if a constraint expression evaluates to FALSE for an input row, the data

row is not output on that link.
Constraint expressions on different links are independent. If you have
more than one output link, an input row may result in a data row being
output from some, none, or all of the output links.
For example, if you consider the data that comes from a paint shop, it
could include information about any number of different colors. If you
want to separate the colors into different files, you would set up different
constraints. You could output the information about green and blue paint
on LinkA, red and yellow paint on LinkB, and black paint on LinkC.
When an input row contains information about yellow paint, the LinkA
constraint expression evaluates to FALSE and the row is not output on
LinkA. However, the input data does satisfy the constraint criterion for
LinkB and the rows are output on LinkB.
If the input data contains information about white paint, this does not
satisfy any constraint and the data row is not output on Links A, B or C,
but will be output on the reject link. The reject link is used to route data to
a table or file that is a catch-all for rows that are not output on any other
link. The table or file containing these rejects is represented by another
stage in the job design.
You can also specify another output link which takes rows that have not
be written to any other links because of write failure. This is specified
outside the stage by adding a link and converting it to a reject link using
the shortcut menu. This link is not shown in the Transformer meta data
grid, and derives its meta data from the input link. Its column values are
those in the input row that failed to be written.
Editing Transformer Stages

The Transformer Editor enables you to perform the following operations
on a Transformer stage:
16-6
Create new columns on a link

Delete columns from within a link
Move columns within a link
Edit column meta data
Define output column derivations
Define link constraints and handle rejects
Specify the order in which links are processed
Define local stage variables
Using Drag and Drop

Many of the Transformer stage edits can be made simpler by using the
Transformer Editors drag and drop functionality. You can drag columns
from any link to any other link. Common uses are:
Copying input columns to output links
Moving columns within a link
Copying derivations in output links
To use drag and drop:
1.
Click the source cell to select it.
2.
Click the selected cell again and, without releasing the mouse button,
drag the mouse pointer to the desired location within the target link.
An insert point appears on the target link to indicate where the new
cell will go.
3.
Release the mouse button to drop the selected cell.
You can drag and drop multiple columns, key expressions, or derivations.
Use the standard Explorer keys when selecting the source column cells,
then proceed as for a single cell.
You can drag and drop the full column set by dragging the link title.
You can add a column to the end of an existing derivation by holding
down the Ctrl key as you drag the column.
The drag and drop insert point is shown below:
Transformer Stage
16-7
Find and Replace Facilities

If you are working on a complex job where several links, each containing
several columns, go in and out of the Transformer stage, you can use the
find/replace column facility to help locate a particular column or expression and change it.
The find/replace facility enables you to:
Find and replace a column name

Find and replace expression text
Find the next empty expression
Find the next expression that contains an error
To use the find/replace facilities, do one of the following:

Click the find/replace button on the toolbar
Choose find/replace from the link shortcut menu
Type Ctrl-F
The Find and Replace dialog box appears. It has three tabs:
Expression Text. Allows you to locate the occurrence of a particular string within an expression, and replace it if required. You can
search up or down, and choose to match case, match whole words,
or neither. You can also choose to replace all occurrences of the
string within an expression.
Columns Names. Allows you to find a particular column and
rename it if required. You can search up or down, and choose to
match case, match the whole word, or neither.
Expression Types. Allows you to find the next empty expression
or the next expression that contains an error. You can also press
Ctrl-M to find the next empty expression or Ctrl-N to find the next
erroneous expression.
Note: The find and replace results are shown in the color specified in
Tools Options.
Press F3 to repeat the last search you made without opening the Find and
Replace dialog box.
16-8
Creating and Deleting Columns

You can create columns on links to the Transformer stage using any of the
following methods:
Select the link, then click the load column definition button in the
toolbar to open the standard load columns dialog box.
Use drag and drop or copy and paste functionality to create a new
column by copying from an existing column on another link.
Use the shortcut menus to create a new column definition.
Edit the grids in the links meta data tab to insert a new column.
When copying columns, a new column is created with the same meta data
as the column it was copied from.
To delete a column from within the Transformer Editor, select the column
you want to delete and click the cut button or choose Delete Column from
the shortcut menu.
Moving Columns Within a Link

You can move columns within a link using either drag and drop or cut and
paste. Select the required column, then drag it to its new location, or cut it
and paste it in its new location.
Editing Column Meta Data

You can edit column meta data from within the grid in the bottom of the
Transformer Editor. Select the tab for the link meta data that you want to
edit, then use the standard DataStage edit grid controls.
The meta data shown does not include column derivations since these are
edited in the links area.
Defining Output Column Derivations

You can define the derivation of output columns from within the Transformer Editor in five ways:
If you require a new output column to be directly derived from an
input column, with no transformations performed, then you can
use drag and drop or copy and paste to copy an input column to an
Transformer Stage
16-9
output link. The output columns will have the same names as the
input columns from which they were derived.
If the output column already exists, you can drag or copy an input
column to the output columns Derivation field. This specifies that
the column is directly derived from an input column, with no
transformations performed.
You can use the column auto-match facility to automatically set
that output columns are derived from their matching input
columns.
You may need one output link column derivation to be the same as
another output link column derivation. In this case you can use
drag and drop or copy and paste to copy the derivation cell from
one column to another.
In many cases you will need to transform data before deriving an
output column from it. For these purposes you can use the Expression Editor. To display the Expression Editor, double-click on the
required output link column Derivation cell. (You can also invoke
the Expression Editor using the shortcut menu or the shortcut
keys.)
If a derivation is displayed in red (or the color defined in Tools

Options), it means that the Transformer Editor considers it incorrect.
Once an output link column has a derivation defined that contains any
input link columns, then a relationship line is drawn between the input
column and the output column, as shown in the following example. This
is a simple example; there can be multiple relationship lines either in or out
of columns. You can choose whether to view the relationships for all links,
or just the relationships for the selected links, using the button in the
toolbar.
16-10
Column Auto-Match Facility

This time-saving feature allows you to automatically set columns on an
output link to be derived from matching columns on an input link. Using
this feature you can fill in all the output link derivations to route data from
corresponding input columns, then go back and edit individual output
link columns where you want a different derivation.
To use this facility:
1.
Do one of the following:

Click the Auto-match button in the Transformer Editor toolbar.
Choose Auto-match from the input link header or output link
header shortcut menu.
The Column Auto-Match dialog box appears:
2.
Choose the output link that you want to match columns with the
input link from the drop down list.
3.
Click Location match or Name match from the Match type area.
If you choose Location match, this will set output column derivations
to the input link columns in the equivalent positions. It starts with the
first input link column going to the first output link column, and
works its way down until there are no more input columns left.
Transformer Stage
16-11
If you choose Name match, you need to specify further information

for the input and output columns as follows:
Input columns:
Match all columns or Match selected columns. Choose one of
these to specify whether all input link columns should be
matched, or only those currently selected on the input link.
Ignore prefix. Allows you to optionally specify characters at the
front of the column name that should be ignored during the
matching procedure.
Ignore suffix. Allows you to optionally specify characters at the
end of the column name that should be ignored during the
matching procedure.
Output columns:
Ignore prefix. Allows you to optionally specify characters at the
front of the column name that should be ignored during the
matching procedure.
Ignore suffix. Allows you to optionally specify characters at the
end of the column name that should be ignored during the
matching procedure.
Ignore case. Select this check box to specify that case should be
ignored when matching names. The setting of this also affects the
Ignore prefix and Ignore suffix settings. For example, if you
specify that the prefix IP will be ignored, and turn Ignore case on,
then both IP and ip will be ignored.
4.
Click OK to proceed with the auto-matching.
Note: Auto-matching does not take into account any data type incompatibility between matched columns; the derivations are set
regardless.
Defining Constraints and Handling Rejects

You can define limits for output data by specifying a constraint.
Constraints are expressions and you can specify a constraint for each
output link from a Transformer stage. You can also specify that a particular
link is to act as a reject link and catch those rows that have failed to satisfy
the constraints on all other output links.
16-12
To define a constraint or specify a reject link, do one of the following:

Select an output link and click the constraints button.
Double-click the output links constraint entry field.
Choose Constraints from the background or header shortcut
menus.
A dialog box appears which allows you either to define constraints for any
of the Transformer output links or to define a link as a reject link.
Define a constraint by entering an expression in the Constraint field for
that link. Once you have done this, any constraints will appear below the
links title bar in the Transformer Editor. This constraint expression will
then be checked against the row data at runtime. If the data does not
satisfy the constraint, the row will not be written to that link. It is also
possible to define a link which can be used to catch these rows which have
been "rejected" from a previous link.
A reject link can be defined by:
Clicking on the Reject Row field so a tick appears and leaving the
Constraint fields blank. This will catch any rows that have failed to
meet constraints on all the previous output links.
Set the constraint to REJECTED. This will be set whenever a row is
rejected on a link because the row fails to match a constraint.
REJECTED is cleared by any output link that accepts the row.
Provided the reject link should occur after the output links it will
catch rows that have failed to meet the constraints of all the output
links.
Clicking on the Reject Row field so a tick appears and defining a
Constraint. This will result in the number of rows written to that
link (i.e. rows which satisfy the constraint) to be recorded in the job
log as a warning message indicating "rejected rows".
Note: You can also specify another reject link which will catch rows that
have not been written on any output links due to a write error.
Define this outside Transformer stage by adding a link and using
the shortcut menu to convert it to a reject link.
Transformer Stage
16-13
Specifying Link Order

You can specify the order in which output links process a row.
The initial order of the links is the order in which they are added to the
stage.
To reorder the links:
1.

Click the output link execution order button on the Transformer
Editor toolbar.
Choose output link reorder from the background shortcut menu.
16-14
The Transformer Stage Properties dialog box appears with the Link
Ordering tab of the Stage page uppermost.:
2.
Use the arrow buttons to rearrange the list of links in the execution
order required.
3.
When you are happy with the order, click OK.
Defining Local Stage Variables

You can declare and use your own variables within a Transformer stage.
Such variables are accessible only from the Transformer stage in which
they are declared. They can be used as follows:
They can be assigned values by expressions.
They can be used in expressions which define an output column
derivation.
Expressions evaluating a variable can include other variables or
the variable being evaluated itself.
Any stage variables you declare are shown in a table in the right pane of
the links area. The table looks similar to an output link. You can display or
Transformer Stage
16-15
hide the table by clicking the Stage Variable button in the Transformer
toolbar or choosing Stage Variable from the background shortcut menu.
Note: Stage variables are not shown in the output link meta data area at
the bottom of the right pane.
The table lists the stage variables together with the expressions used to
derive their values. Link lines join the stage variables with input columns
used in the expressions. Links from the right side of the table link the variables to the output columns that use them.
To declare a stage variable:
1.

Select Insert New Stage Variable from the stage variable shortcut
menu. A new variable is added to the stage variables table in the
links pane. The variable is given the default name StageVar and
16-16
default data type VarChar (255). You can edit these properties
using the Transformer Stage Properties dialog box, as described in
the next step.
Click the Stage Properties button on the Transformer toolbar.
Select Stage Properties from the background shortcut menu.
Select Stage Variable Properties from the stage variable shortcut
menu.
The Transformer Stage Properties dialog box appears:
2.
Using the grid on the Variables page, enter the variable name, initial
value, SQL type, precision, scale, and an optional description. Variable names must begin with an alphabetic character (az, AZ) and
can only contain alphanumeric characters (az, AZ, 09).
3.
Click OK. The new variable appears in the stage variable table in the
links pane.
You perform most of the same operations on a stage variable as you can on
an output column (see page 16-9). A shortcut menu offers the same
commands. You cannot, however, paste a stage variable as a new column,
or a column as a new stage variable.
Transformer Stage
16-17
The DataStage Expression Editor

The DataStage Expression Editor helps you to enter correct expressions
when you edit Transformer stages. The Expression Editor can:
Facilitate the entry of expression elements
Complete the names of frequently used variables
Validate the expression
The Expression Editor can be opened from:
Output link Derivation cells
Stage variable Derivation cells
Constraint dialog box
Entering Expressions
Whenever the insertion point is in an expression box, you can use the
Expression Editor to suggest the next element in your expression. Do this
by right-clicking the box, or by clicking the Suggest button to the right of
the box. This opens the Suggest Operand or Suggest Operator menu.
Which menu appears depends on context, i.e., whether you should be
entering an operand or an operator as the next expression element. (The
Functions available from this menu are described in Appendix B.)
Suggest Operand Menu:
Suggest Operator Menu:
16-18
Completing Variable Names

The Expression Editor stores variable names. When you enter a variable
name you have used before, you can type the first few characters, then
press F5. The Expression Editor completes the variable name for you.
If you enter the name of the input link followed by a period, for example,
DailySales., the Expression Editor displays a list of the column names
of the link. If you continue typing, the list selection changes to match what
you type. You can also select a column name using the mouse. Enter a
selected column name into the expression by pressing Tab or Enter. Press
Esc to dismiss the list without selecting a column name.
Validating the Expression

When you have entered an expression in the Transformer Editor, press
Enter to validate it. The Expression Editor checks that the syntax is correct
and that any variable names used are acceptable to the compiler.
If there is an error, a message appears and the element causing the error is
highlighted in the expression box. You can either correct the expression or
close the Transformer Editor or Transform dialog box.
Exiting the Expression Editor

You can exit the Expression Editor in the following ways:
Press Esc (which discards changes).
Press Return (which accepts changes).
Click outside the Expression Editor box (which accepts changes).
Transformer Stage
16-19
Configuring the Expression Editor

The Expression Editor is switched on by default. If you prefer not to use it,
you can switch it off or use selected features only. The Expression Editor is
configured by editing the Designer options. For more information, see the
DataStage Designer Guide.
Transformer Stage Properties

The Transformer stage has a Properties dialog box which allows you to
specify details about how the stage operates.
The Transform Stage dialog box has three pages:
Stage page. This is used to specify general information about the
stage.
Inputs page. This is where you specify details about the data input
to the Transformer stage.
Outputs page. This is where you specify details about the output
links from the Transformer stage.
Stage Page
The Stage page has four tabs:
General. Allows you to enter an optional description of the stage.
Variables. Allows you to set up stage variables for use in the stage.
Advanced. Allows you to specify how the stage executes.
Link Ordering. Allows you to specify the order in which the
output links will be processed.
The Variables tab is described in Defining Local Stage Variables on
page 16-15. The Link Ordering tab is described in Specifying Link Order
on page 16-14.
Advanced Tab
The Advanced tab is the same as the Advanced tab of the generic stage
editor as described in Advanced Tab on page 3-5. This tab allows you to
specify the following:
16-20

sequential mode the entire contents of the file are processed by the
conductor node.
Preserve partitioning. This is set to Propagate by default, this sets
or clears the partitioning in accordance with what the previous
stage has set. You can also select Set or Clear. If you select Set, the
stage will request that the next stage preserves the partitioning as
is.
Inputs Page
The Inputs page allows you to specify details about data coming into the
Transformer stage. The Transformer stage can have only one input link.
link. The Partitioning tab allows you to specify how incoming data is
partitioned. This is the same as the Partitioning tab in the generic stage
editor described in Partitioning Tab on page 3-11.
Partitioning on the Input Link

data is partitioned or collected when input to the Transformer stage. It also
allows you to specify that the data should be sorted on input.
By default the Transformer stage will attempt to preserve partitioning of
incoming data, or use its own partitioning method according to what the
previous stage in the job dictates.
Transformer Stage
16-21
If the Transformer stage is operating in sequential mode, it will first collect

the data before writing it to the file using the default collection method.
Whether the stage is set to execute in parallel or sequential mode.
or sequential mode.
If the Transformer stage is set to execute in parallel, then you can set a
partitioning method by selecting from the Partitioning type drop-down
If the Transformer stage is set to execute in sequential mode, but the
default collection method.
Configuration file. This is the default method for the Transformer
stage.
16-22

the default method for the Transformer stage.
input link should be sorted. The sort is always carried out within data
partitions. If the stage is partitioning incoming data the sort occurs after
the partitioning. If the stage is collecting data, the sort occurs before the
collection. The availability of sorting depends on the partitioning method
chosen.
list.
Transformer Stage
16-23
also set, the first record is retained <otherwise the last one is
retained?>.
Outputs Page
The Outputs Page has a General tab which allows you to enter an optional
description for each of the output links on the Transformer stage.
16-24
17
Aggregator Stage
The Aggregator stage is an active stage. It classifies data rows from a single
input link into groups and computes totals or other aggregate functions
for each group. The summed totals for each group are output from the
stage via an output link.
When you edit an Aggregator stage, the Aggregator stage editor appears.
This is based on the generic stage editor described in Chapter 3, Stage
Editors.
The stage editor has three pages:
Inputs page. This is where you specify details about the data being
grouped and/or aggregated.
Outputs page. This is where you specify details about the groups
being output from the stage.
The aggregator stage gives you access to grouping and summary operations. One of the easiest ways to expose patterns in a collection of records
is to group records with similar characteristics, then compute statistics on
all records in the group. You can then use these statistics to compare properties of the different groups. For example, records containing cash register
transactions might be grouped by the day of the week to see which day
had the largest number of transactions, the largest amount of revenue, etc.
Records can be grouped by one or more characteristics, where record characteristics correspond to column values. In other words, a group is a set of
records with the same value for one or more columns. For example, transaction records might be grouped by both day of the week and by month.
Aggregator Stage
17-1
These groupings might show that the busiest day of the week varies by
season.
In addition to revealing patterns in your data, grouping can also reduce
the volume of data by summarizing the records in each group, making it
easier to manage. If you group a large volume of data on the basis of one
or more characteristics of the data, the resulting data set is generally much
smaller than the original and is therefore easier to analyze using standard
workstation or PC-based tools.
At a practical level, you should be aware that, in a parallel environment,
the way that you partition data before grouping and summarizing it can
affect the results. For example, if you partitioned using the round robin
method records with identical values in the column you are grouping on
would end up in different partitions. If you then performed a sum operation within these partitions you would not be operating on all the relevant
columns. In such circumstances you may want the hash partition the data
on the on one or more of the grouping keys to ensure that your groups are
entire.
It is important that you bear these facts in mind and take any steps you
need to prepare your data set before presenting it to the Aggregator stage.
In practice this could mean you use Sort stages or additional Aggregate
stages in the job.
Stage Page
The Properties page lets you specify what the stage does. The Advanced
page allows you to specify how the stage executes.
Properties
The Properties tab allows you to specify properties which determine what
the stage actually does. Some of the properties are mandatory, although
many have default settings. Properties without default settings appear in
the warning color (red by default) and turn black when you supply a value
for them.
17-2
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent
of
Grouping
Keys/Column for
Calculation
Input
column
N/A
N/A
True
Group
Grouping Keys/Case True/

Sensitive
False
Aggregations/Aggregation
Type
Calculate/ Reduce
Recalculate/
Count
rows
N/A
Aggregations/Count Output
Column
Output
column
N/A
Y (if
Aggregation
Type =
Count
Rows)
N/A
Aggregations/
Summary Column
for Recalculation
Input
column
N/A
Y (if
N/A
Aggregation
Type =
Rereduce)
Aggregations/
Corrected Sum of
Squares
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Maximum Value
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregator Stage
17-3
Dependent
of
Category/Property
Values
Default
Mandatory?
Repeats?
Aggregations/
Mean Value
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Minimum Value
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Missing Value
Output
column
N/A
Column to
Calculate
Aggregations/
Missing Values
Count
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Non-missing Values
Count
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Output
Percent Coefficient of column
Variation
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/Range Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
17-4
Dependent
of
Category/Property
Values
Default
Mandatory?
Repeats?
Aggregations/
Standard Deviation
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Standard Error
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Sum of Weights
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Sum
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Summary
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Uncorrected Sum of
Squares
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregator Stage
17-5
Dependent
of
Category/Property
Values
Default
Mandatory?
Repeats?
Aggregations/
Variance
Output
column
N/A
Column to
Calculate &
Summary
Column for
Recalculation
Aggregations/
Variance divisor
Default/
Nrecs
Default
Variance
Aggregations/
Weighting column
Input
column
N/A
Column to
Recalculate
or Count
Output
Column
Options/Group
hash/sort
hash
N/A
Options/Ignore Null
Values
True/
False
False
N/A
Grouping Keys Category

Group. Specifies the input columns you are using as group keys. Repeat
the property to select multiple columns as group keys. This property has a
dependent property:
Case Sensitive
Use this to specify whether each group key is case sensitive or not,
this is set to True by default, i.e., the values CASE and case in
would end up in different groups.
Aggregations Category
Aggregation Type. This property allows you to specify the type of aggregation operation your stage is performing. Choose from Calculate (the
default), Recalculate, and Count Rows.
Column for Calculation. The Calculate aggregate type allows you to
summarize the contents of a particular column or columns in your input
data set by applying one or more aggregate functions to it. Select the
17-6
column to be aggregated, then select dependent properties to specify the

operation to perform on it, and the output column to carry the result.
Count Output Column. The Count Rows aggregate type performs a
count of the number of records within each group. Specify the column on
which the count is output.
Summary Column for Recalculation. This aggregate type allows you to
apply aggregate functions to a column that has already been summarized.
This is like reduce but performs the specified aggregate operation on a set
of data that has already been summarized. In practice this means you
should have performed a calculate (or recalculate) operation in a previous
Aggregator stage with the Summary property set to produce a subrecord
containing the summary data that is then included with the data set. Select
the column to be aggregated, then select dependent properties to specify
the operation to perform on it, and the output column to carry the result.
Options Category
Method. The aggregate stage has two modes of operation: hash and sort.
Your choice of mode depends primarily on the number of groupings in the
input data set, taking into account the amount of memory available. You
typically use hash mode for a relatively small number of groups; generally,
fewer than about 1000 groups per megabyte of memory to be used.
When using hash mode, you should hash partition the input data set by
one or more of the grouping key columns so that all the records in the same
group are in the same partition (this happens automatically if (auto) is set
in the Partitioning tab). However, hash partitioning is not mandatory, you
can use any partitioning method you choose if keeping groups together in
a single partition is not important. For example, if youre summing records
in each partition and later youll add the sums across all partitions, you
dont need all records in a group to be in the same partition to do this.
Note, though, that there will be multiple output records for each group.
If the number of groups is large, which can happen if you specify many
grouping keys, or if some grouping keys can take on many values, you
would normally use sort mode. However, sort mode requires the input
data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys (this happens automatically if (auto) is set
in the Partitioning tab). Sorting requires a pregrouping operation: after
sorting, all records in a given group in the same partition are consecutive.
Aggregator Stage
17-7
The method property is set to hash by default.

You may want to try both modes with your particular data and application
to determine which gives the better performance. You may find that when
calculating statistics on large numbers of groups, sort mode performs
better than hash mode, assuming the input data set can be efficiently
sorted before it is passed to group.
Ignore Null Values. Set this to True to indicate that null values will not be
counted as part of the total column count when calculating minimum
value, maximum value, mean value, standard deviation, standard error,
sum, sum of weights, and variance. If False, the null value will have 0
substituted and so will be counted as a valid column. It is False by default.
Weighting column. Configures the stage to increment the count for the
group by the contents of the weight column for each record in the group,
instead of by 1. Not available for Summary Column to Rereduce. Setting
this option affects only the following options:
Percent Coefficient of Variation
Mean Value
Sum
Sum of Weights
Uncorrected Sum of Squares
Calculation and Recalculation Dependent Properties

The following properties are dependents of both Column for Calculation
and Summary Column for Recalculation. These specify the various aggregate functions and the output columns to carry the results.
Corrected Sum of Squares
Produces a corrected sum of squares for data in the aggregate
column and outputs it to the specified output column.
Maximum Value
Gives the maximum value in the aggregate column and outputs it
to the specified output column.
17-8
Mean Value
Gives the mean value in the aggregate column and outputs it to the
specified output column.
Minimum Value
Gives the minimum value in the aggregate column and outputs it
to the specified output column.
Missing Value
This specifies what constitutes a missing values, for example -1 or
NULL. Enter the value as a floating point number. Not available
for Summary Column to Rereduce.
Missing Values Count
Counts the number of aggregate columns with missing values in
them and outputs the count to the specified output column. Not
available for rereduce.
Non-missing Values Count
Counts the number of aggregate columns with values in them and
outputs the count to the specified output column.
Percent Coefficient of Variation
Calculates the percent coefficient of variation for the aggregate
Range
Calculates the range of values in the aggregate column and outputs
it to the specified output column.
Standard Deviation
Calculates the standard deviation of values in the aggregate
Standard Error
Calculates the standard error of values in the aggregate column
and outputs it to the specified output column.
Aggregator Stage
17-9
Sum of Weights
Calculates the sum of values in the weight column specified by the
Weight column property and outputs it to the specified output
column.
Sum
Sums the values in the aggregate column and outputs the sum to
the specified output column.
Summary
Specifies a subrecord to write the results of the reduce or rereduce
operation to.
Uncorrected Sum of Squares
Produces an uncorrected sum of squares for data in the aggregate
Variance
Calculates the variance for the aggregate column and outputs the
sum to the specified output column. This has a dependent
property:
Variance divisor
Specifies the variance divisor. By default, uses a value of the
number of records in the group minus the number of records
with missing values minus 1 to calculate the variance. This corresponds to a vardiv setting of Default If you specify NRecs,
DataStage uses the number of records in the group minus the
number of records with missing values instead.
Advanced Tab
sequential mode. In parallel mode the input data set is processed
by the available nodes as specified in the Configuration file, and by
any node constraints specified on the Advanced tab. In Sequential
mode the entire data set is processed by the conductor node.
17-10
Preserve partitioning. This is Set by default. You can select Set or

Clear. If you select Set the stage will request that the next stage in
the job attempt to maintain the partitioning.
Aggregator Stage
17-11
Inputs Page
The Inputs page allows you to specify details about the incoming data set.
partitioned before being grouped and/or summarized. The Columns tab
specifies the column definitions of incoming data.
Details about Aggregator stage partitioning are given in the following
section. See Chapter 3, Stage Editors, for a general description of the
other tabs.

data is partitioned or collected before it is grouped and/or summarized. It
also allows you to specify that the data should be sorted before being operated on.
Preserve Partitioning option has been set on the previous stage in the job,
this stage will attempt to preserve the partitioning of the incoming data.
If the Aggregator stage is operating in sequential mode, it will first collect
the data before writing it to the file using the default Auto collection
method.
Whether the Aggregator stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Aggregator stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage).
17-12
If the Aggregator stage is set to execute in sequential mode, but the

default partitioning method for the Aggregator stage.
the default collection method for Aggregator stages.
Aggregator Stage
17-13
input link should be sorted before being written to the file or files. The sort
list.
Outputs Page
The Outputs page allows you to specify details about data output from the
Aggregator stage. The Aggregator stage can have only one output link.
output link. The Columns tab specifies the column definitions of incoming
data. The Mapping tab allows you to specify the relationship between the
processed data being produced by the Aggregator stage and the Output
columns.
17-14
Details about Aggregator stage mapping is given in the following section.

Mapping Tab
For the Aggregator stage the Mapping tab allows you to specify how the
output columns are derived, i.e., what input columns map onto them or
how they are generated.
These are read only and cannot be modified on this tab.
in by dragging columns over from the left pane, or by using the Automatch facility.
grouped and summarized. The Expression field shows how the column
has been derived. The right pane represents the data being output by the
stage after the grouping and summarizing. In this example ocol1 carries
the value of the key field on which the data was grouped (for example, if
you were grouping by date it would contain each date grouped on).
Column ocol2 carries the mean of all the col2 values in the group, ocol4 the
minimum value, and ocol3 the sum.
Aggregator Stage
17-15
17-16
18
Join Stage
The Join stage is an active stage. It performs join operations on two or
more data sets input to the stage and then outputs the resulting data set.
The input data sets are notionally identified as the right set and the
left set, and intermediate sets. You can specify which is which. It has
any number of input links and a single output link.
The stage can perform one of four join operations:
Inner transfers records from input data sets whose key columns
contain equal values to the output data set. Records whose key
columns do not contain equal values are dropped.
Left outer transfers all values from the left data set but transfers
values from the right data set and intermediate data sets only
where key columns match. The operator drops the key column
from the right data set.
Right outer transfers all values from the right data set and transfers values from the left data set and intermediate data sets only
where key columns match. The operator drops the key column
from the left data set.
Full outer transfers records in which the contents of the key
columns are equal from the left and right input data sets to the
output data set. It also transfers records whose key columns
contain unequal values from both input data sets to the output
data set. (Full outer joins do not support more than two input
links.)
Join Stage
18-1
Inputs page. This is where you specify details about the data sets
being joined.
Outputs page. This is where you specify details about the joined
data being output from the stage.
Stage Page
page allows you to specify how the stage executes. The Link Ordering tab
allows you to specify which of the input links is the right link and which
is the left link and which are intermediate.
Properties
for them.
Mandatory?
Depen
Repeats? dent
of
Input Column N/A
N/A
Join Keys/Case
Sensitive
True/False
True
Key
Options/Join Type
Full Outer/
Inner/Left
Outer/
Right Outer
Inner
N/A
Category/Property
Values
Join Keys/Key
Default
Join Keys Category

Key. Choose the input column you want to join on. You are offered a
choice of input columns common to all links. For a join to work you must
join on a column that appears in all input data sets, i.e. have the same
18-2
name and compatible data types. If, for example, you select a column
called name from the left link, the stage will expect there to be an
equivalent column called name on the right link.
You can join on multiple key columns. To do so, repeat the Key property.
Key has a dependent property:
Case Sensitive
Use this to specify whether each group key is case sensitive or not,
this is set to True by default, i.e., the values CASE and case in
would not be judged equivalent.
Options Category
Join Type. Specify the type of join operation you want to perform.
Choose one of:
Full Outer
Inner
Left Outer
Right Outer
The default is Inner.
Advanced Tab
sequential mode. In parallel mode the input data is processed by
the available nodes as specified in the Configuration file, and by
Preserve partitioning. This is Propagate by default. It adopts the
setting which results from ORing the settings of the input stages,
i.e., if either of the input stages uses Set then this stage will use Set.
You can explicitly select Set or Clear. Select Set to request that the
next stage in the job attempts to maintain the partitioning.
Join Stage
18-3

Link Ordering
This tab allows you to specify which input link is regarded as the left link
and which link is regarded as the right link, and which links are regarded
as intermediate. By default the first link you add is regarded as the left
link, and the last one as the right link, with all other links labelled as
Intermediate N. You can use this tab to override the default order.
In the example DSLink4 is the left link, click on it to select it then click on
the down arrow to convert it into the right link.
18-4
Inputs Page
The Inputs page allows you to specify details about the incoming data
sets. Choose an input link from the Input name drop down list to specify
which link you want to work on.
partitioned before being joined. The Columns tab specifies the column
definitions of incoming data.
Details about Join stage partitioning are given in the following section.

The Partitioning tab allows you to specify details about how the data on
each of the incoming links is partitioned or collected before it is joined. It
also allows you to specify that the data should be sorted before being operated on.
set, and how many nodes are specified in the Configuration file.
If the Join stage is operating in sequential mode, it will first collect the data
before writing it to the file using the default Auto collection method.
Whether the Join stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Join stage is set to execute in parallel, then you can set a partitioning
has been set on the previous stage in the job).
If the Join stage is set to execute in sequential mode, but the preceding
Join Stage
18-5
method.
Configuration file. This is the default collection method for the Join
stage.
the default collection method for Join stages.
18-6
input link should be sorted before being joined. The sort is always carried
out within data partitions. If the stage is partitioning incoming data the
sort occurs after the partitioning. If the stage is collecting data, the sort
occurs before the collection. The availability of sorting depends on the
partitioning method chosen.
list.
Outputs Page
Join stage. The Join stage can have only one output link.
columns being input to the Join stage and the Output columns.
Details about Join stage mapping is given in the following section. See
Chapter 3, Stage Editors, for a general description of the other tabs.
Join Stage
18-7
Mapping Tab
For Join stages the Mapping tab allows you to specify how the output
are generated.
in by dragging input columns over, or by using the Auto-match facility.
joined. The Expression field shows how the column has been derived, the
Column Name shows the column after it has been renamed by the join
operation. The right pane represents the data being output by the stage
after the join. In this example the data has been mapped straight across.
18-8
19
Funnel Stage
The Funnel stage is an active stage. It copies multiple input data sets to a
single output data set. This operation is useful for combining separate data
sets into a single large data set. The stage can have any number of input
links and a single output link.
The Funnel stage can operate in one of three modes:
Funnel combines the records of the input data in no guaranteed
order. it uses a round robin method to transfer data from input
links to output link, i.e., it takes one record from each input link in
turn.
Sort Funnel combines the input records in the order defined by the
value(s) of one or more key columns and the order of the output
records is determined by these sorting keys.
Sequence copies all records from the first input data set to the
output data set, then all the records from the second input data set,
and so on.
For all methods the meta data of all input data sets must be identical.
The sort funnel method has some particular requirements about its input
data. All input data sets must be sorted by the same key columns as used
by the Funnel operation.
Typically all input data sets for a sort funnel operation are hash-partitioned before theyre sorted (choosing the (auto) partitioning method will
ensure that this is done). Hash partitioning guarantees that all records
with the same key column values are located in the same partition and so
are processed on the same node. If sorting and partitioning are carried out
on separate stages before the Funnel stage, this partitioning must be
preserved.
Funnel Stage
19-1
The sortfunnel operation allows you to set one primary key and multiple
secondary keys. The Funnel stage first examines the primary key in each
input record. For multiple records with the same primary key value, it
then examines secondary keys to determine the order of records it will
output.
being joined.
Outputs page. This is where you specify details about the joined
Stage Page
The Properties tab lets you specify what the stage does. The Advanced tab
allows you to specify how the stage executes. The Link Ordering tab
allows you to specify which order the input links are processed in.
Properties
for them.
Category/Property
Values
Default
Mandatory?
Depen
Repe
dent
ats?
of
Options/Funnel Type
Funnel/
Sequence/
Sort funnel
Funnel
19-2
N/A
Category/Property
Values
Default
Sorting Keys/Key
Input Column N/A
Mandatory?
Depen
Repe
dent
ats?
of
Y (if Funnel Y
Type = Sort
Funnel)
N/A
Sorting Keys/Sort Order Ascending/

Descending
Ascending
Y (if Funnel
Type = Sort
Funnel)
Key
Sorting Keys/Nulls
position
First/Last
First
Y (if Funnel N
Type = Sort
Funnel)
Key
Sorting Keys/Case
Sensitive
True/False
True
Key
Sorting Keys/Character
Set
ASCII/
EBCDIC
ASCII
Key
Options Category
Funnel Type. Specifies the type of Funnel operation. Choose from:
Funnel
Sequence
Sort Funnel
The default is Funnel.
Sorting Keys Category

Key. This property is only required for Sort Funnel operations. Specify the
key column that the sort will be carried out on. The first column you
specify is the primary key, you can add multiple secondary keys by
repeating the key property.
Key has the following dependent properties:
Sort Order
Choose Ascending or Descending. The default is Ascending.
Nulls position
By default columns containing null values appear first in the
funneled data set. To override this default so that columns
Funnel Stage
19-3
containing null values appear last in the funneled data set, select
Last.
Character Set
By default data is represented in the ASCII character set. To represent data in the EBCDIC character set, choose EBCDIC.
Case Sensitive
Use this to specify whether each key is case sensitive or not, this is
set to True by default, i.e., the values CASE and case would
not be judged equivalent.
Advanced Tab
i.e., if any of the input stages uses Set then this stage will use Set.
You can explicitly select Set or Clear. Select Set to request that the
next stage in the job attempts to maintain the partitioning.
19-4
Link Ordering
This tab allows you to specify the order in which links input to the
Funnel stage are processed This is only relevant if you have chose
the Sequence Funnel Type.
By default the input links will be processed in the order they were
added. To rearrange them, choose an input link and click the up
arrow button or the down arrow button.
Inputs Page
sets. Choose an input link from the Input name drop down list to specify
which link you want to work on.
partitioned before being funneled. The Columns tab specifies the column
Details about Funnel stage partitioning are given in the following section.
Funnel Stage
19-5

The Partitioning tab allows you to specify details about how the data on
each of the incoming links is partitioned or collected before it is funneled.
operated on.
set, and how many nodes are specified in the Configuration file.
If the Funnel stage is operating in sequential mode, it will first collect the
Whether the Funnel stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Funnel stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
This will override any current partitioning (even if the Preserve Partitioning option has been set on the previous stage in the job).
If you are using the Sort Funnel method, and havent partitioned the data
in a previous stage, you should hash partition it by choosing the Hash
partition method on this tab.
If the Funnel stage is set to execute in sequential mode, but the preceding
method.
Configuration file. This is the default collection method for the
Funnel stage.
19-6
the default collection method for Funnel stages.
input link should be sorted before being funneled. The sort is always
carried out within data partitions. If the stage is partitioning incoming
data the sort occurs after the partitioning. If the stage is collecting data, the
sort occurs before the collection.
Funnel Stage
19-7
If you are using the Sort Funnel method, and havent sorted the data in a
previous stage, you should sort it here using the same keys that the data is
hash partitioned on and funneled on. The availability of sorting depends
on the partitioning method chosen.
list.
Outputs Page
Funnel stage. The Funnel stage can have only one output link.
columns being input to the Funnel stage and the Output columns.
Details about Funnel stage mapping is given in the following section. See
19-8
Mapping Tab
For Funnel stages the Mapping tab allows you to specify how the output
are generated.
The left pane shows the input columns. These are read only and cannot be
modified on this tab. It is a requirement of the Funnel stage that all input
links have identical meta data, so only one set of column definitions is
shown.
The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived. You can fill
it in by dragging input columns over, or by using the Auto-match facility.
In the above example the left pane represents the incoming data after it has
been funneled. The right pane represents the data being output by the
stage after the funnel operation. In this example the data has been mapped
straight across.
Funnel Stage
19-9
19-10
20
Lookup Stage
The Lookup stage is an active stage. It is used to perform lookup operations on a lookup table contained in a Lookup File Set stage (see Chapter 7,
Lookup File Set Stage.) or provided by one of the database stages that
support reference output links (see Chapter 12 and Chapter 13). It can also
perform a look up on a data set read into memory from any other Parallel
job stage that can output data.
The Lookup stage can have a reference link, a single input link, a single
output link, and a single rejects link. Depending upon the type and setting
of the stage(s) providing the look up information, it can have multiple
reference links (where it is directly looking up a DB2 table or Oracle table,
it can only have a single reference link).
The input link carries the data from the source data set and is known as the
primary link.
For each record of the source data set from the input link, the Lookup stage
performs a table lookup on each of the lookup tables attached by reference
links. The table lookup is based on the values of a set of lookup key columns,
one set for each table. For in-memory look ups, the keys are defined on the
Lookup stage (in the Inputs page Properties tab). For lookups of data
accessed through other stages, the keys are defined in that stage (i.e., the
Lookup File Set stage, the Oracle and DB2 stages in sparse lookup mode).
Each record of the output data set contains all of the columns from a source
record plus columns from all the corresponding lookup records where
corresponding source and lookup records have the same value for the
lookup key columns.
The optional reject link carries source records that do not have a corresponding entry in the input lookup tables.
Lookup Stage
20-1
For example, you could have an input data set carrying names and
addresses of your U.S. customers. The data as presented identifies state as
a two letter U. S. state postal code, but you want the data to carry the full
name of the state. You could define a lookup table that carries a list of
codes matched to states, defining the code as the key column. As the
Lookup stage reads each line, it uses the key to look up the state in the
lookup table. It adds the state to a new column defined for the output link,
and so the full state name is added to each address. If any state codes have
been incorrectly entered in the data set, the code will not be found in the
lookup table, and so that record will be rejected.
Inputs page. This is where you specify details about the incoming
data and the reference links.
Outputs page. This is where you specify details about the data
being output from the stage.
Stage Page
allows you to specify which order the input links are processed in.
Properties
for them.
20-2
Category/Property
Values
Default
Manda
tory?
Repeats?
Dependent of
If Not Found
Fail/Continue/
Drop/Reject
Fail
N/A
Options Category
If Not Found. This property specifies the action to take if the lookup value
is not found in the lookup table. Choose from:
Fail. The default, failure to find a value in the lookup table or tables
causes the job to fail.
Continue. The stage adds the offending record to its output and
continues.
Drop. The stage drops the offending record and continues.
Reject. The offending record is sent to the reject link.
Advanced Tab
setting of the previous stage on the stream link.You can explicitly
select Set or Clear. Select Set to request the next stage in the job
Lookup Stage
20-3
Link Ordering
This tab allows you to specify which input link is the primary link and the
order in which the reference links are processed.
By default the input links will be processed in the order they were added.
To rearrange them, choose an input link and click the up arrow button or
the down arrow button.
Inputs Page
The Inputs page allows you to specify details about the incoming data set
and the reference links. Choose a link from the Input name drop down list
to specify which link you want to work on.
20-4
The General tab allows you to specify an optional description of the link.
The Partitioning tab allows you to specify how incoming data on the
source data set link is partitioned. The Columns tab specifies the column
Details about Lookup stage partitioning are given in the following
other tabs.

Where the Lookup stage is performing in-memory look ups, the Inputs
page has a Properties tab. At a minimum this allows you to define the
lookup keys. Depending on the source of the reference link, other properties may be specified on this link.
The properties most commonly set on this tab are as follows:
Category/Property
Values
Default
Mand
DepenRepeats?
atory?
dent of
Lookup Keys/Key
Input column
N/A
N/A
Lookup Keys/Case
Sensitive
True/False
True
Key
Options/Allow
Duplicates
True/False
False
N/A
Options/Diskpool
string
N/A
N/A
Options/Save to
Lookup File Set
pathname
N/A
N/A
Lookup Keys Category

Key. Specifies the name of a lookup key column. The Key property must
be repeated if there are multiple key columns. The property has a dependent property, Case Sensitive.
Case Sensitive. This is a dependent property of Key and specifies
whether the parent key is case sensitive or not. Set to true by default.
Lookup Stage
20-5
Options Category
Allow Duplicates. Set this to cause multiple copies of duplicate records to
be saved in the lookup table without a warning being issued. Two lookup
records are duplicates when all lookup key columns have the same value
in the two records. If you do not specify this option, DataStage issues a
warning message when it encounters duplicate records and discards all
but the first of the matching records.
into which to write the table or file set. You can also specify a job
parameter.
Save to Lookup File Set. Allows you to specify a lookup file set to save
the look up data.

data is partitioned or collected before the lookup is performed. It also
allows you to specify that the data should be sorted before the lookup.
Note that you cannot specify partitioning or sorting on the reference links,
this is specified in their source stage.
By default the stage uses the auto partitioning method. If the Preserve
Partitioning option has been set on the previous stage in the job the stage
will attempt to preserve the partitioning of the incoming data.
If the Lookup stage is operating in sequential mode, it will first collect the
data before writing it to the file using the default auto collection method.
Whether the Lookup File Set stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Lookup stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
This will override any current partitioning (even if the Preserve Partitioning option has been set by the previous stage in the job).
20-6
If the Lookup stage is set to execute in sequential mode, but the preceding
Collection type drop-down list. This will override the default auto collection method.
Configuration file. This is the default method for the Lookup stage.
the default collection method for the Lookup stage.
Lookup Stage
20-7
input link should be sorted before the lookup is performed. The sort is
always carried out within data partitions. If the stage is partitioning
list.
Outputs Page
Lookup stage. The Lookup stage can have only one output link. It can also
have a single reject link, where records can be sent if the lookup fails. The
Output Link drop-down list allows you to choose whether you are
20-8
columns being input to the Lookup stage and the Output columns.
Details about Lookup stage mapping is given in the following section. See

You cannot change the properties of a Reject link. You cannot edit the
column definitions for a reject link. The link uses the column definitions
for the primary input link.
Mapping Tab
For Lookup stages the Mapping tab allows you to specify how the output
are generated.
The left pane shows the lookup columns. These are read only and cannot
be modified on this tab. This shows the meta data from the primary input
link and the reference input links. If a given lookup column appears in
more than one lookup table, only one occurrence of the column will
appear in the left pane.
The right pane shows the output columns for the output link. This has a
Derivations field where you can specify how the column is derived.You
Lookup Stage
20-9
can fill it in by dragging input columns over, or by using the Auto-match

facility.
In the above example the left pane represents the data after the lookup has
been performed. The right pane represents the data being output by the
stage after the lookup operation. In this example the data has been
mapped straight across.
20-10
21
Sort Stage
The Sort stage is an active stage. It is used to perform more complex sort
operations than can be provided for on the Input page Partitioning tab of
parallel job stage editors. You can also use it to insert a more explicit simple
sort operation where you want to make your job easier to understand. The
Sort stage has a single input link which carries the data to be sorted, and a
single output link carrying the sorted data.
being sorted.
Outputs page. This is where you specify details about the sorted
Stage Page
allows you to specify how the stage executes.
Properties
for them.
Sort Stage
21-1
Category/Property
Values
Default
DepenManda
Repeats?
dent of
tory?
Sorting Keys/Key
Input Column
N/A
N/A
Sorting Keys/Sort
Order
Ascending/Desc Ascending
ending
Key
Sorting Keys/Nulls First/Last

position (only available for Sort Utility
= DataStage)
First
Key
Sorting
Keys/Collating
Sequence
ASCII/EBCDIC
ASCII
Key
Sorting Keys/Case
Sensitive
True/False
True
Key
Sorting Keys/Sort
Key Mode (only
available for Sort
Utility = DataStage)
Sort/Dont Sort Sort

(Previously
Grouped)/Dont
Sort (Previously
Sorted)
Key
Options/Sort
Utility
DataStage/Sync
Sort/UNIX
N/A
Options/Stable Sort True/False
True for
Y
Sort Utility
=
DataStage,
False
otherwise
N/A
Options/Allow
Duplicates (not
available for Sort
Utility = UNIX)
True/False
True
N/A
Options/Output
Statistics
True/False
False
N/A
21-2
DataStage
Default
DepenManda
Repeats?
dent of
tory?
Options/Create
True/False
Cluster Key Change
Column (only available for Sort Utility
= DataStage)
False
N/A
Options/Create
Key Change
Column
True/False
False
N/A
Options/Restrict
Memory Usage
number MB
20
N/A
Options/SyncSort
Extra Options
string
N/A
N/A
Options/Workspace
string
N/A
N/A
Category/Property
Values
Sorting Keys Category

Key. Specifies the key column for sorting. This property can be repeated to
specify multiple key columns. Key has dependent properties depending
on the Sort Utility chosen:
Sort Order
All sort types. Choose Ascending or Descending. The default is
Ascending.
Nulls position
This property appears for sort type DataStage and is optional. By
default columns containing null values appear first in the sorted
data set. To override this default so that columns containing null
values appear last in the sorted data set, select Last.
Collating Sequence
All sort types. By default data is set to ASCII. You can also choose
EBCDIC.
Case Sensitive
All sort types. This property is optional. Use this to specify
whether each group key is case sensitive or not, this is set to True
Sort Stage
21-3
by default, i.e., the values CASE and case would not be judged
equivalent.
Sort Key Mode
This property appears for sort type DataStage. It is set to Sort by
default and this sorts on all the specified key columns.
Set to Dont Sort (Previously Sorted) to specify that input records
are already sorted by this column. The Sort stage will then sort on
secondary key columns, if any. This option can increase the speed
of the sort and reduce the amount of temporary disk space when
your records are already sorted by the primary key column(s)
because you only need to sort your data on the secondary key
column(s).
Set to Dont Sort (Previously Grouped) to specify that specifies that
input records are already grouped by this column, but not sorted.
The operator will then sort on any secondary key columns. This
option is useful when your records are already grouped by the
primary key column(s), but not necessarily sorted, and you want to
sort your data only on the secondary key column(s) within each
group
Options Category
Sort Utility. The type of sort the stage will carry out. Choose from:
DataStage. The default. This uses the built-in DataStage sorter, you
do not require any additional software to use this option.
SyncSort. This specifies that the SyncSort utility (UNIX version,
Release 1) is used to perform the sort.
UNIX. This specifies that the UNIX sort command is used to
perform the sort.
Stable Sort. Applies to Sort Utility of DataStage or SyncSort, the default
is True. It is set to True to guarantee that this sort operation will not rearrange records that are already in a properly sorted data set. If set to False
no prior ordering of records is guaranteed to be preserved by the sorting
operation.
Allow Duplicates. Set to True by default. If False, specifies that, if
multiple records have identical sorting key values, only one record is
21-4
retained. If Stable Sort is True, then the first record is retained. This property is not available for the UNIX sort type.
Output Statistics. Set False by default. If True causes the sort operation to
output statistics. This property is not available for the UNIX sort type.
Create Cluster Key Change Column. This property appears for sort
type DataStage and is optional. It is set False by default. If set True it tells
the Sort stage to create the column clusterKeyChange in each output
record. The clusterKeyChange column is set to 1 for the first record in each
group where groups are defined by using a Sort Key Mode of Dont Sort
(Previously Sorted) or Dont Sort (Previously Grouped). Subsequent
records in the group have the clusterKeyChange column set to 0.
Create Key Change Column. This property appears for sort type
DataStage and is optional. It is set False by default. If set True it tells the
Sort stage to create the column KeyChange in each output record. The
KeyChange column is set to 1 for the first record in each group where the
value of the sort key changes. Subsequent records in the group have the
KeyChange column set to 0.
Restrict Memory Usage. This is set to 20 by default. It causes the Sort
stage to restrict itself to the specified number of megabytes of virtual
memory on a processing node.
We recommend that the number of megabytes specified is smaller than the
amount of physical memory on a processing node.
Workspace. This property appears for sort type SyncSort and UNIX only.
Optionally specifies the workspace used by the stage.
SyncSort Extra Options. This property appears for sort type SyncSort
and is optional. It allows you to specify arguments that are passed on the
command line to SyncSort. You can use a job parameter if required.
Sort Stage
21-5
Advanced Tab
Preserve partitioning. This is Set by default. You can explicitly
select Set or Clear. Select Set to request the next stage in the job
Inputs Page
The Inputs page allows you to specify details about the data coming in to
be sorted. The Sort stage can have only one input link.
Details about Sort stage partitioning are given in the following section.

data is partitioned or collected before the sort is performed.
21-6
Partitioning option has been set on the previous stage in the job the stage
If the Sort Set stage is operating in sequential mode, it will first collect the
Whether the Sort stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Sort stage is set to execute in parallel, then you can set a partitioning
has been set by the previous stage in the job).
If the Sort stage is set to execute in sequential mode, but the preceding
Configuration file. This is the default method for the Sort stage.
Sort Stage
21-7

the default collection method for the Sort stage.
input link should be sorted before the Sort is performed. This is a standard
feature of the stage editors, if you make use of it you will be running a
simple sort before the main Sort operation that the stage provides. The sort
list.
21-8
Outputs Page
Sort stage. The Sort stage can have only one output link.
columns being input to the Sort stage and the Output columns.
Details about Sort stage mapping is given in the following section. See
Mapping Tab
For Sort stages the Mapping tab allows you to specify how the output
columns are derived, i.e., what input columns map onto them.
Sort Stage
21-9
The left pane shows the columns of the sorted data. These are read only
and cannot be modified on this tab. This shows the meta data from the
input link.
facility.
In the above example the left pane represents the incoming data after the
sort has been performed. The right pane represents the data being output
by the stage after the sort operation. In this example the data has been
21-10
22
Merge Stage
The Merge stage is an active stage. It can have any number of input links,
a single output link, and the same number of reject links as there are input
links.
The Merge stage combines a sorted master data set with one or more
sorted update data sets. The columns from the records in the master and
update data sets are merged so that the output record contains all the
columns from the master record plus any additional columns from each
update record.
A master record and an update record are merged only if both of them
have the same values for the merge key column(s) that you specify. Merge
key columns are one or more columns that exist in both the master and
update records. As part of preprocessing your data for the Merge stage,
you first sort the input data sets and remove duplicate records from the
master data set. If you have more than one update data set, you must
remove duplicate records from the update data sets as well. This chapter
describes how to use the Merge stage. See Chapter 21 for information
about the Sort stage and Chapter 23 for information about the Remove
Duplicates stage.
being merged.
Outputs page. This is where you specify details about the merged
data being output from the stage and about the reject links.
Merge Stage
22-1
Stage Page
Properties
for them.
Category/Property
Values
Default
MandaDepenRepeats?
tory?
dent of
Merge Keys/Key
Input Column
N/A
N/A
Merge Keys/Sort
Order
Ascending/
Descending
Ascending
Key
Merge Keys/Nulls
position
First/Last
First
Key
Merge Keys/Character Set
ASCII/EBCDIC
ASCII
Key
Merge Keys/Case
Sensitive
True/False
True
Key
Options/Reject
Masters Mode
Keep/Drop
Keep
N/A
Options/Warn On
Reject Masters
True/False
True
N/A
Options/Warn On
Reject Updates
True/False
True
N/A
Merge Keys Category

Key. This specifies the key column you are merging on. Repeat the property to specify multiple keys. Key has the following dependent properties:
22-2
Sort Order
Choose Ascending or Descending. The default is Ascending.
Nulls position
By default columns containing null values appear first in the
merged data set. To override this default so that columns
containing null values appear last in the merged data set, select
Last.
Character Set
Case Sensitive
Use this to specify whether each merge key is case sensitive or not,
this is set to True by default, i.e., the values CASE and case
would not be judged equivalent.
Options Category
Reject Masters Mode. Set to Keep by default. It specifies that rejected
rows from the master link are output to the merged data set. Set to Drop to
specify that rejected records are dropped instead.
Warn On Reject Masters. Set to True by default. This will warn you
when bad records from the master link are rejected. Set it to False to receive
no warnings.
Warn On Reject Updates. Set to True by default. This will warn you
when bad records from any update links are rejected. Set it to False to
receive no warnings.
Advanced Tab
Merge Stage
22-3

i.e., if any of the input stages uses Set then this stage will use Set.
You can explicitly select Set or Clear. Select Set to request the next
stage in the job attempts to maintain the partitioning.
Link Ordering
This tab allows you to specify which of the input links is the master link
and the order in which links input to the Merge stage are processed. You
22-4
can also specify which of the output links is the master link, and which of
the reject links corresponds to which of the incoming update links.
By default the links will be processed in the order they were added. To
rearrange them, choose an input link and click the up arrow button or the
down arrow button.
Inputs Page
be merged. Choose an input link from the Input name drop down list to
specify which link you want to work on.
Details about Merge stage partitioning are given in the following section.
Merge Stage
22-5

data is partitioned or collected before the merge is performed.
Partitioning option has been set on the previous stage in the job, this stage
If the Merge stage is operating in sequential mode, it will first collect the
Whether the Merge stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Merge stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
If the Merge stage is set to execute in sequential mode, but the preceding
Configuration file. This is the default method for the Merge stage.
22-6

the default collection method for the Merge stage.
input link should be sorted before the merge is performed. The sort is
list.
Merge Stage
22-7

Outputs Page
Merge stage. The Merge stage can have only one master output link
carrying the merged data and a number of reject links, each carrying
rejected records from one of the update links. Choose an input link from
the Input name drop down list to specify which link you want to work on.
columns being input to the Merge stage and the Output columns.
Details about Merge stage mapping is given in the following section. See

You cannot change the properties of a Reject link. They have the meta data
of the corresponding incoming update link and this cannot be altered.
22-8
Mapping Tab
For Merge stages the Mapping tab allows you to specify how the output
The left pane shows the columns of the merged data. These are read only
master input link and any additional columns carried on the update links.
The right pane shows the output columns for the master output link. This
has a Derivations field where you can specify how the column is derived.
You can fill it in by dragging input columns over, or by using the Automatch facility.
merge has been performed. The right pane represents the data being
output by the stage after the merge operation. In this example the data has
been mapped straight across.
Merge Stage
22-9
22-10
23
Remove Duplicates
Stage
The Remove Duplicates stage is an active stage. It can have a single input
link and a single output link.
The Remove Duplicates stage takes a single sorted data set as input,
removes all duplicate records, and writes the results to an output data set.
Removing duplicate records is a common way of cleansing a data set
before you perform further processing. Two records are considered duplicates if they are adjacent in the input data set and have identical values for
the key column(s). A key column is any column you designate to be used
in determining whether two records are identical.
The input data set to the remove duplicates operator must be sorted so that
all records with identical key values are adjacent. You can either achieve
this using the in-stage sort facilities available on the Inputs page Partitioning tab, or have an explicit Sort stage feeding the Remove duplicates
stage.
Inputs page. This is where you specify details about the data set
having its duplicates removed.
Outputs page. This is where you specify details about the
processed data being output from the stage.
Remove Duplicates Stage
23-1
Stage Page
Properties
for them.
Category/Property
Values
Default
Manda
tory?
Repeats?
Dependent of
Keys that Define

Duplicates/Key
Input Column
N/A
N/A
Keys that Define

Duplicates/Character Set
ASCII/EBCDIC
ASCII
Key
Keys that Define

Duplicates/Case
Sensitive
True/False
True
Key
Options/Duplicate
to retain
First/Last
First
N/A
Keys that Define Duplicates Category

Key. Specifies the key column for the operation. This property can be
repeated to specify multiple key columns. Key has dependent properties
as follows:
Character Set
23-2
Case Sensitive
set to True by default, i.e., the values CASE and case would
not be judged equivalent.
Options Category
Duplicate to retain. Specifies which of the duplicate columns encountered to retain. Choose between First and Last. It is set to First by default.
Advanced Tab
Preserve partitioning. This is Propagate by default. It adopts Set
or Clear from the previous stage. You can explicitly select Set or
Clear. Select Set to request that next stage in the job should attempt
to maintain the partitioning.
Inputs Page
be sorted. Choose an input link from the Input name drop down list to
specify which link you want to work on.
23-3
Details about Remove Duplicates stage partitioning are given in the
following section. See Chapter 3, Stage Editors, for a general

data is partitioned or collected before the operation is performed.
Partitioning option has been set on the previous stage in the job this stage
If the Remove Duplicates stage is operating in sequential mode, it will first
collect the data before writing it to the file using the default auto collection
method.
Whether the Remove Duplicates stage is set to execute in parallel
or sequential mode.
or sequential mode.
If the Remove Duplicates stage is set to execute in parallel, then you can
set a partitioning method by selecting from the Partitioning mode dropdown list. This will override any current partitioning (even if the Preserve
Partitioning option has been set on the previous stage).
If the Remove Duplicates stage is set to execute in sequential mode, but the
23-4
Configuration file. This is the default method for the Remove

Duplicates stage.
the default collection method for the Remove Duplicates stage.
23-5
input link should be sorted before the remove duplicates operation is
performed. The sort is always carried out within data partitions. If the
stage is partitioning incoming data the sort occurs after the partitioning. If
the stage is collecting data, the sort occurs before the collection. The availability of sorting depends on the partitioning method chosen.
list.
Output Page
Remove Duplicates stage. The stage only has one output link.
columns being input to the Remove Duplicates stage and the output
columns.
Details about Remove Duplicates stage mapping is given in the following
other tabs.
23-6
Mapping Tab
For Remove Duplicates stages the Mapping tab allows you to specify how
the output columns are derived, i.e., what input columns map onto them.
The left pane shows the columns of the input data. These are read only and
cannot be modified on this tab. This shows the meta data from the
incoming link
The right pane shows the output columns for the master output link. This
has a Derivations field where you can specify how the column is derived.
You can fill it in by dragging input columns over, or by using the Automatch facility.
remove duplicates operation has been performed. The right pane represents the data being output by the stage after the remove duplicates
operation. In this example the data has been mapped straight across.
23-7
23-8
24
Compress Stage
The Compress stage is an active stage. It can have a single input link and
a single output link.
The Compress stage uses the UNIX compress or GZIP utility to compress a
data set. It converts a data set from a sequence of records into a stream of
raw binary data.
being compressed.
compressed data being output from the stage.
Stage Page
Compress Stage
24-1
Properties
the stage actually does. The stage only has a single property which determines whether the stage uses compress or GZIP.
Category/Property
Values
Default
Manda
tory?
Repeats?
Dependent of
Options/Command
compress/gzip
compress
N/A
Options Category
Command. Specifies whether the stage will use compress (the default) or
GZIP.
Advanced Tab
Preserve partitioning. This is Set by default. You can explicitly
select Set or Clear. Select Set to request the next stage should
attempt to maintain the partitioning.
24-2
Input Page
The Inputs page allows you to specify details about the data set being
compressed. There is only one input link.
Details about Compress stage partitioning are given in the following
other tabs.

data is partitioned or collected before the compress is performed.
Partitioning option has been set on the previous stage in the job, this stage
If the Compress stage is operating in sequential mode, it will first collect
the data before writing it to the file using the default auto collection
method.
Whether the Compress stage is set to execute in parallel or sequential mode.
or sequential mode.
If the Compress stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
If the Compress stage is set to execute in sequential mode, but the
Compress Stage
24-3

Configuration file. This is the default method for the Compress
stage.
the default collection method for the Compress stage.
24-4

input link should be sorted before the compression is performed. The sort
list.
Output Page
Compress stage. The stage only has one output link.
data. See Chapter 3, Stage Editors, for a general description of the tabs.
Compress Stage
24-5
24-6
25
Expand Stage
The Expand stage is an active stage. It can have a single input link and a
single output link.
The Expand stage uses the UNIX uncompress or GZIP utility to expand a
data set. It converts a previously compressed data set back into a sequence
of records from a stream of raw binary data.
being expanded.
expanded data being output from the stage.
Stage Page
The Properties tab lets you specify what the stage does. The Advanced
Expand Stage
25-1
Properties
the stage actually does. The stage only has a single property which determines whether the stage uses uncompress or GZIP.
Category/Property
Values
Default
Manda
tory?
Repeats?
Dependent of
Options/Command
uncompress/gzip
compress
N/A
Options Category
Command. Specifies whether the stage will use uncompress (the default)
or GZIP.
Advanced Tab
Preserve partitioning. This is Propagate by default. The stage has a
mandatory partitioning method of Same, this overrides the
preserve partitioning flag and so the partitioning of the incoming
data is always preserved.
25-2
Input Page
expanded. There is only one input link.
Details about Expand stage partitioning are given in the following
other tabs.

data is partitioned or collected before the expansion is performed.
By default the stage uses the Same partitioning method and this cannot be
altered. This preserves the partitioning already in place.
If the Expand stage is set to execute in sequential mode, but the preceding
the default collection method for the Expand stage.
input link should be sorted before the expansion is performed. The sort is
Expand Stage
25-3

list.
Output Page
Expand stage. The stage only has one output link.
output link. The Columns tab specifies the column definitions of outgoing
data.
See Chapter 3, Stage Editors, for a general description of the tabs.
25-4
26
Sample Stage
The Sample stage is an active stage. It can have a single input link and any
number of output links.
The Sample stage samples an input data set. It operates in two modes. In
Percent mode, it extracts records, selecting them by means of a random
number generator, and writes a given percentage of these to each output
data set. You specify the number of output data sets, the percentage
written to each, and a seed value to start the random number generator.
You can reproduce a given distribution by repeating the same number of
outputs, the percentage, and the seed value.
In Period mode, it extracts every Nth row from each partition, where N is
the period, which you supply. In this case all rows will be output to a single
data set.
For both modes you can specify the maximum number of rows that you
want to sample from each partition.
Input page. This is where you specify details about the data set
being Sampled.
Outputs page. This is where you specify details about the Sampled
Stage Page
Sample Stage
26-1
allows you to specify which output links are which.
Properties
for them.
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/Sample
Mode
percent/period
percent
N/A
Options/Percent
number
N/A
Y (if
Sample
Mode =
Percent)
N/A
Options/Output
Link Number
number
N/A
Percent
Options/Seed
number
N/A
N/A
Options/Period
(Per Partition)
number
N/A
Y (if
Sample
Mode =
Period)
N/A
Options/Max Rows number

Per Partition
N/A
N/A
Options Category
Sample Mode. Specifies the type of sample operation. You can sample on
a percentage of input rows (percent), or you can sample the Nth row of
every partition (period).
Percent. Specifies the sampling percentage for each output data set when
use a Sample Mode of Percent. You can repeat this property to specify
different percentages for each output data set. The sum of the percentages
26-2
specified for all output data sets cannot exceed 100%. You can specify a job
Percent has a dependent property:
Output Link Number
This specifies the output link to which the percentage corresponds.
You can specify a job parameter if required.
Seed. This is the number used to initialize the random number generator.
You can specify a job parameter if required. This property is only available
if Sample Mode is set to percent.
Period (Per Partition). Specifies the period when using a Sample Mode of
Period.
Max Rows Per Partition. This specifies the maximum number of rows
that will be sampled from each partition.
Advanced Tab
Clear. Select Set to request the next stage should attempt to maintain the partitioning.
Sample Stage
26-3
Link Ordering
This tab allows you to specify the order in which the output links are
processed.
By default the output links will be processed in the order they were added.
To rearrange them, choose an output link and click the up arrow button or
the down arrow button.
Input Page
The Input page allows you to specify details about the data set being
sampled. There is only one input link.
26-4
Details about Sample stage partitioning are given in the following section.

data is partitioned or collected before the sample is performed.
Partitioning option has been set on the previous stage in the job, the stage
If the Sample stage is operating in sequential mode, it will first collect the
Whether the Sample stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Sample stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
If the Sample stage is set to execute in sequential mode, but the preceding
Configuration file. This is the default method for the Sample stage.
Sample Stage
26-5

the default collection method for the Sample stage.
input link should be sorted before the sample is performed. The sort is
26-6
list.
Outputs Page
Sample stage. The stage can have any number of output links, choose the
one you want to work on from the Output Link drop down list.
columns being input to the Sample stage and the output columns.
Details about Sample stage mapping is given in the following section. See
Sample Stage
26-7
Mapping Tab
For Sample stages the Mapping tab allows you to specify how the output
The left pane shows the columns of the sampled data. These are read only
incoming link
Derivations field where you can specify how the column is derived. You
facility.
Sample operation has been performed. The right pane represents the data
being output by the stage after the Sample operation. In this example the
data has been mapped straight across.
26-8
27
Row Generator Stage
The Row Generator stage is a file stage. It can have any number of output
links.
The Row Generator stage produces a set of mock data fitting the specified
meta data. This is useful where you want to test your job but have no real
data available to process. (See also the Column Generator stage which
allows you to add extra columns to existing data sets)
The meta data you specify on the output link determines the columns you
are generating. Most of the properties are specified using the Edit Column
Meta Data dialog box to provide format details for each column (the Edit
Column Meta Data dialog box is accessible from the shortcut menu of the
Outputs Page Columns tab - select Edit Row).
The stage editor has two pages:
Outputs page. This is where you specify details about the generated data being output from the stage.
Stage Page
Row Generator Stage
27-1
Advanced Tab
Execution Mode. The Generate stage executes in Sequential mode
by default. You can select Parallel mode to generate data sets in
separate partitions.
Preserve partitioning. This is Propagate by default. If you have an
input data set, it adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request the next stage
Outputs Page
Row Generator stage. The stage can have any number of output links,
choose the one you want to work on from the Output Link drop down list.
output link. The Properties page lets you specify what the stage does. The
Columns tab specifies the column definitions of outgoing data.
Properties
27-2
for them. The following table gives a quick reference list of the properties
and their attributes. A more detailed description of each property follows.
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/Number of
Records
number
10
N/A
Options/Schema File
pathname
N/A
N/A
Options Category
Number of Records. The number of records you want your generated
data set to contain.
The default number is 10.
Schema File. By default the stage will take the meta data defined on the
input link to base the mock data set on. But you can specify the column
definitions in a schema file, if required. You can browse for the schema file
or specify a job parameter.
Row Generator Stage
27-3
27-4
28
Column Generator
Stage
The Column Generator stage is an active stage. It can have a single input
The Column Generator stage adds columns to incoming data and generates mock data for these columns for each data row processed. The new
data set is then output. (See also the Row Generator stage which allows
you to generate complete sets of mock data.)
Input page. This is where you specify details about the input link.
Outputs page. This is where you specify details about the generated data being output from the stage.
Stage Page
Properties
Column Generator Stage
28-1
for them.
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/Column
Method
Explicit/
Column
Method
Explicit
N/A
Options/Column to
Generate
output
column
N/A
Y (if
Column
Method
=
Explicit)
N/A
Options/Schema
File
pathname
N/A
Y (if
Column
Method
=
Schema
File)
N/A
Options Category
Column Method. Select Explicit if you are going to specify the column or
columns you want the stage to generate data for. Select Schema File if you
are supplying a schema file containing the column definitions.
Column to Generate. When you have chosen a column method of
Explicit, this property allows you to specify which output columns the
stage is generating data for. Repeat the property to specify multiple
columns. You can specify the properties for each column using the Parallel
tab of the Edit Column Meta Dialog box (accessible from the shortcut
menu on the columns grid of the output Columns tab).
Schema File. When you have chosen a column method of schema file,
this property allows you to specify the column definitions in a schema file.
You can browse for the schema file or specify a job parameter.
28-2
Advanced Tab
Execution Mode. The Generate stage executes in Sequential mode
by default. You can select Parallel mode to generate data sets in
separate partitions.
Preserve partitioning. This is Propagate by default. If you have an
input data set, it adopts Set or Clear from the previous stage. You
can explicitly select Set or Clear. Select Set to request the next stage
Input Page
The Inputs page allows you to specify details about the incoming data set
you are adding generated columns to. There is only one input link and this
is optional.
Details about Generate stage partitioning are given in the following
other tabs.

data is partitioned or collected before the generate is performed.
28-3
If the Column Generator stage is operating in sequential mode, it will first
method.
Whether the Column Generator stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Column Generator stage is set to execute in parallel, then you can set
a partitioning method by selecting from the Partitioning mode dropdown list. This will override any current partitioning (even if the Preserve
Partitioning option has been set on the Stage page Advanced tab).
If the Column Generator stage is set to execute in sequential mode, but the
Configuration file. This is the default method for the Column
Generator stage.
28-4

the default collection method for the Column Generator stage.
from the second partition, and so on. After reaching the last partition, the operation starts over.
input link should be sorted before the column generate operation is
list.
28-5
Outputs Page
Details about Column Generator stage mapping is given in the following
other tabs.
Mapping Tab
For Column Generator stages the Mapping tab allows you to specify how
the output columns are derived, i.e., how the generated data maps onto
them.
The left pane shows the generated columns. These are read only and
cannot be modified on this tab. These columns are automatically mapped
onto the equivalent output columns.
28-6

facility.
The right pane represents the data being output by the stage after the
generate operation. In the above example two columns belong to
incoming data and have automatically been mapped through and the two
generated columns have been mapped straight across.
28-7
28-8
29
Copy Stage
The Copy stage is an active stage. It can have a single input link and any
number of output links.
The Copy stage copies a single input data set to a number of output data
sets. Each record of the input data set is copied to every output data set
without modification. This lets you make a backup copy of a data set on
disk while performing an operation on another copy, for example.
Input page. This is where you specify details about the input link
carrying the data to be copied.
Outputs page. This is where you specify details about the copied
Stage Page
Properties
for them.
Copy Stage
29-1
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/Force
True/False
False
N/A
Options Category
Force. Set True to specify that DataStage should not try to optimize the job
by removing a Copy operation where there is one input and one output.
Set False by default.
Advanced Tab
setting of the previous stage.You can explicitly select Set or Clear.
Select Set to request the stage should attempt to maintain the
partitioning.
29-2
Input Page
copied. There is only one input link.
Details about Copy stage partitioning are given in the following section.

data is partitioned or collected before the copy is performed.
If the Copy stage is operating in sequential mode, it will first collect the
Whether the Copy stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Copy stage is set to execute in parallel, then you can set a partitioning
has been set on the previous stage).
If the Copy stage is set to execute in sequential mode, but the preceding
Copy Stage
29-3

Configuration file. This is the default method for the Copy stage.
the default collection method for the Copy stage.
29-4

list.
Outputs Page
Copy stage. The stage can have any number of output links, choose the
one you want to work on from the Output name drop down list.
columns being input to the Copy stage and the output columns.
Details about Copy stage mapping is given in the following section. See
Copy Stage
29-5
Mapping Tab
For Copy stages the Mapping tab allows you to specify how the output
columns are derived, i.e., what copied columns map onto them.
The left pane shows the copied columns. These are read only and cannot
be modified on this tab.
can fill it in by dragging copied columns over, or by using the Auto-match
facility.
copy has been performed. The right pane represents the data being output
by the stage after the copy operation. In this example the data has been
29-6
30
External Filter Stage
The External Filter stage is an active stage. It can have a single input link
and a single output link.
The External filter stage allows you to specify a UNIX command that acts
as a filter on the data you are processing. An example would be to use the
stage to grep a data set for a certain string, or pattern, and discard records
which did not contain a match. This can be a quick and efficient way of
filtering data.
Input page. This is where you specify details about the input link
carrying the data to be filtered.
Outputs page. This is where you specify details about the filtered
Stage Page
Properties
30-1
for them.
Category/Property
Values
Default
Manda
tory?
Repeats?
Dependent of
Options/Filter
Command
string
N/A
N/A
Options/Arguments
string
N/A
N/A
Options Category
Filter Command. Specifies the filter command line to be executed and
any command line options it requires. For example:
grep
Arguments. Allows you to specify any arguments that the command line
requires. For example:
$cancel$.*\1
Together with the grep command would extract all records that contained
the string cancel twice and discard other records.
Advanced Tab
setting of the previous stage.You can explicitly select Set or Clear.
Select Set to request the next stage should attempt to maintain the
partitioning.
30-2
Input Page
filtered. There is only one input link.
Details about External Filter stage partitioning are given in the following
other tabs.

data is partitioned or collected before the filter is executed.
If the External Filter stage is operating in sequential mode, it will first
method.
Whether the External Filter stage is set to execute in parallel or
sequential mode.
30-3

or sequential mode.
If the External Filter stage is set to execute in parallel, then you can set a
If the External Filter stage is set to execute in sequential mode, but the
Configuration file. This is the default method for the External Filter
stage.
30-4

the default collection method for the External Filter stage.
list.
Outputs Page
External Filter stage. The stage can only have one output link.
30-5

data. See Chapter 3, Stage Editors, for a general description of these
tabs.
30-6
31
Change Capture Stage
The Change Capture Stage is an active stage. The stage compares two data
sets and makes a record of the differences.
The Change Capture stage takes two input data sets, denoted before and
after, and outputs a single data set whose records represent the changes
made to the before data set to obtain the after data set. The stage produces
a change data set, whose table definition is transferred from the after data
sets table definition with the addition of one column: a change code with
values encoding the four actions: insert, delete, copy, and edit. The
preserve-partitioning flag is set on the change data set.
The compare is based on a set a set of key columns, records from the two
data sets are assumed to be copies of one another if they have the same
values in these key columns. You can also optionally specify change
values. If two records have identical key columns, you can compare the
value columns to see if one is an edited copy of the other.
The stage assumes that the incoming data is hash-partitioned and sorted
in ascending order. The columns the data is hashed on should be the key
columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and
partitioning abilities of the Change Capture stage.
You can use the companion Change Apply stage to combine the changes
from the Change Capture stage with the original before data set to reproduce the after data set.
The Change Capture stage is very similar to the Difference stage described
in Chapter 35, Difference Stage.
31-1

Stage Page
allows you to specify which input link carries the before data set and which
the after data set.
Properties
for them.
Category/Property
Values
Default
Manda
tory?
Repeats?
Dependent of
Change Keys/Key
Input Column
N/A
N/A
Change Keys/Case
Sensitive
True/False
True
Key
Change Keys/Sort
Order
Ascending/
Descending
Ascending
Key
Change
Keys/Nulls
Position
First/Last
First
Key
Change
Values/Value
Input Column
N/A
N/A
31-2
Ascential DataStage Manager Guide
Category/Property
Values
Default
Manda
tory?
Repeats?
Dependent of
Change
Values/Case
Sensitive
True/False
True
Value
Options/Change
Mode
Explicit Keys &

Values/All keys,
Explicit
values/Explicit
Keys, All Values
Explicit
Keys &
Values
N/A
Options/Log
Statistics
True/False
False
N/A
Options/Drop
Output for Insert
True/False
False
N/A
Options/Drop
Output for Delete
True/False
False
N/A
Options/Drop
Output for Edit
True/False
False
N/A
Options/Drop
Output for Copy
True/False
True
N/A
Options/Code
Column Name
string
change_
code
N/A
Options/Copy
Code
number
N/A
Options/Deleted
Code
number
N/A
Options/Edit Code
number
N/A
Options/Insert
Code
number
N/A
Change Keys Category

Key. Specifies the name of a difference key input column (see page 31-1 for
an explanation of how Key columns are used). This property can be
repeated to specify multiple difference key input columns. Key has the
following dependent properties:
31-3
Case Sensitive
Use this to property to specify whether each key is case sensitive or
not. It is set to True by default; for example, the values CASE and
case would not be judged equivalent.
Sort Order
Specify ascending or descending sort order.
Nulls Position
Specify whether null values should be placed first or last.
Change Value category

Value. Specifies the name of a value input column (see page 31-1 for an
explanation of how Value columns are used). Value has the following
Case Sensitive
Use this to property to specify whether each value is case sensitive
or not. It is set to True by default; for example, the values CASE
and case would not be judged equivalent.
Options Category
Change Mode. This mode determines how keys and values are specified.
Choose Explicit Keys & Values to specify the keys and values yourself.
Choose All keys, Explicit values to specify that value columns must be
defined, but all other columns are key columns unless excluded. Choose
Explicit Keys, All Values to specify that key columns must be defined but
all other columns are value columns unless they are excluded.
Log Statistics. This property configures the stage to display result information containing the number of input records and the number of copy,
delete, edit, and insert records.
Drop Output for Insert. Specifies to drop (not generate) an output record
for an insert result. By default, an output record is always created by the
stage.
31-4
Drop Output for Delete. Specifies to drop (not generate) the output
record for a delete result. By default, an output record is always created by
the stage.
Drop Output for Edit. Specifies to drop (not generate) the output record
for an edit result. By default, an output record is always created by the
stage.
Drop Output for Copy. Specifies to drop (not generate) the output record
for a copy result. By default, an output record is always created by the
stage.
Code Column Name. Allows you to specify a different name for the
output column carrying the change code generated for each record by the
stage. By default the column is called change_code.
Copy Code. Allows you to specify an alternative value for the code that
indicates the after record is a copy of the before record. By default this code
is 0.
Deleted Code. Allows you to specify an alternative value for the code
that indicates that a record in the before set has been deleted from the after
set. By default this code is 2.
Edit Code. Allows you to specify an alternative value for the code that
indicates the after record is an edited version of the before record. By default
this code is 3.
Insert Code. Allows you to specify an alternative value for the code that
indicates a new record has been inserted in the after set that did not exist
in the before set. By default this code is 1.
Advanced Tab
31-5

Link Ordering
This tab allows you to specify which input link carries the before data set
and which carries the after data set.
31-6
By default the first link added will represent the before set. To rearrange
the links, choose an input link and click the up arrow button or the down
arrow button.
Inputs Page
sets. The Change Capture expects two incoming data sets: a before data set
and an after data set.
partitioned before being compared. The Columns tab specifies the column
Details about Change Capture stage partitioning are given in the
following section. See Chapter 3, Stage Editors, for a general description
of the other tabs.

data is partitioned or collected before it is compared. It also allows you to
specify that the data should be sorted before being operated on.
If the Change Capture stage is operating in sequential mode, it will first
collect the data using the default Auto collection method.
Whether the Change Capture stage is set to execute in parallel or
sequential mode.
or sequential mode.
31-7
If the Change Capture stage is set to execute in parallel, then you can set a
If the Change Capture stage is set to execute in sequential mode, but the
default partitioning method for the Change Capture stage.
31-8

the default collection method for Change Capture stages.
input link should be sorted before being compared. The sort is always
sort occurs before the collection. The availability of sorting depends on the
list.
Outputs Page
Change Capture stage. The Change Capture stage can have only one
output link.
31-9

columns being input to the Change Capture stage and the Output
columns.
Details about Change Capture stage mapping is given in the following
other tabs.
Mapping Tab
For the Change Capture stage the Mapping tab allows you to specify how
the output columns are derived, i.e., what input columns map onto them
and which column carries the change code data.
The left pane shows the columns from the before/after data sets plus the
change code column. These are read only and cannot be modified on this
tab.
31-10
in by dragging input columns over, or by using the Auto-match facility. By
default the data set columns are mapped automatically. You need to
ensure that there is an output column to carry the change code and that
this is mapped to the Change_code column.
31-11
31-12
32
Change Apply Stage
The Change Apply stage is an active stage. It takes the change data set, that
contains the changes in the before and after data sets, from the Change
Capture stage and applies the encoded change operations to a before data
set to compute an after data set. (See Chapter 31 for a description of the
Change Capture stage.)
The before input to Change Apply must have the same columns as the before
input that was input to Change Capture, and an automatic conversion
must exist between the types of corresponding columns. In addition,
results are only guaranteed if the contents of the before input to Change
Apply are identical (in value and record order in each partition) to the
before input that was fed to Change Capture, and if the keys are unique.
The change input to Change Apply must have been output from Change
Capture without modification. Because preserve-partitioning is set on the
change output of Change Capture, the Change Apply stage has the same
number of partitions as the Change Capture stage. Additionally, both
inputs of Change Apply are designated as partitioned using the Same
partitioning method.
The Change Apply stage read a record from the change data set and from
the before data set, compares their key column values, and acts accordingly:
If the before keys come before the change keys in the specified sort
order, the before record is copied to the output. The change record is
retained for the next comparison.
If the before keys are equal to the change keys, the behavior depends
on the code in the change_code column of the change record:
Insert: The change record is copied to the output; the stage retains
the same before record for the next comparison. If key columns are
not unique, and there is more than one consecutive insert with
Change Apply Stage
32-1
the same key, then Change Apply applies all the consecutive
inserts before existing records. This record order may be different
from the after data set given to Change Capture.
Delete: The value columns of the before and change records are
compared. If the value columns are the same or if the Check
Value Columns on Delete is specified as False, the change and
before records are both discarded; no record is transferred to the
output. If the value columns are not the same, the before record is
copied to the output and the stage retains the same change record
for the next comparison.
If key columns are not unique, the value columns ensure that the
correct record is deleted. If more than one record with the same
keys have matching value columns, the first-encountered record
is deleted. This may cause different record ordering than in the
after data set given to the Change Capture stage. A warning is
issued and both change record and before record are discarded, i.e.
no output record results.
Edit: The change record is copied to the output; the before record is
discarded. If key columns are not unique, then the first before
record encountered with matching keys will be edited. This may
be a different record from the one that was edited in the after data
set given to the Change Capture stage. A warning is issued and
the change record is copied to the output; but the stage retains the
same before record for the next comparison.
Copy: The change record is discarded. The before record is copied
to the output.
If the before keys come after the change keys, behavior also depends
on the change_code column:
Insert. The change record is copied to the output, the stage retains
the same before record for the next comparison. (The same as
when the keys are equal.)
Delete. A warning is issued and the change record discarded
while the before record is retained for the next comparison.
Edit or Copy. A warning is issued and the change record is copied
to the output while the before record is retained for the next
comparison.
32-2
Note: If the before input of Change Apply is identical to the before input
of Change Capture and either the keys are unique or copy records
are used, then the output of Change Apply is identical to the after
input of Change Capture. However, if the before input of Change
Apply is not the same (different record contents or ordering), or the
keys are not unique and copy records are not used, this is not
detected and the rules described above are applied anyway,
producing a result that might or might not be useful.
Inputs page. This is where you specify the details about the single
input set from which you are selecting records.
Stage Page
Properties
for them.
Category/Property
Values
Default
Manda
tory?
Repeats?
Dependent of
Change Keys/Key
Input Column
N/A
N/A
Change Keys/Case
Sensitive
True/False
True
Key
Change Apply Stage
32-3
Change Keys/Sort
Order
Ascending/
Descending
Ascending
Key
Change Keys/Nulls
Position
First/Last
First
Key
Change
Values/Value
Input Column
N/A
N/A
Change
Values/Case
Sensitive
True/False
True
Value
Options/Change
Mode
Explicit Keys &

Explicit
Values/All keys, Keys &
Explicit
Values
values/Explicit
Keys, All Values
N/A
Options/Log
Statistics
True/False
False
N/A
Options/Check
Value Columns on
Delete
True/False
True
N/A
Options/Code
Column Name
string
change_
code
N/A
Options/Copy Code number
N/A
Options/Deleted
Code
number
N/A
Options/Edit Code
number
N/A
Options/Insert
Code
number
N/A
Change Keys Category

Key. Specifies the name of a difference key input column. This property
can be repeated to specify multiple difference key input columns. Key has
the following dependent properties:
Case Sensitive
32-4
Sort Order
Specify ascending or descending sort order.
Nulls Position
Specify whether null values should be placed first or last.
Change Value category

Value. Specifies the name of a value input column (see page 32-1 for an
explanation of how Value columns are used). Value has the following
Case Sensitive
Options Category
Change Mode. This mode determines how keys and values are specified.
Choose Explicit Keys & Values to specify the keys and values yourself.
Choose All keys, Explicit values to specify that value columns must be
defined, but all other columns are key columns unless excluded. Choose
Explicit Keys, All Values to specify that key columns must be defined but
all other columns are value columns unless they are excluded.
delete, edit, and insert records.
Check Value Columns on Delete. Specifies that DataStage should not
check value columns on deletes. Normally, Change Apply compares the
value columns of delete change records to those in the before record to
ensure that it is deleting the correct record.
Code Column Name. Allows you to specify that a different name has
been used for the change data set column carrying the change code generated for each record by the stage. By default the column is called
change_code.
indicates a record copy. By default this code is 0.
Change Apply Stage
32-5
that indicates a record delete. By default this code is 2.
indicates a record edit. By default this code is 3.
indicates a record insert. By default this code is 1.
Advanced Tab
32-6
Link Ordering
and which carries the change data set.
By default the first link added will represent the before set. To rearrange the
links, choose an input link and click the up arrow button or the down
arrow button.
Inputs Page
The Inputs page allows you to specify details about the incoming data set.
Details about Change Apply stage partitioning are given in the following
other tabs.
Change Apply Stage
32-7

The change input to Change Apply should have been output from the
Change Capture stage without modification. Because preserve-partitioning is set on the change output of Change Capture, the Change Apply
stage has the same number of partitions as the Change Capture stage.
Additionally, both inputs of Change Apply are automatically designated
as partitioned using the Same partitioning method.
The standard partitioning and collecting controls are available on the
Change Apply stage, however, so you can override this behavior.
If the Change Apply stage is operating in sequential mode, it will first
method.
The Partitioning tab allows you to override the default behavior. The exact
operation of this tab depends on:
Whether the Change Apply stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Change Apply stage is set to execute in parallel, then you can set a
If the Change Apply stage is set to execute in sequential mode, but the
Configuration file. This is the default method for the Change
Apply stage.
32-8

the default collection method for the Change Apply stage.
input link should be sorted before the operation is performed. The sort is
Change Apply Stage
32-9
list.
Outputs Page
Change Apply stage. The Change Apply stage can have only one output
link.
data.The Mapping tab allows you to specify the relationship between the
columns being input to the Change Apply stage and the Output columns.
Details about Change Apply stage mapping is given in the following
other tabs.
32-10
Mapping Tab
For the Change Capture stage the Mapping tab allows you to specify how
or how they are generated.
The left pane shows the common columns of the before and change data
sets. These are read only and cannot be modified on this tab.
Derivations field where you can specify how the column is derived. You
facility. By default the columns are mapped straight across as shown in the
example.
Change Apply Stage
32-11
32-12
33
Encode Stage
The Encode stage is an active stage. It encodes a data set using a UNIX
encoding command that you supply. The stage converts a data set from a
sequence of records into a stream of raw binary data. The companion
Decode stage reconverts the data stream to a data set.
Stage Page
Properties
the stage actually does. This stage only has one property and you must
supply a value for this. The property appears in the warning color (red by
default) until you supply a value.
Encode Stage
33-1
Category/Property
Values
Options/Command Command Line

Line
Default
Manda
tory?
Repeats?
Dependent of
N/A
N/A
Options Category
Command Line. Specifies the command line used for encoding the data
set. The command line must configure the UNIX command to accept input
from standard input and write its results to standard output. The
command must be located in your search path and be accessible by every
processing node on which the Encode stage executes.
Advanced Tab
Preserve partitioning. This is Set by default to request that next
stage in the job should attempt to maintain the partitioning.
33-2
Inputs Page
sets. The Encode stage can only have one input link.
partitioned before being encoded. The Columns tab specifies the column
Details about Change Capture stage partitioning are given in the
of the other tabs.

data is partitioned or collected before it is encoded. It also allows you to
If the Encode stage is operating in sequential mode, it will first collect the
data using the default Auto collection method.
Whether the Encode stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Encode stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
If the Encode stage is set to execute in sequential mode, but the preceding
Encode Stage
33-3
method.
default partitioning method for the Encode stage.
the default collection method for Encode stages.
33-4
input link should be sorted before being encoded. The sort is always
list.
Outputs Page
Encode stage. The Encode stage can have only one output link.
data.
See Chapter 3, Stage Editors, for a general description of these tabs.
Encode Stage
33-5
33-6
34
Decode Stage
The Decode stage is an active stage. It decodes a data set using a UNIX
decoding command that you supply. It converts a data stream of raw
binary data into a data set. Its companion stage Encode converts a data set
from a sequence of records to a stream of raw binary data.
Stage Page
Properties
the stage actually does. This stage only has one property and you must
Decode Stage
34-1
supply a value for this. The property appears in the warning color (red by
default) until you supply a value.
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/Command
Line
Command Line
N/A
N/A
Options Category
Command Line. Specifies the command line used for decoding the data
set. The command line must configure the UNIX command to accept input
from standard input and write its results to standard output. The
command must be located in the search path of your application and be
accessible by every processing node on which the Decode stage executes.
Advanced Tab
34-2
Inputs Page
sets. The Decode stage expects two incoming data sets.
partitioned before being decoded summarized. The Columns tab specifies
the column definitions of incoming data.
Details about Compare stage partitioning are given in the following
other tabs.

data is partitioned or collected before it is decoded. It also allows you to
The Decode stage partitions in Same mode and this cannot be overridden.
If the Decode stage is set to execute in sequential mode, but the preceding
method.
the default collection method for Decode stages.
input link should be sorted before being decoded. The sort is always
Decode Stage
34-3

list.
Outputs Page
Decode stage. The Decode stage can have only one output link.
data.
34-4
35
Difference Stage
The Difference stage is an active stage. It performs a record-by-record
comparison of two input data sets, which are different versions of the
same data set designated the before and after data sets. The Difference stage
outputs a single data set whose records represent the difference between
them. The stage assumes that the input data sets have been hash-partitioned and sorted in ascending order on the key columns you specify for
the Difference stage comparison. You can achieve this by using the Sort
stage or by using the built in sorting and partitioning abilities of the Difference stage.
The comparison is performed based on a set of difference key columns.
Two records are copies of one another if they have the same value for all
difference keys. You can also optionally specify change values. If two
records have identical key columns, you can compare the value columns
to see if one is an edited copy of the other.
The stage generates an extra column, DiffCode, which indicates the result
of each record comparison.
Difference Stage
35-1
Stage Page
allows you to specify which input link carries the before data set and which
the after data set.
Properties
for them.
Category/Property
Values
Default
MandaDepenRepeats?
tory?
dent of
Difference
Keys/Key
Input Column
N/A
N/A
Difference
True/False
Keys/Case Sensitive
True
Key
Difference
Values/All non-Key
Columns are Values
True/False
False
N/A
Difference
Values/Case
Sensitive
True/False
True
All nonKey
Columns
are
Values
Options/Tolerate
Unsorted Inputs
True/False
False
N/A
Options/Log
Statistics
True/False
False
N/A
Options/Drop
Output for Insert
True/False
False
N/A
35-2
Category/Property
Values
Default
DepenMandaRepeats?
dent of
tory?
Options/Drop
Output for Delete
True/False
False
N/A
Options/Drop
Output for Edit
True/False
False
N/A
Options/Drop
Output for Copy
True/False
False
N/A
Options/Copy
Code
number
N/A
Options/Deleted
Code
number
N/A
Options/Edit Code
number
N/A
Options/Insert
Code
number
N/A
Difference Keys Category

Key. Specifies the name of a difference key input column. This property
can be repeated to specify multiple difference key input columns. Key has
this dependent property:
Case Sensitive
Difference Values Category

All non-Key Columns are Values. Set this to True to indicate that any
columns not designated as difference key columns are value columns (see
page 35-1 for a description of value columns). It is False by default. The
property has this dependent property:
Case Sensitive
Difference Stage
35-3
Options Category
Tolerate Unsorted Inputs. Specifies that the input data sets are not
sorted. This property allows you to process groups of records that may be
arranged by the difference key columns but not sorted. The stage
processed the input records in the order in which they appear on its input.
It is False by default.
delete, edit, and insert records. It is False by default.
Drop Output for Insert. Specifies to drop (not generate) an output record
for an insert result. By default, an output record is always created by the
stage.
Drop Output for Delete. Specifies to drop (not generate) the output
record for a delete result. By default, an output record is always created by
the stage.
Drop Output for Edit. Specifies to drop (not generate) the output record
for an edit result. By default, an output record is always created by the
stage.
Drop Output for Copy. Specifies to drop (not generate) the output record
for a copy result. By default, an output record is always created by the
stage.
indicates the after record is a copy of the before record. By default this code
is 0.
that indicates that a record in the before set has been deleted from the after
set. By default this code is 2.
indicates the after record is an edited version of the before record. By default
this code is 3.
35-4
indicates a new record has been inserted in the after set that did not exist
in the before set. By default this code is 1.
Advanced Tab
Difference Stage
35-5
Link Ordering
and which carries the after data set.
By default the first link added will represent the before set. To rearrange
the links, choose an input link and click the up arrow button or the down
arrow button.
Inputs Page
sets. The Difference stage expects two incoming data sets: a before data set
and an after data set.
Details about Difference stage partitioning are given in the following
other tabs.
35-6

data is partitioned or collected before the operation is performed. It also
allows you to specify that the data should be sorted before being operated
on.
If the Difference stage is operating in sequential mode, it will first collect
the data using the default Auto collection method.
Whether the Difference stage is set to execute in parallel or sequential mode.
or sequential mode.
If the Difference stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
If the Difference stage is set to execute in sequential mode, but the
default partitioning method for the Difference stage.
Difference Stage
35-7

the default collection method for Difference stages.
input link should be sorted before the operation is performed. The sort is
35-8
list.
Outputs Page
Difference stage. The Difference stage can have only one output link.
columns being input to the Difference stage and the Output columns.
Details about Difference stage mapping is given in the following section.
Difference Stage
35-9
Mapping Tab
For the Difference stage the Mapping tab allows you to specify how the
output columns are derived, i.e., what input columns map onto them or
how they are generated.
The left pane shows the columns from the before/after data sets plus the
DiffCode column. These are read only and cannot be modified on this tab.
in by dragging input columns over, or by using the Auto-match facility. By
default the data set columns are mapped automatically. You need to
ensure that there is an output column to carry the change code and that
this is mapped to the DiffCode column.
35-10
36
Column Import Stage
The Column Import stage is an active stage. It can have a single input link,
a single output link and a single rejects link.
The Column Import stage imports data from a single column and outputs
it to one or more columns. You would typically use it to divide data
arriving in a single column into multiple columns. The data would be
delimited in some way to tell the Column Import stage where to make the
divisions. The input column must be a string or binary data, the output
columns can be any data type.
You supply an import table definition to specify the target columns and
their types. This also determines the order in which data from the import
column is written to output columns. Information about the format of the
incoming column (e.g., how it is delimited) is given in the Format tab of
the Outputs page. You can optionally save reject records, that is, records
whose import was rejected, and write them to a rejects link.
In addition to importing a column you can also pass other columns
straight through the stage. So, for example, you could pass a key column
straight through.
Column Import Stage
36-1
Stage Page
Properties
for them.
Category/Property
Values
Default
Mandatory?
Repe
ats?
Depen
dent
of
Input/Import Input
Column
Input Column
N/A
N/A
Output/Column
Method
Explicit/Schema Explicit
File
N/A
Output/Column to
Import
Output Column
N/A
Y (if
Y
Column
Method =
Explicit)
N/A
Output/Schema File
Pathname
N/A
Y (if
N
Column
Method =
Schema
file)
N/A
Options/Keep Input
Column
True/False
False
N/A
Options/Reject Mode
Continue (warn)
/Output/Fail
Continue
N/A
36-2
Input Category
Import Input Column. Specifies the name of the column containing the
string or binary data to import.
Output Category
Column Method. Specifies whether the columns to import should be
derived from column definitions on the Output page Columns tab
(Explicit) or from a schema file (Schema File).
Column to Import. Specifies an output column. The meta data for this
column determines the type that the import column will be converted to.
Repeat the property to specify multiple columns. You can specify the properties for each column using the Parallel tab of the Edit Column Meta
dialog box (accessible from the shortcut menu on the columns grid of the
output Columns tab).
Schema File. Instead of specifying the target data type details via output
column definitions, you can use a schema file. You can type in the schema
file name or browse for it.
Options Category
Keep Input Column. Specifies whether the original input column should
be transferred to the output data set unchanged in addition to being
imported and converted. Defaults to False
Reject Mode. The values of this property specify the following actions:
Fail. The stage fails when it encounters a record whose import is
rejected.
Output. The stage continues when it encounters a reject record and
writes the record to the reject link.
Continue. The stage is to continue but report failures to the log file.
Advanced Tab
Column Import Stage
36-3

Inputs Page
sets. The Column Import stage expects one incoming data set.
partitioned before being imported. The Columns tab specifies the column
Details about Column Import stage partitioning are given in the following
other tabs.

data is partitioned or collected before it is imported. It also allows you to
36-4
If the Column Import stage is operating in sequential mode, it will first
Whether the Column Import stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Column Import stage is set to execute in parallel, then you can set a
If the Column Import stage is set to execute in sequential mode, but the
default partitioning method for the Column Import stage.
Column Import Stage
36-5

the default collection method for Column Import stages.
input link should be sorted before being imported. The sort is always
list.
36-6
Outputs Page
Column Import stage. The Column Import stage can have only one output
link, but can also have a reject link carrying records that have been
rejected.
output link. The Format tab allows you to specify details about how data
in the column you are importing is formatted so the stage can divide it into
separate columns. The Columns tab specifies the column definitions of
incoming data. The Mapping tab allows you to specify the relationship
between the columns being input to the Column Import stage and the
Output columns.
Details about Column Import stage mapping is given in the following
other tabs.
Format Tab
column you are importing. You use it in the same way as you would to
describe the format of a flat file you were reading. The tab has a similar
format to the Properties tab and is described in detail on page 3-24.
add window. You can then set a value for that property in the Property
each type.
Column Import Stage
36-7
formatted in the column. The available properties are:
used).


output links.)
following:
the dialog box does not enforce this
of bytes.
36-8

are:
none, or null.
none. No delimiter.
Prefix bytes. Specifies that each column is prefixed by 1, 2, or 4
bytes containing, as a binary value, either the columns length or
the tag value for a tagged column.
character.
the vector.
Column Import Stage
36-9
at column level):
Choose from:
binary
text
36-10
it. Choose from:
%yyyy-%mm-%dd.
Column Import Stage
36-11

GMT.
36-12
Mapping Tab
For the Column Import stage the Mapping tab allows you to specify how
the output columns are derived.
The left pane shows the columns the stage is deriving from the single
imported column. These are read only and cannot be modified on this tab.
The right pane shows the output columns for each link.
In the example the stage has automatically mapped the specified Columns
to Import onto the output columns. The Key column is an extra input
column and is automatically passed through the stage. Because the Keep
Import Column property was set to True, the original column (comp_col
in this example) is available to map onto an output column.
We recommend that you maintain the automatic mappings of the generated columns when using this stage.
Reject Link
You cannot change the details of a Reject link. The link uses the column
definitions for the link rejecting the data records.
Column Import Stage
36-13
36-14
37
Column Export Stage
The Column Export stage is an active stage. It can have a single input link,
a single output link and a single rejects link.
The Column Export stage exports data from a number of columns of
different data types into a single column of data type string or binary. It is
the complementary stage to Column Import (see Chapter 36).
The input data column definitions determine the order in which the
columns are exported to the single output column. Information about how
the single column being exported is delimited is given in the Formats tab
of the Inputs page.You can optionally save reject records, that is, records
whose export was rejected.
In addition to importing a column you can also pass other columns
straight through the stage. So, for example, you could pass a key column
straight through.
Stage Page
Column Export Stage
37-1
Properties
for them.
Category/Property
Values
Default
MandaDepenRepeats?
tory?
dent of
Options/Export
Output Column
Output Column
N/A
N/A
Options/Export
Column Type
Binary/ VarChar Binary
N/A
Options/Reject
Mode
Continue (warn)
/Output
Continue
N/A
Options/Column to
Export
Input Column
N/A
N/A
Options/Schema
File
Pathname
N/A
N/A
Options Category
Export Output Column. Specifies the name of the single column to which
the input column or columns are exported.
Export Column Type. Specify either binary or VarChar (string).
Reject Mode. The values of this property specify the following actions:
Output. The stage continues when it encounters a reject record and
writes the record to the rejects link.
Continue(warn). The stage is to continue but report failures to the
log file.
Column to Export. Specifies an input column the stage extracts data
from. The format properties for this column can be set on the Format tab
of the Inputs page. Repeat the property to specify multiple input columns.
37-2
Schema File. Instead of specifying the source data details via input
column definitions, you can use a schema file. You can type in the schema
file name or browse for it.
Advanced Tab
Inputs Page
sets. The Column Export stage expects one incoming data set.
partitioned before being exported. The Format tab allows you to specify
details how data in the column you are exporting will be formatted. The
Columns tab specifies the column definitions of incoming data.
Column Export Stage
37-3
Details about Column Export stage partitioning are given in the following
other tabs.

data is partitioned or collected before it is exported. It also allows you to
If the Column Export stage is operating in sequential mode, it will first
Whether the Column Export stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Column Export stage is set to execute in parallel, then you can set a
If the Column Export stage is set to execute in sequential mode, but the
default partitioning method for the Column Export stage.
37-4

the default collection method for Column Export stages.
input link should be sorted before being exported. The sort is always
Column Export Stage
37-5
list.
Format Tab
column you are exporting. You use it in the same way as you would to
describe the format of a flat file you were writing. The tab has a similar
format to the Properties tab and is described in detail on page 3-24.
add window. You can then set a value for that property in the Property
each type.
formatted in the column. The available properties are:
37-6

used).


output links.)
following:
the dialog box does not enforce this
of bytes.
Column Export Stage
37-7

are:
none, or null.
none. No delimiter.
Prefix bytes. Specifies that each column in the column is prefixed
character.
the vector.
at column level):
37-8
Choose from:
binary
text
Column Export Stage
37-9

it. Choose from:
%yyyy-%mm-%dd.
GMT.
37-10
Outputs Page
Column Export stage. The Column Export stage can have only one output
link, but can also have a reject link carrying records that have been
rejected.
columns being input to the Column Export stage and the Output columns.
Details about Column Export stage mapping is given in the following
other tabs.
Column Export Stage
37-11
Mapping Tab
For the Column Export stage the Mapping tab allows you to specify how
or how they are generated.
The left pane shows the input columns plus the composite column that the
stage exports the specified input columns to. These are read only and
cannot be modified on this tab.
In the example, the Key column is being passed straight through (it has not
been defined as a Column to Export in the stage properties. The remaining
columns are all being exported to comp_col, which is the specified Export
Column. You could also pass the original columns through the stage, if
required.
37-12
Reject Link
You cannot change the details of a Reject link. The link uses the column
definitions for the link rejecting the data records.
Column Export Stage
37-13
37-14
38
Make Subrecord Stage
The Make Subrecord stage is an active stage. It can have a single input link
and a single output link.
The Make Subrecord stage combines specified vectors in an input data set
into a vector of subrecords whose columns have the names and data types
of the original vectors. You specify the vector columns to be made into a
vector of subrecords and the name of the new subrecord. See Complex
Data Types on page 2-14 for an explanation of vectors and subrecords.
The Split Subrecord stage performs the inverse operation. See Chapter 39,
Split Subrecord Stage.
The length of the subrecord vector created by this operator equals the
length of the longest vector column from which it is created. If a variablelength vector column was used in subrecord creation, the subrecord vector
is also of variable length.
Vectors that are smaller than the largest combined vector are padded with
default values: NULL for nullable columns and the corresponding typedependent value for non-nullable columns. When the Make Subrecord
stage encounters mismatched vector lengths, it warns you by writing to
the job log.
38-1
Stage Page
Properties
for them.
Category/Property
Values
Default
MandaDepenRepeats?
tory?
dent of
Options/Subrecord
Output Column
Output Column
N/A
N/A
Options/Vector
Input Column
Column for Subrecord
N/A
Key
Options/Disable
Warning of Column
Padding
False
N/A
True/False
Input Category
Subrecord Output Column. Specify the name of the subrecord into
which you want to combine the columns specified by the Vector Column
for Subrecord property.
Output Category
Vector Column for Subrecord. Specify the name of the column to
include in the subrecord. You can specify multiple columns to be
combined into a subrecord. For each column, specify the property
followed by the name of the column to include.
38-2
Options Category
Disable Warning of Column Padding. When the operator combines
vectors of unequal length, it pads columns and displays a message to this
effect. Optionally specify this property to disable display of the message.
Advanced Tab
Inputs Page
sets. The Make Subrecord stage expects one incoming data set.
partitioned before being converted. The Columns tab specifies the column
38-3
Details about Make Subrecord stage partitioning are given in the following
other tabs.

data is partitioned or collected before it is converted. It also allows you to
By default the stage partitions in Auto mode. If the Preserve Partitioning
option has been set on the previous stage in the job, this stage will attempt
to preserve the partitioning of the incoming data.
If the Make Subrecord stage is operating in sequential mode, it will first
Whether the Make Subrecord stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Make Subrecord stage is set to execute in parallel, then you can set a
If the Make Subrecord stage is set to execute in sequential mode, but the
default method of the Make Subrecord stage.
38-4
default partitioning method for the Make Subrecord stage.
the default collection method for Make Subrecord stages.
input link should be sorted before being converted. The sort is always
38-5
partitioning method chosen
list.
Outputs Page
Make Subrecord stage. The Make Subrecord stage can have only one
output link.
data.
38-6
39
Split Subrecord Stage
The Split Subrecord stage separates an input subrecord field into a set of
top-level vector columns. It can have a single input link and a single
output link.
The stage creates one new vector column for each element of the original
subrecord. That is, each top-level vector column that is created has the
same number of elements as the subrecord from which it was created. The
stage outputs columns of the same name and data type as those of the
columns that comprise the subrecord.
The Make Subrecord stage performs the inverse operation (see Chapter 38,
Make Subrecord Stage.)
Stage Page
39-1
Properties Tab
for them.
Category/Property
Values
Default
MandaDepenRepeats?
tory?
dent of
Options/Subrecord
Column
Input Column
N/A
N/A
Options Category
Subrecord Column. Specifies the name of the vector whose elements you
want to promote to a set of similarly named top-level columns.
Advanced Tab
39-2
Inputs Page
sets. There can be only one input to the Split Subrecord stage.
Details about Split Subrecord stage partitioning are given in the following
other tabs.

set, and how many nodes are specified in the Configuration file. You can
use any partitioning method except Modulus. If the Preserve Partitioning
If the Split Subrecord stage is operating in sequential mode, it will first
Whether the Split Subrecord stage is set to execute in parallel or
sequential mode.
or sequential mode.
39-3
If the Split Subrecord stage is set to execute in parallel, then you can set a
If the Split Subrecord stage is set to execute in sequential mode, but the
default partitioning method for the Split Subrecord stage.
39-4

the default collection method for Split Subrecord stages.
list.
Outputs Page
Split Subrecord stage. The Split Subrecord stage can have only one output
link.
39-5

data. See Chapter 3, Stage Editors, for a general description of these
tabs.
39-6
40
Promote Subrecord
Stage
The Promote Subrecord stage is an active stage. It can have a single input
The Promote Subrecord stage promotes the columns of an input subrecord
to top-level columns. The number of output records equals the number of
subrecord elements. The data types of the input subrecord columns determine those of the corresponding top-level columns.
The Combine Records stage performs the inverse operation. See
Chapter 41, Promote Subrecord Stage..
Stage Page
Promote Subrecord Stage
40-1
Properties
The Promote Subrecord Stage has one property:
Category/Property
Values
Options/Subrecord Input Column

Column
Default
MandaDepenRepeats?
tory?
dent of
N/A
N/A
Options Category
Subrecord Column. Specifies the name of the subrecord whose elements
will be promoted to top-level records.
Advanced Tab
40-2
Inputs Page
sets. The Promote Subrecord stage expects one incoming data set.
Details about Promote Subrecord stage partitioning are given in the
of the other tabs.

If the Promote Subrecord stage is operating in sequential mode, it will first
Whether the Promote Subrecord stage is set to execute in parallel
or sequential mode.
or sequential mode.
If the Promote Subrecord stage is set to execute in parallel, then you can
set a partitioning method by selecting from the Partitioning mode dropdown list. This will override any current partitioning (even if the Preserve
If the Promote Subrecord stage is set to execute in sequential mode, but the
40-3
default method for the Promote Subrecord stage.
default partitioning method for the Promote Subrecord stage.
the default collection method for Promote Subrecord stages.
40-4
list.
Outputs Page
Promote Subrecord stage. The Promote Subrecord stage can have only one
output link.
data.
40-5
40-6
41
Combine Records
Stage
The Combine Records stage is an active stage. It can have a single input
The Combine Records stage combines records, in which particular keycolumn values are identical, into vectors of subrecords. As input, the stage
takes a data set in which one or more columns are chosen as keys. All adjacent records whose key columns contain the same value are gathered into
the same record as subrecords.
Stage Page
Properties
Combine Records Stage
41-1
for them.
Category/Property
Values
Default
DepenMandaRepeats?
dent of
tory?
Options/Top Level
Keys
Output Column
N/A
N/A
Options/Key
Input Column
N/A
N/A
Options/Case
Sensitive
True/False
True
Key
Options/Top Level
Keys
True/False
False
N/A
Outputs Category
Subrecord Output Column. Specify the name of the subrecord that the
Combine Records stage creates.
Combine Keys Category

Key. Specify one or more columns. All records whose key columns contain
identical values are gathered into the same record as subrecords. If the Top
Level Keys property is set to False, each column becomes the element of a
subrecord.
If the Top Level Keys property is set to True, the key column appears as a
top-level column in the output record as opposed to in the subrecord. All
non-key columns belonging to input records with that key column appear
as elements of a subrecord in that key columns output record. Key has the
following dependent property:
Case Sensitive
41-2
Options Category
Top Level Keys. Specify whether to leave keys as top-level columns or
have them put into the subrecord. False by default.
Advanced Tab
Inputs Page
sets. The Combine Records stage expects one incoming data set.
41-3
Details about Combine Records stage partitioning are given in the

of the other tabs.

If the Combine Records stage is operating in sequential mode, it will first
Whether the Combine Records stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Combine Records stage is set to execute in parallel, then you can set
a partitioning method by selecting from the Partitioning mode dropdown list. This will override any current partitioning (even if the Preserve
If the Combine Records stage is set to execute in sequential mode, but the
default partitioning method for the Combine Records stage.
41-4

the default collection method for Combine Records stages.
41-5
list.
Outputs Page
Combine Records stage. The Combine Records stage can have only one
output link.
data.
41-6
42
Make Vector Stage
The Make Vector stage is an active stage. It can have a single input link and
a single output link.
The Make Vector stage combines specified columns of an input data record
into a vector of columns of the same type. The input columns must be
consecutive and numbered in ascending order. The numbers must
increase by one. The columns must be named column_name0 to
column_namen, where column_name starts the name of a column and 0 and
n are the first and last of its consecutive numbers. All these columns are
combined into a vector of the same length as the number of columns (n+1).
The vector is called column_name.
The Split Vector stage performs the inverse operation. See Chapter 43,
Split Vector Stage.
Stage Page
Make Vector Stage
42-1
Properties
The Make Vector stage has one property:
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/Columns
Common Partial Name
Name
N/A
N/A
Options Category
Columns Common Partial Name. Specifies the beginning column_name
of the series of consecutively numbered columns column_name0 to
column_namen to be combined into a vector called column_name
Advanced Tab
42-2
Inputs Page
sets. The Make Vector stage expects one incoming data set.
Details about Make Vector stage partitioning are given in the following
other tabs.

By default the stage partitions in Same mode. If the Preserve Partitioning
If the Make Vector stage is operating in sequential mode, it will first collect
Whether the Make Vector stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Make Vector stage is set to execute in parallel, then you can set a
If the Make Vector stage is set to execute in sequential mode, but the
Make Vector Stage
42-3

how many nodes are specified in the Configuration file.
default partitioning method for the Make Vector stage.
the default collection method for Make Vector stages.
42-4

list.
Outputs Page
Make Vector stage. The Make Vector stage can have only one output link.
data.
Make Vector Stage
42-5
42-6
43
Split Vector Stage
The Split Vector stage It can have a single input link and a single output
link.
The Split Vector stage promotes the elements of a fixed-length vector to a
set of similarly named top-level columns. The stage creates columns of the
format name0 to namen, where name is the original vectors name and 0 and
n are the first and last elements of the vector.
The Make Vector stage performs the inverse operation.
Stage Page
Split Vector Stage
43-1
Properties
The Make Vector stage has one property:
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Options/Vector
Column
Name
N/A
N/A
Options Category
Vector Column. Specifies the name of the vector whose elements you
want to promote to a set of similarly named top-level columns.
Advanced Tab
43-2
Inputs Page
sets. There are be only one input to the Split Vector stage.
Details about Split Vector stage partitioning are given in the following
other tabs.

If the Split Vector stage is operating in sequential mode, it will first collect
Whether the Split Vector stage is set to execute in parallel or
sequential mode.
or sequential mode.
If the Split Vector stage is set to execute in parallel, then you can set a partitioning method by selecting from the Partitioning mode drop-down list.
Split Vector Stage
43-3
If the Split Vector stage is set to execute in sequential mode, but the
default partitioning method for the Split Vector stage.
the default collection method for Split Vector stages.
43-4
list.
Outputs Page
Split Vector stage. The Split Vector stage can have only one output link.
data.
Details about Split Vector stage mapping is given in the following section.
Split Vector Stage
43-5
43-6
44
Head Stage
The Head Stage is an active stage. It can have a single input link and a
single output link.
The Head Stage selects the first N records from each partition of an input
data set and copies the selected records to an output data set. You determine which records are copied by setting properties which allow you to
specify:
The number of records to copy
The partition from which the records are copied
The location of the records to copy
The number of records to skip before the copying operation begins
This stage is helpful in testing and debugging applications with large data
sets. For example, the Partition property lets you see data from a single
partition to determine if the data is being partitioned as you want it to be.
The Skip property lets you access a certain portion of a data set.
Head Stage
44-1
Stage Page
Properties
for them.
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Rows/All Rows
True/False
False
N/A
Rows/Number of
Rows (per Partition)
Count
10
N/A
Rows/Period (per
Partition)
Number
N/A
N/A
Rows/Skip (per
Partition)
Number
N/A
N/A
Partitions/All
Partitions
Partition
Number
N/A
N/A
Partitions/Partition
Number
Number
N/A
Y (if All
Partitions =
False)
N/A
Rows Category
All Rows. Copy all input rows to the output data set. You can skip rows
before Head performs its copy operation by using the Skip property. The
Number of Rows property is not needed if All Rows is true.
Number of Rows (per Partition). Specify the number of rows to copy
from each partition of the input data set to the output data set. The default
44-2
value is 10. The Number of Rows property is not needed if All Rows is
true.
Period (per Partition). Copy every P th record in a partition, where P is
the period. You can start the copy operation after records have been
skipped by using the Skip property. P must equal or be greater than 1.
Skip (per Partition). Ignore the first number of rows of each partition of
the input data set, where number is the number of rows to skip. The default
skip count is 0.
Partitions Category
All Partitions. If False, copy records only from the indicated partition,
specified by number. By default, the operator copies rows from all
partitions.
Partition Number. Specifies particular partitions to perform the Head
operation on. You can specify the Partition Number property multiple
times to specify multiple partition numbers.
Advanced Tab
Head Stage
44-3
Inputs Page
sets. The Head stage expects one input.
partitioned before being headed. The Columns tab specifies the column
Details about Head stage partitioning are given in the following section.

data is partitioned or collected before it is headed. It also allows you to
If the Head stage is operating in sequential mode, it will first collect the
Whether the Head stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Head stage is set to execute in parallel, then you can set a partitioning
44-4

If the Head stage is set to execute in sequential mode, but the preceding
method.
default partitioning method for the Head stage.
Head Stage
44-5

the default collection method for Head stages.
input link should be sorted before being headed. The sort is always carried
list.
Outputs Page
Head stage. The Head stage can have only one output link.
44-6
columns being input to the Head stage and the Output columns.
Details about Head stage mapping is given in the following section. See
Mapping Tab
For the Head stage the Mapping tab allows you to specify how the output
are generated.
Head Stage
44-7
44-8
45
Tail Stage
The Tail Stage is an active stage. It can have a single input link and a single
output link.
The Tail Stage selects the last N records from each partition of an input
data set and copies the selected records to an output data set. You determine which records are copied by setting properties which allow you to
specify:
The number of records to copy
The partition from which the records are copied
This stage is helpful in testing and debugging applications with large data
sets. For example, the Partition property lets you see data from a single
partition to determine if the data is being partitioned as you want it to be.
The Skip property lets you access a certain portion of a data set.
Stage Page
Tail Stage
45-1
Properties
for them.
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Rows/Number of
Rows (per Partition)
Count
10
Key
Partitions/All
Partitions
Partition
Number
N/A
N/A
Partitions/Partition
Number
Number
N/A
Y (if All
Partitions =
False)
N/A
Rows Category
Number of Rows (per Partition). Specify the number of rows to copy
from each partition of the input data set to the output data set. The default
value is 10.
Partitions Category
All Partitions. If False, copy records only from the indicated partition,
specified by number. By default, the operator copies records from all
partitions.
Partition Number. Specifies particular partitions to perform the Tail operation on. You can specify the Partition Number property multiple times to
specify multiple partition numbers.
45-2
Advanced Tab
Inputs Page
sets. The Tail stage expects one input.
partitioned before being tailed. The Columns tab specifies the column
Details about Tail stage partitioning are given in the following section. See
Tail Stage
45-3

data is partitioned or collected before it is tailed. It also allows you to
If the Tail stage is operating in sequential mode, it will first collect the data
using the default Auto collection method.
Whether the Tail stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Tail stage is set to execute in parallel, then you can set a partitioning
If the Tail stage is set to execute in sequential mode, but the preceding
method.
default partitioning method for the Tail stage.
45-4

the default collection method for Tail stages.
input link should be sorted before being tailed. The sort is always carried
Tail Stage
45-5
list.
Outputs Page
Tail stage. The Tail stage can have only one output link.
columns being input to the Tail stage and the Output columns.
Details about Tail stage mapping is given in the following section. See
45-6
Mapping Tab
For the Tail stage the Mapping tab allows you to specify how the output
are generated.
Tail Stage
45-7
45-8
46
Compare Stage
The Compare stage is an active stage. It can have two input links and a
single output link.
The Compare stage performs a column-by-column comparison of records
in two presorted input data sets. You can restrict the comparison to specified key columns.
The Compare stage does not change the table definition, partitioning, or
content of the records in either input data set. It transfers both data sets
intact to a single output data set generated by the stage. The comparison
results are also recorded in the output data set.
Stage Page
Compare Stage
46-1
Properties
for them.
Category/Property
Values
Default
MandaDepenRepeats?
tory?
dent of
Options/Abort On
Difference
True/False
False
N/A
Options/Warn on
Record Count
Mismatch
True/False
False
N/A
Options/Equals
Value
number
N/A
Options/First is
Empty Value
number
N/A
Options/Greater
Than Value
number
N/A
Options/Less Than
Value
number
-1
N/A
Options/Second is
Empty Value
number
-2
N/A
Options/Key
Input
Column
N/A
N/A
Options/Case
Sensitive
True/False
True
Key
Options Category
Abort On Difference. This property forces the stage to abort its operation
each time a difference is encountered between two corresponding
columns in any record of the two input data sets. This is False by default,
if you set it to True you cannot set Warn on Record Count Mismatch.
46-2
Warn on Record Count Mismatch. This property directs the stage to

output a warning message when a comparison is aborted due to a
mismatch in the number of records in the two input data sets. This is
False by default, if you set it to True you cannot set Abort on difference.
Equals Value. Allows you to set an alternative value for the code which
the stage outputs to indicate two compared records are equal. This is 0 by
default.
First is Empty Value. Allows you to set an alternative value for the code
which the stage outputs to indicate the first record is empty. This is 1 by
default.
Greater Than Value. Allows you to set an alternative value for the code
which the stage outputs to indicate the first record is greater than the other.
This is 2 by default.
Less Than Value. Allows you to set an alternative value for the code
which the stage outputs to indicate the second record is greater than the
other. This is -1 by default.
Second is Empty Value. Allows you to set an alternative value for the
code which the stage outputs to indicate the second record is empty. This
is -2 by default.
Key. Allows you to specify one or more key columns. Only these columns
will be compared. Repeat the property to specify multiple columns. The
Key property has a dependent property:
Case Sensitive
set to True by default, i.e., the values CASE and case in would
end up in different groups.
Advanced Tab
Compare Stage
46-3

Link Ordering Tab

This tab allows you to specify which input link carries the First data set
and which carries the Second data set. Which is categorized as first and
which second affects the setting of the comparison code.
46-4
By default the first link added will represent the First set. To rearrange the
links, choose an input link and click the up arrow button or the down
arrow button.
Inputs Page
sets. The Compare stage expects two incoming data sets.
Details about Compare stage partitioning are given in the following
other tabs.

data is partitioned or collected before it is compared. It also allows you to
If the Compare stage is set to execute in sequential mode, but the
the default collection method for Compare stages.
Compare Stage
46-5

If you are collecting data, the Partitioning tab also allows you to specify
that data arriving on the input link should be sorted before being collected
and compared. The sort is always carried out within data partitions. The
sort occurs before the collection.
list.
Outputs Page
Compare stage. The Compare stage can have only one output link.
data.
46-6
47
Peek Stage
The Peek stage is an active stage. It has a single input link and any number
of output links.
The Peek stage lets you print record column values either to the job log or
to a separate output link as the stage copies records from its input data set
to one or more output data sets. This can be helpful for monitoring the
progress of your application or to diagnose a bug in your application.
Stage Page
Properties
for them.
Peek Stage
47-1
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Rows/All Records
(After Skip)
True/False
False
N/A
Rows/Number of
Records (Per
Partition)
number
10
N/A
Rows/Period (per
Partition)
Number
N/A
N/A
Rows/Skip (per
Partition)
Number
N/A
N/A
Columns/Peek All
Input Columns
True/False
True
N/A
Columns/Input
Column to Peek
Input Column
N/A
Y (if Peek Y
All Input
Columns
= False)
N/A
Partitions/All
Partitions
True/False
True
N/A
Partitions/Partition Number
number
N/A
Y (if All
Partitions =
False)
N/A
Options/Peek
Records Output
Mode
Job
Log/Output
Job Log
N/A
Options/Show
Column Names
True/False
False
N/A
Options/Delimiter
String
space/nl/tab
space
N/A
Rows Category
All Records (After Skip). True to print all records from each partition. Set
to False by default.
47-2
Number of Records (Per Partition). Specifies the number of records to

print from each partition. The default is 10.
Period (per Partition). Print every P th record in a partition, where P is
the period. You can start the copy operation after records have been
skipped by using the Skip property. P must equal or be greater than 1.
Skip (per Partition). Ignore the first number of rows of each partition of
the input data set, where number is the number of rows to skip. The default
skip count is 0.
Columns Category
Peek All Input Columns. True by default and prints all the input
columns. Set to False to specify that only selected columns will be printed
and specify these columns using the Input Column to Peek property.
Input Column to Peek. If you have set Peek All Input Columns to False,
use this property to specify a column to be printed. Repeat the property to
specify multiple columns.
Partitions Category
All Partitions. Set to True by default. Set to False to specify that only
certain partitions should have columns printed, and specify which partitions using the Partition Number property.
Partition Number. If you have set All Partitions to False, use this property
to specify which partition you want to print columns from. Repeat the
property to specify multiple columns.
Options Category
Peek Records Output Mode. Specifies whether the output should go to
an output column (the Peek Records column) or to the job log.
Show Column Names. If True, causes the stage to print the column
name, followed by a colon, followed by the column value. By default, the
stage prints only the column value, followed by a space.
Delimiter String. The string to use as a delimiter on columns. Can be
space, tab or newline. The default is space.
Peek Stage
47-3
Advanced Tab
47-4
Link Ordering
This tab allows you to specify which output link carries the peek records
data set if you have chosen to output the records to a link rather than the
job log.
By default the last link added will represent the peek data set. To rearrange
the links, choose an output link and click the up arrow button or the down
arrow button.
Inputs Page
sets. The Peek stage expects one incoming data set.
partitioned before being peeked. The Columns tab specifies the column
Details about Peek stage partitioning are given in the following section.
Peek Stage
47-5

data is partitioned or collected before it is peeked. It also allows you to
If the Peek stage is operating in sequential mode, it will first collect the
Whether the Peek stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the Peek stage is set to execute in parallel, then you can set a partitioning
If the Peek stage is set to execute in sequential mode, but the preceding
method.
default method of the Peek stage.
47-6

default partitioning method for the Peek stage.
the default collection method for Peek stages.
input link should be sorted before being peeked. The sort is always carried
Peek Stage
47-7

list.
Outputs Page
Peek stage. The Peek stage can have any number of output links. Select the
link whose details you are looking at from the Output name drop-down
list.
columns being input to the Peek stage and the Output columns.
Details about Peek stage mapping is given in the following section. See
47-8
Mapping Tab
For the Tail stage the Mapping tab allows you to specify how the output
are generated.
The left pane shows the columns being peeked. These are read only and
cannot be modified on this tab.
Peek Stage
47-9
47-10
48
SAS Stage
The SAS stage is an active stage. It can have multiple input links and
multiple output links.
The SAS stage allows you to execute part or all of an SAS application in
parallel. It reduces or eliminates the performance bottlenecks that might
otherwise occur when SAS is run on a parallel computer.
DataStage enables SAS users to:
Access, for reading or writing, large volumes of data in parallel
from parallel relational databases, with much higher throughput
than is possible using PROC SQL.
Process parallel streams of data with parallel instances of SAS
DATA and PROC steps, enabling scoring or other data transformations to be done in parallel with minimal changes to existing SAS
code.
Store large data sets in parallel, eliminating restrictions on data-set
size imposed by your file system or physical disk-size limitations.
Parallel data sets are accessed from SAS programs in the same way
as conventional SAS data sets, but at much higher data I/O rates.
Realize the benefits of pipeline parallelism, in which some number
of SAS stages run at the same time, each receiving data from the
previous process as it becomes available.
SAS Stage
48-1

Stage Page
Properties
for them.
Default
Mandatory?
Repeats?
Dependent of
SAS Source/Source Explicit/Source

Method
File
Explicit
N/A
SAS Source/Source code
N/A
Y (if
Source
Method =
Explicit)
N/A
SAS Source/Source pathname

File
N/A
Y (if
Source
Method =
Source
File)
N/A
Inputs/Input Link
Number
number
N/A
N/A
Inputs/Input SAS
Data Set Name.
string
N/A
Y (if input N
link
number
specified)
N/A
Category/Property
48-2
Values
Category/Property
Values
Default
Mandatory?
Repeats?
Dependent of
Outputs/Output
Link Number
number
N/A
N/A
Outputs/Output
SAS Data Set
Name.
string
N/A
Y (if
output
link
number
specified)
N/A
Options/Disable
Working Directory
Warning
True/False
False
N/A
Options/Convert
Local
True/False
False
N/A
Options/Debug
Program
No/Verbose/
Yes
No
N/A
Options/SAS List
File Location Type
File/Job Log/
None/Output
Job Log
N/A
Options/SAS Log
File Location Type
File/Job Log/
None/Output
Job Log
N/A
Options/SAS
Options
string
N/A
N/A
Options/Working
Directory
pathname
N/A
N/A
SAS Source Category

Source Method. Choose from Explicit (the default) or Source File. You
then have to set either the Source property or the Source File property to
specify the actual source.
Source. Specify the SAS code to be executed. This can contain both PROC
and DATA steps.
Source File. Specify a file containing the SAS code to be executed by the
stage.
SAS Stage
48-3
Inputs Category
Input Link Number. Specifies inputs to the SAS code in terms of input
link numbers. Repeat the property to specify multiple links. This has a
dependent property:
Input SAS Data Set Name.
The name of the SAS data set receiving its input from the specified
input link.
Outputs Category
Output Link Number. Specifies an output link to connect to the output of
the SAS code. Repeat the property to specify multiple links. This has a
dependent property:
Output SAS Data Set Name.
The name of the SAS data set sending its output to the specified
output link.
Options Category
Disable Working Directory Warning. Disables the warning message
generated by the stage when you omit the Working Directory property. By
default, if you omit the Working Directory property, the SAS working
directory is indeterminate and the stage generates a warning message.
Convert Local. Specify that the conversion phase of the SAS stage (from
the input data set format to the stage SAS data set format) should run on
the same nodes as the SAS stage. If this option is not set, the conversion
runs by default with the previous stages degree of parallelism and, if
possible, on the same nodes as the previous stage.
Debug Program. A setting of Yes causes the stage to ignore errors in the
SAS program and continue execution of the application. This allows your
application to generate output even if an SAS step has an error. By default,
the setting is No, which causes the stage to abort when it detects an error
in the SAS program.
Setting the property as Verbose is the same as Yes, but in addition it causes
the operator to echo the SAS source code executed by the operator.
48-4
SAS List File Location Type. Specifying File for this property causes the
stage to write the SAS list file generated by the executed SAS code to a
plain text file. The list is sorted before being written out. The name of the
list file, which cannot be modified, is dsident lst, where ident is the name
of the stage, including an index in parentheses if there are more than one
with the same name. For example, dssas(1) lst is the list file from the
second SAS stage in a data flow.
Specifying Job Log causes the list to be written to the DataStage job log.
Specifying Output causes the list file to be written to an output data set of
the stage. The data set from a parallel SAS stage containing the list information will not be sorted.
If you specify None no list will be generated.
SAS Log File Location Type. Specifying File for this property causes the
stage to write the SAS list file generated by the executed SAS code to a
plain text file. The list is sorted before being written out. The name of the
list file, which cannot be modified, is dsident lst, where ident is the name
of the stage, including an index in parentheses if there are more than one
with the same name. For example, dssas(1) lst is the list file from the
second SAS stage in a data flow.
Specifying Job Log causes the list to be written to the DataStage job log.
Specifying Output causes the list file to be written to an output data set of
the stage. The data set from a parallel SAS stage containing the list information will not be sorted.
If you specify None no list will be generated.
SAS Options. Specify any options for the SAS code in a quoted string.
These are the options that you would specify to an SAS OPTIONS
directive.
Working Directory. Name of the working directory on all the processing
nodes executing the SAS application. All relative pathnames in the SAS
code are relative to this pathname.
SAS Stage
48-5
Advanced Tab
48-6
Link Ordering
This tab allows you to specify how input links and output links are
numbered. This is important when you are specifying Input Link Number
and Output Link Number properties.
By default the first link added will be link 1, the second link 2 and so on.
Select a link and use the arrow buttons to change its position.
Inputs Page
sets. There can be multiple inputs to the SAS stage.
partitioned before being passed to the SAS code. The Columns tab specifies the column definitions of incoming data.
Details about SAS stage partitioning are given in the following section. See
SAS Stage
48-7

data is partitioned or collected before passed to the SAS code. It also
allows you to specify that the data should be sorted before being operated
on.
If the SAS stage is operating in sequential mode, it will first collect the data
using the default Auto collection method.
Whether the SAS stage is set to execute in parallel or sequential
mode.
or sequential mode.
If the SAS stage is set to execute in parallel, then you can set a partitioning
If the SAS stage is set to execute in sequential mode, but the preceding
method.
default partitioning method for the SAS stage.
48-8
the default collection method for SAS stages.
input link should be sorted before being passed to the SAS code. The sort
SAS Stage
48-9
list.
Outputs Page
SAS stage. The SAS stage can have multiple output links. Choose the link
whose details you are viewing from the Output Name drop-down list.
columns being input to the SAS stage and the Output columns.
Details about SAS stage mapping is given in the following section. See
48-10
Mapping Tab
For the SAS stage the Mapping tab allows you to specify how the output
columns are derived, how SAS data maps onto them.
The left pane shows the data output from the SAS code. These are read
only and cannot be modified on this tab.
SAS Stage
48-11
48-12
49
Specifying Custom
Parallel Stages
In addition to the wide range of parallel stage types available, DataStage
allows you to define your own stage types, which you can then use in
parallel jobs.
There are three different types of stage that you can define:
Custom. This allows knowledgeable Orchestrate users to specify
an Orchestrate operator as a DataStage stage. This is then available
to use in DataStage Parallel jobs.
Build. This allows you to design and build your own bespoke
operator as a stage to be included in DataStage Parallel Jobs.
Wrapped. This allows you to specify a UNIX command to be
executed by a DataStage stage. You define a wrapper file that in
turn defines arguments for the UNIX command and inputs and
outputs.
The DataStage Manager provides an interface that allows you to define a
new DataStage Parallel job stage of any of these types. This interface is also
available from the Repository window of the DataStage Designer.
Specifying Custom Parallel Stages
49-1
Defining Custom Stages

You can define a custom stage in order to include an Orchestrate operator
in a DataStage stage which you can then include in a DataStage job. The
stage will be available to all jobs in the project in which the stage was
defined. You can make it available to other projects using the DataStage
Manager Export/Import facilities. The stage is automatically added to the
job palette.
To define a custom stage type from the DataStage Manager:
1.
Select the Stage Types category in the Repository tree.
2.
Choose File New Parallel Stage Custom from the main menu or
New Parallel Stage Custom from the shortcut menu. The Stage
Type dialog box appears.
3.
Fill in the fields on the General page as follows:

Stage type name. This is the name that the stage will be known by
to DataStage. Avoid using the same name as existing stages.
49-2
Category. The category that the new stage will be stored in under
the stage types branch of the Repository tree view. Type in or
browse for an existing category or type in the name of a new one.
Parallel Stage type. This indicates the type of new Parallel job
stage you are defining (Custom, Build, or Wrapped). You cannot
change this setting.
Execution Mode. Choose the execution mode. This is the mode
that will appear in the Advanced tab on the stage editor. You can
override this mode for individual instances of the stage as
required, unless you select Parallel only or Sequential only. See
Advanced Tab on page 3-5 for a description of the execution
mode.
Output Mapping. Choose whether the stage has a Mapping tab or
not. A Mapping tab enables the user of the stage to specify how
output columns are derived from the data produced by the stage.
Choose None to specify that output mapping is not performed,
choose Default to accept the default setting that DataStage uses.
Preserve Partitioning. Choose the default setting of the Preserve
Partitioning flag. This is the setting that will appear in the
Advanced tab on the stage editor. You can override this setting for
individual instances of the stage as required. See Advanced Tab
on page 3-5 for a description of the preserve partitioning flag.
Partitioning. Choose the default partitioning method for the stage.
This is the method that will appear in the Inputs Page Partitioning
tab of the stage editor. You can override this method for individual
instances of the stage as required. See Partitioning Tab on
page 3-11 for a description of the partitioning methods.
Collecting. Choose the default collection method for the stage.
page 3-11 for a description of the collection methods.
Operator. Enter the name of the Orchestrate operator that you
want the stage to invoke.
Short Description. Optionally enter a short description of the
stage.
Long Description. Optionally enter a long description of the stage.
49-3
4.
Go to the Links page and specify information about the links allowed
to and from the stage you are defining.
Use this to specify the minimum and maximum number of input and
output links that your custom stage can have.
5.
49-4
Go to the Creator page and optionally specify information about the

stage you are creating. We recommend that you assign a version
number to the stage so you can keep track of any subsequent

changes.
6.
Go to the Properties page. This allows you to specify the options that
the Orchestrate operator requires as properties that appear in the
49-5
Stage Properties tab. For custom stages the Properties tab always
appears under the Stage page.
Fill in the fields as follows:

Property name. The name of the property. This will be passed to
the Orchestrate operator as an option, prefixed with - and
followed by the value selected in the Properties tab of the stage
editor.
Data type. The data type of the property. Choose from:
49-6
Boolean
Float
Integer
String
Pathname
Input Column
Output Column
If you choose Input Column or Output Column, when the stage is

includes in a job a drop-down list will offer a choice of the defined
input or output columns.
Prompt. The name of the property that will be displayed on the
Properties tab of the stage editor.
Default Value. The value the option will take if no other is
specified.
Repeating. Set this true if the property repeats (i.e. you can have
multiple instances of it).
Required. Set this to True if the property is mandatory.
Defining Build Stages

You define a Build stage to enable you to provide a bespoke operator that
can be executed from a DataStage Parallel job stage. The stage will be
available to all jobs in the project in which the stage was defined. You can
make it available to other projects using the DataStage Manager Export
facilities. The stage is automatically added to the job palette.
When defining a Build stage you provide the following information:
Description of the data that will be input to the stage.
Description of the data that will be output from the stage.
Whether records are transferred from input to output. A transfer
copies the input record to the output buffer. If you specify auto
transfer, the operator transfers the input record to the output
record immediately after execution of the per record code. The
code can still access data in the output buffer until it is actually
written.
Any definitions and header file information that needs to be
included.
Code that is executed at the beginning of the stage (before any
records are processed).
Code that is executed at the end of the stage (after all records have
been processed).
Code that is executed every time the stage processes a record.
Compilation and build details for actually building the stage.
49-7
The Code for the Build stage is specified in C++. There are a number of
macros available to make the job of coding simpler (see Build Stage
Macros on page 49-16). There are also a number of header files available
containing many useful functions, see Appendix C
When you have specified the information, and request that the stage is
generated, DataStage generates a number of files and then compiles these
to build an operator which the stage executes. The generated files include:
Header files (ending in .h)
Source files (ending in .C)
Object files (ending in .so)
To define a Build stage from the DataStage Manager:
49-8
1.
2.
Choose File New Parallel Stage Build from the main menu or
New Parallel Stage Build from the shortcut menu. The Stage Type
dialog box appears:
3.
to DataStage. Avoid using the same name as existing stages.
the stage types branch. Type in or browse for an existing category
or type in the name of a new one.
Class Name. The name of the C++ class. By default this takes the
name of the stage type.
stage you ar edefining (Custom, Build, or Wrapped). You cannot
Execution mode. Choose the default execution mode. This is the
mode that will appear in the Advanced tab on the stage editor. You
can override this mode for individual instances of the stage as
mode.
Operator. The name of the operator that your code is defining and
which will be executed by the DataStage stage. By default this
takes the name of the stage type.
stage.
49-9
49-10
4.

stage you are creating. We recommend that you assign a release
changes.
5.
Go to the Properties page. This allows you to specify the options that
the Build stage requires as properties that appear in the Stage Proper-
ties tab. For custom stages the Properties tab always appears under
the Stage page.

Property name. The name of the property. This will be passed to
the operator you are defining as an option, prefixed with - and
followed by the value selected in the Properties tab of the stage
editor.
Boolean
Float
Integer
String
Pathname
Input Column
Output Column
49-11

specified.
Required. Set this to True if the property is mandatory.
6.
Click on the Build page. The tabs here allow you to define the actual
operation that the stage will perform.
The Interfaces tab enable you to specify details about inputs to and
outputs from the stage, and about automatic transfer of records from
input to output. You specify port details, a port being where a link
connects to the stage. You need a port for each possible input link to
the stage, and a port for each possible output link from the stage.
49-12
You provide the following information on the Input sub-tab:

Port Name. Optional name for the port. The default names for the
ports are in0, in1, in2 . You can refer to them in the code using
either the default name or the name you have specified.
AutoRead. This defaults to True which means the stage will automatically read records from the port. Otherwise you explicitly
control read operations in the code.
Table Name. Specify a table definition in the DataStage Repository
which describes the meta data for the port. You can browse for a
table definition by choosing Select Table from the menu that
appears when you click the browse button. You can also view the
schema corresponding to this table definition by choosing View
Schema from the same menu. You do not have to supply a Table
Name.
RCP. Choose True if runtime column propagation is allowed for
inputs to this port. Defaults to False. You do not need to set this if
you are using the automatic transfer facility.
You provide the following information on the Output sub-tab:
Port Name. Optional name for the port. The default names for the
links are out0, out1, out2 . You can refer to them in the code
using either the default name or the name you have specified.
AutoWrite. This defaults to True which means the stage will automatically write records to the port. Otherwise you explicitly
control write operations in the code. Once records are written, the
code can no longer access them.
Table Name. Specify a table definition in the DataStage Repository
which describes the meta data for the port. You can browse for a
table definition. You do not have to supply a Table Name.
RCP. Choose True if runtime column propagation is allowed for
outputs from this port. Defaults to False. You do not need to set
this if you are using the automatic transfer facility.
The Transfer sub-tab allows you to connect an input port to an output
port such that records will be automatically transferred from input to
output. You can also disable automatic transfer, in which case you
have to explicitly transfer data in the code. Transferred data sits in an
49-13
output buffer and can still be accessed and altered by the code until it
is actually written to the port.
You provide the following information on the Transfer tab:

Input. Select the input port to connect from the drop-down list.
Output. Select the output port to transfer input records to from the
drop-down list.
Auto Transfer. This defaults to False, which means that you have
to include code which manages the transfer. Set to True to have the
transfer carried out automatically.
Separate. This is False by default, which means this transfer will be
combined with other transfers to the same port. Set to True to
specify that the transfer should be separate from other transfers.
The Logic tab is where you specify the actual code that the stage
executes.
49-14
The Definitions sub-tab allows you to specify variables, include

header files, and otherwise initialize the stage before processing any
records.
The Pre-Loop sub-tab allows you to specify code which is executed at
the beginning of the stage, before any records are processed.
The Per-Record sub-tab allows you to specify the code which is
executed once for every record processed.
The Post-Loop sub-tab allows you to specify code that is executed
after all the records have been processed.
You can type straight into these pages or cut and paste from another
editor. The shortcut menu on the Pre-Loop, Per-Record, and PostLoop pages gives access to the macros that are available for use in the
code.
The Advanced tab allows you to specify details about how the stage is
compiled and build. Fill in the page as follows:
Compile and Link Flags. Allows you to specify flags which are
passed to the C++ compiler.
Verbose. Select this check box to specify that the compile and build
is done in verbose mode.
Debug. Select this check box to specify that the compile and build
is done in debug mode. Otherwise, it is done in optimize mode.
Suppress Compile. Select this check box to generate files without
compiling, and without deleting the generated files. This option is
useful for fault finding.
Base File Name. The base filename for generated files. All generated files will have this name followed by the appropriate suffix.
This defaults to the name specified under Operator on the General
page.
Source Directory. The directory where generated .C files are
placed. This defaults to the buildop folder in the current project
directory. You can also set it using the
DS_OPERATOR_BUILDOP_DIR environment variable in the
DataStage Administrator (see DataStage Administrator Guide).
Header Directory. The directory where generated .h files are
49-15

Object Directory. The directory where generated .so files are
Wrapper directory. The directory where generated .op files are
7.
When you have filled in the details in all the pages, click Generate to
generate the stage. A window appears showing you the result of the
build.
Build Stage Macros

There are a number of macros you can use when specifying Pre-Loop, PerRecord, and Post-Loop code. Insert a macro by selecting it from the short
cut menu. They are grouped into the following categories:
Informational
Flow-control
Input and output
Transfer
Informational Macros
Use these macros in your code to determine the number of inputs, outputs,
and transfers as follows:
inputs(). Returns the number of inputs to the stage.
outputs(). Returns the number of outputs from the stage.
transfers(). Returns the number of transfers in the stage.
Flow-Control Macros
Use these macros to override the default behavior of the Per-Record loop
in your stage definition:
49-16
endLoop(). Causes the operator to stop looping, following completion of the current loop and after writing any auto outputs for this
loop.
nextLoop() Causes the operator to immediately skip to the start of
next loop, without writing any outputs.
failStep() Causes the operator to return a failed status and terminate the job.
Input and Output Macros

These macros allow you to explicitly control the read and write and
transfer of individual records.
Each of the macros takes an argument as follows:
input is the index of the input (0 to n). If you have defined a name
for the input port you can use this in place of the index in the form
portname.portid_.
output is the index of the output (0 to n). If you have defined a
name for the output port you can use this in place of the index in
the form portname.portid_.
index is the index of the transfer (0 to n).
The following macros are available:
readRecord(input). Immediately reads the next record from input, if
there is one. If there is no record, the next call to inputDone() will
return false.
writeRecord(output). Immediately writes a record to output.
inputDone(input). Returns true if the last call to readRecord() for
the specified input failed to read a new record, because the input
has no more records.
holdRecord(input). Causes auto input to be suspended for the
current record, so that the operator does not automatically read a
new record at the start of the next loop. If auto is not set for the
input, holdRecord() has no effect.
discardRecord(output). Causes auto output to be suspended for the
current record, so that the operator does not output the record at
the end of the current loop. If auto is not set for the output, discardRecord() has no effect.
49-17
discardTransfer(index). Causes auto transfer to be suspended, so

that the operator does not perform the transfer at the end of the
current loop. If auto is not set for the transfer, discardTransfer() has
no effect.
Transfer Macros
These macros allow you to explicitly control the transfer of individual
records.
Each of the macros takes an argument as follows:
input is the index of the input (0 to n). If you have defined a name
for the input port you can use this in place of the index in the form
portname.portid_.
output is the index of the output (0 to n). If you have defined a
name for the output port you can use this in place of the index in
the form portname.portid_.
index is the index of the transfer (0 to n).
The following macros are available:
doTransfer(index). Performs the transfer specified by index.
doTransfersFrom(input). Performs all transfers from input.
doTransfersTo(output). Performs all transfers to output.
transferAndWriteRecord(output). Performs all transfers and writes
a record for the specified output. Calling this macro is equivalent to
calling the macros doTransfersTo() and writeRecord().
How Your Code is Executed

This section describes how the code that you define when specifying a
Build stage executes when the stage is run in a DataStage job.
The sequence is as follows:
49-18
1.
Handles any definitions that you specified in the Definitions sub-tab

when you entered the stage details.
2.
Executes an code that was entered in the Pre-Loop sub-tab.
3.
Loops repeatedly until either all inputs have run out of records, or the
Per-Record code has explicitly invoked endLoop(). In the loop,
performs the following steps:
a.
Reads one record for each input, except where any of the
following is true:
The input has no more records left.
The input has Auto Read set to false.
The holdRecord() macro was called for the input last time
around the loop.
b.
Executes the Per-Record code, which can explicitly read and write
records, perform transfers, and invoke loop-control macros such
as endLoop().
c.
Performs each specified transfer, except where any of the

following is true:
The input of the transfer has no more records.
The transfer has Auto Transfer set to False.
The discardTransfer() macro was called for the transfer during
the current loop iteration.
d. Writes one record for each output, except where any of the
following is true:
The output has Auto Write set to false.
The discardRecord() macro was called for the output during the
current loop iteration.
4.
If you have specified code in the Post-loop sub-tab, executes it.
5.
Returns a status, which is written to the DataStage Job Log.
Inputs and Outputs

The input and output ports that you defined for your Build stage are
where input and output links attach to the stage. By default links are
connected to ports in the order they are connected to the stage, but where
your stage allows multiple input or output links you can change the link
order using the Link Order tab on the stage editor.
When you specify details about the input and output ports for your Build
stage, you need to define the meta data for the ports. You do this by
loading a table definition from the DataStage Repository.
49-19
When you actually use your stage in a job, you have to specify meta data
for the links that attach to these ports. For the job to run successfully the
meta data specified for the port and that specified for the link should
match. An exception to this is where you have runtime column propagation enabled for the job. In this case the input link meta data can be a superset of the port meta data and the extra columns will be automatically
propagated.
Using Multiple Inputs

Where you require your stage to handle multiple inputs, there are some
special considerations. Your code needs to ensure the following:
The stage only tries to access a column when there are records
available. It should not try to access a column after all records have
been read (using the inputDone() macro to check), and should not
attempt to access a column unless either Auto Read is enabled on
the link or an explicit read record has been performed.
The reading of records is terminated immediately after all the
required records have been read from it. In the case of a port with
Auto Read disabled, the code must determine when all required
records have been read and call the endLoop() macro.
In most cases we recommend that you keep auto read enabled when you
are using multiple inputs, this minimizes the need for explicit control in
your code. But there are circumstances when this is not appropriate. The
following paragraphs describes some common scenarios:
Using Auto Read for all Inputs. All ports have Auto Read enabled and
so all record reads are handled automatically. You need to code for Perrecord loop such that each time it accesses a column on any input it first
uses the inputDone() macro to determine if there are any more records.
This method is fine if you want your stage to read a record from every link,
every time round the loop.
Using Inputs with Auto Read Enabled for Some and Disabled for
Others. You define one (or possibly more) inputs as auto read, and the rest
with auto read disabled. You code the stage in such a way as the
processing of records from the auto read input drives the processing of the
other inputs. Each time round the loop, your code should call inputDone()
on the auto read input and call exitLoop() to complete the actions of the
stage.
49-20
This method is fine where you process a record from the auto read input
every time around the loop, and then process records from one or more of
the other inputs depending on the results of processing the auto read
record.
Using Inputs with Auto Read Disabled. Your code must explicitly
perform all record reads. You should define Per-Loop code which calls
readRecord() once for each input to start processing. Your Per-record code
should call inputDone() for every input each time round the loop to determine if a record was read on the most recent readRecord(), and if it did, call
readRecord() again for that input. When all inputs run out of records, the
Per-Loop code should exit.
This method is intended where you want explicit control over how each
input is treated.
Example Build Stage

This section shows you how to define a Build stage called Divide, which
basically divides one number by another and writes the result and any
remainder to an output link. The stage also checks whether you are trying
to divide by zero and, if you are, sends the input record down a reject link.
To demonstrate the use of properties, the stage also lets you define a
minimum divisor. If the number you are dividing by is smaller than the
minimum divisor you specify when adding the stage to a job, then the
record is also rejected
The input to the stage is defined as auto read, while the two outputs have
auto write disabled. The code has to explicitly write the data to one or
other of the output links. In the case of a successful division the data
written is the original record plus the result of the division and any
remainder. In the case of a rejected record, only the original record is
written.
The input record has two columns; dividend and divisor. Output 0 has
four columns; dividend, divisor, result, and remainder. Output 1 (the reject
link) has two columns; dividend and divisor.
If the divisor column of an input record contains zero or is less than the
specified minimum divisor, the record is rejected, and the code uses the
macro transferAndWriteRecord(1) to transfer the data to port 1 and write
it. If the divisor is not zero, the code uses doTransfersTo(0) to transfer the
49-21
input record to Output 0, assigns the division results to result and

remainder and finally calls writeRecord(0) to write the record to output 0.
The following screen shots show how this stage is defined in DataStage
using the Stage Type dialog box:
1.
49-22
First general details are supplied in the General tab:
2.
Details about the stages creation are supplied on the Creator page:
3.
The optional property of the stage is defined in the Properties page:
4.
Details of the inputs and outputs is defined on the interfaces tab of

the Build page.
49-23
Details about the single input to Divide are given on the Input sub-tab
of the Interfaces tab. A table definition for the inputs link is available
to be loaded from the DataStage Repository.
49-24
Details about the outputs are given on the Output sub-tab of the Interfaces tab.
Note: When you use the stage in a job, make sure that you use table definitions compatible with the tables defined in the input and output
sub-tabs.
Details about the transfers carried out by the stage are defined on the
Transfer sub-tab of the Interfaces tab.
49-25
49-26
5.
The code itself is defined on the Logic tab. In this case all the
processing is done in the Per-Record loop and so is entered on the
Per-Record sub-tab.
6.
As this example uses all the compile and build defaults, all that
remains is to click Generate to build the stage.
Defining Wrapped Stages

You define a Wrapped stage to enable you to specify a UNIX command to
be executed by a DataStage stage. You define a wrapper file that handles
arguments for the UNIX command and inputs and outputs. The DataStage
Manager provides an interface that helps you define the wrapper. The
stage will be available to all jobs in the project in which the stage was
defined. You can make it available to other projects using the DataStage
Manager Export facilities. You can add the stage to your job palette using
palette customization features in the DataStage Designer.
When defining a Build stage you provide the following information:
Details of the UNIX command that the stage will execute.
Description of the data that will be input to the stage.
Description of the data that will be output from the stage.
Definition of the environment in which the command will execute.
The UNIX command that you wrap can be a built-in command, such as
grep, a utility, such as SyncSort, or your own UNIX application. The only
limitation is that the command must be pipe-safe (to be pipe-safe a UNIX
command reads its input sequentially, from beginning to end).
You need to define meta data for the data being input to and output from
the stage. You also need to define the way in which the data will be input
or output. UNIX commands can take their inputs from standard in, or
another stream, a file, or from the output of another command via a pipe.
Similarly data is output to standard out, or another stream, to a file or to a
pipe to be input to another command. You specify what the command
expects.
DataStage handles data being input to the Wrapped stage and will present
it in the specified form. If you specify a command that expects input on
standard in, or another stream, DataStage will present the input data from
the jobs data flow as if it was on standard in. Similarly it will intercept data
output on standard out, or another stream, and integrate it into the jobs
data flow.
You also specify the environment in which the UNIX command will be
executed when you define the wrapped stage.
To define a Wrapped stage from the DataStage Manager:
1.
49-27
2.
Choose File New Parallel Stage Wrapped from the main menu
or New Parallel Stage Wrapped from the shortcut menu. The
Stage Type dialog box appears:
3.

to DataStage. Avoid using the same name as existing stages or the
name of the actual UNIX command you are wrapping.
the stage types branch. Type in or browse for an existing category
or type in the name of a new one.
stage you are defining (Custom, Build, or Wrapped). You cannot
49-28
Wrapper Name. The name of the wrapper file DataStage will

generate to call the command. By default this will take the same
name as the Stage type name.
Execution mode. Choose the default execution mode. This is the
mode that will appear in the Advanced tab on the stage editor. You
can override this mode for individual instances of the stage as
mode.
Command. The name of the UNIX command to be wrapped, plus
any required arguments. The arguments that you enter here are
ones that do not change with different invocations of the
command. Arguments that need to be specified when the Wrapped
stage is included in a job are defined as properties for the stage.
stage.
4.

stage you are creating. We recommend that you assign a release
49-29

changes.
5.
49-30
Go to the Properties page. This allows you to specify the arguments

that the UNIX command requires as properties that appear in the
Stage Properties tab. For custom stages the Properties tab always
appears under the Stage page.

Property name. The name of the property that will be displayed on
the Properties tab of the stage editor.
Boolean
Float
Integer
String
Pathname
Input Column
Output Column
49-31

specified.
Required. set this to True if the property is mandatory.
6.
Go to the Wrapped page. This allows you to specify information

about the command to be executed by the stage and how it will be
handled.
The Interfaces tab is used to describe the inputs to and outputs from
the stage, specifying the interfaces that the stage will need to function.
Details about inputs to the stage are defined on the Inputs sub-tab:
Link. The link number, this is assigned for you and is read-only.
When you actually use your stage, links will be assigned in the
order in which you add them. In our example, the first link will be
49-32
taken as link 0, the second as link 1 and so on. You can reassign the
links using the stage editors Link Ordering tab on the General
page.
Table Name. The meta data for the link. You define this by loading
a table definition from the Repository. Type in the name, or browse
for a table definition. Alternatively, you can specify an argument to
the UNIX command which specifies a table definition. In this case,
when the wrapped stage is used in a job design, the designer will
be prompted for an actual table definition to use.
Stream. Here you can specify whether the UNIX command expects
its input on standard in, or another stream, or whether it expects it
in a file. Click on the browse button to open the Wrapped Stream
dialog box.
In the case of a file, you should also specify whether the file to be
read is specified in a command line argument, or by an environment variable.
Details about outputs from the stage are defined on the Outputs subtab:
Link. The link number, this is assigned for you and is read-only.
When you actually use your stage, links will be assigned in the
order in which you add them. In our example, the first link will be
49-33
taken as link 0, the second as link 1 and so on. You can reassign the
links using the stage editors Link Ordering tab on the General
page.
Table Name. The meta data for the link. You define this by loading
a table definition from the Repository. Type in the name, or browse
for a table definition.
Stream. Here you can specify whether the UNIX command will
write its output to standard our, or another stream, or whether it
outputs to a file. Click on the browse button to open the Wrapped
Stream dialog box.
In the case of a file, you should also specify whether the file to be
written is specified in a command line argument, or by an environment variable.
The Environment tab gives information about the environment in
which the command will execute.
Set the following on the Environment tab:
49-34
All Exit Codes Successful. By default DataStage treats an exit code

of 0 as successful and all others as errors. Select this check box to
specify that all exit codes should be treated as successful other than
those specified in the Failure codes grid.
Exit Codes. The use of this depends on the setting of the All Exits
Codes Successful check box.
If All Exits Codes Successful is not selected, enter the codes in the
Success Codes grid which will be taken as indicating successful
completion. All others will be taken as indicating failure.
If All Exits Codes Successful is selected, enter the exit codes in the
Failure Code grid which will be taken as indicating failure. All
others will be taken as indicating success.
Environment. Specify environment variables and settings that the
UNIX command requires in order to run.
7.
When you have filled in the details in all the pages, click Generate to
generate the stage.
Example Wrapped Stage

This section shows you how to define a Wrapped stage called ex_sort
which runs the UNIX sort command in parallel. The stage sorts data in two
files and outputs the results to a file. The incoming data has two columns,
order number and code. The sort command sorts the data on the second
field, code. You can optionally specify that the sort is run in reverse order.
Wrapping the sort command in this way would be useful if you had a situation where you had a fixed sort operation that was likely to be needed in
several jobs. Having it as an easily reusable stage would save having to
configure a built-in sort stage every time you needed it.
When included in a job and run, the stage will effectively call the Sort
command as follows:
sort -r -o outfile -k 2 infile1 infile2
The following screen shots show how this stage is defined in DataStage
using the Stage Type dialog box:
49-35
49-36
1.
First general details are supplied in the General tab. The argument
defining the second column as the key is included in the command
because this does not vary:
2.
The reverse order argument (-r) are included as properties because it

is optional and may or may not be included when the stage is incorporated into a job.
3.
The fact that the sort command expects two files as input is defined
on the Input sub-tab on the Interfaces tab of the Wrapper page.
4.
The fact that the sort command outputs to a file is defined on the
Output sub-tab on the Interfaces tab of the Wrapper page.
Note: When you use the stage in a job, make sure that you use table definitions compatible with the tables defined in the input and output
sub-tabs.
5.
Because all exit codes other than 0 are treated as errors, and because
there are no special environment requirements for this command,
you do not need to alter anything on the Environment tab of the
Wrapped page. All that remains is to click Generate to build the
stage.
49-37
49-38
50
Managing Data Sets
DataStage parallel extender jobs use data sets to store data being operated on in a persistent form. Data sets are operating system files, each
referred to by a descriptor file, usually with the suffix .ds.
You can create and read data sets using the Data Set stage, which is
described in Chapter 6. DataStage also provides a utility for managing
data sets from outside a job. This utility is available from the
DataStage Designer, Manager, and Director clients.
Structure of Data Sets

A data set comprises a descriptor file and a number of other files that
are added as the data set grows. These files are stored on multiple
disks in your system. A data set is organized in terms of partitions and
segments. Each partition of a data set is stored on a single processing
node. Each data segment contains all the records written by a single
DataStage job. So a segment can contain files from many partitions,
and a partition has files from many segments.
Managing Data Sets
50-1
Partition 1
Partition 2
Partition 3
Partition 4
Segment 1
Segment 2
Segment 3
One or more
data files
The descriptor file for a data set contains the following information:
Data set header information.
Creation time and data of the data set.
The schema of the data set.
A copy of the configuration file use when the data set was created.
For each segment, the descriptor file contains:
The time and data the segment was added to the data set.
A flag marking the segment as valid or invalid.
Statistical information such as number of record in the segment
and number of bytes.
Path names of all data files, on all processing nodes.
This information can be accessed through the Data Set Manager.
50-2
Starting the Data Set Manager

To start the Data Set Manager from the DataStage Designer, Manager, or
Director:
1.
Choose Tools Data Set Management, a Browse Files dialog box

appears:
2.
Navigate to the directory containing the data set you want to

manage. By convention, data set files have the suffix .ds.
3.
Select the data set you want to manage and click OK. The Data Set
Viewer appears.From here you can copy or delete the chosen data set.
Managing Data Sets
50-3
You can also view its schema (column definitions) or the data it
contains.
Data Set Viewer

The Data Set viewer displays information about the data set you are
viewing:
Partitions. The partition grid shows the partitions the data set contains
and describes their properties:
#. The partition number.
Node. The processing node that the partition is currently assigned
to.
Records. The number of records the partition contains.
50-4
Blocks. The number of blocks the partition contains.

Bytes. The number of bytes the partition contains.
Segments. Click on an individual partition to display the associated
segment details. This contains the following information:
#. The segment number.
Created. Date and time of creation.
Bytes. The number of bytes in the segment.
Pathname. The name and path of the file containing the segment in
the selected partition.
Click the Refresh button to reread and refresh all the displayed
information.
Click the Output button to view a text version of the information
displayed in the Data Set Viewer.
You can open a different data set from the viewer by clicking the Open
icon on the tool bar. The browse dialog open box opens again and lets you
browse for a data set.
Viewing the Schema

Click the Schema icon from the tool bar to view the record schema of the
current data set. This is presented in text form in the Record Schema
window:
Managing Data Sets
50-5
Viewing the Data

Click the Data icon from the tool bar to view the data held by the current
data set. This options the Data Viewer Options dialog box, which allows
you to select a subset of the data to view.
Rows to display. Specify the number of rows of data you want the
data browser to display.
Skip count. Skip the specified number of rows before viewing
data.
Period. Display every Pth record where P is the period. You can
start after records have been skipped by using the Skip property. P
must equal or be greater than 1.
Partitions. Choose between viewing the data in All partitions or
the data in the partition selected from the drop-down list.
50-6
Click OK to view the selected data, the Data Viewer window appears:
Copying Data Sets

Click the Copy icon on the tool bar to copy the selected data set. The Copy
data set dialog box appears, allowing you to specify a path where the new
data set will be stored:
The new data set will have the same record schema, number of partitions
and contents as the original data set.
Managing Data Sets
50-7
Note: You cannot use the UNIX cp command to copy a data set because
DataStage represents a single data set with multiple files.
Deleting Data Sets

Click the Delete icon on the tool bar to delete the current data set data set.
You will be asked to confirm the deletion.
Note: You cannot use the UNIX rm command to copy a data set because
DataStage represents a single data set with multiple files. Using rm
simply removes the descriptor file, leaving the much larger data
files behind.
50-8
51
DataStage
Development Kit (Job
Control Interfaces)
DataStage provides a range of methods that enable you to run DataStage
server or parallel jobs directly on the server, without using the DataStage
Director. The methods are:
C/C++ API (the DataStage Development kit)
DataStage BASIC calls
Command line Interface commands (CLI)
DataStage macros
These methods can be used in different situations as follows:
API. Using the API you can build a self-contained program that
can run anywhere on your system, provided that it can connect to a
DataStage server across the network.
BASIC. Programs built using the DataStage BASIC interface can be
run from any DataStage server on the network. You can use this
interface to define jobs that run and control other jobs. The controlling job can be run from the Director client like any other job, or
directly on the server machine from the TCL prompt. (Job
sequences provide another way of producing control jobs see
DataStage Designer Guide for details.)
DataStage Development Kit (Job Control Interfaces)
51-1
CLI. The CLI can be used from the command line of any DataStage
server on the network. Using this method, you can run jobs on
other servers too.
Macros. A set of macros can be used in job designs or in BASIC
programs. These are mostly used to retrieve information about
other jobs.
DataStage Development Kit

The DataStage Development Kit provides the DataStage API, a C or C++
application programming interface.
This section gives general information about using the DataStage API.
Specific information about API functions is in API Functions on
page 51-5.
A listing for an example program which uses the API is in Appendix A.
The dsapi.h Header File

DataStage API provides a header file that should be included with all API
programs. The header file includes prototypes for all DataStage API
functions. Their format depends on which tokens you have defined:
If the _STDC_ or WIN32 tokens are defined, the prototypes are in
ANSI C style.
If the _cplusplus token is defined, the prototypes are in C++
format with the declarations surrounded by:
extern "C" {}
Otherwise the prototypes are in Kernighan and Ritchie format.
Data Structures, Result Data, and Threads

DataStage API functions return information about objects as pointers to
data items. This is either done directly, or indirectly by setting pointers in
the elements of a data structure that is provided by the caller.
Each thread within a calling application is allocated a separate storage
area. Each call to a DataStage API routine overwrites any existing
contents of this data area with the results of the call, and returns a pointer
into the area for the requested data.
51-2
For example, the DSGetProjectList function obtains a list of DataStage

projects, and the DSGetProjectInfo function obtains a list of jobs within a
project. When the DSGetProjectList function is called it retrieves the list
of projects, stores it in the threads data area, and returns a pointer to this
area. If the same thread then calls DSGetProjectInfo, the job list is
retrieved and stored in the threads data area, overwriting the project list.
The job list pointer in the supplied data structure references the thread
data area.
This means that if the results of a DataStage API function need to be
reused later, the application should make its own copy of the data before
making a new DataStage API call. Alternatively, the calls can be used in
multiple threads.
DataStage API stores errors for each thread: a call to the DSGetLastError
function returns the last error generated within the calling thread.
Writing DataStage API Programs

Your application should use the DataStage API functions in a logical
order to ensure that connections are opened and closed correctly, and jobs
are run effectively. The following procedure suggests an outline for the
program logic to follow, and which functions to use at each step:
1.
If required, set the server name, user name, and password to use for
connecting to DataStage (DSSetServerParams).
2.
Obtain the list of valid projects (DSGetProjectList).
3.
Open a project (DSOpenProject).
4.
Obtain a list of jobs (DSGetProjectInfo).
5.
Open one or more jobs (DSOpenJob).
6.
List the job parameters (DSGetParamInfo).
7.
Lock the job (DSLockJob).
8.
Set the jobs parameters and limits (DSSetJobLimit, DSSetParam).
9.
Start the job running (DSRunJob).
10. Poll for the job or wait for job completion (DSWaitForJob,
DSStopJob, DSGetJobInfo).
11. Unlock the job (DSUnlockJob).
12. Display a summary of the jobs log entries (DSFindFirstLogEntry,
DSFindNextLogEntry).
51-3
13. Display details of specific log events (DSGetNewestLogId,

DSGetLogEntry).
14. Examine and display details of job stages (DSGetJobInfo stage list,
DSGetStageInfo).
15. Examine and display details of links within active stages (DSGetStageInfo link list, DSGetLinkInfo).
16. Close all open jobs (DSCloseJob).
17. Detach from the project (DSCloseProject).
Building a DataStage API Application

Everything you need to create an application that uses the DataStage API
is in a subdirectory called dsdk (DataStage Development Kit) in the Ascential\DataStage installation directory on the server machine.
To build an application that uses the DataStage API:
1.
Write the program, including the dsapi.h header file in all source
modules that uses the DataStage API.
2.
Compile the code. Ensure that the WIN32 token is defined. (This
happens automatically in the Microsoft Visual C/C++ compiler
environment.)
3.
Link the application, including vmdsapi.lib, in the list of libraries to

be included.
Redistributing Applications
If you intend to run your DataStage API application on a computer where
DataStage Server is installed, you do not need to include DataStage API
DLLs or libraries as these are installed as part of DataStage Server.
If you want to run the application from a computer used as a DataStage
client, you should redistribute the following library with your
application:
vmdsapi.dll
If you intend to run the program from a computer that has neither
DataStage Server nor any DataStage client installed, in addition to the
library mentioned above, you should also redistribute the following:
51-4
uvclnt32.dll
unirpc32.dll
You should locate these files where they will be in the search path of any
user who uses the application, for example, in the
%SystemRoot%\System32 directory.
API Functions
This section details the functions provided in the DataStage API. These
functions are described in alphabetical order. The following table briefly
describes the functions categorized by usage:
Usage
Function
Description
Accessing
projects
DSCloseProject
Closes a project that was opened

with DSOpenProject.
DSGetProjectList
Retrieves a list of all projects on

the server.
DSGetProjectInfo
Retrieves a list of jobs in a

project.
DSOpenProject
Opens a project.
DSSetServer
Params
Sets the server name, user name,

and password to use for a job.
Accessing jobs DSCloseJob
Closes a job that was opened

with DSOpenJob.
DSGetJobInfo
Retrieves information about a

job, such as the date and time of
the last run, parameter names,
and so on.
DSLockJob
Locks a job prior to setting job

parameters or starting a job run.
DSOpenJob
Opens a job.
DSRunJob
Runs a job.
DSStopJob
Aborts a running job.
DSUnlockJob
Unlocks a job, enabling other

processes to use it.
DSWaitForJob
Waits until a job has completed.
51-5
Usage
Function
Description
Accessing job
parameters
DSGetParamInfo

job parameter.
DSSetJobLimit
Sets row processing and warning

limits for a job.
DSSetParam
Sets job parameter values.
Accessing
stages
DSGetStageInfo

stage within a job.
Accessing
links
DSGetLinkInfo

link of an active stage within a
job.
Accessing log
entries
DSFindFirst
LogEntry
Retrieves entries in a log that

meet the specified criteria.
DSFindNext
LogEntry
Finds the next log entry that

meets the criteria specified in
DSFindFirstLogEntry.
DSGetLogEntry
Retrieves the specified log entry.
DSGetNewest
LogId
Retrieves the newest entry in the

log.
DSLogEvent
Adds a new entry to the log.
DSGetLastError
Retrieves the last error code

value generated by the calling
thread.
DSGetLast
ErrorMsg
Retrieves the text of the last

reported error.
Handling
errors
51-6
DSCloseJob
Closes a job that was opened using DSOpenJob.
DSCloseJob
Syntax
int DSCloseJob(
DSJOB JobHandle
);
Parameter
JobHandle is the value returned from DSOpenJob.
Return Values
If the function succeeds, the return value is DSJE_NOERROR.
If the function fails, the return value is:
DSJE_BADHANDLE
Invalid JobHandle.
Remarks
If the job is locked when DSCloseJob is called, it is unlocked.
If the job is running when DSCloseJob is called, the job is allowed to
finish, and the function returns a value of DSJE_NOERROR
immediately.
51-7
DSCloseProject
Closes a project that was opened using the DSOpenProject function.
DSCloseProject
Syntax
int DSCloseProject(
DSPROJECT ProjectHandle
);
Parameter
ProjectHandle is the value returned from DSOpenProject.
Return Value
This function always returns a value of DSJE_NOERROR.
Remarks
Any open jobs in the project are closed, running jobs are allowed to
finish, and the function returns immediately.
51-8
DSFindFirstLogEntry
Retrieves all the log entries that meet the specified criteria, and writes the first
entry to a data structure. Subsequent log entries can then be read using the
DSFindNextLogEntry function.
DSFindFirstLogEntry
Syntax
int DSFindFirstLogEntry(
DSJOB JobHandle,
int EventType,
time_t StartTime,
time_t EndTime,
int MaxNumber,
DSLOGEVENT *Event
);
Parameters
EventType is one of the following keys:
This key
Retrieves this type of message
DSJ_LOGINFO
Information
DSJ_LOGWARNING
Warning
DSJ_LOGFATAL
Fatal
DSJ_LOGREJECT
Transformer row rejection
DSJ_LOGSTARTED
Job started
DSJ_LOGRESET
Job reset
DSJ_LOGBATCH
Batch control
DSJ_LOGOTHER
All other log types
DSJ_LOGANY
Any type of event
StartTime limits the returned log events to those that occurred on or

after the specified date and time. Set this value to 0 to return the
earliest event.
51-9
DSFindFirstLogEntry
EndTime limits the returned log events to those that occurred before
the specified date and time. Set this value to 0 to return all entries up
to the most recent.
MaxNumber specifies the maximum number of log entries to retrieve,
starting from the latest.
Event is a pointer to a data structure to use to hold the first retrieved
log entry.
Return Values
If the function succeeds, the return value is DSJE_NOERROR, and
summary details of the first log entry are written to Event.
If the function fails, the return value is one of the following:
Token
Description
DSJE_NOMORE
There are no events matching the filter

criteria.
DSJE_NO_MEMORY
Failed to allocate memory for results from

server.
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_BADTYPE
Invalid EventType value.
DSJE_BADTIME
Invalid StartTime or EndTime value.
DSJE_BADVALUE
Invalid MaxNumber value.
Remarks
The retrieved log entries are cached for retrieval by subsequent calls
to DSFindNextLogEntry. Any cached log entries that are not
processed by a call to DSFindNextLogEntry are discarded at the next
DSFindFirstLogEntry call (for any job), or when the project is closed.
Note: The log entries are cached by project handle. Multiple threads
using the same open project handle must coordinate access to
DSFindFirstLogEntry and DSFindNextLogEntry.
51-10
DSFindNextLogEntry
Retrieves the next log entry from the cache.
DSFindNextLogEntry
Syntax
int DSFindNextLogEntry(
DSJOB JobHandle,
DSLOGEVENT *Event
);
Parameters
Event is a pointer to a data structure to use to hold the next log entry.
Return Values
If the function succeeds, the return value is DSJE_NOERROR and
summary details of the next available log entry are written to Event.
Token
Description
DSJE_NOMORE
All events matching the filter criteria

have been returned.
DSJE_SERVER_ERROR
Internal error. The DataStage Server

returned invalid data.
Remarks
This function retrieves the next log entry from the cache of entries
produced by a call to DSFindFirstLogEntry.
Note: The log entries are cached by project handle. Multiple threads
using the same open project handle must coordinate access to
DSFindFirstLogEntry and DSFindNextLogEntry.
51-11
DSGetJobInfo
Retrieves information about the status of a job.
DSGetJobInfo
Syntax
int DSGetJobInfo(
DSJOB JobHandle,
int InfoType,
DSJOBINFO *ReturnInfo
);
Parameters
InfoType is a key indicating the information to be returned and can
have any of the following values:
This key
Returns this information
DSJ_JOBSTATUS
The current status of the job.
DSJ_JOBNAME
The name of the job referenced by

JobHandle.
DSJ_JOBCONTROLLER
The name of the job controlling the

job referenced by JobHandle.
DSJ_JOBSTARTTIMESTAMP The date and time when the job

started.
51-12
DSJ_JOBWAVENO
The wave number of last or current

run.
DSJ_USERSTATUS
The value, if any, set as the user

status by the job.
DSJ_STAGELIST
A list of active stages in the job. Separated by nulls.
DSJ_JOBINTERIMSTATUS
The status of a job after it has run all

stages and controlled jobs, but before
it has attempted to run an after-job
subroutine. (Designed to be used by
an after-job subroutine to get the
status of the current job.)
DSGetJobInfo
This key
DSJ_PARAMLIST
A list of job parameter names. Separated by nulls.
DSJ_JOBCONTROL
Whether a stop request has been

issued for the job referenced by
JobHandle.
DSJ_JOBPID
Process id of DSD.RUN process.
DSJ_JOBLASTTIMESTAMP
The date and time when the job last

finished.
DSJ_JOBINVOCATIONS
List of job invocation ids. The ids are

separated by nulls.
DSJ_JOBINVOCATIONID
Invocation name of the job referenced by JobHandle.
ReturnInfo is a pointer to a DSJOBINFO data structure where the

requested information is stored. The DSJOBINFO data structure
contains a union with an element for each of the possible return values
from the call to DSGetJobInfo. For more information, see Data Structures on page 51-46.
Return Values
Token
Description
DSJE_NOT_AVAILABLE
There are no instances of the requested

information in the job.
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_BADTYPE
Invalid InfoType.
Remarks
For controlled jobs, this function can be used either before or after a
call to DSRunJob.
51-13
DSGetLastError
Returns the calling threads last error code value.
DSGetLastError
Syntax
int DSGetLastError(void);
Return Values
The return value is the last error code value. The Return Values
section of each reference page notes the conditions under which the
function sets the last error code.
Remarks
Use DSGetLastError immediately after any function whose return
value on failure might contain useful data, otherwise a later,
successful function might reset the value back to 0 (DSJE_NOERROR).
Note: Multiple threads do not overwrite each others error codes.
51-14
DSGetLastErrorMsg
Retrieves the text of the last reported error from the DataStage server.
DSGetLastErrorMsg
Syntax
char *DSGetLastErrorMsg(
DSPROJECT ProjectHandle
);
Parameter
ProjectHandle is either the value returned from DSOpenProject or
NULL.
Return Values
The return value is a pointer to a series of null-terminated strings, one
for each line of the error message associated with the last error generated by the DataStage Server in response to a DataStage API function
call. Use DSGetLastError to determine what the error number is.
The following example shows the buffer contents with <null> representing the terminating NULL character:
line1<null>line2<null>line3<null><null>
The DSGetLastErrorMsg function returns NULL if there is no error

message.
Rermarks
If ProjectHandle is NULL, this function retrieves the error message
associated with the last call to DSOpenProject or DSGetProjectList,
otherwise it returns the last message associated with the specified
project.
The error text is cleared following a call to DSGetLastErrorMsg.
Note: The text retrieved by a call to DSGetLastErrorMsg relates to
the last error generated by the server and not necessarily the
last error reported back to a thread using DataStage API.
51-15
DSGetLastErrorMsg
Multiple threads using DataStage API must cooperate in order
to obtain the correct error message text.
51-16
DSGetLinkInfo
Retrieves information relating to a specific link of the specified active stage of a job.
DSGetLinkInfo
Syntax
int DSGetLinkInfo(
DSJOB JobHandle,
char *StageName,
char *LinkName,
int InfoType,
DSLINKINFO *ReturnInfo
);
Parameters
StageName is a pointer to a null-terminated character string specifying
the name of the active stage to be interrogated.
LinkName is a pointer to a null-terminated character string specifying
the name of a link (input or output) attached to the stage.
InfoType is a key indicating the information to be returned and is one
of the following values:
Value
Description
DSJ_LINKLASTERR
Last error message reported by the link.
DSJ_LINKROWCOUNT Number of rows that have passed down

the link.
DSJ_LINKNAME
Name of the link.
DSJ_LINKSQLSTATE
SQLSTATE value from last error

message.
DSJ_LINKDBMSCODE
DBMSCODE value from last error

message.
ReturnInfo is a pointer to a DSJOBINFO data structure where the

requested information is stored. The DSJOBINFO data structure
contains a union with an element for each of the possible return values
51-17
DSGetLinkInfo
from the call to DSGetLinkInfo. For more information, see Data
Structures on page 51-46.
Return Value
Token
Description
DSJE_NOT_AVAILABLE
There is no instance of the requested

information available.
DSJE_BADHANDLE
JobHandle was invalid.
DSJE_BADTYPE
InfoType was unrecognized.
DSJE_BADSTAGE
StageName does not refer to a known

stage in the job.
DSJE_BADLINK
LinkName does not refer to a known

link for the stage in question.
Remarks
This function can be used either before or after a call to DSRunJob.
51-18
DSGetLogEntry
Retrieves detailed information about a specific entry in a job log.
DSGetLogEntry
Syntax
int DSGetLogEntry(
DSJOB JobHandle,
int EventId,
DSLOGDETAIL *Event
);
Parameters
EventId is the identifier for the event to be retrieved, see Remarks.
Event is a pointer to a data structure to hold details of the log entry.
Return Values
If the function succeeds, the return value is DSJE_NOERROR and the
event structure contains the details of the requested event.
Token
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_SERVER_ERROR
Internal error. DataStage server returned

invalid data.
DSJE_BADEVENTID
Invalid event if for a specified job.
Remarks
Entries in the log file are numbered sequentially starting from 0. The
latest event ID can be obtained through a call to DSGetNewestLogId.
When a log is cleared, there always remains a single entry saying
when the log was cleared.
51-19
DSGetNewestLogId
Obtains the identifier of the newest entry in the jobs log.
DSGetNewestLogId
Syntax
int DSGetNewestLogId(
DSJOB JobHandle,
int EventType
);
Parameters
EventType is a key specifying the type of log entry whose identifier you
want to retrieve and can be one of the following:
This key
Retrieves this type of log entry
DSJ_LOGINFO
Information
DSJ_LOGWARNING
Warning
DSJ_LOGFATAL
Fatal
DSJ_LOGREJECT
DSJ_LOGSTARTED
Job started
DSJ_LOGRESET
Job reset
DSJ_LOGOTHER
Any other log event type
DSJ_LOGBATCH
Batch control
DSJ_LOGANY
Any type of event
Return Values
If the function succeeds, the return value is the positive identifier of
the most recent entry of the requested type in the job log file.
If the function fails, the return value is 1. Use DSGetLastError to
retrieve one of the following error codes:
51-20
Token
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSGetNewestLogId
Token
Description
DSJE_BADTYPE
Remarks
Use this function to determine the ID of the latest entry in a log file
before starting a job run. Once the job has started or finished, it is then
possible to determine which entries have been added by the job run.
51-21
DSGetParamInfo
Retrieves information about a particular parameter within a job.
DSGetParamInfo
Syntax
int DSGetParamInfo(
DSJOB JobHandle,
char *ParamName,
DSPARAMINFO *ReturnInfo
);
Parameters
ParamName is a pointer to a null-terminated string specifying the
name of the parameter to be interrogated.
ReturnInfo is a pointer to a DSPARAMINFO data structure where the
requested information is stored. For more information, see Data
Structures on page 51-46.
Return Values
Token
Description
DSJE_SERVER_ERROR
Internal error. DataStage Server returned

invalid data.
DSJE_BADHANDLE
Invalid JobHandle.
Remarks
Unlike the other information retrieval functions, DSGetParamInfo
returns all the information relating to the specified item in a single call.
The DSPARAMINFO data structure contains all the information
required to request a new parameter value from a user and partially
validate it. See Data Structures on page 51-46.
51-22
DSGetParamInfo
This function can be used either before or after a DSRunJob call has
been issued:
If called after a successful call to DSRunJob, the information
retrieved refers to that run of the job.
If called before a call to DSRunJob, the information retrieved
refers to any previous run of the job, and not to any call to
DSSetParam that may have been issued.
51-23
DSGetProjectInfo
Obtains a list of jobs in a project.
DSGetProjectInfo
Syntax
int DSGetProjectInfo(
DSPROJECT ProjectHandle,
int InfoType,
DSPROJECTINFO *ReturnInfo
);
Parameters
InfoType is a key indicating the information to be returned.
This key
DSJ_JOBLIST
Lists all jobs within the project.
DSJ_PROJECTNAME
Name of current project.
DSJ_HOSTNAME
Host name of the server.
ReturnInfo is a pointer to a DSPROJECTINFO data structure where

the requested information is stored.
Return Values
51-24
Token
Description
DSJE_NOT_AVAILABLE
There are no compiled jobs defined

within the project.
DSJE_BADTYPE
Invalid InfoType.
DSGetProjectInfo
Remarks
The DSPROJECTINFO data structure contains a union with an
element for each of the possible return values from a call to DSGetProjectInfo.
Note: The returned list contains the names of all jobs known to the
project, whether they can be opened or not.
51-25
DSGetProjectList
Obtains a list of all projects on the host system.
DSGetProjectList
Syntax
char* DSGetProjectList(void);
Return Values
If the function succeeds, the return value is a pointer to a series of nullterminated strings, one for each project on the host system, ending
with a second null character. The following example shows the buffer
contents with <null> representing the terminating null character:
project1<null>project2<null><null>
If the function fails, the return value is NULL. And the DSGetLastError function retrieves the following error code:
DSJE_SERVER_ERROR Unexpected/unknown server error
occurred.
Remarks
This function can be called before any other DataStage API function.
Note: DSGetProjectList opens, uses, and closes its own communications link with the server, so it may take some time to retrieve
the project list.
51-26
DSGetStageInfo
Obtains information about a particular stage within a job.
DSGetStageInfo
Syntax
int DSGetStageInfo(
DSJOB JobHandle,
char *StageName,
int InfoType,
DSSTAGEINFO *ReturnInfo
);
Parameters
StageName is a pointer to a null-terminated string specifying the name
of the stage to be interrogated.
InfoType is one of the following keys:
This key
DSJ_STAGELASTERR
Last error message reported from

any link of the stage.
DSJ_STAGENAME
Stage name.
DSJ_STAGETYPE
Stage type name.
DSJ_STAGEINROWNUM
Primary links input row number.
DSJ_LINKLIST
List of names of links in stage.
DSJ_VARLIST
List of stage variable names in the

stage.
DSJ_STAGESTARTTIMESTAMP
Date and time when stage started.
DSJ_STAGEENDTIMESTAMP
Date and time when stage finished.
ReturnInfo is a pointer to a DSSTAGEINFO data structure where the

requested information is stored. See Data Structures on page 51-46.
51-27
DSGetStageInfo
Return Values
Token
Description
DSJE_NOT_AVAILABLE
There are no instances of the requested

information in the stage.
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_BADSTAGE

stage in the job.
DSJE_BADTYPE
Invalid InfoType.
Remarks
This function can be used either before or after a DSRunJob function
has been issued.
The DSSTAGEINFO data structure contains a union with an element
for each of the possible return values from the call to DSGetStageInfo.
51-28
DSLockJob
Locks a job. This function must be called before setting a jobs run parameters or
starting a job run.
DSLockJob
Syntax
int DSLockJob(
DSJOB JobHandle
);
Parameter
Return Values
Token
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_INUSE
Job is locked by another process.
Remarks
Locking a job prevents any other process from modifying the job
details or status. This function must be called before any call of DSSetJobLimit, DSSetParam, or DSRunJob.
If you try to lock a job you already have locked, the call succeeds. If
you have the same job open on several DataStage API handles, locking
the job on one handle locks the job on all the handles.
51-29
DSLogEvent
Adds a new entry to a job log file.
DSLogEvent
Syntax
int DSLogEvent(
DSJOB JobHandle,
int EventType,
char *Reserved,
char *Message
);
Parameters
EventType is one of the following keys specifying the type of event to
be logged:
This key
Specifies this type of event
DSJ_LOGINFO
Information
DSJ_LOGWARNING
Warning
Reserved is reserved for future use, and should be specified as null.

Message points to a null-terminated character string specifying the text
of the message to be logged.
Return Values
Token
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_SERVER_ERROR Internal error. DataStage Server returned

invalid data.
DSJE_BADTYPE
51-30
DSLogEvent
Remarks
Messages that contain more that one line of text should contain a
newline character (\n) to indicate the end of a line.
51-31
DSOpenJob
Opens a job. This function must be called before any other function that manipulates the job.
DSOpenJob
Syntax
DSJOB DSOpenJob(
DSPROJECT ProjectHandle,
char *JobName
);
Parameters
JobName is a pointer to a null-terminated string that specifies the name
of the job that is to be opened. This may be in either of the following
formats:
job
Finds the latest version of the job.
job%Reln.n.n Finds a particular release of the job on a development

system.
Return Values
If the function succeeds, the return value is a handle to the job.
If the function fails, the return value is NULL. Use DSGetLastError to
retrieve one of the following:
Token
Description
DSJE_OPENFAIL
Server failed to open job.
DSJE_NO_MEMORY
Memory allocation failure.
Remarks
The DSOpenJob function must be used to return a job handle before
a job can be addressed by any of the DataStage API functions. You can
gain exclusive access to the job by locking it with DSLockJob.
51-32
DSOpenJob
The same job may be opened more than once and each call to
DSOpenJob will return a unique job handle. Each handle must be
separately closed.
51-33
DSOpenProject
Opens a project. It must be called before any other DataStage API function, except
DSGetProjectList or DSGetLastError.
DSOpenProject
Syntax
DSPROJECT DSOpenProject(
char *ProjectName
);
Parameter
ProjectName is a pointer to a null-terminated string that specifies the
name of the project to open.
Return Values
If the function succeeds, the return value is a handle to the project.
If the function fails, the return value is NULL. Use DSGetLastError to
retrieve one of the following:
51-34
Token
Description
DSJE_BAD_VERSION
The DataStage server is an older

version than the DataStage API.
DSJE_INCOMPATIBLE_
SERVER
The DataStage Server is either older or

newer than that supported by this
version of DataStage API.
DSJE_SERVER_ERROR
Internal error. DataStage Server

returned invalid data.
DSJE_BADPROJECT
Invalid project name.
DSJE_NO_DATASTAGE
DataStage is not correctly installed on

the server system.
DSJE_NOLICENSE
No DataStage license is available for

the project.
DSOpenProject
Remarks
The DSGetProjectList function can return the name of a project that
does not contain valid DataStage jobs, but this is detected when
DSOpenProject is called. A process can only have one project open at
a time.
51-35
DSRunJob
Starts a job run.
DSRunJob
Syntax
int DSRunJob(
DSJOB JobHandle,
int RunMode
);
Parameters
JobHandle is a value returned from DSOpenJob.
RunMode is a key determining the run mode and should be one of the
following values:
This key
Indicates this action
DSJ_RUNNORMAL
Start a job run.
DSJ_RUNRESET
Reset the job.
DSJ_RUNVALIDATE
Validate the job.
Return Values
51-36
Token
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_BADSTATE
Job is not in the right state (must be

compiled and not running).
DSJE_BADTYPE
RunMode is not recognized.
DSJE_SERVER_ERROR

invalid data.
DSJE_NOTLOCKED
Job has not been locked.
DSRunJob
Remarks
The job specified by JobHandle must be locked, using DSLockJob,
before the DSRunJob function is called.
If no limits were set by calling DSSetJobLimit, the default limits are
used.
51-37
DSSetJobLimit
Sets row or warning limits for a job.
DSSetJobLimit
Syntax
int DSSetJobLimit(
DSJOB JobHandle,
int LimitType,
int LimitValue
);
Parameters
JobHandle is a value returned from DSOpenJob.
LimitType is one of the following keys specifying the type of limit:
This key
Specifies this type of limit
DSJ_LIMITWARN
Job to be stopped after LimitValue warning

events.
DSJ_LIMITROWS
Stages to be limited to LimitValue rows.
LimitValue is the value to set the limit to.
Return Values
51-38
Token
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_BADSTATE
Job is not in the right state (compiled, not

running).
DSJE_BADTYPE
LimitType is not the name of a known

limiting condition.
DSJE_BADVALUE
LimitValue is not appropriate for the

limiting condition type.
DSSetJobLimit
Token
Description
DSJE_SERVER_ERROR

invalid data.
DSJE_NOTLOCKED
Remarks
before the DSSetJobLimit function is called.
Any job limits that are not set explicitly before a run will use the
default values. Make two calls to DSSetJobLimit in order to set both
types of limit.
Set the value to 0 to indicate that there should be no limit for the job.
51-39
DSSetParam
Sets job parameter values before running a job. Any parameter that is not explicitly
set uses the default value.
DSSetParam
Syntax
int DSSetParam(
DSJOB JobHandle,
char *ParamName,
DSPARAM *Param
);
Parameters
ParamName is a pointer to a null-terminated string that specifies the
name of the parameter to set.
Param is a pointer to a structure that specifies the name, type, and
value of the parameter to set.
Note: The type specified in Param need not match the type specified
for the parameter in the job definition, but it must be possible
to convert it. For example, if the job defines the parameter as a
string, it can be set by specifying it as an integer. However, it
will cause an error with unpredictable results if the parameter
is defined in the job as an integer and a nonnumeric string is
passed by DSSetParam.
Return Values
51-40
Token
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_BADSTATE
Job is not in the right state (compiled,

not running).
DSSetParam
Token
Description
DSJE_BADPARAM
Param does not reference a known

parameter of the job.
DSJE_BADTYPE
Param does not specify a valid parameter type.
DSJE_BADVALUE
Param does not specify a value that is

appropriate for the parameter type as
specified in the job definition.
DSJE_SERVER_ERROR

invalid data.
DSJE_NOTLOCKED
Remarks
before the DSSetParam function is called.
51-41
DSSetServerParams
Sets the logon parameters to use for opening a project or retrieving a project list.
DSSetServerParams
Syntax
void DSSetServerParams(
char *ServerName,
char *UserName,
char *Password
);
Parameters
ServerName is a pointer to either a null-terminated character string
specifying the name of the server to connect to, or NULL.
UserName is a pointer to either a null-terminated character string specifying the user name to use for the server session, or NULL.
Password is a pointer to either a null-terminated character string specifying the password for the user specified in UserName, or NULL.
Return Values
This function has no return value.
Remarks
By default, DSOpenProject and DSGetProjectList attempt to connect
to a DataStage Server on the same computer as the client process, then
create a server process that runs with the same user identification and
access rights as the client process. DSSetServerParams overrides this
behavior and allows you to specify a different server, user name, and
password.
Calls to DSSetServerParams are not cumulative. All parameter
values, including NULL pointers, are used to set the parameters to be
used on the subsequent DSOpenProject or DSGetProjectList call.
51-42
DSStopJob
Aborts a running job.
DSStopJob
Syntax
int DSStopJob(
DSJOB JobHandle
);
Parameter
Return Values
DSJE_BADHANDLE
Invalid JobHandle.
Remarks
The DSStopJob function should be used only after a DSRunJob function has been issued. The stop request is sent regardless of the jobs
current status. To ascertain if the job has stopped, use the DSWaitForJob function or the DSJobStatus macro.
51-43
DSUnlockJob
Unlocks a job, preventing any further manipulation of the jobs run state and
freeing it for other processes to use.
DSUnlockJob
Syntax
int DSUnlockJob(
DSJOB JobHandle
);
Parameter
Return Values
If the function succeeds, the return value is DSJ_NOERROR.
DSJE_BADHANDLE
Invalid JobHandle.
Remarks
The DSUnlockJob function returns immediately without waiting for
the job to finish. Attempting to unlock a job that is not locked does not
cause an error. If you have the same job open on several handles,
unlocking the job on one handle unlocks it on all handles.
51-44
DSWaitForJob
Waits to the completion of a job run.
DSWaitForJob
Syntax
int DSWaitForJob(
DSJOB JobHandle
);
Parameter
Return Values
Token
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_WRONGJOB
Job for this JobHandle was not started from

a call to DSRunJob by the current process.
DSJE_TIMEOUT
Job appears not to have started after

waiting for a reasonable length of time.
(About 30 minutes.)
Remarks
This function is only valid if the current job has issued a DSRunJob
call on the given JobHandle. It returns if the job was started since the
last DSRunJob, and has since finished. The finishing status can be
found by calling DSGetJobInfo.
51-45
DSWaitForJob
Data Structures
The DataStage API uses the data structures described in this section to
hold data passed to, or returned from, functions. (SeeData Structures,
Result Data, and Threads on page 51-2). The data structures are
summarized below, with full descriptions in the following sections:
51-46
This data
structure
Holds this type of

data
And is used by this

function
DSJOBINFO
Information about a
DataStage job
DSGetJobInfo
DSLINKINFO
Information about a link

to or from an active
stage in a job, that is, a
stage that is not a data
source or destination
DSGetLinkInfo
DSLOGDETAIL
Full details of an entry

in a job log file
DSGetLogEntry
DSLOGEVENT
Details of an entry in a
job log file
DSLogEvent,
DSFindFirstLogEntry,
DSFindNextLogEntry
DSPARAM
The type and value of a

job parameter
DSSetParam
DSPARAMINFO
Further information
about a job parameter,
such as its default value
and a description
DSGetParamInfo
DSPROJECTINFO A list of jobs in the

project
DSGetProjectInfo
DSSTAGEINFO
DSGetStageInfo
Information about an
active stage in a job
DSJOBINFO
The DSJOBINFO structure represents information values about a DataStage job.
DSJOBINFO
Syntax
typedef struct _DSJOBINFO {
int infoType;
union {
int jobStatus;
char *jobController;
time_t jobStartTime;
int jobWaveNumber;
char *userStatus;
char *paramList;
char *stageList;
char *jobname;
int jobcontrol;
int jobPid;
time_t jobLastTime;
char *jobInvocations;
int jobInterimStatus;
char *jobInvocationid;
} info;
} DSJOBINFO;
Members
infoType is one of the following keys indicating the type of
information:
This key
Indicates this information
DSJ_JOBSTATUS
The current status of the job.
DSJ_JOBNAME
Name of job referenced by JobHandle
DSJ_JOBCONTROLLER
The name of the controlling job.
DSJ_JOBSTARTTIMESTAMP
The date and time when the job

started.
DSJ_JOBWAVENO
Wave number of the current (or last)

job run.
51-47
DSJOBINFO
This key
DSJ_USERSTATUS
The status reported by the job itself

as defined in the jobs design.
DSJ_PARAMLIST
A list of the names of the jobs

parameters. Separated by nulls.
DSJ_STAGELIST
A list of stages in the job. Separated

by nulls.
DSJ_JOBCONTROL
Whetehr a stop request has been

issued for the job.
DSJ_JOBPID
Process id of DSD.RUN process.
DSJ_JOBLASTTIMESTAMP
The date and time on the server

when the job last finished.
DSJ_JOBINVOCATIONS
List of job invocation ids. Separated

by nulls.
DSJ_JOBINTERIMSTATUS
Current Interim status of the job.
DSJ_JOBINVOVATIONID
Invocation name of the job

referenced.
jobStatus is returned when infoType is set to DSJ_JOBSTATUS. Its value

is one of the following keys:
51-48
This key
Indicates this status
DSJS_RUNNING
Job running.
DSJS_RUNOK
Job finished a normal run with no

warnings.
DSJS_RUNWARN
Job finished a normal run with

warnings.
DSJS_RUNFAILED
Job finished a normal run with a

fatal error.
DSJS_VALOK
Job finished a validation run with no

warnings.
DSJS_VALWARN
Job finished a validation run with

warnings.
DSJS_VALFAILED
Job failed a validation run.
DSJS_RESET
Job finished a reset run.
DSJOBINFO
This key
Indicates this status
DSJS_CRASHED
Job was stopped by some indeterminate action.
DSJS_STOPPED
Job was stopped by operator intervention (cant tell run type).
DSJS_NOTRUNNABLE
Job has not been compiled.
DSJS_NOTRUNNING
Any other status. Job was stopped

by operator intervention (cant tell
run type).
jobController is the name of the job controlling the job reference and is
returned when infoType is set to DSJ_JOBCONTROLLER. Note that
this may be several job names, separated by periods, if the job is
controlled by a job which is itself controlled, and so on.
jobStartTime is the date and time when the last or current job run
started and is returned when infoType is set to
DSJ_JOBSTARTTIMESTAMP.
jobWaveNumber is the wave number of the last or current job run and
is returned when infoType is set to DSJ_JOBWAVENO.
userStatus is the value, if any, set by the job as its user defined status,
and is returned when infoType is set to DSJ_USERSTATUS.
paramList is a pointer to a buffer that contains a series of null-terminated strings, one for each job parameter name, that ends with a
second null character. It is returned when infoType is set to
DSJ_PARAMLIST. The following example shows the buffer contents
with <null> representing the terminating null character:
first<null>second<null><null>
stageList is a pointer to a buffer that contains a series of null-terminated strings, one for each stage in the job, that ends with a second null
character. It is returned when infoType is set to DSJ_STAGELIST. The
following example shows the buffer contents with <null> representing the terminating null character:
51-49
DSLINKINFO
The DSLINKINFO structure represents various information values about a link to
or from an active stage within a DataStage job.
DSLINKINFO
Syntax
typedef struct _DSLINKINFO {
int infoType:/
union {
DSLOGDETAIL lastError;
int rowCount;
char *linkName;
char *linkSQLState;
char *linkDBMSCode;
} info;
} DSLINKINFO;
Members
infoType is a key indicating the type of information and is one of the
following values:
This key
DSJ_LINKLASTERR
The last error message reported from a

link.
DSJ_LINKNAME
Actual name of link.
DSJ_LINKROWCOUNT The number of rows that have been

passed down a link.
DSJ_LINKSQLSTATE
SQLSTATE value from last error

message.
DSJ_LINKDBMSCODE
DBMSCODE value from last error

message.
lastError is a data structure containing the error log entry for the last
error message reported from a link and is returned when infoType is set
to DSJ_LINKLASTERR.
rowCount is the number of rows that have been passed down a link so
far and is returned when infoType is set to DSJ_LINKROWCOUNT.
51-50
DSLOGDETAIL
The DSLOGDETAIL structure represents detailed information for a single entry
from a job log file.
DSLOGDETAIL
Syntax
typedef struct _DSLOGDETAIL {
int eventId;
time_t timestamp;
int type;
char *reserved;
char *fullMessage;
} DSLOGDETAIL;
Members
eventId is a a number, 0 or greater, that uniquely identifies the log entry
for the job.
timestamp is the date and time at which the entry was added to the job
log file.
type is a key indicting the type of the event, and is one of the following
values:
This key
Indicates this type of log entry
DSJ_LOGINFO
Information
DSJ_LOGWARNING
Warning
DSJ_LOGFATAL
Fatal error
DSJ_LOGREJECT
DSJ_LOGSTARTED
Job started
DSJ_LOGRESET
Job reset
DSJ_LOGBATCH
Batch control
DSJ_LOGOTHER
Any other type of log entry
reserved is reserved for future use with a later release of DataStage.

fullMessage is the full description of the log entry.
51-51
DSLOGEVENT
The DSLOGEVENT structure represents the summary information for a single
entry from a jobs event log.
DSLOGEVENT
Syntax
typedef struct _DSLOGEVENT {
int eventId;
time_t timestamp;
int type;
char *message;
} DSLOGEVENT;
Members
eventId is a a number, 0 or greater, that uniquely identifies the log entry
for the job.
timestamp is the date and time at which the entry was added to the job
log file.
type is a key indicating the type of the event, and is one of the
following values:
This key
Indicates this type of log entry
DSJ_LOGINFO
Information
DSJ_LOGWARNING
Warning
DSJ_LOGFATAL
Fatal error
DSJ_LOGREJECT
DSJ_LOGSTARTED
Job started
DSJ_LOGRESET
Job reset
DSJ_LOGBATCH
Batch control
DSJ_LOGOTHER
Any other type of log entry
message is the first line of the description of the log entry.
51-52
DSPARAM
The DSPARAM structure represents information about the type and value of a
DataStage job parameter.
DSPARAM
Syntax
typedef struct _DSPARAM {
int paramType;
union {
char *pString;
char *pEncrypt;
int pInt;
float pFloat;
char *pPath;
char *pListValue;
char *pDate;
char *pTime;
} paramValue;
} DSPARAM;
Members
paramType is a key specifying the type of the job parameter. Possible
values are as follows:
This key
Indicates this type of

parameter
DSJ_PARAMTYPE_STRING
A character string.
DSJ_PARAMTYPE_ENCRYPTED
An encrypted character string

(for example, a password).
DSJ_PARAMTYPE_INTEGER
An integer.
DSJ_PARAMTYPE_FLOAT
A floating-point number.
DSJ_PARAMTYPE_PATHNAME
A file system pathname.
DDSJ_PARAMTYPE_LIST
A character string specifying

one of the values from an
enumerated list.
DDSJ_PARAMTYPE_DATE
A date in the format YYYY-MMDD.
51-53
DSPARAM

parameter
This key
DSJ_PARAMTYPE_TIME
A time in the format

HH:MM:SS.
pString is a null-terminated character string that is returned when

paramType is set to DSJ_PARAMTYPE_STRING.
pEncrypt is a null-terminated character string that is returned when
paramType is set to DSJ_PARAMTYPE_ENCRYPTED. The string
should be in plain text form when passed to or from DataStage API
where it is encrypted. The application using the DataStage API should
present this type of parameter in a suitable display format, for
example, an asterisk for each character of the string rather than the
character itself.
pInt is an integer and is returned when paramType is set to
DSJ_PARAMTYPE_INTEGER.
pFloat is a floating-point number and is returned when paramType is set
to DSJ_PARAMTYPE_FLOAT.
pPath is a null-terminated character string specifying a file system
pathname and is returned when paramType is set to
DSJ_PARAMTYPE_PATHNAME.
Note: This parameter does not need to specify a valid pathname on
the server. Interpretation and validation of the pathname is
performed by the job.
pListValue is a null-terminated character string specifying one of the
possible values from an enumerated list and is returned when paramType is set to DDSJ_PARAMTYPE_LIST.
pDate is a null-terminated character string specifying a date in the
format YYYY-MM-DD and is returned when paramType is set to
DSJ_PARAMTYPE_DATE.
pTime is a null-terminated character string specifying a time in the
format HH:MM:SS and is returned when paramType is set to
DSJ_PARAMTYPE_TIME.
51-54
DSPARAMINFO
The DSPARAMINFO structure represents information values about a parameter
of a DataStage job.
DSPARAMINFO
Syntax
typedef struct _DSPARAMINFO {
DSPARAM defaultValue;
char *helpText;
char *paramPrompt;
int paramType;
DSPARAM desDefaultValue;
char *listValues;
char *desListValues;
int promptAtRun;
} DSPARAMINFO;
Members
defaultValue is the default value, if any, for the parameter.
helpText is a description, if any, for the parameter.
paramPrompt is the prompt, if any, for the parameter.
paramType is a key specifying the type of the job parameter. Possible
values are as follows:
This key

parameter
DSJ_PARAMTYPE_STRING
A character string.
DSJ_PARAMTYPE_ENCRYPTED
An encrypted character string

(for example, a password).
DSJ_PARAMTYPE_INTEGER
An integer.
DSJ_PARAMTYPE_FLOAT
A floating-point number.
DSJ_PARAMTYPE_PATHNAME
A file system pathname.
DDSJ_PARAMTYPE_LIST
A character string specifying

one of the values from an
enumerated list.
51-55
DSPARAMINFO

parameter
This key
DDSJ_PARAMTYPE_DATE
A date in the format YYYYMM-DD.
DSJ_PARAMTYPE_TIME
A time in the format

HH:MM:SS.
desDefaultValue is the default value set for the parameter by the jobs
designer.
Note: Default values can be changed by the DataStage administrator,
so a value may not be the current value for the job.
listValues is a pointer to a buffer that receives a series of null-terminated strings, one for each valid string that can be used as the
parameter value, ending with a second null character as shown in the
following example (<null> represents the terminating null character):
desListValues is a pointer to a buffer containing the default list of values
set for the parameter by the jobs designer. The buffer contains a series
of null-terminated strings, one for each valid string that can be used as
the parameter value, that ends with a second null character. The
following example shows the buffer contents with <null> representing the terminating null character:
Note: Default values can be changed by the DataStage administrator,
so a value may not be the current value for the job.
promptAtRun is either 0 (False) or 1 (True). 1 indicates that the operator
is prompted for a value for this parameter whenever the job is run; 0
indicates that there is no prompting.
51-56
DSPROJECTINFO
The DSPROJECTINFO structure represents information values for a DataStage
project.
DSPROJECTINFO
Syntax
typedef struct _DSPROJECTINFO {
int infoType;
union {
char *jobList;
} info;
} DSPROJECTINFO;
Members
infoType is a key value indicating the type of information to retrieve.
Possible values are as follows .
This key
DSJ_JOBLIST
List of jobs in project.
DSJ_PROJECTNAME
Name of current project.
DSJ_HOSTNAME
Host name of the server.
jobList is a pointer to a buffer that contains a series of null-terminated

strings, one for each job in the project, and ending with a second null
character, as shown in the following example (<null> represents the
terminating null character):
51-57
The DSSTAGEINFO structure represents various information values about an active

stage within a DataStage job.
DSSTAGEINFO
Syntax
typedef struct _DSSTAGEINFO {
int infoType;
union {
DSLOGDETAIL lastError;
char *typeName;
int inRowNum;
char *linkList;
char *stagename;
char *varlist;
char *stageStartTime;
char *stageEndTime;
} info;
} DSSTAGEINFO;
Members
infoType is a key indicating the information to be returned and is one of the
following:
This key
DSJ_STAGELASTERR
The last error message generated from

any link in the stage.
DSJ_STAGENAME
Name of stage.
DSJ_STAGETYPE
The stage type name, for example,

Transformer or BeforeJob.
DSJ_STAGEINROWNUM The primary links input row number.
51-58
DSJ_VARLIST
List of stage variable names.
DSJ_STAGESTARTTIMESTAMP
Date and time when stage started.
DSJ_STAGEENDTIMESTAMP
Date and time when stage finished.
DSJ_LINKLIST
A list of link names.
lastError is a data structure containing the error message for the last error
(if any) reported from any link of the stage. It is returned when infoType is
set to DSJ_STAGELASTERR.
typeName is the stage type name and is returned when infoType is set to
DSJ_STAGETYPE.
inRowNum is the primary links input row number and is returned when
infoType is set to DSJ_STAGEINROWNUM.
linkList is a pointer to a buffer that contains a series of null-terminated
strings, one for each link in the stage, ending with a second null character,
as shown in the following example (<null> represents the terminating null
character):
Error Codes
The following table lists DataStage API error codes in alphabetical order:
Error Token
Code
Description
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_BADLINK
LinkName does not refer to a

known link for the stage in
question.
DSJE_BADNAME
12
DSJE_BADPARAM
ParamName is not a parameter

name in the job.
DSJE_BADPROJECT
1002
ProjectName is not a known

DataStage project.
DSJE_BADSTAGE
StageName does not refer to a

known stage in the job.
DSJE_BADSTATE
Job is not in the right state

(compiled, not running).
DSJE_BADTIME
13
Invalid StartTime or EndTime

value.
DSJE_BADTYPE
Information or event type was

unrecognized.
51-59
51-60
Error Token
Code
Description
DSJE_BAD_VERSION
1008
The DataStage server does not

support this version of the
DataStage API.
DSJE_BADVALUE
DSJE_DECRYPTERR
15
Failed to decrypt encrypted

values.
DSJE_INCOMPATIBLE_
SERVER
1009
The server version is incompatible

with this version of the DataStage
API.
DSJE_JOBDELETED
11
The job has been deleted.
DSJE_JOBLOCKED
10
The job is locked by another

process.
DSJE_NOACCESS
16
Cannot get values, default values

or design default values for any
job except the current job.
DSJE_NO_DATASTAGE
1003
DataStage is not installed on the

server system.
DSJE_NOERROR
No DataStage API error has

occurred.
DSJE_NO_MEMORY
1005
Failed to allocate dynamic

memory.
DSJE_NOMORE
1001
All events matching the filter

criteria have been returned.
DSJE_NOT_AVAILABLE
1007
The requested information was not

found.
DSJE_NOTINSTAGE
Internal server error.
DSJE_OPENFAIL
1004
The attempt to open the job failed

perhaps it has not been
compiled.
DSJE_REPERROR
99
General server error.
DSJE_SERVER_ERROR
1006
An unexpected or unknown error

occurred in the DataStage server
engine.
DSJE_TIMEOUT
14
The job appears not to have started

after waiting for a reasonable
length of time. (About 30 minutes.)
Error Token
Code
Description
DSJE_WRONGJOB
Job for this JobHandle was not

started from a call to DSRunJob
by the current process.
The following table lists DataStage API error codes in numerical order:
Code
Error Token
Description
DSJE_NOERROR
No DataStage API error has

occurred.
DSJE_BADHANDLE
Invalid JobHandle.
DSJE_BADSTATE
Job is not in the right state

(compiled, not running).
DSJE_BADPARAM

name in the job.
DSJE_BADVALUE
DSJE_BADTYPE
Information or event type was

unrecognized.
DSJE_WRONGJOB
Job for this JobHandle was not

started from a call to DSRunJob by
the current process.
DSJE_BADSTAGE
StageName does not refer to a

known stage in the job.
DSJE_NOTINSTAGE
Internal server error.
DSJE_BADLINK
LinkName does not refer to a

known link for the stage in
question.
10
DSJE_JOBLOCKED
The job is locked by another

process.
11
DSJE_JOBDELETED
The job has been deleted.
12
DSJE_BADNAME
13
DSJE_BADTIME
Invalid StartTime or EndTime value.
14
DSJE_TIMEOUT
The job appears not to have started

after waiting for a reasonable
length of time. (About 30 minutes.)
15
DSJE_DECRYPTERR
Failed to decrypt encrypted values.
51-61
Code
Error Token
Description
16
DSJE_NOACCESS
Cannot get values, default values

or design default values for any job
except the current job.
99
DSJE_REPERROR
General server error.
1001
DSJE_NOMORE
All events matching the filter

criteria have been returned.
1002
DSJE_BADPROJECT
ProjectName is not a known

DataStage project.
1003
DSJE_NO_DATASTAGE
DataStage is not installed on the

server system.
1004
DSJE_OPENFAIL
The attempt to open the job failed

perhaps it has not been compiled.
1005
DSJE_NO_MEMORY
Failed to allocate dynamic

memory.
1006
DSJE_SERVER_ERROR
An unexpected or unknown error

occurred in the DataStage server
engine.
1007
DSJE_NOT_AVAILABLE
The requested information was not

found.
1008
DSJE_BAD_VERSION
The DataStage server does not

support this version of the
DataStage API.
1009
DSJE_INCOMPATIBLE_
SERVER
The server version is incompatible

with this version of the DataStage
API.
The following table lists some common errors that may be returned from
the lower-level communication layers:
51-62
Error
Number
Description
39121
The DataStage server license has expired.
39134
The DataStage server user limit has been reached.
80011
Incorrect system name or invalid user name or password

provided.
80019
Password has expired.
DataStage BASIC Interface

These functions can be used in a job control routine, which is defined as
part of a jobs properties and allows other jobs to be run and be controlled
from the first job. Some of the functions can also be used for getting status
information on the current job; these are useful in active stage expressions
and before- and after-stage subroutines.
These functions are also described in Chapter
Programming.
13,BASIC
To do this
Use this
Specify the job you want to

control
DSAttachJob, page 51-66
Set parameters for the job you

want to control
DSSetParam, page 51-98
Set limits for the job you want to

control
DSSetJobLimit, page 51-97
Request that a job is run
DSRunJob, page 51-93
Wait for a called job to finish
DSWaitForJob, page 51-105
Get information about the current project
DSGetProjectInfo, page 51-82
Get information about the controlled job or current job
DSGetJobInfo, page 51-70
Get information about a stage in

the controlled job or current job
DSGetStageInfo, page 51-83
Get information about a link in

a controlled job or current job
DSGetLinkInfo, page 51-73
Get information about a controlled jobs parameters
DSGetParamInfo, page 51-79
Get the log event from the job

log
DSGetLogEntry, page 51-75
Get a number of log events on

the specified subject from the
job log
DSGetLogSummary, page 51-76
Get the newest log event, of a

specified type, from the job log
DSGetNewestLogId, page 51-78
51-63
51-64
To do this
Use this
Log an event to the job log of a

different job
DSLogEvent, page 51-85
Stop a controlled job
DSStopJob, page 51-100
Return a job handle previously

obtained from DSAttachJob
DSDetachJob, page 51-68
Log a fatal error message in a

jobs log file and aborts the job.
DSLogFatal, page 19-86
Log an information message in

a jobs log file.
DSLogInfo, page 19-87
Put an info message in the job

log of a job controlling current
job.
DSLogToController, page 19-88
Log a warning message in a

jobs log file.
DSLogWarn, page 19-89
Generate a string describing the

complete status of a valid
attached job.
DSMakeJobReport, page 19-90
Insert arguments into the

message template.
DSMakeMsg, page 19-91
Ensure a job is in the correct

state to be run or validated.
DSPrepareJob, page 19-92
Interface to system send mail

facility.
DSSendMail, page 19-95
Log a warning message to a job

log file.
DSTransformError, page 19-101
Convert a job control status or

error code into an explanatory
text message.
DSTranslateCode, page 19-102
Suspend a job until a named file

either exists or does not exist.
DSWaitForFile, page 19-103
Checks if a BASIC routine is

cataloged, either in VOC as a
callable item, or in the catalog
space.
DSCheckRoutine, page 19-67
Execute a DOS or DataStage

Engine command from a
befor/after subroutine.
DSExecute, page 19-69
To do this
Use this
Set a status message for a job to

return as a termination message
when it finishes
DSSetUserStatus, page 51-99
51-65
DSAttachJob Function
Attaches to a job in order to run it in job control sequence. A handle is returned
which is used for addressing the job. There can only be one handle open for a
particular job at any one time.
DSAttachJob
Syntax
JobHandle = DSAttachJob (JobName, ErrorMode)
JobHandle is the name of a variable to hold the return value which is
subsequently used by any other function or routine when referring to
the job. Do not assume that this value is an integer.
JobName is a string giving the name of the job to be attached to.
ErrorMode is a value specifying how other routines using the handle
should report errors. It is one of:
DSJ.ERRFATAL
Log a fatal message and abort the controlling

job (default).
DSJ.ERRWARNINGLog a warning message but carry on.

DSJ.ERRNONE
No message logged - caller takes full

responsibility (failure of DSAttachJob itself
will be logged, however).
Remarks
A job cannot attach to itself.
The JobName parameter can specify either an exact version of the job
in the form job%Reln.n.n, or the latest version of the job in the form
job. If a controlling job is itself released, you will get the latest released
version of job. If the controlling job is a development version, you will
get the latest development version of job.
Example
This is an example of attaching to Release 11 of the job Qsales:
Qsales_handle = DSAttachJob ("Qsales%Rel1",
DSJ.ERRWARN)
51-66
DSCheckRoutine Function
Checks if a BASIC routine is cataloged, either in the VOC as a callable item, or in
the catalog space.
Syntax
Found = DSCheckRoutine(RoutineName)
RoutineName is the name of BASIC routine to check.
Found
Boolean. @False if RoutineName not findable, else @True.
Example
rtn$ok = DSCheckRoutine(DSU.DSSendMail)
If(NOT(rtn$ok)) Then
* error handling here

End.
51-67
DSDetachJob Function
Gives back a JobHandle acquired by DSAttachJob if no further control of a job is
required (allowing another job to become its controller). It is not necessary to call
this function, otherwise any attached jobs will always be detached automatically
when the controlling job finishes.
DSDetachJob
Syntax
ErrCode = DSDetachJob (JobHandle)
JobHandle is the handle for the job as derived from DSAttachJob.
ErrCode is 0 if DSStopJob is successful, otherwise it may be the
following:
DSJE.BADHANDLE
Invalid JobHandle.
The only possible error is an attempt to close DSJ.ME. Otherwise, the

call always succeeds.
Example
The following command detaches the handle for the job qsales:
Deterr = DSDetachJob (qsales_handle)
51-68
DSExecute Subroutine
Executes a DOS or DataStage Engine command from a before/after subroutine.
DSExecute
Syntax
Call DSExecute (ShellType, Command, Output, SystemReturnCode)
ShellType (input) specifies the type of command you want to execute
and is either NT or UV (for DataStage Engine).
Command (input) is the command to execute. Command should not
prompt for input when it is executed.
Output (output) is any output from the command. Each line of output
is separated by a field mark, @FM. Output is added to the job log file
as an information message.
SystemReturnCode (output) is a code indicating the success of the
command. A value of 0 means the command executed successfully. A
value of 1 (for a DOS command) indicates that the command was not
found. Any other value is a specific exit code from the command.
Remarks
Do not use DSExecute from a transform; the overhead of running a
command for each row processed by a stage will degrade
performance of the job.
51-69
DSGetJobInfo Function
Provides a method of obtaining information about a job, which can be used generally as well as for job control. It can refer to the current job or a controlled job,
depending on the value of JobHandle.
DSGetJobInfo
Syntax
Result = DSGetJobInfo (JobHandle, InfoType)
JobHandle is the handle for the job as derived from DSAttachJob, or it
may be DSJ.ME to refer to the current job.
InfoType specifies the information required and can be one of:
DSJ.JOBCONTROLLER
DSJ.JOBINVOCATIONS
DSJ.JOBINVOCATIONID
DSJ.JOBNAME
DSJ.JOBSTARTTIMESTAMP
DSJ.JOBSTATUS
DSJ.JOBWAVENO
DSJ.PARAMLIST
DSJ.STAGELIST
DSJ.USERSTATUS
DSJ.JOBINTERIMSTATUS
Result depends on the specified InfoType, as follows:
DSJ.JOBSTATUS Integer. Current status of job overall.
Possible statuses that can be returned are currently divided
into two categories:
Firstly, a job that is in progress is identified by:
51-70
DSJS.RESET
Job finished a reset run.
DSJS.RUNFAILED
Job finished a normal run with a fatal

error.
DSJS.RUNNING
Job running - this is the only status that

means the job is actually running.
Secondly, jobs that are not running may have the following
statuses:
DSJS.RUNOK
Job finished a normal run with no

warnings.
DSJS.RUNWARN
Job finished a normal run with

warnings.
DSJS.STOPPED
Job was stopped by operator

intervention (cant tell run type).
DSJS.VALFAILED
Job failed a validation run.
DSJS.VALOK
Job finished a validation run with no

warnings.
DSJS.VALWARN
Job finished a validation run with

warnings.
DSJ.JOBCONTROLLER String. Name of the job controlling

the job referenced by the job handle. Note that this may be
several job names separated by periods if the job is controlled
by a job which is itself controlled, etc.
DSJ.JOBINVOCATIONS. Returns a comma-separated list of
Invocation IDs.
DSJ.JOBINVOCATIONID. Returns the invocation ID of the
specified job (used in the DSJobInvocationId macro in a job
design to access the invocation ID by which the job is invoked).
DSJ.JOBNAME String. Actual name of the job referenced by
the job handle.
DSJ.JOBSTARTTIMESTAMP String. Date and time when the
job started on the server in the form YYYY-MMDD HH:MM:SS.
DSJ.JOBWAVENO Integer. Wave number of last or current
run.
51-71
DSJ.PARAMLIST. Returns a comma-separated list of parameter names.
DSJ.STAGELIST. Returns a comma-separated list of active
stage names.
DSJ.USERSTATUS String. Whatever the job's last call of
DSSetUserStatus last recorded, else the empty string.
DSJ.JOBINTERIMSTATUS. Returns the status of a job after it
has run all stages and controlled jobs, but before it has
attempted to run an after-job subroutine. (Designed to be used
by an after-job subroutine to get the status of the current job).
Result may also return error conditions as follows:
DSJE.BADHANDLE
DSJE.BADTYPE
Remarks
When referring to a controlled job, DSGetJobInfo can be used either
before or after a DSRunJob has been issued. Any status returned
following a successful call to DSRunJob is guaranteed to relate to
that run of the job.
Examples
The following command requests the job status of the job qsales:
q_status = DSGetJobInfo(qsales_handle, DSJ.JOBSTATUS)
The following command requests the actual name of the current job:
whatname = DSGetJobInfo (DSJ.ME, DSJ.JOBNAME)
51-72
DSGetLinkInfo Function
Provides a method of obtaining information about a link on an active stage, which
can be used generally as well as for job control. This routine may reference either
a controlled job or the current job, depending on the value of JobHandle.
DSGetLinkInfo
Syntax
Result = DSGetLinkInfo (JobHandle, StageName, LinkName,
InfoType)
can be DSJ.ME to refer to the current job.
StageName is the name of the active stage to be interrogated. May also
be DSJ.ME to refer to the current stage if necessary.
LinkName is the name of a link (input or output) attached to the stage.
May also be DSJ.ME to refer to current link (e.g. when used in a
Transformer expression or transform function called from link code).
DSJ.LINKLASTERR
DSJ.LINKNAME
DSJ.LINKROWCOUNT
DSJ.LINKLASTERR String - last error message (if any)
reported from the link in question.
DSJ.LINKNAME String - returns the name of the link, most
useful when used with JobHandle = DSJ.ME and StageName =
DSJ.ME and LinkName = DSJ.ME to discover your own name.
DSJ.LINKROWCOUNT Integer - number of rows that have
passed down a link so far.
DSJE.BADHANDLE JobHandle was invalid.
DSJE.BADTYPE
51-73
DSGetLinkInfo Function
DSJE.BADSTAGE
StageName does not refer to a known stage

in the job.
DSJE.NOTINSTAGE StageName was DSJ.ME and the caller is

not running within a stage.
DSJE.BADLINK
LinkName does not refer to a known link

for the stage in question.
Remarks
When referring to a controlled job, DSGetLinkInfo can be used
either before or after a DSRunJob has been issued. Any status
returned following a successful call to DSRunJob is guaranteed to
relate to that run of the job.
Example
The following command requests the number of rows that have
passed down the order_feed link in the loader stage of the job qsales:
link_status = DSGetLinkInfo(qsales_handle, "loader",
"order_feed", DSJ.LINKROWCOUNT)
51-74
DSGetLogEntry Function
Reads the full event details given in EventId.
DSGetLogEntry
Syntax
EventDetail = DSGetLogEntry (JobHandle, EventId)
EventId is an integer that identifies the specific log event for which
details are required. This is obtained using the DSGetNewestLogId
function.
EventDetail is a string containing substrings separated by \. The
substrings are as follows:
Substring1
Timestamp in form YYYY-MM-DD HH:MM:SS
Substring2
User information
Substring3
EventType see DSGetNewestLogId
Substring4 n Event message

If any of the following errors are found, they are reported via a fatal
log event:
DSJE.BADHANDLE
Invalid JobHandle.
DSJE.BADVALUE
Error accessing EventId.
Example
The following command reads full event details of the log event
identified by LatestLogid into the string LatestEventString:
LatestEventString =
DSGetLogEntry(qsales_handle,latestlogid)
51-75
DSGetLogSummary Function
Returns a list of short log event details. The details returned are determined by the
setting of some filters. (Care should be taken with the setting of the filters, otherwise a large amount of information can be returned.)
DSGetLogSummary
Syntax
SummaryArray = DSGetLogSummary (JobHandle, EventType,
StartTime, EndTime, MaxNumber)
EventType is the type of event logged and is one of:
DSJ.LOGINFO
Information message
DSJ.LOGWARNING
Warning message
DSJ.LOGFATAL
Fatal error
DSJ.LOGREJECT
Reject link was active
DSJ.LOGSTARTED
Job started
DSJ.LOGRESET
Log was reset
DSJ.LOGANY
Any category (the default)
StartTime is a string in the form YYYY-MM-DD HH:MM:SS or YYYYMM-DD.

EndTime is a string in the form YYYY-MM-DD HH:MM:SS or YYYYMM-DD.
MaxNumber is an integer that restricts the number of events to return.
0 means no restriction. Use this setting with caution.
SummaryArray is a dynamic array of fields separated by @FM. Each
field comprises a number of substrings separated by \, where each
field represents a separate event, with the substrings as follows:
51-76
Substring1
EventId as per DSGetLogEntry
Substring2
Timestamp in form YYYY-MM-DD

HH:MM:SS
Substring3
EventType see DSGetNewestLogId
DSGetLogSummary Function
Substring4 n Event message
If any of the following errors are found, they are reported via a fatal
log event:
DSJE.BADHANDLE
Invalid JobHandle.
DSJE.BADTYPE
Invalid EventType.
DSJE.BADTIME
Invalid StartTime or EndTime.
DSJE.BADVALUE
Invalid MaxNumber.
Example
The following command produces an array of reject link active events
recorded for the qsales job between 18th August 1998, and 18th
September 1998, up to a maximum of MAXREJ entries:
RejEntries = DSGetLogSummary (qsales_handle,
DSJ.LOGREJECT, "1998-08-18 00:00:00", "1998-09-18
00:00:00", MAXREJ)
51-77
DSGetNewestLogId Function
Gets the ID of the most recent log event in a particular category, or in any category.
DSGetNewestLogId
Syntax
EventId = DSGetNewestLogId (JobHandle, EventType)
Information message
DSJ.LOGINFO
DSJ.LOGWARNING Warning message

DSJ.LOGFATAL
Fatal error
DSJ.LOGREJECT
Reject link was active
DSJ.LOGSTARTED Job started

DSJ.LOGRESET
Log was reset
DSJ.LOGANY
Any category (the default)
EventId is an integer that identifies the specific log event. EventId can
also be returned as an integer, in which case it contains an error code
as follows:
DSJE.BADHANDLE Invalid JobHandle.
DSJE.BADTYPE
Invalid EventType.
Example
The following command obtains an ID for the most recent warning
message in the log for the qsales job:
Warnid = DSGetNewestLogId (qsales_handle,
DSJ.LOGWARNING)
51-78
DSGetParamInfo Function
Provides a method of obtaining information about a parameter, which can be used
generally as well as for job control. This routine may reference either a controlled
job or the current job, depending on the value of JobHandle.
DSGetParamInfo
Syntax
Result = DSGetParamInfo (JobHandle, ParamName, InfoType)
ParamName is the name of the parameter to be interrogated.
InfoType specifies the information required and may be one of:
DSJ.PARAMDEFAULT
DSJ.PARAMHELPTEXT
DSJ.PARAMPROMPT
DSJ.PARAMTYPE
DSJ.PARAMVALUE
DSJ.PARAMDES.DEFAULT
DSJ.PARAMLISTVALUES
DSJ.PARAMDES.LISTVALUES
DSJ.PARAMPROMPT.AT.RUN
DSJ.PARAMDEFAULT String Current default value for the
parameter in question. See also DSJ.PARAMDES.DEFAULT.
DSJ.PARAMHELPTEXT String Help text (if any) for the
parameter in question.
DSJ.PARAMPROMPT String Prompt (if any) for the parameter in question.
DSJ.PARAMTYPE Integer Describes the type of validation
test that should be performed on any value being set for this
parameter. Is one of:
51-79
DSJ.PARAMTYPE.STRING
DSJ.PARAMTYPE.ENCRYPTED
DSJ.PARAMTYPE.INTEGER
DSJ.PARAMTYPE.FLOAT (the parameter may contain
periods and E)
DSJ.PARAMTYPE.PATHNAME
DSJ.PARAMTYPE.LIST (should be a string of Tab-separated
strings)
DSJ.PARAMTYPE.DATE (should be a string in form YYYYMM-DD)
DSJ.PARAMTYPE.TIME (should be a string in form HH:MM)
DSJ.PARAMVALUE String Current value of the parameter
for the running job or the last job run if the job is finished.
DSJ.PARAMDES.DEFAULT String Original default value of
the parameter - may differ from DSJ.PARAMDEFAULT if the
latter has been changed by an administrator since the job was
installed.
DSJ.PARAMLISTVALUES String Tab-separated list of
allowed values for the parameter. See also
DSJ.PARAMDES.LISTVALUES.
DSJ.PARAMDES.LISTVALUES String Original Tab-separated list of allowed values for the parameter may differ from
DSJ.PARAMLISTVALUES if the latter has been changed by
an administrator since the job was installed.
DSJ.PROMPT.AT.RUN String 1 means the parameter is to be
prompted for when the job is run; anything else means it is not
(DSJ.PARAMDEFAULT String to be used directly).
51-80
DSJE.BADHANDLE
DSJE.BADPARAM

name in the job.
DSJE.BADTYPE
Remarks
When referring to a controlled job, DSGetParamInfo can be used
Example
The following command requests the default value of the quarter
parameter for the qsales job:
Qs_quarter = DSGetparamInfo(qsales_handle, "quarter",
DSJ.PARAMDEFAULT)
51-81
DSGetProjectInfo Function
Provides a method of obtaining information about the current project.
DSGetProjectInfo
Syntax
Result = DSGetProjectInfo (InfoType)
DSJ.JOBLIST
DSJ.PROJECTNAME
DSJ.HOSTNAME
DSJ.JOBLIST String - comma-separated list of names of all
jobs known to the project (whether the jobs are currently
attached or not).
DSJ.PROJECTNAME String - name of the current project.
DSJ.HOSTNAME String - the host name of the server holding
the current project.
Result may also return an error condition as follows:
DSJE.BADTYPE
51-82
DSGetStageInfo Function
Provides a method of obtaining information about a stage, which can be used
generally as well as for job control. It can refer to the current job, or a controlled job,
depending on the value of JobHandle.
DSGetStageInfo
Syntax
Result = DSGetStageInfo (JobHandle, StageName, InfoType)
StageName is the name of the stage to be interrogated. It may also be
DSJ.ME to refer to the current stage if necessary.
InfoType specifies the information required and may be one of:
DSJ.STAGELASTERR
DSJ.STAGENAME
DSJ.STAGETYPE
DSJ.STAGEINROWNUM
DSJ.LINKLIST
DSJ.STAGEVARLIST
DSJ.STAGELASTERR String - last error message (if any)
reported from any link of the stage in question.
DSJ.STAGENAME String - most useful when used with
JobHandle = DSJ.ME and StageName = DSJ.ME to discover
your own name.
DSJ.STAGETYPE String - the stage type name (e.g. "Transformer", "BeforeJob").
DSJ. STAGEINROWNUM Integer - the primary link's input
row number.
DSJ.LINKLIST - comma-separated list of link names in the
stage.
51-83
DSGetStageInfo Function
DSJ.STAGEVARLIST - comma-separated list of stage variable
names.
DSJE.BADHANDLE
DSJE.BADTYPE
DSJE.NOTINSTAGE
StageName was DSJ.ME and the caller is

not running within a stage.
DSJE.BADSTAGE

stage in the job.
Remarks
When referring to a controlled job, DSGetStageInfo can be used
Example
The following command requests the last error message for the
loader stage of the job qsales:
stage_status = DSGetStageInfo(qsales_handle, "loader",
DSJ.STAGELASTERR)
51-84
DSLogEvent Function
Logs an event message to a job other than the current one. (Use DSLogInfo,
DSLogFatal, or DSLogWarn to log an event to the current job.)
DSLogEvent
Syntax
ErrCode = DSLogEvent (JobHandle, EventType, EventMsg)
DSJ.LOGINFO
Information message
DSJ.LOGWARNING
Warning message
EventMsg is a string containing the event message.

ErrCode is 0 if there is no error. Otherwise it contains one of the
following errors:
DSJE.BADHANDLE
Invalid JobHandle.
DSJE.BADTYPE
Invalid EventType (particularly note that

you cannot place a fatal message in
another jobs log).
Example
The following command, when included in the msales job, adds the
message monthly sales complete to the log for the qsales job:
Logerror = DsLogEvent (qsales_handle, DSJ.LOGINFO,
"monthly sales complete")
51-85
DSLogFatal Function
Logs a fatal error message in a jobs log file and aborts the job.
DSLogFatal
Syntax
Call DSLogFatal (Message, CallingProgName)
Message (input) is the warning message you want to log. Message is
automatically prefixed with the name of the current stage and the
calling before/after subroutine.
CallingProgName (input) is the name of the before/after subroutine
that calls the DSLogFatal subroutine.
Remarks
DSLogFatal writes the fatal error message to the job log file and
aborts the job. DSLogFatal never returns to the calling before/after
subroutine, so it should be used with caution. If a job stops with a
fatal error, it must be reset using the DataStage Director before it can
be rerun.
In a before/after subroutine, it is better to log a warning message
(using DSLogWarn) and exit with a nonzero error code, which allows
DataStage to stop the job cleanly.
DSLogFatal should not be used in a transform. Use
DSTransformError instead.
Example
Call DSLogFatal("Cannot open file", "MyRoutine")
51-86
DSLogInfo Function
Logs an information message in a jobs log file.
DSLogInfo
Syntax
Call DSLogInfo (Message, CallingProgName)
Message (input) is the information message you want to log. Message
is automatically prefixed with the name of the current stage and the
calling program.
CallingProgName (input) is the name of the transform or before/after
subroutine that calls the DSLogInfo subroutine.
Remarks
DSLogInfo writes the message text to the job log file as an
information message and returns to the calling routine or transform.
If DSLogInfo is called during the test phase for a newly created
routine in the DataStage Manager, the two arguments are displayed
in the results window.
Unlimited information messages can be written to the job log file.
However, if a lot of messages are produced the job may run slowly
and the DataStage Director may take some time to display the job log
file.
Example
Call DSLogInfo("Transforming: ":Arg1, "MyTransform")
51-87
DSLogToController Function
This routine may be used to put an info message in the log file of the job controlling
this job, if any. If there isnt one, the call is just ignored.
Syntax
Call DSLogToController(MsgString)
MsgString is the text to be logged. The log event is of type
Information.
Remarks
If the current job is not under control, a silent exit is performed.
Example
Call DSLogToController(This is logged to parent)
51-88
DSLogWarn Function
Logs a warning message in a jobs log file.
DSLogWarn
Syntax
Call DSLogWarn (Message, CallingProgName)
calling before/after subroutine.
CallingProgName (input) is the name of the before/after subroutine
that calls the DSLogWarn subroutine.
Remarks
DSLogWarn writes the message to the job log file as a warning and
returns to the calling before/after subroutine. If the job has a warning
limit defined for it, when the number of warnings reaches that limit,
the call does not return and the job is aborted.
DSLogWarn should not be used in a transform. Use
DSTransformError instead.
Example
If InputArg > 100 Then
Call DSLogWarn("Input must be =< 100; received
":InputArg,"MyRoutine")
End Else
* Carry on processing unless the job aborts
End
51-89
DSMakeJobReport Function
Generates a string describing the complete status of a valid attached job.
Syntax
ReportText = DSMakeJobReport(JobHandle, ReportLevel,
LineSeparator)
JobHandle is the string as returned from DSAttachJob.
ReportLevel is the number where 0 = basic and 1= more detailed.
LineSeparator is the string used to separate lines of the report. Special
values recognised are:
"CRLF" => CHAR(13):CHAR(10)
"LF" => CHAR(10)
"CR" => CHAR(13)
The default is CRLF if on Windows NT, else LF.
Remarks
If a bad job handle is given, or any other error is encountered,
information is added to the ReportText.
Example
h$ = DSAttachJob(MyJob, DSJ.ERRNONE)
rpt$ = DSMakeJobReport(h$,0,CRLF)
51-90
DSMakeMsg Function
Insert arguments into a message template. Optionally, it will look up a template ID
in the standard DataStage messages file, and use any returned message template
instead of that given to the routine.
Syntax
FullText = DSMakeMsg(Template, ArgList)
FullText is the message with parameters substituted
Template is the message template, in which %1, %2 etc. are to be
substituted with values from the equivalent position in ArgList. If the
template string starts with a number followed by "\", that is assumed
to be part of a message id to be looked up in the DataStage message
file.
Note: If an argument token is followed by "[E]", the value of that
argument is assumed to be a job control error code, and an
explanation of it will be inserted in place of "[E]". (See the
DSTranslateCode function.)
ArgList is the dynamic array, one field per argument to be substituted.
Remarks
This routine is called from job control code created by the
JobSequence Generator. It is basically an interlude to call
DSRMessage which hides any runtime includes.
It will also perform local job parameter substitution in the message
text. That is, if called from within a job, it looks for substrings such as
"#xyz#" and replaces them with the value of the job parameter named
"xyz".
Example
t$ = DSMakeMsg(Error calling DSAttachJob(%1)<L>%2,
jb$:@FM:DSGetLastErrorMsg())
51-91
DSPrepareJob Function
Used to ensure that a compiled job is in the correct state to be run or validated.
Syntax
JobHandle = DSPrepareJob(JobHandle)
JobHandle is the handle, as returned from DSAttachJob(), of the job to
be prepared.
JobHandle is either the original handle or a new one. If returned as 0,
an error occurred and a message is logged.
Example
h$ = DSPrepareJob(h$)
51-92
DSRunJob Function
Starts a job running. Note that this call is asynchronous; the request is passed to the
run-time engine, but you are not informed of its progress.
DSRunJob
Syntax
ErrCode = DSRunJob (JobHandle, RunMode)
RunMode is the name of the mode the job is to be run in and is one of:
DSJ.RUNNORMAL
(Default) Standard job run.
DSJ.RUNRESET
Job is to be reset.
DSJ.RUNVALIDATE
Job is to be validated only.
ErrCode is 0 if DSRunJob is successful, otherwise it is one of the

following negative integers:
DSJE.BADHANDLE
Invalid JobHandle.
DSJE.BADSTATE

not running).
DSJE.BADTYPE
RunMode is not a known mode.
Remarks
If the controlling job is running in validate mode, then any calls of
DSRunJob will act as if RunMode was DSJ.RUNVALIDATE,
regardless of the actual setting.
A job in validate mode will run its JobControl routine (if any) rather
than just check for its existence, as is the case for before/after
routines. This allows you to examine the log of what jobs it started up
in validate mode.
After a call of DSRunJob, the controlled jobs handle is unloaded. If
you require to run the same job again, you must use DSDetachJob
and DSAttachJob to set a new handle. Note that you will also need to
use DSWaitForJob, as you cannot attach to a job while it is running.
51-93
DSRunJob Function
Example
The following command starts the job qsales in standard mode:
RunErr = DSRunJob(qsales_handle, DSJ.RUNNORMAL)
51-94
DSSendMail Function
This routine is an interface to a sendmail program that is assumed to exist somewhere in the search path of the current user (on the server). It hides the different
call interfaces to various sendmail programs, and provides a simple interface for
sending text. For example:
Syntax
Reply = DSSendMail(Parameters)
Parameters is a set of name:value parameters, separated by either a
mark character or "\n".
Currently recognized names (case-insensitive) are:
"From"
Mail address of sender, e.g. Me@SomeWhere.com

Can only be left blank if the local template file does not
contain a "%from%" token.
"To"
Mail address of recipient, e.g. You@ElseWhere.com

Can only be left blank if the local template file does not
contain a "%to%" token.
"Subject" Something to put in the subject line of the message.

Refers to the "%subject%" token. If left as "", a standard
subject line will be created, along the lines of "From
DataStage job: jobname"
"Server" Name of host through which the mail should be sent.
May be omitted on systems (such as Unix) where the
SMTP host name can be and is set up externally, in
which case the local template file presumably will not
contain a "%server%" token.
"Body"
Message body.
Can be omitted. An empty message will be sent. If
used, it must be the last parameter, to allow for getting
multiple lines into the message, using "\n" for line
breaks. Refers to the "%body%" token.
Note: The text of the body may contain the tokens
"%report% or %fullreport% anywhere within it, which
51-95
DSSendMail Function
will cause a report on the current job status to be
inserted at that point. A full report contains stage and
link information as well as job status.
Reply. Possible replies are:
DSJE.NOERROR
(0) OK
DSJE.NOPARAM
Parameter name missing - field does not

look like name:value
DSJE.NOTEMPLATE
Cannot find template file
DSJE.BADTEMPLATE Error in template file
Remarks
The routine looks for a local file, in the current project directory, with
a well-known name. That is, a template to describe exactly how to
run the local sendmail command.
Example
code =
DSSendMail("From:me@here\nTo:You@there\nSubject:Hi
ya\nBody:Line1\nLine2")
51-96
DSSetJobLimit Function
By default a controlled job inherits any row or warning limits from the controlling
job. These can, however, be overridden using the DSSetJobLimit function.
DSSetJobLimit
Syntax
ErrCode = DSSetJobLimit (JobHandle, LimitType, LimitValue)
LimitType is the name of the limit to be applied to the running job and
is one of:
DSJ.LIMITWARN
Job to be stopped after LimitValue

warning events.
DSJ.LIMITROWS
Stages to be limited to LimitValue rows.
LimitValue is an integer specifying the value to set the limit to.

ErrCode is 0 if DSSetJobLimit is successful, otherwise it is one of the
DSJE.BADHANDLE
Invalid JobHandle.
DSJE.BADSTATE

not running).
DSJE.BADTYPE
LimitType is not a known limiting

condition.
DSJE.BADVALUE
LimitValue is not appropriate for the

limiting condition type.
Example
The following command sets a limit of 10 warnings on the qsales job
before it is stopped:
LimitErr = DSSetJobLimit(qsales_handle,
DSJ.LIMITWARN, 10)
51-97
DSSetParam Function
Specifies job parameter values before running a job. Any parameter not set will be
defaulted.
DSSetParam
Syntax
ErrCode = DSSetParam (JobHandle, ParamName, ParamValue)
ParamName is a string giving the name of the parameter.
ParamValue is a string giving the value for the parameter.
ErrCode is 0 if DSSetParam is successful, otherwise it is one of the
DSJE.BADHANDLE
Invalid JobHandle.
DSJE.BADSTATE

not running).
DSJE.BADPARAM
ParamName is not a known parameter of

the job.
DSJE.BADVALUE
ParamValue is not appropriate for that

parameter type.
Example
The following commands set the quarter parameter to 1 and the
startdate parameter to 1/1/97 for the qsales job:
paramerr = DSSetParam (qsales_handle, "quarter", "1")
paramerr = DSSetParam (qsales_handle, "startdate",
"1997-01-01")
51-98
DSSetUserStatus Subroutine
Applies only to the current job, and does not take a JobHandle parameter. It can be
used by any job in either a JobControl or After routine to set a termination code for
interrogation by another job. In fact, the code may be set at any point in the job, and
the last setting is the one that will be picked up at any time. So to be certain of
getting the actual termination code for a job the caller should use DSWaitForJob
and DSGetJobInfo first, checking for a successful finishing status.
Note: This routine is defined as a subroutine not a function because there are no
possible errors.
DSSetUserStatus
Syntax
Call DSSetUserStatus (UserStatus)
UserStatus String is any user-defined termination message. The string
will be logged as part of a suitable "Control" event in the calling jobs
log, and stored for retrieval by DSGetJobInfo, overwriting any
previous stored string.
This string should not be a negative integer, otherwise it may be
indistinguishable from an internal error in DSGetJobInfo calls.
Example
The following command sets a termination code of sales job done:
Call DSSetUserStatus("sales job done")
51-99
DSStopJob Function
This routine should only be used after a DSRunJob has been issued. It immediately sends a stop request to the run-time engine. The call is asynchronous. If you
need to know that the job has actually stopped, you must call DSWaitForJob or use
the Sleep statement and poll for DSGetJobStatus. Note that the stop request gets
sent regardless of the jobs current status.
DSStopJob
Syntax
ErrCode = DSStopJob (JobHandle)
ErrCode is 0 if DSStopJob is successful, otherwise it may be the
following:
DSJE.BADHANDLE
Invalid JobHandle.
Example
The following command requests that the qsales job is stopped:
stoperr = DSStopJob(qsales_handle)
51-100
DSTransformError Function
Logs a warning message to a job log file. This function is called from transforms
only.
DSTransformError
Syntax
Call DSTransformError (Message, TransformName)
calling transform.
TransformName (input) is the name of the transform that calls the
DSTransformError subroutine.
Remarks
DSTransformError writes the message (and other information) to the
job log file as a warning and returns to the transform. If the job has a
warning limit defined for it, when the number of warnings reaches
that limit, the call does not return and the job is aborted.
In addition to the warning message, DSTransformError logs the
values of all columns in the current rows for all input and output
links connected to the current stage.
Example
Function MySqrt(Arg1)
If Arg1 < 0 Then
Call DSTransformError("Negative value:"Arg1,
"MySqrt")
Return("0")
;*transform produces 0 in this case
End
Result = Sqrt(Arg1) ;* else return the square root
Return(Result)
51-101
DSTranslateCode Function
Converts a job control status or error code into an explanatory text message.
Syntax
Ans = DSTranslateCode(Code)
Code is:
If Code > 0, its assumed to be a job status.
If Code < 0, its assumed to be an error code.
(0 should never be passed in, and will return "no error")
Ans is the message associated with the code.
Remarks
If Code is not recognized, then Ans will report it.
Example
code$ = DSGetLastErrorMsg()
ans$ = DSTranslateCode(code$)
51-102
DSWaitForFile Function
Suspend a job until a named file either exists or does not exist.
Syntax
Reply = DSWaitForFile(Parameters)
Parameters is the full path of file to wait on. No check is made as to
whether this is a reasonable path (for example, whether all directories
in the path exist). A path name starting with "-", indicates a flag to
check the non-existence of the path. It is not part of the path name.
Parameters may also end in the form " timeout:NNNN" (or
"timeout=NNNN") This indicates a non-default time to wait before
giving up. There are several possible formats, case-insensitive:
nnn
number of seconds to wait (from now)
nnnS
ditto
nnnM
number of minutes to wait (from now)
nnnH
number of hours to wait (from now)
nn:nn:nn
wait until this time in 24HH:MM:SS. If this or

nn:nn time has passed, will wait till next day.
The default timeout is the same as "12H".

The format may optionally terminate "/nn", indicating a poll delay
time in seconds. If omitted, a default poll time is used.
Reply may be:
DSJE.NOERROR
(0) OK - file now exists or does not exist,

depending on flag.
DSJE.BADTIME
Unrecognized Timeout format
DSJE.NOFILEPATH File path missing

DSJE.TIMEOUT
Waited too long
Examples
Reply = DSWaitForFile("C:\ftp\incoming.txt
timeout:2H")
51-103
DSWaitForFile Function
(wait 7200 seconds for file on C: to exist before it gives up.)
Reply = DSWaitForFile("-incoming.txt timeout=15:00")
(wait until 3 pm for file in local directory to NOT exist.)

Reply = DSWaitForFile("incoming.txt timeout:3600/60")
(wait 1 hour for a local file to exist, looking once a minute.)
51-104
DSWaitForJob Function
This function is only valid if the current job has issued a DSRunJob on the given
JobHandle(s). It returns if the/a job has started since the last DSRunJob has since
finished.
DSWaitForJob
Syntax
ErrCode = DSWaitForJob (JobHandle)
JobHandle is the string returned from DSAttachJob. If commas are
contained, its a comma-delimited set of job handles, representing a
list of jobs that are all to be waited for.
ErrCode is 0 if no error, else possible error values (<0) are:
DSJE.BADHANDLE
Invalid JobHandle.
DSJE.WRONGJOB
Job for this JobHandle was not run from

within this job.
ErrCode is >0 => handle of the job that finished from a multi-job wait.
Remarks
DSWaitForJob will wait for either a single job or multiple jobs.
Example
To wait for the return of the qsales job:
WaitErr = DSWaitForJob(qsales_handle)
51-105
Job Status Macros

A number of macros are provided in the JOBCONTROL.H file to facilitate getting information about the current job, and links and stages
belonging to the current job. These macros provide the functionality of
using the DataStage BASIC DSGetProjectInfo, DSGetJobInfo,
DSGetStageInfo, and DSGetLinkInfo functions with the DSJ.ME
token as the JobHandle and can be used in all active stages and
before/after subroutines. The macros provide the functionality for all
the possible InfoType arguments for the DSGetInfo functions.
The available macros are:
DSHostName
DSProjectName
DSJobStatus
DSJobName
DSJobController
DSJobStartDate
DSJobStartTime
DSJobWaveNo
DSJobInvocations
DSJobInvocationID
DSStageName
DSStageLastErr
DSStageType
DSStageInRowNum
DSStageVarList
DSLinkRowCount
DSLinkLastErr
DSLinkName
For example, to obtain the name of the current job:
MyName = DSJobName
To obtain the full current stage name:

MyName = DSJobName : "." : DSStageName
In addition, the following macros are provided to manipulate Transformer stage variables:
51-106
DSGetVar(VarName) returns the current value of the named
stage variable. If the current stage does not have a stage variable called VarName, then "" is returned and an error message
is logged. If the named stage variable is defined but has not
been initialized, the "" is returned and an error message is
logged.
DSSetVar(VarName, VarValue) sets the value of the named
stage variable. If the current stage does not have a stage variable called VarName, then an error message is logged.
Command Line Interface

The DataStage CLI gives you access to the same functionality as the
DataStage API functions described on page 51-5 or the BASIC functions described on page 51-63. There is a single command, dsjob, with
a large range of options. These options are described in the following
topics:
The logon clause

Starting a job
Stopping a job
Listing projects, jobs, stages, links, and parameters
Retrieving information
Accessing log files
All output from the dsjob command is in plain text without column
headings on lists of things, or any other sort of description. This
enables the command to be used in shell or batch scripts without extra
processing.
The DataStage CLI returns a completion code of 0 to the operating
system upon successful execution, or one of the DataStage API error
codes on failure. See Error Codes on page 51-59.
The Logon Clause

By default, the DataStage CLI connects to the DataStage server engine
on the local system using the user name and password of the user
invoking the command. You can specify a different server, user name,
51-107
or password using the logon clause, which is equivalent to the API DSSetServerParams function. Its syntax is as follows:
[ server servername ][ user username ][ password password ]

servername specifies a different server to log on to.
username specifies a different user name to use when logging on.
password specifies a different password to use when logging on.
You can also specify these details in a file using the following syntax:
[ file filename servername ]

servername specifies the server for which the file contains login details.
filename is the name of the file containing login details. The file should
contain the following information:
servername, username, password
You can use the logon clause with any dsjob command.
Starting a Job
You can start, stop, validate, and reset jobs using the run option.
dsjob run
[ mode [ NORMAL | RESET | VALIDATE ] ]

[ param name=value ]
[ warn n ]
[ rows n ]
[ wait ]
[ stop ]
[ jobstatus]
[userstatus]
[-local]
project job
51-108
mode specifies the type of job run. NORMAL starts a job run, RESET
resets the job and VALIDATE validates the job. If mode is not specified, a
normal job run is started.
param specifies a parameter value to pass to the job. The value is in the
format name=value, where name is the parameter name, and value is the
value to be set. If you use this to pass a value of an environment variable
for a job (as you may do for parallel jobs), you need to quote the environment variable and its value, for example -param
$APT_CONFIG_FILE=chris.apt otherwise the current value of the
environment variable will be used.
warn n sets warning limits to the value specified by n (equivalent to the
DSSetJobLimit function used with DSJ_LIMITWARN specified as the
LimitType parameter).
rows n sets row limits to the value specified by n (equivalent to the
DSSetJobLimit function used with DSJ_LIMITROWS specified as the
LimitType parameter).
wait waits for the job to complete (equivalent to the DSWaitForJob
function).
stop terminates a running job (equivalent to the DSStopJob function).
jobstatus waits for the job to complete, then returns an exit code derived
from the job status.
userstatus waits for the job to complete, then returns an exit code derived
from the user status if that status is defined. The user status is a string, and
it is converted to an integer exit code. The exit code 0 indicates that the job
completed without an error, but that the user status string could not be
converted. If a job returns a negative user status value, it is interpreted as
an error.
-local use this when running a DataStage job from withing a shellscript on
a UNIX server. Provided the script is run in the project directory, the job
will pick up the settings for any environment variables set in the script and
any setting specific to the user environment.
project is the name of the project containing the job.
job is the name of the job.
51-109
Stopping a Job
You can stop a job using the stop option.
dsjob stop project job
stop terminates a running job (equivalent to the DSStopJob function).
project is the name of the project containing the job.
Listing Projects, Jobs, Stages, Links, and Parameters

You can list projects, jobs, stages, links, and job parameters using the dsjob
command. The different versions of the syntax are described in the
following sections.
Listing Projects
The following syntax displays a list of all known projects on the server:
dsjob lprojects
This syntax is equivalent to the DSGetProjectList function.
Listing Jobs
The following syntax displays a list of all jobs in the specified project:
dsjob ljobs project
project is the name of the project containing the jobs to list.
This syntax is equivalent to the DSGetProjectInfo function.
Listing Stages
The following syntax displays a list of all stages in a job:
dsjob lstages project job
project is the name of the project containing job.
job is the name of the job containing the stages to list.
This syntax is equivalent to the DSGetJobInfo function with
DSJ_STAGELIST specified as the InfoType parameter.
51-110
Listing Links
The following syntax displays a list of all the links to or from a stage:
dsjob llinks project job stage
job is the name of the job containing stage.
stage is the name of the stage containing the links to list.
This syntax is equivalent to the DSGetStageInfo function with
DSJ_LINKLIST specified as the InfoType parameter.
Listing Parameters
The following syntax display a list of all the parameters in a job and their
values:
dsjob lparams project job
job is the name of the job whose parameters are to be listed.
This syntax is equivalent to the DSGetJobInfo function with
DSJ_PARAMLIST specified as the InfoType parameter.
Retrieving Information
The dsjob command can be used to retrieve and display the available information about specific projects, jobs, stages, or links. The different versions
of the syntax are described in the following sections.
Displaying Job Information

The following syntax displays the available information about a specified
job:
dsjob jobinfo project job
The following information is displayed:
The current status of the job
The name of any controlling job for the job
51-111
The date and time when the job started

The wave number of the last or current run (internal DataStage
reference number)
User status
This syntax is equivalent to the DSGetJobInfo function.
Displaying Stage Information

The following syntax displays all the available information about a stage:
dsjob stageinfo project job stage
stage is the name of the stage.
The last error message reported from any link to or from the stage
The stage type name, for example, Transformer or Aggregator
The primary links input row number
This syntax is equivalent to the DSGetStageInfo function.
Displaying Link Information

The following syntax displays information about a specified link to or
from a stage:
dsjob linkinfo project job stage link
stage is the name of the stage containing link.
link is the name of the stage.
The last error message reported by the link
The number of rows that have passed down a link
This syntax is equivalent to the DSGetLinkInfo function.
51-112
Displaying Parameter Information

This syntax displays information about the specified parameter:
dsjob paraminfo project job param
job is the name of the job containing parameter.
parameter is the name of the parameter.
The parameter type

The parameter value
Help text for the parameter that was provided by the jobs designer
Whether the value should be prompted for
The default value that was specified by the jobs designer
Any list of values
The list of values provided by the jobs designer
This syntax is equivalent to the DSGetParamInfo function.
Accessing Log Files

The dsjob command can be used to add entries to a jobs log file, or retrieve
and display specific log entries. The different versions of the syntax are
described in the following sections.
Adding a Log Entry

The following syntax adds an entry to the specified log file. The text for the
entry is taken from standard input to the terminal, ending with Ctrl-D.
dsjob log
[ info | warn ] project job
info specifies an information message. This is the default if no log entry

type is specified.
warn specifies a warning message.
job is the name of the job that the log entry refers to.
This syntax is equivalent to the DSLogEvent function.
51-113
Displaying a Short Log Entry

The following syntax displays a summary of entries in a job log file:
dsjob logsum [type type]
[ max n ] project job
type type specifies the type of log entry to retrieve. If type type is not
specified, all the entries are retrieved. type can be one of the following
options:
This option
INFO
Information.
WARNING
Warning.
FATAL
Fatal error.
REJECT
Rejected rows from a Transformer stage.
STARTED
Job started.
RESET
Job reset.
BATCH
Batch control.
ANY
All entries of any type. This is the default if type is not

specified.
max n limits the number of entries retrieved to n.

project is the project containing job.
job is the job whose log entries are to be retrieved.
Displaying a Specific Log Entry

The following syntax displays the specified entry in a job log file:
dsjob logdetail project job entry
entry is the event number assigned to the entry. The first entry in the file is
0.
This syntax is equivalent to the DSGetLogEntry function.
51-114
Identifying the Newest Entry

The following syntax displays the ID of the newest log entry of the specified type:
dsjob lognewest project job type
type can be one of the following options:
This option
INFO
Information
WARNING
Warning
FATAL
Fatal error
REJECT
Rejected rows from a Transformer stage
STARTED
Job started
RESET
Job reset
BATCH
Batch
This syntax is equivalent to the DSGetNewestLogId function.
51-115
51-116
A
Schemas
Schemas are an alternative way for you to specify column definitions for
the data used by parallel jobs. By default, most parallel job stages take
there meta data from the Columns tab, which contains table definitions.,
supplemented, where necessary by format information from the Format
tab. For some stages, you can specify a property that causes the stage to
take its meta data from the specified schema file instead. Some stages also
allow you to specify a partial schema. This allows you to describe only
those columns that a particular stage is processing and ignore the rest.
The schema file is a plain text file, this appendix describes its format. A
partial schema has the same format.
Schema Format
A schema contains a record (or row) definition. This describes each
column (or field) that will be encountered within the record, giving
column name and data type. The following is an example record schema:
record (
name:string[255];
address:nullable string[255];
value1:int32;
value2:int32
date:date)
(The line breaks are there for ease of reading, you would omit these if you
were defining a partial schema, for example
record(name:string[255];value1:int32;date:date)is a valid
schema.)
The format of each line describing a column is:
column_name:[nullability]datatype;
Schemas
A-1
column_name. This is the name that identifies the column. Names

must start with a letter or an underscore (_), and can contain only
alphanumeric or underscore characters. The name is not case sensitive. The name can be of any length.
nullability. You can optionally specify whether a column is allowed
to contain a null value, or whether this would be viewed as invalid.
If the column can be null, insert the word nullable. By default
columns are not nullable.
You can also include nullable at record level to specify that all
columns are nullable, then override the setting for individual
columns by specifying not nullable. For example:
record nullable (
name:not nullable string[255];
value1:int32;
date:date)
datatype. This is the data type of the column. This uses the internal
data types as described on page 2-13, not SQL data types as used
on Columns tabs in stage editors.
You can include comments in schema definition files. A comment is
started by a double slash //, and ended by a newline.
A-2
The example schema corresponds to the following table definition as specified on a Columns tab of a stage editor:
The following sections give special consideration for representing various

data types in a schema file.
Date Columns
The following examples show various different data definitions:
record
record
record
record
(dateField1:date; ) // single date

(dateField2[10]:date; ) // 10-element date vector
(dateField3[]:date; ) // variable-length date vector
(dateField4:nullable date;) // nullable date
(See Complex Data Types on page 2-14 for information about vectors.)
Decimal Columns
To define a record field with data type decimal, you must specify the
columns precision, and you may optionally specify its scale, as follows:
column_name:decimal[ precision, scale];
Schemas
A-3
where precision is greater than or equal 1 and scale is greater than or equal
to 0 and less than precision.
If the scale is not specified, it defaults to zero, indicating an integer value.
The following examples show different decimal column definitions:
record (dField1:decimal[12]; ) // 12-digit integer
record (dField2[10]:decimal[15,3]; )// 10-element
//decimal vector
record (dField3:nullable decimal[15,3];) // nullable decimal
Floating-Point Columns
To define floating-point fields, you use the sfloat (single-precision) or
dfloat (double-precision) data type, as in the following examples:
record
record
record
record
(aSingle:sfloat; aDouble:dfloat; ) // float definitions

(aSingle: nullable sfloat;) // nullable sfloat
(doubles[5]:dfloat;) // fixed-length vector of dfloats
(singles[]:sfloat;) // variable-length vector sfloats
Integer Columns
To define integer fields, you use an 8-, 16-, 32-, or 64-bit integer data type
(signed or unsigned), as shown in the following examples:
record (n:int32;) // 32-bit signed integer
record (n:nullable int64;) // nullable, 64-bit signed
integerrecord (n[10]:int16;) // fixed-length vector of 16-bit
//signed integer
record (n[]:uint8;) // variable-length vector of 8-bit unsigned
//int
Raw Columns
You can define a record field that is a collection of untyped bytes, of fixed
or variable length. You give the field data type raw. The definition for a
raw field is similar to that of a string field, as shown in the following
examples:
record
record
record
record
(var1:raw[];) // variable-length raw field

(var2:raw;) // variable-length raw field; same as raw[]
(var3:raw[40];) // fixed-length raw field
(var4[5]:raw[40];)// fixed-length vector of raw fields
You can specify the maximum number of bytes allowed in the raw field
with the optional property max, as shown in the example below:
A-4
record (var7:raw[max=80];)
The length of a fixed-length raw field must be at least 1.
String Columns
You can define string fields of fixed or variable length. For variable-length
strings, the string length is stored as part of the string as a hidden integer.
The storage used to hold the string length is not included in the length of
the string.
The following examples show string field definitions:
record
record
record
record
record
record
(var1:string[];) // variable-length string

(var2:string;) // variable-length string; same as string[]
(var3:string[80];) // fixed-length string of 80 bytes
(var4:nullable string[80];) // nullable string
(var5[10]:string;) // fixed-length vector of strings
(var6[]:string[80];) // variable-length vector of strings
You can specify the maximum length of a string with the optional property
max, as shown in the example below:
record (var7:string[max=80];)
The length of a fixed-length string must be at least 1.
Time Columns
By default, the smallest unit of measure for a time value is seconds, but
you can instead use microseconds with the [microseconds] option. The
following are examples of time field definitions:
record (tField1:time; ) // single time field in seconds
record (tField2:time[microseconds];)// time field in
//microseconds
record (tField3[]:time; ) // variable-length time vector
record (tField4:nullable time;) // nullable time
Timestamp Columns
Timestamp fields contain both time and date information. In the time
portion, you can use seconds (the default) or microseconds for the smallest
unit of measure. For example:
record (tsField1:timestamp;)// single timestamp field in
//seconds
record (tsField2:timestamp[microseconds];)// timestamp in
//microseconds
Schemas
A-5
record (tsField3[15]:timestamp; )// fixed-length timestamp

//vector
record (tsField4:nullable timestamp;)// nullable timestamp
Vectors
Many of the previous example show how to define a vector of a particular
data type. You define a vector field by following the column name with
brackets []. For a variable-length vector, you leave the brackets empty,
and for a fixed-length vector you put the number of vector elements in the
brackets. For example, to define a variable-length vector of int32, you
would use a field definition such as the following one:
intVec[]:int32;
To define a fixed-length vector of 10 elements of type sfloat, you would use

a definition such as:
sfloatVec[]:sfloat;
You can define a vector of any data type, including string and raw. You
cannot define a vector of a vector or tagged type. You can, however, define
a vector of type subrecord, and you can define that subrecord includes a
tagged column or a vector.
You can make vector elements nullable, as shown in the following record
definition:
record (vInt[]:nullable int32;
vDate[6]:nullable date; )
In the example above, every element of the variable-length vector vInt

will be nullable, as will every element of fixed-length vector vDate. To test
whether a vector of nullable elements contains no data, you must check
each element for null.
Subrecords
Record schemas let you define nested field definitions, or subrecords, by
specifying the type subrec. A subrecord itself does not define any storage;
instead, the fields of the subrecord define storage. The fields in a subrecord
can be of any data type, including tagged.
The following example defines a record that contains a subrecord:
record ( intField:int16;
aSubrec:subrec (
aField:int16;
A-6
bField:sfloat; );
)
In this example, the record contains a 16-bit integer field, intField, and a
subrecord field, aSubrec. The subrecord includes two fields: a 16-bit
integer and a single-precision float.
Subrecord columns of value data types (including string and raw) can be
nullable, and subrecord columns of subrec or vector types can have
nullable elements. A subrecord itself cannot be nullable.
You can define vectors (fixed-length or variable-length) of subrecords. The
following example shows a definition of a fixed-length vector of
subrecords:
record (aSubrec[10]:subrec (
aField:int16;
bField:sfloat; );
)
You can also nest subrecords and vectors of subrecords, to any depth of
nesting. The following example defines a fixed-length vector of
subrecords, aSubrec, that contains a nested variable-length vector of
subrecords, cSubrec:
record (aSubrec[10]:subrec (
aField:int16;
bField:sfloat;
cSubrec[]:subrec (
cAField:uint8;
cBField:dfloat; );
);
)
Subrecords can include tagged aggregate fields, as shown in the following

sample definition:
record (aSubrec:subrec (
aField:string;
bField:int32;
cField:tagged (
dField:int16;
eField:sfloat;
);
);
)
In this example, aSubrec has a string field, an int32 field, and a tagged
aggregate field. The tagged aggregate field cField can have either of two
data types, int16 or sfloat.
Schemas
A-7
Tagged Columns
You can use schemas to define tagged columns (similar to C unions), with
the data type tagged. Defining a record with a tagged type allows each
record of a data set to have a different data type for the tagged column.
When your application writes to a field in a tagged column, DataStage
updates the tag, which identifies it as having the type of the column that
is referenced.
The data type of a tagged columns can be of any data type except tagged
or subrec. For example, the following record defines a tagged subrecord
field:
record ( tagField:tagged (
aField:string;
bField:int32;
cField:sfloat;
)
;
)
In the example above, the data type of tagField can be one of following:
a variable-length string, an int32, or an sfloat.
A-8
Partial Schemas
Some parallel job stages allow you to use a partial schema. This means that
you only need define column definitions for those columns that you are
actually going to operate on. The stages that allow you to do this are file
stages that have a Format tab. These are:
Sequential File stage

File Set stage
External Source stage
External Target stage
Column Import stage
You specify a partial schema using the Intact property on the Format tab
of the stage. Positional information for the columns in a partial schema is
derived from other settings in the Format tab.
Note: If you wish to use runtime column propagation to propagate those
columns you have not defined through the job, then you will also
have to define the full schema using the schema file property on
the input or output page of the stage.
You need to describe the record and the individual columns. Describe the
record as follows:
intact. This property specifies that the schema being defined is a
partial one. You can optionally specify a name for the intact schema
here as well.
record_length. The length of the record, including record delimiter
characters.
record_delim_string. String giving the record delimiter as an
ASCII string in single quotes. (For a single character delimiter, use
record_delim and supply a single ASCII character in single
quotes).
Describe the columns as follows:
position. The position of the starting character within the record.
delim. The column trailing delimiter, can be any of the following:
ws to skip all standard whitespace characters (space, tab, and
newline) trailing after a field.
Schemas
A-9
end to specify that the last field in the record is composed of all
none to specify that fields have no delimiter.
null to specify that the delimiter is the ASCII null character.
ASCII_char specifies a single ASCII delimiter. Enclose ASCII_char
in single quotation marks. (To specify multiple ASCII characters,
use delim_string followed by the string in single quotes.)
text specifies the data representation type of a field as being text
rather than binary. Data is formatted as text by default. (Specify
binary if data is binary.)
A-10
B
Functions
This appendix describes the functions that are available from the expression editor under the Function menu item. You would typically use
these functions when defining a column derivation in a Transformer stage.
The functions are described by category.
Date and Time Functions

The following table lists the functions available in the Date & Time category (Square brackets indicate an argument is optional):
Name
Description
Arguments
Output
DateFrom
DaysSince
Returns a date by adding

an integer to a baseline
date
number (int32)
[baseline date]
date
DateFrom
JulianDay
Returns a date from the

given julian date
juliandate (uint32)
date
DaysSince
FromDate
Returns the number of

days from source date to
the given date
source_date
given_date
days since (int32)
Hours
FromTime
Returns the hour portion

of a time
time
hours (int8)
JulianDay
FromDate
Returns julian day from

the given date
date
julian date (int32)
Micro
Seconds
FromTime
Returns the microsecond

portion from a time
time
microseconds
(int32)
Functions
B-1
Name
Description
Arguments
Output
Minutes
FromTime
Returns the minute

portion from a time
time
minutes (int8)
Month
DayFrom
Date
Returns the day of the

month given the date
date
day (int8)
Month
FromDate
Returns the month

number given the date
date
month number
(int8)
NextWeek
dayFromDate
Returns the date of the

specified day of the week
soonest after the source
date
source date
day of week (string)
date
Previous
Weekday
FromDate
Returns the date of the

specified day of the week
most recent before the
source date
source date
day of week (string)
date
Seconds
FromTime
Returns the second

portion from a time
time
seconds (dfloat)
Seconds
Since
From
Times
tamp
Returns the number of

seconds between two
timestamps
timestamp
base timestamp
seconds (dfloat)
TimeDate
Returns the system time

and date as a formatted
string
system time and

date (string)
Time
From
Midnight
Seconds
Returns the time given

the number of seconds
since midnight
seconds (dfloat)
time
Time
stamp
From
DateTime
Returns a timestamp
form the given date and
time
date
time
timestamp
Time
stamp
From
Seconds
Since
Returns the timestamp

from the number of
seconds from the base
timestamp
seconds (dfloat)
[base timestamp]
timestamp
B-2
Name
Description
Arguments
Output
Time
stamp
From
Timet
Returns a timestamp
from the given unix
time_t value
timet (int32)
timestamp
Timet
FromTime
stamp
Returns a unix time_t

value from the given
timestamp
timestamp
timet (int32)
Weekday
FromDate
Returns the day number

of the week from the
given date
date
day (int8)
Yearday
FromDate
Returns the day number

in the year from the
given date
date
day (int16)
YearFrom
Date
Returns the year from

the given date
date
year (int16)
Yearweek
FromDate
Returns the week

number in the year from
the given date
date
week (int16)
Date, Time, and Timestamp functions that take a format string (e.g., timetostring(time, stringformat)) need specific formats:
For a date, the format components are:
%dd
two digit day
%mm two digit month
%yy
two digit year (from 1900)
%year_cutoffyy
two digit year from year_cutoff (e.g. %2000yy)
%yyyy four digit year
%ddd
three digit day of the year
The default format is %yyyy-%mm-%dd
For a time, the format components are:
%hh
two digit hour
%nn
two digit minutes
%ss
two digit seconds
The default is %hh:%nn:%ss
A timestamp can include the components for date and time above. The
default format is %yyyy-%mm-%dd %hh:%nn:%ss
Functions
B-3
Logical Functions
The following table lists the functions available in the Logical category
(square brackets indicate an argument is optional):
Name
Description
Arguments
Output
Not
Returns the complement

of the logical value of an
expression
expression
Complement (int8)
Mathematical Functions
The following table lists the functions available in the Mathematical category (square brackets indicate an argument is optional):
Name
Description
Arguments
Output
Abs
Absolute value of any

numeric expression
number (int32)
result (dfloat)
Acos
Calculates the trigonometric arc-cosine of an

expression
number (dfloat)
result (dfloat)
Asin
Calculates the trigonometric arc-sine of an

expression
number (dfloat)
result (dfloat)
Atan
Calculates the trigonometric arc-tangent of an

expression
number (dfloat)
result (dfloat)
Ceil
Calculates the smallest

integer value greater
than or equal to the
given decimal value
number (decimal)
result (int32)
Cos
Calculates the trigonometric cosine of an

expression
number (dfloat)
result (dfloat)
Cosh
Calculates the hyperbolic number (dfloat)

cosine of an expression
result (dfloat)
B-4
Name
Description
Arguments
Output
Div
Outputs the whole part

of the real division of
two real numbers (dividend, divisor)
dividend (dfloat)
divisor (dfloat)
result (dfloat)
Exp
Calculates the result of

base e raised to the
power designated by the
value of the expression
number (dfloat)
result (dfloat)
Fabs
Calculates the absolute

value of the given value
number (dfloat)
result (dfloat)
Floor
Calculates the largest

integer value less than or
equal to the given
decimal value
number (decimal)
result (int32)
Ldexp
Calculates a number
from an exponent and
mantissa
mantissa (dfloat)
exponent (int32)
result (dfloat)
Llabs
Returns the absolute

number (uint64)
value of the given integer
Ln
Calculates the natural

logarithm of an expression in base e
number (dfloat)
result (dfloat)
Log10
Returns the log to the

base 10 of the given
value
number (dfloat)
result (dfloat)
Max
Returns the greater of the

two argument values
number 1 (int32)
number 1 (int32)
result (int32)
Min
Returns the lower of the

two argument values
number 1 (int32)
number 1 (int32)
result (int32)
Mod
Calculates the modulo

(the remainder) of two
expressions (dividend,
divisor)
dividend (int32)
divisor (int32)
result (int32)
Neg
Negate a number
number (dfloat)
result (dfloat)
Functions
result (int64)
B-5
Name
Description
Arguments
Output
Pwr
Calculates the value of

an expression when
raised to a specified
power (expression,
power)
expression (dfloat)
power (dfloat)
result (dfloat)
Rand
Return a psuedo random

integer between 0 and
232-1
result (uint32)
Random
Returns a random
number between 0 232-1
result (uint32)
Sin
Calculates the trigonometric sine of an angle
number (dfloat)
result (dfloat)
Sinh

sine of an expression
result (dfloat)
Sqrt
Calculates the square

root of a number
number (dfloat)
result (dfloat)
Tan
Calculates the trigonometric tangent of an

angle
number (dfloat)
result (dfloat)
Tanh

tangent of an expression
result (dfloat)
Null Handling Functions

The following table lists the functions available in the Null Handling category (square brackets indicate an argument is optional):
Name
Description
Arguments
Output
Handle
Null
Set a columns in-band

null value
any (column)
string (string)
IsDFloat
InBand
Null
Return whether the

given DFloat is an inband null
number (dfloat)
true/false (int8)
IsInt16
InBand
Null
Return whether the

given integer is an inband null
number (int16)
true/false (int8)
B-6
Name
Description
Arguments
Output
IsInt32InB
andNull
Return whether the

number (int32)
true/false (int8)
IsInt64InB
andNull
Return whether the

number (int64)
true/false (int8)
IsNotNull
Returns true when an

any
expression does not evaluate to the null value
true/false (int8)
IsNull
Returns true when an

expression evaluates to
the null value
any
true/false (int8)
IsSFloatIn
BandNull
Return whether the

given SFloat is an inband null
number (dfloat)
true/false (int8)
IsStringIn
BandNull
Return whether the

string (string)
given string is an in-band
null
true/false (int8)
MakeNull
Change an in-band null

to out of band null
any (column)
string (string)
SetNull
Assign a null value to the

target column
Number Functions
The following table lists the functions available in the Number category
Name
Description
Arguments
Output
Mantissa
From
Decimal
Returns the mantissa

from the given decimal
number (decimal)
[fixzero (int8)]
result (dfloat)
Mantissa
From
DFloat
Returns the mantissa

from the given dfloat
number (dfloat)
result (dfloat)
Functions
B-7
Raw Functions
The following table lists the functions available in the Raw category
Name
Description
Arguments
Output
Raw
Length
Returns the length of a

raw string
input string (raw)
Result (int32)
String Functions
The following table lists the functions available in the String category
Name
Description
Arguments
Output
AlNum
Return whether the

given string consists of
alphanumeric characters
string (string)
true/false (int8)
Alpha
Returns 1 if string is
purely alphabetic
string (string)
result (int8)
Compact
White
Space
Return the string after

reducing all consective
whitespace to a single
space
string (string)
result (string)
Compare
Compares two strings for string1 (string)

sorting
string2 (string)
[justification (L or R)]
result (int8)
Compare
NoCase
Case insensitive compar- string1 (string)

ison of two strings
string2 (string)
result (int8)
Compare
Num
Compare the first n char- string1 (string)

acters of the two strings
string2 (string)
length (int16)
result (int8)
Compare
Num
NoCase
Caseless comparison of
the first n characters of
the two strings
string1 (string)
string2 (string)
length (int16)
result (int8)
Convert
Converts specified characters in a string to

designated replacement
characters
fromlist (string)
tolist (string)
expression (string)
result (string)
B-8
Name
Description
Arguments
Output
Count
Count number of times a

substring occurs in a
string
string (string)
substring (string)
result (int32)
Dcount
Count number of delimited fields in a string
string (string)
delimiter (string)
result (int32)
DownCase Change all uppercase

letters in a string to
lowercase
string (string)
result (string)
DQuote
Enclose a string in
double quotation marks
string (string)
result (string)
Field
Return 1 or more delimited substrings
string (string)
delimiter (string)
occurrence (int32)
[number (int32)]
result (string)
Index
Find starting character

position of substring
string (string)
substring (string)
occurrence (int32)
result (int32)
Left
Leftmost n characters of
string
string (string)
number (int32)
result (string)
Len
Length of string in
characters
string (string)
result (int32)
Num
Return 1 if string can be

converted to a number
string (string)
result (int8)
PadString
Return the string padded

with the optional pad
character and optional
length
string (string)
padlength (int32)
result (string)
Right
Rightmost n characters of string (string)

string
number (int32)
result (string)
Space
Return a string of N
space characters
length (int32)
result (string)
Squote
Enclose a string in single

quotation marks
string (string)
result (string)
Str
Repeat a string
string (string)
repeats (int32)
result (string)
Functions
B-9
Name
Description
Arguments
Output
Strip
White
Space
Return the string after

stripping all whitespace
from it
string (string)
result (string)
Trim
Remove all leading and

trailing spaces and tabs
plus reduce internal
occurrences to one
string (string)
[stripchar (string)]
[oprions (string)]
result (string)
TrimB
Remove all trailing

spaces and tabs
string (string)
result (string)
TrimF
Remove all leading

spaces and tabs
string (string)
result (string)
Trim
Leading
Trailing
Returns a string with

leading and trailing
whitespace removed
string (string)
result (string)
Upcase
Change all lowercase

letters in a string to
uppercase
string (string)
result (string)
Type Conversion Functions

The following table lists the functions available in the Type Conversion
category (square brackets indicate an argument is optional):
Name
Description
Arguments
Output
Date
ToString
Return the string representation of the given

date
date
[format (string)]
result (string)
Decimal
Returns the given
decimal (decimal)
ToDecimal decimal in decimal repre- [type (string)]
sentation with specified [packedflag (int8)]*
precision and scale
result (decimal)
Decimal
ToDFloat
result (dfloat)
B-10
Returns the given

decimal in dfloat
representation
number (decimal)
[fixzero (int8)]
Name
Description
Arguments
Output
Decimal
ToString

decimal
number (decimal)
[ fixzero (int8)]
[prefix (string)]
[prefixlen (int32)]
[suffix (string)]
[suffixlen (int32)](
result (string)
Dfloat
Returns the given dfloat number (dfloat)
ToDecimal in decimal representation [rtype (string)]
result (decimal)
IsValid
Return whether the

given string is valid for
the given type
type (string)
format (string)
result (int8)
StringTo
Date

given string in the given
format
date (string)
format (string)
date
StringTo
Decimal
Returns the given string string (string)

in decimal representation [rtype (string)]
result (decimal)
String
ToRaw
Returns a string in raw

representation
string (string)
result (raw)
StringTo
Time
Returns a time representation of the given string
string (string)
[format (string)]
time
StringTo
Time
stamp
Returns a timestamp
representation of the
given string
string (string)
[format (string)]
timestamp
Time
stamp
ToDate

given timestamp
timestamp
date
Time
stamp
ToString

timestamp
timestamp
[format (string)]
result (string)
Time
stamp
ToTime
Returns the time from a

given timestamp
timestamp
time
TimeToString

time
time
[format (string)]
result (string)
The type argument of decimaltodecimal() is a string, and should contain

one of the following:
Functions
B-11
ceil. Round the source field toward positive infinity. E.g, 1.4 -> 2, 1.6 -> -1.
floor. Round the source field toward negative infinity. E.g, 1.6 -> 1,
-1.4 -> -2.
round_inf. Round or truncate the source field toward the nearest
representable value, breaking ties by rounding positive values
toward positive infinity and negative values toward negative
infinity. E.g, 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2.
trunc_zero. Discard any fractional digits to the right of the rightmost fractional digit supported in the destination, regardless of
sign. For example, if the destination is an integer, all fractional
digits are truncated. If the destination is another decimal with a
smaller scale, round or truncate to the scale size of the destination
decimal. E.g, 1.6 -> 1, -1.6 -> -1.
The default is trunc_zero.
Utility Functions
The following table lists the functions available in the Utility category
Name
Description
Arguments
Output
Get
Environ
ment
Return the value of the

given environment
variable
environment variable (string)
result (string)
B-12
C
Header Files
DataStage comes with a range of header files that you can include in code
when you are defining a Build stage. The following sections list the header
files and the classes and macros that they contain. See the header files
themselves for more details about available functionality.
C++ Classes Sorted By Header File

apt_ framework/ accessorbase. h
APT_ AccessorBase
APT_ AccessorTarget
APT_ InputAccessorBase
APT_ InputAccessorInterface
APT_ OutputAccessorBase
APT_ OutputAccessorInterface
apt_ framework/ adapter. h
APT_ AdapterBase
APT_ ModifyAdapter
APT_ TransferAdapter
APT_ ViewAdapter
apt_ framework/ collector. h
APT_ Collector
apt_ framework/ composite. h
APT_ CompositeOperator
apt_ framework/ config. h
Header Files
C-1
APT_ Config
APT_ Node
APT_ NodeResource
APT_ NodeSet
apt_ framework/ cursor. h
APT_ CursorBase
APT_ InputCursor
APT_ OutputCursor
apt_ framework/ dataset. h
APT_ DataSet
apt_ framework/ fieldsel. h
APT_ FieldSelector
apt_ framework/ fifocon. h
APT_ FifoConnection
apt_ framework/ gsubproc. h
APT_ GeneralSubprocessConnection
APT_ GeneralSubprocessOperator
apt_ framework/ impexp/ impexp_ function. h
APT_ GFImportExport
apt_ framework/ operator. h
APT_ Operator
apt_ framework/ partitioner. h
APT_ Partitioner
APT_ RawField
apt_ framework/ schema. h
APT_ Schema
APT_ SchemaAggregate
APT_ SchemaField
APT_ SchemaFieldList
APT_ SchemaLengthSpec
apt_ framework/ step. h
APT_ Step
C-2
apt_ framework/ subcursor. h

APT_ InputSubCursor
APT_ OutputSubCursor
APT_ SubCursorBase
apt_ framework/ tagaccessor. h
APT_ InputTagAccessor
APT_ OutputTagAccessor
APT_ ScopeAccessorTarget
APT_ TagAccessor
apt_ framework/ type/ basic/ float. h
APT_ InputAccessorToDFloat
APT_ InputAccessorToSFloat
APT_ OutputAccessorToDFloat
APT_ OutputAccessorToSFloat
apt_ framework/ type/ basic/ integer. h
APT_ InputAccessorToInt16
APT_ InputAccessorToUInt16
APT_ OutputAccessorToInt16
APT_ OutputAccessorToUInt16
apt_ framework/ type/ basic/ raw. h
APT_ InputAccessorToRawField
APT_ OutputAccessorToRawField
APT_ RawFieldDescriptor
apt_ framework/ type/ conversion. h
Header Files
C-3
APT_ FieldConversion
APT_ FieldConversionRegistry
apt_ framework/ type/ date/ date. h
APT_ DateDescriptor
APT_ InputAccessorToDate
APT_ OutputAccessorToDate
apt_ framework/ type/ decimal/ decimal. h
APT_ DecimalDescriptor
APT_ InputAccessorToDecimal
APT_ OutputAccessorToDecimal
apt_ framework/ type/ descriptor. h
APT_ FieldTypeDescriptor
APT_ FieldTypeRegistry
apt_ framework/ type/ function. h
APT_ GenericFunction
APT_ GenericFunctionRegistry
APT_ GFComparison
APT_ GFEquality
APT_ GFPrint
apt_ framework/ type/ protocol. h
APT_ BaseOffsetFieldProtocol
APT_ EmbeddedFieldProtocol
APT_ FieldProtocol
APT_ PrefixedFieldProtocol
APT_ TraversedFieldProtocol
apt_ framework/ type/ time/ time. h
APT_ TimeDescriptor
APT_ InputAccessorToTime
APT_ OutputAccessorToTime
apt_ framework/ type/ timestamp/ timestamp. h
APT_ TimeStampDescriptor
APT_ InputAccessorToTimeStamp
APT_ OutputAccessorToTimeStamp
apt_ framework/ utils/fieldlist. h
C-4
APT_ FieldList
apt_ util/archive. h
APT_ Archive
APT_ FileArchive
APT_ MemoryArchive
apt_ util/ argvcheck. h
APT_ ArgvProcessor
apt_ util/ date. h
APT_ Date
apt_ util/ dbinterface. h
APT_ DataBaseDriver
APT_ DataBaseSource
APT_ DBColumnDescriptor
apt_ util/ decimal. h
APT_ Decimal
apt_ util/ endian. h
APT_ ByteOrder
apt_ util/ env_ flag. h
APT_ EnvironmentFlag
apt_ util/ errind. h
APT_ Error
apt_ util/ errlog. h
APT_ ErrorLog
apt_ util/ errorconfig. h
APT_ ErrorConfiguration
apt_ util/ fast_ alloc. h
APT_ FixedSizeAllocator
APT_ VariableSizeAllocator
apt_ util/ fileset. h
APT_ FileSet
Header Files
C-5
apt_ util/ identifier. h

APT_ Identifier
apt_ util/keygroup. h
APT_ KeyGroup
apt_ util/locator. h
APT_ Locator
apt_ util/persist. h
APT_ Persistent
apt_ util/proplist. h
APT_ Property
APT_ PropertyList
apt_ util/random. h
APT_ RandomNumberGenerator
apt_ util/rtti. h
APT_ TypeInfo
apt_ util/ string. h
APT_ String
APT_ StringAccum
apt_ util/ time. h
APT_ Time
APT_ TimeStamp
C++ Macros Sorted By Header File

apt_ framework/ accessorbase. h
APT_ DECLARE_ ACCESSORS()
APT_ IMPLEMENT_ ACCESSORS()
apt_ framework/ osh_ name. h
APT_ DEFINE_ OSH_ NAME()
APT_ REGISTER_ OPERATOR()
apt_ framework/ type/ basic/ conversions_ default. h
C-6
APT_ DECLARE_ DEFAULT_ CONVERSION()

APT_ DECLARE_ DEFAULT_ CONVERSION_ WARN()
apt_ framework/ type/ protocol. h
APT_ OFFSET_ OF()
apt_ util/ archive. h
APT_ DIRECTIONAL_ SERIALIZATION()
apt_ util/assert. h
APT_ ASSERT()
APT_ DETAIL_ FATAL()
APT_ DETAIL_ FATAL_ LONG()
APT_ MSG_ ASSERT()
APT_ USER_ REQUIRE()
APT_ USER_ REQUIRE_ LONG()
apt_ util/condition. h
CONST_ CAST()
REINTERPRET_ CAST()
apt_ util/errlog. h
APT_ APPEND_ LOG()
APT_ DUMP_ LOG()
APT_ PREPEND_ LOG()
apt_ util/ exception. h
APT_ DECLARE_ EXCEPTION()
APT_ IMPLEMENT_ EXCEPTION()
apt_ util/ fast_ alloc. h
APT_ DECLARE_ NEW_ AND_ DELETE()
apt_ util/ logmsg. h
APT_ DETAIL_ LOGMSG()
APT_ DETAIL_ LOGMSG_ LONG()
APT_ DETAIL_ LOGMSG_ VERYLONG()
apt_ util/ persist. h
APT_ DECLARE_ ABSTRACT_ PERSISTENT()
APT_ DECLARE_ PERSISTENT()
APT_ DIRECTIONAL_ POINTER_ SERIALIZATION()
Header Files
C-7
APT_ IMPLEMENT_ ABSTRACT_ PERSISTENT()

APT_ IMPLEMENT_ ABSTRACT_ PERSISTENT_ V()
APT_ IMPLEMENT_ NESTED_ PERSISTENT()
APT_ IMPLEMENT_ PERSISTENT()
APT_ IMPLEMENT_ PERSISTENT_ V()
apt_ util/ rtti. h
APT_ DECLARE_ RTTI()
APT_ DYNAMIC_ TYPE()
APT_ IMPLEMENT_ RTTI_ BASE()
APT_ IMPLEMENT_ RTTI_ BEGIN()
APT_ IMPLEMENT_ RTTI_ END()
APT_ IMPLEMENT_ RTTI_ NOBASE()
APT_ IMPLEMENT_ RTTI_ ONEBASE()
APT_ NAME_ FROM_ TYPE()
APT_ PTR_ CAST()
APT_ STATIC_ TYPE()
APT_ TYPE_ INFO()
C-8
Index
Symbols
_cplusplus token 51-2
_STDC_ token 51-2
A
active stages 3-1
advanced tab, stage page 3-5
aggragator stage 17-1
aggragator stage properties 17-2
API 51-2
B
batch log entries 51-111
build stage macros 49-16
build stages 49-1
C
change apply stage 32-1
change apply stage properties 32-3
change capture stage 31-1
change capture stage properties
properties
change capture stage 31-2
Cluster systems 2-5
collecting data 2-7
collection types
ordered 2-9
round robin 2-9
sorted merge 2-9
column auto-match facility 16-11
column export stage 37-1
column export stage properties
properties
column export stage 37-2

column generator stage 28-1
column generator stage
properties 28-1
column import stage 36-1
column import stage properties
properties
column import stage 36-2
columns tab, inputs page 3-16
columns tab, outputs page 3-28
combine records stage 41-1
combine records stage properties 41-2
command line interface 51-104
commands
dsjob 51-104
compare stage 46-1, 46-2
compare stage properties 46-2
complex data types 2-14
compress stage 24-1
compress stage properties 24-2
configuration file 2-6
containers 2-17
copy stage 29-1
copy stage properties 29-1
custom stages 49-1
D
data set stage 6-1
data set stage input properties 6-2
data set stage output properties 6-4
data structures
description 51-45
how used 51-2
summary of usage 51-44
data types 2-13
Index-1
complex 2-14
database stages 3-1
DataStage API
building applications that use 51-4
header file 51-2
programming logic example 51-3
redistributing programs 51-4
DataStage CLI
completion codes 51-104
logon clause 51-104
overview 51-104
using to run jobs 51-105
DataStage Development Kit 51-2
API functions 51-5
command line interface 51-104
data structures 51-44
dsjob command 51-105
error codes 51-57
job status macros 51-104
writing DataStage API
programs 51-3
DataStage server engine 51-104
DB2 partition properties 3-15
DB2 partitioning 2-8
DB2 stage 12-1
DB2 stage input properties 12-3
DB2 stage output properties 12-11
decode stage 34-1
decode stage properties 34-1
defining
local stage variables 16-15
difference stage 35-1
difference stage properties 35-2
DLLs 51-4
documentation conventions xixxx
dsapi.h header file
description 51-2
including 51-4
DSCloseJob function 51-7
DSCloseProject function 51-8
DSDetachJob function 51-66
dsdk directory 51-4
Index-2
DSExecute subroutione 51-67

DSFindFirstLogEntry function 51-9
DSFindNextLogEntry function 51-9,
51-11
DSGetJobInfo function 51-12, 51-68
and controlled jobs 51-13
DSGetLastError function 51-14
DSGetLastErrorMsg function 51-15
DSGetLinkInfo function 51-16, 51-71
DSGetLogEntry function 51-18, 51-73
DSGetLogSummary function 51-74
DSGetNewestLogId function 51-19,
51-76
DSGetParamInfo function 51-21, 51-77
DSGetProjectInfo function 51-23,
51-80
DSGetProjectList function 51-25
DSGetStageInfo function 51-26, 51-81
DSHostName macro 51-103
dsjob command
description 51-104
DSJobController macro 51-103
DSJOBINFO data structure 51-45
and DSGetJobInfo 51-13
and DSGetLinkInfo 51-16
DSJobInvocationID macro 51-103
DSJobInvocations macro 51-103
DSJobName macro 51-103
DSJobStartDate macro 51-103
DSJobStartTime macro 51-103
DSJobStatus macro 51-103
DSJobWaveNo macro 51-103
DSLINKINFO data structure 51-48
DSLinkLastErr macro 51-103
DSLinkName macro 51-103
DSLinkRowCount macro 51-103
DSLockJob function 51-28
DSLOGDETAIL data structure 51-49
DSLOGEVENT data structure 51-50
DSLogEvent function 51-29, 51-83
DSLogFatal function 51-84
DSLogInfo function 51-85
DSLogWarn function 51-87

DSOpenJob function 51-31
DSOpenProject function 51-32
DSPARAM data structure 51-51
DSPARAMINFO data structure 51-53
and DSGetParamInfo 51-21
DSPROJECTINFO data
structure 51-55
and DSGetProjectInfo 51-23
DSProjectName macro 51-103
DSRunJob function 51-34
DSSetJobLimit function 51-36, 51-94
DSSetParam function 51-38, 51-95
DSSetServerParams function 51-40
DSSetUserStatus subroutine 51-96
DSSTAGEINFO data structure 51-56
and DSGetStageInfo 51-26
DSStageInRowNum macro 51-103
DSStageLastErr macro 51-103
DSStageName macro 51-103
DSStageType macro 51-103
DSStageVarList macro 51-103
DSStopJob function 51-41, 51-97
DSTransformError function 51-98
DSUnlockJob function 51-42
DSWaitForJob function 51-43
E
editing
Transformer stages 16-6
encode stage 33-1
encode stage properties 33-1
entire partitioning 2-8
error codes 51-57
errors
and DataStage API 51-57
functions used for handling 51-6
retrieving message text 51-15
retrieving values for 51-14
Event Type parameter 51-9
example build stage 49-21
examples
Development Kit program A-1
expand stage 25-1
expand stage properties 25-2
expression editor 16-18
external fileter stage 30-1
external fileter stage properties 30-1
external source stage 8-1
external source stage output
properties 8-3
external target stage 9-1
external target stage input
properties 9-3
F
fatal error log entries 51-111
file set output properties 5-13
file set stage 5-1
file set stage input properties 5-3
file stages 3-1
Find and Replace dialog box 16-8
Folder stages 12-1
format tab, inputs page 3-24
formqt tab, outputs page 3-29
functions, table of 51-5
funnel stage 19-1
funnel stage properties 19-2
G
general tab, inputs page 3-10
general tab, outputs page 3-26
general tab, stage page 3-2
H
hash by field partitioning 2-8
head stage 44-1
head stage properties 44-2
Index-3
I
information log entries 51-111
Informix XPS stage 15-1
Informix XPS stage input
properties 15-2
Informix XPS stage output properties
properties
Informix XPS stage output 15-7
input links 16-5
inputs page 3-6, 3-9
columns tab 3-16
format tab 3-24
general tab 3-10
partitioning tab 3-11
properties tab 3-10
J
job control interface 51-1
job handle 51-31
job parameters
displaying information
about 51-110
functions used for accessing 51-6
listing 51-108
retrieving information about 51-21
setting 51-38
job status macros 51-104
jobs
closing 51-7
about 51-108
listing 51-23, 51-107
locking 51-28
opening 51-31
resetting 51-34, 51-105
retrieving status of 51-12
running 51-34, 51-105
stopping 51-41, 51-107
unlocking 51-42
validating 51-34, 51-105
Index-4
waiting for completion 51-43

join stage 18-1
join stage properties 18-2
L
library files 51-4
limits 51-36
link ordering tab, stage page 3-6
links
about 51-109
input 16-5
listing 51-107
output 16-5
reject 16-5
specifying order 16-14
log entries
adding 51-29, 51-110
batch control 51-111
fatal error 51-111
finding newest 51-19, 51-112
job reset 51-111
job started 51-111
new lines in 51-30
rejected rows 51-111
retrieving 51-9, 51-11
retrieving specific 51-18, 51-111
types of 51-9
warning 51-111
logon clause 51-104
lookup file set stage 7-1
lookup file set stage output
properties 7-7
lookup stage 20-1
lookup stage input properties 20-5
lookup stage properties 20-2
M
macros, job status 51-104
make subrecord stage 38-1
make subrecord stage properties 38-2
make vector stage 42-1, 42-2
make vector stage properties 42-2
mapping tab, outputs page 3-30
merge stage 22-1
merge stage properties 22-2
meta data 2-11
modulus partitioning 2-8
MPP ssystems 2-5
N
new lines in log entries 51-30
O
optimizing performance 2-1
Oracle stage 13-1
Oracle stage input properties 13-2
Oracle stage output properties 13-11
ordered collection 2-9
output links 16-5
outputs page 3-25
columns tab 3-28
format tab 3-29
general tab 3-26
mapping tab 3-30
properties tab 3-27
P
parallel processing 2-1
parallel processing environments 2-5
parameters, see job parameters
partial schemas 2-12
partition parallel processing 2-1
partition paralllelism 2-3
partitioning data 2-7
partitioning icons 2-9

partitioning tab, inputs page 3-11
partitioning types
DB2 2-8
entire 2-8
hash by field 2-8
modulus 2-8
random 2-7
range 2-8
round robin 2-7
same 2-7
passwords, setting 51-40
Peek stage 47-1
peek stage 47-1
peek stage properties 47-1
pipeline processing 2-1, 2-2
projects
closing 51-8
listing 51-25, 51-107
opening 51-32
promote subrecord stage 40-1
promote subrecord stage
properties 40-2
properties 23-2, 33-1
aggragator stage 17-2
change apply stage 32-3
combine records stage 41-2
compare stage 46-2
compress stage 24-2
copy stage 29-1
data set stage input 6-2
data set stage output 6-4
DB2 stage input 12-3
DB2 stage output 12-11
decode stage 34-1
difference stage 35-2
expand stage 25-2
external fileter stage 30-1
external source stage output 8-3
external stage input 9-3
file set input 5-3
Index-5
file set output 5-13

funnel stage 19-2
head stage 44-2
Informix XPS stage input 15-2
join stage 18-2
lookup file set stage output 7-7
lookup stage 20-2
lookup stage input 20-5
make subrecord stage 38-2
make vector stage 42-2
merge stage 22-2
Oracle stage input 13-2
Oracle stage output 13-11
peek stage 47-1
promote subrecord stage 40-2
sample stage 26-2
SAS data set stage input 11-3
SASA data set stage output 11-6
sequential file input 4-3
sequential file output 4-12
sort stage 21-1
split subrecord stage 39-2
split vector stage 43-2
tail stage 45-2
Teradata stage input 14-2
Teradata stage output 14-8
trnsformer stage 16-18
write range map stage input
properties 10-2
properties tab, inputs page 3-10
properties tab, outputs page 3-27
properties tab, stage page 3-2
propertiesrow generator stage
output 27-2
prperties
column generator stage 28-1
R
random partitioning 2-7
range partition properties 3-15
range partitioning 2-8
Index-6
redistributable files 51-4

reject links 16-5
rejected rows 51-111
remove duplicates stage 23-1, 23-2
remove duplicates stage
properties 23-2
restructure operators
splitvect 43-1
result data
reusing 51-3
storing 51-3
round robin collection 2-9
round robin partitioning 2-7
row generator stage 27-1
row generator stage output
properties 27-2
row limits 51-36, 51-106
runtime column propagation 2-12,
3-28, 4-20, 9-12
S
same partitioning 2-7
sample stage 26-1
sample stage properties 26-2
SAS data set stage 11-1
SAS data set stage input input
properties 11-3
SAS data set stage output
properties 11-6
SAS stage 48-1
schema files 2-12
sequential file input properties 4-3
sequential file output properties 4-12
sequential file stage 4-1
server names, setting 51-40
shared containers 2-17
shortcut menus in Transformer
Editor 16-4
SMP systems 2-5
sort stage 21-1
sort stage properties 21-1
sorted merge collection 2-9

split subrecord stage 39-1
split subrecord stage properties 39-2
split vector stage 43-1
split vector stage properties 43-2
splitvect restructure operator 43-1
stage editors 3-1
stage page 3-2
advanced tab 3-5
general tab 3-2
link ordering tab 3-6
properties tab 3-2
stages
about 51-109
editing
Sequential File 7-1
listing 51-107
sequential file 4-1
subrecords 2-15
_cplusplus 51-2
_STDC_ 51-2
WIN32 51-2
toolbars
Transformer Editor 16-3
Transformer Editor
link area 16-3
meta data area 16-3
shortcut menus 16-4
toolbar 16-3
transformer stage 16-1
transformer stage properties 16-18
Transformer stages
basic concepts 16-5
editing 16-6
table definitions 2-12

tagged subrecords 2-15
tail stage 45-1
tail stage properties 45-2
Teradata stage 14-1
Teradata stage input properties 14-2
Teradata stage output properties 14-8
threads
and DSFindFirstLogEntry 51-11
and DSFindNextLogEntry 51-11
and DSGetLastErrorMsg 51-15
and error storage 51-3
and errors 51-14
and log entries 51-10
and result data 51-2
using multiple 51-3
tokens
vector 2-15
vmdsapi.dll 51-4
vmdsapi.lib library 51-4
U
unirpc32.dll 51-5
UniVerse stages 4-1
user names, setting 51-40
uvclnt32.dll 51-5
W
warning limits 51-36, 51-106
warnings 51-111
WIN32 token 51-2
wrapped stages 49-1
write range map stage 10-1
write range map stage input
properties 10-2
writing
DataStage API programs 51-3
Index-7
Index-8
Index-9
Index-10

Ascential Datastage: Parallel Job Developer'S Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ascential Datastage: Parallel Job Developer'S Guide

Uploaded by

Copyright:

Available Formats

Ascential DataStage

Parallel Job Developers

Published by Ascential Software

Chapter 2. Designing Parallel Extender Jobs

Chapter 3. Stage Editors

General Tab ........................................................................................................... 3-2

Chapter 4. Sequential File Stage

Chapter 5. File Set Stage

Ascential DataStage Parallel Job Developers Guide

Format of File Set Files ........................................................................................ 5-8

Chapter 6. Data Set Stage

Chapter 7. Lookup File Set Stage

Chapter 8. External Source Stage

Chapter 9. External Target Stage

Inputs Page ................................................................................................................... 9-2

Chapter 10. Write Range Map Stage

Chapter 11. SAS Data Set Stage

Chapter 12. DB2 Stage

Chapter 13. Oracle Stage

Ascential DataStage Parallel Job Developers Guide

Partitioning on Input Links .............................................................................. 13-9

Chapter 14. Teradata Stage

Chapter 15. Informix XPS Stage

Chapter 16. Transformer Stage

Editing Column Meta Data ............................................................................... 16-9

Chapter 17. Aggregator Stage

Chapter 18. Join Stage

Ascential DataStage Parallel Job Developers Guide

Chapter 19. Funnel Stage

Chapter 20. Lookup Stage

Chapter 21. Sort Stage

Chapter 22. Merge Stage

Link Ordering ..................................................................................................... 22-4

Chapter 23. Remove Duplicates Stage

Chapter 24. Compress Stage

Chapter 25. Expand Stage

Chapter 26. Sample Stage

Ascential DataStage Parallel Job Developers Guide

Link Ordering ..................................................................................................... 26-4

Chapter 27. Row Generator Stage

Chapter 28. Column Generator Stage

Chapter 29. Copy Stage

Chapter 30. External Filter Stage

Outputs Page ..............................................................................................................30-5

Chapter 31. Change Capture Stage

Chapter 32. Change Apply Stage

Chapter 33. Encode Stage

Chapter 34. Decode Stage

Ascential DataStage Parallel Job Developers Guide

Outputs Page ............................................................................................................. 34-4

Chapter 35. Difference Stage

Chapter 36. Column Import Stage

Chapter 37. Column Export Stage

Chapter 38. Make Subrecord Stage

Properties ............................................................................................................. 38-2

Chapter 39. Split Subrecord Stage

Chapter 40. Promote Subrecord Stage

Chapter 41. Combine Records Stage

Chapter 42. Make Vector Stage

Ascential DataStage Parallel Job Developers Guide

Partitioning on Input Links .............................................................................. 42-3

Chapter 43. Split Vector Stage

Chapter 44. Head Stage