You are on page 1of 30

Data Stage Best Practices & Performance Tuning

DataStage
Best Practices & Performance Tuning

Page 1 of 30

Data Stage Best Practices & Performance Tuning

DataStage ..........................................................................................................................1
1 Environment Variable Settings.........................................................................................3
1.1 Environment Variable Settings for All Jobs...............................................................3
1.2 Additional Environment Variable Settings ................................................................3
2 Configuration Files............................................................................................................6
2.1 Logical Processing Nodes.........................................................................................6
2.2 Optimizing Parallelism...............................................................................................7
2.3 Configuration File Examples......................................................................................7
2.3.1 Example for Any Number of CPUs and Any Number of Disks...........................8
2.3.2 Example that Reduces Contention...................................................................10
2.3.3 Smaller Configuration Example........................................................................11
2.4 Sequential File Stages (Import and Export)............................................................13
2.4.1 Improving Sequential File Performance............................................................13
2.4.2 Partitioning Sequential File Reads....................................................................13
2.4.3 Sequential File (Export) Buffering.....................................................................13
2.4.4 Reading from and Writing to Fixed-Length Files..............................................14
2.4.5 Reading Bounded-Length VARCHAR Columns...............................................14
2.5 Transformer Usage Guidelines................................................................................14
2.5.1 Choosing Appropriate Stages...........................................................................14
2.5.2 Transformer NULL Handling and Reject Link...................................................15
2.5.3 Transformer Derivation Evaluation...................................................................16
2.5.4 Conditionally Aborting Jobs..............................................................................16
2.6 Lookup vs. Join Stages............................................................................................16
2.7 Capturing Unmatched Records from a Join............................................................16
2.8 The Aggregator Stage..............................................................................................17
2.9 Appropriate Use of SQL and DataStage Stages.....................................................17
2.10 Optimizing Select Lists...........................................................................................18
2.11 Designing for Restart.............................................................................................18
2.12 Database OPEN and CLOSE Commands............................................................18
2.13 Database Sparse Lookup vs. Join.........................................................................19
2.14 Oracle Database Guidelines..................................................................................19
2.14.1 Proper Import of Oracle Column Definitions (Schema)..................................19
2.14.2 Reading from Oracle in Parallel......................................................................19
2.14.3 Oracle Load Options.......................................................................................20
3 Tips for Debugging Enterprise Edition Jobs...................................................................21
3.1 Reading a Score Dump............................................................................................21
3.2 Partitioner and Sort Insertion...................................................................................22
4 Performance Tips for Job Design...................................................................................24
5 Performance Monitoring and Tuning..............................................................................25
5.1 The Job Monitor.......................................................................................................25
5.2 OS/RDBMS-Specific Tools .....................................................................................25
5.3 Obtaining Operator Run-Time Information..............................................................26
5.4 Selectively Rewriting the Flow.................................................................................27
5.5 Eliminating Repartitions...........................................................................................27
5.6 Ensuring Data is Evenly Partitioned.......................................................................27
5.7 Buffering for All Versions........................................................................................28
5.8 Resolving Bottlenecks.............................................................................................28

Page 2 of 30

Data Stage Best Practices & Performance Tuning


5.8.1 Variable Length Data........................................................................................28
5.8.2 Combinable Operators......................................................................................28
5.8.3 Disk I/O..............................................................................................................29
5.8.4 Buffering............................................................................................................29

1 Environment Variable Settings


DataStage EE provides a number of environment variables to control how jobs operate
on a UNIX system. In addition to providing required information, environment variables
can be used to enable or disable various DataStage features, and to tune performance
settings.

1.1

Environment Variable Settings for All Jobs

Ascential recommends the following environment variable settings for all Enterprise
Edition jobs. These settings can be made at the project level, or may be set on an
individual basis within the properties for each job.
Environment Variable Settings For All Jobs
Environment Variable

Setting

Description

$APT_CONFIG_FILE

filepath

Specifies the full pathname to the EE configuration file.

$APT_DUMP_SCORE

Outputs EE score dump to the DataStage job log,


providing detailed information about actual job flow
including operators, processes, and datasets. Extremely
useful for understanding how a job actually ran in the
environment. (see section 10.1 Reading a Score Dump)

$OSH_ECHO

Includes a copy of the generated osh in the jobs


DataStage log. Starting with v7, this option is enabled
when Generated OSH visible for Parallel jobs in ALL
projects option is enabled in DataStage Administrator.

$APT_RECORD_COUNTS

Outputs record counts to the DataStage job log as each


operator completes processing. The count is per
operator per partition.

$APT_PM_SHOW_PIDS

Places entries in DataStage job log showing UNIX


process ID (PID) for each process started by a job.
Does not report PIDs of DataStage phantom
processes started by Server shared containers.

$APT_BUFFER_MAXIMUM_TIMEOUT

Maximum buffer delay in seconds

$APT_THIN_SCORE

Only needed for DataStage v7.0 and earlier. Setting this


environment variable significantly reduces memory
usage for very large (>100 operator) jobs.

(DataStage 7.0 and earlier)

1.2

Additional Environment Variable Settings

Ascential recommends setting the following environment variables on an as-needed


basis. These variables can be used to tune the performance of a particular job flow, to
assist in debugging, and to change the default behavior of specific EE stages.

Page 3 of 30

Data Stage Best Practices & Performance Tuning


NOTE: The environment variable settings in this section are only examples. Set values
that are optimal to your environment.
Sequential File Stage Environment Variables
Environment Variable

Setting
[nrows]

Description

[Kbytes
]

Defines size of I/O buffer for Sequential File


reads (imports) and writes (exports)
respectively. Default is 128 (128K), with a
minimum of 8. Increasing these values on
heavily-loaded file servers may improve
performance.

$APT_CONSISTENT_BUFFERIO_SIZE

[bytes]

In some disk array configurations, setting this


variable to a value equal to the read / write
size in bytes can improve performance of
Sequential File import/export operations.

$APT_DELIMITED_READ_SIZE

[bytes]

Specifies the number of bytes the Sequential


File (import) stage reads-ahead to get the
next delimiter. The default is 500 bytes, but
this can be set as low as 2 bytes.

$APT_EXPORT_FLUSH_COUNT

$APT_IMPORT_BUFFER_SIZE
$APT_EXPORT_BUFFER_SIZE

Specifies how frequently (in rows) that the


Sequential File stage (export operator)
flushes its internal buffer to disk. Setting this
value to a low number (such as 1) is useful
for realtime applications, but there is a small
performance penalty from increased I/O.

This setting should be set to a lower value


when reading from streaming inputs (eg.
socket, FIFO) to avoid blocking.
$APT_MAX_DELIMITED_READ_SIZE

[bytes]

By default, Sequential File (import) will read


ahead 500 bytes to get the next delimiter. If it
is not found the importer looks ahead
4*500=2000 (1500 more) bytes, and so on
(4X) up to 100,000 bytes.
This variable controls the upper bound which
is by default 100,000 bytes. When more than
500 bytes read-ahead is desired, use this
variable instead of
APT_DELIMITED_READ_SIZE.

Oracle Environment Variables


Environment Variable

Setting
[path]

Description

$ORACLE_HOME

$ORACLE_SID

[sid]

Specifies the Oracle service name,


corresponding to a TNSNAMES entry.

Specifies installation directory for current


Oracle instance. Normally set in a users
environment by Oracle scripts.

Page 4 of 30

Data Stage Best Practices & Performance Tuning


$APT_ORAUPSERT_COMMIT_ROW_INTERVAL
$APT_ORAUPSERT_COMMIT_TIME_INTERVAL

[num]
[second
s]

These two environment variables work


together to specify how often target rows are
committed for target Oracle stages with
Upsert method.
Commits are made whenever the time
interval period has passed or the row interval
is reached, whichever comes first. By default,
commits are made every 2 seconds or 5000
rows.

$APT_ORACLE_LOAD_OPTIONS

[SQL*
Loader
options]

Specifies Oracle SQL*Loader options used in


a target Oracle stage with Load method. By
default, this is set to
OPTIONS(DIRECT=TRUE,
PARALLEL=TRUE)

$APT_ORA_IGNORE_CONFIG_FILE_PARALLELIS
M

When set, a target Oracle stage with Load


method will limit the number of players to the
number of datafiles in the tables tablespace.

$APT_ORA_WRITE_FILES

[filepath]

Useful in debugging Oracle SQL*Loader


issues. When set, the output of a Target
Oracle stage with Load method is written to
files instead of invoking the Oracle
SQL*Loader. The filepath specified by this
environment variable specifies the file with
the SQL*Loader commands.

$DS_ENABLE_RESERVED_CHAR_CONVERT

Allows DataStage to handle Oracle


databases which use the special characters #
and $ in column names.

Job Monitoring Environment Variables


Environment Variable
Setting
[second
$APT_MONITOR_TIME
s]

Description
In v7 and later, specifies the time interval (in seconds)
for generating job monitor information at runtime. To
enable size-based job monitoring, unset this
environment variable, and set $APT_MONITOR_SIZE
below.

$APT_MONITOR_SIZE

[rows]

Determines the minimum number of records the job


monitor reports. The default of 5000 records is usually
too small. To minimize the number of messages during
large job runs, set this to a higher value (eg. 1000000).

$APT_NO_JOBMON

Disables job monitoring completely. In rare instances,


this may improve performance. In general, this should
only be set on a per-job basis when attempting to
resolve performance bottlenecks.

$APT_RECORD_COUNTS

Prints record counts in the job log as each operator


completes processing. The count is per operator per
partition.

Job Monitoring Environment Variables


Environment Variable
Setting
[second
$APT_MONITOR_TIME

Description
In v7 and later, specifies the time interval (in seconds)

Page 5 of 30

Data Stage Best Practices & Performance Tuning


s]

for generating job monitor information at runtime. To


enable size-based job monitoring, unset this
environment variable, and set $APT_MONITOR_SIZE
below.

$APT_MONITOR_SIZE

[rows]

Determines the minimum number of records the job


monitor reports. The default of 5000 records is usually
too small. To minimize the number of messages during
large job runs, set this to a higher value (eg. 1000000).

$APT_NO_JOBMON

Disables job monitoring completely. In rare instances,


this may improve performance. In general, this should
only be set on a per-job basis when attempting to
resolve performance bottlenecks.

$APT_RECORD_COUNTS

Prints record counts in the job log as each operator


completes processing. The count is per operator per
partition.

2 Configuration Files
The configuration file tells DataStage Enterprise Edition how to exploit underlying system
resources (processing, temporary storage, and dataset storage). In more advanced
environments, the configuration file can also define other resources such as databases
and buffer storage. At runtime, EE first reads the configuration file to determine what
system resources are allocated to it, and then distributes the job flow across these
resources.
When you modify the system, by adding or removing nodes or disks, you must modify
the DataStage EE configuration file accordingly. Since EE reads the configuration file
every time it runs a job, it automatically scales the application to fit the system without
having to alter the job design.
There is not necessarily one ideal configuration file for a given system because of the
high variability between the way different jobs work. For this reason, multiple
configuration files should be used to optimize overall throughput and to match job
characteristics to available hardware resources. At runtime, the configuration file is
specified through the environment variable $APT_CONFIG_FILE.

2.1

Logical Processing Nodes

The configuration file defines one or more EE processing nodes on which parallel jobs
will run. EE processing nodes are a logical rather than a physical construct. For this
reason, it is important to note that the number of processing nodes does not necessarily
correspond to the actual number of CPUs in your system.
Within a configuration file, the number of processing nodes defines the degree of
parallelism and resources that a particular job will use to run. It is up to the UNIX
operating system to actually schedule and run the processes that make up a DataStage
job across physical processors. A configuration file with a larger number of nodes

Page 6 of 30

Data Stage Best Practices & Performance Tuning


generates a larger number of processes that use more memory (and perhaps more disk
activity) than a configuration file with a smaller number of nodes.
While the DataStage documentation suggests creating half the number of nodes as
physical CPUs, this is a conservative starting point that is highly dependent on system
configuration, resource availability, job design, and other applications sharing the server
hardware. For example, if a job is highly I/O dependent or dependent on external (eg.
database) sources or targets, it may appropriate to have more nodes than physical
CPUs.
For typical production environments, a good starting point is to set the number of nodes
equal to the number of CPUs. For development environments, which are typically
smaller and more resource-constrained, create smaller configuration files (eg. 2-4
nodes). Note that even in the smallest development environments, a 2-node
configuration file should be used to verify that job logic and partitioning will work in
parallel (as long as the test data can sufficiently identify data discrepancies).

2.2

Optimizing Parallelism

The degree of parallelism of a DataStage EE application is determined by the number of


nodes you define in the configuration file. Parallelism should be optimized rather than
maximized. Increasing parallelism may better distribute your work load, but it also adds
to your overhead because the number of processes increases. Therefore, you must
weigh the gains of added parallelism against the potential losses in processing
efficiency. The CPUs, memory, disk controllers and disk configuration that make up your
system influence the degree of parallelism you can sustain.
Keep in mind that the closest equal partitioning of data contributes to the best overall
performance of an application running in parallel. For example, when hash partitioning,
try to ensure that the resulting partitions are evenly populated. This is referred to as
minimizing skew.
When business requirements dictate a partitioning strategy that is excessively skewed,
remember to change the partition strategy to a more balanced one as soon as possible
in the job flow. This will minimize the effect of data skew and significantly improve
overall job performance.

2.3

Configuration File Examples

Given the large number of considerations for building a configuration file, where do you
begin? For starters, the default configuration file (default.apt) created when DataStage is
installed is appropriate for only the most basic environments.
The default configuration file has the following characteristics:
- number of nodes = number of physical CPUs
- disk and scratchdisk storage use subdirectories within the DataStage install
filesystem
You should create and use a new configuration file that is optimized to your hardware
and file systems. Because different job flows have different needs (CPU-intensive?
Memory-intensive? Disk-Intensive? Database-Intensive? Sorts? need to share resources

Page 7 of 30

Data Stage Best Practices & Performance Tuning


with other jobs/databases/ applications? etc), it is often appropriate to have multiple
configuration files optimized for particular types of processing.
With the synergistic relationship between hardware (number of CPUs, speed, cache,
available system memory, number and speed of I/O controllers, local vs. shared disk,
RAID configurations, disk size and speed, network configuration and availability),
software topology (local vs. remote database access, SMP vs. Clustered processing),
and job design, there is no definitive science for formulating a configuration file. This
section attempts to provide some guidelines based on experience with actual production
applications.
IMPORTANT: It is important to follow the order of all sub-items within individual node
specifications in the example configuration files given in this section.
2.3.1 Example for Any Number of CPUs and Any Number of Disks
Assume you are running on a shared-memory multi-processor system, an SMP server,
which is the most common platform today. Lets assume these properties:
computer host name fastone
6 CPUs
4 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3
You can adjust the sample to match your precise environment.
The configuration file you would use as a starting point would look like the one below.
Assuming that the system load from processing outside of DataStage is minimal, it may
be appropriate to create one node per CPU as a starting point.
In the following example, the way disk and scratchdisk resources are handled is the
important.
{ /* config files allow C-style comments. */
/* Configuration do not have flexible syntax. Keep all the sub-items of
the individual node specifications in the order shown here. */
node "n0" {
pools "" /* on an SMP node pools arent used often. */
fastname "fastone"
resource scratchdisk "/fs0/ds/scratch" {} /* start with fs0 */
resource scratchdisk "/fs1/ds/scratch" {}
resource scratchdisk "/fs2/ds/scratch" {}
resource scratchdisk "/fs3/ds/scratch" {}
resource disk "/fs0/ds/disk" {} /* start with fs0 */
resource disk "/fs1/ds/disk" {}
resource disk "/fs2/ds/disk" {}
resource disk "/fs3/ds/disk" {}
}
node "n1" {
pools ""
fastname "fastone"
resource scratchdisk "/fs1/ds/scratch" {} /* start with fs1 */
resource scratchdisk "/fs2/ds/scratch" {}
resource scratchdisk "/fs3/ds/scratch" {}
resource scratchdisk "/fs0/ds/scratch" {}
resource disk "/fs1/ds/disk" {} /* start with fs1 */
resource disk "/fs2/ds/disk" {}
resource disk "/fs3/ds/disk" {}

Page 8 of 30

Data Stage Best Practices & Performance Tuning


resource disk "/fs0/ds/disk" {}
}
node "n2" {
pools ""
fastname "fastone"
resource scratchdisk "/fs2/ds/scratch" {} /* start with fs2 */
resource scratchdisk "/fs3/ds/scratch" {}
resource scratchdisk "/fs0/ds/scratch" {}
resource scratchdisk "/fs1/ds/scratch" {}
resource disk "/fs2/ds/disk" {} /* start with fs2 */
resource disk "/fs3/ds/disk" {}
resource disk "/fs0/ds/disk" {}
resource disk "/fs1/ds/disk" {}
}
node "n3" {
pools ""
fastname "fastone"
resource scratchdisk "/fs3/ds/scratch" {} /* start with fs3 */
resource scratchdisk "/fs0/ds/scratch" {}
resource scratchdisk "/fs1/ds/scratch" {}
resource scratchdisk "/fs2/ds/scratch" {}
resource disk "/fs3/ds/disk" {} /* start with fs3 */
resource disk "/fs0/ds/disk" {}
resource disk "/fs1/ds/disk" {}
resource disk "/fs2/ds/disk" {}
}
node "n4" {
pools ""
fastname "fastone"
/* Now we have rotated through starting with a different disk, but the fundamental problem
* in this scenario is that there are more nodes than disks. So what do we do now?
* The answer: something that is not perfect. Were going to repeat the sequence. You could
* shuffle differently i.e., use /fs0 /fs2 /fs1 /fs3 as an order, but that most likely wont
* matter. */
resource scratchdisk /fs0/ds/scratch {} /* start with fs0 again */
resource scratchdisk /fs1/ds/scratch {}
resource scratchdisk /fs2/ds/scratch {}
resource scratchdisk /fs3/ds/scratch {}
resource disk /fs0/ds/disk {} /* start with fs0 again */
resource disk /fs1/ds/disk {}
resource disk /fs2/ds/disk {}
resource disk /fs3/ds/disk {}
}
node n5 {
pools
fastname fastone
resource scratchdisk /fs1/ds/scratch {} /* start with fs1 */
resource scratchdisk /fs2/ds/scratch {}
resource scratchdisk /fs3/ds/scratch {}
resource scratchdisk /fs0/ds/scratch {}
resource disk /fs1/ds/disk {} /* start with fs1 */
resource disk /fs2/ds/disk {}
resource disk /fs3/ds/disk {}
resource disk /fs0/ds/disk {}
}
} /* end of entire config */

The file pattern of the configuration file above is a give every node all the disk example,
albeit in different orders to minimize I/O contention. This configuration method works well
when the job flow is complex enough that it is difficult to determine and precisely plan for
good I/O utilization.
Within each node, EE does not stripe the data across multiple filesystems. Rather, it
fills the disk and scratchdisk filesystems in the order specified in the configuration file. In
the 4-node example above, the order of the disks is purposely shifted for each node, in
an attempt to minimize I/O contention.

Page 9 of 30

Data Stage Best Practices & Performance Tuning


Even in this example, giving every partition (node) access to all the I/O resources can
cause contention, but EE attempts to minimize this by using fairly large I/O blocks.
This configuration style works for any number of CPUs and any number of disks since it
doesn't require any particular correspondence between them. The heuristic here is:
When its too difficult to figure out precisely, at least go for achieving balance.

2.3.2

Example that Reduces Contention

The alternative to the first configuration method is more careful planning of the I/O
behavior to reduce contention. You can imagine this could be hard given our
hypothetical 6-way SMP with 4 disks because setting up the obvious one-to-one
correspondence doesn't work. Doubling up some nodes on the same disk is unlikely to
be good for overall performance since we create a hotspot.
We could give every CPU two disks and rotate them around, but that would be little
different than the previous strategy. So, lets imagine a less constrained environment
with two additional disks:
computer host name fastone
6 CPUs
6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5
Now a configuration file for this environment might look like this:
{
node "n0" {
pools ""
fastname "fastone"
resource disk "/fs0/ds/data" {pools ""}
resource scratchdisk "/fs0/ds/scratch" {pools ""}
}
node "node2" {
fastname "fastone"
pools ""
resource disk "/fs1/ds/data" {pools ""}
resource scratchdisk "/fs1/ds/scratch" {pools ""}
}
node "node3" {
fastname "fastone"
pools ""
resource disk "/fs2/ds/data" {pools ""}
resource scratchdisk "/fs2/ds/scratch" {pools ""}
}
node "node4" {
fastname "fastone"
pools ""
resource disk "/fs3/ds/data" {pools ""}
resource scratchdisk "/fs3/ds/scratch" {pools ""}
}
node "node5" {
fastname "fastone"
pools ""
resource disk "/fs4/ds/data" {pools ""}

Page 10 of 30

Data Stage Best Practices & Performance Tuning


resource scratchdisk "/fs4/ds/scratch" {pools ""}
}
node "node6" {
fastname "fastone"
pools ""
resource disk "/fs5/ds/data" {pools ""}
resource scratchdisk "/fs5/ds/scratch" {pools ""}
}
} /* end of entire config */

While this is the simplest scenario, it is important to realize that no single player, stage,
or operator instance on any one partition can go faster than the single disk it has access
to.
You could combine strategies by adding in a node pool where disks have a one-to-one
association with nodes. These nodes would then not be in the default node pool, but a
special one that you would specifically assign to stage / operator instances.
2.3.3 Smaller Configuration Example
Because disk and scratchdisk resources are assigned per node, depending on the total
disk space required to process large jobs, it may be necessary to distribute file systems
across nodes in smaller environments (fewer available CPUs/memory).
Using the above server example, this time with 4-nodes:
computer host name fastone
4 CPUs
6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5
{
node "node1" {
fastname "fastone"
pools ""
resource disk "/fs0/ds/data" {pools ""} /* start with fs0 */
resource disk "/fs4/ds/data" {pools ""}
resource scratchdisk "/fs4/ds/scratch" {pools ""} /* start with fs4 */
resource scratchdisk "/fs0/ds/scratch" {pools ""}
}
node "node2" {
fastname "fastone"
pools ""
resource disk "/fs1/ds/data" {pools ""}
resource disk "/fs5/ds/data" {pools ""}
resource scratchdisk "/fs5/ds/scratch" {pools ""}
resource scratchdisk "/fs1/ds/scratch" {pools ""}
}
node "node3" {
fastname "fastone"
pools ""
resource disk "/fs2/ds/data" {pools ""}
resource disk "/fs6/ds/data" {pools ""}
resource scratchdisk "/fs6/ds/scratch" {pools ""}
resource scratchdisk "/fs2/ds/scratch" {pools ""}
}
node "node4" {
fastname "fastone"
pools ""
resource disk "/fs3/ds/data" {pools ""}
resource disk "/fs7/ds/data" {pools ""}
resource scratchdisk "/fs7/ds/scratch" {pools ""}
resource scratchdisk "/fs3/ds/scratch" {pools ""}
}
} /* end of entire config */

The 4-node example above illustrates another concept in configuration file setup you
can assign multiple disk and scratch disk resources for each node.

Page 11 of 30

Data Stage Best Practices & Performance Tuning


Unfortunately, physical limitations of available hardware and disk configuration dont
always lend themselves to clean configurations illustrated above.
Other configuration file tips:
Consider avoiding the disk(s) that your input files reside on. Often those disks will
be hotspots until the input phase is over. If the job is large and complex this is
less of an issue since the input part is proportionally less of the total work.

Ensure that the different file systems mentioned as the disk and scratchdisk
resources hit disjoint sets of spindles even if they're located on a RAID system.
Do not trust high-level RAID/SAN monitoring tools, as their cache hit ratios are
often misleading.
Never use NFS file systems for scratchdisk resources. Know what's real and
what's NFS: Real disks are directly attached, or are reachable over a SAN
(storage-area network - dedicated, just for storage, low-level protocols).
Proper configuration of scratch and resource disk (and the underlying filesystem
and physical hardware architecture) can significantly affect overall job
performance. Beware if you use NFS (and, often SAN) filesystem space for disk
resources. For example, your final result files may need to be written out onto the
NFS disk area, but that doesn't mean the intermediate data sets created and
used temporarily in a multi-job sequence should use this NFS disk area. It is
better to setup a "final" disk pool, and constrain the result sequential file or data
set to reside there, but let intermediate storage go to local or SAN resources, not
NFS.

Page 12 of 30

Data Stage Best Practices & Performance Tuning

2.4

Sequential File Stages (Import and Export)

2.4.1 Improving Sequential File Performance


If the source file is fixed/de-limited, the Readers Per Nodeoption can be used to read a
single input file in parallel at evenly-spaced offsets. Note that in this manner, input row
order is not maintained.
If the input sequential file cannot be read in parallel, performance can still be improved
by separating the file I/O from the column parsing operation. To accomplish this, define a
single large string column for the non-parallel Sequential File read, and then pass this to
a Column Import stage to parse the file in parallel. The formatting and column properties
of the Column Import stage match those of the Sequential File stage.
On heavily-loaded file servers or some RAID/SAN array configurations, the environment
variables $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE can be used to
improve I/O performance. These settings specify the size of the read (import) and write
(export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may improve
performance.
Finally, in some disk array configurations, setting the environment variable
$APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the read/write size in bytes can
significantly improve performance of Sequential File operations.
2.4.2 Partitioning Sequential File Reads
Care must be taken to choose the appropriate partitioning method from a Sequential File
read:

Dont read from Sequential File using SAME partitioning! Unless more than one
source file is specified, SAME will read the entire file into a single partition,
making the entire downstream flow run sequentially (unless it is later
repartitioned).
When multiple files are read by a single Sequential File stage (using multiple
files, or by using a File Pattern), each files data is read into a separate partition.
It is important to use ROUND-ROBIN partitioning (or other partitioning
appropriate to downstream components) to evenly distribute the data in the flow.

2.4.3 Sequential File (Export) Buffering


By default, the Sequential File (export operator) stage buffers its writes to optimize
performance. When a job completes successfully, the buffers are always flushed to disk.
The environment variable $APT_EXPORT_FLUSH_COUNT allows the job developer to specify
how frequently (in number of rows) that the Sequential File stage flushes its internal
buffer on writes. Setting this value to a low number (such as 1) is useful for realtime
applications, but there is a small performance penalty associated with increased I/O.

Page 13 of 30

Data Stage Best Practices & Performance Tuning

2.4.4

Reading from and Writing to Fixed-Length Files

Particular attention must be taken when processing fixed-length fields using the
Sequential File stage:
If the incoming columns are variable-length data types (eg. Integer, Decimal,
Varchar), the field width column property must be set to match the fixed-width of
the input column. Double-click on the column number in the grid dialog to set this
column property.

If a field is nullable, you must define the null field value and length in the Nullable
section of the column property. Double-click on the column number in the grid
dialog to set these properties.

When writing fixed-length files from variable-length fields (eg. Integer, Decimal,
Varchar), the field width and pad string column properties must be set to match
the fixed-width of the output column. Double-click on the column number in the
grid dialog to set this column property.

To display each field value, use the print_field import property. All import and
export properties are listed in chapter 25, Import/Export Properties of the
Orchestrate 7.0 Operators Reference.

2.4.5

Reading Bounded-Length VARCHAR Columns

Care must be taken when reading delimited, bounded-length Varchar columns (Varchars
with the length option set). By default, if the source file has fields with values longer than
the maximum Varchar length, these extra characters will be silently truncated.
Starting with v7.01 the environment variable
will direct DataStage to reject records with
strings longer than their declared maximum column length.
$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS

2.5

Transformer Usage Guidelines

2.5.1 Choosing Appropriate Stages


The parallel Transformer stage always generates C code which is then compiled to a
parallel component. For this reason, it is important to minimize the number of

Page 14 of 30

Data Stage Best Practices & Performance Tuning


transformers, and to use other stages (Copy, Filter, Switch, etc) when derivations are not
needed.

The Copy stage should be used instead of a Transformer for simple operations
including:
- Job Design placeholder between stages (unless the Force option =true, EE
will optimize this out at runtime)
- Renaming Columns
- Dropping Columns
- Default Type Conversions
Note that rename, drop (if runtime column propagation is disabled), and
default type conversion can also be performed by the output mapping tab of
any stage.

NEVER use the BASIC Transformer stage in large-volume job flows. Instead,
user-defined functions and routines can expand parallel Transformer capabilities.

Consider, if possible, implementing complex derivation expressions using regular


patterns by Lookup tables instead of using a Transformer with nested
derivations.
For example, the derivation expression:
If A=0,1,2,3 Then B=X If A=4,5,6,7 Then B=C
Could be implemented with a lookup table containing values for column A
and corresponding values of column B.
Optimize the overall job flow design to combine derivations from multiple
Transformers into a single Transformer stage when possible.
In v7 and later, the Filter and/or Switch stages can be used to separate rows
into multiple output links based on SQL-like link constraint expressions.
In v7 and later, the Modify stage can be used for non-default type
conversions, null handling, and character string trimming. See section 7.5 for
more information.
Buildops should be used instead of Transformers in the handful of scenarios
where complex reusable logic is required, or where existing Transformer-based
job flows do not meet performance requirements.

2.5.2 Transformer NULL Handling and Reject Link


When evaluating expressions for output derivations or link constraints, the Transformer
will reject (through the reject link indicated by a dashed line) any row that has a NULL
value used in the expression. To create a Transformer reject link in DataStage Designer,
right-click on an output link and choose Convert to Reject.

Page 15 of 30

Data Stage Best Practices & Performance Tuning


The Transformer rejects NULL derivation results because the rules for arithmetic and
string handling of NULL values are by definition undefined. For this reason, always test
for null values before using a column in an expression, for example:
If ISNULL(link.col) Then Else
Note that if an incoming column is only used in a pass-through derivation, the
Transformer will allow this row to be output. DataStage release 7 enhances this behavior
by placing warnings in the log file when discards occur.
2.5.3 Transformer Derivation Evaluation
Output derivations are evaluated BEFORE any type conversions on the assignment. For
example, the PadString function uses the length of the source type, not the target.
Therefore, it is important to make sure the type conversion is done before a row reaches
the Transformer.
For example, TrimLeadingTrailing(string) works only if string is a VarChar field.
Thus, the incoming column must be type VarChar before it is evaluated in the
Transformer.
2.5.4 Conditionally Aborting Jobs
The Transformer can be used to conditionally abort a job when incoming data matches a
specific rule. Create a new output link that will handle rows that match the abort rule.
Within the link constraints dialog box, apply the abort rule to this output link, and set the
Abort After Rows count to the number of rows allowed before the job should be aborted
(eg. 1).
Since the Transformer will abort the entire job flow immediately, it is possible that valid
rows will not have been flushed from Sequential File (export) buffers, or committed to
database tables. It is important to set the Sequential File buffer flush (see section 7.3) or
database commit parameters.

2.6

Lookup vs. Join Stages

The Lookup stage is most appropriate when the reference data for all lookup stages in a
job is small enough to fit into available physical memory. Each lookup reference requires
a contiguous block of physical memory. If the datasets are larger than available
resources, the JOIN or MERGE stage should be used.
If the reference to a Lookup is directly from a Oracle table, and the number of input rows
is significantly smaller (eg. 1:100 or more) than the number of reference rows, a Sparse
Lookup may be appropriate.

2.7

Capturing Unmatched Records from a Join

The Join stage does not provide reject handling for unmatched records (such as in an
InnerJoin scenario). If un-matched rows must be captured or logged, an OUTER join
operation must be performed. In an OUTER join scenario, all rows on an outer link (eg.
Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of
match on key values.

Page 16 of 30

Data Stage Best Practices & Performance Tuning


During an Outer Join, when a match does not occur, the Join stage inserts NULL values
into the unmatched columns. Care must be taken to change the column properties to
allow NULL values before the Join. This is most easily done by inserting a Copy stage
and mapping a column from NON-NULLABLE to NULLABLE.
A Filter stage can be used to test for NULL values in unmatched columns.
In some cases, it is simpler to use a Column Generator to add an indicator column, with
a constant value, to each of the outer links and test that column for the constant after
you have performed the join. This is also handy with Lookups that have multiple
reference links.

2.8

The Aggregator Stage

By default, the output data type of a parallel Aggregator stage calculation or


recalculation column is Double. Starting with v7.01 of DataStage EE, the new optional
property Aggregations/Default to Decimal Output specifies that all calculation or
recalculations result in decimal output of the specified precision and scale.
You can also specify that the result of an individual calculation or recalculation is decimal
by using the optional Decimal Output subproperty.

2.9

Appropriate Use of SQL and DataStage Stages

When using relational database sources, there is often a functional overlap between
SQL and DataStage stages. Although it is possible to use either SQL or DataStage to
solve a given business problem, the optimal implementation involves leveraging the
strengths of each technology to provide maximum throughput and developer
productivity.
While there are extreme scenarios when the appropriate technology choice is clearly
understood, there may be gray areas where the decision should be made factors such
as developer productivity, metadata capture and re-use, and ongoing application
maintenance costs.
The following guidelines can assist with the appropriate use of SQL and DataStage
technologies in a given job flow:
a) When possible, use a SQL filter (WHERE clause) to limit the number of
rows sent to the DataStage job. This minimizes impact on network and
memory resources, and leverages the database capabilities.
b) Use a SQL Join to combine data from tables with a small number of rows
in the same database instance, especially when the join columns are
indexed.
c) When combining data from very large tables, or when the source includes
a large number of database tables, the efficiency of the DataStage EE
Sort and Join stages can be significantly faster than an equivalent SQL

Page 17 of 30

Data Stage Best Practices & Performance Tuning


query. In this scenario, it can still be beneficial to use database filters
(WHERE clause) if appropriate.
d) Avoid the use of database stored procedures (eg. Oracle PL/SQL) on a
per-row basis within a high-volume data flow. For maximum scalability
and parallel performance, it is best to implement business rules natively
using DataStage components.

2.10 Optimizing Select Lists


For best performance and optimal memory usage, it is best to explicitly specify column
names on all source database stages, instead of using an unqualified Table or SQL
SELECT * read. For Table read method, always specify the Select List subproperty.
For Auto-Generated SQL, the DataStage Designer will automatically populate the
select list based on the stages output column definition.
The only exception to this rule is when building dynamic database jobs that use runtime
column propagation to process all rows in a source table.

2.11 Designing for Restart


To enable restart of high-volume jobs, it is important to separate the transformation
process from the database write (Load or Upsert) operation. After transformation, the
results should be landed to a parallel data set. Subsequent job(s) should read this data
set and populate the target table using the appropriate database stage and write
method.
As a further optimization, a Lookup stage (or Join stage, depending on data volume) can
be used to identify existing rows before they are inserted into the target table.

2.12 Database OPEN and CLOSE Commands


The native parallel database stages provide options for specifying OPEN and CLOSE
commands. These options allow commands (including SQL) to be sent to the database
before (OPEN) or after (CLOSE) all rows are read/written/loaded to the database. OPEN
and CLOSE are not offered by plug-in database stages.
For example, the OPEN command could be used to create a temporary table, and the
CLOSE command could be used to select all rows from the temporary table and insert
into a final target table.
As another example, the OPEN command can be used to create a target table, including
database-specific options (tablespace, logging, constraints, etc) not possible with the
Create option. In general, dont let EE generate target tables unless they are used for
temporary storage. There few options to specify Create table options, and doing so may
violate data-management (DBA) policies.
It is important to understand the implications of specifying a user-defined OPEN and
CLOSE command. For example, when reading from DB2, a default OPEN statement

Page 18 of 30

Data Stage Best Practices & Performance Tuning


places a shared lock on the source. When specifying a user-defined OPEN command,
this lock is not sent and should be specified explicitly if appropriate.
Further details are outlined in the respective database sections of the Orchestrate
Operators Reference which is part of the Orchestrate OEM documentation.

2.13 Database Sparse Lookup vs. Join


Data read by any database stage can serve as the reference input to a Lookup
operation. By default, this reference data is loaded into memory like any other reference
link (Normal Lookup).
When directly connected as the reference link to a Lookup stage, both DB2/UDB
Enterprise and Oracle Enterprise stages allow the lookup type to be changed to
Sparse, sending individual SQL statements to the reference database for each
incoming Lookup row. Sparse Lookup is only available when the database stage is
directly connected to the reference link, with no intermediate stages.
IMPORTANT: The individual SQL statements required by a Sparse Lookup are
an expensive operation from a performance perspective. In most cases, it is
faster to use a DataStage JOIN stage between the input and DB2 reference data
than it is to perform a Sparse Lookup.
For scenarios where the number of input rows is significantly smaller (eg. 1:100 or more)
than the number of reference rows in a Oracle table, a Sparse Lookup may be
appropriate.

2.14 Oracle Database Guidelines


2.14.1 Proper Import of Oracle Column Definitions (Schema)
DataStage EE always uses the Oracle table definition, regardless of explicit job design
metadata (Data Type, Nullability, etc)
IMPORTANT: To avoid unexpected default type conversions, always import Oracle
table definitions using the orchdbutil option (in v6.0.1 or later) of DataStage
Designer to avoid unexpected data type conversions.
2.14.2 Reading from Oracle in Parallel
By default, the Oracle Enterprise stage reads sequentially from its source table or query.
Setting the part i t i o n table
option to the specified table will enable parallel extracts from an
Oracle source. The underlying Oracle table does not have to be partitioned for parallel
read within DataStage EE.
Page 19 of 30

Data Stage Best Practices & Performance Tuning


It is important to note that certain types of queries cannot run in parallel. Examples
include:
- queries containing a GROUP BY clause that are also hash partitioned on the
same field
- queries performing a non-collocated join (a SQL JOIN between two tables
that are not stored in the same partitions with the same partitioning strategy)
2.14.3 Oracle Load Options
When writing to an Oracle table (using Write Method = Load), Parallel Extender uses the
Parallel Direct Path Load method. When using this method, the Oracle stage cannot
write to a table that has indexes (including indexes automatically generated by Primary
Key constraints) on it unless you specify the Index Mode option (maintenance, rebuild).
Setting the environment variable $APT_ORACLE_LOAD_OPTIONS to
OPTIONS (DIRECT=TRUE, PARALLEL=FALSE) also allows loading of indexed
tables without index maintenance. In this instance, the Oracle load will be done
sequentially.
The Upsert Write Method can be used to insert rows into a target Oracle table without
bypassing indexes or constraints. In order to automatically generate the SQL required by
the Upsert method, the key column(s) must be identified using the check boxes in the
column grid.

Page 20 of 30

Data Stage Best Practices & Performance Tuning

3 Tips for Debugging Enterprise Edition Jobs


There are a number of tools available to debug DataStage EE jobs. The general process
for debugging a job is:
Check the DataStage job log for warnings. These may indicate an underlying
logic problem or unexpected data type conversion. When a fatal error occurs, the
log entry is sometimes preceded by a warning condition.

Enable the Job Monitoring Environment Variables detailed in section 5.2.

Use the Data Set Management tool (available in the Tools menu of DataStage
Designer or DataStage Manager) to examine the schema, look at row counts,
and to manage source or target Parallel Data Sets.

For flat (sequential) sources and targets:


o To display the actual contents of any file (including embedded control
characters or ASCII NULLs), use the UNIX command
od xc
o

To display the number of lines and characters in a specified ASCII text file,
use the UNIX command
wc lc [filename]
Dividing the total number of characters number of lines provides an audit to
ensure all rows are same length.
NOTE: The wc command counts UNIX line delimiters, so if the file has any
binary columns, this count may be incorrect .

3.1

Use $OSH_PRINT_SCHEMAS to verify that the jobs runtime schemas matches


what the job developer expected in the design-time column definitions.

Examine the score dump (placed in the DataStage log when


$APT_DUMP_SCORE is enabled).

Reading a Score Dump

When attempting to understand an EE flow, the first task is to examine the score dump
which is generated when you set APT_DUMP_SCORE=1 in your environment. A score
dump includes a variety of information about a flow, including how composite operators
and shared containers break down; where data is repartitioned and how it is
repartitioned; which operators, if any, have been inserted by EE; what degree of
parallelism each operator runs with; and exactly which nodes each operator runs on.
Also available is some information about where data may be buffered.

Page 21 of 30

Data Stage Best Practices & Performance Tuning


The following score dump shows a flow with a single dataset, which has a hash
partitioner that partitions on key field a. It shows three stages: Generator, Sort (tsort) and
Peek. The Peek and Sort stages are combined; that is, they have been optimized into
the same process. All stages in this flow are running on one node. The job runs 3
processes on 2 nodes.
##I TFSC 004000 14:51:50(000) <main_program>
This step has 1 dataset:
ds0: {op0[1p] (sequential generator)
eOther(APT_HashPartitioner { key={ value=a }
})->eCollectAny
op1[2p] (parallel APT_CombinedOperatorController:tsort)}
It has 2 operators:
op0[1p] {(sequential generator)
on nodes (
lemond.torrent.com[op0,p0]
)}
op1[2p] {(parallel APT_CombinedOperatorController:
(tsort)
(peek)
)on nodes (
lemond.torrent.com[op1,p0]
lemond.torrent.com[op1,p1]
)}
In a score dump, there are three areas to investigate:
Are there sequential stages?
Is needless repartitioning occurring?
In a cluster, are the computation-intensive stages shared evenly across all
nodes?

3.2

Partitioner and Sort Insertion

Partitioner and sort insertion are two processes that can insert additional components
into the work flow. Because these processes, especially sort insertion, can be
computationally expensive, understanding the score dump can help a user detect any
superfluous sorts or partitioners.
EE automatically inserts partitioner and sort components in the work flow to optimize
performance. This makes it possible for users to write correct data flows without having
to deal directly with issues of parallelism.
However, there are some situations where these features can be a hindrance. Presorted data, coming from a source other than a dataset, must be explicitly marked as
sorted, using the Dont Sort, Already Sorted key property in the Sort stage. This same
mechanism can be used to override sort insertion on any specific link. Partitioner
insertion may be disabled on a per-link basis by specifying SAME partitioning on the
appropriate link. Orchestrate users accomplish this by inserting same partitioners.

Page 22 of 30

Data Stage Best Practices & Performance Tuning


In some cases, setting $APT_SORT_INSERTION_CHECK_ONLY=1 may improve
performance if the data is pre-partitioned or pre-sorted but EE does not know this. With
this setting, EE still inserts sort stages, but instead of actually sorting the data, they
verify that the incoming data is sorted correctly. If the data is not correctly sorted, the job
will abort.
As a last resort, $APT_NO_PART_INSERTION=1 and
$APT_NO_SORT_INSERTION=1 can be used to disable the two features on a flowwide basis. It is generally advised that both partitioner insertion and sort insertion be left
alone by the average user, and that more experienced users carefully analyze the score
to determine if sorts or partitioners are being inserted sub-optimally.

Page 23 of 30

Data Stage Best Practices & Performance Tuning

4 Performance Tips for Job Design


-

Remove unneeded columns as early as possible within the job flow every
additional unused column requires additional buffer memory which can impact
performance (it also makes each transfer of a record from one stage to the next
more expensive).
o When reading from database sources, use a select list to read needed
columns instead of the entire table (if possible)
o To ensure that columns are actually removed using a stages Output
Mapping, disable runtime column propagation for that column.

Always specify a maximum length for Varchar columns. Unbounded strings


(Varchars without a maximum length) can have a significant negative performance
impact on a job flow. There are limited scenarios when the memory overhead of
handling large Varchar columns would dictate the use of unbounded strings. For
example:
o Varchar columns of a large (eg. 32K) maximum length that are rarely
populated
o Varchar columns of a large maximum length with highly varying data sizes
Placing unbounded columns at the end of the schema definition may improve
performance.

In DataStage v7.0 and earlier, limit the use of variable-length records within a flow.
Depending on the number of variable-length columns, it may be beneficial to convert
incoming records to fixed-length types at the start of a job flow, and trim to variablelength at the end of a flow before writing to a target database or flat file (using fixedlength records can dramatically improve performance). DataStage v7.01 and later
implement internal performance optimizations for variable-length columns that
specify a maximum length.

Avoid type conversions if possible


o Be careful to use proper datatype from source (especially Oracle) in EE job
design
Enable $OSH_PRINT_SCHEMAS to verify runtime schema matches
job design column definitions
o Verify that the data type of defined Transformer stage variables matches the
expected result type

Minimize the number of Transformers. Where appropriate, use other stages (eg.
Copy, Filter, Switch, Modify) instead of the Transformer

NEVER use the BASIC Transformer in large-volume data flows. Instead, userdefined functions and routines can expand the capabilities of the parallel
Transformer.

Page 24 of 30

Data Stage Best Practices & Performance Tuning


-

Buildops should be used instead of Transformers in the handful of scenarios where


complex reusable logic is required, or where existing Transformer-based job flows do
not meet performance requirements.

Minimize and combine use of Sorts where possible


o It is sometimes possible to re-arrange the order of business logic within a job
flow to leverage the same sort order, partitioning, and groupings.
If data has already been partitioned and sorted on a set of key
columns, specifying the dont sort, previously sorted option for those
key columns in the Sort stage will reduce the cost of sorting and take
greater advantage of pipeline parallelism.
o When writing to parallel datasets, sort order and partitioning are preserved.
When reading from these datasets, try to maintain this sorting if possible by
using SAME partitioning.
o The stable sort option is much more expensive than non-stable sorts, and
should only be used if there is a need to maintain row order except as
needed to perform the sort.
o Performance of individual sorts can be improved by increasing the memory
usage per partition using the Restrict Memory Usage (MB) option of the
standalone Sort stage . The default setting is 20MB per partition. Note that
sort memory usage can only be specified for standalone Sort stages, it
cannot be changed for inline (on a link) sorts.

5 Performance Monitoring and Tuning


5.1

The Job Monitor

The Job Monitor provides a useful snapshot of a jobs performance at a moment of


execution, but does not provide thorough performance metrics. That is, a Job Monitor
snapshot should not be used in place of a full run of the job, or a run with a sample set of
data. Due to buffering and to some job semantics, a snapshot image of the flow may not
be a representative sample of the performance over the course of the entire job.
The CPU summary information provided by the Job Monitor is useful as a first
approximation of where time is being spent in the flow. However, it does not include
operators that are inserted by EE. Such operators include sorts which were not explicitly
included and the suboperators of composite operators. The Job Monitor also does not
monitor sorts on links. For these components, the score dump can be of assistance. See
Reading a Score Dump in section 10.1.
A worst-case scenario occurs when a job flow reads from a dataset, and passes
immediately to a sort on a link. The job will appear to hang, when, in fact, rows are being
read from the dataset and passed to the sort.

5.2

OS/RDBMS-Specific Tools

Each OS and RDBMS has its own set of tools which may be useful in performance
monitoring. Talking to the system administrator or DBA may provide some useful
monitoring strategies.

Page 25 of 30

Data Stage Best Practices & Performance Tuning

5.3

Obtaining Operator Run-Time Information

Setting $APT_PM_PLAYER_TIMING=1 provides information for each stage in the DataStage


job log. For example:
##I TFPM 000324 08:59:32(004) <generator,0> Calling runLocally: step=1, node=rh73dev04,
op=0, ptn=0
##I TFPM 000325 08:59:32(005) <generator,0> Operator completed. status: APT_StatusOk
elapsed: 0.04 user: 0.00 sys: 0.00 suser: 0.09 ssys: 0.02 (total CPU: 0.11)
##I TFPM 000324 08:59:32(006) <peek,0> Calling runLocally: step=1, node=rh73dev04, op=1,
ptn=0
##I TFPM 000325 08:59:32(012) <peek,0> Operator completed. status: APT_StatusOk elapsed:
0.01 user: 0.00 sys: 0.00 suser: 0.09 ssys: 0.02 (total CPU: 0.11)
##I TFPM 000324 08:59:32(013) <peek,1> Calling runLocally: step=1, node=rh73dev04a, op=1,
ptn=1
##I TFPM 000325 08:59:32(019) <peek,1> Operator completed. status: APT_StatusOk elapsed:
0.00 user: 0.00 sys: 0.00 suser: 0.09 ssys: 0.02 (total CPU: 0.11)

This output shows that each partition of each operator has consumed about one tenth of
a second of CPU time during its runtime portion. In a real world flow, wed see many
more operators and partitions.
It can often be very useful to see how much CPU each operator, and each partition of
each component, is using. If one partition of an operator is using significantly more CPU
than others, it may mean the data is partitioned in an unbalanced way, and that
repartitioning, or choosing different partitioning keys, might be a useful strategy.
If one operator is using a much larger portion of the CPU than others, it may be an
indication that there is a problem in your flow. Common sense is generally required here,
for example, a sort is going to use dramatically more CPU time than a copy. This,
however, gives you a sense of which operators are using more of the CPU; and when
combined with other metrics presented in this document, the information can be very
enlightening.
Setting $APT_DISABLE_COMBINATION=1 , which globally disables stage combination, may
be useful in some situations to get finer-grained information as to which operators are
using up CPU cycles. Be aware, however, that setting this flag changes the
performance behavior of your flow, therefore, this should be done with care.
Unlike the Job Monitor CPU percentages, setting $APT_PM_PLAYER_TIMING provides
timings on every operator within the flow.

Page 26 of 30

Data Stage Best Practices & Performance Tuning

5.4

Selectively Rewriting the Flow

One of the most useful mechanisms you can use to determine what is causing
bottlenecks in your flow is to isolate sections of the flow by rewriting portions of it to
exclude stages from the set of possible causes. The goal of modifying the flow is to see
whether modified flow runs noticeably faster than the original flow. If the flow is running
at roughly an identical speed, change more of the flow.
While editing a flow for testing, it is important to keep in mind that removing one operator
might have unexpected affects in the flow. Comparing the score dump between runs is
useful before concluding what has made the performance difference.
When modifying the flow, be aware of introducing any new performance problems. For
example, adding a persistent dataset to a flow introduces disk contention with any other
datasets being read. This is rarely a problem, but it might be significant in some cases.
Reading and writing data are two obvious places to be aware of potential performance
bottlenecks. Changing a job to write into a Copy stage with no outputs discards the data.
Keep the degree of parallelism the same, with a nodemap if necessary. Similarly,
landing any read data to a dataset can be helpful if the point of origin of the data is a flat
file or RDBMS.
This pattern should be followed, removing any potentially suspicious stages while trying
to keep the rest of the flow intact. Removing any customer-created operators or
sequence operators should be at the top of the list. Much work has gone into the latest
7.0 release to improve Transformer performance.

5.5

Eliminating Repartitions

Superfluous repartitioning should be eliminated. Due to operator or license limitations


(import, export, RDBMS operators, SAS operators, and so on) some operators run with
a degree of parallelism that is different than the default degree of parallelism. Some of
this cannot be eliminated, but understanding the where, when and why these repartitions
occur is important for flow understanding. Repartitions are especially expensive when
the data is being repartitioned on an MPP, where significant network traffic is generated.
Sometimes a repartition might be able to be moved further upstream in order to
eliminate a previous, implicit repartition. Imagine an Oracle read, which does some
processing, and is then hashed and joined with another dataset. There might be a
repartition after the Oracle read stage and then the hash, when only one repartitioning is
ever necessary.
Similarly, a nodemap on a stage may prove useful for eliminating repartitions. In this
case, a transform between a DB2 read and a DB2 write might need to have a nodemap
placed on it to force it to run with the same degree of parallelism as the two DB2 stages
in order to avoid two repartitions.

5.6

Ensuring Data is Evenly Partitioned

Due to the nature of EE, the entire flow runs as slow as its slowest component. If data is
not evenly partitioned, the slowest component is often a result of data skew. If one
Page 27 of 30

Data Stage Best Practices & Performance Tuning


partition has ten records, and another has ten million, EE can simply not make ideal use
of the resources.
displays the number of records per partition for each
component. Ideally, counts across all partitions should be roughly equal. Differences in
data volumes between keys often skew this data slightly, but any significant (over 5 or
10%) differences in volume should be a warning sign that alternate keys or an alternate
partitioning strategy might be required.
$APT_RECORD_COUNTS=1

5.7

Buffering for All Versions

Buffer operators are introduced in a flow anywhere that a directed cycle exists or
anywhere that the user or operator requests them using the C++ API or osh.
The default goal of the buffer operator, on a specific link, is to make the source stage
output rate match the consumption rate of the target stage. In any flow, where there is
incorrect behavior for the buffer operator, performance degrades. For example, the
target stage has two inputs, and waits until it has exhausted one of those inputs before
reading from the next. Identifying these spots in the flow requires an understanding of
how each stage involved reads its records, and is often only found by empirical
observation.
There is a buffer operator tuning issue when a flow runs slowly when it is one massive
flow; but when broken up, each component runs quickly. For example, replacing an
Oracle write with a Copy stage vastly improves performance; and writing that same data
to a dataset, then loading using Oracle write, also goes quickly. When the two are put
together, performance grinds to a crawl.
xHyperlink\xd2 on page Default Font details specific common buffer operator
configurations in the context of resolving various bottlenecks. For more information on
buffering, see Appendix A, Data Set Buffering in the Orchestrate 7.0 User Guide.

5.8

Resolving Bottlenecks

5.8.1 Variable Length Data


In releases prior to v7.01, using fixed-length records can dramatically improve
performance; therefore, limit the use of variable-length records within a flow. This is no
longer an issue in 7.01 and later releases.
5.8.2 Combinable Operators
Combined operators generally improve performance at least slightly; and in some cases,
the performance improvement may be dramatic. However, there may be situations
where combining operators actually hurts performance. Identifying such operators can
be difficult without trial and error.
The most common situation arises when multiple operators, such as Sequential File
(import and export) and Sort, are combined and are performing disk I/O. In I/O-bound
situations, turning off combination for these specific operators may result in a
performance increase.
Page 28 of 30

Data Stage Best Practices & Performance Tuning


This is a new option in the Advanced stage properties of DataStage Designer version
7.x. Combinable operators often provide a dramatic performance increase when a large
number of variable length fields are used in a flow.
To experiment with this, try disabling the combination of any stages that perform I/O and
any sort stages. $APT_DISABLE_COMBINATION=1 globally disables operator combining.
5.8.3 Disk I/O
Total disk throughput is often a fixed quantity that EE has no control over. There are,
however, some settings and rules of thumb that are often beneficial:

If data is going to be read back in, in parallel, it should never be written as a


sequential file. A dataset or fileset is a much more appropriate format.

When importing fixed-length data, the Number of Readers Per Node option on
the Sequential File stage can often provide a noticeable performance boost as
compared with a single process reading the data. However, if there is a need to
assign a number in source file row order, the -readers option cannot be used
because it opens multiple streams at evenly-spaced offsets in the source file.
Also, this option can only be used for fixed-length sequential files.

Some disk arrays have read-ahead caches that are only effective when data is
read repeatedly in like-sized chunks. $APT_CONSISTENT_BUFFERIO_SIZE=n forces
import to read data in chunks which are size n or a multiple of n.

Memory mapped IO is, in many cases, a big performance win; however, in


certain situations, such as a remote disk mounted via NFS, it may cause
significant performance problems. APT_IO_NOMAP=1 and
APT_BUFFERIO_NOMAP=1 turn off this feature and sometimes affect
performance. AIX and HP-UX default to NOMAP. APT_IO_MAP=1 and
APT_BUFFERIO_MAP=1 can be used to turn on memory mapped IO on for
these platforms.

5.8.4 Buffering
Buffer operators are intended to slow down their input to match the consumption rate of
the output. When the target stage reads very slowly, or not at all, for a length of time,
upstream stages begin to slow down. This can cause a noticeable performance loss if
the optimal behavior of the buffer operator is something other than rate matching.
By default, the buffer operator has a 3MB in-memory buffer. Once that buffer reaches
two-thirds full, the stage begins to push back on the rate of the upstream stage. Once
the 3MB buffer is filled, data is written to disk in 1MB chunks.
In the following discussions, settings in all caps are environment variables and affect all
buffer operators. Settings in all lowercase are buffer-operator options and can be set per
buffer operator.

Page 29 of 30

Data Stage Best Practices & Performance Tuning


In most cases, the easiest way to tune the buffer operator is to eliminate the push back
and allow it to buffer the data to disk as necessary. $APT_BUFFER_FREE_RUN=n or
bufferfreerun do this. The buffer operator reads N * max_memory (3MB by default)
bytes before beginning to push back on the upstream. If there is enough disk space to
buffer large amounts of data, this usually fixes any egregious slow- down issues caused
by the buffer operator.
If there is a significant amount of memory available on the machine, increasing the
maximum in-memory buffer size is likely to be very useful if the buffer operator is
causing any disk IO. $APT_BUFFER_MAXIMUM_MEMORY or maximummemorybuffersize is
used to do this. It defaults to roughly 3000000 (3MB).
For systems where small to medium bursts of IO are not desirable, the 1MB write to disk
size chunk size may be too small. $APT_BUFFER_DISK_WRITE_INCREMENT or
diskwriteincrement controls this and defaults to roughly 1000000 (1MB). This setting
may not exceed max_memory * 2/3.
Finally, in a situation where a large, fixed buffer is needed within the flow,
queueupperbound (no environment variable exists) can be set equal to max_memory to
force a buffer of exactly max_memory bytes. Such a buffer blocks an upstream stage
(until data is read by the downstream stage) once its buffer has been filled, so this
setting should be used with extreme caution. This setting is rarely necessary to achieve
good performance, but is useful when there is a large variability in the response time of
the data source or data target. No environment variable is available for this flag; it can
only be set at the osh level.
For releases 7.0.1 and beyond, per-link buffer settings are available in EE. They appear
on the Advanced tab of the Input & Output tabs. The settings saved on an Output tab are
shared with the Input tab of the next stage and vice versa, like Columns.

Page 30 of 30

You might also like