Professional Documents
Culture Documents
DataStage
Best Practices & Performance Tuning
Page 1 of 30
DataStage ..........................................................................................................................1
1 Environment Variable Settings.........................................................................................3
1.1 Environment Variable Settings for All Jobs...............................................................3
1.2 Additional Environment Variable Settings ................................................................3
2 Configuration Files............................................................................................................6
2.1 Logical Processing Nodes.........................................................................................6
2.2 Optimizing Parallelism...............................................................................................7
2.3 Configuration File Examples......................................................................................7
2.3.1 Example for Any Number of CPUs and Any Number of Disks...........................8
2.3.2 Example that Reduces Contention...................................................................10
2.3.3 Smaller Configuration Example........................................................................11
2.4 Sequential File Stages (Import and Export)............................................................13
2.4.1 Improving Sequential File Performance............................................................13
2.4.2 Partitioning Sequential File Reads....................................................................13
2.4.3 Sequential File (Export) Buffering.....................................................................13
2.4.4 Reading from and Writing to Fixed-Length Files..............................................14
2.4.5 Reading Bounded-Length VARCHAR Columns...............................................14
2.5 Transformer Usage Guidelines................................................................................14
2.5.1 Choosing Appropriate Stages...........................................................................14
2.5.2 Transformer NULL Handling and Reject Link...................................................15
2.5.3 Transformer Derivation Evaluation...................................................................16
2.5.4 Conditionally Aborting Jobs..............................................................................16
2.6 Lookup vs. Join Stages............................................................................................16
2.7 Capturing Unmatched Records from a Join............................................................16
2.8 The Aggregator Stage..............................................................................................17
2.9 Appropriate Use of SQL and DataStage Stages.....................................................17
2.10 Optimizing Select Lists...........................................................................................18
2.11 Designing for Restart.............................................................................................18
2.12 Database OPEN and CLOSE Commands............................................................18
2.13 Database Sparse Lookup vs. Join.........................................................................19
2.14 Oracle Database Guidelines..................................................................................19
2.14.1 Proper Import of Oracle Column Definitions (Schema)..................................19
2.14.2 Reading from Oracle in Parallel......................................................................19
2.14.3 Oracle Load Options.......................................................................................20
3 Tips for Debugging Enterprise Edition Jobs...................................................................21
3.1 Reading a Score Dump............................................................................................21
3.2 Partitioner and Sort Insertion...................................................................................22
4 Performance Tips for Job Design...................................................................................24
5 Performance Monitoring and Tuning..............................................................................25
5.1 The Job Monitor.......................................................................................................25
5.2 OS/RDBMS-Specific Tools .....................................................................................25
5.3 Obtaining Operator Run-Time Information..............................................................26
5.4 Selectively Rewriting the Flow.................................................................................27
5.5 Eliminating Repartitions...........................................................................................27
5.6 Ensuring Data is Evenly Partitioned.......................................................................27
5.7 Buffering for All Versions........................................................................................28
5.8 Resolving Bottlenecks.............................................................................................28
Page 2 of 30
1.1
Ascential recommends the following environment variable settings for all Enterprise
Edition jobs. These settings can be made at the project level, or may be set on an
individual basis within the properties for each job.
Environment Variable Settings For All Jobs
Environment Variable
Setting
Description
$APT_CONFIG_FILE
filepath
$APT_DUMP_SCORE
$OSH_ECHO
$APT_RECORD_COUNTS
$APT_PM_SHOW_PIDS
$APT_BUFFER_MAXIMUM_TIMEOUT
$APT_THIN_SCORE
1.2
Page 3 of 30
Setting
[nrows]
Description
[Kbytes
]
$APT_CONSISTENT_BUFFERIO_SIZE
[bytes]
$APT_DELIMITED_READ_SIZE
[bytes]
$APT_EXPORT_FLUSH_COUNT
$APT_IMPORT_BUFFER_SIZE
$APT_EXPORT_BUFFER_SIZE
[bytes]
Setting
[path]
Description
$ORACLE_HOME
$ORACLE_SID
[sid]
Page 4 of 30
[num]
[second
s]
$APT_ORACLE_LOAD_OPTIONS
[SQL*
Loader
options]
$APT_ORA_IGNORE_CONFIG_FILE_PARALLELIS
M
$APT_ORA_WRITE_FILES
[filepath]
$DS_ENABLE_RESERVED_CHAR_CONVERT
Description
In v7 and later, specifies the time interval (in seconds)
for generating job monitor information at runtime. To
enable size-based job monitoring, unset this
environment variable, and set $APT_MONITOR_SIZE
below.
$APT_MONITOR_SIZE
[rows]
$APT_NO_JOBMON
$APT_RECORD_COUNTS
Description
In v7 and later, specifies the time interval (in seconds)
Page 5 of 30
$APT_MONITOR_SIZE
[rows]
$APT_NO_JOBMON
$APT_RECORD_COUNTS
2 Configuration Files
The configuration file tells DataStage Enterprise Edition how to exploit underlying system
resources (processing, temporary storage, and dataset storage). In more advanced
environments, the configuration file can also define other resources such as databases
and buffer storage. At runtime, EE first reads the configuration file to determine what
system resources are allocated to it, and then distributes the job flow across these
resources.
When you modify the system, by adding or removing nodes or disks, you must modify
the DataStage EE configuration file accordingly. Since EE reads the configuration file
every time it runs a job, it automatically scales the application to fit the system without
having to alter the job design.
There is not necessarily one ideal configuration file for a given system because of the
high variability between the way different jobs work. For this reason, multiple
configuration files should be used to optimize overall throughput and to match job
characteristics to available hardware resources. At runtime, the configuration file is
specified through the environment variable $APT_CONFIG_FILE.
2.1
The configuration file defines one or more EE processing nodes on which parallel jobs
will run. EE processing nodes are a logical rather than a physical construct. For this
reason, it is important to note that the number of processing nodes does not necessarily
correspond to the actual number of CPUs in your system.
Within a configuration file, the number of processing nodes defines the degree of
parallelism and resources that a particular job will use to run. It is up to the UNIX
operating system to actually schedule and run the processes that make up a DataStage
job across physical processors. A configuration file with a larger number of nodes
Page 6 of 30
2.2
Optimizing Parallelism
2.3
Given the large number of considerations for building a configuration file, where do you
begin? For starters, the default configuration file (default.apt) created when DataStage is
installed is appropriate for only the most basic environments.
The default configuration file has the following characteristics:
- number of nodes = number of physical CPUs
- disk and scratchdisk storage use subdirectories within the DataStage install
filesystem
You should create and use a new configuration file that is optimized to your hardware
and file systems. Because different job flows have different needs (CPU-intensive?
Memory-intensive? Disk-Intensive? Database-Intensive? Sorts? need to share resources
Page 7 of 30
Page 8 of 30
The file pattern of the configuration file above is a give every node all the disk example,
albeit in different orders to minimize I/O contention. This configuration method works well
when the job flow is complex enough that it is difficult to determine and precisely plan for
good I/O utilization.
Within each node, EE does not stripe the data across multiple filesystems. Rather, it
fills the disk and scratchdisk filesystems in the order specified in the configuration file. In
the 4-node example above, the order of the disks is purposely shifted for each node, in
an attempt to minimize I/O contention.
Page 9 of 30
2.3.2
The alternative to the first configuration method is more careful planning of the I/O
behavior to reduce contention. You can imagine this could be hard given our
hypothetical 6-way SMP with 4 disks because setting up the obvious one-to-one
correspondence doesn't work. Doubling up some nodes on the same disk is unlikely to
be good for overall performance since we create a hotspot.
We could give every CPU two disks and rotate them around, but that would be little
different than the previous strategy. So, lets imagine a less constrained environment
with two additional disks:
computer host name fastone
6 CPUs
6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5
Now a configuration file for this environment might look like this:
{
node "n0" {
pools ""
fastname "fastone"
resource disk "/fs0/ds/data" {pools ""}
resource scratchdisk "/fs0/ds/scratch" {pools ""}
}
node "node2" {
fastname "fastone"
pools ""
resource disk "/fs1/ds/data" {pools ""}
resource scratchdisk "/fs1/ds/scratch" {pools ""}
}
node "node3" {
fastname "fastone"
pools ""
resource disk "/fs2/ds/data" {pools ""}
resource scratchdisk "/fs2/ds/scratch" {pools ""}
}
node "node4" {
fastname "fastone"
pools ""
resource disk "/fs3/ds/data" {pools ""}
resource scratchdisk "/fs3/ds/scratch" {pools ""}
}
node "node5" {
fastname "fastone"
pools ""
resource disk "/fs4/ds/data" {pools ""}
Page 10 of 30
While this is the simplest scenario, it is important to realize that no single player, stage,
or operator instance on any one partition can go faster than the single disk it has access
to.
You could combine strategies by adding in a node pool where disks have a one-to-one
association with nodes. These nodes would then not be in the default node pool, but a
special one that you would specifically assign to stage / operator instances.
2.3.3 Smaller Configuration Example
Because disk and scratchdisk resources are assigned per node, depending on the total
disk space required to process large jobs, it may be necessary to distribute file systems
across nodes in smaller environments (fewer available CPUs/memory).
Using the above server example, this time with 4-nodes:
computer host name fastone
4 CPUs
6 separate file systems on 4 drives named /fs0, /fs1, /fs2, /fs3, /fs4, /fs5
{
node "node1" {
fastname "fastone"
pools ""
resource disk "/fs0/ds/data" {pools ""} /* start with fs0 */
resource disk "/fs4/ds/data" {pools ""}
resource scratchdisk "/fs4/ds/scratch" {pools ""} /* start with fs4 */
resource scratchdisk "/fs0/ds/scratch" {pools ""}
}
node "node2" {
fastname "fastone"
pools ""
resource disk "/fs1/ds/data" {pools ""}
resource disk "/fs5/ds/data" {pools ""}
resource scratchdisk "/fs5/ds/scratch" {pools ""}
resource scratchdisk "/fs1/ds/scratch" {pools ""}
}
node "node3" {
fastname "fastone"
pools ""
resource disk "/fs2/ds/data" {pools ""}
resource disk "/fs6/ds/data" {pools ""}
resource scratchdisk "/fs6/ds/scratch" {pools ""}
resource scratchdisk "/fs2/ds/scratch" {pools ""}
}
node "node4" {
fastname "fastone"
pools ""
resource disk "/fs3/ds/data" {pools ""}
resource disk "/fs7/ds/data" {pools ""}
resource scratchdisk "/fs7/ds/scratch" {pools ""}
resource scratchdisk "/fs3/ds/scratch" {pools ""}
}
} /* end of entire config */
The 4-node example above illustrates another concept in configuration file setup you
can assign multiple disk and scratch disk resources for each node.
Page 11 of 30
Ensure that the different file systems mentioned as the disk and scratchdisk
resources hit disjoint sets of spindles even if they're located on a RAID system.
Do not trust high-level RAID/SAN monitoring tools, as their cache hit ratios are
often misleading.
Never use NFS file systems for scratchdisk resources. Know what's real and
what's NFS: Real disks are directly attached, or are reachable over a SAN
(storage-area network - dedicated, just for storage, low-level protocols).
Proper configuration of scratch and resource disk (and the underlying filesystem
and physical hardware architecture) can significantly affect overall job
performance. Beware if you use NFS (and, often SAN) filesystem space for disk
resources. For example, your final result files may need to be written out onto the
NFS disk area, but that doesn't mean the intermediate data sets created and
used temporarily in a multi-job sequence should use this NFS disk area. It is
better to setup a "final" disk pool, and constrain the result sequential file or data
set to reside there, but let intermediate storage go to local or SAN resources, not
NFS.
Page 12 of 30
2.4
Dont read from Sequential File using SAME partitioning! Unless more than one
source file is specified, SAME will read the entire file into a single partition,
making the entire downstream flow run sequentially (unless it is later
repartitioned).
When multiple files are read by a single Sequential File stage (using multiple
files, or by using a File Pattern), each files data is read into a separate partition.
It is important to use ROUND-ROBIN partitioning (or other partitioning
appropriate to downstream components) to evenly distribute the data in the flow.
Page 13 of 30
2.4.4
Particular attention must be taken when processing fixed-length fields using the
Sequential File stage:
If the incoming columns are variable-length data types (eg. Integer, Decimal,
Varchar), the field width column property must be set to match the fixed-width of
the input column. Double-click on the column number in the grid dialog to set this
column property.
If a field is nullable, you must define the null field value and length in the Nullable
section of the column property. Double-click on the column number in the grid
dialog to set these properties.
When writing fixed-length files from variable-length fields (eg. Integer, Decimal,
Varchar), the field width and pad string column properties must be set to match
the fixed-width of the output column. Double-click on the column number in the
grid dialog to set this column property.
To display each field value, use the print_field import property. All import and
export properties are listed in chapter 25, Import/Export Properties of the
Orchestrate 7.0 Operators Reference.
2.4.5
Care must be taken when reading delimited, bounded-length Varchar columns (Varchars
with the length option set). By default, if the source file has fields with values longer than
the maximum Varchar length, these extra characters will be silently truncated.
Starting with v7.01 the environment variable
will direct DataStage to reject records with
strings longer than their declared maximum column length.
$APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS
2.5
Page 14 of 30
The Copy stage should be used instead of a Transformer for simple operations
including:
- Job Design placeholder between stages (unless the Force option =true, EE
will optimize this out at runtime)
- Renaming Columns
- Dropping Columns
- Default Type Conversions
Note that rename, drop (if runtime column propagation is disabled), and
default type conversion can also be performed by the output mapping tab of
any stage.
NEVER use the BASIC Transformer stage in large-volume job flows. Instead,
user-defined functions and routines can expand parallel Transformer capabilities.
Page 15 of 30
2.6
The Lookup stage is most appropriate when the reference data for all lookup stages in a
job is small enough to fit into available physical memory. Each lookup reference requires
a contiguous block of physical memory. If the datasets are larger than available
resources, the JOIN or MERGE stage should be used.
If the reference to a Lookup is directly from a Oracle table, and the number of input rows
is significantly smaller (eg. 1:100 or more) than the number of reference rows, a Sparse
Lookup may be appropriate.
2.7
The Join stage does not provide reject handling for unmatched records (such as in an
InnerJoin scenario). If un-matched rows must be captured or logged, an OUTER join
operation must be performed. In an OUTER join scenario, all rows on an outer link (eg.
Left Outer, Right Outer, or both links in the case of Full Outer) are output regardless of
match on key values.
Page 16 of 30
2.8
2.9
When using relational database sources, there is often a functional overlap between
SQL and DataStage stages. Although it is possible to use either SQL or DataStage to
solve a given business problem, the optimal implementation involves leveraging the
strengths of each technology to provide maximum throughput and developer
productivity.
While there are extreme scenarios when the appropriate technology choice is clearly
understood, there may be gray areas where the decision should be made factors such
as developer productivity, metadata capture and re-use, and ongoing application
maintenance costs.
The following guidelines can assist with the appropriate use of SQL and DataStage
technologies in a given job flow:
a) When possible, use a SQL filter (WHERE clause) to limit the number of
rows sent to the DataStage job. This minimizes impact on network and
memory resources, and leverages the database capabilities.
b) Use a SQL Join to combine data from tables with a small number of rows
in the same database instance, especially when the join columns are
indexed.
c) When combining data from very large tables, or when the source includes
a large number of database tables, the efficiency of the DataStage EE
Sort and Join stages can be significantly faster than an equivalent SQL
Page 17 of 30
Page 18 of 30
Page 20 of 30
Use the Data Set Management tool (available in the Tools menu of DataStage
Designer or DataStage Manager) to examine the schema, look at row counts,
and to manage source or target Parallel Data Sets.
To display the number of lines and characters in a specified ASCII text file,
use the UNIX command
wc lc [filename]
Dividing the total number of characters number of lines provides an audit to
ensure all rows are same length.
NOTE: The wc command counts UNIX line delimiters, so if the file has any
binary columns, this count may be incorrect .
3.1
When attempting to understand an EE flow, the first task is to examine the score dump
which is generated when you set APT_DUMP_SCORE=1 in your environment. A score
dump includes a variety of information about a flow, including how composite operators
and shared containers break down; where data is repartitioned and how it is
repartitioned; which operators, if any, have been inserted by EE; what degree of
parallelism each operator runs with; and exactly which nodes each operator runs on.
Also available is some information about where data may be buffered.
Page 21 of 30
3.2
Partitioner and sort insertion are two processes that can insert additional components
into the work flow. Because these processes, especially sort insertion, can be
computationally expensive, understanding the score dump can help a user detect any
superfluous sorts or partitioners.
EE automatically inserts partitioner and sort components in the work flow to optimize
performance. This makes it possible for users to write correct data flows without having
to deal directly with issues of parallelism.
However, there are some situations where these features can be a hindrance. Presorted data, coming from a source other than a dataset, must be explicitly marked as
sorted, using the Dont Sort, Already Sorted key property in the Sort stage. This same
mechanism can be used to override sort insertion on any specific link. Partitioner
insertion may be disabled on a per-link basis by specifying SAME partitioning on the
appropriate link. Orchestrate users accomplish this by inserting same partitioners.
Page 22 of 30
Page 23 of 30
Remove unneeded columns as early as possible within the job flow every
additional unused column requires additional buffer memory which can impact
performance (it also makes each transfer of a record from one stage to the next
more expensive).
o When reading from database sources, use a select list to read needed
columns instead of the entire table (if possible)
o To ensure that columns are actually removed using a stages Output
Mapping, disable runtime column propagation for that column.
In DataStage v7.0 and earlier, limit the use of variable-length records within a flow.
Depending on the number of variable-length columns, it may be beneficial to convert
incoming records to fixed-length types at the start of a job flow, and trim to variablelength at the end of a flow before writing to a target database or flat file (using fixedlength records can dramatically improve performance). DataStage v7.01 and later
implement internal performance optimizations for variable-length columns that
specify a maximum length.
Minimize the number of Transformers. Where appropriate, use other stages (eg.
Copy, Filter, Switch, Modify) instead of the Transformer
NEVER use the BASIC Transformer in large-volume data flows. Instead, userdefined functions and routines can expand the capabilities of the parallel
Transformer.
Page 24 of 30
5.2
OS/RDBMS-Specific Tools
Each OS and RDBMS has its own set of tools which may be useful in performance
monitoring. Talking to the system administrator or DBA may provide some useful
monitoring strategies.
Page 25 of 30
5.3
This output shows that each partition of each operator has consumed about one tenth of
a second of CPU time during its runtime portion. In a real world flow, wed see many
more operators and partitions.
It can often be very useful to see how much CPU each operator, and each partition of
each component, is using. If one partition of an operator is using significantly more CPU
than others, it may mean the data is partitioned in an unbalanced way, and that
repartitioning, or choosing different partitioning keys, might be a useful strategy.
If one operator is using a much larger portion of the CPU than others, it may be an
indication that there is a problem in your flow. Common sense is generally required here,
for example, a sort is going to use dramatically more CPU time than a copy. This,
however, gives you a sense of which operators are using more of the CPU; and when
combined with other metrics presented in this document, the information can be very
enlightening.
Setting $APT_DISABLE_COMBINATION=1 , which globally disables stage combination, may
be useful in some situations to get finer-grained information as to which operators are
using up CPU cycles. Be aware, however, that setting this flag changes the
performance behavior of your flow, therefore, this should be done with care.
Unlike the Job Monitor CPU percentages, setting $APT_PM_PLAYER_TIMING provides
timings on every operator within the flow.
Page 26 of 30
5.4
One of the most useful mechanisms you can use to determine what is causing
bottlenecks in your flow is to isolate sections of the flow by rewriting portions of it to
exclude stages from the set of possible causes. The goal of modifying the flow is to see
whether modified flow runs noticeably faster than the original flow. If the flow is running
at roughly an identical speed, change more of the flow.
While editing a flow for testing, it is important to keep in mind that removing one operator
might have unexpected affects in the flow. Comparing the score dump between runs is
useful before concluding what has made the performance difference.
When modifying the flow, be aware of introducing any new performance problems. For
example, adding a persistent dataset to a flow introduces disk contention with any other
datasets being read. This is rarely a problem, but it might be significant in some cases.
Reading and writing data are two obvious places to be aware of potential performance
bottlenecks. Changing a job to write into a Copy stage with no outputs discards the data.
Keep the degree of parallelism the same, with a nodemap if necessary. Similarly,
landing any read data to a dataset can be helpful if the point of origin of the data is a flat
file or RDBMS.
This pattern should be followed, removing any potentially suspicious stages while trying
to keep the rest of the flow intact. Removing any customer-created operators or
sequence operators should be at the top of the list. Much work has gone into the latest
7.0 release to improve Transformer performance.
5.5
Eliminating Repartitions
5.6
Due to the nature of EE, the entire flow runs as slow as its slowest component. If data is
not evenly partitioned, the slowest component is often a result of data skew. If one
Page 27 of 30
5.7
Buffer operators are introduced in a flow anywhere that a directed cycle exists or
anywhere that the user or operator requests them using the C++ API or osh.
The default goal of the buffer operator, on a specific link, is to make the source stage
output rate match the consumption rate of the target stage. In any flow, where there is
incorrect behavior for the buffer operator, performance degrades. For example, the
target stage has two inputs, and waits until it has exhausted one of those inputs before
reading from the next. Identifying these spots in the flow requires an understanding of
how each stage involved reads its records, and is often only found by empirical
observation.
There is a buffer operator tuning issue when a flow runs slowly when it is one massive
flow; but when broken up, each component runs quickly. For example, replacing an
Oracle write with a Copy stage vastly improves performance; and writing that same data
to a dataset, then loading using Oracle write, also goes quickly. When the two are put
together, performance grinds to a crawl.
xHyperlink\xd2 on page Default Font details specific common buffer operator
configurations in the context of resolving various bottlenecks. For more information on
buffering, see Appendix A, Data Set Buffering in the Orchestrate 7.0 User Guide.
5.8
Resolving Bottlenecks
When importing fixed-length data, the Number of Readers Per Node option on
the Sequential File stage can often provide a noticeable performance boost as
compared with a single process reading the data. However, if there is a need to
assign a number in source file row order, the -readers option cannot be used
because it opens multiple streams at evenly-spaced offsets in the source file.
Also, this option can only be used for fixed-length sequential files.
Some disk arrays have read-ahead caches that are only effective when data is
read repeatedly in like-sized chunks. $APT_CONSISTENT_BUFFERIO_SIZE=n forces
import to read data in chunks which are size n or a multiple of n.
5.8.4 Buffering
Buffer operators are intended to slow down their input to match the consumption rate of
the output. When the target stage reads very slowly, or not at all, for a length of time,
upstream stages begin to slow down. This can cause a noticeable performance loss if
the optimal behavior of the buffer operator is something other than rate matching.
By default, the buffer operator has a 3MB in-memory buffer. Once that buffer reaches
two-thirds full, the stage begins to push back on the rate of the upstream stage. Once
the 3MB buffer is filled, data is written to disk in 1MB chunks.
In the following discussions, settings in all caps are environment variables and affect all
buffer operators. Settings in all lowercase are buffer-operator options and can be set per
buffer operator.
Page 29 of 30
Page 30 of 30