Component usage

The following guidelines should be followed when constructing parallel jobs in IBM InfoSphere DataStage Enterprise Edition: 1 Never use Server Edition components (BASIC Transformer, Server Shared Containers) within a parallel job. 2. Always use parallel Data Sets for intermediate storage between jobs unlessthat specific data also needs to be shared with other applications 3. Use the Copy stage for changing column names and to facilitate default type conversions instead of modify stage 4 Use the parallel Transformer stage (not the BASIC Transformer) instead of the Filter or Switch stages.

DataStage data types
The following guidelines should be followed with DataStage data types: 1. Be aware of the mapping between DataStage (SQL) data types and the internal DS/EE data types. If possible, import table definitions for source databases using the Orchestrate Schema Importer (orchdbutil) utility.

Partitioning data
In most cases, the default partitioning method (Auto) is appropriate. With Auto partitioning, the Information Server Engine will choose the type of partitioning at runtime based on stage requirements, degree of parallelism, and source and target systems. While Auto partitioning will generally give correct results, it might not give optimized performance. As the job developer, you have visibility into requirements, and can optimize within a job and across job flows. Given the numerous options for keyless and keyed partitioning, the following objectives form a methodology for assigning partitioning: 1. Objective 1 Choose a partitioning method that gives close to an equal number of rows in each partition, while minimizing overhead. This ensures that the processing workload is evenly balanced, minimizing overall run time. 2. Objective 2 The partition method must match the business requirements and stage functional requirements, assigning related records to the same partition if required. Any stage that processes groups of related records (generally using one or more key columns) must be partitioned using a keyed partition method. This includes, but is not limited to: Aggregator, Change Capture, Change Apply, Join, Merge, Remove Duplicates, and Sort stages. It might also be necessary for Transformers and BuildOps that process groups of related records Note: In satisfying the requirements of this second objective, it might not be possible to choose a partitioning method that gives an almost equal number of rows in each partition. 3. Objective 3 Unless partition distribution is highly skewed, minimize re-partitioning, especially in cluster or Grid configurations. Re-partitioning data in a cluster or Grid configuration incurs the overhead of network transport.

you can only set that via the Sort Stage not on a sort link. When output order does not matter.1 When the input Data Set has been sorted in parallel. Use a stand-alone Sort stage instead of a Link sort for options that are not available on a Link sort: 4. 4. Create Key Change Column. Output Statistics. Be aware of automatically-inserted sorts: Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish required sort order. Start with Auto partitioning (the default). Consider how the input Data Set has been sorted: 2.1 The “Restrict Memory Usage” option should be included here. If you want more memory available for the sort. 2. Sorting Apply the following methodology when sorting in an IBM InfoSphere DataStage Enterprise Edition data flow: 1. as long as the Data Set has not been re-partitioned or reduced. use Auto partitioning (the default). Do not use Stable Sort unless needed. 2. the Ordered collector might be more efficient. use Sort Merge collector to produce a single.2 When the input Data Set has been sorted in parallel and Range partitioned. Create Cluster Key Change Column. Specify only necessary key column(s). Specify Hash partitioning for stages that require groups of related records as follows: • Specify only the key column(s) that are necessary for correct grouping as long as the number of unique values is sufficient • Use Modulus partitioning if the grouping is on a single integer key column • Use Range partitioning if the data is highly skewed and the key column values and distribution do not change significantly over time (Range Map can be reused) c. globally sorted stream of rows. the following methodology can be applied: a. Use a Round Robin collector to reconstruct rows in input order for round-robin partitioned input Data Sets. • Especially useful if the input Data Set is highly skewed or sequential d.3 Always specify “DataStage” Sort Utility for standalone Sort stages. 4. 3. Objective 4 Partition method should not be overly complex. 2. The simplest method that meets the above objectives will generally be the most efficient and yield the best performance. b.4 Use the “Sort Key Mode=Don’t Sort (Previously Sorted)” to resort a sub-grouping of a previously sorted input Data Set. The environment variable $APT_TSORT_STRESS_BLOCKSIZE can also be used to set sort memory usage (in MB) per partition. Use Same partitioning to optimize end-to-end partitioning and to minimize re-partitioning Collecting data Given the options for collecting data into a sequential stream. use Round Robin partitioning to redistribute data equally across all partitions.4. the following guidelines form a methodology for choosing the appropriate collector type: 1. 4. 4.2 Sort Key Mode. If grouping is not required. . Start with a link sort. Using the above objectives as a guide. 3. 5.

DB2. use Connector stages or native parallel database stages for maximum performance and scalability. use an SQL where clause to limit the number of rows sent to a DataStage job. Even if the source data is not nullable. 2. For maximum scalability and parallel performance. A Sort method Aggregator should be used when the number of distinct key values is large or unknown. Aggregators Use Hash method Aggregators only when the number of distinct key column values is small. Always place a reject link on a parallel Transformer to capture / audit possible rejects. If the Data Sets are larger than available memory resources. Minimize the use of sorts within a job flow. 4. Where possible. Stage specific guidelines The guidelines by stage are as follows: Transformer Take precautions when using expressions or derivations on nullable columns within the parallel Transformer: Always convert nullable columns to in-band values before using them in an expression or derivation. 3. sequential ordered result set. 6.6. or when exception processing. Join Be particularly careful to observe the nullability properties for input links to any form of Outer Join. or Informix databases. 5. . If possible. The ODBC Connector and ODBC Enterprise stages should only be used when a native parallel stage is not available for the given source or target database. When using Oracle. Take care to observe the data type mappings. the non-key columns must be defined as nullable in the Join stage input in order to identify unmatched records. 7. use a parallel Sort and a Sort Merge collector. To generate a single. Limit the use of database Sparse Lookups to scenarios where the number of input rows is significantly smaller (for example 1:100 or more) than the number of reference rows. use Orchestrate Schema Importer (orchdbutil) to properly import design metadata. it is best to implement business rules natively using DataStage parallel components. Avoid the use of database stored procedures on a per-row basis within a high-volume data flow. Database Stages The following guidelines apply to database stages: 1. Lookup It is most appropriate when reference data is small enough to fit into available shared memory. use the Join or Merge stage.

Sign up to vote on this title
UsefulNot useful