  

Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop the metadata. Modify stage will drop the metadata only when explicitey specified using DROP clause. Do remember that Modify drops the Metadata only when it is explicitly asked to do so using KEEP/DROP clauses. Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer stage only if the anticipated volumes are high andperformance becomes a problem. Otherwise use Transformer. Its very easy to code a transformer than a modify stage.

   

Turn off Runtime Column propagation wherever it‘s not required. One of the most important mistake that developers often make is not to have a volumetric analyses done before you decide to use Join or Lookup or Merge stages. Estimate the volumes and then decide which stage to go for. Add reject files wherever you need reprocessing of rejected records or you think considerable data loss may happen. Try to keep reject file at least at Sequential file stages and writing to Database stages. Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of datastage reources. Indicate don‘t sort option between DB stage and join stage using sort stage when using order by clause.

 

Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options. One of the most frequent mistakes that developers face is lookup failures by not taking care of String padchar that datastage appends when converting strings of lower precision to higher precision.Try to decide on the APT_STRING_PADCHAR, APT_CONFIG_FILE parameters from the beginning. Ideally APT_STRING_PADCHAR should be set to OxOO (C/C++ end of string) and Configuration file to the maximum number of nodes available.

Data Partitioning is very important part of Parallel job design. It‘s always advisable to have the data partitioning as ‗Auto‘ unless you are comfortable with partitioning, since all DataStage stages are designed to perform in the required way with Auto partitioning.

While doing Outer joins, you can make use of Dummy variables for just Null checking instead of fetching an explicit column from table.

Stage Variables: Having too much variables in the transformer can impact the memory consumption. Avoid looping in transformer unless necessary. Try not to use the basic transformer when possible using other one. Buffering: Try to tune the following before changing the buffering policy in the job: 1) Job design 2) Config File 3) Disk Performance of the job can be improved if: 1) Unnecessary column are removed from the up and down stream links. 2) Removing these unnecessary columns will help reducing the memory consumption. 3) Always specify the list of columns in the select statement when reading from database. This will not bring unnecessary column data in the job which will save memory and network consumption. 4) Use RCP very carefully. 5) Understand the data-type before using them in the job. Do the data profiling before bringing data in the job. Sort operations: Always perform the following checks first before using the sort in the design: 1) Is sort really needed?

Communication between the operators should be optimized. Hence having more nodes can improve the performance of the job but one should think about the resource consumption on the server also. ULIMIT settings can prevent parallel jobs from running and processes from execution. by the number of nodes. Amount of data handled by the operators should be handled by the nodes efficiently.2) What is the volume of data going to be sorted? 3) Is data being read from database first and getting sorted in the job? Can we not sort the data in the database and bring sorted data? 4) What are the values set in the system related to Sort stage? If we give attention to above questions before applying sort then this will help up creating more performant job. One should keep scratch area maintained for the smooth operation of the jobs. Degree of Parallelism Degree of parallelism determined by the configuration file. Scratch Area Scratch area is memory space which will be used by the jobs in case if the data required to read is more than the buffer size. 2. This will save re-partitioning and would make the job more robust. Number of number will bring more number of processes which can exhaust the server. . Datasets will keep the partitions and sort order if set. Degree of parallelism is determined by the configuration file where you can check how many node are defined. Datasets Datasets are the best when storing the results intermediately. inequality will make the job less performant. To get the maximum performance from job: To get the maximum performance from job we should start the job design with the smaller set of the data and then increase the amount of data. Increased parallelism can bring more overhead but will help distributing the work. We will only get the best performing job when we will experiment with the design of the job using different partitioning methods etc. Following factors affect the job design: 1. This means the partitions should be correct. You can not fit everything in memory One should always remember that not every thing can be fitted into memory hence the while reading / sorting large amount of data. More the number of nodes more the resource usage will happen. While configuring the SMP environment: While configuring the SMP environment designers should leave some processors for the operating systems processes. Point to remember while partitioning the data: While partitioning the data make sure that the partitions are having equal amount of data in them. Stored Procedure Once should avoid using stored procedure for per row basis. correct stage selection is very necessary. Setting this up can help in better execution of the server and jobs. Parallelism is not always good: Remember parallelism is not always beneficial. You have to think about the design of the job and the configuration. Never use lookup stage where large amount of record set needs to be stored in virtual datasets.

This can also be used for integration with source code control systems. and so forth. .Standards It is important to establish and follow consistent standards in: _ Directory structures for installation and application support directories. these standard parameters and settings should be made part of a Designer Job Parameter Sets. database login settings. The Multiple-Instance job property allows multiple invocations of the same job to run simultaneously. especially for DataStage Project categories. user. To ease re-use. etc.) and directories where files are stored. as well as Annotation fields. file names. password. and you should speak to your Account Executive to engage IBM IPS Services. create re-usable components such as parallel shared containers to encapsulate frequently-used logic. and database login properties – Environment variables and their default settings – Annotation blocks _ Job Parameters should always be used for file paths. _ Standardized Error Handling routines should be followed to capture errors and rejects. and links. _ DataStage Template jobs should be created with: – Standard parameters such as source and target file paths. BASIC Routines are appropriate only for job control sequences. using DataStage's DSX export capability. Development guidelines Modular development techniques should be used to maximize re-use of DataStage jobs and components: _ Job parameterization allows a single job design to process similar logic instead of creating multiple copies of the same job. intermediate work files. All DataStage jobs should be documented with the Short Description field. _ A set of standard job parameters should be used in DataStage jobs for source and target database parameters (DSN. _ Naming conventions. stage names. It is the DataStage developer‘s responsibility to make personal backups of their work on their local workstation. Component usage The following guidelines should be followed when constructing parallel jobs in IBM InfoSphere DataStage Enterprise Edition: _ Never use Server Edition components (BASIC Transformer. _ Where possible. Server Shared Containers) within a parallel job. Note: A detailed discussion of these practices is beyond the scope of this Redbooks publication. _ Create a standard directory structure outside of the DataStage project directory for source and target files.

Remove Duplicates. _ Objective 2 The partition method must match the business requirements and stage functional requirements._ Always use parallel Data Sets for intermediate storage between jobs unless that specific data also needs to be shared with other applications. minimize re-partitioning. the default partitioning method (Auto) is appropriate. and source and target systems. the following objectives form a methodology for assigning partitioning: _ Objective 1 Choose a partitioning method that gives close to an equal number of rows in each partition. while minimizing overhead. If possible. degree of parallelism. minimizing overall run time. . Given the numerous options for keyless and keyed partitioning. Change Apply. IBM InfoSphere DataStage overview 29 _ Use BuildOp stages only when logic cannot be implemented in the parallel Transformer. it might not give optimized performance. Re-partitioning data in a cluster or Grid configuration incurs the overhead of network transport. especially in cluster or Grid configurations. _ Use the parallel Transformer stage (not the BASIC Transformer) instead of the Filter or Switch stages. It might also be necessary for Transformers and BuildOps that process groups of related records. the Information Server Engine will choose the type of partitioning at runtime based on stage requirements. Chapter 1. DataStage data types The following guidelines should be followed with DataStage data types: _ Be aware of the mapping between DataStage (SQL) data types and the internal DS/EE data types. While Auto partitioning will generally give correct results. Change Capture. This ensures that the processing workload is evenly balanced. As the job developer. you have visibility into requirements. This includes. and Sort stages. Partitioning data In most cases. and to facilitate default type conversions. but is not limited to: Aggregator. and can optimize within a job and across job flows. Merge. Join. Any stage that processes groups of related records (generally using one or more key columns) must be partitioned using a keyed partition method. _ Leverage default type conversions using the Copy stage or across the Output mapping tab of other stages. assigning related records to the same partition if required. _ Use the Copy stage as a placeholder for iterative design. _ Objective 3 Unless partition distribution is highly skewed. With Auto partitioning. import table definitions for source databases using the Orchestrate Schema Importer (orchdbutil) utility.

The simplest method that meets the above objectives will generally be the most efficient and yield the best performance. If grouping is not required. persistent Data Sets can be used to retain the partitioning and sort order. examine up-stream partitioning and sort order and attempt to preserve for down-stream processing. – When the input Data Set has been sorted in parallel and Range partitioned. Start with Auto partitioning (the default). Across jobs. Consider how the input Data Set has been sorted: – When the input Data Set has been sorted in parallel. Note: In satisfying the requirements of this second objective. use Round Robin partitioning to redistribute data equally across all partitions. use Sort Merge collector to produce a single. it might not be possible to choose a partitioning method that gives an almost equal number of rows in each partition. When output order does not matter. use Auto partitioning (the default). 2. Specify Hash partitioning for stages that require groups of related records as follows: • Specify only the key column(s) that are necessary for correct grouping as long as the number of unique values is sufficient • Use Modulus partitioning if the grouping is on a single integer key column • Use Range partitioning if the data is highly skewed and the key column values and distribution do not change significantly over time (Range Map can be reused) c. Use a Round Robin collector to reconstruct rows in input order for round-robin partitioned input Data Sets. • Especially useful if the input Data Set is highly skewed or sequential d._ Objective 4 Partition method should not be overly complex. Use Same partitioning to optimize end-to-end partitioning and to minimize re-partitioning • Be mindful that Same partitioning retains the degree of parallelism of the upstream stage • Within a flow. . 3. globally sorted stream of rows. the following methodology can be applied: a. This is particularly useful if downstream jobs are run with the same degree of parallelism (configuration file) and require the same partition and sort order. This may require re-examining key column usage within stages and re-ordering stages within a flow (if business requirements permit). the following guidelines form a methodology for choosing the appropriate collector type: 1. Using the above objectives as a guide. as long as the Data Set has not been re-partitioned or reduced. Collecting data Given the options for collecting data into a sequential stream. b. the Ordered collector might be more efficient.

To generate a single. 6. 7. 32 IBM InfoSphere DataStage Data Flow and Job Design 5. The environment variable $APT_TSORT_STRESS_BLOCKSIZE can also be used to set sort memory usage (in MB) per partition. Specify only necessary key column(s). Be aware of automatically-inserted sorts: – Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish required sort order. the non-key . 3. If the Data Sets are larger than available memory resources. – Sort Key Mode. Create Cluster Key Change Column. – Always specify ―DataStage‖ Sort Utility for standalone Sort stages. _ Join Be particularly careful to observe the nullability properties for input links to any form of Outer Join. Even if the source data is not nullable. Minimize the use of sorts within a job flow. 2. you can only set that via the Sort Stage — not on a sort link. Use a stand-alone Sort stage instead of a Link sort for options that are not available on a Link sort: – The ―Restrict Memory Usage‖ option should be included here.Sorting Apply the following methodology when sorting in an IBM InfoSphere DataStage Enterprise Edition data flow: 1. use a parallel Sort and a Sort Merge collector. Stage specific guidelines The guidelines by stage are as follows: _ Transformer Take precautions when using expressions or derivations on nullable columns within the parallel Transformer: – Always convert nullable columns to in-band values before using them in an expression or derivation. – Always place a reject link on a parallel Transformer to capture / audit possible rejects. use the Join or Merge stage. _ Lookup It is most appropriate when reference data is small enough to fit into available shared memory. – Use the ―Sort Key Mode=Don‘t Sort (Previously Sorted)‖ to resort a sub-grouping of a previously-sorted input Data Set. If you want more memory available for the sort. Limit the use of database Sparse Lookups to scenarios where the number of input rows is significantly smaller (for example 1:100 or more) than the number of reference rows. or when exception processing. Start with a link sort. Create Key Change Column. sequential ordered result set. Do not use Stable Sort unless needed. 4. Output Statistics.

– Avoid the use of database stored procedures on a per-row basis within a high-volume data flow. – When using Oracle. For maximum scalability and parallel performance.columns must be defined as nullable in the Join stage input in order to identify unmatched records. use an SQL where clause to limit the number of rows sent to a DataStage job. DB2. – If possible. – The ODBC Connector and ODBC Enterprise stages should only be used when a native parallel stage is not available for the given source or target database. _ Aggregators Use Hash method Aggregators only when the number of distinct key column values is small. use Orchestrate Schema Importer (orchdbutil) to properly import design metadata. use the Connector stages or native parallel database stages for maximum performance and scalability. _ Database Stages The following guidelines apply to database stages: – Where possible. it is best to implement business rules natively using DataStage parallel components . A Sort method Aggregator should be used when the number of distinct key values is large or unknown. or Informix databases. – Take care to observe the data type mappings.

08   !.5.0 8904-/0.3/ 9.0706:7020398 /0700415.0880874:584170.08874:584170.90$.0897.9433.90 9:94 5.799433 901443 4-0../41 3094797.045920/5071472.3..:8907477/.7994330307.9574.0 4480.799432094/  %83.908 .9.3/:/ 589.0/ 23234.799433 9031472.070.9.799431 706:70/  389.947 .9574..79943/897-:9438800/ 232070 5.9.90/ 70.7.07.-0/013943814784:7.709889028 0:945.799430/:83.070.-41490789.47/894908.90-:83088706:7020398.4-..381472078.205.9.0883 474.8-9394 706:7020398 .0.799432094/2:892.3/89.43.081472.  32489.79943 023234.4:238 2:89-05.48094.9.80/4389.385479  .:9950.7994332094/9.-.9.3:2-07417483 0.00/5.3/.9708:98 929 349.2547907 47.0:5.:/08 -:98349290/9470..84-0 30.9.59:70 .9438  #0 5.07330.7147%7.883370.7994332094/ :94 8.990574.4770.5.3.706:7020398 .0 /.0 1:3.448090950415./80.799433  * -0.2094/44147../-:9 :99  *0.08.557457.04507 4:.431:7..30 55 43 070 #024..30.90/70.7:3920   * -0.943.799433  0850.748890 :95:9 2.799433/.9508 15488-0 254799.74884-148  ./ %8038:7089.03903:2074:84594381470088.3/00/5.9.0 %05.07.0.799433.3.088.5539.:95.0/01.306:.08 929.0.3/$47989..03-.:8907477/.943$07.808:8390 7.9.799433/.0 &30885.88335.02..082 .09.808 90/01.9 7:3920-.:78904.078438:83904589.047.:8343047 24700.9.431:7..34592093.47/8   * -0.90/70./.3/84:7.47/8 0307.3/.3.

5.08241 90:58970.3-0:80/94709.557457.974:53 .08.039 W&804/::85.5.3905.843.039.:801:19035:9..4:23:8.81448 W$50. $9.9.25.8-00384790/35.7:801:1/438970.$09.799433 W023/1:9.9.3.7.799433709.248906:.707:39908.90.08 93.3/942320 70 5.0/  .40.799433 90/01.9.14 1-:83088706:702039850729  49038.3/0/90 -0895071472.830390070 .79.93/.90/70.14 0.2 901443 :/03081472.8.9.3/8479 47/07 %885.:9   438/0749035:9.3-0 .:08.082 .08.230:5 8970.448390.9.9479470.089.07920 #.20/0700 415.43897:.799438  W850.9907 :80:945.2330.40.8970.$098800/47806:039.0 929349 -05488-094.799433190/.843.04-0.947950  034:95:947/07/4083492.0883 %82.0 :80$479070 .3890/0700415..7488.8349-003 70 5.85.8903:2-0741:36:0.431:7.4480.99025994570807...5.4770.0147/43 8970.08.3//897-:943/4349.:0888:11.7..9.-4.7.07.2094/44147.0.:9  - $50.799:945.9. 3:2-074174830.84790/8970.79943319074:53843.205.289.94310 .94729-02470011.:.09389.039   &80.0390459438147.3/ . 174:538349706:70/ :80#4:3/#4-35.-0902489011.79943.305.9748335:947/0714774:3/ 74-3 5.7994332094/9.06:.9..799430/ 90 7/070/.9$.40.205.8-00384790/ 039035:9.0  !.$09.799433.93/.30 5.3/70 47/07389.5.9. * -0.799433.:/0 9014432094/44..799433944592003/ 94 03/5.$09.9706:7074:584170.2574.  .3/847947/07.47/8 .1.9.$09.9.8800/.9 2009890.3/706:70908.74884-8 507889039.4:23 8 9.799430/35:9.4:23 .088..9.9.806:039.7030.#4:3/#4-3.3.799432094/84:/349-04.24-8.799430/4770/:.241748  039035:9.3/847947/07    40.0 W93.394.9.890.205. / &80$.4250 %08250892094/9.3-070:80/  .30831.3/900.394.$098 .799433.550/  .0  &8390.9..9.04-0.7.4:23 W&80#.7994339470/897-:90 /.94794574/:.30 .830 4-.080307.40.7147.0.-4..79943   .43/4-0.143900.40.$098..706:70 70 0.3/#.799433 90/01.8-00384790/35.981390706:7020398419880.9.79943314789.

0 3 5075.90$479 $9.430$47989.38479 %0 #0897..38479   $50.84793 %003.89.-0 !%*%$ #%*$%#$$* $. 8:- 74:5341.-0147908479 4:.4:8 84790/35:9.3314$5070.570.$9.088.799.-8 706:70/847947/07   32090:80418479893.1 .4:23 8   4349:80$9. $479070.9.9343.8.3/.3/4-083  0.3814720794.304:23 70.9438433:.0%7..:943803:8305708843847/07.14  $9.3/.4:238 93905.430$47989..947    $9.0850.79943  $47904/0 70.-043.59:70.0.0$479.90247&8.08  &8090 $47904/043 9$479 !70.7.90.0 34943.4-14   %40307.9.:/0308  %0:/0308-89.989.943  .9..0 $479&9914789.830 806:039.03890.7.0%7.8  ..9.81448  *%7..7.900...0570.0. 380790/84798 $09!%*$ #%*$#% ** 94..38147207 .0..8850.9.9.7.$9.7432039..4:238943 -.40.071-:9349089.47/070/708:9809 :80.9.38147207 %.85..1..30 4:23  :95:9$9.9.:942.$9.:/0/070 14:.30570884347/07.3.9.14330.70.-0$479:3088300/0/   &80.700.-0.0793:.3/..4.84-0:80/948098479 20247:8.-0.5.../41..90:89070.0 39075780/943/.:08-01470:839023 .3/ .4:8$4790/ 94708479.43.39 247020247.70. $4793  559014432094/4403847933.384791474594389.7041.0 4594384:/-03.$09  314$5070..3438099.5.70349 .

.0  2990:8041/.-0 8..:..-957450790814735:93894 .78044:58948.9.0319084:7..:/9 5488-0700.0903:.9.701:944-807.882.3147241 :90743 .70.0/.0748 47030.08  :8090434707089..3...90037010703.390 3:2-07417010703.79.-.-0202477084:7.03.9.80$5.7079.07 1470.9.$098.0883   *43 05.83493:.98   *44:5 982489.034:9419394..557457.05943574.250 472470 9.7.748070903:2-0741 35:97488831.3982.70/20247 190.-0 90343 0 ..0/.

94784303903:2-0741/893.9.3.9.:809429903:2-0741748803994.-.9.808 :80 7.5594/..4/90:8041/.05.8089470/574.9..9.3/8..5538  15488-0 :80.-..80$9.9.0:83..08349.  .08473.:08882.947.9.3/ 3907578089.9.82094/70.9.9502.08 0705488-0 :80904330.089. 2547907 47.90.:088.090/.83:.-.90.0.425430398 .9478 &80.80 89.035:9347/0794/0391 :32. .05..9..08 %01443:/0308.14 472.05071472.94784:/-0:80/03903:2-07 41/893.94789./.2:28.9.0884:/43-0:80/ 03..0384:7.$9..7047:3343   *.80  03:83 7.9.0/70.47/8   *70.7.-014790.3.0897.0479...-03904389.0  98-089942502039-:830887:083.0./-:9 9457450725479/083209.02. $4792094/70.9.081472.0  4731472/.7.04-  .0/.507 74-.0 .-9.$9..3.8893.3/5.9.4:20/.-.709 /.05.8089.2:25071472.-.7.9.3$"070.7.4:23 .-9  %0 4330..0/:70843.70944-807.-.4:2382:89-0/0130/.  %.90$.

Sign up to vote on this title
UsefulNot useful