Information Integration Solutions Center of Excellence

Parallel Framework Standard Practices
Investigate, Design, Develop: Data Flow Job Development
Prepared by IBM Information Integration Solutions Center of Excellence July 17, 2006

CONFIDENTIAL, PROPRIETARY, AND TRADE SECRET NATURE OF ATTACHED DOCUMENTS
This document is Confidential, Proprietary and Trade Secret Information (“Confidential Information”) of IBM, Inc. and is provided solely for the purpose of evaluating IBM products with the understanding that such Confidential Information will be disclosed only to those who have a “need to know.” The attached documents constitute Confidential Information as they include information relating to the business and/or products of IBM (including, without limitation, trade secrets, technical, business, and financial information) and are trade secret under the laws of the State of Massachusetts and the United States. Copyrights © 2006 IBM Information Integration Solutions All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM. While every precaution has been taken in the preparation of this document to reflect current information, IBM assumes no responsibility for errors or omissions or for damages resulting from the use of information contained herein.

Information Integration Solutions Center of Excellence

Document Goals
Intended Use This document presents a set of standard practices, methodologies, and examples for IBM WebSphere® DataStage Enterprise Edition™ (“DS/EE”) on UNIX, Windows, and USS. Except where noted, this document is intended to supplement, not replace the installation documentation. The primary audience for this document is DataStage developers who have been trained in Enterprise Edition. Information in certain sections may also be relevant for Technical Architects, System Administrators, and Developers This document is intended for the following product releases: - WebSphere DataStage Enterprise Edition 7.5.1 (UNIX, USS) - WebSphere DataStage Enterprise Edition 7.5x2 (Windows)

Target Audience Product Version

Document Revision History
Date
April 16, 2004 June 30, 2005 December 9, 2005 January 31, 2006 February 17, 2006 March 10, 2006 March 31, 2006

Rev.
1.0 2.0 3.0 3.1 4.0 4.1 4.2

Description
Initial Services release First version based on separation of EE BP into four separate documents, merged new material on Remote DB2, configuring DS for multiple users. Significant updates, additional material Updates based on review feedback. Added patch install checklist item (7.10) and Windows 7.5x2 patch list. Significant updates, new material on ETL overview, data types, naming standards, USS, design standards, database stage usage, database data type mappings, updated styles and use of cross-references. Corrected missing Figure 9. Added new material on establishing job boundaries, balancing job resource requirements / startup time with required data volume and processing windows, and minimizing number of runtime processes. Moved Baselining Performance discussion to Performance Tuniing BP. Expanded performance tuning section. Removed Architecture Overview (now a separate document). Expanded file stage recommendations. Updated directory naming standards for consistency with DS/EE Automation Standards and Toolkit. Segmented content into “Red Book” and “Standards”. Clarified terminology (“Best Practices”). Incorporated additional field feedback.

May 08, 2006 July 17, 2006

4.3 5.0

Document Conventions
This document uses the following conventions: Convention Usage Bold In syntax, bold indicates commands, function names, keywords, and options that must be input exactly as shown. In text, bold indicates keys to press, function names, and menu selections. Italic In syntax, italic indicates information that you supply. In text, italic also indicates UNIX commands and options, file names, and pathnames. Plain In text, plain indicates Windows NT commands and options, file names, and pathnames. Bold Italic Indicates: important information.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 2 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence Lucida Console Lucida Bold

Lucida Console text indicates examples of source code and system output. In examples, Lucida Console bold indicates characters that the user types or keys the user presses (for example, <Return>). In examples, Lucida Blue will be used to illustrate operating system command line prompt. A right arrow between menu commands indicates you should choose each command in sequence. For example, “Choose File Exit” means you should choose File from the menu bar, and then choose Exit from the File pull-down menu. The continuation character  is used in source code examples to indicate a line that is too long to fit on the page, but must be entered as a single line on screen.

Lucida Blue

This line  continues

The following are also used: • Syntax definitions and examples are indented for ease in reading. • All punctuation marks included in the syntax—for example, commas, parentheses, or quotation marks—are required unless otherwise indicated. • Syntax lines that do not fit on one line in this manual are continued on subsequent lines. The continuation lines are indented. When entering syntax, type the entire syntax entry, including the continuation lines, on the same input line. • Text enclosed in parenthesis and underlined (like this) following the first use of proper terms will be used instead of the proper term. Interaction with our example system will usually include the system prompt (in blue) and the command, most often on 2 or more lines. If appropriate, the system prompt will include the user name and directory for context. For example:
%etl_node%:dsadm /usr/dsadm/Ascential/DataStage > /bin/tar –cvf /dev/rmt0 /usr/dsadm/Ascential/DataStage/Projects

Parallel Framework Red Book: Data Flow Job Design

July 17, 2006

3 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

......................................................51 4 DATASTAGE DATA TYPES..............................................2 MONITORING PARTITIONS.......................................................................................................................1 PARTITION TYPES................................................5 COLLECTOR TYPES..7 SORT METHODOLOGY................................................................................................................................................................................................................................................................82 7................................. or translated into any language in any form by any means without the written permission of IBM................67 5..............................................................................................................................................................................................................................................................................................................................................................................................................43 3.......................................81 7.....................................74 6........................................................78 6..........43 3..............2 JOB TYPES................................................................................. transmitted.............1 DIRECTORY STRUCTURES................................................................................................................................1 PARTITION AND SORT KEYS............................................................85 8 TRANSFORMATION LANGUAGES..................................................................................................................................5 SUB-SORTS........1 MODULAR DEVELOPMENT ............... transcribed.........................................................................................................2 NULL HANDLING.................................35 3 DEVELOPMENT GUIDELINES...................................3 RUNTIME COLUMN PROPAGATION.....................................................1 WHICH FILE STAGE TO USE.............................................................3 LINK SORT AND SORT STAGE....................................................................................................................40 3..............................................7 ERROR AND REJECT RECORD HANDLING...................................4 DEFAULT JOB DESIGN....................................................................................................................................................................................................................................... 2006 4 of 179 © 2006 IBM Information Integration Solutions............................................................... No part of this publication may be reproduced.......................................................59 5.....................................................................................................................................................................................................................................2 DATA SET USAGE....................41 3................. stored in a retrieval system...............................................................Information Integration Solutions Center of Excellence Table of Contents 1 DATA INTEGRATION OVERVIEW.................................................................6 COLLECTING METHODOLOGY............39 3.............................................2 ESTABLISHING JOB BOUNDARIES.......8 COMPONENT USAGE..................13 2...............................................56 4................................................................74 6...3 DOCUMENTATION AND ANNOTATION.............................................................................................................................4 STABLE SORT...........................................42 3..............................................................................................58 5 PARTITIONING AND COLLECTING.......................................................................................................................................................................................................................................................................1 JOB SEQUENCES................ All rights reserved......4 PARTITIONING EXAMPLES.6 PARALLEL SHARED CONTAINERS............................................................................................................................................29 2................................................................2 COMPLETE (TOTAL) SORT....................................................................................................................................................................................................................................................4 WORKING WITH SOURCE CODE CONTROL SYSTEMS...............................................................................................................................3 SEQUENTIAL FILE STAGES (IMPORT AND EXPORT)...................6 1......... ..................................................................7 1........................................................................81 7.................................................................6 AUTOMATICALLY-INSERTED SORTS....................77 6.................................31 2.........13 2....................................................................................................................81 7..............................................................................................................................................................................76 6...........73 6 SORTING..........39 3..............87 Parallel Framework Red Book: Data Flow Job Design July 17...............................................................................................................................................39 3................................................................18 2......70 5....................................54 4.......................................................................5 UNDERSTANDING A JOB’S ENVIRONMENT............................................................................................................................................................................77 6.............59 5.......................................................................................5 JOB PARAMETERS....8 2 STANDARDS.............................................................................................................................................................................................................................4 COMPLEX FLAT FILE STAGE...............................................................................75 6......................................................................................................................................................8 TUNING SORT.3 JOB DESIGN TEMPLATES................................................................................................................................................................79 7 FILE STAGE USAGE..........72 5....................79 6........3 PARTITION METHODOLOGY....................................................................................................................................................................................................................................2 NAMING CONVENTIONS..............................................................................68 5.................

.....................1 LOOKUP VS...................................................................................94 9............................... stored in a retrieval system...............................126 11.........................................153 APPENDIX E: ENVIRONMENT VARIABLE REFERENCE..................................................................................................160 Parallel Framework Red Book: Data Flow Job Design July 17.........................................................................................................................................2 UNDERSTANDING OPERATOR COMBINATION................124 11.....................................................................................................................................................4 ODBC ENTERPRISE GUIDELINES.......... JOIN VS......................................... transmitted.......129 12......................................3 MINIMIZING RUNTIME PROCESSES AND RESOURCE REQUIREMENTS...................1 WARNING ON SINGLE-NODE CONFIGURATION FILES.............................................................................................................................................................................................................................95 10 DATABASE STAGE GUIDELINES.................129 12...................................................................................................................................................134 APPENDIX A: STANDARD PRACTICES SUMMARY............................................................................................................................................................................................1 HOW TO DESIGN A JOB FOR OPTIMAL PERFORMANCE................133 12.........................................................................................................................124 11...............96 10.........................96 10.. 2006 5 of 179 © 2006 IBM Information Integration Solutions..................................................................................................................................................................................................................................................................................................................................125 11.........................................................................................114 10...................................127 12 PERFORMANCE TUNING JOB DESIGNS.2 MODIFY STAGE...............................................................................4 VIEWING THE GENERATED OSH.................................................................................103 10...........................................5 INTERPRETING THE PARALLEL JOB SCORE.....................146 APPENDIX C: UNDERSTANDING THE PARALLEL JOB SCORE.............................2 CAPTURING UNMATCHED RECORDS FROM A JOIN.... MERGE.................................................................................148 APPENDIX D: ESTIMATING THE SIZE OF A PARALLEL DATA SET............................................................................................................................................................................1 DATABASE DEVELOPMENT OVERVIEW................................ No part of this publication may be reproduced..........................3 THE AGGREGATOR STAGE........................................... or translated into any language in any form by any means without the written permission of IBM......................................4 UNDERSTANDING BUFFERING......................................................7 TERADATA DATABASE GUIDELINES..............91 9 COMBINING DATA............94 9...................................................................................................................5 ORACLE DATABASE GUIDELINES...................................................................... ............3 INFORMIX DATABASE GUIDELINES........................................................................................2 DB2 GUIDELINES...........................87 8....................................113 10.............94 9.........................................................................................120 11 TROUBLESHOOTING AND MONITORING.............................................154 APPENDIX F: SORTING AND HASHING ADVANCED EXAMPLE.................................................6 SYBASE ENTERPRISE GUIDELINES................................................Information Integration Solutions Center of Excellence 8............................................140 APPENDIX B: DATASTAGE NAMING REFERENCE...............1 TRANSFORMER STAGE........................................................................................2 DEBUGGING ENVIRONMENT VARIABLES.........131 12........117 10................................................ transcribed...............119 10..............................3 HOW TO ISOLATE AND DEBUG A PARALLEL JOB..........................124 11...................................................... All rights reserved.....

databases and DS/EE Data Sets. All rights reserved. transcribed. No part of this publication may be reproduced.Information Integration Solutions Center of Excellence 1 Data Integration Overview Work performed by Data Integration jobs fall into 4 general categories: • • • Reading input data including sequential files. transmitted. or translated into any language in any form by any means without the written permission of IBM. 2006 6 of 179 After Job © 2006 IBM Information Integration Solutions. Performing transformation from data sources to data targets. Performing row validation to support data quality. . and • Provisioning data targets. stored in a Subroutine retrieval system. Here is the general flow diagram for Data Stage Enterprise Edition jobs: Before Job Subroutine Halt on Error ? Yes Exit Failure No Create Reject Files (Limited) Read Input Data Halt on Error ? Yes Exit Failure No Create Error and Reject Files Perform Validations Errors and Warnings Halt on Error ? Yes Exit Failure No Create Error and Reject Files Perform Transformations Halt on Error ? Yes Exit Failure No Create Reject Files (Limited) Perform Load and/or Create Intermediate Datasets Over Job W arning Threshold? Yes Exit Failure No Parallel Framework Red Book: Data Flow Job Design July 17.

transcribed.1 Job Sequences As shown in the previous diagram.Information Integration Solutions Center of Excellence 1. etc). auditing/capture. Job sequences also provide the recommended level of integration with external schedulers (such as AutoSys. and provides an appropriate leveraging of the respective technologies. or translated into any language in any form by any means without the written permission of IBM. In most production deployments. All rights reserved. etc). and Production Automation. CA7. transmitted. No part of this publication may be reproduced. ETL development is intended to be modular. Parallel Framework Red Book: Data Flow Job Design July 17. and together form a single end-to-end module within a DataStage application. stored in a retrieval system. This provides a level of granularity and control that is easy to manage and maintain. built from individual Parallel jobs assembled in DataStage Enterprise Edition (“DS/EE”) controlled as modules from master DataStage Sequence jobs. error logging. 2006 7 of 179 © 2006 IBM Information Integration Solutions. Job Sequences require a level of integration with various production automation technologies (scheduling. These topics are discussed in Parallel Framework Standard Practices: Administration. Management. . as illustrated in the example below: These job Sequences control the interaction and error handling between individual DataStage jobs. Cron.

and preserves the compute effort of long running transformation jobs. Hybrid Provisioning Data must NOT be changed by any method unless jobs transforming an entire subject area have successfully completed. Data can be changed regardless of success or failure. stored in a retrieval system. . or where the resource requirements for data transformation are very large. Any target where either all sources have been successfully transformed or where the resources required to transform the data must be preserved in the event of a load failure or where the provisioning will take so long that it increases the probability of job failure if the job includes transformation and provisioning. and Hybrid jobs do both. or where the resource requirements for data provisioning are very large. The following table defines when each type should be used: Type Transformatio n Data Requirements Data must NOT be changed by any method unless jobs transforming an entire subject area have successfully completed. This prevents partial replacement of reference data in the event of transformation failure. Parallel Framework Red Book: Data Flow Job Design July 17. Example Reference tables upon which all subsequent jobs and/or the current data target (usually a database) will depend. or long running provisioning processes. Neither the transformation nor provisioning requirements are large. transcribed. Hybrid. Transformation jobs prepare data for provisioning jobs Provisioning jobs load transformed data.2 - Job Types Nearly all data integration jobs fall into three major types: Transformation. 2006 8 of 179 © 2006 IBM Information Integration Solutions. All rights reserved. Non-reference data or independent data are candidates. No part of this publication may be reproduced. and Provisioning. transmitted.Information Integration Solutions Center of Excellence 1. The data target (usually a database) must allow subsequent processing of error or reject rows and tolerate partial or complete non-update of targets. or translated into any language in any form by any means without the written permission of IBM.

stored in a retrieval system. 2006 9 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. transmitted. are processed to produce a load-ready Data Set that represents either the entire target table or new records to be appended to the target table. The following example transformation job demonstrates the use of write-through cache DS/EE Data Sets: The target table is among the inputs. All rights reserved. some of which may be write-through cache Data Sets. data sources.2. transcribed. Parallel Framework Red Book: Data Flow Job Design July 17.Information Integration Solutions Center of Excellence 1. If the entire target table is held in the load-ready Data Set. that Data Set qualifies as write-through cache and may be used as source data instead of the target table. No part of this publication may be reproduced. .1 Transformation Jobs In transformation jobs.

All rights reserved. transcribed.Information Integration Solutions Center of Excellence The following example transformation job does NOT produce write-through cache – its sources do NOT include the target table. 2006 10 of 179 © 2006 IBM Information Integration Solutions. . Parallel Framework Red Book: Data Flow Job Design July 17. No part of this publication may be reproduced. stored in a retrieval system. or translated into any language in any form by any means without the written permission of IBM. transmitted.

Information Integration Solutions Center of Excellence 1. the validated records. In this case. it is rejected by the transformer to a reject port and the validation is not performed for those records. No part of this publication may be reproduced. Parallel Framework Red Book: Data Flow Job Design July 17. producing an ordered and linked associative table. The key column for a Referential Integrity check is validated by a Transformer stage. or translated into any language in any form by any means without the written permission of IBM. This job also loads the target database table and creates write-through cache. if the load fails. we replicated the Oracle structure (lower input) for each country found in the write-through cache country dataset (upper input). If the key column is NULL. . forcing other jobs that might depend on this data to access the existing (not updated) target database table. This enforces a coherent view of the subject area from either cache (current state if all jobs complete successfully) or target tables (previous state if any job fails). transmitted. The non-validated records.2. All rights reserved. 2006 11 of 179 © 2006 IBM Information Integration Solutions. and described below following the highlighted areas from Left to Right: A column generator inserts the key column for a join and generates a single value guaranteed to never appear in the other input(s) to the join. stored in a retrieval system. and the write-through cache records from the last load of the target database are merged.2 Hybrid Jobs The following example hybrid job demonstrates several interesting techniques that might be used in more complex jobs. the cache is deleted. In this case. transcribed. By specifying a full-outer join we produce a Cartesian product dataset. The de-duplicated records are re-grouped and ordered before calculation of the terminating keys. Some of the more interesting solutions in this job are circled. The merged records are grouped and ordered before being de-duplicated to remove obsolete records.

stored in a retrieval system. . 2006 12 of 179 © 2006 IBM Information Integration Solutions. All rights reserved.3 Provisioning Jobs This example provisioning job demonstrates the straightforward approach to simple provisioning tasks. or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence 1. transmitted. No part of this publication may be reproduced. Parallel Framework Red Book: Data Flow Job Design July 17. transcribed.2.

. 2. These directories are configured during product installation. transcribed. / /DSEngine Data File Systems /PXEngine /Data0 /Configurations /Project_A . transmitted. /Ascential /Scratch0 /ScratchN /Project_A /patches /P /DataStage /Project_Z . stored in a retrieval system.... Install FS Install File System Scratch File Systems .1 Directory Structures 2. and Data Directories /Project_Z Parallel Framework Red Book: Data Flow Job Design July 17. and Project Directory Structures The following diagrams depict the IBM WebSphere DataStage software directory structures and the support directory structures. File systems are highlighted in blue /Project_Z . /DataN /P .1 Data.. Scratch. Development standards can also make it easier to integrate external processes such as automated auditing and reporting. Install. Figure 1: RecommendedGigabyte 1 DataStage Install. . All rights reserved... or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence 2 Standards Establishing consistent development standards helps to improve developer productivity and reduce ongoing maintenance costs. No part of this publication may be reproduced. 2006 13 of 179 /Projects / /Project_A © 2006 IBM Information Integration Solutions.1. and to build technical and support documentation....

This is illustrated in the above diagram. sort memory overflow. 2006 14 of 179 © 2006 IBM Information Integration Solutions. NOTE: On some operating systems. No part of this publication may be reproduced. because they increase the risk of filling the DataStage project file systems. The DataStage installation creates the following two directories: $DSHOME/./Datasets The DataStage Administrator should ensure that these default directories are never used by any parallel configuration files. the DataStage Administrator client creates its projects (repositories) in the Projects directory of the DataStage installation directory. it is a bad practice to create DataStage projects in the default directory. or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence By default. It is a bad practice to share the DataStage project file system and conductor file system with volatile files like scratch files and Parallel data set part files. As a standard practice.. stored in a retrieval system. as disk space is typically limited in production install directories. Parallel Framework Red Book: Data Flow Job Design July 17. . To scale I/O performance within DataStage. For this reason. In order to scale I/O for DataStage. This best practice advocates creating subdirectories for each project for each scratch and disk partition. projects should be installed in their own file system. as a separate file system for the Projects sub directory within the DataStage installation.. transmitted./Scratch $DSHOME/. In general. the administrator should consider creating separate file systems for each Scratch and Resource partition. Consider naming the file systems in accordance with partition numbers in your DataStage EE Configuration file. Scratch is used by the EE framework for temporary files such as buffer overflow. it is possible to create separate file systems at non-root levels. consider creating separate file systems for each Scratch and Data resource partition. All rights reserved. transcribed.

. system integration.Information Integration Solutions Center of Excellence /Staging /dev /si /qa /prod /Project_A /Project_A /Project_A /Project_A /archive /archive /archive /archive .. All rights reserved. and production) as appropriate. or translated into any language in any form by any means without the written permission of IBM. target data files.... Within each deployment directory. stored in a retrieval system./archive of these development phases may be present on a local file not all system.. data directories are implemented for each deployment phase of a job (development. /Staging /dev Top-Level Directory development data tree. location of source data files. error and reject files.. 2006 15 of 179 . transcribed. /Project_A /archive /si /qa /prod Parallel Framework Red Book: Data Flow Job Design Figure 2: DataStage Staging Directories /Project_Z /Project_Z © 2006 IBM Information Integration Solutions. files are separated by Project name as shown below. If the file system is not /archive /archive /archive shared across multiple servers. . subdirectory created for each project location of compressed archives created by archive process of previously processed files System Integration (also known as “test”) data tree Quality Assurance data tree Production data tree . No part of this publication may be reproduced. transmitted. qa. /Project_Z July 17. /Project_Z Within the separate Staging file system.. .

. 16 of 179 . such as the operating system. The directory structure will be made transparent to the DataStage application. Environment variables are a critical portability tool. transmitted.2 Extending the DataStage Project for External Entities It is quite common for a DataStage application to be integrated with external entities.. transcribed. July 17. To completely integrate all aspects of a DataStage application the directory structure that is used for integration with external entities should be defined in a way that provides a complete and separate structure in the same spirit as a DataStage project.. All rights reserved. or it could require scripts for example integrating with an Enterprise Scheduler. This will provide a convenient vehicle to group and manage resources used by a project.1.. through the use of environment variables. .Information Integration Solutions Center of Excellence 2. .. another application or middle ware. which will enable DataStage applications to move through the life cycle without any code changes. Project_Plus Project_Plus Directory Hierar /si /qa /prod /dev /Project_A /bin /src /doc /datasets /logs /params /schemas /scripts /sql /Project_A /bin /src /doc /datasets /logs /params /schemas /scripts /Project_A /bin /src /doc /datasets /logs /params /schemas /scripts /sql /Project_A /bin /s /d /datase /logs /param /schema /scripts /sql Figure 3: Project_Plus/sql Directory Structure Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. stored in a No part of this /Project_Z /Project_Z retrieval system./Project_Z publication may be reproduced. 2006 . /Project_Z .. or translated into any language in any form by any means without the written permission of IBM.. The integration can be as simple as a file system for housing source files. A directory structure should be created that organizes external entities and is directly associated with 1 and only 1 DataStage project..

utilities. system integration. 2006 17 of 179 © 2006 IBM Information Integration Solutions. not all of these development phases may be present on a local file system. If the file system is not shared across multiple servers.$ProjectName project files location of Orchestrate schema files location of operating system (shell) script files location of maintenance or template SQL system integration (aka “test”) code tree quality assurance code tree production code tree /datasets /logs /params /schemas /scripts /sql /si /qa /prod In support of a Project_Plus directory structure environment variable parameters should be configured.Information Integration Solutions Center of Excellence Within the Project_Plus hierarchy. DataStage routines. and production) as appropriate. this directory may only be present in the /dev development code tree) location of DataSet header files (. transmitted. BuildOps. or translated into any language in any form by any means without the written permission of IBM. stored in a retrieval system. directories are created for each deployment phase of a job (development. qa. . a copy of dsenv and copies of DSParams. No part of this publication may be reproduced. Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved.ds file) location of custom job logs and reports location of parameter files for automated program control. Project_Plus /dev /Project_A /bin /doc /src Top-Level of Directory Hierarchy development code tree subdirectory created for each project location of custom programs. transcribed. and shells location of documentation for programs found in /bin subdirectory location of source code and makefiles for items found in /bin subdirectory (Note: depending on change management policies. for example the following diagram shows Project_Plus variables as defined in the DataStage Administrator.

2 Naming Conventions As a graphical development environment.Information Integration Solutions Center of Excellence Figure 4: Project_Plus Environment Variables In some implementations. While the default names may create a functional data flow. All rights reserved. they do not facilitate ease of maintenance over time. there may be external entities that are shared with other DataStage projects. and the order the item is placed on the design canvas. for example all jobs are invoked with the same Script. A similar directory structure to the Project_Plus structure could be configured and referred to as DataStage_Plus. stored in a retrieval system. or an XML extract Parallel Framework Red Book: Data Flow Job Design July 17. nor do they adequately document the business rules or subject areas. the Designer tool assigns default names based on the object type. 2. A consistent naming standard is essential to • maximize the speed of development • minimize the effort and cost of downstream maintenance • enable consistency across multiple teams and projects • facilitate concurrent development • maximize the quality of the developed application • increase the readability of the objects in the visual display medium • increase the understanding of components when seen in external systems. DataStage offers (within certain restrictions) flexibility to developers when naming various objects and components used to build a data flow. 2006 18 of 179 © 2006 IBM Information Integration Solutions. for example in WebSphere MetaStage. or translated into any language in any form by any means without the written permission of IBM. transcribed. . No part of this publication may be reproduced. transmitted. By default.

principles. transcribed.Information Integration Solutions Center of Excellence This section presents a set of standards and guidelines to apply to developing data integration applications using DataStage Enterprise Edition. 2. the moving of a message). and many of these benefits were used to establish this naming standard: • With rapid development. Sequential File. the class word refers to the functions of Reading. or translated into any language in any form by any means without the written permission of IBM. Subject Modifier. Throughout this section. objects can be sub-typed (for example. a Left Outer Join). 2006 19 of 179 © 2006 IBM Information Integration Solutions. • Since much of the development work is done using a click. and Class Word In the context of DataStage. Parallel Framework Red Book: Data Flow Job Design July 17. to be tuned according to needs. a transformer might be named: Data_Block_Split_Tfm As a guideline. In these cases the class word represents the subtype. Moving or Writing data (or within a Sequence Job. In the case of a data store the class word will refer to the type of data store. transmitted. Any set of standards needs to take on the culture of an organization.2. In some cases where appropriate. View.1 Key Attributes of the Naming Convention This naming convention is based on a three-part convention: Subject. Where it is a two letter abbreviation both letters should be capitalized. but not required. Table. in the case of a link object. No part of this publication may be reproduced. while the term “Guideline” refers to recommended. and so forth. There are a number of benefits from using a graphical development tool like DataStage. enabling a greater understanding of the requirements and greater control over how they are delivered. All rights reserved. As an example. drag and drop paradigm there is less typing involved hence the opportunity to use longer more meaningful. while maintaining quality. the class word is used to identify either a type of object or the function that a particular type of object will perform. for example: Data Set. more readable names. • There can be a much tighter link between design and development. so it is envisaged that these standards will develop and will adapt over time to suit both the organization and the purpose. Where there is no sub classification required then the class word will simply refer to the object. stored in a retrieval system. For example. Where it is a three or four letter abbreviation then it should be word capitalized. three or four letter abbreviation. more effort can be put into analysis and design. the term “Standard” refers to those principles that are required. . Reference (Lookup). the Class Word is represented as a two.

2.2 DataStage Naming Reference. Carrying this information as a separate attributes enables the first word of the name to be used as the subject matter. a link. 2006 20 of 179 © 2006 IBM Information Integration Solutions. all word abbreviations should be referenced by the long form to get used to saying the name in full even if reading the abbreviation. Subject Modifier. documentation is often something that is left until later. so a pattern of work needs to be identified and adopted to help development.2. This should be enhanced by also using Word Capitalization. DataStage provides the ability to document during development with the use of meaningful naming standards (as outlined in this section).2. The key issue is readability. attention should be given to the layout to enhance readability before it is handed over to versioning. Though best intentions are always apparent. for example. No part of this publication may be reproduced. Class Word approach. This is the same or similar information that would be carried in a prefix approach. when creating the object. This type of approach takes extra effort at first. expanding the icon border can significantly improve readability. the first letter of each Word should be capitalized. whatever tool you use. over using the Prefix approach. For stages with multiple links. allowing sort either by subject matter or by object type. Like a logical name. Secondly the class word approach enables sub-classification by object type to provide additional information. In WebSphere MetaStage. Where possible. and so forth.Information Integration Solutions Center of Excellence A list of frequently-used Class Word abbreviations is provided in 12. the standard. the object type is defined in a separate field. This can help make them more productive and makes their work more easily read. When development is more or less complete. Establishing standards also eases use of external tools and Parallel Framework Red Book: Data Flow Job Design July 17. or translated into any language in any form by any means without the written permission of IBM. a stage. All rights reserved. inadequately carried out. consideration should be made to provide DataStage developers with higher resolution screens as this provides them with more screen display real-estate. is maintaining documentation. a derivation. transcribed. a job design. For the purposes of documentation. transmitted. The “Snap to Grid” feature of Designer can help improve development speed. 2.2 Designer Object Layout The effective use of naming conventions means that objects need to be spaced appropriately on the DataStage Designer canvas. where possible. . One benefit of using the Subject. will be to separate words by an Underscore which will allow clear identification of each work in a name. the abbreviated form is used.4.3 Documentation and Metadata Capture One of the major problems with any development effort.there is a field that denotes whether the object is a column. is to enable two levels of sorting or grouping. however. stored in a retrieval system. Though DataStage imposes some limitations on the type of characters and length of various object names. This will help re-enforce wider understanding of the subjects.

developers have the flexibility to define their own Directory or Category hierarchy. are Alpha Numeric and can also contain both Spaces and Underscores.2. stored in a retrieval system. transmitted. 2006 21 of 179 © 2006 IBM Information Integration Solutions. which can provide impact analysis. Below this level. Examples of Project naming where the project is single application focused are: • “Accounting Engine NAB Development” would be named: Acct_Eng_NAB_Dev • “Accounting Engine NAB Production” would be named: Acct_Eng_NAB_Prod Examples of Project naming where the project is multi-application focused are: • Accounting Engine Development or Acct_Engine_Dev • Accounting Engine Production or Acct_Engine_Prod 2. Therefore Directory names should be Word Capitalized and separated by either an underscore or a space. The suffix of a Project name should be used to identify Development (“Dev”). The name of a DataStage Project may only be 18 characters in length. This factor often can cause terminology issues especially in teamwork where both business and developers are involved.4.3 Job Category Naming Within Designer. No part of this publication may be reproduced. and Production (“Prod”). DataStage enforces the top level Directory Structure for different types of Objects (for example. Category Names can be long. it can contain alpha-numeric characters and it can contain underscores.4 Naming Conventions by Object Type 2.Information Integration Solutions Center of Excellence processes such as WebSphere MetaStage. 2.2. . Shared Containers. dialog box fields that specify a new category have only one input area for defining the Category name. or translated into any language in any form by any means without the written permission of IBM.2 Category Hierarchy DataStage organizes objects in its repository by Categories. All rights reserved. Test (“Test”).2. transcribed.4.2. Table definitions…). Multiple levels of Hierarchy are named by specifying the Hierarchy levels separated by a backslash (“\”). as well as documentation and auditing. For example. allowing related objects to be grouped together. It may or may not have a one to one relationship with an organizations’ project of work. Jobs.1 Projects Each DataStage Project is a standalone repository. However with the limit of 18 characters the name is most often composed of abbreviations. the structure “A Test\Lower\Lower Still” is shown below: Parallel Framework Red Book: Data Flow Job Design July 17. Routines. 2.4.

Where possible. All rights reserved. Organizing related DataStage objects within categories also facilitates backup/export/import/change control strategies for projects since Manager can import/export objects by category grouping. a job category might contain a Job Sequence and all the jobs and only those jobs that are contained in that sequence. with sub-levels for individual functional modules. Jobs and Job Sequences are grouped together in the same scope as the technical design documents. or translated into any language in any form by any means without the written permission of IBM. stored in a retrieval system. For example. transmitted.Information Integration Solutions Center of Excellence Figure 5: Creating Category Hierarchies The main reason for having Categories is to group related objects. No part of this publication may be reproduced. 2006 22 of 179 © 2006 IBM Information Integration Solutions. Note that Job names must be unique within a DataStage project. all Jobs and Job Sequences will be grouped in a single parent Category. a Category level should only contain objects that are directly related. . not within a category. Within each functional module category. Categorization by Functional Module For a given application. transcribed. jobs that read write-through cache for a ECRP subset in the ECRDEV project that cleanse and load multi-family mortgage data and are driven by a sequencer might have a hierarchy that looks like the following example: Parallel Framework Red Book: Data Flow Job Design July 17. For example.

4 Table Definition Categories Unlike other types of DataStage objects. and there are 2 additional high-level categories. establishing a Table Definition categorization that matches project development organization is recommended.4. . By default. Although the default table definition categories are useful from a functional perspective. Table Definitions are always categorized using two level names. if these TableDefs are to be used by other jobs..2. In the previous illustration.. Remembering that Job names must be unique within a given project. categories and metadata. stored in a retrieval system. When implementing a customized Table Definition categorization. New Table Definition categories can be created within the repository by right-clicking within the Table Definitions area of the DataStage project repository and choosing the “New Category” command.). transcribed. two developers cannot save a copy of the same job with the same name within their individual “sandbox” categories – a unique Job name must be given. etc. or translated into any language in any form by any means without the written permission of IBM. 2. Temporary TableDefs created by developers to assist with job creation appear under the Saved category by default. project manager.Information Integration Solutions Center of Excellence Figure 6: Categorization by Functional Module Categorization by Developer In development projects. All rights reserved. The placement of these fields varies with the method of metadata import. Saved. TableDefs that remain in the Saved category Parallel Framework Red Book: Data Flow Job Design July 17. and the responsibility of the development manager assigned the DataStage Manager role to ensure that projects are not obese with unused jobs. 2006 23 of 179 © 2006 IBM Information Integration Solutions. Orchestrate. Once created. care must be taken to override the default choices for category names during Table Definition import. DataStage assigns the level names based on the source of the metadata import (for example. but this can be overridden during import. two developers have private categories for sandbox and development activities. PlugIn. they must be moved to the appropriate category and re-imported from that category in every job where they are used. transmitted. No part of this publication may be reproduced. On import. categories will be created for each developer as their personal sandbox and place they perform unit test activities on jobs they are developing. It is the responsibility of each developer to delete unused or obsolete code. the first level Table Definition category is identified as the “Data Source Type” and the second level categorization is referred to as the “Data Source Name” as shown in the example on the below. ECRP and Templates.

2.g: DWPH1 or ECRP. Job and Job Sequence names should be descriptive and should use word capitalization to make them readable. and underscores only. A Job will be suffixed with the class word “Job” and a Job Sequence will be suffixed with the class word “Seq”. No part of this publication may be reproduced.4.2 Category Hierarchy.4. numbers. stored in a retrieval system. Figure 7: Table Definition Categories 2.: Datasets. transcribed. Each subject area will have a master category. e.2. . 2006 24 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. In this example. All rights reserved. Because the name of can be long. transmitted. Examples of Job naming are: • CodeBlockAggregationJob • CodeBlockProcessingSeq Jobs should be organized under Category Directories to provide grouping such that a Directory should contain a Sequence Job and all the Jobs that are contained within that sequence. Jobs and Job Sequences are all held under the Category Directory Structure of which the top level is the category “Jobs”. the TableDefs have been grouped into a master category of Custom. An alternative implementation is to set the “Data source name” to that of the source system or schema. with sub-categories intended to identify the type of the source. e. Parallel Framework Red Book: Data Flow Job Design July 17.g.Information Integration Solutions Center of Excellence should be deleted as soon as possible. The following is one of the TableDefs from this project showing how to correctly specify the category and sub-category. This will be discussed further in Section 2.5 Jobs and Job Sequences Job names must begin with a letter and can contain letters.

stage editors identify links by name.7 Parameters A Parameter can be a long name consisting of alphanumeric characters and underscores.2. It is particularly important to establish a consistent naming convention for link names. .2. No part of this publication may be reproduced. links represent the flow of a message from one activity / step to the next. stored in a retrieval system. Examples of Parameter naming are: • Audit_Trail_Output_Path_Parm • Note where this is used in a stage property. links are objects that represent the flow of data from one stage to the next. It is optional as to whether you decide to change this code to something meaningful. so word capitalization should be used for readability. Furthermore. To differentiate between Parallel Shared Containers and Server Shared Containers. having a descriptive link name reduces the chance for errors (for example. When a Shared Container is used. a character code is automatically added to that instance of its use throughout the project. Within the graphical Designer environment. All rights reserved. or translated into any language in any form by any means without the written permission of IBM.2. when sharing data with external applications (for Parallel Framework Red Book: Data Flow Job Design July 17. The class word suffix is “Parm”.4. the parameter name is delimited by the # sign: #Audit_Trail_Output_Path_Parm# 2.Information Integration Solutions Center of Excellence 2. during Link Ordering). 2006 25 of 179 © 2006 IBM Information Integration Solutions. Examples of Shared Container naming are: • AuditTrailPsc (this is the original as seen in the Category Directory) • AuditTrailPscC1 (This is an instance of use of the above shared container) • AuditTrailPscC2 (This is another instance of use of the same shared container) In the above examples the characters “C1” and the “C2” are automatically applied to the Shared Container Stage by DataStage Designer when dragged onto the design canvas. the following Class Word naming is recommended: • Psc = Parallel (Enterprise Edition) Shared Container • Ssc = Server Edition Shared Container IMPORTANT: Use of Server Edition Shared Containers is discouraged within a parallel job. Therefore the parameter name must be made readable using Capitalized words separated by underscores. Within a Job Sequence. Shared containers have their own Category Directory and consideration should be given to a meaningful Directory Hierarchy.6 Shared Containers Shared containers have the same naming constraints as jobs in that the name can be long but can not contain underscores.4.4.8 Links Within a DataStage Job. transmitted. transcribed. instead of using the default “DSLink#” (where “#” is an assigned number). 2.

Merge. Transformer. These names are based on the type of stage (object) and a unique number. Sequential File. stage names must be unique.4.2 DataStage Naming Reference.Information Integration Solutions Center of Excellence example. Instead of using the full object name. transcribed. All rights reserved. after the subject name and subject modifier. stored in a retrieval system. Database. etc) • The type of movement may optionally be part of the Class Word. the same name may be appropriate for multiple links. the link name should include the link type (reference.9 Stage Names DataStage assigns default names to stages as they are dragged onto the Designer canvas. Within a Job or Job Sequence. A list of frequently-used stages and their corresponding Class Word abbreviation may be found in 12. transmitted. establishing standardized link names makes it easier to understand results and audit counts.4. In this case. . always specify a unique link name within a particular Job or Job Sequence by including a number.2. 2006 26 of 179 © 2006 IBM Information Integration Solutions. reject) to reinforce the visual cues of the Designer canvas: o “Ref” for reference links (Lookup) o “Rej” for reject links (Lookup. The following rules can be used to establish a link name: • Use the prefix “lnk_” before the subject name to differentiate with stage objects • The link name should define the subject of the data that is being moved • For non-stream links.) • Examples Link names: • Input Transactions: “lnk_Txn_In” • Reference Account Numbers: “lnk_Account_Ref” • Customer File Rejects: “lnk_Customer_Rej” • Reception Succeeded Message or “lnk_Reception_Succeeded_Msg” 2. based on the order the object was added to the flow. for example: o “In” for input o “Out” for output o “Upd” for updates o “Ins” for inserts o “Del” for deletes o “Get” for shared container inputs o “Put” for shared container output As data is enriched through stages. through Job reporting). or 4 character abbreviation should be used for the Class Word suffix. a 2. No part of this publication may be reproduced. Parallel Framework Red Book: Data Flow Job Design July 17. (The DataStage Designer does not require link names on different stages to be unique. or translated into any language in any form by any means without the written permission of IBM. 3.

Information Integration Solutions Center of Excellence

2.2.4.10 Data Stores For the purposes of this section, a data store is a physical piece of disk storage where data is held for some period of time. In DataStage terms, this can be either a table in a database structure or a file contained within a disk directory or catalog structure. Data held in a database structure is referred to as either a Table or a View. In data warehousing, two additional subclasses of table might be used: Dimension and Fact. Data held in a file in a directory structure will be classified according to its type, for example: Sequential File, Parallel Data Set, Lookup File Set, etc. The concept of source and target can be applied in a couple of ways. Every job in a series of jobs could consider the data it gets in to be a source and the data it writes out as being a target. However for the sake of this naming convention a Source will only be data that is extracted from an original system and Target will be the data structures that are produced or loaded as the final result of a particular series of jobs. This is based on the purpose of the project – to move some data from a source to a target. Data Stores used as temporary structures to land data between jobs, supporting restart and modularity, should use the same names in the originating job and any downstream jobs reading the structure. Examples of Data Store naming are: • Transaction Header Sequential File or Txn_Header_SF • Customer Dimension or Cust_Dim (This optionally could be further qualified as Cust_Dim_Tgt if you wish to qualify it as a final target) • Customer Table or Cust_Tab • General Ledger Account Number View or GL_Acctno_View 2.2.4.11 Transformer Stage and Stage Variables A Transformer Stage name can be long – over 50 characters and can contain underscores. Therefore the name can be descriptive and readable through word capitalization and underscores. DataStage Enterprise Edition supports two types of Transformers: • “Tfm”: Parallel (Enterprise Edition) Transformer • “BTfm”: BASIC (Server Edition) Transformer IMPORTANT: For maximum performance and scalability, BASIC Transformers should be avoided in Enterprise Edition data flows. A Transformer Stage Variable can have a long name consisting of alphanumeric characters but not underscores. Therefore the Stage Variable name must be made readable only by using Capitalized words. The Class Word suffix is Stage Variable or “SV”. Stage Variables should be named according to their purpose.

Parallel Framework Red Book: Data Flow Job Design

July 17, 2006

27 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence

When developing Transformer derivation expressions, it is important to remember Stage variable names are case sensitive. 2.2.4.12 DataStage Routines DataStage BASIC routine names will indicate their function and they will be grouped in sub-categories by function under a main category of Custom, for example.: Routines/Custom/SetDSParamsFromFile. A How-To document describing the appropriate use of the routine must be provided by the author of the routine, and placed in a documentation repository. DataStage Custom Transformer routine names will indicate their function and they will be grouped in sub-categories by function under a main category of Custom, for example: Routines/Custom/DetectTeradataUnicode. Source code, a makefile, and the resulting object for each Custom Transformer routine must be placed in the project phase source directory, e.g.: /home/dsadm/dev/bin/source. 2.2.4.13 File Names Source file names should include the name of the source database or system and the source table name or copybook name. The goal is to connect the name of the file with the name of the storage object on the source system. Source flat files will have a unique serial number composed of the date, “_ETL_” and time, for example: Client_Relationship_File1_In_20060104_ETL_184325.psv. Intermediate datasets are created between modules. Their names will include the name of the module that created the dataset OR the contents of the dataset in that more than one module may use the dataset after it is written, for example: BUSN_RCR_CUST.ds Target output files will include the name of the target database or system, the target table name or copybook name. The goal is the same as with source files – to connect the name of the file with the name of the file on the target system. Target flat files will have a unique serial number composed of the date, “_ETL_” and time, for example: Client_Relationship_File1_Out_20060104_ETL_184325.psv Files and datasets will have suffixes that allow easy identification of the content and type. DataStage proprietary format files have required suffixes and are identified in italics in the table below which defines the types of files and their suffixes. File Type
Parallel Framework Red Book: Data Flow Job Design

Suffix
July 17, 2006 28 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence

Flat delimited and non-delimited files Flat pipe (|) delimited files Flat comma-and-quote delimited files DataStage datasets DataStage filesets DataStage hash files Orchestrate schema files Flat delimited or non-delimited REJECT files DataStage REJECT datasets Flat delimited or non-delimited ERROR files DataStage ERROR datasets Flat delimited or non-delimited LOG files .dat. .psv .csv. .ds. .fs .hash. .schema. .rej. _rej.ds. .err. _err.ds. .log.

2.3

Documentation and Annotation

DataStage Designer provides description fields for each object type. These fields allow the developer to provide additional descriptions that can be captured and used by administrators and other developers. The Short Description field is also displayed on summary lines within the Director and Manager clients. At a minimum, description annotations must be provided in the Job Properties Short Description field for each job and job sequence, as shown below:

Figure 8: Job Level Short Description Within a job, the Annotation tool should be used to highlight steps in a given job flow. Note that by changing the vertical alignment properties (for example, Bottom) the annotation can be drawn around the referenced stage(s), as shown in the following example.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 29 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

More complex operators or operations should have correspondingly longer and more complex explanations on this tab.Information Integration Solutions Center of Excellence Figure 9: Example Job Annotation DataStage also allows descriptions to be attached to each stage within the General tab of the stage properties. There are no selection criteria in the WHERE clause. Each stage should have a short description of its function specified within the stage properties. ODBC Enterprise stage read: Read the GLO.RcR_GLOBAL_BUSN_CAT_TYP table from jpORACLE_SERVER using the ODBC driver. stored in a retrieval system. transcribed. or translated into any language in any form by any means without the written permission of IBM. transmitted. There are no selection criteria in the WHERE clause. Oracle Enterprise stage read: Read the GLOBAL. Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved. . Examples of such annotations include: Job “short” description: This Job takes the data from GBL Oracle Table AD_TYP and does a truncate load into Teradata Table AD_TYP. No part of this publication may be reproduced.GLOBAL_REST_CHAR table from jpORACLE_SERVER using the Oracle Enterprise operator. 2006 30 of 179 © 2006 IBM Information Integration Solutions. These descriptions will appear in the job documentation automatically generated from jobs and sequencers adhering to the standards in this document.

This stage is cosmetic and is optimized out. Parallel Framework Red Book: Data Flow Job Design July 17. and to a dataset for use as write-through cache. Lookup stage This stage validates the input and writes rejects. This stage identifies changes and drops records not matched (not updated). This stage converts null dates. . Transformer stage: This stage generates sequence numbers that have a less-than file scope. Data Set stage: This stage writes the GLOBAL_Ad_Typ dataset which is used as write-through cache to avoid the use of Teradata in subsequent jobs. table definitions.4 Working with Source Code Control Systems DataStage’s built-in repository manages objects (jobs. or translated into any language in any form by any means without the written permission of IBM.) that may be part of a completed application. This stage reads the GLOBAL_Lcat dataset. This is the target file for business qualification process rejects. sequences. etc. this repository is not capable of managing nonDataStage components (for example. job scheduler configurations. Modify stage: This stage performs data conversions not requiring a transformer. All rights reserved. stored in a retrieval system. routines. custom components) during job development. 2006 31 of 179 © 2006 IBM Information Integration Solutions. This stage renames and/or drops columns and is NOT optimized out. which is used as write-through cache to avoid the use of Teradata. Teradata MultiLoad stage: Load the RcR_GLOBAL_LCAT_TYP table. No part of this publication may be reproduced. However.Information Integration Solutions Center of Excellence Remove Duplicates stage This stage removes all but one record with duplicate BUSN_OWN_TYP_ID keys. Copy stage This stage sends data to the TDMLoadPX stage for loading into Teradata. UNIX shell scripts. transcribed. transmitted. Sequential file stage: This is the source file for the LANG table. 2. This stage validates the input and continues. environment files.

custom stage types. The Manager client is the primary interface to the DataStage object repository. transcribed. Designer facilitates iterative job design. b) Define rules for exchange with source code control As a graphical development environment. and so on. PVCS. or translated into any language in any form by any means without the written permission of IBM. user-defined routines. stored in a retrieval system. they would not be an effective backup strategy. you can export objects (job designs. . but it does offer the ability to exchange information with these systems. No part of this publication may be reproduced.4. .) from the repository as clear-text format files. Using Manager. and cannot be used to restore individual objects. These files can then be checked into the external source code control system. 2. Typically. Parallel Framework Red Book: Data Flow Job Design July 17. Rather.DSX (DataStage eXport format) or . For these reasons. This object grouping also helps establish a manageable “middle ground” between an entire project exports and individual object exports. transmitted.DSX is the recommended export format. etc. c) Don’t rely on the source code control system for backups Because the rules defined for transfer to the source code control system will typically be only at milestones in the development cycle. The export file format for DataStage objects can be either .1 Source Code Control Standards The first step to effective integration with source code control systems is to establish standards and rules for managing this process: a) Establish Category naming and organization standard DataStage objects can be exported individually or by category (folder hierarchy).DSX file exports to a local or (preferably) shared file system. operating system backups of the project repository files only establish a “point in time”.for example. It is the responsibility of the DataStage developer to maintain DataStage objects within the source code system. Grouping related objects by folder can simplify the process of exchanging information with the external source code control system. It would be cumbersome to require the developer to check-in every change to a DataStage object in the external source code control system. table definitions.Information Integration Solutions Center of Excellence Source code control systems (such as ClearCase. 2006 32 of 179 © 2006 IBM Information Integration Solutions. organized into specific releases for version control. Both formats contain the same information. milestone points on the development lifecycle are a good point for transferring objects to the source code control system . Unless there is a need to parse information in the export file. Furthermore. unit test. rules should be defined for when this transfer should take place. DataStage does not directly integrate with source code control systems. All rights reserved. SCCS) are useful for managing the development lifecycle of all components of an application. although the XML file is generally much larger. when a set of objects has completed initial development. it is important that an identified individual maintains backup copies of the important job designs using .XML.

or translated into any language in any form by any means without the written permission of IBM. and individual Object levels. All exports from the DataStage repository are performed on the Windows workstation. July 17. Figure 10: Manager Category browser • Choose “Export DataStage Components” from the “Export” menu. establishing and following a consistent naming and categorization standard is essential to the change management process. 2. the developer should create a local backup prior to implementing any extensive changes. 2006 33 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions.4. . The DataStage client includes Windows command-line utilities for automating the export process. Assigning related objects to the same category provides a balanced level of granularity when exporting and importing objects with external source code control systems. transmitted.4. There is no server-side project export facility. • Select the object or category in the Manager browser.4. stored in a retrieval system. No part of this publication may be reproduced. 2. It can be done interactively by the developer or project manager using the Manager client. In either case.3 Export to Source Code Control System The process of exporting DataStage objects to a source code control system is a straightforward process. as explained in this section.Information Integration Solutions Center of Excellence These backups can be done on a scheduled basis by an Operations support group.2 Category Hierarchy. Category. or by the individual DataStage developer. All rights reserved.2 Using Object Categorization Standards As discussed in Section 2.2. These utilities (dsexport and dscmdexport) are documented in the DataStage Manager Guide. transcribed. The DataStage Manager can export at the Project.

• Using your source code control utilities. No part of this publication may be reproduced.DSX file 2. Make sure all objects are saved and closed before exporting.Information Integration Solutions Center of Excellence NOTE: Objects cannot be exported from DataStage if they are open in Designer. stored in a retrieval system. If you wish to include compiled Transformer objects for a selected job. the import of objects from an external source code control system is a straightforward process. or translated into any language in any form by any means without the written permission of IBM.4.4 Import from Source Code Control System In a similar manner. transcribed. Figure 11: Manager Export Options To export a group of objects to a single export file. or automated through command-line utilities. The filename for export is specified in the “Export to file:” field at the top of the Export dialog. transmitted. Import can be interactive through the Manager client (as described in this section). check-in the exported . the option “Selection: By category” should be specified in the “Options” tab. . 2006 34 of 179 © 2006 IBM Information Integration Solutions. make sure the “Job Executables” category is checked. Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved.

dsrpcd is started when the server installed. or translated into any language in any form by any means without the written permission of IBM. transcribed. . Figure 12: Manager Import options • • The import of the . it is possible to import the job executables from the DataStage server host using the dsjob command-line. 2006 35 of 179 © 2006 IBM Information Integration Solutions. stored in a retrieval system. • • Use the source code control system to check-out (or export) the . Although operating system environment variables can be set in multiple places. Select the file you checked out of your source code control system by clicking on the ellipsis (“…”) next to the filename field in the import dialog. command-line import utilities are available for both Windows workstation and DataStage server platforms. as documented in the DataStage Development Kit chapter of the Parallel Job Advanced Developer’s Guide. 2. transmitted. After selecting your file. This means that if necessary it will create the Job Category if it doesn't already exits.5 Understanding a Job’s Environment DataStage Enterprise Edition provides a number of environment variables to enable / disable product features and to fine-tune job performance. By default (in a root installation). For test and production environments. there is a defined order of precedence that is evaluated when a job’s actual environment is established at runtime: 1) The daemon for managing client connections to the DataStage server engine is called dsrpcd. No part of this publication may be reproduced.DSX file to your client workstation.DSX file will place the object in the same DataStage category it originated from. The Windows workstation utilities (dsimport and dscmdimport) are documented in the DataStage Manager Guide.DSX file using Manager. Import objects in the .Information Integration Solutions Center of Excellence Unlike the export process. Note that using dsjob will only import job executables job designs can only be imported using the Manager client or the dsimport or dscmdimport client tools. Choose “Import DataStage Components” from the “Import” menu. All rights reserved. then compile the imported objects from Designer. click OK to import. and should start Parallel Framework Red Book: Data Flow Job Design July 17. If the objects were not exported with the “Job Executables”. or using the Multi-Job Compile tool.

No part of this publication may be reproduced. Any project-level settings for a specific environment variable will override any settings inherited from dsrpcd. $ENV NOTE: $ENV should not be used for specifying the default $APT_CONFIG_FILE value because. on UNIX platforms is set in the etc/profile and $DSHOME/dsenv scripts.Information Integration Solutions Center of Excellence whenever the machine is restarted. Typically this is used to pickup values set in the operating system outside of DataStage. 3) Within Designer. there are three special values that can be used for environment variables within job parameters: • causes the value of the named environment variable to be retrieved from the operating system of the job environment. stored in a retrieval system. On Windows. 2006 36 of 179 © 2006 IBM Information Integration Solutions. Any project-level environment variables must be set for new projects using the Administrator client. the default DataStage environment is defined in the registry.) By default. (For more information. or by carefully editing the DSPARAMS file within the project. To avoid hard-coding default values for job parameters. $PROJDEF Parallel Framework Red Book: Data Flow Job Design July 17. the dsrpc environment is not inherited since DataStage jobs do not execute on the conductor node. DataStage jobs inherit the dsrpcd environment. environment variables may be defined for a particular job using the Job Properties dialog box. during job development. 2) Environment variable settings for particular projects can be set in the DataStage Administrator client. Management. the Designer parses the corresponding parallel configuration file to obtain a list of node maps and constraints (advanced stage properties). • causes the project default value for the environment variable (as shown on the Administrator client) to be picked up and used to set the environment variable and job parameter for the job. Any job-level settings for a specific environment variable will override any settings inherited from dsrpcd or from project-level defaults. transcribed. and Production Automation Best Practice for additional details. or translated into any language in any form by any means without the written permission of IBM. transmitted. Note that client connections DO NOT pick up per-user environment settings from their $HOME/. All rights reserved. dsrpcd can also be manually started and stopped using the $DSHOME/uv – admin command. IMPORTANT: When migrating projects between machines or environments.profile script. On USS environments. . These settings are stored in a file named DSPARAMS in the project directory. Refer to the DataStage Administration. see the DataStage Administrator Guide. it is important to note that project-level environment variable settings are not exported when a project is exported. which.

It may be helpful to create a Job Template and include these environment variables in the parameter settings.5. APT_SORT_INSERTION_CHECK_ONLY). or may be set on an individual basis within the properties for each job. Outputs actual runtime metadata (schema) to DataStage job log. Places entries in DataStage job log showing UNIX process ID (PID) for each process started by a job.1 Environment Variable Settings for All Jobs IBM recommends the following environment variable settings for all DataStage Enterprise Edition jobs. but part of every job design so that it can be easily enabled for debugging purposes. These settings can be made at the project level.2 Understanding the Parallel Job Score) Includes a copy of the generated osh in the job’s DataStage log Outputs record counts to the DataStage job log as each operator completes processing.1 Environment Variable Settings An extensive list of environment variables is documented in the DataStage Parallel Job Advanced Developer’s Guide. Does not report PIDs of DataStage “phantom” processes started by Server shared containers. and Data Sets. providing detailed information about actual job flow including operators. $UNSET 2. Extremely useful for understanding how a job actually ran in the environment. and to document a few that are not part of the documentation. Maximum buffer delay in seconds $OSH_PRINT_SCHEMA 0 $APT_PM_SHOW_PIDS 1 $APT_BUFFER_MAXIMUM_TIMEOUT 1 Parallel Framework Red Book: Data Flow Job Design July 17. but part of every job design so that it can be easily enabled for debugging purposes.1. This section is intended to call attention to some specific environment variables. transcribed. Environment Variable $APT_CONFIG_FILE Setting filepath 1 $APT_DUMP_SCORE $OSH_ECHO $APT_RECORD_COUNTS 1 0 Description Specifies the full pathname to the EE configuration file. 2.5. Several environment variables are evaluated only for their presence in the environment (for example. This setting should be disabled by default.4. transmitted. No part of this publication may be reproduced. (see 12. This variable should be included in all job parameters so that it can be easily changed at runtime. The count is per operator per partition. . Outputs EE score dump to the DataStage job log. or translated into any language in any form by any means without the written permission of IBM. stored in a retrieval system. processes. 2006 37 of 179 © 2006 IBM Information Integration Solutions. All rights reserved. This setting should be disabled by default.Information Integration Solutions Center of Excellence • causes the environment variable to be removed completely from the runtime environment.

2. No part of this publication may be reproduced. This setting instructs EE to use named pipes rather than shared memory for local data transport.1. stored in a retrieval system. transcribed.Information Integration Solutions Center of Excellence On Solaris platforms only: When working with very large parallel Data Sets (where the individual data segment files are larger than 2GB). transmitted.1A platforms only: On Tru64 platforms. or translated into any language in any form by any means without the written permission of IBM. or changing the default behavior of specific Enterprise Edition stages. a number of environment variables will be mentioned for tuning the performance of a particular job flow. . 2006 38 of 179 © 2006 IBM Information Integration Solutions. the environment variable $APT_PM_NO_SHARED_MEMORY should be set to 1 to work around a performance issue with shared memory MMAP operations.5. All rights reserved. assisting in debugging. Parallel Framework Red Book: Data Flow Job Design July 17. you must define the environment variable $APT_IO_NOMAP On Tru64 5.2 Additional Environment Variable Settings Throughout this document. The environment variables mentioned in this document are summarized in Appendix D: Environment Variable Reference. An extensive list of environment variables is documented in the DataStage Parallel Job Advanced Developer’s Guide.

For example. The Multiple-Instance job property allows multiple invocations of the same job to run simultaneously. stored in a retrieval system. functional requirements may dictate job boundaries. transcribed. if the extract of source data takes a long time (such as an FTP transfer over a wide area network) it would be good to land the extracted source data to a parallel data set before processing. A set of standard job parameters should be used in DataStage jobs for source and target database parameters (DSN. All rights reserved. it is generally a good idea to land data to a parallel Data Set before loading to a target database unless the data volume is small or the overall time to process the data is minimal. • • • 3. transmitted. While it may be possible to construct a large. To ease re-use. these standard parameters and settings should be made part of a Designer Job Template. In some cases. Where possible. 2006 39 of 179 © 2006 IBM Information Integration Solutions. create re-usable components such as parallel shared containers to encapsulate frequently-used logic. complex job that satisfies given functional requirements. or translated into any language in any form by any means without the written permission of IBM. Factors to consider when establishing job boundaries include:  Establishing job boundaries through intermediate Data Sets creates “checkpoints” that can be used in the event of a failure when processing must be restarted. o For example. processing must be restarted from the beginning of the job flow. Without these checkpoints. But functional requirements may not be the only factor driving the size of a given DataStage job.2 Establishing Job Boundaries It is important to establish appropriate job boundaries when developing with DS/EE. and so forth. this may not be appropriate. Parallel Framework Red Book: Data Flow Job Design July 17. intermediate work files. o As another example. it may be appropriate to update all dimension values before inserting new entries in a data warehousing fact table. . It is for these reasons that long-running tasks are often segmented into separate jobs in an overall sequence.Information Integration Solutions Center of Excellence 3 Development Guidelines 3. Create a standard directory structure outside of the DataStage project directory for source and target files. No part of this publication may be reproduced. etc) and directories where files are stored.1 Modular Development Modular development techniques should be used to maximize re-use of DataStage jobs and components: • Job parameterization allows a single job design to process similar logic instead of creating multiple copies of the same job. password. user.

Information Integration Solutions Center of Excellence

Larger, more complex jobs require more system resources (CPU, memory, swap) than a series of smaller jobs, sequenced together through intermediate Data Sets. Resource requirements are further increased when running with a greater degree of parallelism specified by a given configuration file. However, the sequence of smaller jobs generally requires more disk space to hold intermediate data, and the speed of the I/O subsystem can impact overall end-to-end throughput. Section 12.3: Minimizing Runtime Processes and Resource Requirements provides some recommendations for minimizing resource requirements of a given job design, especially when the volume of data does not dictate parallel processing.

Breaking large job flows into smaller jobs may further facilitate modular development and reuse if business requirements for more than one process depend on intermediate data created by an earlier job. The size of a job directly impacts the speed of development tasks such as opening, saving, and compiling. These factors may be amplified when developing across a wide-area or high-latency network connection. In extreme circumstances this can significantly impact developer productivity and ongoing maintenance costs. The startup time of a given job is directly related to the number of stages and links in the job flow. Larger more complex jobs require more time to startup before actual data processing can begin. Job startup time is further impacted by the degree of parallelism specified by the parallel configuration file. Remember that the number of stages in a parallel job includes the number of stages within each shared container used in a particular job flow.

As a rule of thumb, keeping job designs to less than 50 stages may be a good starting point. But this is not a hard-and-fast rule. The proper job boundaries are ultimately dictated by functional / restart / performance requirements, expected throughput and data volumes, degree of parallelism, number of simultaneous jobs and their corresponding complexity, and the capacity and capabilities of the target hardware environment. Combining or splitting jobs is relatively easy, so don't be afraid to experiment and see what works best for your jobs in your environment.

3.3

Job Design Templates

DataStage Designer provides the developer with re-usable Job Templates, which can be created from an existing Parallel Job or Job Sequence using the “New Template from Job” command.

Parallel Framework Red Book: Data Flow Job Design

July 17, 2006

40 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence

Template jobs should be created with: - standard parameters (for example, source and target file paths, database login properties…) - environment variables and their default settings (as outlined in Section 2.5.1 Environment Variable Settings) - annotation blocks In addition, template jobs may contain any number of stages and pre-built logic, allowing multiple templates to be created for different types of “standardized” processing. By default, the Designer client stores all job templates in the local “Templates” directory within the DataStage client install directory, for example, C:\Program Files\Ascential\DataStage751\Templates To facilitate greater re-use of job templates, especially in a team-based development, the template directory can be changed using the Windows Registry Editor. This change must be made on each client workstation, by altering the following registry key:
HKEY_LOCAL_MACHINE\SOFTWARE\Ascential Software\DataStage Client\CurrentVersion\Intelligent Assistant\Templates

3.4

Default Job Design

Default job designs include all of the capabilities detailed Section 2: Standards. Template jobs should contain all the default characteristics and parameters the project requires. These defaults provide at a minimum: 1. 2. 3. 4. 5. Development phase neutral storage (e.g.: dev, si, qa and prod); Support for Teradata, Oracle, DB2/UDB and SQL Server login requirements; Enforced project standards; Optional operational metadata (runtime statistics) suitable for loading into a database; and Optional auditing capabilities.

The default job design specifically will support the creation of write-through cache in which data in load-ready format is stored in DS/EE Data Sets for use in the load process or in the event the target table becomes unavailable. The default job design incorporates several features and components of DataStage that are used together to support tactical and strategic job deployment. These features include: 1. Re-start-able job sequencers which manage one or more jobs, detect and report failure conditions, provide monitoring and alert capabilities and support checkpoint restart functionality. 2. Custom routines written in DataStage BASIC (DS Basic) that detect external events, manage and manipulate external resources, provide enhanced notification and alert capabilities and interface to the UNIX operating system. 3. DataStage Enterprise Edition (DS/EE) ETL jobs that exploit job parameterization, runtime UNIX environment variables, and conditional execution.
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 41 of 179
© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence

Each subject area is broken into sub-areas and each sub-area may be further subdivided. These sub-areas are populated by a DataStage job sequencer utilizing 2 types of DataStage jobs at a minimum: 1. A job that reads source data and • Transforms it to load-ready format • Optionally stores its results in a write-through cache DataStage Data Set or loads the data to the target table. 2. A job that reads the DataStage dataset and loads it to the target table. Other sections will discuss in detail each of the components and give examples of their use in a working example job sequencer.

3.5

Job Parameters

Parameters are passed to a job as either DataStage job parameters or as environment variables. Job parameters can be set from a file and are distinguished by the presence of a ‘jp’ prefix to the variable name. This prefix is part of the DataStage development standard. The names of environment variables have no prefix when they are set (UNIX_VAR=”some value”) and a prefix of “$” when used (myval=$UNIX_VAR). Job parameters are passed from a job sequencer to the jobs in its control as if a user were answering the runtime dialog questions displayed in the DataStage Director job-run dialogue. Default environment variables cannot be reset during this dialog unless explicitly specified in the job. The scope of a parameter depends on their type. Essentially: o The scope of a job parameter is specific to the job in which it is defined and used. Job parameters are stored internally within DataStage for the duration of the job, and are not accessible outside that job. o The scope of a job parameter can be extended by the use of job sequencer, which can manage and pass job parameters among jobs. o The scope of an environment variable is wider, as it is defined at operating system level, though conversely the use of environment variables is limited within this exercise. Job parameters are required for the following DataStage programming elements: 1. File name entries in stages that use files or Data Sets must NEVER use a hard-coded operating system pathname. a. Staging area files must ALWAYS have pathnames as follows: /jpSTAGING/jpENVIRON/jpSUBJECT_AREA[filename.suffix] b. DataStage datasets ALWAYS have pathnames as follows: /jpDSTAGE_ROOT/jpENVIRON/datasets/[filename.suffix]

Parallel Framework Red Book: Data Flow Job Design

July 17, 2006

42 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

userid and password. Using RCP. Specification of this action is the responsibility of the Business Analyst and will be published in the design document. enable RCP at the project level and for every stage within the parallel shared container. 3. schema (if appropriate). The presence of rejects and errors will be detected and notification sent by email to selected staff. . Error rows are those rows caused by unforeseen data events such as values too large for a column or text in an unsupported language. or translated into any language in any form by any means without the written permission of IBM. Reject files will include those records rejected from the ETL stream due to Referential Integrity failures. Use and management of job parameters. This allows the container input and output links to contain only the columns relevant to the container processing. The presence of rejects may indicate that a job has failed and prevent further processing. The Usage Analysis and Multi-Job Compile tools can be used to recompile jobs that use a shared container. data rule violations or other reasons that would disqualify a row from processing. and further. Management.6 Parallel Shared Containers Parallel Shared Containers allow common logic to be shared across multiple jobs. Parallel Framework Red Book: Data Flow Job Design July 17. The exact policy for each reject is specified in the job design document. For maximum component re-use. whether the job or ETL processing is to continue is specified on a per-job and/or per-sequence and/or per-script basis based on business requirements. transmitted.Information Integration Solutions Center of Excellence 2. stored in a retrieval system. Both rejects and errors will be archived and placed in a special directory for evaluation or other action by support staff. Database stages must ALWAYS use variables for the server name. all jobs that use a shared container must be recompiled when the container is changed.7 Error and Reject Record Handling Reject rows are those rows that fail active or passive business rule driven validation as specified in the job design document. Error files will include those records from sources that fail quality tests. Specification of this action is the responsibility of the Business Analyst and will be published in the design document. as well as standardized routines for use in Job Sequencers are discussed further in Parallel Framework Standard Practices: Administration. and Production Automation. These activities are the responsibility of job sequencers used to group jobs by some reasonable grain or by a federated scheduler. any additional columns are passed through the container at runtime without the need to separate and remerge. All rights reserved. Because Parallel Shared Containers are inserted when a job is compiled. 3. transcribed. The presence of errors may not prevent further processing. No part of this publication may be reproduced. 2006 43 of 179 © 2006 IBM Information Integration Solutions.

No part of this publication may be reproduced.7. Use the Fail option. Rejects should not exist and should stop the job.Information Integration Solutions Center of Excellence ETL actions to be taken for each record type is specified for each stage in the job design document. Only records that match the given table definition and format are output. and should be reviewed by the Data Steward. 2 3 Rejects should not exist but should not stop the job. By default. Pass successful reads to the output stream. 2006 44 of 179 © 2006 IBM Information Integration Solutions. The Sequential File stage offers the following reject options: Option Description Continue Drop read failures from input stream. (Reject link exists) The reject option should be used in all cases where active management of the rejects is required. The default action is to push back reject and error rows to a Data Steward. Rejects are categorized in the ETL job design document using the following ranking: Category 1 Description Rejects are expected and can be ignored Sequential File Stage Option Use the Continue option. Rejects are tracked by count only.1 Reject Handling with the Sequential File Stage The Sequential File stage can optionally include a reject link. (No reject link exists) Fail Abort job on read format failure (No reject link exists) Output Reject switch failures to the reject stream. If a file is created by this option. Ignore – some process or event downstream of the ETL process is responsible for handling the error. Send the reject stream to a *. it must have a *. Push back – rows are sent to a Data Steward for corrective action. 3.rej file. A message is always written to the Director log which details the count of rows successfully read and rows rejected. Reprocess – rows are reprocessed and re-enter the data stream. or translated into any language in any form by any means without the written permission of IBM. which outputs rows that do not match the given table definition and format specifications. 2. rows that cannot be read are dropped by the Sequential File stage. 3. Parallel Framework Red Book: Data Flow Job Design July 17. a shared container error handler can be used. All rights reserved. .rej file extension. transcribed. Use the Output option. Pass successful reads to the output stream. These actions include: 1. Alternatively. stored in a retrieval system. transmitted.

If there are multiple validations to perform. The *. All rights reserved. Use the Fail option. Pass successful lookups to the output stream. to enforce error management ONLY ONE REFERENCE LINK is allowed on a Lookup stage. or translated into any language in any form by any means without the written permission of IBM. a local error handler based on a shared container can be used. each must be done in its own Lookup.err extension when rejects can be ignored but need to be recorded. Furthermore.rej file or tag and merge with the output stream. the *. however.err file or tag and merge with the output stream. and should be reviewed by the Data Steward. No part of this publication may be reproduced.3 Reject Handling with the Transformer Stage Rejects occur when a transformer stage is used and a row: Parallel Framework Red Book: Data Flow Job Design July 17. If a file is created by this option. and rejects can occur if the key fields are not found in the reference data.7.rej extension is used when rejects require investigation after a job run. Rejects are categorized in the ETL job design document using the following ranking: Category 1 Description Rejects are expected and can be ignored Rejects can exist in the data. Pass successful lookups to the output stream. Drop Drop lookup failures from the input stream. DS/EE offers the following options within a Lookup stage: Option Description Continue Ignore lookup failures and pass lookup fields as nulls to the output stream. Pass successful lookups to the output stream. Alternatively. 2006 45 of 179 © 2006 IBM Information Integration Solutions. Lookup Stage Option Drop if lookup fields are necessary down stream or Continue if lookup fields are optional Send the reject stream to an *. The reject option should be used in all cases where active management of the rejects is required.7. Send the reject stream to an *.rej or *. Rejects should not exist but should not stop the job. transmitted. . Rejects should not exist and should stop the job. stored in a retrieval system. it must have a *. 2 3 4 3. they only need to be recorded but not acted on. Fail Abort job on lookup failure Reject Reject lookup failures to the reject stream.2 Reject Handling with the Lookup Stage The Lookup stage compares a single input stream to one or more reference streams using keys. This behavior makes the Lookup stage very valuable for positive (reference is found) and negative (reference is NOT found) business rule validation.err file extension.Information Integration Solutions Center of Excellence 3. transcribed.

A message is always written to the Director log which details the count of rows successfully read and rows rejected. a shared container error handler is used.err file or tag and merge with the output stream.err extension when rejects can be ignored but need to be recorded. and be reviewed by the Data Steward. OR 2. it must have a *.4 Reject Handling with target database stages Some database stages (such as DB2/UDB Enterprise. Otherwise. 3. . 2006 46 of 179 © 2006 IBM Information Integration Solutions. Satisfies requirements for a reject conditional output stream. stored in a retrieval system. To capture rejects from a target database. transcribed. Rejects are categorized in the ETL job design document using the following ranking: Category 1 2 3 4 Description Rejects are expected and can be ignored. Pass rows that fail to be written to the reject stream.rej extension is used when rejects require investigation after a job run. All rights reserved. Rejects should not exist and should stop the job. reject rows will not be captured. Alternatively. Cannot satisfy requirements of any conditional output stream and is rejected by the default output stream. No part of this publication may be reproduced. Alternatively. Rejects can exist in the data. and Oracle Enterprise) offer an optional reject link that can be used to capture rows that cannot be written to the target database. The *. Send the reject stream to a reject file and halt the job. it must have a *. ODBC Enterprise. however. Transformer Stage Option Funnel the reject stream back to the output stream(s). transmitted. a shared container error handler can be used. they only need to be recorded but not acted on. the *. Rejects should not exist but should not stop the job. a reject link must exist on that stage. Parallel Framework Red Book: Data Flow Job Design July 17.rej file extension.7. or translated into any language in any form by any means without the written permission of IBM.rej file or tag and merge with the output stream. If a file is created by this option. Send the reject stream to an *.err file extension. If a file is created from the reject stream. Target database stages offer the following reject options: Option No reject link exists Reject link exists Description Do not capture rows that fail to be written. Send the reject stream to an *. The reject option should be used in all cases where active management of the rejects is required.Information Integration Solutions Center of Excellence 1.rej or *.

Send the reject stream to a *. 2006 47 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions.Rejects are tracked by count only.err Rows will be converted to the common file record format with 9 columns (below) using Column Export and Transformer stages for each reject port. and 5. The job serial number (jpJOBSERIALNO) and a period “.7. stored in a retrieval system.rej file. Only records that match the given table definition and database constraints are written.Information Integration Solutions Center of Excellence Rejects are categorized in the ETL job design document using the following ranking: Category 1 Description Rejects are expected and can be ignored Target Database Stage Option No reject link exists. 3. Reject link exists. 2. The project phase (jpENVIRON) and a underscore “_”. 3. or translated into any language in any form by any means without the written permission of IBM.”. 2 Rejects should not exist but should not stop the job. . For example. transmitted.5. 3. one of “rej” or “err”. The Column Export and Transformer stages may be kept in a template Shared Container the developer will make local in each job. job DECRP_N_XformClients in the ECR_FACTS project in the development environment with a serial number of 20060201-ETL-091504 would have these reject and error file names: ECR_FACTS_DECRP_N_XformClients_dev_20060201-ETL-091504. and gathered using a Funnel stage that feeds a Sequential File stage. The standard columns for error and reject processing are: Column Name HOST_NAME Key? Yes Data Source DSHostName transformer macro in the error handler July 17. This section deals with both methods of handling errors. 4.5 Error Processing Requirements Jobs will produce flat files containing reject and errors and may alternatively process rows on reject ports and merge these rows with the normal output stream.7.1 Processing Errors and Rejects to a Flat File Each job will produce a flat file for errors and a flat file for rejects with a specific naming convention: 1. No part of this publication may be reproduced. The project name (jpPROJECT_NAME) and a underscore “_”.rej ECR_FACTS_DECRP_N_XformClients_dev_20060201-ETL-091504. The appropriate file type. All rights reserved. and should be reviewed by the Data Steward. transcribed. The job name (jpJOB_NAME) and a underscore “_”.

The Transformer stage adds the required key columns. All rights reserved. Figure 13: Error Processing Components Parallel Framework Red Book: Data Flow Job Design July 17.Information Integration Solutions Center of Excellence PROJECT_NAME JOB_NAME STAGE_NAME DATA_OBJ_NAME JOB_SERIALNO ETL_ROW_NUM ETL_BAT_ID ROW_DATA Yes Yes Yes Yes Yes Yes Yes No DSProjectName transformer macro in the error handler DSJobName transformer macro in the error handler The name of the stage from which the error came The source table or file data object name jpJOBSERIALNO Data stream coming in to the error handler Data stream coming in to the error handler The columns from the upstream stages reject port exported to a single pipe-delimited “|” varchar(2000) column using the Column Export stage in the error handler In this example. transmitted. stored in a retrieval system. or translated into any language in any form by any means without the written permission of IBM. . No part of this publication may be reproduced. 2006 48 of 179 © 2006 IBM Information Integration Solutions. transcribed. the following stages process the only errors produced by a job: The Column Export stage maps the unique columns to the single standard column.

Track*) to a single output column. ROW_DATA: Figure 14: Error Processing Column Export stage And the downstream Transformer stage builds the standard output record by creating the required keys: Parallel Framework Red Book: Data Flow Job Design July 17. 2006 49 of 179 © 2006 IBM Information Integration Solutions. No part of this publication may be reproduced.Information Integration Solutions Center of Excellence The input to the Column Export stage explicitly converts the data unique to the reject stream (in this case. transcribed. transmitted. . or translated into any language in any form by any means without the written permission of IBM. All rights reserved. stored in a retrieval system.

OR with columns contain illegal values for some operation performed on said columns. stored in a retrieval system. A failed switch will reject an intact input row show key fails to resolve to one of the Switch output stream. Method Connect the reject port to a Transformer stage where those columns selected for replacement are set to specific values. Connect the reject port to a Transformer stage where columns are set to specific values. Stage Lookup Description A failed lookup will reject an intact input row whose key fails to match the reference link key.Information Integration Solutions Center of Excellence Figure 15: Error Processing Transformer stage 3. transcribed. Connect the output stream of the Transformer stage and one or more output streams of the Switch stage to a Funnel stage to merge the two (or more) streams. or translated into any language in any form by any means without the written permission of IBM. One or more columns may have been selected for replacement when a reference key is found. rows rejected by the Lookup stage are processed by a corrective Transformer stage where the failed references as set to a specific value and then merged with the output of the Lookup stage: Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved. This is done by processing the rows from the reject ports and setting the value of a specific column with a value specified by the design document. . No part of this publication may be reproduced. In either case. attaching a nonspecific reject stream (referred to as the stealth reject stream) will gather rows from either condition to the reject stream. 2006 50 of 179 © 2006 IBM Information Integration Solutions.7. Connect the output stream of the Transformer and Lookup stages to a Funnel stage to merge the two streams. Connect the reject port to a Transformer stage where columns are set to specific values. The following table identifies the tagging method to be used for the previously cited operators. A Transformer will reject an intact input row that cannot pass conditions specified on the output streams. Connect the output stream of the corrective Transformer stage and one or more output streams of the original Transformer stage to a Funnel stage to merge the two (or more) streams. transmitted.5.2 Processing Errors and Rejects and Merging with an Output Stream There may be processing requirements that specify that rejected or error rows be tagged as having failed a validation and merged back into the output stream. Switch Transformer In this example.

stored in a retrieval system.8 Component Usage DataStage Enterprise Edition offers a wealth of component types for building ETL flows. 3. This section provides guidelines appropriate use of various stages when building a parallel job flows. transmitted. The ability to use a Server Edition component within a parallel job is intended only as a migration option for existing Server Edition applications that might benefit by leveraging some parallel capabilities on SMP platforms.8. No part of this publication may be reproduced.Information Integration Solutions Center of Excellence Figure 16: Error Processing Lookup example 3. or translated into any language in any form by any means without the written permission of IBM. Server Edition components limit overall performance of large-volume job flows since many components such as the BASIC Transformer use interpreted psuedo-code. In clustered an MPP environments Server Edition components only run on the primary (conductor) node. .1 Server Edition Components Avoid the use of Server Edition components in parallel job flows. severely impacting scalability and network resources. All rights reserved. transcribed. 2006 51 of 179 © 2006 IBM Information Integration Solutions. Parallel Framework Red Book: Data Flow Job Design July 17.

Dropping Columns July 17. . Enterprise Edition will optimize this out at runtime) . and Production Automation. 3. For this reason. o Unless the Force property is set to “True”. All rights reserved. stored in a retrieval system. always write to parallel Data Sets. it is important to minimize the number of transformers. it can be used at the end of a data flow o For simple jobs with only two stages.Renaming Columns .BASIC Routines . they should not be used for long-term archive of source data.8. or translated into any language in any form by any means without the written permission of IBM. a Copy stage with a single input link and a single output link will be optimized out of the final job flow at runtime. it is best to develop a job iteratively using the Copy stage as a “placeholder”.Server shared containers Note that BASIC Routines are still appropriate. transmitted. Data Sets offer parallel I/O on read and write operations.Job Design placeholder between stages (unless the Force option =true. Management. Since the Copy stage does not require an output link. Data Sets achieve end-to-end parallelism across job boundaries by writing data in partitioned form. the Copy stage should be used as a placeholder so that new stages can be inserted easily should future requirements change.4 Parallel Transformer stages The DataStage Enterprise Edition parallel Transformer stage generates “C” code which is then compiled into a parallel component. in sort order. parallel Data Sets effectively establish restart points in the event that a job (or sequence) needs to be re-run. 3.2 Copy Stage For complex data flows. for the job control components of a DataStage Job Sequence and Before/After Job Subroutines for parallel jobs.3 Parallel Data Sets When writing intermediate results between DS/EE parallel jobs. and necessary. without overhead for format or data type conversions. This is discussed in more detail in Parallel Framework Standard Practices: Administration. transcribed. 2006 52 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. NOTE: Because parallel Data Sets are platform and configuration-specific.Information Integration Solutions Center of Excellence Server Edition Components that should be avoided within parallel job flows include: . Used in this manner. No part of this publication may be reproduced. and to use other stages (such as Copy) when derivations are not needed. 3.8. • The Copy stage should be used instead of a Transformer for simple operations including: . and in Enterprise Edition native format.BASIC Transformers .8.

• Optimize the overall job flow design to combine derivations from multiple Transformers into a single Transformer stage when possible. and default type conversion can also be performed by the output mapping tab of any stage. it is faster than the interpreted Filter and Switch stages.5 BuildOp stages BuildOps should only be used when: .1. Consider. • The Modify stage can be used for non-default type conversions. All rights reserved.Complex reusable logic cannot be implemented using the Transformer or . performance should this should be tested in isolation to identify specific cause of bottlenecks. The BASIC Transformer is intended as a “stop-gap” migration choice for existing Server Edition jobs containing complex routines. Even then its use should be restricted and the routines should be converted as soon as possible. and character string trimming. No part of this publication may be reproduced.2: Default and Explicit Type Conversions] Note that rename. null handling. drop (if Runtime Column Propagation is disabled).7 Then B=”C” • could also be implemented with a lookup table containing values for column A and corresponding values of column B. See Section 8. Instead.2: Modify Stage. if possible.5.Existing Transformers do not meet performance requirements As always.2. The only time that Filter or Switch should be used is when the selection clauses need to be parameterized at runtime. implementing complex derivation expressions using regular patterns by Lookup tables instead of using a Transformer with nested derivations. Parallel Framework Red Book: Data Flow Job Design July 17. the derivation expression: If A=0.3 Then B=”X” If A=4. user-defined functions and routines can expand parallel Transformer capabilities.1. • NEVER use the “BASIC Transformer” stage in large-volume job flows. transcribed. • Because the parallel Transformer is compiled. 3. . For example.Information Integration Solutions Center of Excellence - Default Type Conversions [see Section 4. stored in a retrieval system. 2006 53 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM.8. transmitted.6.

Numeric Float. with resolution of microseconds (Specify microseconds Extended option) Single field containing both date and time value with resolution to seconds. VarBinary Unknown. 2006 54 of 179 © 2006 IBM Information Integration Solutions. All rights reserved. Real Double TinyInt SmallInt Integer BigInt1 Binary. consisting of a fixed or variable number of contiguous bytes and an optional alignment value ASCII character string of fixed or variable length (Unicode Extended option NOT selected) ASCII character string of fixed or variable length (Unicode Extended option NOT selected) ASCII character string of fixed or variable length (Unicode Extended option IS selected) Time of day. VarChar NChar. transcribed. Each SQL data type maps to an underlying data type in the Enterprise Edition engine. with resolution to seconds Time of day. unit64 raw 4 bytes (roundup(p)+1)/2 4 bytes 8 bytes 1 byte 2 bytes 4 bytes 8 bytes 1 byte per character Date with month. . day. LongVarChar. uint16 int32. compatible with IBM packed decimal format. stored in a retrieval system. Parallel Framework Red Book: Data Flow Job Design July 17. transmitted. Single field containing both date and time value with resolution to microseconds. IEEE single-precision (32-bit) floating point value IEEE double-precision (64-bit) floating point value Signed or unsigned integer of 8 bits (Specify unsigned Extended option for unsigned) Signed or unsigned integer of 16 bits (Specify unsigned Extended option for unsigned) Signed or unsigned integer of 32 bits (Specify unsigned Extended option for unsigned) Signed or unsigned integer of 64 bits (Specify unsigned Extended option for unsigned) Untyped collection. or translated into any language in any form by any means without the written permission of IBM. LongVarChar. No part of this publication may be reproduced. Bit. The following table summarizes the underlying data types of DataStage Enterprise Edition: SQL Type Internal Type Size Description Date Decimal. LongNVarChar Char. LongVarBinary. and year Packed decimal. VarChar Time Time Timestamp Timestamp date decimal sfloat dfloat int8.Information Integration Solutions Center of Excellence 4 DataStage Data Types The DataStage Designer and Manager represent column data types using SQL notation. The internal Enterprise Edition data types are used in schema files and are displayed when viewing generated OSH or viewing the output from $OSH_PRINT_SCHEMAS. (Specify microseconds Extended option) string 1 byte per character ustring ustring time time(microseconds) timestamp timestamp(microseconds) multiple bytes per character multiple bytes per character 5 bytes 5 bytes 9 bytes 9 bytes 1 BigInt values map to long long integers on all supported platforms except Tru64 where they map to longer integer values. uint8 int16. NVarChar. unit32 int64. Char.

A blank cell indicates that no conversion is provided.Information Integration Solutions Center of Excellence 4.1. All rights reserved. specify that these data types are extended. VarChar. and extended Char. e = You can use a Modify or a Transformer conversion function to explicitly convert from the source field type to the destination field type.2 Default and Explicit Type Conversions DataStage Enterprise Edition provides a number of default conversions and conversion functions when mapping from a source to a target data type. The Char. NVarChar. (They are specified as such by selecting the Extended check box for the column in the Edit Meta Data dialog box. VarChar.1. No part of this publication may be reproduced. and LongNVarChar types relate to underlying ustring types so do not need to be explicitly extended. The NChar. parallel jobs support two types of underlying character data types: strings and ustrings. 2006 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. in which case they are taken as ustrings and do require mapping. and LongVarChar SQL types relate to underlying string types where each character is 8-bits and does not require mapping because it represents an ASCII character. The following table summarizes Data Type conversions: Source Field Target Field d = There is a default type conversion from source field type to destination field type. ustring data represents full Unicode (UTF-16) data.1 Strings and Ustrings If NLS is enabled on your DataStage server. date uint8 sfloat uint16 uint32 uint64 dfloat int16 int32 int64 string ustring time e e e int8 timestamp e e e e 55 of 179 decimal raw Int8 uint8 Int16 uint16 Int32 uint32 Int64 uint64 sfloat dfloat decimal string d de d de d de d de de de de d d d d d d d d d d d d d d d d d d d d d de d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d de d d d d d d d d de de d d d d d d d d d d e d d d d d d d d d d d d d d de d d d d d d de d d d de d d d de d d e d d e d e d e d e d d d d e d e de d de de de de d d d de de d e e e e Parallel Framework Red Book: Data Flow Job Design July 17. or LongVarChar columns have ‘Unicode’ in this field. Default type conversions take place across the stage output mappings of any Enterprise Edition stage. . stored in a retrieval system. however. String data represents unmapped bytes.) An Extended field appears in the columns grid. You can. 4. transcribed. transmitted.

Some stages (for example. • As an alternate solution. 2006 56 of 179 © 2006 IBM Information Integration Solutions. • 4. . Enterprise Edition pads the remaining length with NULL (ASCII zero) characters by default. All rights reserved. an ASCII space (0x20) or a Unicode space (U+0020). stored in a retrieval system. depending on the source and result data types. The Transformer and Modify stages can change a null representation from an out-of-band null to an inband null and from an in-band null to an out-of-band null.2 Null Handling DataStage Enterprise Edition represents nulls in two ways: . for example. This type of representation is called an out-of-band null. When entering a space for the value of APT_STRING_PADCHAR do note enclose the space character in quotes. You must first convert a Char string type to a Varchar type before using PadString. Enterprise Edition displays a warning message in the job log. In-band null representation can be disadvantageous because you must reserve a field value for nulls and this value cannot be treated as valid data elsewhere. for example a numeric field’s most negative possible value. No part of this publication may be reproduced. or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence ustring raw date time timesta mp de e e e e d de e d d e e e e de e d d d de e d e d e e e e e e e e e de e e e e The conversion of numeric data types may result in a loss of precision and cause incorrect results. When converting from variable-length to fixed-length strings. .It allocates a single bit to mark a field as null. the PadString Transformer function can be used to pad a variablelength (Varchar) string to a specified length using a specified pad character. When used in these stages. • The environment variable APT_STRING_PADCHAR can be used to change the default pad character from an ASCII NULL (0x0) to another character. transcribed. Parallel Framework Red Book: Data Flow Job Design July 17. Note that PadString does not work with fixed-length (CHAR) string types. This type of representation is called an in-band null. Sequential File and DB2/UDB Enterprise targets) allow the pad character to be specified in their stage or column definition properties. the specified pad character will override the default for that stage only.It designates a specific field value to indicate a null. transmitted. In these instances.

July 17. Source value or null propagates. If the source value is not null. Source Field not Nullable Nullable not Nullable Nullable Destination Field not Nullable Nullable Nullable not Nullable Result Source value propagates to destination. 2006 57 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. transmitted. destination value is never null. as shown in Figure 17: Figure 17: Extended Column Metadata (Nullable properties) The Table Definition of a stage’s input or output data set can contain columns defined to support outof-band nulls (Nullable attribute is checked). the in-band (value) must be explicitly defined in the extended column attributes for each Nullable column. When reading from or writing to Sequential Files or File Sets. care must be taken to avoid data rejects. Source value propagates. stored in a retrieval system. out-of-band null representation for NULL values.1. No part of this publication may be reproduced. or translated into any language in any form by any means without the written permission of IBM. The next table lists the rules for handling nullable fields when a stage takes a Data Set as input or writes to a Data Set as output. .1: Transformer NULL Handling and Reject Link. See Section 8. transcribed. When reading from Data Set and database sources with nullable columns. Enterprise Edition uses the internal. the source value propagates. All rights reserved.Information Integration Solutions Center of Excellence IMPORTANT: When processing nullable columns in a Transformer stage.

No part of this publication may be reproduced.3 Runtime Column Propagation Runtime column propagation (“RCP”) allows job designs to accommodate additional columns beyond those defined by the job developer. allow their runtime schema to be parameterized further extending re-use through RCP. transmitted. 2006 58 of 179 © 2006 IBM Information Integration Solutions. Some stages. transcribed. Furthermore. a fatal error occurs. Parallel Framework Red Book: Data Flow Job Design July 17. stored in a retrieval system. All rights reserved. Using RCP. RCP facilitates re-use through parallel shared containers. it must be enabled at the project level through the Administrator client. rather than using a large number of jobs with hard-coded table definitions to perform the same tasks. Using RCP judiciously in a job design facilitates re-usable job designs based on input metadata.Information Integration Solutions Center of Excellence If the source value is null. as long as each stage in the shared container has RCP enabled on their stage Output properties. or translated into any language in any form by any means without the written permission of IBM. for example the Sequential File stage. Before a DataStage developer can use RCP. the remaining columns pass through at runtime. . only the columns explicitly referenced within the shared container logic need to be defined. 4.

as shown in this example: Stage running sequentially Figure 18: “fan-out” icon Collectors combine parallel partitions of a single link for sequential processing. If the prior stage was running sequentially.Keyless partitioning distributes rows without regard to the actual data values. Keyed partitioning is used when business rules (for example. and provides guidelines for appropriate use in job designs.1 Partition Types While partitioning allows data to be distributed across multiple processes running in parallel. transmitted. Partitioners distribute rows of a single link into smaller segments that can be processed independently in parallel. Partitioning methods are separated into keyless and keyed classes: . ensuring that records with the same values in those key column(s) are assigned to the same partition. Partitioners exist before any stage that is running in parallel. transcribed. No part of this publication may be reproduced. Stage running sequentially Stage running in parallel 5. Remove Duplicates) or stage requirements (for example. Different types of keyless partitioning methods define the method of data distribution.Information Integration Solutions Center of Excellence 5 Partitioning and Collecting Partition parallelism is a key to establishing scalable performance of DataStage Enterprise Edition. a “fan-out” icon is drawn on the link within the Designer canvas. Parallel Framework Red Book: Data Flow Job Design July 17. Collectors only exist before stages running sequentially and when the previous stage is running in parallel. 2006 59 of 179 © 2006 IBM Information Integration Solutions. and are indicated by a “fan-in” icon as shown in this example: Stage running in parallel Figure 19: Collector icon This section provides an overview of partitioning and collecting methods. All rights reserved. it is important that this distribution does not violate business requirements for accurate data processing. Keyed partitioning examines the data values in one or more key columns. Join) require processing on groups of related records. stored in a retrieval system. It also provides tips for monitoring jobs running in parallel. For this reason. . different types of partitioning are provided for the parallel job developer. or translated into any language in any form by any means without the written permission of IBM.

Auto partitioning will ensure correct results when using built-in stages. This flag is set automatically by Parallel Framework Red Book: Data Flow Job Design July 17. Furthermore. For example. or translated into any language in any form by any means without the written permission of IBM. In general. The partitioning method is specified in the Input stage properties using the “Partitioning” option as shown on the right: Figure 20: Specifying Partition method 5. to improve performance. Auto partitioning will select between keyless (Same. on the output of a parallel Sort). However. Data Sets. However. in some cases. Based on the configuration file. if the logic defined in a Transformer stage is based on a group of related records. and job design (stage requirements and properties). transmitted.1. Auto partitioning specifies that the Enterprise Edition engine will attempt to select the appropriate partitioning method at runtime. The “Preserve Partitioning” flag is an internal “hint” that Auto partitioning uses to attempt to preserve carefully ordered data (for example. . the partitioning method may not necessarily be the most efficient from an overall job perspective. Round Robin. 2006 60 of 179 © 2006 IBM Information Integration Solutions. No part of this publication may be reproduced. Within the Designer canvas. links with Auto partitioning are drawn with the following link icon: Figure 21: Auto partitioning icon Auto partitioning is designed to allow the beginning DataStage developer to construct simple data flows without having to understand the details of parallel design principles.1 Auto Partitioning The default partitioning method for newly-drawn links. Entire) and keyed (Hash) partitioning methods to produce functionally correct results and.Information Integration Solutions Center of Excellence The default partitioning method for newly-drawn links is Auto partitioning. since the Enterprise Edition engine has no visibility into user-specified logic (such as Transformer or BuildOp stages) it may be necessary to explicitly specify a partitioning method for some stages. All rights reserved. transcribed. the ability for the Enterprise Edition engine to determine the appropriate partitioning method depends on the information available to it. then a keyed partitioning method must be specified to achieve correct results. stored in a retrieval system.

Each partition receives the entire Data Set. for example). Because Same does not redistribute existing partitions. as illustrated on the right: Same partitioning doesn’t move data between partitions (or. . although it can be explicitly set or cleared in the “Advanced” stage properties of a given stage. transcribed.2 Keyless Partitioning Keyless partitioning methods distribute rows without examining the contents of the data: Keyless Partition Method Same Round Robin Random Entire Description Retains existing partitioning from previous stage. a warning will be placed in the Director log indicating that Enterprise Edition was unable to preserve partitioning for a specified stage. the degree of parallelism remains unchanged: Parallel Framework Red Book: Data Flow Job Design July 17.2.1. in a round robin partition assignment. it retains the partitioning from the output of the upstream stage. In these instances.1 Same Partitioning Same partitioning in fact performs no partitioning to the input Data Set. transmitted. in the case of a cluster or Grid. Distributes rows evenly across partitions. No part of this publication may be reproduced. 5.1. a parallel Sort). There are some cases when the input stage requirements prevent partitioning from being preserved.Information Integration Solutions Center of Excellence some stages (Sort. Distributes rows evenly across partitions in a random partition assignment. as shown in on the right: Figure 22: Preserve Partitioning option The Preserve Partitioning flag is part of the Data Set structure. stored in a retrieval system. All rights reserved. 2006 61 of 179 © 2006 IBM Information Integration Solutions. and is appropriate when trying to preserve the grouping of a previous operation (for example. if the Preserve Partitioning flag was set. between servers). or translated into any language in any form by any means without the written permission of IBM. Instead. and its state is stored in persistent Data Sets. links that have been specified with Same partitioning are drawn with a “horizontal line” partitioning icon: Figure 23: Same partitioning icon It is important to understand the impact of Same partitioning in a given data flow. 5. Row ID's 0 3 6 1 4 7 2 5 8 0 3 6 1 4 7 2 5 8 Within the Designer canvas.

Random partitioning has a slightly higher overhead than Round Robin partitioning. No part of this publication may be reproduced.2. While in theory Random partitioning is not subject to regular data patterns that might exist in the source data. or translated into any language in any form by any means without the written permission of IBM. On clustered and Grid implementations. Round Robin partitioning is useful for redistributing data that is highly skewed (there are an unequal number of rows in each partition).2. transcribed.1. Since the random partition number must be calculated. …8 7 6 5 4 3 2 1 0 Round Robin 6 3 0 7 4 1 8 5 2 5. transmitted. it is rarely used in real-world data flows. As a result. as the complete Data Set must be distributed across the network to each node.3 Random Partitioning Like Round Robin.1.1. stored in a retrieval system. All rights reserved. Entire partitioning may have a performance impact. Random partitioning evenly distributes rows across partitions. . Same partitioning will effectively cause a downstream parallel stage to also run sequentially If you read a parallel Data Set with Same partitioning. 5. .4 Entire Partitioning Entire partitioning distributes a complete copy of the entire Data Set to each partition. 3 2 1 0 . similar to dealing cards: Round robin partitioning has a fairly low overhead.Information Integration Solutions Center of Excellence If the upstream stage is running sequentially. …8 7 6 5 4 3 2 1 0 ENTIRE . as illustrated on right: Entire partitioning is useful for distributing the reference data of a Lookup task (this may or may not involve the Lookup stage). Since optimal parallel processing occurs when all partitions have the same workload.2 Round Robin Partitioning Round Robin partitioning evenly distributes rows across partitions in a round-robin assignment. but using a random assignment.2. the downstream stage runs with the degree of parallelism used to create the Data Set. regardless of the current $APT_CONFIG_FILE 5. . 2006 62 of 179 © 2006 IBM Information Integration Solutions. 3 2 1 0 Parallel Framework Red Book: Data Flow Job Design July 17. the order that rows are assigned to a particular partition will differ between job runs. 3 2 1 0 . .

Keyed Partition Method Hash Modulus Range DB2 Description Assigns rows with the same values in one or more key column(s) to the same partition using an internal hashing algorithm. As an example of hashing.3 Keyed Partitioning Keyed partitioning examines the data values in one or more key columns. All rights reserved.1. . transcribed. then the resulting partitions will be of relatively equal size. Keyed partitioning is used when business rules (for example. Join) require processing on groups of related records.1. For DB2 Enterprise Server Edition with DPF (DB2/UDB) only – matches the internal partitioning of the specified source or target table. If the source data values are evenly distributed within these key column(s). No part of this publication may be reproduced.3. and there are a large number of unique values. transmitted. Assigns rows with the same values in a single integer key column to the same partition using a simple modulus calculation. ensuring that records with the same values in those key column(s) are assigned to the same partition. or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence 5. consider the following sample Data Set: ID 1 2 3 4 5 6 7 8 9 10 LName Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford FName Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor Address 66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore Values of key column …0 3 2 1 0 2 3 2 1 1 HASH 0 3 0 3 1 1 1 2 2 2 Parallel Framework Red Book: Data Flow Job Design July 17. Assigns rows with the same values in one or more key column(s) to the same partition using a specified range map generated by pre-reading the Data Set. 5. 2006 63 of 179 © 2006 IBM Information Integration Solutions. Remove Duplicates) or stage requirements (for example.1 Hash Partitioning Hash partitioning assigns rows with the same values in one or more key column(s) to the same partition using an internal hashing algorithm. stored in a retrieval system.

transmitted. All rights reserved. or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence Hashing on key column LName would produce the following results: Partition 0: ID 5 6 LName Dodge Dodge FName Horace John Address 17840 Jefferson 75 Boston Boulevard Partition 1: ID 1 2 3 4 7 8 9 10 LName Ford Ford Ford Ford Ford Ford Ford Ford FName Henry Clara Edsel Eleanor Henry Clara Edsel Eleanor Address 66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore Parallel Framework Red Book: Data Flow Job Design July 17. No part of this publication may be reproduced. . transcribed. stored in a retrieval system. 2006 64 of 179 © 2006 IBM Information Integration Solutions.

When using hash partitioning on a composite key (more than one key column). All rights reserved. which would impact performance. Also note that in this example the number of unique values will limit the degree of parallelism. transmitted. the partition size of modulus partitioning will be equally distributed as long as the data values in the key column are equally distributed. . there are more instances of “Ford” than “Dodge”. producing partition skew. the key column combination of LName and FName yields improved data distribution and a greater degree of parallelism.3. 2006 65 of 179 © 2006 IBM Information Integration Solutions. Also note that only the unique combination of key column values appear in the same partition when used for hash partitioning. Parallel Framework Red Book: Data Flow Job Design July 17.2 Modulus Partitioning Modulus partitioning uses a simplified algorithm for assigning related records based on a single integer key column. hash partitioning on the key columns LName and FName yields the following distribution with a 4-node configuration file: Partition 0: I D 2 8 LName Ford Ford FName Clara Clara Address 66 Edison Avenue 4901 Evergreen I D 4 6 10 LName Ford Dodge Ford Partition 2: FName Eleanor John Eleanor Address 7900 Jefferson 75 Boston Boulevard 1100 Lakeshore Partition 1: I D 3 5 9 LName Ford Dodge Ford FName Edsel Horace Edsel Address 7900 Jefferson 17840 Jefferson 1100 Lakeshore I D 1 7 LName Ford Ford Partition 3: FName Henry Henry Address 66 Edison Avenue 4901 Evergreen In this example. individual key column values have no significance for partition assignment. or translated into any language in any form by any means without the written permission of IBM. The remainder is used to assign the value to a given partition: partition = MOD (key_value / number of partitions) Like hash. stored in a retrieval system. transcribed.1.Information Integration Solutions Center of Excellence In this case. Using the same source Data Set. 5. It performs a modulus operation on the data value using the number of partitions as the divisor. No part of this publication may be reproduced. regardless of the actual number of nodes in the parallel configuration file.

stored in a retrieval system. Parallel Framework Red Book: Data Flow Job Design July 17. 5. it should be used if you have a single integer key column. defeating the intention of Range partitioning. if the incoming Data Set is sequential and ordered on the key column(s). data is read in parallel from each DB2 node. data is partitioned to match the internal partitioning of the target DB2 table using the DB2 partitioning method. transmitted. Modulus partitioning cannot be used for composite keys.1. 5. Values of key column 4 0 5 1 6 0 5 4 3 RANGE Rang e Map fi le 0 1 0 4 4 3 The “read twice” penalty of Range partitioning limits its use to specific scenarios. when writing data to a target DB2 database using the DB2/UDB Enterprise stage. if new data values are processed outside of the range of a given Range Map.3 Range Partitioning As a keyed partitioning method. It is important to note that if the data distribution changes without recreating the Range Map. the Range Map file can be re-used. No part of this publication may be reproduced. or translated into any language in any form by any means without the written permission of IBM. unlike Hash and Modulus partitioning where partition skew is dependent on the actual data distribution.3. And. 2006 66 of 179 © 2006 IBM Information Integration Solutions. To achieve this balanced distribution. Using the DB2/UDB Enterprise stage. Range partitioning will result in sequential processing. Given a sufficient number of unique values.4 DB2 Partitioning The DB2/UDB Enterprise stage matches the internal database partitioning of the source or target DB2 Enterprise Server Edition with Data Partitioning Facility database (previously called “DB2/UDB EEE”). these rows will be assigned to either the first or the last partition.Information Integration Solutions Center of Excellence Since modulus partitioning is simpler and faster than hash. or for a non-integer key column. Range partitioning ensures balanced workload by assigning an approximately equal number of rows to each partition.1. transcribed. In these instances. In another scenario to avoid. use Same partitioning on the input to downstream stages. All rights reserved. A Range Map file is specific to a given parallel configuration file. Also. Range partitioning assigns rows with the same values in one or more key column(s) to the same partition. . typically where the incoming data values and distribution are consistent over time. Range partitioning must read the Data Set twice: once to create a Range Map file. partition balance will be skewed. DB2 partitioning can only be specified for target DB2/UDB Enterprise stages. by default. depending on the value. and the second to actually partition the data within a flow using the Range Map. To maintain partitioning on data read from a DB2/UDB Enterprise stage.3.

and select the “Show Instances” option. transmitted. a stage’s node pool (Stage/Advanced properties) This information is detailed in the parallel job score. This is very useful in determining the distribution across parallel partitions (skew). 2006 67 of 179 © 2006 IBM Information Integration Solutions. Specific details on interpreting the parallel job score can be found in 12. if specified. which is output to the Director job log when the environment variable APT_DUMP_SCORE is set to True. as illustrated below. starting at zero.500) of rows for an optimal balanced workload. stored in a retrieval system. or translated into any language in any form by any means without the written permission of IBM. Partitions are assigned numbers. and each stage is processing an equal number (12. . in some cases. right-click anywhere in the window.2 Monitoring Partitions At runtime. All rights reserved. the stage named “Sort_3” is running across four partitions (“x 4” next to the stage name).2Understanding the Parallel Job Score. transcribed.4.Information Integration Solutions Center of Excellence 5. DataStage Enterprise Edition determines the degree of parallelism for each stage using: a) the parallel configuration file (APT_CONFIG_FILE) b) the degree of parallelism of existing source and target Data Sets (and. Figure 25: Director Job Monitor row counts by partition Setting the environment variable APT_RECORD_COUNTS will output the row count per link per partition to the Director log as each stage/node completes processing. In this instance. databases) c) and. The partition number is appended to the stage name for messages written to the Director log. as shown in the example log below where the stage named “Peek” is running with four degrees of parallelism (partition numbers zero through 3): Figure 24: Partition numbers as shown in Director log To display row counts per partition in the Director Job Monitor window. No part of this publication may be reproduced. as illustrated below: Parallel Framework Red Book: Data Flow Job Design July 17.

No part of this publication may be reproduced.3 Partition Methodology Given the numerous options for keyless and keyed partitioning. the orchadmin command line utility on the DataStage server can also be used to examine a given parallel Data Set. or Manager) can be used to identify the degree of parallelism and number of rows per partition for an existing persistent Data Set. transmitted. Director.Information Integration Solutions Center of Excellence Figure 26: Output of APT_RECORD_COUNTS in Director log Finally. minimizing overall run time. or translated into any language in any form by any means without the written permission of IBM. 2006 68 of 179 © 2006 IBM Information Integration Solutions. transcribed. the following objectives help to form a methodology for assigning partitioning: Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition. This ensures that the processing workload is evenly balanced. . the “Data Set Management” tool (available in the Tools menu of Designer. stored in a retrieval system. All rights reserved. 5. as shown below: Figure 27: Data Set Management Tool In a non-graphical way. Parallel Framework Red Book: Data Flow Job Design July 17. while minimizing overhead.

This includes. transmitted. Remove Duplicates. Objective 3: Unless partition distribution is highly skewed. minimize repartitioning. transcribed. Note that in satisfying the requirements of this second objective. Join. use Round Robin partitioning to redistribute data equally across all partitions o Especially useful if the input Data Set is highly skewed or sequential d) Use Same partitioning to optimize end-to-end partitioning and to minimize repartitioning o Being mindful that Same partitioning retains the degree of parallelism of the upstream stage o Within a flow. and Sort stages. This may require re-examining key column usage within stages and re-ordering stages within a flow (if business requirements permit). especially in cluster or Grid configurations Repartitioning data in a cluster or Grid configuration incurs the overhead of network transport. Merge. or translated into any language in any form by any means without the written permission of IBM. Change Apply. persistent Data Sets can be used to retain the partitioning and sort order. Using the above objectives as a guide. Parallel Framework Red Book: Data Flow Job Design July 17. Objective 4: Partition method should not be overly complex The simplest method that meets the above objectives will generally be the most efficient and yield the best performance. No part of this publication may be reproduced. stored in a retrieval system. it may not be possible to choose a partitioning method that gives close to an equal number of rows in each partition. All rights reserved. 2006 69 of 179 © 2006 IBM Information Integration Solutions.Information Integration Solutions Center of Excellence Objective 2: The partition method must match the business requirements and stage functional requirements. Change Capture. but is not limited to: Aggregator. . the following methodology can be applied: a) Start with Auto partitioning (the default) b) Specify Hash partitioning for stages that require groups of related records o Specify only the key column(s) that are necessary for correct grouping as long as the number of unique values is sufficient o Use Modulus partitioning if the grouping is on a single integer key column o Use Range partitioning if the data is highly skewed and the key column values and distribution do not change significantly over time (Range Map can be reused) c) If grouping is not required. o Across jobs. examine up-stream partitioning and sort order and attempt to preserve for down-stream processing. This is particularly useful if downstream jobs are run with the same degree of parallelism (configuration file) and require the same partition and sort order. assigning related records to the same partition if required Any stage that processes groups of related records (generally using one or more key columns) must be partitioned using a keyed partition method. It may also be necessary for Transformers and BuildOps that process groups of related records.

Using a “standard” solution. The standard solution would be to Hash partition (and Sort) the inputs to the Join and Aggregator stages as shown below: Figure 28: “Standard” Partitioning assignment However. . Within the Transformer. we’ll apply the partitioning methodology defined earlier to several example job flows. transmitted.4. more advanced partitioning and sorting examples are given in 12.2 Partitioning Example 2 – Use of Entire Partitioning In this example. To add aggregate columns to every detail row. No part of this publication may be reproduced. All rights reserved. or through Auto partitioning): Parallel Framework Red Book: Data Flow Job Design July 17. stored in a retrieval system.4. a new output column is defined on the header and detail links using a single constant value derivation. as shown below: Figure 29: Optimized Partitioning assignment This example will be revisited in the Sorting discussion because there is one final step necessary to optimize the sorting in this example. both inputs to the Join would be Hash partitioned and sorted on this single join column (either explicitly. 5. on closer inspection. Additional. transcribed. a Copy stage is used to send the detail rows to an Inner Join and an Aggregator. the partitioning and sorting of this scenario can be optimized.1 Partitioning Example 1 – Optimized Partitioning The Aggregator stage only outputs key column and aggregate result columns. we can move the Hash partition and Sort before the Copy stage.Information Integration Solutions Center of Excellence 5.4 Partitioning Examples In this section. and apply Same partitioning to the downstream links. The output of the Aggregator is then sent to the second input of the Join. 2006 70 of 179 © 2006 IBM Information Integration Solutions. This column is used as the key for a subsequent Inner Join to attach the header values to every detail row. Because the Join and Aggregator use the same partition keys and sort order.2Sorting and Hashing Advanced Example. or translated into any language in any form by any means without the written permission of IBM. 5.4. a Transformer is used to extract data from a single header row of an input file.

there is no need to pre-sort the input to the Join. the above solution has one serious limitation. or translated into any language in any form by any means without the written permission of IBM. the single value join column will assign all rows to a single partition. and the detail rows are assigned to the Left input as shown in the following illustration: Figure 32: Specifying Link Order in Join stage Parallel Framework Red Book: Data Flow Job Design July 17. An optimized solution would be to alter the partitioning for the input links to the Join stage: . stored in a retrieval system. the link order in this example should be set so that the single header row is assigned to the Right input. The Join stage operates by reading a single row from the Left input and reading all rows from the Right input that match the key value(s). transcribed.Use Round Robin partitioning on the detail input to evenly distribute rows across all partitions .Use Entire partitioning on the header input to copy the single header row to all partitions Figure 31: Optimized Partitioning assignment based on business requirements Because we are joining on a single value. consider that the single header row is really a form of reference data. To optimize partitioning. so we will revisit this in the Sorting discussion.Information Integration Solutions Center of Excellence Figure 30: “Standard” Partitioning assignment for a Join stage Although Hash partitioning guarantees correct results for stages that require groupings of related records. the link order of the Inner Join is significant. depending on the business requirements. it is not always the most efficient solution. transmitted. . All rights reserved. Although functionally correct. In order to process a large number of detail records. For this reason. No part of this publication may be reproduced. resulting in sequential processing. Remembering that the degree of parallel operation is limited by the number of distinct values. 2006 71 of 179 © 2006 IBM Information Integration Solutions.

there is one further detail in this example. transmitted. the order of rows in an Auto collector is undefined. or translated into any language in any form by any means without the written permission of IBM. Like partitioning methods.2 Round Robin Collector The Round Robin collector patiently reads rows from partitions in the input Data Set by reading input partitions in round robin order. The Round Robin collector is generally slower than an Auto collector because it must wait for a row to appear in a particular partition. stored in a retrieval system. Changing the output derivation on the header row to a series of numbers instead of a constant value will establish the End of Group and prevent buffering to disk. and may vary between job runs on the same Data Set. No part of this publication may be reproduced.5.5 Collector Types Collectors combine parallel partitions of an input Data Set (single link) into a single input stream to a stage running sequentially. transcribed. Consider an example where data is read sequentially and passed to a Round Robin partitioner: Parallel Framework Red Book: Data Flow Job Design July 17. the detail rows in the Left input will buffer to disk to prevent a deadlock.1 Auto Collector The Auto collector eagerly reads rows from partitions in the input Data Set without blocking if a row is unavailable on a particular partition. 5. Because the Join will wait until it receives an End of Group (new key value) or End of Data (no more rows on the input Data Set) from the Right input.3: Minimizing Runtime Processes and Resource Requirements). (See Section 12. the Join will attempt to read all detail rows from the right input (since they have the same key column value) into memory. For this reason. However. 2006 72 of 179 © 2006 IBM Information Integration Solutions. when the previous stage is running in parallel as shown on the right: Figure 33: Specifying Collector method 5. Auto is the default collector method. For advanced users. the collector method is defined in the stage Input/Partitioning properties for any stage running sequentially. there is a specialized example where the Round Robin collector may be appropriate. All rights reserved. . 5.Information Integration Solutions Center of Excellence If defined in reverse of this order.5.

then a Round Robin collector can be used before the final Sequential output to reconstruct a sequential output stream in the same order as the input data stream. 5. In this scenario. use Sort Merge collector to produce a single. This is because Round Robin collector reads from partitions using the same partition order that a Round Robin partitioner assigns rows to parallel partitions. Ordered collectors are generally only useful if the input Data Set has been Sorted and Range partitioned on the same key column(s). and these should be the same columns.3 Ordered Collector An Ordered collector reads all rows from the first partition. use Auto partitioning (the default) b) When the input Data Set has been sorted in parallel. The Sort Merge collector requires one or more key column(s) to be defined. 5. an Ordered collector will generate a sequential stream in sort order.4 Sort Merge Collector If the input Data Set is sorted in parallel. Row order is undefined for non-key columns. in the same order. globally sorted stream of rows o When the input Data Set has been sorted in parallel and Range partitioned. as used to sort the input Data Set in parallel. All rights reserved. stored in a retrieval system. the following guidelines form a methodology for choosing the appropriate collector type: a) When output order does not matter.Information Integration Solutions Center of Excellence Round Robin partitioner Round Robin collector Sequential input Stage running in parallel Sequential output Figure 34: RoundRobin Collector example Assuming the data is not repartitioned within the job flow and that the number of rows is not reduced (for example. 5. Parallel Framework Red Book: Data Flow Job Design July 17. No part of this publication may be reproduced. the Sort Merge collector will generate a sequential stream of rows in globally sorted order. as long as the Data Set has not been repartitioned or reduced. transmitted.5. 2006 73 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. the Ordered collector may be more efficient c) Use a Round Robin collector to reconstruct rows in input order for round-robin partitioned input Data Sets.5. . then reads all rows from the next partition until all rows in the Data Set have been collected.6 Collecting Methodology Given the options for collecting data into a sequential stream. transcribed. through aggregation).

one or more secondary key column(s) to generate a sequential. Given a 4-node configuration file. assigning rows with the same key column values to the same partition .Partitioning: is used to gather related records. No part of this publication may be reproduced. sorting is most often needed to establish order within specified groups of data. Instead. when data is re-partitioned. Sort Aggregator. transcribed.Sorting: is used to establish group order within each partition. 2006 74 of 179 © 2006 IBM Information Integration Solutions. Change Apply. 6.Information Integration Solutions Center of Excellence 6 Sorting Traditionally. sort order is not maintained. . Change Capture. Merge) require pre-sorted groups of related records. FName: LName Dodge Dodge Ford Ford Ford Ford Ford Ford Ford Ford FName John Horace Henry Henry Eleanor Eleanor Edsel Edsel Clara Clara Address 75 Boston Boulevard 17840 Jefferson 66 Edison Avenue 4901 Evergreen 7900 Jefferson 1100 Lakeshore 7900 Jefferson 1100 Lakeshore 66 Edison Avenue 4901 Evergreen However. or translated into any language in any form by any means without the written permission of IBM. in most cases there is no need to globally sort data to produce a single sequence of rows. For example. Join. transmitted. as illustrated in the following example. Each column is specified with an ascending or descending sort order. sorting on primary key LName (ascending). To restore row order and groupings. we would see the following results: Parallel Framework Red Book: Data Flow Job Design July 17. This sort can be done in parallel. the previous input Data Set is partitioned on LName and FName columns. stored in a retrieval system. secondary key FName (descending): Input Data: I D 1 2 3 4 5 6 7 8 9 10 LName Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford FName Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor Address 66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore I D 6 5 1 7 4 10 3 9 2 8 After Sorting by LName. based on one or more key column(s) NOTE: By definition. The order of key columns determines the sequence and groupings in the result set. This is the method the SQL databases use for an ORDER BY clause. the process of sorting data uses one primary key column and. Other stages (for example. optionally. a sort is required after repartitioning.1 Partition and Sort Keys Using the parallel Sort within DataStage Enterprise Edition: . ordered result set. In the following example. All rights reserved. the Remove Duplicates stage selects either the first or last row from each group of an input Data Set sorted by one or more key columns.

4.partition and parallel Sort on key column(s) Parallel Framework Red Book: Data Flow Job Design July 17. transmitted. stored in a retrieval system. To satisfy these requirements we could:  Partition on CustID to group related records  Sort on OrderDate in Descending order  Remove Duplicates on CustID. with Duplicate To Retain=First 12. we want to select the most recent order for a given customer.Information Integration Solutions Center of Excellence Partition 0 ID 2 8 LName Ford Ford FName Clara Clara Address 66 Edison Avenue 4901 Evergreen ID 4 6 10 LName Ford Dodge Ford Partition 2 FName Eleanor John Eleanor Address 7900 Jefferson 75 Boston Boulevard 1100 Lakeshore Partition 1 ID 3 5 9 LName Ford Dodge Ford FName Edsel Horace Edsel Address 7900 Jefferson 17840 Jefferson 1100 Lakeshore Partition 3 ID 1 7 LName Ford Ford FName Henry Henry Address 66 Edison Avenue 4901 Evergreen Applying a parallel sort to this partitioned input Data Set.2 Complete (Total) Sort If a single. 2006 75 of 179 © 2006 IBM Information Integration Solutions. using the primary key column LName (ascending) and secondary key column FName (descending) would generate the resulting Data Set: Partition 0 ID 2 8 LName Ford Ford FName Clara Clara Address 66 Edison Avenue 4901 Evergreen ID 6 4 10 LName Dodge Ford Ford Partition 2 FName John Eleanor Eleanor Address 75 Boston Boulevard 7900 Jefferson 1100 Lakeshore Partition 1 ID 5 3 9 LName Dodge Ford Ford FName Horace Edsel Edsel Address 17840 Jefferson 7900 Jefferson 1100 Lakeshore ID 1 7 LName Ford Ford Partition 3 FName Henry Henry Address 66 Edison Avenue 4901 Evergreen Note that the partition and sort keys do not always have to match. or translated into any language in any form by any means without the written permission of IBM. sequential ordered result is needed. For example. secondary sort keys can be used to establish order within a group for selection with the Remove Duplicates stage (which can specify First or Last duplicate to retain). . 6. No part of this publication may be reproduced. All rights reserved. Let’s say that an input Data Set consists of order history based on CustID and Order Date.2 Sorting and Hashing Advanced Example provides a more detailed discussion and example of partitioning and sorting. in general it is best to use a two step process: . transcribed. Using Remove Duplicates.

key column usage: sorting. the “Perform Sort” option is checked. as there are fewer stages on the design canvas. The stand-alone sort offers more options. stored in a retrieval system. as shown below: Figure 35: Link Sort icon Additional properties can be specified by right-clicking on the key column as shown in the following illustration: Key column options let the developer specify: .1 Link Sort Sorting on a link is specified on the Input/Partitioning stage options. By default. 6. ordered result set This is similar to the way parallel database engines perform their parallel sort operations. or translated into any language in any form by any means without the written permission of IBM. both methods use the same internal sort package (the tsort operator). links that have sort defined will have a Sort icon in addition to the partitioning icon. (Sorting on a link is not available with Auto partitioning. In general. transcribed. . Within the Designer canvas. 2006 76 of 179 © 2006 IBM Information Integration Solutions. when specifying a keyed partitioning method.3. or both . No part of this publication may be reproduced. although the Enterprise Edition engine may insert a sort if required).Sorting character set: ASCII (default) or EBCDIC (strings) Parallel Framework Red Book: Data Flow Job Design July 17. partitioning.3 Link Sort and Sort Stage DataStage Enterprise Edition provides two methods for parallel sorts – the standalone sort stage (when execution mode is set to Parallel) and sort on a link (when using a keyed input partitioning method).sort direction: Ascending or Descending . but as a separate stage makes job maintenance slightly more complicated.Information Integration Solutions Center of Excellence - use a Sort Merge collector on these same key column(s) to generate a sequential. 6. The Link sort offers fewer options. All rights reserved. When specifying key column(s) for partitioning. Most often. use the Link sort unless a specific option is needed on the stand-alone Stage. but is easier to maintain in a DataStage job. transmitted. the standalone Sort stage is used to specify the Sort Key mode for partial sorts.case sensitivity (strings) .

the Sort Key Mode is most frequently used. All rights reserved. and thus a stable sort is generally slower than a non-stable sort for the same input Data Set and sort keys.5 Sub-Sorts Within the standalone Sort stage. 2006 77 of 179 © 2006 IBM Information Integration Solutions. transmitted. Specifically. It is used when resorting a sub-grouping of a previously sorted input Data Set. the key column property “Sort Key Mode” is a particularly powerful feature and a significant performance optimization. disable Stable sort unless needed. . Parallel Framework Red Book: Data Flow Job Design July 17. This requires some additional overhead in the sort algorithm. transcribed. - 6. NOTE: The Sort Utility option is an artifact of previous releases. This “subsort” uses significantly less disk space and CPU resource.3.2 Sort Stage The standalone Sort stage offers more options than the sort on a link. It is important to note that by default the Stable sort option is disabled for sorts on a link and Enabled with the standalone Sort stage. and can often be performed in memory (depending on the size of the new subsort groups). instead of performing a complete Sort. which is significantly faster than a “UNIX” sort. the following properties are not available when sorting on a link: Sort Key Mode (a particularly important performance optimization) Create Cluster Key Change Column Create Key Change Column Output Statistics Sort Utility (don’t change this!) Restrict Memory Usage Figure 37: Sort Stage options Of the options only available in the standalone Sort stage.Information Integration Solutions Center of Excellence - Position of nulls in the result set (for nullable columns) Figure 36: Specifying Link Sort options 6. No part of this publication may be reproduced.4 Stable Sort Stable sorts preserve the order of non-key columns within each sort group. 6. For this reason. or translated into any language in any form by any means without the written permission of IBM. Always specify “DataStage” Sort Utility. stored in a retrieval system.

Parallel Framework Red Book: Data Flow Job Design July 17. as shown in the following example: Figure 38: Sort Key Mode property To successfully perform a subsort. as shown in this score fragment: op1[4p] {(parallel inserted tsort operator {key={value=LastName}. or translated into any language in any form by any means without the written permission of IBM. Sorts are only inserted automatically when the flow developer has not explicitly defined an input sort.p0] node2[op2. key={value=FirstName}}(0)) on nodes ( node1[op2. Remove Duplicates.p3] )} Typically. while new sort keys are specified with the “Sort” key mode property. all key columns must still be defined in the Sort stage. the job will abort. .Information Integration Solutions Center of Excellence To resort based on a sub-grouping. the key column order for these keys must match the key columns and order defined in the previously-sorted input Data Set. Enterprise Edition inserts sorts before any stage that requires matched key values or ordered groupings of (Join. If the input data does not match the key column definition for a subsort. 6.4. Re-used sort keys are specified with the “Don’t Sort (Previously Sorted)” property. All rights reserved. keys with “Don’t Sort (Previously Sorted)” property must be at the top of the list. Sort Aggregator). The parallel job score (see 12. transmitted. transcribed.p1] node3[op2. No part of this publication may be reproduced. And.2 Understanding the Parallel Job Score) can be used to identify automatically-inserted sorts.p2] node4[op2. without gaps between them. 2006 78 of 179 © 2006 IBM Information Integration Solutions. Merge.6 Automatically-Inserted Sorts By default. stored in a retrieval system. DataStage Enterprise Edition inserts sort operators as necessary to ensure correct results.

the environment variable APT_TSORT_STRESS_BLOCKSIZE can be Parallel Framework Red Book: Data Flow Job Design July 17. There are two ways to prevent Enterprise Edition from inserting an un-necessary sort: a) Insert an upstream Sort stage on each link.7 Sort Methodology Using the rules and behavior outlined in the previous section. transcribed. rows in the input Data Set are read into a memory buffer on each partition. On a global basis. each sort uses 20MB of memory per partition for its memory buffer. Output Statistics  Always specify “DataStage” Sort Utility for standalone Sort stages  Use the “Sort Key Mode=Don’t Sort (Previously Sorted)” to resort a sub-grouping of a previously-sorted input Data Set e) Be aware of automatically-inserted sorts  Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish required sort order f) Minimize the use of sorts within a job flow g) To generate a single. If the sort operation can be performed in memory (as is often the case with a subsort) then no disk I/O is performed. and disk resources. This value can be changed for each standalone Sort stage using the “Restrict Memory Usage” option (the minimum is 1MB/partition).4: Partitioning Examples. inserted sorts can be a significant performance impact if they are not necessary. Create Key Change Column. transmitted. memory. All rights reserved. define all sort key columns with the Sort Mode key property “Don’t Sort (Previously Sorted)” b) Set the environment variable APT_SORT_INSERTION_CHECK_ONLY. stored in a retrieval system.Information Integration Solutions Center of Excellence While ensuring correct results. or translated into any language in any form by any means without the written permission of IBM. Revisiting the partitioning examples in Section 5.8 Tuning Sort Sort is a particularly expensive task within DataStage Enterprise Edition which requires CPU. the following methodology should be applied when sorting in a DataStage Enterprise Edition data flow: a) b) c) d) Start with a link sort Specify only necessary key column(s) Don’t use Stable Sort unless needed Use a stand-alone Sort stage instead of a Link sort for options that not available on a Link sort:  Sort Key Mode. No part of this publication may be reproduced. the environment variable $APT_SORT_INSERTION_CHECK_ONLY should be set to prevent Enterprise Edition from inserting un-necessary sorts before the Join stage. 2006 79 of 179 © 2006 IBM Information Integration Solutions. By default. aborting the job if data is not in the required sort order. sequential ordered result set use a parallel Sort and a Sort Merge collector 6. 6. . Create Cluster Key Change Column. This will verify sort order but not actually perform a sort. To perform a sort.

Having a greater number of scratch disks for each node allows the sort to spread I/O across multiple file systems.the directory “/tmp” (on UNIX) or “C:/TMP” (on Windows) if available The file system configuration and number of scratch disks defined in parallel configuration file can greatly impact the I/O performance of a parallel sort. stored in a retrieval system. 2006 80 of 179 © 2006 IBM Information Integration Solutions. All rights reserved.scratch disks defined in the current configuration file (APT_CONFIG_FILE) in the “sort” named disk pool . Parallel Framework Red Book: Data Flow Job Design July 17. If the input Data Set cannot fit into the sort memory buffer. or translated into any language in any form by any means without the written permission of IBM. No part of this publication may be reproduced. . in MB.scratch disks defined in the current configuration file default disk pool . transmitted. transcribed.Information Integration Solutions Center of Excellence use to specify the size of the memory buffer. then results are temporarily spooled to disk in the following order: .the default directory specified by the environment variable TMPDIR . for all sort operators (link and standalone). overriding any per-sort specifications.

Can only be read from and written to by DataStage parallel jobs or orchadmin command. transcribed. Requires Parallel SAS. Need to share data with an external Parallel SAS application.Information Integration Solutions Center of Excellence 7 File Stage Usage 7. (Requires SAS connectivity license for DataStage. Limitations Cannot write to a single file in parallel. Sequential File Need to read source data in complex (hierarchical) format.) Rare instances where lookup reference data is required by multiple jobs and is not updated frequently. can write in parallel (generates multiple segment files). or translated into any language in any form by any means without the written permission of IBM. In general.2 Data Set Usage Parallel Data Sets are the persistent (on-disk) representation of the in-memory data structures of DS/EE. but this is not recommended as it imposes risks for failure recovery. If data is Parallel Framework Red Book: Data Flow Job Design July 17. 7. Some stages (parallel Data Set) support “Append” to add new records to an existing file. No part of this publication may be reproduced. as no overhead is needed to translate data to the internal DS/EE representation. such as mainframe sources with COBOL copybook file definitions. Cannot write in parallel. are summarized below: File Stage Recommended Usage Read and write standard files in a single format. Need to share information with external applications. using the internal format of the parallel engine. Data Sets can only be read from and written to using a DataStage parallel job. stored in a retrieval system. Intermediate storage between DataStage parallel jobs. As such. transmitted. can only be read from / written to by DS/EE or Parallel SAS. Slightly higher overhead than Data Set. Complex Flat File Data Set File Set SAS Parallel Data Set Lookup File Set No DS/EE file stage supports “update” of existing records. . Can only be used as reference link on a Lookup stage. performance penalty of format conversion. and any limitations. Recommendations for when to use a particular stage. Data Sets provide maximum performance for reading and writing data from disk. All rights reserved. However.1 Which File Stage to Use DataStage/EE offers various stages for reading from and writing to files. Can only be written – contents cannot be read or verified. 2006 81 of 179 © 2006 IBM Information Integration Solutions. does not support hierarchical data files. Data Sets store data in partitioned form. performance penalty of conversion.

7. transmitted.3 Sequential File Stages (Import and Export) The Sequential File stage can be used to read from or write to one or more flat files of the same format.3 Separating I/O from Column Import If the Sequential File input cannot be read in parallel. more than one file specified each file specified within a single Sequential File stage must be of the same format Read Method: File Pattern. A better option for writing to a set of Sequential Files in parallel is to use the FileSet stage. Unlike the Complex Flat File stage.2 Writing to a Sequential File in Parallel It is only possible to write in parallel from a Sequential File stage when more than one output file is specified. Parallel Framework Red Book: Data Flow Job Design July 17.1 Reading from a Sequential File in Parallel The ability to read Sequential File(s) in parallel within Enterprise Edition depends on the Read Method and the options specified: Sequential File – options to read sequentially: Read Method: Specific Files. in parallel.file may be either fixed or variable-width Read Method: Specific Files.file may only be fixed-width Note that when reading in parallel. 2006 82 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. input row order is not maintained across readers. No part of this publication may be reproduced. only one file specified. As shown in the following Job fragment. the Sequential File stage can only read and write data that is in flattened (row/column) format. This will create a single header file (in text format) and corresponding data files. using the format options specified in the FileSet stage. The FileSet stage will write in parallel. define a single large string column for the non-parallel Sequential File read. only one file specified may be a file or named pipe Read Method: File Pattern Sequential File – options to read in parallel: Read Method: Specific Files.3. All rights reserved.3. stored in a retrieval system. . 7. and then pass this to a Column Import stage to parse the file in parallel. 7.3. The formatting and column properties of the Column Import stage match those of the Sequential File stage. Readers Per Node option greater than 1 useful for SMP configurations . Read From Multiple Nodes option is set to Yes useful for cluster and Grid configurations .Information Integration Solutions Center of Excellence 7. transcribed. performance can still be improved by separating the file I/O from the column parsing operation. the degree of parallelism of the write will correspond to the number of file names specified. In these instances. set environment variable $APT_IMPORT_PATTERN_USES_FILESET Read Method: Specific Files.

7. since the Import / Export properties used by the Sequential File and Column Import stages are not documented in the DataStage Parallel Job Developer’s Guide. When a job completes successfully. • 7. It is also important to remember that this setting will apply to all Sequential File stages in the data flow.3. the buffers are always flushed to disk. No part of this publication may be reproduced. making the entire downstream flow run sequentially (unless it is later repartitioned). Note that this document is required.7 Reading and Writing Nullable Columns When reading from or writing to Sequential Files or File Sets. 7.Information Integration Solutions Center of Excellence Figure 39: Column Import example Note that this method is also useful for External Source and FTP sequential source stages. transcribed. All rights reserved. transmitted. The environment variable $APT_EXPORT_FLUSH_COUNT allows the job developer to specify how frequently (in number of rows) that the Sequential File stage flushes its internal buffer on writes. or by using a File Pattern). but there is a small performance penalty associated with increased I/O.4 Partitioning Sequential File Reads Care must be taken to choose the appropriate partitioning method from a Sequential File read: • Don’t read from Sequential File using SAME partitioning in the downstream stage! Unless more than one source file is specified. as shown below: Parallel Framework Red Book: Data Flow Job Design July 17.3. instead of statically through Table Definitions. or translated into any language in any form by any means without the written permission of IBM. It is important to use ROUNDROBIN partitioning (or other partitioning appropriate to downstream components) to evenly distribute the data in the flow. each file’s data is read into a separate partition. SAME will read the entire file into a single partition. 7. The format of the Schema File. the Sequential File (export operator) stage buffers its writes to optimize performance. . Using the Schema File option allows the format of the source file to be specified at runtime. the in-band (value) must be explicitly defined in the extended column attributes for each Nullable column. stored in a retrieval system. 2006 83 of 179 © 2006 IBM Information Integration Solutions. Setting this value to a low number (such as 1) is useful for realtime applications. including Sequential File import / export format properties is documented in the Orchestrate Record Schema manual.3.3.5 Sequential File (Export) Buffering By default. When multiple files are read by a single Sequential File stage (using multiple files.6 Parameterized Sequential File Format The Sequential File stage supports a Schema File option to specify the column definitions and file format of the source file.

bounded-length Varchar columns (Varchars with the length option set). the field width and pad string column properties must be set to match the fixed-width of the output column. or translated into any language in any form by any means without the written permission of IBM.3. By default. • If a field is nullable. When writing fixed-length files from variable-length fields (eg. if the source file has fields with values longer than the maximum Parallel Framework Red Book: Data Flow Job Design July 17. Use caution when specifying this option as it can generate an enormous amount of detail in the job log. Double-click on the column number in the grid dialog or right mouse click on the column and select edit column to set these properties. transmitted. • • 7. Varchar). the field width column property must be set to match the fixed-width of the input column. Integer. Decimal. .3. Double-click on the column number in the grid dialog to set this column property. Double-click on the column number in the grid dialog to set this column property. Integer. Decimal. stored in a retrieval system.Information Integration Solutions Center of Excellence Figure 40: Extended Column Metadata (Nullable properties) 7. No part of this publication may be reproduced.9 Reading Bounded-Length VARCHAR Columns Care must be taken when reading delimited. 2006 84 of 179 © 2006 IBM Information Integration Solutions. Varchar). transcribed. To display each field value. use the print_field import property. All rights reserved. All import and export properties are listed in the Import/Export Properties chapter of the Orchestrate Operators Reference.8 Reading from and Writing to Fixed-Length Files Particular attention must be taken when processing fixed-length fields using the Sequential File stage: • If the incoming columns are variable-length data types (for example. you must define the null field value and length in the Nullable section of the column property.

Finally. . or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence Varchar length. stored in a retrieval system. A complex flat file may contain one or more GROUPs. including MVS datasets with QSAM and VSAM files. No part of this publication may be reproduced. Parallel Framework Red Book: Data Flow Job Design July 17. REDEFINES. setting the environment variable $APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the read/write size in bytes can significantly improve performance of Sequential File operations. but you can configure the stage to execute sequentially if it is only reading one file with a single reader. Complex Flat File source stages execute in parallel mode when they are used to read multiple files. 2006 85 of 179 © 2006 IBM Information Integration Solutions.4 Complex Flat File Stage The Complex Flat File (CFF) stage can be used to read or write one or more files in the same hierarchical format. with a default of 128 (128K). or OCCURS clauses. 7. in some disk array configurations. It does not write to MVS datasets. All rights reserved. 7. transmitted. the stage allows you to write data to one or more complex flat files.10 Tuning Sequential File Performance On heavily-loaded file servers or some RAID/SAN array configurations. the environment variables $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE can be used to improve I/O performance.3. The environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS will direct Enterprise Edition to reject records with strings longer than their declared maximum column length. transcribed. NOTE: The Complex Flat File stage cannot read from sources with OCCURS DEPENDING ON clauses. When used as a source. the stage allows you to read data from one or more complex flat files. Increasing this size may improve performance. (This is an error in the DataStage documentation. these extra characters will be silently truncated. These settings specify the size of the read (import) and write (export) buffer size in Kbytes.) When used as a target.

transmitted. All rights reserved.4. trailing separate.y] decimal[x+y. No part of this publication may be reproduced. transcribed. trailing zoned. or translated into any language in any form by any means without the written permission of IBM.y] decimal[x+y. graphic_g vargraphic_g/n Size 2 bytes 4 bytes 2 bytes 2 bytes 4 bytes 8 bytes n bytes n bytes n bytes (x+y)/2+1 bytes (x+y)/2+1 bytes x+y bytes x+y bytes x+y bytes x+y bytes x+y+1 bytes x+y+1 bytes 4 bytes 8 bytes n*2 bytes n*2 bytes Internal Type int16 int32 int64 uint16 uint32 uint64 string(n) raw(n) string(max=n) decimal[x+y. native binary character character for filler varchar decimal decimal display_numeric display_numeric display_numeric display_numeric display_numeric display_numeric float float graphic_n. native binary binary.y] decimal[x+y.y] decimal[x+y. . native binary binary. native binary binary.y] or string[x+y] decimal[x+y. native binary binary.y] or string[x+y] decimal[x+y. stored in a retrieval system.y] decimal[x+y. the data types are mapped to internal Enterprise Edition data types as follows: COBOL Type S9(1-4) COMP/COMP-5 S9(5-9) COMP/COMP-5 S9(10-18) COMP/COMP-5 9(1-4) COMP/COMP-5 9(5-9) COMP/COMP-5 9(10-18) COMP/COMP-5 X(n) X(n) X(n) 9(x)V9(y)COMP-3 S9(x)V9(y)COMP-3 9(x)V9(y) S9(x)V9(y) S9(x)V9(y) SIGN IS TRAILING S9(x)V9(y) SIGN IS LEADING S9(x)V9(y) SIGN IS TRAILING SEPARATE S9(x)V9(y) SIGN IS LEADING SEPARATE COMP-1 COMP-2 N(n) or G(n) DISPLAY-1 N(n) or G(n) DISPLAY-1 Group Description binary. native binary binary. leading Parallel Framework Red Book: Data Flow Job Design July 17. 2006 86 of 179 © 2006 IBM Information Integration Solutions.1 CFF Stage Data Type Mapping When you work with mainframe data using the CFF stage. leading separate.y] sfloat dfloat ustring[n] ustring[max=n] subrec Internal Options packed packed zoned zoned. trailing zoned.Information Integration Solutions Center of Excellence 7.

it is important to minimize the number of transformers. transmitted.col) Then “” Else link.4: Parallel Transformer stages for guidelines on Transformer stage usage.col Note that if an incoming column is only used in an output column mapping. No part of this publication may be reproduced. Even if the target column in an output derivation allows nullable results. 8. the Transformer will reject (through the reject link indicated by a dashed line) any row that has a NULL value used in the expression. undefined.1 Transformer Stage The DataStage Enterprise Edition parallel Transformer stage generates “C” code which is then compiled into a parallel component. For example. if you intend to use a nullable column within a Transformer derivation or output link constraint. To create a Transformer reject link in Designer. the Transformer will allow this row to be sent to the output link without being rejected. transcribed. 8. See Section 3.2 Parallel Transformer System Variables The system variable @ROWNUM behaves differently in the Enterprise Edition Transformer stage than in the Server Edition Transformer. the following stage variable expression would convert a null value to a specific empty string: If ISNULL(link. entries are placed in the Director job log. by definition. For this reason. All rights reserved. and to use other stages (such as Copy) when derivations are not needed. right-click on an output link and choose “Convert to Reject: Figure 41: Transformer Reject link The parallel Transformer rejects NULL derivation results (including output link constraints) because the rules for arithmetic and string handling of NULL values are.1 Transformer NULL Handling and Reject Link When evaluating expressions for output derivations or link constraints. the Transformer will reject the row instead of sending it to the output link(s).1. or translated into any language in any form by any means without the written permission of IBM. For this reason. When rows are rejected by a Transformer. it should be converted from its out-of-band (internal) null representation to an in-band (specific value) null representation using stage variables or the Modify stage. Because the DS/EE Transformer runs in parallel. Always include reject links in a parallel Transformer. This makes it easy to identify reject conditions (by row counts). @ROWNUM is Parallel Framework Red Book: Data Flow Job Design July 17.8. . stored in a retrieval system.1.Information Integration Solutions Center of Excellence 8 Transformation Languages 8. 2006 87 of 179 © 2006 IBM Information Integration Solutions.

Thus. or performing parallel derivations. For example. these internal decimal variables will have a precision and scale of 38 and 10.5: Sequential File (Export) Buffering). up to a maximum precision of 255 and scale of 125. No part of this publication may be reproduced.3. . For example.4 -> -1. By default.Information Integration Solutions Center of Excellence assigned to incoming rows for each partition. internal decimal results are rounded to the nearest applicable value. the PadString function uses the length of the source type.1. transmitted. Within the link constraints dialog box. transcribed.4 -> -2 round_inf Rounds or truncates towards nearest representable value.5 -> -2 Parallel Framework Red Book: Data Flow Job Design July 17. Create a new output link that will handle rows that match the abort rule.4 -> 1. the incoming column must be type VarChar before it is evaluated in the Transformer. 8. the system variables @NUMPARTITIONS and @PARTITIONNUM should be used.3 Transformer Derivation Evaluation Output derivations are evaluated before any type conversions on the assignment. When generating a sequence of numbers in parallel. not the target. If more precision is required. -1. 8.4 -> 2. 1). or committed to database tables.6 ->1. or translated into any language in any form by any means without the written permission of IBM. 1.5 Transformer Decimal Arithmetic When decimal data is evaluated by the Transformer stage. Examples: 1. -1.5-> 2. It is important to set the database commit parameters or adjust the Sequential File buffer settings (see Section 7. Therefore. Examples: 1. breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity. The environment variable APT_DECIMAL_INTERM_ROUND_MODE can be used to change the rounding behavior using one of the following keywords: ceil Rounds towards positive infinity. the environment variables APT_DECIMAL_INTERM_PRECISION and APT_DECIMAL_INTERM_SCALE can be set to the desired range. it is important to make sure the type conversion is done before a row reaches the Transformer.4 Conditionally Aborting Jobs The Transformer can be used to conditionally abort a job when incoming data matches a specific rule. 8. and set the “Abort After Rows” count to the number of rows allowed before the job should be aborted (for example. All rights reserved. Examples: 1. TrimLeadingTrailing(string) works only if string is a VarChar field. apply the abort rule to this output link. stored in a retrieval system. 2006 88 of 179 © 2006 IBM Information Integration Solutions. By default. -1. there are times when internal decimal variables need to be generated in order to perform the evaluation.1. Since the Transformer will abort the entire job flow immediately. -1. it is possible that valid rows will not have been flushed from Sequential File (export) buffers.6 -> -1 floor Rounds towards negative infinity.1.

stored in a retrieval system.1. From this sequence.. round or truncate to the scale size. -1. the evaluation of the substring of DSLINK1. transmitted.6 Optimizing Transformer Expressions and Stage Variables In order to write efficient Transformer stage derivations. it can be seen that there are certain constructs that would be inefficient to include in output column derivations. it is useful to understand what items get evaluated and when. In this case. unless the derivation is empty For each output link: Evaluate the link constraint. the output links are also evaluated in the order in which they are displayed.Information Integration Solutions Center of Excellence trunc_zero Discard any fractional digits to the right of the rightmost fractional digit supported regardless of sign. Examples: 1.3] is evaluated for each column that uses it. if $APT_DECIMAL_INTERM_SCALE is smaller than the results of the internal calculation. or translated into any language in any form by any means without the written permission of IBM. For example. the stage variable definition would be: Parallel Framework Red Book: Data Flow Job Design July 17.. suppose multiple columns in output links want to use the same substring of an input column. All rights reserved. In this case. if true Evaluate each column derivation value Write the output record Else skip the link Next output link Next input row The stage variables and the columns within a link are evaluated in the order in which they are displayed in the Transformer editor. 8. No part of this publication may be reproduced.5.56 ->-1.col[1.56 -> 1.col[1.5. By doing this. as they would be evaluated once for every output column that uses them. The evaluation sequence is as follows: Evaluate each stage variable initial value For each input row to process: Evaluate each stage variable derivation value. the substring is evaluated just once for every input row. . Such constructs are: • Where the same part of an expression is used in multiple column derivations For example. 2006 89 of 179 © 2006 IBM Information Integration Solutions.3] = “001”) THEN . Similarly. transcribed. then the following test may appear in a number of output columns derivations: IF (DSLINK1. This can be made more efficient by moving the substring calculation into a stage variable.

its value for the whole Transformer processing is unchanged from the initial value. the function would be evaluated every time the column derivation is evaluated.3] and each column derivation would start with: IF (StageVar1 = “001” THEN .Information Integration Solutions Center of Excellence DSLINK1. 2006 90 of 179 © 2006 IBM Information Integration Solutions.. Any expression that previously used this function would be changed to use the stage variable instead. In this case.col1[1. In this case. the function would still be evaluated once for every input row. because the derivation expression of the stage variable is empty. such as: Str(“ “. before any input rows are processed. In addition to a function value returning a constant value. . A stage variable can be assigned an initial value from the Stage Properties dialog/Variables tab in the Transformer stage editor. the variable would have its initial value set to: Str(“ “. transcribed. No part of this publication may be reproduced. another example would be part of an expression such as: Parallel Framework Red Book: Data Flow Job Design July 17. This function could be moved into a stage variable derivation. All rights reserved. The initial value of the stage variable is evaluated just once.col[1. or translated into any language in any form by any means without the written permission of IBM. transmitted. it is not reevaluated for each input row. Therefore.20) You would then leave the derivation of the stage variable on the main Transformer page empty. but in this case. this example could be improved further by also moving the string comparison into the stage variable. This can be achieved using stage variables. a column definition may include a function call that returns a constant value. It would be more efficient to calculate the constant value just once for the whole Transformer. • Where an expression includes calculated constant values For example. Then..3] = “001” THEN 1 ELSE 0 and each column derivation would start with: IF (StageVar1) THEN This reduces both the number of substring functions evaluated and string comparisons made in the Transformer.20) This returns a string of 20 spaces. In fact. The solution here is to move the function evaluation into the initial value of a stage variable. stored in a retrieval system. The stage variable would be: IF (DSLink1.

Information Integration Solutions Center of Excellence "abc" : "def" As with the function-call example. As noted in the previous section. where that conversion would have been required. for example. However. Since the subpart of the expression is actually constant. In this case. you would create. this constant part of the expression could again be moved into a stage variable. or translated into any language in any form by any means without the written permission of IBM. specify its derivation to be DSLINK1. or it is used in multiple places. this concatenation is evaluated every time the column derivation is evaluated.col1. transmitted. • Where an expression requiring a type conversion is used as a constant. transcribed. dropping and renaming columns. since it uses low-level functionality that is part of every DataStage Enterprise Edition component. stored in a retrieval system. using the initial value setting to perform the concatenation just once. If this just appeared once in one output column expression. It should be noted that when using stage variables to evaluate parts of expressions. . null conversion.col1+"1" In this case.col1+1 In this example. the Output Mapping properties for any parallel stage will generate an underlying modify for default data type conversions. The Modify stage uses the syntax of the underlying modify operator. The solution in this case is just to change the constant from a string to an integer: DSLink1. an integer stage variable. For example. if an input column is used in more than one expression. 2006 91 of 179 © 2006 IBM Information Integration Solutions. the "1" is a string constant. where it requires the same type conversion in each expression. and so. and then use the stage variable in place of DSLink1. it must be converted from a string to an integer each time the expression is evaluated.col1. Otherwise. this would be fine. 8. if DSLINK1.col1 were a string field. an expression may include something like this: DSLink1. No part of this publication may be reproduced. needless conversions are required wherever that variable is used. a conversion would be required every time the expression is evaluated. then. then it would be more efficient to use a stage variable to perform the conversion once. documented in the Parallel Job Developers Guide as well as the Orchestrate Operators Reference.col1. The standalone Modify stage can be used for non-default type conversions (nearly all date and time conversions are non-default). and string trim. Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved.2 Modify Stage The Modify stage is the most efficient “stage” available. in order to be able to add it to DSLink1. again. the data type of the stage variable should be set correctly for that context.

.destField is the destination field’s name. The character argument is the character to remove. and justify defaults to begin.value) where: . Justify has no affect when the target string has variable length. date.value is the value you wish to represent a null in the output. justify] (string) You can use this function to remove the characters used to pad variable-length strings when they are converted to fixed-length strings of greater length. NOTE: The DataStage Parallel Job Developers Guide gives incorrect syntax for converting an out-of-band null to an in-band null (value) representation. By default. string.dataType is its optional data type. this is NULL. with the following syntax: stringField=string_trim[character. the syntax is: destField[:dataType] = handle_null (sourceField. For a numeric field value can be a numeric value. . 2006 92 of 179 © 2006 IBM Information Integration Solutions. and timestamp fields. time. The value of the direction and justify arguments can be either begin or end.1 Modify and Null Handling The Modify stage can be used to convert an out-of-band null value to an in-band null representation and vice-versa. 8.destField is the destination field’s name. The following example removes all leading ASCII NULL characters from the beginning of name and places the remaining characters in an output variable-length string with the same name: Parallel Framework Red Book: Data Flow Job Design July 17. stored in a retrieval system. transmitted. the syntax is: destField[:dataType] = make_null(sourceField.2.2 Modify and String Trim The function string_trim has been added to Modify. use it if you are also converting types. for decimal. . By default. value can be a string.value is the value of the source field when it is null. No part of this publication may be reproduced. use it if you are also converting types. .Information Integration Solutions Center of Excellence 8. or translated into any language in any form by any means without the written permission of IBM. transcribed. The destField is converted from an Orchestrate out-of-band null to a value of the field’s data type.value) where: .dataType is its optional data type. To convert from an out-of-band null to an in-band null (value) representation within Modify.sourceField is the source field’s name .2. All rights reserved. . .sourceField is the source field’s name. To convert from an in-band null to an out-of-band null. direction. direction defaults to end. these characters are retained when the fixed-length string is then converted back to a variable-length string.

No part of this publication may be reproduced. . begin](name) The following example removes all trailing Z characters from color. transcribed. 2006 93 of 179 © 2006 IBM Information Integration Solutions. and left-justifies the resulting hue fixed-length string: hue:string[10] = string_trim[‘Z’. end. or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence name:string = string_trim[NULL. stored in a retrieval system. begin](color) Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved. transmitted.

or both links in the case of Full Outer) are output regardless of match on key values. Left Outer. This isolates your match/no-match logic from any changes in the metadata.Information Integration Solutions Center of Excellence 9 Combining Data 9. For this reason.7: Database Sparse Lookup vs. For example. Oracle Enterprise. Join). No part of this publication may be reproduced. Limit the use of database Sparse Lookups (available in the DB2 Enterprise. During an Outer Join. It is best to test both the Sparse and Normal to see which actually performs best. . 9. Each lookup reference requires a contiguous block of shared memory. or translated into any language in any form by any means without the written permission of IBM. In an OUTER join scenario. the default value for an Integer is zero. with a constant value. all rows on an outer link (for example.1. the JOIN or MERGE stage should be used. Merge The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. This is most easily done by inserting a Copy stage and mapping a column from NON-NULLABLE to NULLABLE. when a match does not occur. and to retest if the relative volumes of data change dramatically. (see Section 10. care must be taken to change the column properties to allow NULL values before the Join. the default value for a Varchar is an empty string (“”). and ODBC Enterprise stages) to scenarios where the number of input rows is significantly smaller (for example.1 Lookup vs.2 Capturing Unmatched Records from a Join The Join stage does not provide reject handling for unmatched records (such as in an InnerJoin scenario). All rights reserved. In most cases. 2006 94 of 179 © 2006 IBM Information Integration Solutions. Join vs. A Transformer stage can be used to test for NULL values in unmatched columns. stored in a retrieval system. Sparse Lookups may also be appropriate for exception-based processing when the number of exceptions is a small fraction of the main input data. it is best to use a Column Generator to add an ‘indicator’ column. 1:100 or more) than the number of reference rows. Parallel Framework Red Book: Data Flow Job Design July 17. an OUTER join operation must be performed. the Join stage inserts values into the unmatched non-key column(s) using the following rules: a) If the non-key column is defined as nullable (on the Join input links) then Enterprise Edition will insert NULL values in the unmatched columns b) If the non-key column is defined as not-nullable. and the default value for a Char is a string of padchar characters equal to the length of the Char column. transcribed. This is also handy with Lookups that have multiple reference links. transmitted. Right Outer. If un-matched rows must be captured or logged. then Enterprise Edition inserts “default” values based on the data type. If the Data Sets are larger than available memory resources. to each of the inner links and test that column for the constant after you have performed the join.

2 Aggregation Data Type By default. but only maintains the calculations for the current group in memory. which maintains the results of each keycolumn value/aggregation pair in memory. All rights reserved.3 The Aggregator Stage 9. To perform a total aggregation. 9. or translated into any language in any form by any means without the written permission of IBM.3 Performing Total Aggregations The Aggregator counts and calculates based on distinct key value groupings. Unlike the Hash Aggregator. To aggregate in decimal precision. set the optional property “Aggregations/Default to Decimal Output” within the Aggregator stage. No part of this publication may be reproduced.3.1 Aggregation Method By default. You can also specify that the result of an individual calculation or recalculation is decimal by using the optional “Decimal Output” sub-property.3. 2006 95 of 179 © 2006 IBM Information Integration Solutions. the Hash Aggregator should only be used when the number of distinct key values is small and finite. Parallel Framework Red Book: Data Flow Job Design July 17. The Sort Aggregation Method should be used when the number of key values is unknown or very large. the default Aggregation Method is set to Hash. 9.3. An exception to this is financial calculations which should be done in decimal to preserve appropriate precision. aggregate on generated key column) there is no need to sort or hash-partition the input data with only one key column value aggregate Sequentially on the generated column - - Note that in this example use two Aggregators are used to prevent the sequential aggregation from disrupting upstream processing. Because each key value/aggregation requires approximately 2K of memory. . the Sort Aggregator requires presorted data. the output data type of a parallel Aggregator stage calculation or recalculation column is floating point (Double). use the stages shown on the right to: generate a single constant-value key column using the Column Generator or an upstream Transformer aggregate in parallel on the generated column (partition Round Robin. Note that performance is typically better if you let calculations occur in floating point (Double) data type and convert the results to decimal downstream in the flow. transmitted. stored in a retrieval system.Information Integration Solutions Center of Excellence 9. transcribed.

1. For some databases (DB2.1 Database development overview This section is intended to provide guidelines appropriate to accessing any database within DataStage Enterprise Edition. Oracle. stored in a retrieval system. No part of this publication may be reproduced.1 Database stage types DataStage Enterprise Edition offers database connectivity through native parallel and plug-in stage types. or translated into any language in any form by any means without the written permission of IBM. 2006 96 of 179 © 2006 IBM Information Integration Solutions. transmitted. All rights reserved. transcribed. Informix. Plug-In Database Stages Dynamic RDBMS DB2/UDB API DB2/UDB Load Informix CLI Informix Load Informix XPS Load Oracle OCI Load RedBrick Load Sybase IQ12 Load Sybase OC Teradata API Teradata MultiLoad (MultiLoad) Parallel Framework Red Book: Data Flow Job Design July 17. You may need to customize the palette to add hidden stages.Information Integration Solutions Center of Excellence 10 Database Stage Guidelines 10. multiple stage types are available: Teradata MultiLoad (TPump) Native Parallel Database Stages DB2/UDB Enterprise Informix Enterprise ODBC Enterprise Oracle Enterprise SQL Server Enterprise Teradata Enterprise NOTE: Not all database stages (for example. 10. and Teradata). Subsequent sections provide database-specific tips and guidelines. Teradata API) are visible in the default DataStage Designer palette. .

Unlike the database-specific parallel stages. so Table Definitions must match the order of columns in a query. On UNIX.1. 10.1.2 ODBC Enterprise stage In general. they should only be used when it is not possible to use a native parallel stage. DataStage Enterprise Edition bundles OEM versions of ODBC drivers from DataDirect. specific guidelines of when to use various stage types are provided in the database-specific topics in this section. attributes). or translated into any language in any form by any means without the written permission of IBM. the ODBC Enterprise stage cannot interface with database-specific parallel load technologies.1. the DataDirect ODBC Driver Manager is also included. plug-in database stages match columns by order.1 Native Parallel database stages In general. For example. 10. stored in a retrieval system.3 Plug-In database stages Plug-in stage types are intended to provide connectivity to database configurations not offered by the native parallel stages.Information Integration Solutions Center of Excellence 10. All rights reserved. the ODBC Enterprise stage cannot read in parallel (although a patch to allow parallel read may be available on some platforms through IBM IIS Support). care must be taken to assign the correct data types in the job design. . the native parallel stages often have more stringent connectivity requirements than plug-in stages. types. native database components (such as the Oracle Enterprise stage) are preferable to ODBC connectivity if both are supported on the database platform. scalability. operating system. transcribed. and features it is best to use the native parallel database stages in a job design if connectivity requirements can be satisfied. and cannot span multiple servers in a clustered or Grid configuration. Parallel Framework Red Book: Data Flow Job Design July 17. The benefit of ODBC Enterprise stage comes from the large number of included and third party ODBC drivers to enable connectivity to all major database platforms. Because of their tight integration with database technologies.1. transmitted. Furthermore. for maximum parallel performance. not name. This allows Enterprise Edition to match return columns by name. ODBC also provides an increased level of “data virtualization” which can be useful when sources and targets (or deployment platforms) can change. However.1. “Wire Protocol” ODBC Drivers generally do not require database client software to be installed on the server platform. and version.1. not position in the stage Table Definitions. the DB2/UDB Enterprise stage is only compatible with DB2 Enterprise Server Edition with DPF on the same UNIX platform as the DataStage server. Native parallel stages always pre-query the database for actual runtime metadata (column names. From a design perspective. Because there are exceptions to this rule (especially with Teradata). 2006 97 of 179 © 2006 IBM Information Integration Solutions. No part of this publication may be reproduced. Because plug-in stage types cannot read in parallel.

or translated into any language in any form by any means without the written permission of IBM. data types. All rights reserved. However. Informix Enterprise or Oracle Enterprise stages. For each native parallel database stage: .2.Information Integration Solutions Center of Excellence 10. stored in a retrieval system.1 Runtime metadata At runtime.2 Database Metadata 10.columns of the database row correspond to columns of a DS/EE record . 2006 98 of 179 © 2006 IBM Information Integration Solutions. nullability) and partitioning scheme (in some cases) of the source or target table. the DS/EE native parallel database stages always “pre-query” the database source or target to determine the actual metadata (column names. This utility is available as a server command line utility and within Designer and Manager using “Import Orchestrate Schema Definitions”.rows of the database result set correspond to records of a DS/EE Data Set .1. transcribed. No part of this publication may be reproduced.2 Metadata Import When using the native parallel DB2 Enterprise.1.the name and data type of each database column corresponds to a DS/EE Data Set name and data type using a predefined mapping of database data types to Enterprise Edition data types .1. and selecting “Import from Database Table” option in the wizard as illustrated below: Parallel Framework Red Book: Data Flow Job Design July 17. and a null value in a database column is stored as an out-of-band NULL value in the DS/EE column The actual metadata used by a DS/EE native parallel database stage is always determined at runtime. use orchdbutil to import metadata to avoid type conversion issues. regardless of the table definitions assigned by the DataStage developer. . transmitted. This allows the database stages to match return values by column name instead of position.2. Database-specific data type mapping tables are included in the following sections.both DS/EE and relational databases support null values. 10. care must be taken that the column data types defined by the DataStage developer match the data types generated by the database stage at runtime.

The alias name(s) should then be added to the Table Definition within DataStage. No part of this publication may be reproduced.Information Integration Solutions Center of Excellence Figure 42: orchdbutil metadata import One disadvantage to the graphical orchdbutil metadata import is that the user interface requires each table to be imported individually. If the connection is successful.1.4 Testing Database Connectivity The “View Data” button on the Output / Properties tab of source database stages lets you verify database connectivity and settings without having to create and run a job.3 Optimizing Select Lists For best performance and optimal memory usage. transcribed. All rights reserved. it is best to explicitly specify column names on all source database stages. it will be easier to use the corresponding orchdbutil command-line utility from the DataStage server machine. 10. Test the connection using View Data button. . always specify the “Select List” subproperty. The only exception to this rule is when building dynamic database jobs that use runtime column propagation to process all columns in a source table. similar to the illustration on the right: Parallel Framework Red Book: Data Flow Job Design July 17. As a command. there may be cases where user-defined functions or logic need to be executed on the database server. SUM(sales) Total FROM store_info GROUP BY store_name Note that in many cases it may be more appropriate to aggregate using the Enterprise Edition Aggregator stage. or translated into any language in any form by any means without the written permission of IBM. orchdbutil can be scripted to automate the process of importing a large number of tables. For “Auto-Generated” SQL. 10. However. stored in a retrieval system.1. the following SQL assigns the alias Total to the calculated column: SELECT store_name. it is important to use SQL aliases to explicitly name the calculated columns so that they can be referenced within the DataStage job. For “Table” read method. transmitted. When importing a large number of tables. the DataStage Designer will automatically populate the select list based on the stage’s output column definition. 10. For example. you will see a window with the result columns and data.1. 2006 99 of 179 © 2006 IBM Information Integration Solutions.2. instead of using an unqualified “Table” or SQL “SELECT *” read.3 Defining Metadata for Database Functions When using database functions within a SQL SELECT list in a Read or Lookup.

and the CLOSE command could be used to select all rows from the temporary table and insert into a final target table. etc) not possible with the “Create” option. including databasespecific options (tablespace. depending on data volume) can be used to identify existing rows before they are inserted into the target table. and doing so may violate data-management (DBA) policies. As a further optimization. Parallel Framework Red Book: Data Flow Job Design July 17.1.6 Database OPEN and CLOSE Commands The native parallel database stages provide options for specifying OPEN and CLOSE commands. OPEN and CLOSE are not offered by plugin database stages. the OPEN command can be used to create a target table. a Lookup stage (or Join stage. After transformation. and you will be prompted to view additional detail. As another example. Figure 44: View Additional Error Detail 10.5 Designing for Restart To enable restart of high-volume jobs. the OPEN command could be used to create a temporary table. 10.Information Integration Solutions Center of Excellence Figure 43: Sample View Data Output If the connection fails. Subsequent job(s) should read this Data Set and populate the target table using the appropriate database stage and write method. Clicking YES will display a detailed dialog box with the specific error messages generated by the database stage that can be very useful in debugging a database connection failure. For example. it is important to separate the transformation process from the database write (Load or Upsert) operation. There are limited capabilities to specify Create table options in the stage. transcribed. constraints. 2006 100 of 179 © 2006 IBM Information Integration Solutions. the results should be landed to a parallel Data Set.1. . it is not a good idea to let DataStage generate target tables unless they are used for temporary storage. stored in a retrieval system. These options allow commands (including SQL) to be sent to the database before (OPEN) or after (CLOSE) all rows are read/written/loaded to the database. transmitted. In general. an error message may appear. All rights reserved. or translated into any language in any form by any means without the written permission of IBM. logging. No part of this publication may be reproduced.

ODBC Enterprise. Further details are outlined in the respective database sections of the Orchestrate Operators Reference which is part of the Orchestrate OEM documentation. a Sparse Lookup may be appropriate. it is faster to use a DataStage JOIN stage between the input and DB2 reference data than it is to perform a “Sparse” Lookup. and leverages the database capabilities. Join Data read by any database stage can serve as the reference input to a Lookup operation. 10. metadata capture and re-use. use a SQL filter (WHERE clause) to limit the number of rows sent to the DataStage job. transcribed. and Oracle Enterprise stages allow the lookup type to be changed to “Sparse”. a default OPEN statement places a shared lock on the source. By default. While there are extreme scenarios when the appropriate technology choice is clearly understood.Information Integration Solutions Center of Excellence It is important to understand the implications of specifying a user-defined OPEN and CLOSE command. For example. All rights reserved. there may be “gray areas” where the decision should be made based on factors such as developer productivity.7 Database Sparse Lookup vs. sending individual SQL statements to the reference database for each incoming Lookup row. the optimal implementation involves leveraging the strengths of each technology to provide maximum throughput and developer productivity. 1:100 or more) than the number of reference rows in a DB2 or Oracle table. the DB2/UDB Enterprise. No part of this publication may be reproduced. with no intermediate stages. . IMPORTANT: The individual SQL statements required by a “Sparse” Lookup are an expensive operation from a performance perspective. transmitted. The following guidelines can assist with the appropriate use of SQL and DataStage technologies in a given job flow: • When possible. Sparse Lookup is only available when the database stage is directly connected to the reference link. Parallel Framework Red Book: Data Flow Job Design July 17.1.8 Appropriate Use of SQL and DataStage When using relational database sources. Although it is possible to use either SQL or DataStage to solve a given business problem. 2006 101 of 179 © 2006 IBM Information Integration Solutions. 10. this lock is not sent – and should be specified explicitly if appropriate. or translated into any language in any form by any means without the written permission of IBM. When directly connected as the reference link to a Lookup stage. When specifying a user-defined OPEN command. stored in a retrieval system. In most cases. For scenarios where the number of input rows is significantly smaller (for example. this reference data is loaded into memory like any other reference link (“Normal” Lookup).1. and ongoing application maintenance costs. there is often a functional overlap between SQL and DataStage functionality. when reading from DB2. This minimizes impact on network and memory resources.

2006 102 of 179 © 2006 IBM Information Integration Solutions.Information Integration Solutions Center of Excellence • Use a SQL Join to combine data from tables with a small number of rows in the same database instance. Avoid the use of database stored procedures (for example. When combining data from very large tables. it is best to implement business rules using native parallel DataStage components. In this scenario. A join that reduces the result set significantly is also often appropriate to do in the database. it can still be beneficial to use database filters (WHERE clause) if appropriate. transcribed. Oracle PL/SQL) on a per-row basis within a high-volume data flow. the efficiency of the Enterprise Edition Sort and Join stages can be significantly faster than an equivalent SQL query. or translated into any language in any form by any means without the written permission of IBM. . No part of this publication may be reproduced. especially when the join columns are indexed. stored in a retrieval system. transmitted. All rights reserved. For maximum scalability and parallel performance. • • Parallel Framework Red Book: Data Flow Job Design July 17. or when the source includes a large number of database tables.

summarized in the following table: DataStage Stage Name DB2/UDB Enterprise DB2/UDB API DB2/UDB Load ODBC Enterprise Dynamic RDBMS Stage Type Native Parallel Plug-In Plug-In Native Plug-In DB2 Requirement DPF. stored in a retrieval system. the DB2 hardware/UNIX/software platform must match the hardware/software platform of the DataStage ETL server.2.) Furthermore.2 DB2 Guidelines 10. transmitted. No part of this publication may be reproduced. All rights reserved. same platform as ETL server 2 Any DB2 via DB2 Client or DB2-Connect Subject to DB2 Loader Limitations Any DB2 via DB2 Client or DBE-Connect Any DB2 via DB2 Client or DB2-Connect Supports Partitioned DB2? Yes / directly to each DB2 node Yes / through DB2 node 0 No Yes / through DB2 node 0 Yes / through DB2 node 0 Parallel Read? Yes Parallel Write? Yes Parallel Sparse Lookup Yes SQL Open / Close Yes No No No3 No Possible Limitations No No Possible Limitations No No No No No No No No For specific details on the stage capabilities.1. This will only work when the authentication mode of the database on the remote instance is set to “client authentication”. you may experience data duplication when working in partitioned instances since the node configuration of the local instance may not be the same as the remote instance. The DB2/UDB Enterprise stage requires DB2 Enterprise Server Edition on UNIX with Data Partitioning Facility (DPF) option. and DataStage Plug-In guides) 10. Parallel Framework Red Book: Data Flow Job Design July 17. (Before DB2 v8. For this reason. .2. Check with IBM IIS Support for availability. 2006 103 of 179 © 2006 IBM Information Integration Solutions.Information Integration Solutions Center of Excellence 10. or translated into any language in any form by any means without the written permission of IBM. and load capabilities to parallel DB2 databases on UNIX using the native parallel DB2/UDB Enterprise stage.1 DB2/UDB Enterprise stage Enterprise Edition provides native parallel read. If you use the stage in this way. lookup. this was also called “DB2 EEE”. consult the DataStage documentation (DataStage Parallel Job Developers Guide. 2 It is possible to connect the DB2 UDB stage to a remote database by simply cataloging the remote database in the local instance and then using it as if it were a local database.1 DB2 Stage Types DataStage Enterprise Edition provides access to DB2 databases using one of 5 stages. 3 A patched version of the ODBC Enterprise stage allowing parallel read is available from IBM IIS Support for some platforms. the “client authentication” configuration of a remote instance is not recommended. upsert. transcribed.

or translated into any language in any form by any means without the written permission of IBM. 10. Add the following properties: Parallel Framework Red Book: Data Flow Job Design July 17. the use of DataStage plug-in stages will limit overall performance and scalability. it may be possible to write to a DB2 target in parallel. when used as data sources. the ability to write in parallel may be limited by the table and index configuration set by the D2 database administrator. 2006 104 of 179 © 2006 IBM Information Integration Solutions.writing to DB2 in parallel (where appropriate). since the DS/EE framework will instantiate multiple copies of these stages to handle the data that has already been partitioned in the parallel framework.2 Connecting to DB2 with the DB2/UDB Enterprise Stage Create a Parallel job and add a DB2/UDB Enterprise stage. By facilitating flexible connectivity to multiple types of remote DB2 database servers. transcribed. Furthermore. stored in a retrieval system.1.2 ODBC and DB2 Plug-In Stages The ODBC Enterprise and plug-in stages are designed for lower-volume access to DB2 databases without the DPF option installed (prior to v8. plug-in stages cannot read from DB2 in parallel. These stages also provide connectivity to non-UNIX DB2 databases.2.2. 10. using the same data partitioning as the referenced DB2 tables. or DB2 databases on Windows or Mainframe platforms (except for the “Load” stage against a mainframe DB2 instance which is not supported). These goals are achieved through tight integration with the DB2 RDBMS. . Sparse Lookup is not supported through the DB2/API stage. The DB2/API (plug-in) stage should only be used to read from and write to DB2 databases on nonUNIX platforms (such as mainframe editions through DB2-Connect). parallel component the DB2/UDB Enterprise stage is designed for maximum performance and scalability. and reading from . including direct communication with each DB2 database node. “DB2 EE”). All rights reserved.Information Integration Solutions Center of Excellence As a native. Using the DB2/UDB API stage or the Dynamic RDBMS stage. No part of this publication may be reproduced. Because each plug-in invocation will open a separate connection to the same target DB2 database table. transmitted. databases on UNIX platforms that differ from the platform of the DataStage ETL server.

If you set this property. you must adhere to all of the directions specified for connecting to a remote instance AND the following: • You must not set the APT_DB2INSTANCE_HOME environment variable. Optionally set this to the remote server database name. • Database. it is possible to connect to more than one DB2 instance within a single job.Two Instances Only reading from one instance and writing to another instance with no other DB2 instances (not sure how many stages of these 2 instances can be added to the canvas for this configuration for lookups) 2. Otherwise use the DB2 environment variable. DB2INSTANCE. or translated into any language in any form by any means without the written permission of IBM. Set this to the DB2 client instance name. Since a db2nodes. Enter the password for connecting to DB2. This is required only if the client’s alias is different from the actual name of the remote server database. transcribed. this is required for a remote connection in order to retrieve the catalog information from the local instance of DB2 and thus must have privileges for that local instance. Your job must meet one of the following configurations (note: the use of the word “stream” refers to a contiguous flow of one stage to another within a single job): 1. All rights reserved. Single stream . • User. Otherwise use the environment variables $APT_DBNAME or $APT_DB2DBDFT to identify the database.cfg file can July 17. Enter the user name for connecting to DB2.2. Optionally set this to the instance name of the DB2 server. • Client Alias DB Name. • Password. . transmitted. Set this to the DB2 client’s alias database name for the remote DB2 server database. 2006 105 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. Multiple Stream with N DB2 sources with no DB2 targets reading from 1 to n DB2 instances in separate source stages with no downstream other DB2 stages In order to get this configuration to work correctly. • Server. No part of this publication may be reproduced.Information Integration Solutions Center of Excellence Figure 45: DB2/UDB Enterprise stage properties For connection to a remote DB2/UDB instance. Two Stream – One Instance per Steam reading from instance A and writing to instance A and reading from instance B and writing to instance B (not sure how many stages of these 2 instances can be added to the canvas for this configuration for lookups) 3. stored in a retrieval system. this is required for a remote connection in order to retrieve the catalog information from the local instance of DB2 and thus must have privileges for that local instance.3 Configuring DB2 Multiple Instances in One DataStage Job Although it is not officially supported. Once this variable is set. 10. to identify the instance name of the DB2 server. you will need to set the following properties on the DB2/UDB Enterprise stage in your parallel job: • Client Instance Name. DataStage assumes you require remote connection. it will try to use it for each of the connections in the job.

2.cfg. although this incurs the overhead of Sequential File stage (corresponding export/import operators) which does not run in parallel. where column# is the number of the column.cfg there. This converts these special characters into an internal representation that DataStage can understand. the Enterprise Edition column will be named “APT_37dig” . Enterprise Edition converts the DB2 column name as follows: . stored in a retrieval system. transcribed. All rights reserved. . which place no limit on the length of a column name. landing intermediate results to a parallel Data Set. create a sqllib subdirectory and place the remote instance’s db2nodes.the name can contain only alphanumeric and underscore characters . if the third DB2 column is named 7dig.the name must start with a letter or underscore character . In the users UNIX home directory. separate jobs can communicate through named pipes.5 DB2/API stage Column Names When using the DB2/API. To connect to multiple DB2 instances. No part of this publication may be reproduced. and Dynamic RDBMS plug-in stages.Information Integration Solutions Center of Excellence only contain information for one instance. the character is replaced by two underscore characters 10. this will create problems. DB2 Load. Observe the following guidelines when $DS_ENABLE_RESERVED_CHAR_CONVERT is set: Parallel Framework Red Book: Data Flow Job Design July 17. For example. set the environment variable $DS_ENABLE_RESERVED_CHAR_CONVERT if your DB2 database uses the reserved characters # or $ in column names. the string “APT__column#” (two underscores) is added to beginning of the column name. DS will default to this directory to find the configuration file for the remote instance.4 DB2/UDB Enterprise stage Column Names At runtime. Or. but have the following restrictions: . 2006 106 of 179 © 2006 IBM Information Integration Solutions.if the DB2 column name does not begin with a letter or underscore. • In order for DataStage to locate the db2nodes. or translated into any language in any form by any means without the written permission of IBM. if the data volumes are sufficiently small. Depending on platform configuration and I/O subsystem performance. 10. you must build a user on the DataStage server with the same name as the instance you are trying to connect to (the default logic for the DB2/UDB Enterprise stage is to use the instance’s home directory as defined for the UNIX user with the same name as the DB2 instance). Dynamic RDBMS) may be used to access data in other instances. transmitted.the name is case insensitive When there is an incompatibility. DB2 Load. we recommend using separate jobs with their respective DB2 environment variable settings. the native parallel DB2/UDB Enterprise stage translates column names exactly except when a component of a DB2 column name is not compatible with Enterprise Edition column naming conventions. Since the APT_DB2INSTANCE_HOME is not set.if the DB2 column name contains a character that is not alphanumeric or an underscore. DB2 plug-in stages (DB2 API.2.

examine the DDL for each schema to be accessed.s] DOUBLE-PRECISION FLOAT INTEGER MONEY NCHAR(n.r) REAL SERIAL SMALLFLOAT SMALLINT VARCHAR(n) If the DATETIME starts with an hour.6 DB2/UDB Enterprise stage Data Type Mapping The DB2 database schema to be accessed must NOT have any columns with User Defined Types (UDTs). 10. transcribed.2. however. Alternatively. avoid hand editing (this minimizes the risk of mistakes or confusion). They are also used in derivations and expressions. 2006 107 of 179 © 2006 IBM Information Integration Solutions. . No part of this publication may be reproduced. the result is a time field. decimal[p. as shown in the following table. The original names are used in generated SQL statements.r) DATE DATETIME Enterprise Edition Data Type string[n] or ustring[n] string[max=n] or ustring[max=n] date Time or timestamp with corresponding fractional precision for time: If the DATETIME starts with a year component. DECIMAL[p. Use the “db2 describe table [table-name]” command on the DB2 client for each table to be accessed to determine if UDTs are in use. All rights reserved. Once the table definition is loaded. or translated into any language in any form by any means without the written permission of IBM. transmitted.r) NVARCHAR(n.Information Integration Solutions Center of Excellence - Avoid using the strings __035__ and __036__ in your DB2 column names (these are used as the internal representations of # and $ respectively) Import meta data using the Plug-in Meta Data Import tool.s] where p is the precision and s is the scale dfloat dfloat int32 decimal string[n] or ustring[n] string[max=n] or ustring[max=n] sfloat int32 sfloat int16 string[max=n] or ustring[max=n] IMPORTANT: DB2 data types that are not listed in the above table cannot be used in the DB2/UDB Enterprise stage. The DB2/UDB Enterprise stage converts DB2 data types to Enterprise Edition data types. stored in a retrieval system. Table Definitions should be imported into DataStage using orchdbutil to ensure accurate Table Definitions. the result is a timestamp field. the internal column names are displayed rather than the original DB2 names both in table definitions and in the Data Browser. and will generate an error at runtime Parallel Framework Red Book: Data Flow Job Design July 17. and you should use them if entering SQL in the job yourself. DB2 Data Type CHAR(n) CHARACTER VARYING(n.

In this scenario. transcribed. parallel component. the DB2/UDB Enterprise stage offers the choice of SQL (insert / update / upsert / delete) or fast DB2 loader methods.2. it may be beneficial to have the DB2 DBA configure separate DB2 coordinator nodes (no local data) on each ETL server (in clustered ETL configurations). The choice between these methods depends on required performance. reading from and writing to DB2 in parallel (where appropriate).2. or translated into any language in any form by any means without the written permission of IBM. database log usage. When writing to a DB2 database in parallel. the overhead of routing information through a remote DB2 coordinator may be significant. . No part of this publication may be reproduced. forcing the DB2 Enterprise stages on each ETL server to communicate directly with their local DB2 coordinator. Parallel Framework Red Book: Data Flow Job Design July 17. the DB2 Load method places an exclusive lock on the entire DB2 tablespace into which it loads the data and no other tables in that tablespace can be accessed by other applications until the load completes.run in Truncate mode to clear the load pending state. In these instances. transmitted. stored in a retrieval system. and the target table(s) may be accessed by other users. performing lookups against.7 DB2/UDB Enterprise stage options The DB2/UDB Enterprise (native parallel) stage should be used for reading from. the DB2/UDB Enterprise stage is designed for maximum performance and scalability against very large partitioned DB2 UNIX databases. • DB2/UDB Enterprise stage is tightly integrated with the DB2 RDBMS. That is. All rights reserved. DB2 Enterprise stage should not include the Client Instance Name property. All activity in the z/OS environment always goes through the DB2 coordinator node so parallelism differs slightly depending on how DB2 is accessed. b) The DB2 Load method requires that the DataStage user running the job have DBADM privilege on the target DB2 database. if the load operation is terminated before it is completed. During the load operation. when using user-defined SQL without partitioning against large volumes of DB2 data. Time and row-based commit intervals determine the transaction size. 10. In this configuration.8 Performance Notes In some cases.Information Integration Solutions Center of Excellence 10. a) The Write Method (and corresponding insert / update / upsert / delete) communicates directly with the DB2 database nodes to execute instructions in parallel. 10. All operations are logged to the DB2 database log. and the availability of new rows to other applications. communicating directly with each database node. and using the same data partitioning as the referenced DB2 tables. the contents of the table are unusable and the tablespace is left in a load pending state. the DB2 Load DataStage job must be re. and writing to a DB2 Enterprise Server Edition database with Database Partitioning Feature (DPF) • As a native.9 DB2 in the DataStage USS environment The manner in which DataStage / USS Edition interfaces with DB2 is slightly different than it is in the non-z/OS environment. 2006 108 of 179 © 2006 IBM Information Integration Solutions. The DB2 load operator performs a non-recoverable load.2. and recoverability.

This information determines the number of db2read operators that the conductor builds into the score and the queries that they execute. 40). The WHERE clauses which are created to read this tables are: Where Col1 < ‘F’ or (Col1 = ‘F’ and (Col2 < 10 or Col2 = 10)) Where (Col1 > ‘F’ and Col1 < ‘P’) or (Col1 = ‘F’ and Col2 > 10) or (Col1 = ‘P’ and (Col2 < 20 or Col2 = 20)) Where Col1 > ‘T’ or (Col1 = ‘T’ and Col2 > 40) The method that DataStage/USS Edition uses to write to DB2 UDB on z/OS works differently than the read process. No part of this publication may be reproduced. 2006 109 of 179 © 2006 IBM Information Integration Solutions. and is controlled by the number of nodes in the configuration file. partitioning index name(s). All rights reserved. Table T is in tablespace TS and TS is partitioned into 3 partitions on Col1 (limits: F.Information Integration Solutions Center of Excellence When accessing a DB2 table using the Table read method. 30. Since all write operations need to go through the DB2 coordinator node on z/OS (this is different than on non-z/OS platforms). This is illustrated in Figure 47. P. functions within the db2read operator are used to read the DB2 SYSTABLES table to retrieve the tablespace and database name for the table. the number of operators do not have to match to the number of partitions. or translated into any language in any form by any means without the written permission of IBM. as illustrated in Figure 46: Figure 46: DB2 read on DataStage/USS For example. T) and Col2 (10. stored in a retrieval system. . and the partition limit key value(s). Parallel Framework Red Book: Data Flow Job Design July 17. transmitted. These values are in turn used to read the SYSTABLEPART table to retrieve the number of partitions. Finally the SYSKEYS and SYSCOLUMNS tables are read using the index name to get the associated column metadata (name and type). transcribed.

Information Integration Solutions Center of Excellence Figure 47: DB2 write on DataStage/USS On DataStage/USS Edition. transcribed. All rights reserved. transmitted. . stored in a retrieval system. No part of this publication may be reproduced. An example of an in-memory Normal Lookup is shown in Figure 48. Lookups work differently depending on whether the lookup is done normally (in memory) or using a sparse technique where each lookup is effectively a query to the database. 2006 110 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. Parallel Framework Red Book: Data Flow Job Design July 17.

where each lookup operator is issuing an SQL to DB2 for every row it processes. we need to add a special resource statement in our configuration file to specify the MVS dataset name to use. All rights reserved. using the DB2 load utility in USS is different from non-z/OS environments. The DB2 LOAD utility is designed to run from JCL only. This requires DataStage/USS to create an MVS flat file to pass to the loader – note that this is the only non-HFS file that DS/USS can write to. Since there is no sequential file stage associated with this MVS load file. Figure 49: DB2 Sparse Lookup on DataStage/USS Finally. transcribed. . Since each of these queries must go through the DB2 coordinator node we can effectively ignore the level of parallelism specified for the table. No part of this publication may be reproduced. stored in a retrieval system. nor can it be read in from a USS HFS file. transmitted. Figure 50 illustrates the DB2 LOAD process on USS and also shows the format of the special resource statement used to define the MVS dataset used during the load operation. we call a DB2 stored procedure called DSNUTILS.Information Integration Solutions Center of Excellence Figure 48: In-Memory Lookup on DataStage/USS Here we see that the Normal Lookup actually consists of reading the DB2 table into memory and then performing the lookup against the memory copy of the table. Contrast the Normal Lookup with the way a Sparse Lookup is done as shown in Figure 49. it matches the number of db2read operators to the partitioning scheme of the table (similar to the read) and the number of lookup operators to the number of nodes in the configuration file. or translated into any language in any form by any means without the written permission of IBM. When the conductor creates the score. 2006 111 of 179 © 2006 IBM Information Integration Solutions. In order to invoke it from a DataStage/USS job. Parallel Framework Red Book: Data Flow Job Design July 17. The LOAD utility has a second limitation in that data cannot be piped into it.

Information Integration Solutions Center of Excellence Figure 50: Calling DB2 Load Utility on DataStage/USS Parallel Framework Red Book: Data Flow Job Design July 17. or translated into any language in any form by any means without the written permission of IBM. transmitted. No part of this publication may be reproduced. . transcribed. stored in a retrieval system. All rights reserved. 2006 112 of 179 © 2006 IBM Information Integration Solutions.

stored in a retrieval system.r) DATE DATETIME Enterprise Edition Data Type string[n] string[max=n] date date. The Informix Enterprise stage converts Informix data types to Enterprise Edition data types. as shown in the following table. the result is a timestamp field. Informix Data Type CHAR(n) CHARACTER VARYING(n.columns of the database row correspond to columns of a DS/EE record .3.r) NVARCHAR(n. or translated into any language in any form by any means without the written permission of IBM.3.s] DOUBLE-PRECISION FLOAT INTEGER MONEY NCHAR(n.3 Informix Database Guidelines 10.both DS/EE and Informix support null values.rows of the database result set correspond to records of a DS/EE Data Set .s] where p is the precision and s is the scale The maximum precision is 32.the name and data type of each database column corresponds to a DS/EE Data Set name and data type using a predefined mapping of database data types to Enterprise Edition data types .r) REAL SERIAL SMALLFLOAT SMALLINT VARCHAR(n) If the DATETIME starts with an hour.1 Informix Enterprise Stage Column Names For each Informix Enterprise stage: . 2006 113 of 179 © 2006 IBM Information Integration Solutions. No part of this publication may be reproduced. decimal[p. the result is a date field. All rights reserved. If the DATETIME starts with a year component. and a null value in a database column is stored as an out-of-band NULL value in the DS/EE column 10. . transcribed. DECIMAL[p. the result is a time field. A decimal with floating scale is converted to dfloat dfloat dfloat int32 decimal string[n] string[max=n] sfloat int32 sfloat int16 string[max=n] IMPORTANT: Informix data types that are not listed in the above table cannot be used in the Informix Enterprise stage. and will generate an error at runtime Parallel Framework Red Book: Data Flow Job Design July 17. transmitted.Information Integration Solutions Center of Excellence 10.2 Informix Enterprise stage Data Type Mapping Table Definitions should be imported into DataStage using orchdbutil to ensure accurate Table Definitions. time or timestamp with corresponding fractional precision for time: If the DATETIME starts with a year component and ends with a month.

two underscore characters replace the unsupported character . 2006 114 of 179 © 2006 IBM Information Integration Solutions. as shown in the following table: ODBC Data Type SQL_BIGINT SQL_BINARY SQL_CHAR SQL_DECIMAL SQL_DOUBLE SQL_FLOAT SQL_GUID SQL_INTEGER SQL_BIT SQL_REAL SQL_SMALLINT SQL_TINYINT SQL_TYPE_DATE SQL_TYPE_TIME SQL_TYPE_TIMESTAMP SQL_VARBINARY SQL_VARCHAR SQL_WCHAR SQL_WVARCHAR Enterprise Edition Data Type int64 raw(n) string[n] decimal[p.rows of the database result set correspond to records of a DS/EE Data Set . transcribed.4 ODBC Enterprise Guidelines 10.both DS/EE and ODBC support null values. All rights reserved. or translated into any language in any form by any means without the written permission of IBM.s] int16 int8 date time[p] timestamp[p] raw[max=n] string[max=n] ustring[n] ustring[max=n] Note that the maximum size of a DataStage record is limited to 32K.the name and data type of each database column corresponds to a DS/EE Data Set name and data type using a predefined mapping of database data types to Enterprise Edition data types .s] string[36] int32 int8 [0 or 1] decimal[p. It is important to verify the correct ODBC to Enterprise Edition data mapping. stored in a retrieval system.Information Integration Solutions Center of Excellence 10. Parallel Framework Red Book: Data Flow Job Design July 17.names are translated exactly except when the external data source column name contains a character that DataStage does not support. transmitted. No part of this publication may be reproduced.s] decimal[p. In that case.4. Enterprise Edition will return an error and abort your job.1 ODBC Enterprise Stage Column Names For each ODBC Enterprise stage: . If you attempt to read a record larger than 32K.4.s] where p is the precision and s is the scale decimal[p. and a null value in a database column is stored as an out-of-band NULL value in the DS/EE column 10.columns of the database row correspond to columns of a DS/EE record .2 ODBC Enterprise stage Data Type Mapping ODBC data sources are not supported by the orcdbutil utility. .

and will generate an error at runtime Parallel Framework Red Book: Data Flow Job Design July 17. . transmitted. No part of this publication may be reproduced. All rights reserved. stored in a retrieval system. 2006 115 of 179 © 2006 IBM Information Integration Solutions.Information Integration Solutions Center of Excellence IMPORTANT: ODBC data types that are not listed in the above table cannot be used in the ODBC Enterprise stage. transcribed. or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence 10. a patch may be available through IBM IIS Support to support parallel reads through ODBC. transcribed. . it may be possible to write to a target database in parallel using the ODBC Enterprise stage. No part of this publication may be reproduced. Depending on the target database. Parallel Framework Red Book: Data Flow Job Design July 17. 4 On some platforms. or translated into any language in any form by any means without the written permission of IBM. and the table configuration (row or page level lock mode if available).4. since this capability is not provided by the ODBC API. 2006 116 of 179 © 2006 IBM Information Integration Solutions. stored in a retrieval system. All rights reserved.3 Reading ODBC Sources Unlike other native parallel database stages. transmitted. the ODBC Enterprise stage does not support parallel read4. Parallel reads through ODBC match the degree of parallelism in the $APT_CONFIG_FILE.

Information Integration Solutions Center of Excellence 10. In that case. 2006 117 of 179 © 2006 IBM Information Integration Solutions. Enterprise Edition will return an error and abort your job.1 Oracle Enterprise Stage Column Names For each Oracle Enterprise stage: . No part of this publication may be reproduced.10] int32 if precision(p) < 11 and scale (s) = 0 decimal[p. transmitted.5.queries containing a GROUP BY clause that are also hash partitioned on the same field Parallel Framework Red Book: Data Flow Job Design July 17. or translated into any language in any form by any means without the written permission of IBM.rows of the database result set correspond to records of a DS/EE Data Set .2 Oracle Enterprise stage Data Type Mapping Oracle Table Definitions should be imported into DataStage using orchdbutil to ensure accurate Table Definitions. IMPORTANT: Oracle data types that are not listed in the above table cannot be used in the Oracle Enterprise stage. and a null value in a database column is stored as an out-of-band NULL value in the DS/EE column 10.5. Enterprise Edition maps Oracle data types based on the rules given in the following table: Oracle Data Type CHAR(n) DATE NUMBER NUMBER[p. The underlying Oracle table does not have to be partitioned for parallel read within Enterprise Edition.the name and data type of each database column corresponds to a DS/EE Data Set name and data type using a predefined mapping of database data types to Enterprise Edition data types . stored in a retrieval system.5 Oracle Database Guidelines 10.names are translated exactly except when the Oracle source column name contains a character that DataStage does not support.s] RAW(n) VARCHAR(n) Enterprise Edition Data Type string[n] or ustring[n] a fixed-length string with length = n timestamp decimal[38.5. It is important to note that certain types of queries cannot run in parallel.s] if precision (p) >=11 or scale > 0 not supported string[max=n] or ustring[max=n] a variable-length string with maximum length = n Note that the maximum size of a DataStage record is limited to 32K. which are not heavily typed. the Oracle Enterprise stage reads sequentially from its source table or query. and will generate an error at runtime 10. This is particularly important for Oracle databases. Examples include: . .both DS/EE and Oracle support null values. Setting the partition table option to the specified table will enable parallel extracts from an Oracle source. transcribed. two underscore characters replace the unsupported character .columns of the database row correspond to columns of a DS/EE record . If you attempt to read a record larger than 32K. All rights reserved.3 Reading from Oracle in Parallel By default.

• Setting the environment variable $APT_ORACLE_LOAD_OPTIONS to “OPTIONS (DIRECT=TRUE. transcribed. All rights reserved. PARALLEL=FALSE) also allows loading of indexed tables without index maintenance. rebuild). The Upsert Write Method can be used to insert rows into a target Oracle table without bypassing indexes or constraints. No part of this publication may be reproduced. stored in a retrieval system.4 Oracle Load Options When writing to an Oracle table (using Write Method = Load). the Oracle load will be done sequentially. Parallel Framework Red Book: Data Flow Job Design July 17. . When using this method. In order to automatically generate the SQL required by the Upsert method. or translated into any language in any form by any means without the written permission of IBM. the key column(s) must be identified using the check boxes in the column grid. the Oracle stage cannot write to a table that has indexes (including indexes automatically generated by Primary Key constraints) on it unless you specify the Index Mode option (maintenance. Enterprise Edition uses the Parallel Direct Path Load method.5.Information Integration Solutions Center of Excellence - queries performing a non-collocated join (a SQL JOIN between two tables that are not stored in the same partitions with the same partitioning strategy) 10. 2006 118 of 179 © 2006 IBM Information Integration Solutions. In this instance. transmitted.

Information Integration Solutions Center of Excellence

10.6 Sybase Enterprise Guidelines
10.6.1 Sybase Enterprise Stage Column Names For each Sybase Enterprise stage: - rows of the database result set correspond to records of a DS/EE Data Set - columns of the database row correspond to columns of a DS/EE record - the name and data type of each database column corresponds to a DS/EE Data Set name and data type using a predefined mapping of database data types to Enterprise Edition data types - names are translated exactly except when the Sybase source column name contains a character that DataStage does not support. In that case, two underscore characters replace the unsupported character - both DS/EE and Sybase support null values, and a null value in a database column is stored as an out-of-band NULL value in the DS/EE column 10.6.2 Sybase Enterprise stage Data Type Mapping Sybase databases are not supported by the orcdbutil utility. It is important to verify the correct Sybase to Enterprise Edition data mapping, as shown in the following table: Sybase Data Type
BINARY(n) BIT CHAR(n) DATE DATETIME DEC[p,s] or DECIMAL[p,s] DOUBLE PRECISION or FLOAT INT or INTEGER MONEY NCHAR(n) NUMERIC[p,s] NVARCHAR(n,r) REAL SERIAL SMALLDATETIME SMALLFLOAT SMALLINT SMALLMONEY TINYINT TIME UNSIGNED INT VARBINARY(n) VARCHAR(n)

Enterprise Edition Data Type
raw(n) int8 string[n] a fixed-length string with length n date timestamp decimal[p,s] where p is the precision and s is the scale dfloat int32 decimal[15,4] ustring[n] a fixed-length string with length n - only for ASE decimal[p,s] where p is the precision and s is the scale ustring[max=n] a variable-length string with length n - only for ASE sfloat int32 timestamp sfloat int16 decimal[10,4] int8 time unit32 raw[max=n] string[max=n] a variable-length string with maximum length n

IMPORTANT: Sybase data types that are not listed in the above table cannot be used in the Sybase Enterprise stage, and will generate an error at runtime
Parallel Framework Red Book: Data Flow Job Design July 17, 2006 119 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence

10.7 Teradata Database Guidelines
10.7.1 Choosing the Proper Teradata Stage Within DataStage Enterprise Edition, the following stages can be used for reading from and writing to Teradata databases in a parallel job flow: Source Teradata Stages Teradata Enterprise Teradata API Target Teradata Stages Teradata Enterprise Teradata API Teradata MultiLoad (MultiLoad option) Teradata MultiLoad (TPump option)

For maximum performance of high-volume data flows, the native parallel Teradata Enterprise stage should be used. Teradata Enterprise uses the programming interface of the Teradata utilities FastExport (reads) and FastLoad (writes), and is subject to all these utilities’ restrictions. NOTE: Unlike the FastLoad utility, the Teradata Enterprise stage supports Append mode, inserting rows into an existing target table. This is done through a shadow “terasync” table.

Teradata has a system-wide limit to the number of concurrent database utilities. Each use of the Teradata Enterprise stages counts toward this limit. 10.7.2 Source Teradata Stages Teradata Stage Stage Usage Guidelines Type
Teradata Enterprise Teradata API Native Parallel Plug-In - Reading a large number of rows in parallel - Supports OPEN and CLOSE commands - Subject to the limits of Teradata FastExport - Reading a small number of rows sequentially

Parallel Read
Yes No

Teradata Utility Limit
applies none

10.7.3 Target Teradata Stages Teradata Stage Stage Usage Guidelines Type
Teradata Enterprise Native Parallel - Writing a large number of rows in parallel - Supports OPEN and CLOSE commands - Limited to INSERT (new table) or APPEND (existing table) - Subject to the limits of Teradata FastLoad (but also supports APPEND) - Locks the target table in exclusive mode - Insert, Update, Delete, Upsert of moderate data volumes - Locks the target table(s) in exclusive mode

Parallel Write
Yes

Teradata Utility Limit
applies

Teradata MultiLoad (MultiLoad utility)

Plug-In

No

applies

Parallel Framework Red Book: Data Flow Job Design

July 17, 2006

120 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence Teradata MultiLoad (TPump utility) Plug-In - Insert, Update, Delete, Upsert of small volumes of data within a large database - Does not lock the target tables - Should not be run in parallel, because each node and use counts toward system-wide Teradata utility limit - Insert, Update, Delete, Upsert of small volumes of data - Allows concurrent writes (does not lock target) - Slower than TPump for equivalent operations

Teradata API

Plug-In

Yes

none

10.7.4 Teradata Enterprise Stage Column Names For each Teradata Enterprise stage: - rows of the database result set correspond to records of a DS/EE Data Set - columns of the database row correspond to columns of a DS/EE record - the name and data type of each database column corresponds to a DS/EE Data Set name and data type using a predefined mapping of database data types to Enterprise Edition data types - both DS/EE and Teradata support null values, and a null value in a database column is stored as an out-of-band NULL value in the DS/EE column - DS/EE gives the same name to its columns as the Teradata column name. However, while DS/EE column names can appear in either upper or lower case, Teradata column names appear only in upper case. 10.7.5 Teradata Enterprise stage Data Type Mapping Teradata databases are not supported by the orcdbutil utility. It is important to verify the correct Teradata to Enterprise Edition data mapping, as shown in the following table: Teradata Data Type
byte(n) byteint char(n) date decimal[p,s] double precision float graphic(n) integer long varchar long vargraphic numeric(p,s) real smallint time timestamp varbyte(n) varchar(n)

Enterprise Edition Data Type
raw[n] int8 string[n] date decimal[p,s] where p is the precision and s is the scale dfloat dfloat raw[max=n] int32 string[max=n] raw[max=n] decimal[p,s] Dfloat int16 time timestamp raw[max=n] string[max=n] July 17, 2006 121 of 179

Parallel Framework Red Book: Data Flow Job Design

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Teradata Enterprise stage uses 32K buffers.8 Improving Teradata Enterprise Performance Setting the environment variable $APT_TERA_64K_BUFFERS may significantly improve performance of Teradata Enterprise connections depending on network configuration. stored in a retrieval system. (Note that 64K buffers must be enabled at the Teradata server level). and hence the number of UNIX processes and overall system resource requirements of the DataStage job.7. transcribed. Parallel Framework Red Book: Data Flow Job Design July 17. transmitted. No part of this publication may be reproduced. the password must be surrounded by an “escaped” single quote as shown. 10. RequestedSessions equals the maximum number of available sessions on the Teradata instance. SessionsPerPlayer should be set such that: RequestedSessions = (sessions per player * the number of nodes * players per node) The default value for the SessionsPerPlayer suboption is 2. . and/or RequestedSessions should be decreased. By default.SessionsPerPlayer=nn][. Indirectly. where pa$$ is the example password: \’pa$$\’ 10. The SessionsPerPlayer option determines the number of connections each DataStage EE player opens to Teradata. and will generate an error at runtime.7. but this can be set to a value between 1 and the database vprocs. or translated into any language in any form by any means without the written permission of IBM.7 Teradata Enterprise Settings Within the Teradata Enterprise stage. Setting the SessionsPerPlayer too low on a large system can result in so many players that the job fails due to insufficient resources. 10.6 Specifying Teradata Passwords with Special Characters Teradata permits passwords with special characters and symbols. By default. 2006 122 of 179 © 2006 IBM Information Integration Solutions.password=password[.RequestedSessions=nn] where SesionsPerPlayer and RequestedSessions are optional connection parameters that are required when accessing large Teradata databases. the DB Options property specifies the connection string and connection properties in the form: user=username. In that case SessionsPerPlayer should be increased. Aggregates and most arithmetic operators are not allowed in the SELECT clause of a Teradata Enterprise stage. To specify a Teradata password that contains special characters.Information Integration Solutions Center of Excellence vargraphic(n) raw[max=n] IMPORTANT: Teradata data types that are not listed in the above table cannot be used in the Teradata Enterprise stage. this determines the number of DataStage players. All rights reserved.7.

or @. A-Z (case insensitive). Parallel Framework Red Book: Data Flow Job Design July 17. the tdpid must be in the form TDPx. 2006 123 of 179 © 2006 IBM Information Integration Solutions. the tdpid is the host name of the Teradata server. stored in a retrieval system. . $. To connect to a Teradata server. On a network-attached system. That leaves 39 possible TDP names and is different than the convention used for non-channel attached systems. All rights reserved.9 Teradata on USS On the USS platform the Teradata Enterprise Stage uses CLIv2 for channel-attached systems (OS/390 and z/OS). where x is 0-9.7. transmitted. #. transcribed. also known as the tdpid. or translated into any language in any form by any means without the written permission of IBM. No part of this publication may be reproduced. you must supply the client with the Teradata Director Program (TDP) identifier. The first three characters must be TDP. On MVS.Information Integration Solutions Center of Excellence 10.

Using the job monitor performance statistics. this can identify which part of a job flow is impacting overall performance.1 Warning on Single-Node Configuration Files Because the DS/EE configuration file can be changed at runtime. the DataStage EE framework inserts buffer operators into a job flow at runtime to avoid deadlocks and improve performance. and incorrect with a multi-node configuration file. and internal Enterprise Edition log messages in a directory corresponding to the job name. Prints detailed information in the job log for each operator when allocating additional heap memory. Prints detailed information in the job log for each operator. July 17. This will ensure that the jobs have been designed with proper partitioning logic. or translated into any language in any form by any means without the written permission of IBM. including CPU utilization and elapsed processing time. This can be useful when determining if the actual runtime schema matches the expected job design table definitions. All rights reserved. Set this environment variable to capture copies of the job score. $DS_PX_DEBUG 1 $APT_PM_STARTUP_CONCURRENCY 5 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. If the job results are correct with a single-node configuration file. This environment variable should not normally need to be set. When trying to start very large jobs on heavilyloaded servers.2 Debugging Environment Variables The following environment variables can be set to assist in debugging a parallel job: Environment Variable $OSH_PRINT_SCHEMAS Setting 1 Description Outputs the actual schema definitions used by the DataStage EE framework at runtime in the DataStage log. generated osh. lowering this number will limit the number of processes that are simultaneously created when a job is started. transmitted. Normally. the job’s partitioning logic and parallel design concepts (especially within Transformer stages) should be examined. 2006 124 of 179 $APT_PM_PLAYER_TIMING 1 1 FORCE $APT_PM_PLAYER_MEMORY $APT_BUFFERING_POLICY Setting $APT_BUFFERING_POLICY=FORCE is not recommended for production job runs. 11. Using $APT_BUFFERING_POLICY=FORCE in combination with $APT_BUFFER_FREE_RUN effectively isolates each operator from slowing upstream production. stored in a retrieval system. Forces an internal buffer operator to be placed between every operator. This directory will be created in the “Debugging” sub-directory of the Project home directory on the DataStage server. transcribed.Information Integration Solutions Center of Excellence 11 Troubleshooting and Monitoring 11. it is important that all jobs be tested with a configuration file that has more than one node in its default node pool. No part of this publication may be reproduced. .

3 How to Isolate and Debug a Parallel Job There are a number of tools available to debug DataStage Enterprise Edition jobs.5. tune. This directory will be created in the “Debugging” sub-directory of the Project home directory on the DataStage server. look at row counts. These may indicate an underlying logic problem or unexpected data type conversion. enable both $OSH_PRINT_SCHEMAS and $DS_PX_DEBUG . 2006 125 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. generated osh. July 17. To capture the full schema output in these cases. Use the Data Set Management tool (available in the Tools menu of DataStage Designer or DataStage Manager) to examine the schema. All rights reserved.  Enable the Job Monitoring Environment Variables detailed in Section 2. Set the environment variable $DS_PX_DEBUG to capture copies of the job score. stored in a retrieval system.Information Integration Solutions Center of Excellence $APT_PM_NODE_TIMEOUT [seconds] For heavily loaded MPP or clustered environments. The default is 30 seconds. this variable determines the number of seconds the conductor node will wait for a successful startup from each section leader. the log entry is sometimes preceded by a warning condition. 11. When a fatal error occurs. and internal Enterprise Edition log messages in a directory corresponding to the job name. In some instances. transmitted. and to manage source or target Parallel Data Sets. or promote a job from development into test or production. . But all warnings should be examined and understood. All fatal and warning messages should be addressed before attempting to debug. transcribed. or translated into any language in any form by any means without the written permission of IBM. The general process for debugging a job is:  Check the Director job log for warnings. it may not be possible to remove all warning messages generated by the EE engine.1: Environment Variable Settings and the DataStage Parallel Job Advanced Developers Guide.  Examine the score dump (placed in the Director log when $APT_DUMP_SCORE is enabled).    NOTE: For large jobs. Use $OSH_PRINT_SCHEMAS to verify that the job’s runtime schemas matches what the job developer expected in the design-time column definitions. No part of this publication may be reproduced. This will place entries in the Director log with the actual runtime schema for every link using Enterprise Edition internal data types. it is possible for $OSH_PRINT_SCHEMAS to generate a log entry that is too large for DataStage Director to store or display.

All rights reserved. use the UNIX command wc –lc [filename] Dividing the total number of characters number of lines provides an audit to ensure all rows are same length. this count may be incorrect. stored in a retrieval system. To enable viewing of generated OSH. 2006 126 of 179 © 2006 IBM Information Integration Solutions. so if the file has any binary columns. jobs are compiled into OSH (Orchestrate SHell) scripts that are used to execute the given job design at runtime.4 Viewing the Generated OSH Within Designer. transcribed.Information Integration Solutions Center of Excellence  For flat (sequential) sources and targets: o To display the actual contents of any file in hexadecimal and ASCII (including embedded control characters or ASCII NULLs). or translated into any language in any form by any means without the written permission of IBM. use the UNIX command od –xc –Ax o To display the number of lines and characters in a specified ASCII text file. and to understand internally what is running. NOTE: The wc command counts UNIX line delimiters. transmitted. It is also not useful for files of non-delimited fixed-length record format. the generated OSH tab will appear in the Job Properties dialog box: Parallel Framework Red Book: Data Flow Job Design July 17. it must be enabled for a given project within the Administrator client: Figure 51: Generated OSH Administrator option Once this option has been enabled for a given project. It is useful to examine the generated OSH for debugging purposes. 11. No part of this publication may be reproduced. .

##I TFSC 004000 14:51:50(000) <main_program> This step has 1 dataset: ds0: {op0[1p] (sequential generator) eOther(APT_HashPartitioner { key={ value=a } })->eCollectAny op1[2p] (parallel APT_CombinedOperatorController:tsort)} It has 2 operators: op0[1p] {(sequential generator) on nodes ( Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved. what degree of parallelism each operator runs with. they have been optimized into the same process. and exactly which nodes each operator runs on. It shows three stages: Generator. that is. The following score dump shows a flow with a single Data Set. stored in a retrieval system. No part of this publication may be reproduced. transmitted.5 Interpreting the Parallel Job Score When attempting to understand an Enterprise Edition flow. All stages in this flow are running on one node. The Peek and Sort stages are combined.Information Integration Solutions Center of Excellence Figure 52: Generated OSH in Designer Job Properties 11. transcribed. which operators. or translated into any language in any form by any means without the written permission of IBM. A score dump includes a variety of information about a flow. which has a hash partitioner that partitions on key field a. have been inserted by EE. where data is repartitioned and how it is repartitioned. Sort (tsort) and Peek. Also available is some information about where data may be buffered. The job runs 3 processes on 2 nodes. . if any. 2006 127 of 179 © 2006 IBM Information Integration Solutions. including how composite operators and shared containers break down. the first task is to examine the score dump which is generated when you set APT_DUMP_SCORE=TRUE in your environment.

are the computation-intensive stages shared evenly across all nodes? More details on interpreting the parallel job score can be found in 12.torrent. Parallel Framework Red Book: Data Flow Job Design July 17. transmitted. .p0] )} op1[2p] {(parallel APT_CombinedOperatorController: (tsort) (peek) )on nodes ( lemond. All rights reserved.2 Understanding the Parallel Job Score. No part of this publication may be reproduced.com[op1.p0] lemond.torrent.com[op0. or translated into any language in any form by any means without the written permission of IBM.p1] )} In a score dump. 2006 128 of 179 © 2006 IBM Information Integration Solutions.Information Integration Solutions Center of Excellence lemond.com[op1. stored in a retrieval system.4. transcribed. there are three areas to investigate: • Are there sequential stages? • Is needless repartitioning occurring? • In a cluster or Grid.torrent.

All rights reserved. and for optimizing the performance of a given data flow using various settings and features within DataStage Enterprise Edition. stored in a retrieval system. • Parallel Data Sets retain data partitioning and sort order. • When reading from database sources. in the DS/EE native internal format. There are no utilities for examining data within a Lookup File Set. use a select list to read needed columns instead of the entire table (if possible) • Be alert when using runtime column propagation (“RCP”) – it may be necessary to disable RCP for a particular stage to ensure that columns are actually removed using that stage’s Output Mapping. 12. This section provides tips for designing a job for optimal performance. or translated into any language in any form by any means without the written permission of IBM. Performance tuning and optimization is an iterative process that begins at job design and unit tests. This section outlines performance-related tips that can be followed when building a parallel data flow using DataStage Enterprise Edition. facilitating end-to-end parallelism across job boundaries. However. No part of this publication may be reproduced. proceeds through integration and volume testing. Lookup File Sets can only be used on reference links to a Lookup stage. pre-indexed. Parallel Framework Red Book: Data Flow Job Design July 17. • Data Sets can only be read by other DS/EE parallel jobs (or the orchadmin command line utility).Information Integration Solutions Center of Excellence 12 Performance Tuning Job Designs The ability to process large volumes of data in a short period of time depends on all aspects of the flow and environment being optimized for maximum throughput and performance. transmitted. b) Remove unneeded columns as early as possible within the data flow. • Lookup File Sets can be used to store reference data used in subsequent jobs. File Sets facilitate parallel I/O at the expense of exporting to a specified file format.1 How to Design a Job for Optimal Performance Overall job design can be the most significant factor in data flow performance. Every unused column requires additional memory which can impact performance (it also makes each transfer of a record from one stage to the next more expensive). They maintain reference data in DS/EE internal format. a) Use Parallel Data Sets to land intermediate result between parallel jobs. . If you need to share information with external applications. transcribed. and continues throughout an application’s production lifecycle. 2006 129 of 179 © 2006 IBM Information Integration Solutions.

ensure data is as close to evenly distributed as possible. Unbounded strings (Varchar’s without a maximum length) can have a significant negative performance impact on a job flow. • • Avoid using the BASIC Transformer. a Transformer is always faster than a Filter or Switch stage. External user-defined functions can expand the capabilities of the parallel Transformer. Use BuildOps only when existing Transformers do not meet performance requirements. When business rules dictate otherwise and the data volume is large and sufficiently skewed.  Enable $OSH_PRINT_SCHEMAS to verify runtime schema matches job design column definitions o Verify that the data type of defined Transformer stage variables matches the expected result type e) Minimize the number of Transformers. repartition to a more balanced distribution as soon as possible to improve performance of downstream stages. 2006 130 of 179 © 2006 IBM Information Integration Solutions. transcribed. there is greater control over the efficiency of code. For example:  Varchar columns of a large (for example. keys. No part of this publication may be reproduced. or when complex reusable logic is required. • When possible. however. Parallel Framework Red Book: Data Flow Job Design July 17. transmitted. Because BuildOps are built in C++. f) Minimize the number of partitioners in a job. It is usually possible to choose a smaller partition-key set. that unless dynamic (parameterized) conditions are required. For data type conversions. stored in a retrieval system. Modify) may be more appropriate. other stages (for example. renaming and removing columns. or translated into any language in any form by any means without the written permission of IBM. o When working with database sources and targets. All rights reserved. Note. use orchdbutil to ensure that the designtime metadata matches the actual runtime metadata (especially with Oracle databases).Information Integration Solutions Center of Excellence c) Always specify a maximum length for Varchar columns. • There are limited scenarios when the memory overhead of handling large Varchar columns would dictate the use of unbounded strings. if possible. and to simply re-sort on a differing set of secondary/tertiary/etc. . at the expense of ease of development (and more skilled developer requirements). 32K) maximum length that are rarely populated  Varchar columns of a large maximum length with highly varying data sizes d) Avoid type conversions. especially in large-volume data flows. Copy.

When reading from these Data Sets. 2006 131 of 179 © 2006 IBM Information Integration Solutions. 12. Remember that SAME maintains the degree of parallelism of the upstream operator. DataStage Enterprise Edition analyzes a given job design and uses the parallel configuration file to build a job score which defines the processes and connection topology (Data Sets) between them used to execute the job logic. All rights reserved. if possible. transcribed. . Enterprise Edition attempts to reduce the number of processes by combining the logic from 2 or more stages (operators) into a single process (per partition). disabling operator combination allows CPU activity to be spread across multiple processors instead of being constrained to a single processor. even those that have the “Restrict Memory Usage” option set. o When writing to parallel Data Sets. Combined operators are generally adjacent to each other in a data flow. the environment variable APT_TSORT_STRESS_BLOCKSIZE can be used to set (in units of MB) the size of the RAM buffer for all sorts in a job. Use SAME partitioning carefully. As with any example. and groupings. 5 One exception to this guideline is when operator combination generates too few processes to keep the processors busy. stored in a retrieval system. In these configurations. sort order and partitioning are preserved. In addition. Parallel Framework Red Book: Data Flow Job Design July 17. specifying the “don’t sort. previously sorted” option for those key columns in the Sort stage will reduce the cost of sorting and take greater advantage of pipeline parallelism.2 Understanding Operator Combination At runtime. and should only be used if there is a need to maintain an implied (i. transmitted.  If data has already been partitioned and sorted on a set of key columns. The purpose behind operator combination is to reduce the overhead associated with an increased process count. by using “SAME” partitioning. o Performance of individual sorts can be improved by increasing the memory usage per partition using the “Restrict Memory Usage (MB)” option of the standalone Sort stage. When composing the score. your results should be tested in your environment.Information Integration Solutions Center of Excellence • • Know your data. g) Minimize and combine use of Sorts where possible o It is frequently possible to arrange the order of business logic within a job flow to leverage the same sort order.e. No part of this publication may be reproduced. not explicitly stated in the sort keys) row order. or translated into any language in any form by any means without the written permission of IBM. there is nothing to be gained from pipeline partitioning5. try to maintain this sorting. o The stable sort option is much more expensive than non-stable sorts. Choose hash key columns that generate sufficient unique key combinations (while satisfying business requirements). If two processes are interdependent (one processes the other’s output) and they are both CPU-bound or I/O-bound. partitioning. The default setting is 20MB per partition.

as shown in the following illustration: Figure 53: CPU-bound combined process in Job Monitor Parallel Framework Red Book: Data Flow Job Design July 17. the assumptions used by the Enterprise Edition optimizer to determine which stages can be combined may not always be the most efficient. here is a partial list of non-combinable operators:  Join  Aggregator  Remove Duplicates  Merge  BufferOp  Funnel  DB2 Enterprise Stage  Oracle Enterprise Stage  ODBC Enterprise Stage  BuildOps In general. However. stored in a retrieval system. it is best to let DSEE decide what to combine and what to leave uncombined. combined stages are indicated by parenthesis surrounding the % CPU. There are 2 ways to affect operator combination: o The environment variable APT_DISABLE_COMBINATION.0 versions of DS/EE. disables ALL combination in the entire data flow. No part of this publication may be reproduced. Combined Operator Controller).2 Understanding the Parallel Job Score” in this document. .4. when other performance tuning measures have been applied and still greater performance is needed. transcribed. or globally.a. if the “%CPU” column is displayed in a Job Monitor window in Director. combination can be set on a per-stage basis (on the Stage/Advanced tab) The job score identifies what components are combined.k.Information Integration Solutions Center of Excellence However.) In addition. When deciding which operators to include in a particular combined operator (a. Enterprise Edition is ‘greedy’ . tuning combination might yield additional performance benefits. see 12. or translated into any language in any form by any means without the written permission of IBM. o Within Designer.it will include all operators that meet the following rules: o Must be contiguous o Must be the same degree of parallelism o Must be ‘Combinable’. (For information on interpreting a job score dump. 2006 132 of 179 © 2006 IBM Information Integration Solutions. It is for this reason that combination can be enabled or disabled on a per-stage basis. transmitted. All rights reserved. this is only recommended on pre 7.

in general. For example. etc. Toward that end. if you have several transformers and database operators combined with an output Sequential File. when it is appropriate to minimize the resource requirements for a given scenario. No part of this publication may be reproduced. etc. the I/O-intensive FileSet is combined with a CPU-intensive Transformer. . Change Capture.Information Integration Solutions Center of Excellence Choosing which operators to disallow combination for is as much art as science. DS/EE executes a given job across the resources defined in a the specified configuration file.3 Minimizing Runtime Processes and Resource Requirements The architecture of Enterprise Edition is well suited for processing massive volumes of data in parallel across available resources. There are times. However.) from CPU-heavy operators (Transformer. or translated into any language in any form by any means without the written permission of IBM. There are many factors that can reduce the number of processes generated at runtime:  Use a single-node configuration file  Remove ALL partitioners and collectors (especially when using a single-node configuration file)  Enable runtime column propagation on Copy stages with only one input and one output  Minimize join structures (any stage with more than one input. for example:  Batch jobs that process a small volume of data  Real-time jobs that process data in small message units  Environments running a large number of jobs simultaneously on the same server(s) In these instances. Funnel) Parallel Framework Red Book: Data Flow Job Design July 17. such as Join. All rights reserved. as shown in this subsequent Job Monitor for the same job: Figure 54: Throughput in Job Monitor after disabling combination 12. Full Sorts. in the above job design. Disabling combination with the Transformer enables pipeline partitioning. and improves performance. Merge. This will prevent IO requests from waiting on CPU to become available and viceversa. it is good to separate I/O heavy operators (Sequential File. stored in a retrieval system. it might be a good idea to set the sequential file to be non-combinable. however. transmitted. Lookup. In fact. transcribed. 2006 133 of 179 © 2006 IBM Information Integration Solutions.). a single-node configuration file is often appropriate to minimize job startup time and resource requirements without significantly impacting overall performance.

) 12. cannot be greater than 1048576. Cannot be less than 8192. This type of buffering (or ‘Record Blocking’) is rarely tuned.Information Integration Solutions Center of Excellence   Minimize non-combinable stages (as outlined in the previous section) such as Join. transmitted. Default is 8192. the blocks are swapped and the process begins again. If necessary.4. APT_MIN_TRANSPORT_BLOCK_SIZE o Specifies the minimum allowable block size for transferring data between players. The first block will be used by the upstream/producer stage to output data it is done with. APT_AUTO_TRANSPORT_BLOCK_SIZE and APT_MAX_TRANSPORT_BLOCK_SIZE APT_MAX_TRANSPORT_BLOCK_SIZE July 17. (Buffering is discussed in more detail in the following section. Remember. Once the upstream block is full and the downstream block is empty. records do not stream from one stage to another. then setting APT_DEFAULT_TRANSPORT_BLOCK_SIZE to a multiple of (or equal to) the record size will resolve the problem. ODBC Enterprise. 2006 134 of 179   Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. Oracle Enterprise. BufferOp Selectively (being careful to to avoid deadlocks) disable buffering. transcribed. This variable is only meaningful when used in combination with APT_LATENCY_COEFFICIENT. DB2 Enterprise. Funnel. Remove Duplicates. It usually only comes into play when the size of a single record exceeds the default size of the transport block. Each pair of operators that have a producer/consumer relationship will share at least 2 of these blocks. All rights reserved. there are 2 of these transport blocks for each partition of each link. Aggregator. 12. or translated into any language in any form by any means without the written permission of IBM. so setting this value too high will result in a large amount of memory consumption. . strictly speaking.4 Understanding Buffering There are two types of buffering in Enterprise Edition: ‘inter-operator transport’ and ‘deadlock prevention’. Merge. They are actually transferred in blocks (just like on old magnetic tapes) called “Transport Blocks”. The behavior of these transport blocks is determined by these environment variables:  APT_DEFAULT_TRANSPORT_BLOCK_SIZE o Specifies the default block size for transferring data between players. No part of this publication may be reproduced. BuildOps. The default value is 8192. the value provided by a user for this variable is rounded up to the operating system's nearest page size. with a valid value range for between 8192 and 1048576.1 Inter-Operator Transport Buffering Though it may look like it from the performance statistics and documentation might discuss ‘record streaming’. stored in a retrieval system. The second block will be used by the downstream/consumer stage to obtain data that is ready for the next processing step.

All rights reserved. must be at least 2 NOTE: The environment variables APT_MIN/MAX_TRANSPORT_BLOCK_SIZE. Cannot be less than 8192. APT_AUTO_TRANSPORT_BLOCK_SIZE and APT_MMIN_TRANSPORT_BLOCK_SIZE APT_AUTO_TRANSPORT_BLOCK_SIZE o If set. 2006 135 of 179 © 2006 IBM Information Integration Solutions.Information Integration Solutions Center of Excellence    o Specifies the maximum allowable block size for transferring data between players. the framework calculates the block size for transferring data between players according to this algorithm: if (recordSize * APT_LATENCY_COEFFICIENT < APT_MIN_TRANSPORT_BLOCK_SIZE) then blockSize = APT_MIN_TRANSPORT_BLOCK_SIZE else if (recordSize * APT_LATENCY_COEFFICIENT > APT_MAX_TRANSPORT_BLOCK_SIZE) then blockSize = APT_MAX_TRANSPORT_BLOCK_SIZE else blockSize = recordSize * APT_LATENCY_COEFFICIENT APT_LATENCY_COEFFICIENT o Specifies the number of records to be written to each transport block APT_SHARED_MEMORY_BUFFERS o Specifies the number of Transport Blocks between a pair of operators. stored in a retrieval system. Default is 1048576.4. . cannot be greater than 1048576. transmitted. This variable is only meaningful when used in combination with APT_LATENCY_COEFFICIENT. APT_LATENCY_COEFFICIENT. Here is an example job fragment: Figure 55: Fork-Join example Parallel Framework Red Book: Data Flow Job Design July 17. “Deadlock Prevention” comes into play anytime there is a Fork-Join structure in a job. No part of this publication may be reproduced. and APT_AUTO_TRANSPORT_BLOCK_SIZE are used only with fixed-length records. 12. transcribed.2 Deadlock Prevention Buffering The other type of buffering. or translated into any language in any form by any means without the written permission of IBM.

or translated into any language in any form by any means without the written permission of IBM.Information Integration Solutions Center of Excellence In this example.) Without deadlock buffering. Without deadlock buffering. the job would deadlock . it is important to understand the operation of a parallel pipeline. 2006 136 of 179 . which go into an Inner Join. BufferOp is always ready to read or write and will not allow a read/write request to be queued. Imagine that the Transformer is waiting to write to Aggregator1. and Join is waiting to read from Aggregator2. All rights reserved. Aggregator1 is waiting to write to the Join. transmitted. there is a specialized operator called BufferOp. Aggregator2 is waiting to read from the Transformer. that “Fork-Join” is a graphical description. It is placed on all inputs to a join structure (again. this scenario would create a circular dependency where Transformer is waiting on Aggregator1. Like this: Aggregator 1 Waiting to Write to Join d e ue Qu ite Wr Qu eu Wr ed ite Aggregator2 Waiting to read from Transformer (Here the arrows represent dependency direction. transcribed. stored in a retrieval system. To guarantee that this problem never happens in Enterprise Edition. instead of data flow. To understand deadlock-prevention buffering. So the above job structure would be altered by the DS/EE engine to look like this: Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. Note however. not necessarily a Join stage) by Enterprise Edition during job startup. the Transformer creates a fork with 2 parallel Aggregators. d eue Qu ad Re Transformer Waiting to write to Aggregator1 Qu eue Re d ad Join Waiting to read from Aggregator2 July 17. it would eventually time out). Join is waiting on Aggregator2. it doesn’t necessarily have to involve a Join stage.bringing processing to a halt (though the job does not stop running. Aggregator1 is waiting on Join. No part of this publication may be reproduced. and Aggregator2 is waiting on Transformer.

as these same types of circular dependencies can result from partition-wise Fork-Joins. When that is full (because the upstream operator is still writing but the downstream operator isn’t ready to accept that data yet) it will begin to flush data to the scratchdisk resources specified in the configuration file (detailed in Chapter 11. . All rights reserved. or translated into any language in any form by any means without the written permission of IBM. not dependency. transmitted. APT_BUFFER_FREE_RUN o Maximum capacity of the buffer operator before it starts to offer resistance to incoming flow. Join cannot be ‘stuck’ waiting to read from either of its inputs. Values greater than 1 indicate that the buffer July 17. stored in a retrieval system. it may be necessary to increase the default buffer size (APT_BUFFER_MAXIMUM_MEMORY) to hold more rows in memory.) Since BufferOp is always ready to read or write.Information Integration Solutions Center of Excellence BufferOp1 Aggregator 1 Transformer Join Aggregator2 BufferOp 2 (Here the arrows now represent data-flow. per partition). 2006 137 of 179  Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. thus breaking the circular dependency and guaranteeing no deadlock will occur. By default. “The Parallel Engine Configuration File” of the DataStage Manager guide). May not exceed 2/3 of APT_BUFFER_MAXIMUM_MEMORY. BufferOps will also be placed on the input partitions to any sequential stage that is fed by a parallel stage. as a nonnegative (proper or improper) fraction of APT_BUFFER_MAXIMUM_MEMORY. The behavior of deadlock-prevention BufferOps can be tuned through these environment variables:  APT_BUFFER_DISK_WRITE_INCREMENT o Controls the “blocksize” written to disk as the memory buffer fills. BufferOps will allocate 3MB of memory each (remember that this is per operator. No part of this publication may be reproduced. Default 1 MB. TIP: For very wide rows. transcribed.

as shown in the illustration below: Aside from ensuring that no dead-lock occurs. No part of this publication may be reproduced. stored in a retrieval system. 2006 138 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. the buffer mode. All rights reserved. Default is 3 MB. BufferOps also have the effect of “smoothing out” production/consumption spikes.  APT_BUFFER_MAXIMUM_MEMORY o Maximum memory consumed by each buffer operator for data storage. Valid settings are:  AUTOMATIC_BUFFERING: buffer as necessary to prevent dataflow deadlocks  FORCE_BUFFERING: buffer all virtual Data Sets  NO_BUFFERING: inhibit buffering on all virtual Data Sets  WARNING: Inappropriately specifying NO_BUFFERING can cause dataflow deadlock during job execution. buffer free run.Information Integration Solutions Center of Excellence operator will free run (up to a point) even when it has to write data to disk. This allows the job to run at the highest rate possible even when a Parallel Framework Red Book: Data Flow Job Design July 17. Additionally. buffer size. transmitted. transcribed. APT_BUFFERING_POLICY o Specifies the buffering policy for the entire job. . use of this setting is only recommend for advanced users! FORCE_BUFFERING can be used to reveal bottlenecks in a job design during development and performance tuning. queue bound. but will almost certainly degrade performance and therefore shouldn’t be used in production job runs. and write increment can be set on a per-stage basis from the Input/ Advanced tab of the stage properties. When it is not defined or defined to be the null string. the default buffering policy is AUTOMATIC_BUFFERING.

and should be considered among the last resorts for performance tuning. it is best to tune the buffers on a per-stage basis. you cannot determine that any one stage is waiting on any other stage. Stages upstream/downstream from high-latency stages (such as remote databases. NFS mount points for data storage. when a buffer has consumed its RAM. it will ask the upstream stage to “slow down” . instead of globally through environment variable settings. then setting the buffering policy to “FORCE_BUFFERING” will cause buffering to occur everywhere. All rights reserved. if you do not have force buffering set and APT_BUFFER_FREE_RUN set to at least ~1000.Information Integration Solutions Center of Excellence downstream stage is ready for data at different times than when its upstream stage is ready to produce that data. . Choosing which stages to tune buffering for and which to leave alone is as much art as science. stored in a retrieval system. Each place may offer an opportunity for buffer tuning. No part of this publication may be reproduced.) are a good place to start. etc. transcribed. As implied above. When attempting to address these mismatches in production/consumption. Parallel Framework Red Book: Data Flow Job Design July 17. transmitted.this is called “pushback”. By using the performance statistics in conjunction with this buffering. you may be able identify points in the data flow where a downstream stage is waiting on an upstream stage to produce data. 2006 139 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. while changing only 1 thing at a time). Because of this. as some other stage far downstream could be responsible for cascading pushback all the way upstream to the place you are seeing the bottleneck. If that doesn’t yield enough of a performance boost (remember to test iteratively.

Parallel Framework Red Book: Data Flow Job Design July 17. DataStage Template jobs should be created with:  standard parameters (for example. database login settings. stored in a retrieval system. and so forth. The Multiple-Instance job property allows multiple invocations of the same job to run simultaneously.3: Documentation and Annotation.  A set of standard job parameters should be used in DataStage jobs for source and target database parameters (DSN. along with cross-references for more detail. See Section 2. An example directory naming structure is given in Section 2. No part of this publication may be reproduced.  Where possible. using the Manager DSX export capability. All DataStage jobs should be documented with Short Description field. To ease re-use. transcribed. The scope of a parameter is discussed further in Section 3. especially for DataStage Project categories. See Section 2. as well as Annotation fields. Development Guidelines Modular development techniques should be used to maximize re-use of DataStage jobs and components.Information Integration Solutions Center of Excellence Appendix A: Standard Practices Summary This Appendix summarizes Standard Practices recommendations outlined in this document. etc) and directories where files are stored. This can also be used for integration with source code control systems. All rights reserved. 2. stage names. source and target file paths. create re-usable components such as parallel shared containers to encapsulate frequently-used logic.2: Naming Conventions. or translated into any language in any form by any means without the written permission of IBM. transmitted. database login properties…)  environment variables and their default settings (as outlined in Section 2.5.1:Directory Structures. 2006 140 of 179 © 2006 IBM Information Integration Solutions. Standards It is important to establish and follow consistent standards in:  Directory structures for install and application support directories.  Create a standard directory structure outside of the DataStage project directory for source and target files. file names.1 Environment Variable Settings)  annotation blocks Job Parameters should always be used for file paths. these standard parameters and settings should be made part of a Designer Job Template.4: Working with Source Code Control Systems. user.  Naming conventions. An example DataStage naming structure is given in Section 2. It is the DataStage developer’s responsibility to make personal backups of their work on their local workstation. intermediate work files. and links. password. as outlined in Section 3:Development Guidelines:  Job parameterization allows a single job design to process similar logic instead of creating multiple copies of the same job. 1.5: Job Parameters. .

Server Shared Containers) within a parallel job. and to facilitate default type conversions.8: Component Usage. Objective 2: The partition method must match the business requirements and stage functional requirements.   DataStage Data Types Be aware of the mapping between DataStage (SQL) data types and the internal DS/EE data types. This ensures that the processing workload is evenly balanced. using RCP to maximize re-use. stored in a retrieval system. as outlined in Section 4:DataStage Data Types.  Always use parallel Data Sets for intermediate storage between jobs. All rights reserved. Standardized Error Handling routines should be followed to capture errors and rejects. Parallel Framework Red Book: Data Flow Job Design July 17.  Use the parallel Transformer stage (not the BASIC Transformer) instead of the Filter or Switch stages. . 4. Leverage default type conversions using the Copy stage or across the Output mapping tab of other stages. Component Usage As discussed in Section 3. while minimizing overhead. the following objectives help to form a methodology for assigning partitioning: Objective 1: Choose a partitioning method that gives close to an equal number of rows in each partition. transcribed. transmitted. Partitioning Data Given the numerous options for keyless and keyed partitioning.7:Error and Reject Record Handling. or translated into any language in any form by any means without the written permission of IBM. No part of this publication may be reproduced. 3. assigning related records to the same partition if required Any stage that processes groups of related records (generally using one or more key columns) must be partitioned using a keyed partition method. 5. Further details are provided in Section 3.  Use BuildOp stages only when logic cannot be implemented in the parallel Transformer. 2006 141 of 179 © 2006 IBM Information Integration Solutions.Information Integration Solutions Center of Excellence Parallel Shared Containers should be used to encapsulate frequently-used logic. minimizing overall run time.  Use the Copy stage as a placeholder for iterative design. the following guidelines should be followed when constructing parallel jobs in DS/EE:  Never use Server Edition components (BASIC Transformer. BASIC Routines are appropriate only for job control sequences.

Change Capture. and Sort stages. Across jobs. transcribed. Further details on Partitioning methods can be found in Section 5: Partitioning and Collecting. Objective 3: Unless partition distribution is highly skewed. use Auto partitioning (the default) Parallel Framework Red Book: Data Flow Job Design July 17. transmitted. Merge. All rights reserved. stored in a retrieval system. Note that in satisfying the requirements of this second objective. . Change Apply. This may require re-examining key column usage within stages and re-ordering stages within a flow (if business requirements permit). especially in cluster or Grid configurations Repartitioning data in a cluster or Grid configuration incurs the overhead of network transport. it may not be possible to choose a partitioning method that gives close to an equal number of rows in each partition. persistent Data Sets can be used to retain the partitioning and sort order. or translated into any language in any form by any means without the written permission of IBM. examine up-stream partitioning and sort order and attempt to preserve for down-stream processing. the following guidelines form a methodology for choosing the appropriate collector type: a) When output order does not matter. It may also be necessary for Transformers and BuildOps that process groups of related records. the following methodology can be applied: a) Start with Auto partitioning (the default) b) Specify Hash partitioning for stages that require groups of related records o Specify only the key column(s) that are necessary for correct grouping as long as the number of unique values is sufficient o Use Modulus partitioning if the grouping is on a single integer key column o Use Range partitioning if the data is highly skewed and the key column values and distribution do not change significantly over time (Range Map can be reused) c) If grouping is not required. Join. This is particularly useful if downstream jobs are run with the same degree of parallelism (configuration file) and require the same partition and sort order. minimize repartitioning. 2006 142 of 179 © 2006 IBM Information Integration Solutions. use Round Robin partitioning to redistribute data equally across all partitions o Especially useful if the input Data Set is highly skewed or sequential d) Use Same partitioning to optimize end-to-end partitioning and to minimize repartitioning o Being mindful that Same partitioning retains the degree of parallelism of the upstream stage o Within a flow. but is not limited to: Aggregator. Collecting Data Given the options for collecting data into a sequential stream. Objective 4: Partition method should not be overly complex The simplest method that meets the above objectives will generally be the most efficient and yield the best performance. Using the above objectives as a guide.Information Integration Solutions Center of Excellence This includes. Remove Duplicates. No part of this publication may be reproduced. 6.

sequential ordered result set use a parallel Sort and a Sort Merge collector 8. globally sorted stream of rows o When the input Data Set has been sorted in parallel and Range partitioned. transmitted. No part of this publication may be reproduced.1: Transformer NULL Handling and Reject Link. Further details on Partitioning methods can be found in Section 5: Partitioning and Collecting. or translated into any language in any form by any means without the written permission of IBM.  Stage-Specific Guidelines As discussed in Section 8. Output Statistics  Always specify “DataStage” Sort Utility for standalone Sort stages  Use the “Sort Key Mode=Don’t Sort (Previously Sorted)” to resort a sub-grouping of a previously-sorted input Data Set e) Be aware of automatically-inserted sorts  Set $APT_SORT_INSERTION_CHECK_ONLY to verify but not establish required sort order f) Minimize the use of sorts within a job flow g) To generate a single. precautions must be taken when using expressions or derivations on nullable columns within the parallel Transformer: o Always convert nullable columns to in-band values before using them in an expression or derivation. Join vs. transcribed. Create Cluster Key Change Column.  Parallel Framework Red Book: Data Flow Job Design July 17.1.1: Lookup vs. If the Data Sets are larger than available memory resources. The Lookup stage is most appropriate when reference data is small enough to fit into available memory. All rights reserved. use the Join or Merge stage. . the following methodology should be applied when sorting in a DataStage Enterprise Edition data flow: a) Start with a link sort b) Specify only necessary key column(s) c) Don’t use Stable Sort unless needed d) Use a stand-alone Sort stage instead of a Link sort for options that not available on a Link sort:  Sort Key Mode. Sorting Using the rules and behavior outlined in Section 6: Sorting. See Section 9. use Sort Merge collector to produce a single.Information Integration Solutions Center of Excellence b) When the input Data Set has been sorted in parallel. Merge. o Always place a reject link on a parallel Transformer to capture / audit possible rejects. stored in a retrieval system. 7. the Ordered collector may be more efficient c) Use a Round Robin collector to reconstruct rows in input order for round-robin partitioned input Data Sets. as long as the Data Set has not been repartitioned or reduced. 2006 143 of 179 © 2006 IBM Information Integration Solutions. Create Key Change Column.

or when exception processing. . If possible. All warnings and failures should be addressed (and removed if possible) before deploying a DS/EE job. A Sort method Aggregator should be used when the number of distinct key values is large or unknown. transcribed.Information Integration Solutions Center of Excellence    Limit the use of database Sparse Lookups to scenarios where the number of input rows is significantly smaller (for example 1:100 or more) than the number of reference rows. or Informix databases. use orchdbutil to properly import design metadata.1. stored in a retrieval system. as discussed in Section 10. No part of this publication may be reproduced. 10. Check the Director log for warnings. DB2. transmitted. Even if the source data is not nullable. use the native parallel database stages for maximum performance and scalability.2Capturing Unmatched Records from a Join). When using Oracle. it is best to implement business rules natively using DataStage parallel components. the non-key columns must be defined as nullable in the Join stage input in order to identify unmatched records.1: Database stage types: Native Parallel Database Stages DB2/UDB Enterprise Informix Enterprise ODBC Enterprise Oracle Enterprise SQL Server Enterprise Teradata Enterprise      The ODBC Enterprise stage should only be used when a native parallel stage is not available for the given source or target database. Parallel Framework Red Book: Data Flow Job Design July 17. Use Hash method Aggregators only when the number of distinct key column values is small.  Database Stage Guidelines Where possible. For maximum scalability and parallel performance. 2006 144 of 179 © 2006 IBM Information Integration Solutions. use a SQL where clause to limit the number of rows sent to a DataStage job. (See Section 9. or translated into any language in any form by any means without the written permission of IBM. 9. All rights reserved. Care must be taken to observe the data type mappings documented in Section 10: Database Stage Guidelines when designing a parallel job with DS/EE. Be particularly careful to observe the nullability properties for input links to any form of Outer Join. Avoid the use of database stored procedures on a per-row basis within a high-volume data flow. which may indicate an underlying problem or data type conversion issue.   Troubleshooting and Monitoring Always test DS/EE jobs with a parallel configuration file ($APT_CONFIG_FILE) that has two or more nodes in its default pool.

2 Understanding the Parallel Job Score. Set $DS_PX_DEBUG if the schema record is too large to capture in a Director log entry. or translated into any language in any form by any means without the written permission of IBM. Set the environment variable $OSH_PRINT_SCHEMAS to capture actual runtime schema to the Director log. All rights reserved. Parallel Framework Red Book: Data Flow Job Design July 17. .4.Information Integration Solutions Center of Excellence    The environment variable $DS_PX_DEBUG can be used to capture all generated OSH. transmitted. error and warning messages from a running DS/EE job. and examine the job score by following the guidelines outlined in Section 12. Enable $APT_DUMP_SCORE by default. 2006 145 of 179 © 2006 IBM Information Integration Solutions. No part of this publication may be reproduced. stored in a retrieval system. transcribed.

Subject Modifier. transcribed.Information Integration Solutions Center of Excellence Appendix B: DataStage Naming Reference Every name should be based on a three-part concept: Subject. All rights reserved. DB) Message (Sequence) Get (Shared Container) Put (Shared Container) Input Output Delete Insert Update Data Store Database Stored Procedure Table View Dimension Fact Source Target Development / Debug Stages Column Generator Head Peek Row Generator Sample Src<job> Load<job> <job>_Seq <job>Psc <job>Ssc <name>Parm Ref Rej Msg Get Put In Out Del Ins Upd DB SP Tab View Dim Fact Src Tgt CGen Head Peek RGen Smpl Tail File Stages Sequential File Complex Flat File File Set Parallel Data Set Lookup File Set External Source External Target Parallel SAS Data Set Processing Stages Aggregator Change Apply Change Capture Copy Filter Funnel Join (Inner) Join (Left Outer) Join (Right Outer) Join (Full Outer) Lookup Merge Modify Pivot Remove Duplicates SAS processing Sort Surrogate Key Generator Switch Transformer Stage Transformer (native parallel) BASIC Transformer (Server) Stage Variable Real Time Stages RTI Input RTI Output XML Input July 17. 2006 Tail SF CFF FS DS LFS XSrc XTgt SASd Agg ChAp ChCp Cp Filt Funl InJn LOJn ROJn FOJn Lkp Mrg Mod Pivt RmDp SASp Srt SKey Swch Tfm BTfm SV RTIi RTIo XMLi 146 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. stored in a retrieval system. transmitted. or translated into any language in any form by any means without the written permission of IBM. Class Word where the following frequently-used Class Words describe the object type. . No part of this publication may be reproduced. File. or the function the object performs: Project Repository and Components <proj>Dev Development <proj>Test Test <proj>Prod Production BdOp<name> BuildOp XFn<name> Parallel External Function Wrap<name> Wrapper Job Names and Properties Extract Job Load Sequence Parallel Shared Container Server Shared Container Parameter Links (prefix with “lnk_”) Reference (Lookup) Reject (Lookup.

transcribed.Information Integration Solutions Center of Excellence XML Output XML Transformer Restructure Stages Column Export Column Import XMLo XMLt CExp CImp Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved. stored in a retrieval system. 2006 147 of 179 © 2006 IBM Information Integration Solutions. . transmitted. No part of this publication may be reproduced. or translated into any language in any form by any means without the written permission of IBM.

All rights reserved. The parallel Configuration File provides a mapping at runtime between the compiled job and the actual runtime infrastructure and resources by defining logical processing nodes. multiple operators are combined within a single operating system process to improve performance and optimize resource requirements. It is recommended that this setting be enabled by default at the project level. job score entries start with the phrase “main_program: This step has n datasets…” Two separate scores are written to the log for each job run. the job score is output to the DataStage Director log. not the actual job. The first score is from the license operator. Viewing the Job Score When the environment variable APT_DUMP_SCORE is set. Similar to the way a parallel database optimizer builds a query plan. At runtime. the DS/EE job score: • • • • • Identifies degree of parallelism and node assignment(s) for each operator Details mappings between functional (stage/operator) and actual operating system processes Includes operators automatically inserted at runtime: o Buffer operators to prevent deadlocks and optimize data flow rates between stages o Sorts and Partitioners that have been automatically inserted to ensure correct results Outlines connection topology (Data Sets) between adjacent operators and/or persistent Data Sets Defines number of actual operating system processes Where possible. and the overhead to capture the score is negligible. and can be ignored. and interconnects (Data Sets) between them. Parallel Framework Red Book: Data Flow Job Design July 17. transmitted. . No part of this publication may be reproduced. transcribed. Enterprise Edition uses the given job design and configuration file to compose a job score which details the processes created. As shown in the illustration below. 2006 148 of 179 © 2006 IBM Information Integration Solutions. stored in a retrieval system. The second score entry is the actual job score. as the job score offers invaluable data for debugging and performance tuning. degree of parallelism and node (server) assignments. or translated into any language in any form by any means without the written permission of IBM. 1.Information Integration Solutions Center of Excellence Appendix C: Understanding the Parallel Job Score Jobs developed in DataStage Enterprise Edition are independent of the actual hardware and degree of parallelism used to run the job.

or translated into any language in any form by any means without the written permission of IBM. Terminology in this section can be used to identify the type of partitioning or collecting that was used between operators. • Operators: starts with the words “It has n operators:” The second section details actual operators created to execute the job flow. All rights reserved. “node3”. (in this example: “node1”. stored in a retrieval system. The actual node names correspond to node names in the parallel configuration file. “node2”. there are two virtual Data Sets. and the degree of parallelism per operator o Node assignment for each operator. for a total of 9 operating system process. July 17. there are 3 operators. one running sequentially. as shown in the example on the right: • Data Sets: starts with the words “main_program: This step has n datasets:” The first section details all Data Sets. No part of this publication may be reproduced. including persistent (on disk) and virtual (in memory. 2006 149 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. This includes: o Sequential or Parallel operation. Parallel Job Score Components The Enterprise Edition parallel job score is divided into two sections. In this example. .Information Integration Solutions Center of Excellence Actual job score 2. transcribed. “node4”). In this example. links between stages). transmitted. two running in parallel across 4 nodes.

Producer Partitioner Collector Consumer The notation between producer and consumer is used to report the type of partitioning or collecting (if any) that is applied. Within the Data Set definition. For example. Producers and consumers may be either persistent (on disk) Data Sets or parallel operators. collector type with the second. In the above example. and the next “ds1”. the first Data Set is identified as “ds0”. corresponding to the lower section of the job score. or translated into any language in any form by any means without the written permission of IBM. Job Score: Data Sets The parallel pipeline architecture of DataStage Enterprise Edition passes data from upstream producers to downstream consumers through in-memory virtual data sets. followed by a notation to indicate the type of partitioning or collecting (if any). operator zero (op0) is running sequentially. Operator 1 (op1) is running in parallel with 4 degrees of parallelism [4p]. transcribed.Information Integration Solutions Center of Excellence Note that the number of virtual Data Sets and the degree of parallelism determine the amount of memory used by the inter-operator transport buffers. followed by the downstream consumer. Data Sets are identified in the first section of the parallel job score. while operators are identified by their operator number and name. The symbol between the partition name and collector name indicates: -> <> => #> >> > Sequential producer to Sequential consumer Sequential producer to Parallel consumer Parallel producer to Parallel consumer (SAME partitioning) Parallel producer to Parallel consumer (repartitioned. The partition type is associated with the first term. . the upstream producer is identified first. The memory used by deadlock-prevention BufferOps can be calculated based on the number of inserted BufferOps. in the example on the right. for persistent Data Sets) July 17. All rights reserved. No part of this publication may be reproduced. as illustrated in the example on the right: The degree of parallelism is identified in brackets after the operator name. transmitted. with 1 degree of parallelism [1p]. not SAME) Parallel producer to Sequential consumer No producer or no consumer (typically. 2006 150 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. with each Data Set identified by its number (starting at zero). Persistent Data Sets are identified by their Data Set name. 3. stored in a retrieval system.

each individual component of a composite operator is represented as an individual operator in the job score.p3] )} specified stage name in the Designer op2[4p] {(parallel buffer(0)) canvas on nodes ( node1[op2. Since Parallel Framework Red Book: Data Flow Job Design July 17. Lookup) Some stages are composite operators – to the DataStage developer.p3] (APT_CombinedOperatorController) which include logic from multiple stages in a single operator o may also include framework-inserted operators such as Buffers. . includes (as illustrated in the job score fragment): key={value=FirstName}}) ) on nodes ( • operator name (opn) numbered sequentially node1[op0. a composite operator appears to be a single stage on the design canvas. transmitted.p0] )} . the notation “[pp]” will appear in this section of the job score.p1] • Components of the operator node3[op2.p0] o may also include combined node2[op2. It is composted of the following internal operators: .p0] from zero (example “op0”) )} op1[4p] {(parallel inserted tsort operator • degree of parallelism within brackets {key={value=LastName}.p0] node2[op2. this (inserted tsort operator {key={value=LastName}.p1] At runtime.Information Integration Solutions Center of Excellence Finally. stored in a retrieval system.p0] ecc3672[op4.APT_LUTCreateImpl: op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3) on nodes ( Reads the reference data into memory ecc3671[op2. or translated into any language in any form by any means without the written permission of IBM. Sorts o may include “composite” operators (for example. Lookup is a composite operator.p2] ecc3674[op3. But internally. No part of this publication may be reproduced.p2] Using this information together with the output from the $APT_PM_SHOW_PIDS environment variable. For example. 2006 151 of 179 © 2006 IBM Information Integration Solutions.p2] node4[op2. 4. if the Preserve Partitioning flag has been set for a particular Data Set.p1] ecc3673[op4. transcribed.APT_LUTProcessImpl: op3[4p] {(parallel buffer(0)) Performs actual lookup processing once reference on nodes ( ecc3671[op3. as shown in the following score fragment shown on the right: ecc3673[op3. you can evaluate the memory used by a lookup. Job Score: Operators The lower portion of the parallel job score details op0[1p] {(sequential the mapping between stages and actual processes APT_CombinedOperatorController: (Row_Generator_0) generated at runtime.p3] )} op4[4p] {(parallel APT_CombinedOperatorController: (APT_LUTProcessImpl in Lookup_3) (APT_TransformOperatorImplV0S7_cpLookupTest1_Tran sformer_7 in Transformer_7) (PeekNull) ) on nodes ( ecc3671[op4.p1] operators node3[op2.p0] data has been loaded ecc3672[op3. For each operator.p2] o typically correspond to the usernode4[op2. a composite operator includes more than one function. (example “[4p]”) key={value=FirstName}}(0)) on nodes ( • “sequential” or “parallel” execution mode node1[op2. All rights reserved.

Information Integration Solutions Center of Excellence the entire structure needs to be loaded before actual lookup processing can begin.ds) .p0] )} op1[1p] {(parallel delete data files in delete temp. All rights reserved.Delete Data Files ->eCollectAny op2[1p] (sequential delete descriptor file in delete temp. In a similar way.Delete Descriptor File ds1: {op0[1p] (sequential Row_Generator_0) -> temp. a persistent Data Set defined to “Overwrite” an existing Data Set of the same name main_program: This step has 2 datasets: will have multiple entries in the job score to: ds0: {op1[1p] (parallel delete data files in delete temp. . 2006 152 of 179 © 2006 IBM Information Integration Solutions.ds} It has 3 operators: op0[1p] {(sequential Row_Generator_0) on nodes ( node1[op0. transcribed.ds)} .p0] )} op2[1p] {(sequential delete descriptor file in delete temp. you can also determine the delay associated with loading the lookup structure. transmitted. stored in a retrieval system.ds) on nodes ( node1[op2.p0] )} Parallel Framework Red Book: Data Flow Job Design July 17.ds) on nodes ( node1[op1. No part of this publication may be reproduced. or translated into any language in any form by any means without the written permission of IBM.

add (# nullable fields)/8 for null indicators . As can APT_Transfer::getTransferBufferSize(). All rights reserved. or translated into any language in any form by any means without the written permission of IBM. rounded up 8 bytes n + 4 bytes for non-NLS data 2n + 4 bytes for NLS data (internally stored as UTF-16) n bytes for non-NLS data 2n bytes for NLS data 4 bytes 8 bytes with microsecond resolution 4 bytes 8 bytes 12 bytes with microsecond resolution For the overall record width: . No part of this publication may be reproduced. stored in a retrieval system. the method APT_Record::estimateFinalOutputSize() can give you an estimate for a given record schema. this Appendix provides a more accurate and detailed way to estimate the size of a parallel Data Set based on the internal storage requirements for each data type: Data Type Integers Small Integer Tiny Integer Big Integer Decimal Float VarChar(n) Char(n) Time Date Timestamp Size 4 bytes 2 bytes 1 byte 8 bytes (precision+1)/2.5 bytes per field) Using the internal DataStage Enterprise Edition C++ libraries. transmitted.one byte per column for field alignment (worst case is 3. Parallel Framework Red Book: Data Flow Job Design July 17.Information Integration Solutions Center of Excellence Appendix D: Estimating the Size of a Parallel Data Set For the advanced user. transcribed. if you have a transfer that transfers all fields from input to output. . 2006 153 of 179 © 2006 IBM Information Integration Solutions.

Set values that are optimal to your environment. Increasing these values on heavily-loaded file servers may improve performance. or translated into any language in any form by any means without the written permission of IBM. NOTE: The environment variable settings in this Appendix are only examples. 1. .1. All rights reserved. By default. stored in a retrieval system. See section 4. These variables can be used on an as-needed basis to tune the performance of a particular job flow. with a minimum of 8. to assist in debugging. Defines size of I/O buffer for Sequential File reads (imports) and writes (exports) respectively. Default is 128 (128K). imported string fields that exceed their maximum declared length are truncated. Environment Variable $APT_EXPORT_FLUSH_COUNT Sequential File Stage Environment Variables Setting [nrows] Description Specifies how frequently (in rows) that the Sequential File stage (export operator) flushes its internal buffer to disk. 2006 154 of 179 $APT_IMPORT_REJECT_STRING_FIELD_OVERRUN S 1 (DataStage v7. a variable-length string field to a fixed length (or a fixed-length to a longer fixed-length).01 and later) [Kbytes] $APT_IMPORT_BUFFER_SIZE $APT_EXPORT_BUFFER_SIZE $APT_CONSISTENT_BUFFERIO_SIZE [bytes] Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. Setting this environment variable directs DataStage to reject Sequential File records with strings longer than their declared maximum column length. transcribed. No part of this publication may be reproduced. transmitted. July 17. but there is a small performance penalty from increased I/O. Environment Variable $APT_STRING_PADCHAR Job Design Environment Variables Setting [char] Description Overrides the default pad character of 0x0 (ASCII null) used when EE extends. In some disk array configurations. Setting this value to a low number (such as 1) is useful for realtime applications. or to change the default behavior of specific DataStage Enterprise Edition stages. setting this variable to a value equal to the read / write size in bytes can improve performance of Sequential File import/export operations. An extensive list of environment variables is documented in the DataStage Parallel Job Advanced Developer’s Guide.2: Default and Explicit Type Conversions 2. or pads.Information Integration Solutions Center of Excellence Appendix E: Environment Variable Reference This Appendix summarizes the environment variables mentioned throughout this document.

. The default value is 2000 per partition.000 bytes.db2profile. Specifies the number of records to insert between commits. use this variable instead of APT_DELIMITED_READ_SIZE. All rights reserved. This setting should be set to a lower value when reading from streaming inputs (for example. socket or FIFO) to avoid blocking. For disk configurations with multiple controllers and disk. By default. $DS_ENABLE_RESERVED_CHAR_CONVE RT [rows] 1 Parallel Framework Red Book: Data Flow Job Design July 17. and so on (4X) up to 100.000 bytes. $DB2DBDFT is used to find the database name. This variable is usually set in a user’s environment from . $APT_DB2INSTANCE_HOME $APT_DBNAME $APT_RDBMS_COMMIT_ROWS Can also be specified with the “Row Commit Interval” stage input property. This variable controls the upper bound which is by default 100. transcribed. If $APT_DBNAME is not defined. but this can be set as low as 2 bytes. Environment Variable $INSTHOME DB2 Environment Variables Setting [path] [path] [database] Description Specifies the DB2 install directory. Allows DataStage plug-in stages to handle DB2 databases which use the special characters # and $ in column names. When this environment variable is set (present in the environment) file pattern reads are done in parallel by dynamically building a File Set header based on the list of files that match the given expression. $APT_MAX_DELIMITED_READ_SIZE [bytes] $APT_IMPORT_PATTERN_USES_FILESET [set] 3. The default is 500 bytes. this will significantly improve file pattern reads. transmitted. Used as a backup for specifying the DB2 installation directory (if $INSTHOME is undefined). If it is not found the importer looks ahead 4*500=2000 (1500 more) bytes. Sequential File (import) will read ahead 500 bytes to get the next delimiter. When more than 500 bytes read-ahead is desired.Information Integration Solutions Center of Excellence $APT_DELIMITED_READ_SIZE [bytes] Specifies the number of bytes the Sequential File (import) stage reads-ahead to get the next delimiter. No part of this publication may be reproduced. stored in a retrieval system. 2006 155 of 179 © 2006 IBM Information Integration Solutions. or translated into any language in any form by any means without the written permission of IBM. Specifies the name of the DB2 database for DB2/UDB Enterprise stages if the “Use Database Environment Variable” option is True.

transcribed. By default. 2006 156 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. the output of a Target Oracle stage with Load method is written to files instead of invoking the Oracle SQL*Loader. Specifies the commit interval in rows for Informix HPL Loads. Environment Variable $ORACLE_HOME Oracle Environment Variables Setting [path] [sid] [num] [seconds] Description Specifies installation directory for current Oracle instance. July 17. .Information Integration Solutions Center of Excellence 4. whichever comes first. When set. stored in a retrieval system. transmitted. PARALLEL=TRUE) $ORACLE_SID $APT_ORAUPSERT_COMMIT_ROW_INTERVAL $APT_ORAUPSERT_COMMIT_TIME_INTERVAL $APT_ORACLE_LOAD_OPTIONS [SQL* Loader options] [char] $APT_ORACLE_LOAD_DELIMITED (DataStage 7. Specifies the name of the Informix server matching an entry in the sqlhosts file. By default. this is set to OPTIONS(DIRECT=TRUE. or translated into any language in any form by any means without the written permission of IBM. Normally set in a user’s environment by Oracle scripts. The default is 10000 per partiton. Specifies Oracle SQL*Loader options used in a target Oracle stage with Load method. 5. Specifies the path to the Informix sqlhosts file. These two environment variables work together to specify how often target rows are committed for target Oracle stages with Upsert method. No part of this publication may be reproduced. Setting this variable makes it possible to load fields with trailing or leading blank characters.01 and later) $APT_ORA_IGNORE_CONFIG_FILE_PARALLELIS M 1 $APT_ORA_WRITE_FILES [filepath] Specifies a field delimiter for target Oracle stages using the Load method. a target Oracle stage with Load method will limit the number of players to the number of datafiles in the table’s tablespace. When set. Commits are made whenever the time interval period has passed or the row interval is reached. Specifies the Oracle service name. All rights reserved. corresponding to a TNSNAMES entry. The filepath specified by this environment variable specifies the file with the SQL*Loader commands. commits are made every 2 seconds or 5000 rows per partition. Useful in debugging Oracle SQL*Loader issues. Informix Environment Variables Environment Variable $INFORMIXDIR $INFORMIXSQLHOSTS $INFORMIXSERVER $APT_COMMIT_INTERVAL Setting [path] [filepath] [name] [rows] Description Specifies the Informix install directory.

Information Integration Solutions Center of Excellence
$DS_ENABLE_RESERVED_CHAR_CONVERT

1 Allows DataStage plug-in stages to handle Oracle databases which use the special characters # and $ in column names.

6.
Environment Variable
$APT_TERA_SYNC_DATABASE $APT_TERA_SYNC_USER $APT_TER_SYNC_PASSWORD $APT_TERA_64K_BUFFERS

Teradata Environment Variables
Setting
[name] [user] [password] 1 1

Description
Starting with v7, specifies the database used for the terasync table. By default, EE uses the Starting with v7, specifies the user that creates and writes to the terasync table. Specifies the password for the user identified by $APT_TERA_SYNC_USER. Enables 64K buffer transfers (32K is the default). May improve performance depending on network configuration. This environment variable is not recommended for general use. When set, this environment variable may assist in job debugging by preventing the removal of error tables and partially written target table. Disables permission checking on Teradata system tables that must be readable during the TeraData Enterprise load process. This can be used to improve the startup time of the load.

$APT_TERA_NO_ERR_CLEANUP

$APT_TERA_NO_PERM_CHECKS

1

7.
Environment Variable
$APT_MONITOR_TIME

Job Monitoring Environment Variables
Setting [seconds] Description In v7 and later, specifies the time interval (in seconds) for generating job monitor information at runtime. To enable size-based job monitoring, unset this environment variable, and set $APT_MONITOR_SIZE below. Determines the minimum number of records the job monitor reports. The default of 5000 records is usually too small. To minimize the number of messages during large job runs, set this to a higher value (for example, 1000000). Disables job monitoring completely. In rare instances, this may improve performance. In general, this should only be set on a per-job basis when attempting to resolve performance bottlenecks. Prints record counts in the job log as each operator completes processing. The count is per operator per partition.

$APT_MONITOR_SIZE

[rows]

$APT_NO_JOBMON

1

$APT_RECORD_COUNTS

1

Parallel Framework Red Book: Data Flow Job Design

July 17, 2006

157 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence

8. PerformanceTuning Environment Variables
Environment Variable
$APT_BUFFER_MAXIMUM_MEMORY

Setting 41903040 (example)

$APT_BUFFER_FREE_RUN

1000 (example)

Description Specifies the maximum amount of virtual memory, in bytes, used per buffer per partition. If not set, the default is 3MB (3145728). Setting this value higher will use more memory, depending on the job flow, but may improve performance. Specifies how much of the available in-memory buffer to consume before the buffer offers resistance to any new data being written to it. If not set, the default is 0.5 (50% of $APT_BUFFER_MAXIMUM_MEMORY). If this value is greater than 1, the buffer operator will read $APT_BUFFER_FREE_RUN * $APT_BUFFER_MAXIMIMUM_MEMORY before offering resistance to new data. When this setting is greater than 1, buffer operators will spool data to disk (by default scratch disk) after the $APT_BUFFER_MAXIMUM_MEMORY threshold. The maximum disk required will be
$APT_BUFFER_FREE_RUN * # of buffers * $APT_BUFFER_MAXIMUM_MEMORY

$APT_PERFORMANCE_DATA

directory [path]

$TMPDIR

Enables capture of detailed, per-process performance data in an XML file in the specified directory. Unset this environment variable to disable. Defaults to /tmp. Used for miscellaneous internal temporary data including FIFO queues and Transformer temporary storage. As a minor optimization, may be best set to a filesystem outside of the DataStage install directory.

9.
Environment Variable
$OSH_PRINT_SCHEMAS

Job Flow Debugging Environment Variables
Setting
1

Description
Outputs the actual schema definitions used by the DataStage EE framework at runtime in the DataStage log. This can be useful when determining if the actual runtime schema matches the expected job design table definitions. Disables operator combination for all stages in a job, forcing each EE operator into a separate process. While not normally needed in a job flow, this setting may help when debugging a job flow or investigating performance by isolating individual operators to separate processes. Note that disabling operator combination will generate more UNIX processes, and hence require more system resources (and memory). Disabling operator July 17, 2006 158 of 179

$APT_DISABLE_COMBINATION

1

The Advanced Stage Properties editor in DataStage Designer v7.1 and later allows combination to be enabled and disabled for on a per-stage basis.

Parallel Framework Red Book: Data Flow Job Design

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

Information Integration Solutions Center of Excellence

$APT_PM_PLAYER_TIMING

1 1 FORCE

$APT_PM_PLAYER_MEMORY $APT_BUFFERING_POLICY

combination also disables internal optimizations for job efficiency and run-times. Prints detailed information in the job log for each operator, including CPU utilization and elapsed processing time. Prints detailed information in the job log for each operator when allocating additional heap memory. Forces an internal buffer operator to be placed between every operator. Normally, the DataStage EE framework inserts buffer operators into a job flow at runtime to avoid deadlocks and improve performance. Using $APT_BUFFERING_POLICY=FORCE in combination with $APT_BUFFER_FREE_RUN effectively isolates each operator from slowing upstream production. Using the job monitor performance statistics, this can identify which part of a job flow is impacting overall performance. Set this environment variable to capture copies of the job score, generated osh, and internal Enterprise Edition log messages in a directory corresponding to the job name. This directory will be created in the “Debugging” sub-directory of the Project home directory on the DataStage server. This environment variable should not normally need to be set. When trying to start very large jobs on heavilyloaded servers, lowering this number will limit the number of processes that are simultaneously created when a job is started. For heavily loaded MPP or clustered environments, this variable determines the number of seconds the conductor node will wait for a successful startup from each section leader. The default is 30 seconds.

Setting
$APT_BUFFERING_POLICY=FORCE is not

recommended for production job runs.
$DS_PX_DEBUG

1

$APT_PM_STARTUP_CONCURRENCY

5

$APT_PM_NODE_TIMEOUT

[seconds]

Parallel Framework Red Book: Data Flow Job Design

July 17, 2006

159 of 179

© 2006 IBM Information Integration Solutions. All rights reserved. No part of this publication may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language in any form by any means without the written permission of IBM.

e. for example.Information Integration Solutions Center of Excellence Appendix F: Sorting and Hashing Advanced Example The standard recipe for using the ‘Inter-Record Relationship Suite’ (Sort. Join.) on any super-set of the keys. stored in a retrieval system. below). RemoveDuplicates. These operations take place in parallel. key clustering. across all partitions. or translated into any language in any form by any means without the written permission of IBM. all require sorted inputs. for example. b) Hash gathers Hash gathers into the same partition. not globally. Remove Duplicates requires only (i): when it completes processing all the rows in a key cluster. This creates partition-wise concurrency (a. at the same time. It will be followed by a detailed example that discusses these ideas in much greater depth. a) Sort within partitions. Merge.k. will notice that two rows have identical values in the user-defined key column only if the two rows are contiguous. this may be review for you. Sort actually does two things: (i) Groups rows that share the same values in key columns (forces related rows to be contiguous. In most cases. rows from all partitions that share the same value in key columns.k. related rows are in the same partition. . Join. There is also an “advanced” rule: a) b) Hash on any sub-set of the keys Sort (join/etc. If you have a lot of experience with hashing and sorting. a. etc. i. Sorting is rarely required by the business logic. in order This Appendix contains descriptions of what happens “behind the scenes”. Join. sorting is needed to satisfy an input requirement of a downstream stage. record adjacency. 2006 160 of 179 © 2006 IBM Information Integration Solutions. but other rows may separate them within that partition. partition-wise co-location). key-clustering is sufficient in many cases (a-only. The reason for this requirement lies in the “light-weight” nature of these stages. An illustrative piece of information. RemoveDuplicates.a. No part of this publication may be reproduced.) (ii) Orders the clusters resulting from (i). The second portion of this Appendix assumes you have read and thoroughly understand these concepts. only needs to see two records at a time —one from each input stream—to do its job. transmitted. but is frequently inefficient as records are ‘over-hashed’ and ‘over-partitioned’. This approach is guaranteed to work. All rights reserved. transcribed. c) Sort clusters and orders Sorting is often overkill. in the same order. but there is little you Parallel Framework Red Book: Data Flow Job Design July 17.a. Merge. and related stages) is: Hash and Sort/Join/Merge on exactly the same keys. for example. it does not care about the key value of the next cluster with respect to the current key value—in part because this stage takes only one input.

as a rule. allow a row in a partition to jump ahead of another row in the same partitionii. they DO care about the key value of the next cluster. No part of this publication may be reproduced. Nonetheless. reshuffle rows among partitions. and your data will retain its previous sort order. (and most other stages) do not gratuitously alter existing intra-partition row order. in part. however.e. . there are more advanced methods to sort and partition that can leverage this capability and mitigate the cost of sorting vs. 1. but it can be invoked via the generic stage. like stages. Enterprise Edition itself normally manages this use of this component. On multi-partitioned (i. stored in a retrieval system. Join and Merge. follow the partitioner with a PSM. As you will see in this Appendix. To restore row-adjacency. This due. Partitioners. Join/Merge can't effectively choose which input stream to advance for finding subsequent key matches. grouped/clustered for remove duplicates). In other words. Enterprise Edition will not. transcribed.Information Integration Solutions Center of Excellence can do to take advantage of it as there are no stages which guarantee key-clustering but do not perform a sort (some databases might be able to do key-clustering more cheaply than a sort. partitioners (except for SAME). See usage notes in footnotesiii. transmitted. so row order between the two inputs is obviously critical. d) Partitioners respect row order but split clusters. this is one instance where this might be handy). There is a component that will allow you to partition sorted data and achieve a sorted result: parallelsortmerge (PSM). non-sequential) inputs. ‘streaming-style’). Inside a partitioner In Enterprise Edition. any existing sort order is usually destroyed—see example below. one from each input. on the other hand. 2006 161 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. Note that neither partition has a sorted result despite P1 having a sorted input (read row order from the bottom up. to the fact these stages take multiple input links—these stages only see 2 records at a timei. clustering. or translated into any language in any form by any means without the written permission of IBM. partitioners. Partitioner P0 2 101 P1 1 103 July 17. All rights reserved. for example: Repartitioning: P0 2 1 3 P1 103 102 101 Note that ‘1’ and ‘101’ have switched partitions. If the values on both inputs aren't ordered (vs. Whenever you re-partition your sorted data.. require both (i) and (ii). a sort operation is needed even on previously sorted columns following any partitioner. When this stage completes processing all the rows in a key cluster. work in parallel.

or translated into any language in any form by any means without the written permission of IBM. stored in a retrieval system. transcribed. July 17. Minimizing Record Movement for Maximizing Performance Now we have covered the basic rules and mechanics for hash-partitioning and sorting. transmitted. Partition 0 Partition 1 1 Orlando 10 Rose 1 10 Boris 2 2 Adam 10 3 3 John Eve Jones Jones Elm Pine 2 1 Adam Smith Orlando Jones Boris Rose John Eve Smith Jones Zorn Smith Smith Walnut 10 2 Smith Pine 10 1 Zorn Walnut 10 3 Smith Pine 3 2. Here is another possible outcome: Also: Consider the result of running the same job with the same data. . All rights reserved. No part of this publication may be reproduced. let’s look at how we can capitalize on these behaviors for performance benefits. 6 There is an exception to this rule: If your hash key has only one value. 2006 162 of 179 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. but a different number of partitions. Illustrated Above Before Partitioner Partition 0 3 Eve Smith Pine 3 Eve 1 Orlando Jones Elm 2 Adam 2 Adam Smith Pine 10 Rose 1 Partition 1 10 Rose Jones Pine 1 Orlando 1 10 Boris Smith Walnut 10 Boris 2 2 10 John Zorn Walnut 10 John 3 3 After Hash partitioning on Street/Tree: Illustrated Above After Partitioner Partition 0 Smith 3 Eve Smith Pine 3 Eve Smith 10 Rose Jones Pine 1 Orlando 1 Jones 2 Adam Smith Pine 10 Rose 1 Partition 1 Jones 10 Boris Smith Walnut 10 Boris 2 2 Smith 10 John Zorn Walnut 10 John 3 Zorn3 1 Orlando Jones Elm 2 Adam Smith Jones Jones Smith Zorn Smith There is more than one way to correctly hash-partition any Data Set6.Information Integration Solutions Center of Excellence 3 102 Example: 6 rows in 2 partitions.

and append these values to the original data. transmitted. thereby increasing the value of this exercise. No part of this publication may be reproduced. They wish to determine the weighted average transaction amount peritem nation-wide. as well as the average transaction amount per-item. taking advantage of Enterprise Edition’s ability to analyze a jobflow and insert sorts and partitioners in Parallel Framework Red Book: Data Flow Job Design July 17. There are many common extensions on gathering these kinds of sales metrics that take the following ideas and increase the scale of the problem at hand. stored in a retrieval system. per store for all stores in the nation.Information Integration Solutions Center of Excellence Scenario Description: Our customer is a national retail business with several hundred outlets nation-wide. Here is our source data: Data Set 1: 32 Rows Store Location 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Item ID 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 Transaction Date 2004-01-01 2004-01-02 2004-01-01 2004-01-03 2004-01-04 2004-01-02 2004-01-04 2004-01-03 2004-01-04 2004-01-03 2004-01-01 2004-01-04 2004-01-02 2004-01-02 2004-01-01 2004-01-03 2004-01-03 2004-01-01 2004-01-02 2004-01-04 2004-01-02 2004-01-03 2004-01-01 2004-01-04 2004-01-04 2004-01-01 2004-01-03 2004-01-02 2004-01-04 2004-01-02 2004-01-01 2004-01-03 Transaction Amt 1 2 3 5 5 54 7 8 2 3 45 65 7 85 9 98 23 3 32 45 54 56 7 8 23 45 534 6 65 7 78 87 The screen capture below shows how to implement the business logic in an efficient manner. transcribed. or translated into any language in any form by any means without the written permission of IBM. This would make it possible to determine how well each store is doing in relation to the national averages and track these performance trends over time. 2006 163 of 179 © 2006 IBM Information Integration Solutions. All rights reserved. .

not store). stored in a retrieval system. Here. To do this. If you allow this environment variable to exist with any value. transmitted. Enterprise Edition will hash-partition and sort on ‘Store ID’. This is done with JoinSourceToAggregator_1 and JoinSourceToAggregator_2. APT_NO_SORT_INSERTION is NOT defined in your environment. ensure that the environment variable. both on the output link of JoinSourceToAggregator_1): NOTE: In this job. and calculate the average of the ‘Transaction Amt’ column and place the results in a column named ‘Store Average Item Transaction Amt’. or translated into any language in any form by any means without the written permission of IBM. you will disable this facility. Since the aggregator reduces row count (to the group count). to get the original data with the averages appended. you will disable this facility. ‘Item ID’. ensure that the environment variable. . APT_NO_PART_INSERTION is NOT defined in your environment.Information Integration Solutions Center of Excellence appropriate places notice. with ANY value. ‘Item ID’. we will need to join each aggregator’s output back to the original data in order to get the original row count. and ‘Transaction Date’ calculating the average of the ‘TransactionAmt’ column and place the results in a column named ‘National Average Item Transaction Amt’. This is the per-store transaction average per item. 2006 164 of 179 © 2006 IBM Information Integration Solutions. and automatic partition insertion must be turned on7. automatic sort insertion. The Aggregator stage NationalAverageItemTransactionAmt will aggregate the data on ‘Item ID’. and ‘Transaction Date’. transcribed. Parallel Framework Red Book: Data Flow Job Design July 17. If you allow this environment variable to exist with any value. Here you want to let DS/EE choose where to insert sorts and partitioners for you. The Aggregator StoreAverageItemTransactionAmt will aggregate the data on ‘Store ID’. with ANY value. Enterprise Edition will hash-partition and sort on ‘Item ID’. and ‘Transaction Date’. 7 To enable automatic partition insertion. No part of this publication may be reproduced. there is only one sort and one repartition in the diagram. and ‘Transaction Date’. All rights reserved. To enable automatic sort insertion. This is the nation-wide transaction average per item (weighted by transaction. so you want to leave them enabled (the default).

Therefore.25 26.25 Store Average Item Transaction Amt 2 2 5 5 6 6 July 17. The hash method only requires the input data to be hashed. it does not guarantee output order.5 39.5 46 46 6. No part of this publication may be reproduced.5 50. transcribed. It does this by keeping running totals in memory for the aggregation for each output group.5 6.5 35.5 39.5 180. you will need to set the Aggregator’s “Method” to Sort.5 180.5 23 23 23 23 26.5 50. it consumes an amount of RAM proportionate to the number of output rows and the number of columns involved in the aggregation. The sort method requires the input data to be hashed and sorted. All rights reserved.Information Integration Solutions Center of Excellence NOTE: For this scenario. transmitted.25 180.5 3.5 3.5 16.5 3. stored in a retrieval system.25 26.5 310.5 6.5 310.5 35.25 16. 2006 165 of 179 1 1 2 2 1 1 1 1 1 1 1 1 2004/01/01 2004/01/01 2004/01/01 2004/01/01 2004/01/04 2004/01/04 1 3 3 7 7 5 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions.5 Store Average Item Transaction Amt 28 28 43 43 6. not Hash. and ‘Average Item Transaction Amt By Store’) The output Data Set should look something like this (A 3-node configuration file was used in this implementation): Data Set 2: 32 Rows PeekFinalOutput. in return.25 26. Partition 2: 16 Rows Store Location Item ID Transaction Date Transaction Amt National Average Item Transaction Amt 3.5 180. . or translated into any language in any form by any means without the written permission of IBM. Partition 0: 16 Rows Store Location Item ID Transaction Date Transaction Amt National Average Item Transaction Amt 35.5 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2004/01/02 2004/01/02 2004/01/02 2004/01/02 2004/01/03 2004/01/03 2004/01/03 2004/01/03 2004/01/02 2004/01/02 2004/01/02 2004/01/02 2004/01/03 2004/01/03 2004/01/03 2004/01/03 2 54 54 32 5 8 56 23 85 7 6 7 3 98 87 534 PeekFinalOutput. JoinSourceToAggregator_2 produces the final result: the original input Data Set with two columns appended (‘Average Item Transaction Amt’. however.5 35. it guarantees the output order to be sorted since the result of aggregation can be released for downstream processing as soon as the key change is detected.

stored in a retrieval system. and 1.25 44.e. 3. i. or translated into any language in any form by any means without the written permission of IBM. more efficientv.25 44.75 38.75 38.5 61. 3.25 16.25 38. solution (score dump also attached below vi): 8 Records cannot be adjacent if they are not in the same partition. No part of this publication may be reproduced.25 44.5 27 27 61.: record adjacency assumes partition concurrency8.75 26.5 44 44 Since both the Aggregator and Join expect the data to arrive hashed and sorted on the grouping key(s) —both operations that consume large amounts of CPU—a couple of questions arise with respect to efficiency: What is the minimum number of hash-partitioners needed to implement this solution correctly? What is the minimum number of sorts needed to implement this solution? What is the minimum number of times that sort will need to buffer the entire Data Set to disk to implement this solution? Though running the job sequentially eliminates questions related to partitioners. and 6.Information Integration Solutions Center of Excellence 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2004/01/04 2004/01/04 2004/01/01 2004/01/01 2004/01/01 2004/01/01 2004/01/04 2004/01/04 2004/01/04 2004/01/04 8 45 45 9 78 45 2 65 65 23 16.5 33.5 33. 2006 166 of 179 © 2006 IBM Information Integration Solutions. An examination of the job above would suggest: 6. Parallel Framework Red Book: Data Flow Job Design July 17. 6. transcribed. as only partition concurrency is affected by sequential execution. appended to the end of this document for masochistsiv) might suggest: 4. Here’s a screen shot of this.25 44. . transmitted. and 3. A deeper examination (of the score dump. even sequential job execution does not alter the answer for the sort-related questions. This is certainly an improvement on the previous answer. All rights reserved. A much better answer is: 1.75 38.

we still need to fix the sort order. automatic sort insertion. and ItemId. and ItemId. but it is a lengthy parenthetical statement that would interrupt the flow of the scenario discussion). the data is not properly prepared for StoreAverageItemTransactionAmt. just as in the previous example. This combination of hash and sort adequately prepares the data for NationalAverageItemTransactionAmt. or translated into any language in any form by any means without the written permission of IBM. but we know that the data is already sorted on TransactionDate. This is because the data is already partitioned in a compatible manner for the aggregator. However. This is a problem for StoreAverageItemTransactionAmt. No part of this publication may be reproduced. transmitted. The settings in the sort should look like this: Parallel Framework Red Book: Data Flow Job Design July 17. All rights reserved. in order. TransactionDate. Hashing on these fields will gather all unique combinations of ItemID and TransactionDate into the same partition. as it expects all of the records for a particular StoreLocation/TransactionDate/ItemId combination to arrive on the same partition.Information Integration Solutions Center of Excellence NOTE: In this job. as it isn’t available on the link sort. Sort offers an efficiency-mode for pre-sorted data. but you must use the sort stage to access it. One would expect that we would need to sort on StoreLocation. 2006 167 of 179 © 2006 IBM Information Integration Solutions. stored in a retrieval system. transcribed. and automatic partition insertion must be turned offvii. The ‘advanced’ rule for hash partitioning is: you may partition on any sub-set of the aggregation/join/sort/etc. keys (viii This footnote contains key concepts that this document addresses. However. we hash and sort on ItemID and TransactionDate only. What is wrong with the data? The sort order does not include the StoreLocation. You may be wondering why the partitioning wasn’t mentioned as part of the problem. . In our initial copy stage (DistributeCopiesOfSourceData).

but the entire Data Set was 100 million records. even for ‘fast’ disks). and we only want to ‘sub-sort’ the data on StoreLocation (this option is only viable for situations where you need to maintain the sort order on the initial keys). This lets sort know that it only needs to gather all records with a unique combination of ItemID and TransactionDate in order to sort a batch of records. but the group keys will force the proper order). instead of buffering the entire Data Set. . we will have to prep the output from the first join to account for Parallel Framework Red Book: Data Flow Job Design July 17. stored in a retrieval system. this would save a tremendous amount of very expensive disk I/O as sort can hold a few hundred records in memory in most cases (disk I/O is typically several orders of magnitude more costly than memory I/O. the rows will come out in the same order they went in (different rows. key order is very important). transmitted. This accomplishes the first goal. All rights reserved. we have instructed the sort stage that the data is already sorted on ItemID and TransactionDate (as always with sorting records. If the group size was only several hundred records. transcribed. Getting back to the aggregators. Also worth noting here: because we already hashed the data on ItemID and TransactionDate. especially in MPP environments where repartitioning implies network I/O). however dreadful). No part of this publication may be reproduced. The output of StoreAverageItemTransactionAmt contains the other column we need to append to our source rows. This means that the output of the DistributeCopiesOfSourceData and NationalAverageItemTransactionAmt are already hashed and sorted on the keys needed to perform JoinSourceToAggregator_1.Information Integration Solutions Center of Excellence As you can see. However since we sub-sorted the data before this aggregator (unlike NationalAverageItemTransactionAmt). Since the aggregator does not need to disturb row-order (for pre-sorted data). ALL extant values of the remaining columns are already in the same partition. or translated into any language in any form by any means without the written permission of IBM. to append a column representing the national (weighted) average item transaction amount. 2006 168 of 179 © 2006 IBM Information Integration Solutions. The previous two paragraphs contain two key concepts in Enterprise Edition (pun fully intended. granted. which is what makes this sort possible w/o re-partitioning (which is also quite expensive.

25 38.5 16.5 3. or translated into any language in any form by any means without the written permission of IBM. transmitted.25 44. All rights reserved. .5 61. stored in a retrieval system.5 33.25 44. and it is enabled by default.25 44.25 16. Partition 2: 16 Rows Parallel Framework Red Book: Data Flow Job Design July 17. This sort will look exactly like the other sort stage: Remember to disable the ‘Stable Sort’ option if you do not need it (it will try to maintain row order except as needed to perform the sort.5 16.25 3.75 38. Partition 1: 16 Rows Store Location Item ID Transaction Date Transaction Amt National Average Item Transaction Amt 3. Output from above solution: Data Set 3: PeekFinalOutput. useful for preserving previous sort orderings).5 26. transcribed.75 Store Average Item Transaction Amt 2 2 6 6 5 5 26. as it is much more expensive than non-stable sorts.5 33.75 38.75 38.25 44.5 3. No part of this publication may be reproduced.5 27 27 61.5 44 44 1 1 1 1 2 2 2 2 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2004/01/01 2004/01/01 2004/01/04 2004/01/04 2004/01/01 2004/01/01 2004/01/04 2004/01/04 2004/01/01 2004/01/01 2004/01/01 2004/01/01 2004/01/04 2004/01/04 2004/01/04 2004/01/04 3 1 5 7 3 7 45 8 9 45 45 78 2 65 23 65 PeekFinalOutput.Information Integration Solutions Center of Excellence the new row ordering of StoreAverageItemTransactionAmt.25 16. 2006 169 of 179 © 2006 IBM Information Integration Solutions.

or translated into any language in any form by any means without the written permission of IBM. The second solution only sorts (on disk) 100. 000.25 26. stored in a retrieval system.000.5 23 23 180. No part of this publication may be reproduced.5 35.5 35.5 180.5 23 23 26. a single partitioner. but there is a critical difference.5 50.5 35.5 This solution produces the same result but is achieved with only one complete sort.5 Store Average Item Transaction 28 28 43 43 6.5 50.000 records in addition to hashing 300.5 46 46 6. There is an even more efficient solution.5 310.000 records.000. a savings of 400.000 additional record movements—half of them involving disk I/O—for a 100 million record input volume. . and only hashes. transmitted.25 26.25 180. Parallel Framework Red Book: Data Flow Job Design July 17. 2006 170 of 179 © 2006 IBM Information Integration Solutions. we had to sort (on disk) 300. 100.5 310. That is a LOT of saved processing power.5 180.25 26.000 records. and two sub-sorts—a much more efficient solution for large data volumes. transcribed.5 39.Information Integration Solutions Center of Excellence Store Location Item ID Transaction Date Transaction Amt 1 1 2 2 1 1 1 1 2 2 1 1 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 2 2 2004/01/02 2004/01/02 2004/01/02 2004/01/02 2004/01/03 2004/01/03 2004/01/02 2004/01/02 2004/01/02 2004/01/02 2004/01/03 2004/01/03 2004/01/03 2004/01/03 2004/01/03 2004/01/03 2 54 54 32 5 8 7 85 7 6 3 98 23 56 534 87 National Average Item Transaction Amt 35.000 records.000.5 39. With the initial solution. All rights reserved. It looks very similar to the first solution. Imagine a job with 100 million records as the input.5 6.000.5 6.

Comparing the efficiency of this solution with that of number two. the need for the second sort on the output of the JoinSourceToAggregator_1 is not needed.5 3.25 44.a significant savings. Also. we have chosen to use the StoreLocation column as a part of our sorting key.Information Integration Solutions Center of Excellence NOTE: In this job. and automatic partition insertion must be turned off vii. 2006 171 of 179 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 1 1 2 2 2004/01/01 2004/01/01 2004/01/01 2004/01/01 2004/01/04 2004/01/04 2004/01/04 2004/01/04 2004/01/01 2004/01/01 1 3 7 3 5 7 8 45 45 9 Parallel Framework Red Book: Data Flow Job Design © 2006 IBM Information Integration Solutions. but NOT to use it for hashing.5 3. for the same reasons. transcribed. we saved a sub-sort on 100 million records .5 16.25 44. The difference is on DistributeCopiesOfSourceData: Here. or translated into any language in any form by any means without the written permission of IBM.25 Store Average Item Transaction Amt 2 2 5 5 6 6 26. No part of this publication may be reproduced. Partition 1: 16 Rows Store Location Item ID Transaction Date Transaction Amt National Average Item Transaction Amt 3. This is a potentially huge savings on large data volumes (remember the previous example). . transmitted. stored in a retrieval system. Looks a lot like solution 1. This is functionally equivalent to doing a sub-sort right before the StoreAverageItemTransactionAmt aggregator. automatic sort insertion.25 16. All rights reserved. Data Set 4: PeekFinalOutput. However it will not create additional processes to handle the records and re-order them.5 3.5 26.5 27 27 July 17. Here is the output from this version of the job. except w/o the sort on the output of JoinSourceToAggregator.25 16.25 16.

5 180. transmitted. 2006 172 of 179 © 2006 IBM Information Integration Solutions.5 6. in addition to the heavy penalty paid in disk I/O for using a full sort.25 180.5 6.5 46 46 6.75 61.25 26. .25 44. All rights reserved.5 23 23 23 23 26.5 35.75 38.5 33.5 35.75 38. transcribed.5 50.5 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2004/01/02 2004/01/02 2004/01/02 2004/01/02 2004/01/03 2004/01/03 2004/01/03 2004/01/03 2004/01/02 2004/01/02 2004/01/02 2004/01/02 2004/01/03 2004/01/03 2004/01/03 2004/01/03 2 54 54 32 5 8 23 56 7 85 7 6 3 98 534 87 Finally.5 35.5 180. No part of this publication may be reproduced.5 44 44 PeekFinalOutput.Information Integration Solutions Center of Excellence 2 2 1 1 2 2 2 2 2 2 2 2 2004/01/01 2004/01/01 2004/01/04 2004/01/04 2004/01/04 2004/01/04 78 45 2 65 23 65 44.25 26.75 38.5 310. stored in a retrieval system. or translated into any language in any form by any means without the written permission of IBM.5 50.5 180. inhibits pipe-lining (by buffering large amounts of data to disk since it needs to see all data before it can determine the resulting sorted sequence)ix. by definition.25 26. sort. Partition 2: 16 Rows Store Location Item ID Transaction Date Transaction Amt National Average Item Transaction Amt 35.5 39.5 Store Average Item Transaction Amt 28 28 43 43 6.5 39. Here is a screen shot of a sort running on 40 million records: Parallel Framework Red Book: Data Flow Job Design July 17.25 38.5 310.5 33.

Parallel Framework Red Book: Data Flow Job Design July 17. requiring a large amount of scratch diskx.Information Integration Solutions Center of Excellence As you can see. we are sub-sorting the data we sorted in the previous diagram). no rows have left yet. or translated into any language in any form by any means without the written permission of IBM. although ~5 million records have entered the sort. instead of waiting until all 40 million records have been sorted (in this instance. you can clearly see that a sub-sort does not inhibit pipe-lining--very nearly the same number of rows have entered and left the sort stage (and NO buffering is required to perform the subsort). All rights reserved. 2006 173 of 179 © 2006 IBM Information Integration Solutions. transmitted. transcribed. This allows down-stream stages to be processing data during the sorting process. the link sort in solution 2. This situation is analogous to all of the sorts in solution 1. . Here is an example of a ‘sub-sort’: Here. This is because a standard sort requires all rows to be present in order to release the first row. and the link sort in solution three. No part of this publication may be reproduced. stored in a retrieval system.

but in another section you need to hash and sort on columns A and B. presorted.queue length at which to issue a warning. but not introduce superfluous sorts. value one of first. You could hash only on A. default=first -asc or ascending -. optional params -. which would result in having to completely re-sort that dataset despite having a sorted version already. There are other situations where this is valuable but they are much less common.input field name Sub-options for key: -ci -. Another common problem: You need to Hash and Sort on columns A and C to implement the business logic in one section of the flow.i This is an over-simplification. optional. problem: you created a fileset with 8 nodes. 0 or larger. gender codes. default=10000 -doStats -.where null values should sort. Join needs to see all of the rows in the current cluster on at least one of the input links. ii A common problem: Suppose you have two (or more).resorted dataset. default -nulls -. last. This would allow you to combine other columns into your hash key to reduce data-skew.specify a key field. optional -param -. without curly braces (mutually exclusive: -ci.extra parameters for key. but suppose that A has too small a number of unique values (country codes. race/gender/ethnicity codes are typical).ascending sort order. optional -cs -. This ‘problem’ is addressed by the parallelsortmerge component iii.report statistics at the end of the run. otherwise a Cartesian product is impossible. exactly one occurrence required This operator may have following outputs -reSorted -. Normally EE would re-partition the data into 4 nodes + destroy your sort order.integer.dataset to be resorted/merged. -desc) -warnLevel -.string. iii ParallelSortMerge Operator Options: -key -. exactly one occurrence required <add example here of how psm works> iv Dump score for solution 1 main_program: This step has 16 datasets: ds0: {op0[3p] (parallel SourceData. optional. key={ value=TransactionDate }. A third. -cs) (mutually exclusive: -asc. less common. it’s only true for cases where the key is unique.property=value pair(s). 1 or more name -. key={ value=TransactionDate } })#>eCollectAny op4[3p] (parallel APT_TSortOperator(0))} ds4: {op2[3p] (parallel APT_HashedGroup2Operator in NationalAverageItemTransactionAmt) [pp] eSame=>eCollectAny op5[3p] (parallel APT_TSortOperator(1))} ds5: {op3[3p] (parallel APT_HashedGroup2Operator in StoreAverageItemTransactionAmt) [pp] eSame=>eCollectAny op8[3p] (parallel APT_TSortOperator(2))} ds6: {op4[3p] (parallel APT_TSortOperator(0)) . default -desc or descending -. optional records -. In other cases. However. you can use the ParallelSortMerge stage to ensure that no matter the degree of parallelism of the writer + reader. the sort order will be preserved. but the job that reads it only has 4 nodes. optional This operator may have following inputs -Sorted -.case-sensitive comparison. key={ value=StoreLocation } })#>eCollectAny op3[3p] (parallel APT_HashedGroup2Operator in StoreAverageItemTransactionAmt)} ds3: {op1[3p] (parallel DistributeCopiesOfSourceDta) eOther(APT_HashPartitioner { key={ value=ItemID }. optional position -. 2147483647 or smaller.descending sort order. At least one of these dataset must be re-hashed.case-insensitive comparison.DSLink2) eAny=>eCollectAny op1[3p] (parallel DistributeCopiesOfSourceDta)} ds1: {op1[3p] (parallel DistributeCopiesOfSourceDta) eOther(APT_HashPartitioner { key={ value=ItemID }. datasets with differing partition counts and you wish to join/merge them. key={ value=TransactionDate } })#>eCollectAny op2[3p] (parallel APT_HashedGroup2Operator in NationalAverageItemTransactionAmt)} ds2: {op1[3p] (parallel DistributeCopiesOfSourceDta) eOther(APT_HashPartitioner { key={ value=ItemID }.

key={ value=StoreLocation.p0] node1[op1. subArgs={ cs } } })#>eCollectAny op10[3p] (parallel JoinSourceToAggregator_2.p2] )} op1[3p] {(parallel DistributeCopiesOfSourceDta) on nodes ( node1[op1. subArgs={ cs } }.DSLink18_Sort)} ds12: {op10[3p] (parallel JoinSourceToAggregator_2.DSLink18_Sort) [pp] eSame=>eCollectAny op11[3p] (parallel buffer(2))} ds13: {op11[3p] (parallel buffer(2)) [pp] eSame=>eCollectAny op13[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)} ds14: {op12[3p] (parallel buffer(3)) [pp] eSame=>eCollectAny op13[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)} ds15: {op13[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2) [pp] eSame=>eCollectAny op14[3p] (parallel PeekFinalOutput)} It has 15 operators: op0[3p] {(parallel SourceData.p1] node1[op1.[pp] eSame=>eCollectAny op6[3p] (parallel buffer(0))} ds7: {op5[3p] (parallel APT_TSortOperator(1)) [pp] eSame=>eCollectAny op7[3p] (parallel buffer(1))} ds8: {op6[3p] (parallel buffer(0)) [pp] eSame=>eCollectAny op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)} ds9: {op7[3p] (parallel buffer(1)) [pp] eSame=>eCollectAny op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)} ds10: {op8[3p] (parallel APT_TSortOperator(2)) [pp] eSame=>eCollectAny op12[3p] (parallel buffer(3))} ds11: {op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1) [pp] eOther(APT_HashPartitioner { key={ value=ItemID.p1] node3[op2.p0] node2[op4.p1] node1[op0.p2] )} op6[3p] {(parallel buffer(0)) .p1] node3[op5. key={ value=TransactionDate }.p2] )} op5[3p] {(parallel APT_TSortOperator(1)) on nodes ( node1[op5.p0] node2[op5.DSLink2) on nodes ( node1[op0.p2] )} op2[3p] {(parallel APT_HashedGroup2Operator in NationalAverageItemTransactionAmt) on nodes ( node1[op2.p0] node2[op2.p0] node1[op0.p2] )} op4[3p] {(parallel APT_TSortOperator(0)) on nodes ( node1[op4.p1] node3[op4.p2] )} op3[3p] {(parallel APT_HashedGroup2Operator in StoreAverageItemTransactionAmt) on nodes ( node1[op3.p0] node2[op3.p1] node3[op3.

p1] node3[op9.p2] )} op7[3p] {(parallel buffer(1)) on nodes ( node1[op7.p2] )} op13[3p] {(parallel APT_JoinSubOperator in JoinSourceToAggregator_2) on nodes ( node1[op13.DSLink2_Sort) [pp] eSame=>eCollectAny op2[3p] (parallel DistributeCopiesOfSourceDta)} ds2: {op2[3p] (parallel DistributeCopiesOfSourceDta) .p0] node2[op10. subArgs={ cs } }.p1] node3[op6. key={ value=TransactionDate } })#>eCollectAny op1[3p] (parallel DistributeCopiesOfSourceDta.p0] node2[op12.p2] )} op10[3p] {(parallel JoinSourceToAggregator_2.p2] )} op12[3p] {(parallel buffer(3)) on nodes ( node1[op12.p0] node2[op13.p1] node3[op8.p1] node3[op12.p2] )} op11[3p] {(parallel buffer(2)) on nodes ( node1[op11. Since moving records around takes CPU time and extra system calls.e.p1] node3[op11. fewer times.DSLink18_Sort) on nodes ( node1[op10.p2] )} op14[3p] {(parallel PeekFinalOutput) on nodes ( node1[op14. v Throughout this document the general meaning of the phrase ‘more efficient’ is fewer record movements--i.p2] )} op9[3p] {(parallel APT_JoinSubOperator in JoinSourceToAggregator_1) on nodes ( node1[op9.p2] )} op8[3p] {(parallel APT_TSortOperator(2)) on nodes ( node1[op8.p0] node2[op9. or order.DSLink2_Sort)} ds1: {op1[3p] (parallel DistributeCopiesOfSourceDta. a record changes partition. vi Dump Score for Solution 2 main_program: This step has 15 datasets: ds0: {op0[3p] (parallel SourceData.p1] node3[op7.p0] node2[op7.p1] node3[op13.p0] node2[op8.p0] node2[op14.p1] node3[op14.p0] node2[op6.p2] )} It runs 45 processes on 3 nodes.on nodes ( node1[op6. if you move records unnecessarily. your run time will be adversely affected.p0] node2[op11.p1] node3[op10.DSLink2) eOther(APT_HashPartitioner { key={ value=ItemID.

p1] node1[op0.p0] node2[op1.p2] )} op2[3p] {(parallel DistributeCopiesOfSourceDta) on nodes ( node1[op2.p2] )} op1[3p] {(parallel DistributeCopiesOfSourceDta.p1] node3[op5.p0] node2[op5.p1] node3[op3.p0] node2[op2.p2] )} op4[3p] {(parallel APT_SortedGroup2Operator in NationalAverageItemTransactionAmt) on nodes ( node1[op4.DSLink2) on nodes ( node1[op0.p1] node3[op2.p0] node2[op4.p0] node2[op3.p0] node1[op0.p1] node3[op1.p2] .p2] )} op5[3p] {(parallel APT_SortedGroup2Operator in StoreAverageItemTransactionAmt) on nodes ( node1[op5.[pp] eSame=>eCollectAny op3[3p] (parallel SubSortOnStoreLocation)} ds3: {op2[3p] (parallel DistributeCopiesOfSourceDta) [pp] eSame=>eCollectAny op4[3p] (parallel APT_SortedGroup2Operator in NationalAverageItemTransactionAmt)} ds4: {op2[3p] (parallel DistributeCopiesOfSourceDta) [pp] eSame=>eCollectAny op6[3p] (parallel buffer(0))} ds5: {op3[3p] (parallel SubSortOnStoreLocation) [pp] eSame=>eCollectAny op5[3p] (parallel APT_SortedGroup2Operator in StoreAverageItemTransactionAmt)} ds6: {op4[3p] (parallel APT_SortedGroup2Operator in NationalAverageItemTransactionAmt) [pp] eSame=>eCollectAny op7[3p] (parallel buffer(1))} ds7: {op5[3p] (parallel APT_SortedGroup2Operator in StoreAverageItemTransactionAmt) [pp] eSame=>eCollectAny op8[3p] (parallel buffer(2))} ds8: {op6[3p] (parallel buffer(0)) [pp] eSame=>eCollectAny op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)} ds9: {op7[3p] (parallel buffer(1)) [pp] eSame=>eCollectAny op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1)} ds10: {op8[3p] (parallel buffer(2)) [pp] eSame=>eCollectAny op12[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)} ds11: {op9[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_1) [pp] eSame=>eCollectAny op10[3p] (parallel SubSortOnStoreLocation2)} ds12: {op10[3p] (parallel SubSortOnStoreLocation2) [pp] eSame=>eCollectAny op11[3p] (parallel buffer(3))} ds13: {op11[3p] (parallel buffer(3)) [pp] eSame=>eCollectAny op12[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2)} ds14: {op12[3p] (parallel APT_JoinSubOperator in JoinSourceToAggregator_2) [pp] eSame=>eCollectAny op13[3p] (parallel PeekFinalOutput)} It has 14 operators: op0[3p] {(parallel SourceData.DSLink2_Sort) on nodes ( node1[op1.p2] )} op3[3p] {(parallel SubSortOnStoreLocation) on nodes ( node1[op3.p1] node3[op4.

you could have ALL 5 groups sent to the same partition (this is unlikely. thus introducing the inefficiencies we are trying to avoid. in fact.p1] node3[op12.p0] node2[op10.p0] node2[op13.p2] )} op10[3p] {(parallel SubSortOnStoreLocation2) on nodes ( node1[op10. you want auto insertion turned off b/c EE will see that you are ‘missing’ a sort/partitioner and insert one for you. regardless of the number of partitions: if you are running a job with 6 partitions. Any combination of these groups can be in any partition.p2] )} op8[3p] {(parallel buffer(2)) on nodes ( node1[op8.p2] )} op13[3p] {(parallel PeekFinalOutput) on nodes ( node1[op13.p2] )} op9[3p] {(parallel APT_JoinSubOperator in JoinSourceToAggregator_1) on nodes ( node1[op9.p1] node3[op7.p0] node2[op12. vii In this instance.p0] node2[op9. This does not mean that these groups will be in unique partitions: consider a job that only has 3 partitions. Here are the possible outcomes if I hash-partition on ColumnA only: ColumnAColumnBColumnCGroup 1111112113121122123Group 2211212213221222223231232233 .p0] node2[op6.p2] )} op12[3p] {(parallel APT_JoinSubOperator in JoinSourceToAggregator_2) on nodes ( node1[op12.p2] )} op11[3p] {(parallel buffer(3)) on nodes ( node1[op11. look at this example: Here is my source data: ColumnAColumnBColumnC111112113121122123211212213221222223231232233 Here are the possible outcomes if I hash-partition on ColumnA.)} op6[3p] {(parallel buffer(0)) on nodes ( node1[op6.p1] node3[op10.p2] )} op7[3p] {(parallel buffer(1)) on nodes ( node1[op7.p1] node3[op9. and ColumnB: ColumnAColumnBColumnCGroup 1111112113Group 2121122123Group 3211212213Group 4221222223Group 5231232233 There must be exactly 5 groups identified by the hash algorithm b/c there are exactly 5 unique combinations of ColumnA and ColumnB.p0] node2[op11. viii To understand why this is true.p1] node3[op11. the distribution of groups across partitions is nearly even for large numbers of groups).p1] node3[op8.p2] )} It runs 42 processes on 3 nodes.p1] node3[op6.p0] node2[op8.p1] node3[op13.p0] node2[op7. and the likelihood decreases with larger numbers of groups.

we want to reduce the number of times that we hash (b/c partitioning costs CPU time). You need to understand your data and make educated decisions about your hashing strategy. So hashing on fewer columns resulted in fewer. summing ColumnC. In the scenario that we are discussing in the main document. larger groups.As you can see. ColumnB. ix This means that down-stream process will be sitting idle until the sort is completed. x This is a slight oversimplification. which will. Therefore. you will get a very small number of groups. as well as all unique combinations of ColumnA. there are only two groups by hashing on ColumnA only. b/c: if all unique values of ColumnA are together. One effect is that if we wanted to aggregate on ColumnA and ColumnB. then all unique combinations of ColumnA and ColumnB are together. effectively. even if we ran the job 12-ways. and ColumnC. we would have only two groups. not for the entire dataset. We can do this by identifying the intersection of keys needed among all of the hash-partitioners and hashing only on those keys: TransactionDate and ItemId NOTE: if you take this to an extreme. consuming RAM and process space and offering nothing in return. In the above example. we wouldn’t see any improvements in performance over a 2-way job. It is only true on a per-partition basis. reduce the parallelism of the job. . this grouping is OK.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.