You are on page 1of 500

V6.

cover

Front cover

IBM InfoSphere Advanced


DataStage v8

(Course code KM400)

Student Notebook
ERC 1.0
Student Notebook

Trademarks
IBM® and the IBM logo are registered trademarks of International Business Machines
Corporation.
The following are trademarks of International Business Machines Corporation, registered in
many jurisdictions worldwide:
DataStage® DB2® Informix®
InfoSphere™
Windows is a trademark of Microsoft Corporation in the United States, other countries, or
both.
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Other product and service names might be trademarks of IBM or other companies.

July 2011 edition


The information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without
any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer
responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While
each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.

© Copyright International Business Machines Corporation 2005, 2011.


This document may not be reproduced in whole or in part without the prior written permission of IBM.
Note to U.S. Government Users — Documentation related to restricted rights — Use, duplication or disclosure is subject to restrictions
set forth in GSA ADP Schedule Contract with IBM Corp.
V6.0
Student Notebook

TOC Contents
Course description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

Unit 0. IBM InfoSphere Advanced DataStage v8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0-1


Course objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0-2
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0-3
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0-4
Introductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0-5

Unit 1. Introduction to the Parallel Framework Architecture . . . . . . . . . . . . . . . . . . 1-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2
Why study the parallel architecture? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
What we need to master . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4
DataStage parallel job documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
Key parallel concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6
Scalable hardware environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7
Drawbacks of traditional batch processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-8
Pipeline parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9
Partition parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
Partitioning illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11
DataStage combines partitioning and pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . 1-12
Job design versus execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13
Defining parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14
Configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-15
Example configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-16
Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-18
Generating mock data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-19
Job design for generating mock data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-20
Specifying the generating algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-21
Inside the Lookup stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-22
Configuration file displayed in job log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-23
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-24
Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-25
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-26

Unit 2. Compilation and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2
Parallel Job Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
Parallel job compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4
Transformer job compilation notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5
Generated OSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
Stage to OSH operator mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
Generated OSH primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8

© Copyright IBM Corp. 2005, 2011 Contents iii


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

DataStage GUI versus OSH terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9


Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10
Configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
Processing nodes (partitions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12
Configuration file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
Node options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14
Sample configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15
Resource pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16
Sorting resource pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
Another configuration file example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
Constraining operators to specific node pools . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19
Configuration Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20
Configuration editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21
Parallel Runtime Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
Parallel Job startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23
Parallel job run time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24
Viewing the job Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25
Example job Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26
Job execution: The orchestra metaphor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27
Runtime control and data networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28
Parallel data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29
Monitoring job startup and execution in the log . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30
Counting the total number of processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31
Parallel Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-32
Peeking at the data steam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33
Peeking at the data stream design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34
Using Transformer stage variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36
Exercise 2 - Compilation and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38

Unit 3. Partitioning and Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
Partitioning and collecting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3
Partitioning and collecting icons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
Partitioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
Where partitioning is specified . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
The Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Viewing the Score operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8
Interpreting the Score partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
Score partitioning example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10
Partition numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
Partitioning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
Selecting a partitioning method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13
Selecting a partitioning method, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
Same partitioning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15
Caution regarding Same partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
Round Robin and Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17

iv Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V6.0
Student Notebook

TOC Parallel runtime example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18


Entire partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19
Hash partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20
Unequal distribution example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21
Modulus partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
Range partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23
Using Range partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24
Example partitioning icons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25
Auto partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26
Preserve partitioning flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27
Partitioning strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28
Partitioning strategy, continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29
Collecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30
Collectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31
Specifying the collector method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32
Collector methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33
Sort Merge example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34
Non-deterministic execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35
Choosing a collector method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36
Collector method versus Funnel stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37
Parallel Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38
Parallel number sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39
Row Generator sequences of numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40
Generated numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-41
Transformer example using @INROWNUM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42
Transformer example using parallel variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43
Header and detail processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44
Job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45
Inside the Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46
Examining the Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47
Difficulties with the design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48
Examining the Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49
Generating a header detail data file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-50
Inside the Column Export stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-51
Inside the Funnel stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-53
Exercise 3 - Read data with multiple record formats . . . . . . . . . . . . . . . . . . . . . . . 3-54
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55

Unit 4. Sorting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2
Traditional (sequential) sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Parallel sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
Example parallel sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5
Stages that require sorted data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
Parallel sorting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7
In-Stage sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
Sort stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9

© Copyright IBM Corp. 2005, 2011 Contents v


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Stable sorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10


Resorting on sub-groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
Don’t sort (previously grouped) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
Partitioning and sort order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
Global sorting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14
Inserted tsorts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
Changing inserted tsort behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
Sort resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Partition and sort keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
Optimizing job performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
Fork join job example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
Fork join job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
Examining the Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
Difficulties with the design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24
Optimized solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25
Score of optimized Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27
Exercise 4 - Optimize a fork join job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29

Unit 5. Buffering in Parallel Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2
Introducing the buffer operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
Identifying buffer operators in the Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
How buffer operators work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Buffer flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
Buffer tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7
Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8
Changing buffer settings in a job stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9
Buffer resource usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10
Buffering for group stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
Join stage internal buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12
Avoiding buffer contention in fork-join jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13
Parallel Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14
Revisiting the header detail job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
Buffering solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16
Redesigned header detail processing job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18
Exercise - Optimize a fork join job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20

Unit 6. Parallel Framework Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4
Example schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
Type conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6

vi Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V6.0
Student Notebook

TOC Source to target type conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7


Using Modify Stage For Type Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
Processing external data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
Sequential file import conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
COBOL file import conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
Oracle automatic conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
Standard Framework data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
Complex data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
Schema with complex types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
Complex types column definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
Complex Flat File Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17
Complex Flat File (CFF) stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18
Sample COBOL copybook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19
Importing a COBOL File Definition (CFD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
COBOL table definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21
COBOL file layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-22
Specifying a date mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23
Example data file with multiple formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24
Sample job With CFF Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-25
File options tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26
Records tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-27
Record ID tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-28
Selection tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-29
Record options tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-30
Layout tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-31
View data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-32
Processing multi-format records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-33
Transformer constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-34
Nullability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-35
Nullable data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-36
Null transfer rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-37
Nulls and sequential files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-38
Null field value examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-39
Viewing data with Null values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-40
Lookup stage and nullable columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-41
Default values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-42
Nullability in lookups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-43
Outer joins and nullable columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-44
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-45
Exercise 6 - Test nullability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-46
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-47

Unit 7. Reusable components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Using Schema Files to Read Sequential Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
Schema file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
Creating a schema file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
Importing a schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6

© Copyright IBM Corp. 2005, 2011 Contents vii


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Creating a schema from a table definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7


Reading a sequential file using a schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
Runtime Column Propagation (RCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9
Runtime Column Propagation (RCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10
Enabling Runtime Column Propagation (RCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
Enabling RCP at Project Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Enabling RCP at Job Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13
Enabling RCP at Stage Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
When RCP is Disabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
When RCP is Enabled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
Where do RCP columns come from? (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
Where do RCP columns come from? (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18
Where do RCP columns come from? (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19
Shared Containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
Shared containers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
Creating a shared container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22
Inside the shared container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23
Inside the shared container Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24
Using a shared container in a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25
Mapping input / output links to the container . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-26
Interfacing with the shared container . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28
Exercise 7 - Reusable components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-30

Unit 8. Advanced Transformer Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
Transformer Null Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3
Transformer legacy null handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4
Legacy null processing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
Inside the Transformer stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6
Transformer stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8
Transformer non-legacy null handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9
Transformer stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10
Results with non-legacy null processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
Transformer Loop Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12
Transformer loop processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13
Repeating columns example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14
Solution using multiple-output links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15
Inside the Transformer stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16
Limitations of the multiple output links solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
Loop processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
Creating the loop condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19
Loop variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
Repeating columns solution using a loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21
Inside the Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22
Transformer Group Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23

viii Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V6.0
Student Notebook

TOC Transformer group processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24


Building a Transformer group processing job (1) . . . . . . . . . . . . . . . . . . . . . . . . . 8-25
Building a Transformer group processing job (2) . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
Group processing example job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-27
Transformer stage variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28
Stage Variable Derivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-29
Specifying the Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30
Runtime errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-31
Validating rows before saving them in the queue . . . . . . . . . . . . . . . . . . . . . . . . . 8-32
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-33
Exercise 8 - Transformer Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-34
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-35

Unit 9. Extending the Functionality of Parallel Jobs. . . . . . . . . . . . . . . . . . . . . . . . . 9-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Ways of adding new functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
Wrapped Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4
Building Wrapped stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
Wrapped stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6
Wrapped stage example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
Creating a Wrapped stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
Defining the Wrapped stage interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
Specifying Wrapped stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
Job with Wrapped stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
Exercise 9 - Wrapped stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
Build Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-13
Build stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14
Example job with Build stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
Creating a new Build stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
Build stage elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-17
Anatomy of a Build stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-18
Defining the input, output interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20
Interface table definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-21
Specifying the input interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-22
Specifying the output interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-23
Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-24
Defining a transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-25
Anatomy of a transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26
Defining stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-27
Specifying properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-28
Defining the Build stage logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-29
Definitions tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30
Pre-Loop tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-31
Per-Record tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32
Post-Loop tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-33
Writing to the job log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-34
Using a Build stage in a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-35
Stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-36

© Copyright IBM Corp. 2005, 2011 Contents ix


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Build stages with multiple ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-37


Build Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-38
Build macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-39
Turning off auto read, write, and transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-40
Reading records using macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-41
APT Framework Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-42
APT framework and utility classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-43
Framework class sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-44
APT_String Build stage example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-45
Exercise 9 - Build stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-46
External Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-47
Parallel routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-48
External function example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-49
Another external function example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-50
Creating an external function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-51
Defining the input arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-52
Calling the external function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-53
Exercise 9 - External Function Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-54
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-55
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-56

Unit 10. Accessing Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
Connector stage usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5
Connector stage look and feel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6
Connector stage GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7
Connection properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8
Usage properties - Generate SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9
Deprecated stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10
Database stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11
Do it in DataStage or in the Database? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12
Connector Stage Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
Reading with Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-14
Before/After SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-15
Sparse lookups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16
Writing using Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
Parameterizing the table action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18
Optimizing the insert/update performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19
Commit interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20
Bulk load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21
Cleaning Up failed DB2 loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22
Error Handling in Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23
Error handling in Connector stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-24
Connector stage with reject link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-25
Specifying reject conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-26
Added error code information examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-27

x Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V6.0
Student Notebook

TOC Multiple Input Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-28


Multiple input links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-29
Inside the Connector - stage properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-30
Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-31
Data Connection Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-32
Standard insert plus update example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-33
Insert-Update Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-34
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-35
Exercise 10. Working with Connectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-36
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-37

Unit 11. Processing XML Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
XML stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3
Schema Library Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4
Schema Library Manager window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6
Schema file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7
Composing XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8
Composing XML data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9
Compositional Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10
Inside the XML stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11
Inside the Assembly editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12
Input step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-13
Composer step - XML Target tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14
Composer step - XML Document Root tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15
Composer step - Validation tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16
Composer step - Mappings tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17
XML file output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18
Parsing XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19
Parsing XML data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20
Parser step - XML Source tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21
Parser step - Document Root tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
Transforming XML Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-23
Transforming XML data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24
Transformation Example - HJoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-25
Editing the HJoin step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26
Switch step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-27
Aggregate step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-29
Exercise 11 - XML stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31

Unit 12. Slowly Changing Dimensions Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Surrogate Key Generator Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3
Surrogate Key Generator stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4
Example job to create surrogate key state files . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5

© Copyright IBM Corp. 2005, 2011 Contents xi


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Editing the Surrogate Key Generator stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6


Example job to update the surrogate key state file . . . . . . . . . . . . . . . . . . . . . . . . 12-7
Specifying the update information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8
Slowly Changing Dimensions Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9
Slowly Changing Dimensions stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10
Star schema database structure and mappings . . . . . . . . . . . . . . . . . . . . . . . . . 12-11
Example Slowly Changing Dimensions (SCD) job . . . . . . . . . . . . . . . . . . . . . . . 12-13
Working in the SCD stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14
Selecting the output link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15
Specifying the purpose codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-16
Surrogate key management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17
Dimension update specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-18
Output mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-20
Exercise 12 - Slowly Changing Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-22

Unit 13. Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1


Unit objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2
Job Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3
Overall job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
Balancing performance with requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5
Modular job design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6
Establishing job boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7
Use job sequences to combine job modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8
Adding environment variables as job parameters . . . . . . . . . . . . . . . . . . . . . . . . . 13-9
Stage Usage Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10
Reading sequential files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11
Reading a sequential file in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12
Parallel file pattern I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13
Partitioning and sequential files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
Other sequential file tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15
Buffering sequential file writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16
Lookup Stage Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17
Lookup stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18
Partitioning lookup reference data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-19
Lookup reference data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20
Lookup file sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-21
Using Lookup File Set stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-22
Aggregator Stage Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-23
Aggregator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-24
Using Aggregator to sum all input rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-25
Transformer Stage Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-26
Transformer performance guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-27
Transformer vs. other stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-28
Modify stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-29
Optimizing Transformer expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-30
Simplifying Transformer expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-31

xii Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V6.0
Student Notebook

TOC Transformer stage compared with Build stage . . . . . . . . . . . . . . . . . . . . . . . . . . 13-32


Transformer decimal arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-33
Transformer decimal rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-34
Conditionally aborting a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-35
Job Design Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-36
Summing all rows with Aggregator stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-37
Conditionally aborting the job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-38
Checkpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-39
Exercise 13 - Best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-40
Unit summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-41

© Copyright IBM Corp. 2005, 2011 Contents xiii


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

xiv Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V6.0
Student Notebook

pref Course description


IBM InfoSphere Advanced DataStage v8

Duration: 4 days

Purpose
This course is designed to introduce advanced job development
techniques in DataStage v8.5.

Audience
Experienced DataStage developers seeking training in more
advanced DataStage techniques and who seek an understanding of
the parallel framework architecture.

Prerequisites
DataStage Essentials course or equivalent and at least one year of
experience developing parallel jobs using DataStage.

Objectives
After completing this course, you should be able to:
- Describe the parallel processing architecture and development
and runtime environments
- Describe the compile process and the runtime job execution
process
- Describe how partitioning and collection works in the parallel
framework
- Describe sorting and buffering in the parallel framework and
optimization techniques
- Describe and work with parallel framework data types
- Create reusable job components
- Use loop processing in a Transformer stage
- Process groups in a Transformer stage
- Extend the functionality of DataStage by building custom stages
and creating new Transformer functions

© Copyright IBM Corp. 2005, 2011 Course description xv


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

- Use Connector stages to read and write from relational tables


and handle errors in Connector stages
- Process XML data in DataStage jobs using the XML stage
- Design a job that processes a star schema database with Type
1 and Type 2 slowly changing dimensions
- List job and stage best practices

xvi Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V6.0
Student Notebook

pref Agenda
Day 1
(00:30) Welcome
(01:35) Unit 1 - Introduction to the Parallel Framework Architecture
(02:10) Unit 2 - Compilation and Execution
(02:10) Unit 3 - Partitioning and Collecting Data

Day 2
(01:40) Unit 4 - Sorting Data
(01:00) Unit 5 - Buffering in Parallel Jobs
(02:00) Unit 6 - Parallel Framework Data Types
(01:45) Unit 7 - Reusable components

Day 3
(02:10) Unit 8 - Advanced Transformer Logic
(04:10) Unit 9 - Extending the Functionality of Parallel Jobs
(01:55) Unit 10 - Accessing Databases (start if there is time)

Day 4
(-------) Unit 10 - Accessing Databases, continued
(01:40) Unit 11 - Processing XML Data
(01:20) Unit 12 - Slowly Changing Dimensions Stages
(01:50) Unit 13 - Best Practices

© Copyright IBM Corp. 2005, 2011 Agenda xvii


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

xviii Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 0. IBM InfoSphere Advanced DataStage v8

What this unit is about


This unit describes the course objectives and agenda.

© Copyright IBM Corp. 2005, 2011 Unit 0. IBM InfoSphere Advanced DataStage v8 0-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Course objectives
After completing this course, you should be able to:
• Describe the parallel processing architecture and development and runtime
environments
• Describe the compile process and the runtime job execution process
• Describe how partitioning and collection works in the parallel framework
• Describe sorting and buffering in the parallel framework and optimization techniques
• Describe and work with parallel framework data types
• Create reusable job components
• Use loop processing in a Transformer stage
• Process groups in a Transformer stage
• Extend the functionality of DataStage by building custom stages and creating new
Transformer functions
• Use Connector stages to read and write from relational tables and handle errors in
Connector stages
• Process XML data in DataStage jobs using the XML stage
• Design a job that processes a star schema database with Type 1 and Type 2 slowly
changing dimensions
• List job and stage best practices

© Copyright IBM Corporation 2011

Figure 0-1. Course objectives KM4001.0

Notes:

0-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Agenda
Day 1
• Unit 1: Introduction to the Parallel Framework Architecture
– Exercise 1
• Unit 2: Compilation and Execution
– Exercise 2
• Unit 3: Partitioning and Collecting Data
– Exercise 3
Day 2
• Unit 4: Sorting Data
– Exercise 4
• Unit 5: Buffering in Parallel Jobs
– Exercise 5
• Unit 6: Parallel Framework Data Types
– Exercise 6
• Unit 7: Reusable components
– Exercise 7
© Copyright IBM Corporation 2011

Figure 0-2. Agenda KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 0. IBM InfoSphere Advanced DataStage v8 0-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Agenda
Day 3
• Unit 8: Advanced Transformer Logic
– Exercise 8
• Unit 9: Extending the Functionality of Parallel Jobs
– Exercise 9
• Unit 10: Accessing Databases (start)
Day 4
• Unit 10: Accessing Databases (finish)
– Exercise 4
• Unit 11: Processing XML Data
– Exercise 4
• Unit 12: Slowly Changing Dimensions Stages
– Exercise 5
• Unit 13: Best Practices
– Exercise 6
© Copyright IBM Corporation 2011

Figure 0-3. Agenda KM4001.0

Notes:

0-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Introductions
• Name
• Company
• Where you live
• Your job role
• Current experience with products and technologies
in this course
– Database
– ETL tools
– DataStage
– Programming
• Do you meet the course prerequisites?
– DataStage Essentials course or equivalent
– 1 year experience using DataStage
• Class expectations
© Copyright IBM Corporation 2011

Figure 0-4. Introductions KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 0. IBM InfoSphere Advanced DataStage v8 0-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

0-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 1. Introduction to the Parallel Framework


Architecture

What this unit is about


This unit introduces you to parallelism and the parallel framework
environment. Later units will go into more detail.

What you should be able to do


After completing this unit, you should be able to:
• Describe the parallel processing architecture
• Describe pipeline and partition parallelism
• Describe the role of the configuration file
• Design a job that creates robust test data

How you will check your progress


• Checkpoint questions and lab exercises.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Describe the parallel processing architecture
• Describe pipeline and partition parallelism
• Describe the role of the configuration file
• Design a job that creates robust test data

© Copyright IBM Corporation 2006-2011

Figure 1-1. Unit objectives KM4001.0

Notes:

1-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Why study the parallel architecture?

• DataStage Client is a productivity tool


– GUI design functionality is intended for fast development
– Not intended to mirror underlying architecture
• GUI depicts standard ETL process
– Parallelism is implemented under the covers
– GUI hides and in some cases distorts things
• For example, sorts, buffers, partitioning operators
• Sound, scalable designs require an
understanding of underlying architecture

© Copyright IBM Corporation 2006-2011

Figure 1-2. Why study the parallel architecture? KM4001.0

Notes:
Learning DataStage at the GUI job design level is not enough. In order to develop the
ability to design sound, scalable jobs, it is necessary to understand the underlying
architecture. This is because the DataStage client is primarily a productivity tool. It is not
intended to mirror underlying architecture.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

What we need to master

• How the GUI job design gets executed


– What is generated from the GUI (OSH)
– How this is executed in the parallel framework
• How parallelism is implemented
– Pipeline parallelism
– Partition parallelism
• Role of the configuration file
• Score
• Development environment
– How to develop efficient, well-performing GUI job designs
– How to debug and change the GUI job design based on
the generated OSH and Score and messages in the job
log

© Copyright IBM Corporation 2006-2011

Figure 1-3. What we need to master KM4001.0

Notes:
To be able to design robust parallel jobs, we need to get behind and beyond the GUI. We
need to understand what gets generated from the GUI design and how this gets executed
by the parallel framework. We also need to be able to debug and modify our job designs
based on what we see happen at runtime.

1-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

DataStage parallel job documentation

• “Administrator Client Guide”


• “Director Client Guide”
• “Connectivity” guides
– Database connectivity information
• “Parallel Job Developer’s Guide”
– Parallelism
– Stages information
– Configuration file
• “Parallel Job Advanced Developer’s Guide”
– Environment variables
– Buffering
– Stage to operator mappings
– Performance
– Custom stages and functions
• “Custom Operator Reference”
– Coding custom operators
© Copyright IBM Corporation 2006-2011

Figure 1-4. DataStage parallel job documentation KM4001.0

Notes:
This slide lists and summarizes the main DataStage guides covering the material in this
course. DataStage documentation is installed during the DataStage client installation.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Key parallel concepts


• Parallel processing:
– Executing the job on multiple CPUs
• Scalable processing:
– Add more resources (CPUs and disks) to increase system
performance

1 2 • Example system: 6 CPUs (processing


nodes) and disks
3 4 • Scale up by adding more CPUs
• Add CPUs as individual nodes or to an
5 6 SMP system

© Copyright IBM Corporation 2006-2011

Figure 1-5. Key parallel concepts KM4001.0

Notes:
Parallel processing is the key to building jobs that are highly scalable.
The parallel engine uses the processing node concept. “Standalone processes” rather than
“thread technology” is used. Processed-based architecture is platform-independent, and
allows greater scalability across resources within the processing pool.

1-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Scalable hardware environments

• Single CPU • SMP • GRID / Clusters


• Dedicated memory • Multi-CPU (2-64+) – Multiple, multi-CPU systems
– Dedicated memory per
& disk • Shared memory & system
disk – Typically SAN-based shared
storage
• MPP
– Multiple nodes with
dedicated memory, storage
• 2 – 1000’s of CPUs

© Copyright IBM Corporation 2006-2011

Figure 1-6. Scalable hardware environments KM4001.0

Notes:
DataStage parallel jobs are designed to be platform-independent. A single job, if properly
designed, can run across resources within a single machine (SMP) or multiple machines
(cluster, GRID, or MPP architectures).
While DataStage can run on a single-CPU environment, it is designed to take advantage of
parallel platforms.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Drawbacks of traditional batch processing

• Poor utilization of resources


– Lots of idle processing time
– Lots of disk and I/O for staging
• Complex to manage
– Lots of small jobs
• Impractical with large data volumes
© Copyright IBM Corporation 2006-2011

Figure 1-7. Drawbacks of traditional batch processing KM4001.0

Notes:
Traditional batch processing consists of a distinct set of steps, defined by business
requirements. Between each step, intermediate results are written to disk.
This processing may exist outside of a database (using flat files for intermediate results) or
within a database (using SQL, stored procedures, and temporary tables).
There are several problems with this approach: First, each step must complete and write its
entire result set before the next step can begin. Secondly, landing intermediate results
incurs a large performance penalty through increased I/O. In this example, a single source
incurs 7 times the I/O to process. Thirdly, with increased I/O requirements come increased
storage costs.

1-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Pipeline parallelism

• Transform, enrich, load processes execute simultaneously


• Like a conveyor belt moving rows from process to process
– Start downstream process while upstream process is running
• Advantages:
– Reduces disk usage for staging areas
– Keeps processors busy
• Still has limits on scalability

© Copyright IBM Corporation 2006-2011

Figure 1-8. Pipeline parallelism KM4001.0

Notes:
In this diagram, the arrows represent rows of data flowing through the job. While earlier
rows are undergoing the Loading process, later rows are undergoing the Transform and
Enrich processes. In this way a number of rows (7 in the picture) are being processed in
parallel.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Partition parallelism
• Divide the incoming stream of data into subsets to
be separately processed by an operation
– Subsets are called partitions
• Each partition of data is processed by the same
operation
– For example, if the operation is Filter, each partition will
run the Filter operation
• Facilitates near-linear scalability
– 8 times faster on 8 processors
– 24 times faster on 24 processors
– This assumes the data is evenly distributed

© Copyright IBM Corporation 2006-2011

Figure 1-9. Partition parallelism KM4001.0

Notes:
Partitioning breaks a data set into smaller sets. This is a key to scalability. However, the
data needs to be evenly distributed across the partitions; otherwise, the benefits of
partitioning are reduced.
It is important to note that what is done to each partition of data is the same. How the data
is processed or transformed is the same.

1-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Partitioning illustration

Node 1

Operation
subset1
Node 2
subset2
Operation
Data subset3
Node 3

Operation

• Here the data is partitioned into three subsets


• The same operation is performed on each partition of data
separately and in parallel
• If the data is evenly distributed, the data will be processed
roughly three times faster
© Copyright IBM Corporation 2006-2011

Figure 1-10. Partitioning illustration KM4001.0

Notes:
This diagram depicts how partition parallelism is implemented in DataStage. The data is
split into multiple data streams which are each processed separately by the same stage
operations.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

DataStage combines partitioning and pipelining

– Within DataStage, pipelining, partitioning, and repartitioning are automatic


– Job developer only identifies:
• Sequential vs. parallel operations (by stage)
• Method of data partitioning
• Configuration file (which identifies resources)
• Advanced stage options (buffer tuning, operator combining, etc.)

© Copyright IBM Corporation 2006-2011

Figure 1-11. DataStage combines partitioning and pipelining KM4001.0

Notes:
By combining both pipelining and partitioning, DataStage creates jobs with higher volume
throughput.
The configuration file drives the parallelism by specifying the number of partitions.

1-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Job design versus execution

User assembles the flow using DataStage Designer

… at runtime, this job runs in parallel for any configuration


(1 node, 4 nodes, N nodes)

No need to modify or recompile the job design!


© Copyright IBM Corporation 2006-2011

Figure 1-12. Job design versus execution KM4001.0

Notes:
Much of the parallel processing paradigm is hidden from the designer. The designer simply
designates the process flow, as shown in the upper portion of this diagram. The Parallel
engine, using definitions in a configuration file, will actually execute processes that are
partitioned and parallelized, as illustrated in the bottom portion.
A misleading feature of the lower diagram is that it makes it appear as if the data remains in
the same partitions through the duration of the job. In fact, partitioning and re-partitioning
occurs on a stage-by-stage basis. There will be times when the data moves from one
partition to another.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Defining parallelism
• Execution mode (sequential / parallel) is controlled by stage
definition and properties
– Default is parallel for most stages
– Can override default in most cases (Advanced Properties tab)
– By default, Sequential File stage runs in sequential mode
• Can run in parallel mode when using multiple readers
– By default, Sort stage (and most other stages) run in parallel mode
• Degree of parallelism is determined by configuration file the job is
running with
– Total number of logical nodes in the configuration file
• Assuming these nodes exist in available node pools

© Copyright IBM Corporation 2006-2011

Figure 1-13. Defining parallelism KM4001.0

Notes:
Stages run in two possible execution modes: sequential, parallel. The default is parallel for
most stages. For example, the Sequential File stage runs in sequential mode by default.
The Sort stage, and most other stages, run in parallel mode.
If a stage runs in sequential node it will run on only one of the available nodes specified in
the configuration file. If a stage runs in parallel mode, it can use all the available nodes
specified in the configuration file.
The Score provides this information.

1-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Configuration file
• Configuration file separates configuration (hardware / software) from
job design
– Specified per job at runtime by $APT_CONFIG_FILE environment variable
– Optimizes overall throughput and matches job characteristics to overall
hardware resources
– Allows you to change hardware and resources without changing job design
• Defines number of nodes (logical processing units) with their
resources
– Need not match the number of physical CPUs
– Resources include dataset, scratch, buffer disk (file systems)
– Optional resources include database, SAS
– Resource usage can be optimized using “pools” (named subsets of nodes)
– Allows runtime constraints on resource usage on a per job basis
• Different configuration files can be used on different job runs
– Add $APT_CONFIG_FILE as a job parameter

© Copyright IBM Corporation 2006-2011

Figure 1-14. Configuration file KM4001.0

Notes:
The configuration file determines the degree of parallelism (number of partitions) of jobs
that use it. Each job runs under a configure file. The configuration file is specified by the
$APT_CONFIG_FILE job parameter.
DataStage job runs can point to different configuration files by using job parameters. Thus,
a job can utilize different hardware architectures without being recompiled. It might, for
example, pay to have a 4-node configuration file running on a 2 processor box, for
example, if the job is “resource bound.” We can spread disk I/O among more controllers.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Example configuration file


{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
3 4 resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
1 2 }
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
Key points: resource scratchdisk "/temp" {}
}
1. Number of nodes defined node "n3" {
fastname "s3"
2. Resources assigned to each pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
node. Their order is significant. resource scratchdisk "/temp" {}
}
3. Nameless node pool (“”). Nodes node "n4" {
fastname "s4"
in it are available to stage pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
}

© Copyright IBM Corporation 2006-2011

Figure 1-15. Example configuration file KM4001.0

Notes:
This example shows a typical configuration file. Pools can be applied to nodes or other
resources. The curly braces following some disk resources specify the resource pools
associated with that resource. A node pool is simply a collection of nodes. The pools a
given node belongs to are listed after the key word ‘pool’ for the given node. A stage that is
constrained to use a particular named pool will run only on the nodes that are in that pool.
By default, all stages run on the nodes that are in the nameless pool (“”).
Following the keyword “node” is the name of the node (logical processing unit).
The order of resources is significant. The first disk is used before the second, and so on.
Keywords, such as “sort” and “bigdata”, when used, restrict the signified processes to the
use of the resources that are identified. For example, “sort” restricts sorting to node pools
and scratch disk resources labeled “sort”.
Database resources (not shown here) can also be created that restrict database access to
certain nodes.

1-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Question: Can objects be constrained to specific CPUs? No, a request is made to the
operating system and the operating system chooses the CPU.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Job Design Examples

© Copyright IBM Corporation 2006-2011

Figure 1-16. Job Design Examples KM4001.0

Notes:

1-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Generating mock data


• Row Generator stage
– Define columns in which to generate the data
– On the Extended Properties page, select algorithm for generating
values
• Different types have different algorithms available
• Lookups can be used to generate large amounts of robust
mock data
– Lookup tables map integers to values
– Column Generator columns generate integers to look up
– Cycling through integer sets can generate all possible
combinations

© Copyright IBM Corporation 2006-2011

Figure 1-17. Generating mock data KM4001.0

Notes:
Among its many uses, the Row Generator stage can be used to generate mock or test
data. When used with Lookup stages in a job, large amounts of robust mock data can be
generated.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Job design for generating mock data

Lookup tables

Row Generator

© Copyright IBM Corporation 2006-2011

Figure 1-18. Job design for generating mock data KM4001.0

Notes:
In this job design, the Row Generator stage generate integers to look up. For different
columns, it cycles through integer sets, generating all possible combinations. The lookup
files map these integers to specific values. For example, FName maps different integer
values to first names. LName maps different integer values to last names. And so on.

1-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Specifying the generating algorithm

Cycle through

Columns
generating
integers

© Copyright IBM Corporation 2006-2011

Figure 1-19. Specifying the generating algorithm KM4001.0

Notes:
The number of values to cycle through should be different for each set of integers, so that
all possible combinations will be generated, for example:
000
111
120
301
010
121
200
Here the first column cycles through 0-3, the second 0-2, and the third 0-1.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the Lookup stage

Metadata for
data file

Return mapped
values

© Copyright IBM Corporation 2006-2011

Figure 1-20. Inside the Lookup stage KM4001.0

Notes:
This shows the inside of the Lookup stage. Notice how integer columns (int1, int2, …) are
specified as keys into the lookup files. The values the keys are mapped to are returned in
the output.

1-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Configuration file displayed in job log

Message displaying
First partition config file

Second
partition
© Copyright IBM Corporation 2006-2011

Figure 1-21. Configuration file displayed in job log KM4001.0

Notes:
The job log contains a lot of valuable information. One message displays the configuration
file the job is running under. This slide shows that message.

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Checkpoint
1. What two main factors determine the number of nodes a
stage in a job will run on?
2. What two types of parallelism are implemented in parallel
jobs?
3. What stage is often used to generate mock data?

© Copyright IBM Corporation 2006-2011

Figure 1-22. Checkpoint KM4001.0

Notes:
Write your answers here:

1-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Exercise 1
• In this lab exercise, you will:
– Generate mock data
– Examine the job log

© Copyright IBM Corporation 2006-2011

Figure 1-23. Exercise 1 KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 1. Introduction to the Parallel Framework Architecture 1-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit summary
Having completed this unit, you should be able to:
• Describe the parallel processing architecture
• Describe pipeline and partition parallelism
• Describe the role of the configuration file
• Design a job that creates robust test data

© Copyright IBM Corporation 2006-2011

Figure 1-24. Unit summary KM4001.0

Notes:

1-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 2. Compilation and Execution

What this unit is about


This unit describes the compile time and run time architectures of
DataStage parallel jobs.

What you should be able to do


After completing this unit, you should be able to:
• Describe the main parts of the configuration file
• Describe the compile process and the OSH that is generated
during it
• Describe the role and the main parts of the Score
• Describe the job execution process

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Describe the main parts of the configuration file
• Describe the compile process and the OSH that is generated
during it
• Describe the role and the main parts of the Score
• Describe the job execution process

© Copyright IBM Corporation 2005-2011

Figure 2-1. Unit objectives KM4001.0

Notes:

2-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Parallel Job Compilation

© Copyright IBM Corporation 2005-2011

Figure 2-2. Parallel Job Compilation KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel job compilation


Designer
Client
DataStage generates all code (OSH, C++)
• OSH is a scripting language that can be turned
into code executable by the DataStage parallel
engine Compile

• Validates link requirements, mandatory stage


DataStage server
options, Transformer logic, and other
requirements
• Generates OSH representation of data flow
and stages
– Stages are compiled into C++ operators
C+ + f
– Table definitions are compiled into schemas or
Trans each
forme
• Generates transform operators for Executable
Job
r
Transformers
– Compiled into C++ source code and then into C++
operators
• Build stages are compiled manually within the Transformer
Gene
GUI rated Components
OSH

© Copyright IBM Corporation 2005-2011

Figure 2-3. Parallel job compilation KM4001.0

Notes:
During the compile process, DataStage generates all the code for the job. The compilation
process generates OSH (a scripting language) from the job design and also C++ code for
any Transformer stages that are used in the job.
For each Transformer, DataStage builds a C++ operator. This explains why jobs with
Transformers often take longer to compile (but not to run).

2-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer job compilation notes


• To improve compilation times, previously compiled, unchanged
Transformers are not recompiled
– The Force Compile option in the Multiple Job Compile utility can be used to
force recompiles of batches of jobs
• On clustered and grid runtime environments, job processing is
distributed across multiple platforms
– Transformer operators must be available to all the platforms the job is running
on
– To share Transformer operators, you can share the project directory across
the different platforms
• Must be identical mount point across the platforms

Alternatively, you can set $APT_COPY_TRANSFORM_OPERATOR on
the first job run to distribute Transformer operators to all the platforms
– Build and custom stage code must be shared or distributed manually

© Copyright IBM Corporation 2005-2011

Figure 2-4. Transformer job compilation notes KM4001.0

Notes:
As previously mentioned, DataStage generates and then compiles C++ source code for
each Transformer in a job. These become custom operators in the OSH. This explains why
jobs with Transformers often take longer to compile. This also creates a problem if the jobs
are run in a grid or clustered environment which distributes the processing across multiple
platforms. These custom operators must exist on each of the platforms. This is not a
problem for other standard stages because their corresponding operators will at installation
time have been distributed to all the platforms.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Generated OSH
Enable viewing of
generated OSH in
Administrator:

Comments
OSH is visible in:
Operator - Job Properties
- Job log
- View Data
Schema
- Table
definitions

© Copyright IBM Corporation 2005-2011

Figure 2-5. Generated OSH KM4001.0

Notes:
You can view generated OSH in Designer in several places, as shown above. To view the
OSH, you must enable this in Administrator on the Parallel tab. When enabled it is enabled
for all projects.

2-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Stage to OSH operator mappings


• Sequential File stage
– When used as a source: import operator
– When used as a target: export operator
• Data Set stage: copy operator
• Sort: tsort operator
• Aggregator: group operator
• Row Generator, Column Generator, Surrogate Key Generator
– All mapped to generator operator
• Oracle
– Source: oraread
– Sparse Lookup: oralookup
– Target Load: orawrite
– Target Upsert: oraupsert
• Lookup File Set
– Target: lookup -createOnly

© Copyright IBM Corporation 2005-2011

Figure 2-6. Stage to OSH operator mappings KM4001.0

Notes:
The stages on the diagram do not necessarily map one-to-one to OSH operators. For
example, the Sequential File stage when used as a source is mapped to the import
operator. When used as a target it is mapped to the export operator.
The converse is also true. Different stages can be mapped to a single operator. For
example, the Row Generator and Column Generator stages are both mapped to the
generator operator with different parameters.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Generated OSH primer Generated OSH for first 2 stages

• Comment blocks introduce each operator


– Operator order in the OSH script is
determined by the order stages were
added to the canvas ####################################################
#### STAGE: Row_Generator_0
## Operator
– Execution order is determined by the generator
## Operator options
ordering of the inputs and outputs of the -schema record
(
operators a:int32;
b:string[max=12];
• OSH uses the familiar syntax of the UNIX shell )
c:nullable decimal[10,2] {nulls=10};

-records 50000
– Operator name ## General options

– Schema (generated from stage table [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]


## Outputs

definitions) 0> [] 'Row_Generator_0:lnk_gen.v'


;

– Operator options (-name value format) Virtual data set is


####################################################
#### STAGE: SortSt
## Operator
used to connect
– Virtual (in-memory) data sets are used as tsort output of one
## Operator options
inputs and outputs to and from operators -key 'a' operator to input of
-asc
– Inputs to the operator are indicated by n< another
## General options
where n is the input number [ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
– Outputs from the operator are indicated by ## Outputs
0> [modify (
n> where n is the output number keep
a,b,c;
• Virtual data sets are generated to connect )] 'SortSt:lnk_sorted.v'
;

operators
– Have *.v extensions
© Copyright IBM Corporation 2005-2011

Figure 2-7. Generated OSH primer KM4001.0

Notes:
Data sets connect the OSH operators. These are virtual data sets, that is, in-memory data
flows. These data sets correspond to links in the job diagram. Link names are used in data
set names. So good practice is to name links meaningfully, so they can be recognized in
the OSH.
To determine the execution order of the operators, trace the output to input data sets. For
example, if operator1 has dataSet1.v as an output and this data set is input to operator2,
then operator2 follows operator1 in the execution order.

2-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

DataStage GUI versus OSH terminology

GUI OSH
-------------------------------------------------------------------------
table definition schema
property format
SQL column type C++ data type
link virtual dataset
row record
column field
stage operator

• Log messages use OSH terminology

© Copyright IBM Corporation 2005-2011

Figure 2-8. DataStage GUI versus OSH terminology KM4001.0

Notes:
This slide lists some of the equivalencies of terminology between the DataStage GUI and
the generated OSH.
OSH terms and DataStage GUI terms have an equivalency. The GUI frequently uses terms
from both paradigms. Log messages almost exclusively use OSH terminology because this
is what the parallel engine executes.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Configuration File

© Copyright IBM Corporation 2005-2011

Figure 2-9. Configuration File KM4001.0

Notes:

2-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Configuration file

• Specifies the processing nodes


– Determines the degree of parallelism
• Identifies resources connected to each processing
node
• When system resources change, only need to change
the configuration file
– No need to modify or recompile jobs
• When a DataStage parallel job runs, the configuration
file is read
– The application is automatically configured to fit the
system

© Copyright IBM Corporation 2005-2011

Figure 2-10. Configuration file KM4001.0

Notes:
The “Parallel Job Developers Guide” documents the configuration file.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Processing nodes (partitions)


• Locations on which the engine runs the generated OSH
operators
• Logical rather than physical
• Do not necessarily correspond to the number of CPUs in
your system
– Maybe more or less
• Distinguished from “computer nodes” in a network or grid
– A single computer may support multiple processing nodes

© Copyright IBM Corporation 2005-2011

Figure 2-11. Processing nodes (partitions) KM4001.0

Notes:
Processing nodes are specified in the configuration file. These do not necessarily
correspond to “computer nodes.” A single computer node can run multiple processing
nodes.

2-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Configuration file format

• Text file that is passed to the parallel engine


– Stored on DataStage Server
– Can be displayed and edited
• Name and location of the file to be used is
determined by the $APT_CONFIG_FILE
environmental variable
• Primary elements
– Node name
– Fast name
– Pools
– Resources

© Copyright IBM Corporation 2005-2011

Figure 2-12. Configuration file format KM4001.0

Notes:
This slide lists the configuration file format. The primary elements are the node name, fast
name, pools, and resources.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Node options
• Node name
– User-defined name of a processing node
• Fast name
– Name of computer system upon which a node is located, as referred to by fastest network in the
system
– Specified for each processing node
– Used by DataStage operators to open connections
– For non-distributive systems such as SMP, the operators all run on a single system
• So all processing nodes will have the same fast name
• Pools: Two types
– Node pools
• Names of pools to which a node is assigned
• Used to logically group nodes
– Resource pools
• Names of pools to which resources are assigned
• Used to logically group resources
– The Default pool: Specified by the empty string (“”)
• By default, all operators can use any node assigned to the default pool
• By default, resources assigned to the default pool are available to all operators
• Resources
– Disk
– Scratch disk

© Copyright IBM Corporation 2005-2011

Figure 2-13. Node options KM4001.0

Notes:
The node name is not required to correspond to anything physical. It is a user-defined
name for a virtual location where operators can run.
Fast name is the name of the node as it is referred to on the fastest network in the system,
such as an IBM switch, FDDI, or BYNET. For non-distributive systems such as SMP, the
operators all run on a single system. So regardless of the number of processing nodes, the
fast name will be the same for all of them. The fast name is the physical node name that
operators use to open connections for high-volume data transfers. Typically this is the
principal node name as returned by the UNIX command uname –n.

2-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Sample configuration file


{ Default
node pool
node “Node1" Named
node pool
{
fastname "BlackHole"
pools "" "node1"
resource disk "/usr/dsadm/Datasets1" {pools “”}
resource disk "/usr/dsadm/Datasets2" {pools “bigdata”}
resource scratchdisk "/usr/dsadm/Scratch1" {pools "" }
resource scratchdisk "/usr/dsadm/Scratch2" {pools “sort" }
}
} Named disk Reserved named
pool pool

© Copyright IBM Corporation 2005-2011

Figure 2-14. Sample configuration file KM4001.0

Notes:
There are a set of resource pool reserved names, including: db2, oracle, informix, sas,
sort, lookup, buffer. Certain types of operators will use resources assigned to these
reserved name pools. For example, the sort operator will use scratch disk assigned to the
sort pool, if it exists. If it exhausts the space on the sort pool, it will use other default
scratch disk.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Resource pools

• Resource pools allocate disk


Resource pool "bigdata" storage
• By default, operators use the
default pool, specified by “”
• But operators can be
constrained to use resources
assigned to specific named
pools, for example “bigdata”

© Copyright IBM Corporation 2005-2011

Figure 2-15. Resource pools KM4001.0

Notes:
Resource pools allocate resources, mainly disk resources, to nodes as specified in the
configuration file. One resource pool, specified by “” (the empty pool) is special. It is the
default pool of resources to be used by operators.

2-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Sorting resource pools


• Resources can be specified for sorting
– Scratch disk available for sorting
– Sort operator can use “sort” disk pool or default disk pool
• Sort stage looks first for scratch disk resources in the
sort pool
– Then it looks for resources assigned to default disk pools

© Copyright IBM Corporation 2005-2011

Figure 2-16. Sorting resource pools KM4001.0

Notes:
One type of resource pool, the sort pool, specifies disk resources to be used if a sorting
operation runs out of memory. If it runs out of sort disk resources, it will use scratch disk
resources.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Another configuration file example


{
Node pool
{node "n1" { for sort
node "n1" “s1"
fastname {
fastname
pool “s1"
"" "n1" "s1" "sort"
operator
pool "" disk
resource "s1" "sort"
"n1" "/data/n1/d1" {} Resource
resourcedisk
disk"/data/n1/d2"
"/data/n1/d1"{}{}
resource
resourcescratchdisk
disk "/data/n1/d2" {} {"sort"}
pool for
resource "/scratch”
} resource scratchdisk "/scratch” {"sort"} sort
}
node
node
"n2" {
"n2" "s2"
{
operator
fastname
fastname "s2"
pool "" "n2" "s2" "app1"
pool "" "n2" "s2" "app1"
resource disk "/data/n2/d1" {}
resourcescratchdisk
resource disk "/data/n2/d1" {} {}
"/scratch" Fast names are all
} resource scratchdisk "/scratch" {}
}
node "n3" {
different. So running
node "n3" "s3"
fastname { on a grid or cluster
fastname "s3"
pool "" "n3" "s3" "app1"
pool "" disk
resource "s3" "app1"
"n3" "/data/n3/d1" {}
resource disk "/data/n3/d1"
resource scratchdisk {} {}
"/scratch"
} resource scratchdisk "/scratch" {}
}
node "n4" {
node "n4" "s4"
{
fastname
fastname "s4" Default
pool "" "n4" "s4" "app1"
pool "" "n4" "s4" "app1"
resource disk "/data/n4/d1" {} resource pool
resourcescratchdisk
resource disk "/data/n4/d1" {} {}
"/scratch"
} resource scratchdisk "/scratch" {}
} }
}

© Copyright IBM Corporation 2005-2011

Figure 2-17. Another configuration file example KM4001.0

Notes:
This slide shows an example of a configuration file. Notice the sort keyword used for the
first node which designates disk resources to use by a sort operator running on node n1.
This is disk the sort operator will use if it runs out of memory.

2-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Constraining operators to specific node pools

Named resource pool with extra disk


resources

Named node pool with extra nodes


for bottleneck situations

© Copyright IBM Corporation 2005-2011

Figure 2-18. Constraining operators to specific node pools KM4001.0

Notes:
In this example, since a sparse lookup is viewed as the bottleneck, the stage has been set
to execute on multiple nodes. These are nodes that are assigned to the extra node pool. It
is also given extra resources. These are resources that are assigned to the extra resource
pool.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Configuration Editor

• View and edit configuration files


– Click Tools>Configurations in Designer
• Create new configurations
– Easiest way is to save an existing configuration under a
new name and modify it
• There is a button you can click to check a
configuration for syntax errors

© Copyright IBM Corporation 2005-2011

Figure 2-19. Configuration Editor KM4001.0

Notes:
DataStage Designer has a configuration editor you can use to create and edit configuration
files. The editor also contains functionality for checking the configuration file for syntax
errors.

2-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Configuration editor

Select
configuration

Check
configuration

Output from
check

© Copyright IBM Corporation 2005-2011

Figure 2-20. Configuration editor KM4001.0

Notes:
This slide shows the configuration editor. Select the configuration file to edit from the list
box at the top. When you click the Check button the editor checks the syntax and displays
the results in the lower window.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel Runtime Architecture

© Copyright IBM Corporation 2005-2011

Figure 2-21. Parallel Runtime Architecture KM4001.0

Notes:

2-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Parallel Job startup


• Generated OSH and configuration file are used to “compose” the job
Score
– Think of “Score” as in musical score, not game score
– Similar to a database building a query optimization plan
– Identifies degree of parallelism and node assignments for each operator
– Inserts sorts and partitioners as needed to ensure correct results
– Defines connection topology (virtual data sets) between adjacent
operators
– Inserts buffer operators to prevent deadlocks
– Defines number of actual operating system processes
• Where possible, multiple operators are combined within a single process to
improve performance and optimize resource requirements
• Set $APT_STARTUP_STATUS to show each step of job startup
• Set $APT_PM_SHOW_PIDS to show process IDs in log
messages
© Copyright IBM Corporation 2005-2011

Figure 2-22. Parallel Job startup KM4001.0

Notes:
The Score is one of the main runtime debugging tools. It is generated from the OSH and
configuration file. This slide lists some of the information the Score contains.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel job run time


• It is only after the job Score and processes are
created that processing begins
– “Startup overhead” of a parallel job
• Job processing ends when either:
– Last row of data is processed by final operator
– A fatal error is encountered by any operator
– Job is halted by DataStage job sequence or user
intervention (for example, DataStage Director STOP)

© Copyright IBM Corporation 2005-2011

Figure 2-23. Parallel job run time KM4001.0

Notes:
Generating the Score and initiating the operator processes is part of the job overhead.
Processing does not begin until this occurs. This is inconsequential for jobs processing
very large amounts of data, but it can be consequential for jobs processing smaller
amounts of data.

2-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Viewing the job Score


• Set $APT_DUMP_SCORE to output the Score to the job log
• To identify the Score message, look for “main program:
This step has N datasets” Score
– You don’t see anywhere the word ‘Score’ message

Score
contents

© Copyright IBM Corporation 2005-2011

Figure 2-24. Viewing the job Score KM4001.0

Notes:
The only place the Score is displayed is in the job log. Unfortunately, the message does not
contain the word “Score” in its heading. Look for the message heading that begins “main
program: This step has N data sets”.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Example job Score


• Job score is divided into two
sections
– Datasets
• Partitioning and collecting
algorithms
– Operators
• Operator to node mappings
• Both sections identify sequential
or parallel processing

Number of player
processes
© Copyright IBM Corporation 2005-2011

Figure 2-25. Example job Score KM4001.0

Notes:
The Score contains a lot of useful information including the number of operators and data
sets, and the mappings of operators to processing nodes. Recall that the names of the
nodes are arbitrary. “1” in “node1” is just part of an arbitrary string name; it does not identify
where it is in the partitioning order. “p0”, “p1”, identify the partitions and their ordering, as
determined by the configuration file.
The last entry identifies the number of player processes. In this example, there is one for
the Row Generator stage, which is running sequentially, and four each for the two Peek
stages, which are running in parallel using all the nodes. The total is nine processes.

2-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Job execution: The orchestra metaphor


Conductor Node • Conductor (C) - initial startup process
– On the system where DataStage is installed
C
– Composes Score from OSH and configuration
file
Processing Node – Creates Section Leader (SL) processes
– One per node
SL – Consolidates messages to job log
– Manages orderly shutdown
P P P
• Section Leader (SL) processes
– Forks Player processes
Processing Node One per operator
– Manages up/down communication
SL
• Player (P) processes
P
– The actual processes associated with operators
P P
– Send error and information messages to their
Section Leaders
• Default Communication:
• SMP: Shared Memory – Establish connections to other players for data
• MPP: Shared Memory (within hardware flow
node); TCP (across hardware nodes)
– Clean up upon completion
© Copyright IBM Corporation 2005-2011

Figure 2-26. Job execution: The orchestra metaphor KM4001.0

Notes:
The conductor node has the start-up process. It creates the Score based on OSH and
configuration file. Then it starts up section leader processes.
Section leaders manage communication between the conductor node and the players.
Error and information messages returned by an operator running on a node (that is, a
player process) are passed to the section leader who then passes them to the conductor.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Runtime control and data networks

Control Channel/TCP
Conductor
Stdout Channel/Pipe

Stderr Channel/Pipe

APT_Communicator

Section Leader,0 Section Leader,1 Section Leader,2

generator,0 generator,1 generator,2

copy,0 copy,1 copy,2

$ osh “generator -schema record(a:int32) [par] | roundrobin | copy”

© Copyright IBM Corporation 2005-2011

Figure 2-27. Runtime control and data networks KM4001.0

Notes:
The dotted lines are communication channels between player processes for passing data.
Data that moves between nodes (for example, between section leader 1 and section leader
2) is being repartitioned.
Every player has to be able to communicate with every other player. There are separate
communication channels (pathways) for control, messages, errors, and data. Note that the
data channel does not go through the section leader/conductor, as this would limit
scalability. Data flows directly from upstream operators to downstream operators.

2-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Parallel data flow


• Think of running job as a series of “conveyor belts” transporting rows
for each link
– If the stage (operator) is parallel, each link will have multiple independent “belts”
(partitions)
• Row order is undefined (“non-deterministic”) across partitions and
across multiple links
– Order within a particular link and partition is deterministic
• Based on partition type and optionally on the sort order
• For this reason, job designs cannot include “circular” references
– For example, cannot update a source or reference file used in the same flow

lin ss
nd cro
ks
sa ra
ion rde
rtit o
pa ined
f
de
Un
Data Flow

© Copyright IBM Corporation 2005-2011

Figure 2-28. Parallel data flow KM4001.0

Notes:
Conceptually, you can picture a running parallel job as a series of conveyor belts
transporting rows. The order of the rows across the partitions is non-deterministic. Within a
single partition the order is determined.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Monitoring job startup and execution in the log


Conductor creates
score

Define Section
Leaders

Send score to
Section Leaders

Start Players

Set up data
Start the job
connections
processing
between Players
© Copyright IBM Corporation 2005-2011

Figure 2-29. Monitoring job startup and execution in the log KM4001.0

Notes:
This slide shows some of the information contained in the log about the start-up
processing. Reporting environment variables control how much of this information shows
up in the log.

2-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Counting the total number of processes


• One for the conductor process
• One section leader process for each node
– Four nodes = four processes
• One Player process for each operator running on a node
– One operator running (sequentially) on one node = one process
– One operator running (in parallel) on four nodes = four processes
• Total number of processes = Conductor + Section Leader
processes + Player processes for all operators

© Copyright IBM Corporation 2005-2011

Figure 2-30. Counting the total number of processes KM4001.0

Notes:
The total number of processes a job generates is important to performance. If you can
reduce the number of processes a job is using, relative to a certain configuration file, you
can improve its performance.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel Job Design Examples

© Copyright IBM Corporation 2005-2011

Figure 2-31. Parallel Job Design Examples KM4001.0

Notes:

2-32 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Peeking at the data steam


How do you view what is happening on a link at run time?
• Use Copy stage to split the stream off to a Peek stage
• Use Filter stage to select the data
– Map out the columns you’re interested in
• Use Peek stage to display selected data in the job log

© Copyright IBM Corporation 2005-2011

Figure 2-32. Peeking at the data steam KM4001.0

Notes:
Sometimes it would be nice to know what it happening to the data at a particular place in
the job. For example, maybe you want to know what the data is before it is processed by a
Transformer stage. This is one use you can make of Copy and Peek stages.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Peeking at the data stream design

Copy stage
Copy stream to
used as a
the peeks
place holder

Selecting records Peek stage


to peek at

© Copyright IBM Corporation 2005-2011

Figure 2-33. Peeking at the data stream design KM4001.0

Notes:
This slide shows a job with Copy stages used to get snapshots of the data during the
processing. The second Copy stage will be optimized away, because it has only one output
and the Force property has been set to False. The first may be combined with the
Transformer in the final optimization. This information can be seen in the log.

2-34 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Using Transformer stage variables

Stage
variables are
executed top
to bottom

Reference stage
variables in
column
derivations
© Copyright IBM Corporation 2005-2011

Figure 2-34. Using Transformer stage variables KM4001.0

Notes:
In this job example, stage variables are defined. Stage variables are executed top to
bottom, just like columns. They are executed before any output links are processed.

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Checkpoint
1. Why do jobs with Transformer stages take longer to compile?
2. What do the following objects in the GUI correspond to in the
OSH? Table definition, stage, link?
3. Suppose node2 is in a node pool named “sort”. Then a Sort
stage operator can run on this node. What about other stage
operators? What about, for example, a Transformer stage
operator? Could it run on node2?
4. From the Score we learn that this job generates three
operators. The first runs sequentially. The last two run in
parallel each on two nodes. How many player processes
does it run? How many total processes?

© Copyright IBM Corporation 2005-2011

Figure 2-35. Checkpoint KM4001.0

Notes:
Write your answers here:

2-36 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Exercise 2 – Compilation and Execution


• In this lab exercise, you will:
– Examine OSH from a job
– Examine the configuration file and
create a node pool
– Create a node pool constraint in a job
– Build a job using Filter and Peek stages
to view selected data
– Define derivations in a Transformer
– Examine the Score

© Copyright IBM Corporation 2005-2011

Figure 2-36. Exercise 2 - Compilation and Execution KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 2. Compilation and Execution 2-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit summary
Having completed this unit, you should be able to:
• Describe the main parts of the configuration file
• Describe the compile process and the OSH that is generated
during it
• Describe the role and the main parts of the Score
• Describe the job execution process

© Copyright IBM Corporation 2005-2011

Figure 2-37. Unit summary KM4001.0

Notes:

2-38 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 3. Partitioning and Collecting Data

What this unit is about


This unit describes how partitioning and collecting works in the parallel
job environment.

What you should be able to do


After completing this unit, you should be able to:
• Understand how partitioning works in the Framework
• Viewing partitioners in the Score
• Selecting partitioning algorithms
• Generate sequences of numbers (surrogate keys) in a partitioned,
parallel environment

How you will check your progress


• Lab exercises and checkpoint questions.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Understand how partitioning works in the Framework
• Viewing collectors and partitioners in the Score
• Selecting collecting and partitioning algorithms
• Generate sequences of numbers (surrogate keys) in a
partitioned, parallel environment

© Copyright IBM Corporation 2011

Figure 3-1. Unit objectives KM4001.0

Notes:

3-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Partitioning and collecting

• Partitioners distribute rows of


a link into smaller segments
that can be processed
independently in parallel
partitioner collector
– ONLY on input links before stages
running in parallel mode

• Collectors combine parallel


partitions of a link for
sequential processing Stage Stage
Stage
running in running
– ONLY on input links before stages running in
Parallel Sequentially
running in sequential mode Parallel

© Copyright IBM Corporation 2011

Figure 3-2. Partitioning and collecting KM4001.0

Notes:
Partitioners are generated by default or when you specify them explicitly in the stage. They
distribute rows of a link into smaller segments that can be processed independently in
parallel. Collectors reverse this process. They combine parallel partitions of a link into a
single partition for sequential processing.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Partitioning and collecting icons

“Fan-Out” “Fan-In”
Partitioner Collector

Sequential to Parallel to
parallel sequential

Partitioner and collector icons always


appear left to right regardless of the angle of the link

© Copyright IBM Corporation 2011

Figure 3-3. Partitioning and collecting icons KM4001.0

Notes:
This slide shows a job opened in Designer. The partition and collector icons show up on the
input links going to a stage. They always appear left to right regardless of the angle of the
link.

3-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Partitioners

• Partitioners are inserted before stages


running in parallel. The previous stage may
be running:
– Sequentially partitioner
• Results in a “fan-out” icon
Stage Stage
running running in
Sequentially Parallel

– In Parallel Stage
Stage
running in
• If partitioning method changes, data is running in
repartitioned Parallel
Parallel
Stage Stage
running in running in
Parallel Parallel

Repartitioning icon

© Copyright IBM Corporation 2011

Figure 3-4. Partitioners KM4001.0

Notes:
Partitioners are inserted before stages running in parallel. The previous stage may be
running in parallel or sequentially. The former yields a “fan-in” or “butterfly” icon, depending
on whether there is repartitioning. The latter yields a “fan-out” icon.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Where partitioning is specified


Inputs tab of parallel stage

Partitioning tab

Partitioning method

Note word
Partition

© Copyright IBM Corporation 2011

Figure 3-5. Where partitioning is specified KM4001.0

Notes:
This slide shows the Inputs>Partitioning tab. Auto is the default. If a partitioning method
other than Auto is selected, then this information can go into the OSH. If Auto is selected,
the framework inserts partitioners when the Score is composed.

3-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

The Score
• Set $APT_DUMP_SCORE to include the Score in the job log
• Score includes information, including:
– How the data is partitioned and collected
• Including partitioning keys
– Extra operators and buffers inserted into the flow
– Degree of parallelism each operator runs on, and on which nodes
– tsort operators inserted into the flow

© Copyright IBM Corporation 2011

Figure 3-6. The Score KM4001.0

Notes:
The setting of the environment variable $APT_DUMP_SCORE determines whether the
Score is displayed in the job log. The Score contains a lot of valuable information, some of
which in listed in this slide.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Viewing the Score operators

Summary of operators

Running in parallel

Number of
partitions Text name

Alias

Nodes operator is running on


Partition the node is associated with
© Copyright IBM Corporation 2011

Figure 3-7. Viewing the Score operators KM4001.0

Notes:
Under each operator is a list of nodes the operator is running on. This will include multiple
nodes if the operator is running in parallel. For each node (for example, node1), the
partition (p0) the node name is associated with is shown.
Each operator has a name derived from the GUI stage the operator was generated from
and an alias (op0, op1, and so on) used within the Score. So, for example, op0 was
generated from a Row Generator stage. Following the operator alias is the number of
partitions (1p, 2p, and so on) that it is running on.

3-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Interpreting the Score partitioning

• The Framework implements a producer-consumer data


flow model at the operator level
– Upstream stages (operators or persistent data sets) produce rows that are
consumed by downstream stages (operators or data sets)
Producer

Indicator
– Partitioning method is associated with producer Consumer
• Even though it is set in the consumer stage on the GUI
– Collector method is associated with consumer
– Separated by an indicator:

-> Sequential to Sequential


<> Sequential to Parallel
=> Parallel to Parallel (SAME)
#> Parallel to Parallel (not SAME)
>> Parallel to Sequential
> No producer or no consumer

– May also include [pp] notation when Preserve Partitioning flag is set
© Copyright IBM Corporation 2011

Figure 3-8. Interpreting the Score partitioning KM4001.0

Notes:
At the operator level, partitioning and collecting involves a pair of operators. The first
operator produces the rows; the second consumes them. At the GUI level in the job design
we specify the partitioning or collecting algorithm always and only at the consumer stage.
So here the GUI is a little misleading when we specify a partitioning method.
To interpret the score partitioning and collecting methods, first look for the indicator symbol
in the row between the two operators. The indicator identifies the parallelism sequence
between the two operators, as shown in the list.
Look to the left of the indicator to determine the partitioning method. eAny indicates Auto,
which is a default as determined by the type of stage. If we had, for example, chosen
Entire as the partitioning method for the Transformer, we would see eEntire to the left of
the indicator.
Look to the right of the indicator symbol to determine the collection method. Since the
Transformer is running in parallel there is no useful information on the right side. The
eCollectAny symbol indicates that even when a Transformer operator is running in parallel
it still has to retrieve the data from the producer operator.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Score partitioning example

Combinability mode: Don’t Combine


Partitioning method: Hash by c1 Collector: Ordered

Sequential to parallel

Partitioner: Hash

Collector: Ordered

Parallel to sequential
© Copyright IBM Corporation 2011

Figure 3-9. Score partitioning example KM4001.0

Notes:
This slide shows the Score generated for the job displayed. Property settings in the stages
in the job affect the contents of the Score. For example, Hash by c1 has been set in the
Transformer stage. Notice that a hash partitioner is generated in the Score, as indicated.

3-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Partition numbers

• At runtime, the parallel framework determines the degree


of parallelism for each stage from:
– Configuration file
– Execution mode (Stage>Advanced tab)
– Stage constraints if applicable (Stage>Advanced tab)
• Partitions are assigned numbers, starting at zero
– Partition number is appended to the stage name for messages
written to the job log

Stage name

Partition #

© Copyright IBM Corporation 2011

Figure 3-10. Partition numbers KM4001.0

Notes:
At runtime, the parallel framework determines the degree of parallelism for each stage from
the configuration file and other settings. Partitions are assigned numbers, starting at zero.
In the log, the partition number is appended to the stage name in messages.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Partitioning methods
Keyless Partitioning Keyed Partitioning

Rows are distributed Rows are distributed based on


independently of data values values in specified key columns
• Same • Hash
– Existing partitioning is not altered – Rows with same key column
• Round Robin values go to the same partition
– Rows are evenly alternated • Modulus
among partitions
– Assigns each row of an input data
• Random set to a partition, as determined by
– Rows are assigned randomly to a specified numeric key column
partitions
• Range
• Entire
– Similar to hash, but partition
– Each partition gets the entire data mapping is user-determined and
set (rows are duplicated) partitions are ordered
• DB2
– Matches DB2 EEE partitioning

© Copyright IBM Corporation 2011

Figure 3-11. Partitioning methods KM4001.0

Notes:
This slide lists the two main categories of partitioning methods: Keyless, and Keyed.
Auto (the default method): DataStage chooses appropriate partitioning method. Round
Robin, Same, or Hash are most commonly chosen.
Random: DataStage uses a Random algorithm to choose where the row goes. The result
of Random is that you cannot know where a row will end up.
Hash: DataStage’s internal algorithm applied to key values determines the partition. The
data type of the key value is irrelevant. All key values are converted to characters before
the algorithm is applied.
Range: The partition is chosen based on a range map, which maps ranges of values to
specified partitions. There is a stage that can be used to build the range map, but its use is
not required.
DB2: DB2 has published its hashing algorithm and DataStage copies that. Use when
hashing to partitioned DB2 tables.

3-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Selecting a partitioning method


• Choose a partitioning method that gives approximately an
equal number of rows to each partition
– Ensures that processing is evenly distributed across nodes
• Greatly varied partition sizes increase processing time
– Enable “Show Instances” in Director Job Monitor to show data
distribution across partitions:

• Setting the environment variable $APT_RECORD_COUNTS


outputs row counts per partition to the job log as each stage
operator completes processing

© Copyright IBM Corporation 2011

Figure 3-12. Selecting a partitioning method KM4001.0

Notes:
In general, when it comes to choosing a partitioning method, you should choose a
partitioning method that gives approximately an equal number of rows to each partition, but
satisfies business requirements. This ensures that processing is evenly distributed across
nodes.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Selecting a partitioning method, continued

• Partition method must match the stage logic


– Assigning related records to the same partition if required
– This includes any stage that operates on groups of related data
(often using key columns)
• Aggregator, Join, Merge, Sort, Remove
Duplicates, Transformers and Build stages (when
processing groups)
– A partitioning method needed to ensure correct results may lead
to uneven distribution
• Leverage partitioning performed earlier in the flow
– Repartitioning increases performance overhead

© Copyright IBM Corporation 2011

Figure 3-13. Selecting a partitioning method, continued KM4001.0

Notes:
The partition method must match the stage logic. Some stages, for example, require that all
related records (by key) are in the same partition. This includes any stage that operates on
groups of related data. For best performance, leverage partitioning performed earlier in the
flow.

3-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Same partitioning algorithm Keyless

• Keyless partitioning method


• Rows retain current
distribution and order from Row ID's 0 1 2
3 4 5
output of previous parallel 6 7 8
stage
– Doesn’t move data between
partitions
– Retains “carefully partitioned”
data (such as the output of a 0 1 2
3 4 5
previous sort) 6 7 8

• Fastest partitioning method


(no overhead)

SAME partitioning icon


© Copyright IBM Corporation 2011

Figure 3-14. Same partitioning algorithm KM4001.0

Notes:
This slide illustrates the Same partitioning algorithm. It is a keyless method that retains the
current distribution and order of the rows from the previous parallel stage.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Caution regarding Same partitioning


• Degree of parallelism remains unchanged in the
downstream stage
• Don’t follow a stage running sequentially (for example, a
Sequential File stage) with a stage using SAME partitioning
• The downstream stage will run sequentially!
• Do not follow a Data Set stage with a stage using Same
partitioning
• This would occur if one job writes to a data set that a second
job reads
• The downstream stage will run with the degree of parallelism
used to create the data set
– Regardless of the degree of parallelism defined in the job’s
configuration file

© Copyright IBM Corporation 2011

Figure 3-15. Caution regarding Same partitioning KM4001.0

Notes:
Same has low overhead, but there are times when it should not be used. Do not follow a
stage running sequentially (for example, a Sequential File stage) with a stage using Same
partitioning. And do not follow a Data Set stage with a stage using Same partitioning.

3-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Round Robin and Random Keyless

• Keyless partitioning methods


• Rows are evenly distributed across …8 7 6 5 4 3 2 1 0
partitions
– Good for initial import of data if no
other partitioning is needed
– Useful for redistributing data
Round Robin
• Fairly low overhead

• Round Robin assigns rows to


partitions like dealing cards
– The row assignment will always be the 6 7 8
same for a given configuration 3 4 5
file 0 1 2

• Random has slightly higher


overhead, but assigns rows in a
non-deterministic fashion between
job runs

© Copyright IBM Corporation 2011

Figure 3-16. Round Robin and Random KM4001.0

Notes:
Round Robin and Random are two other keyless methods. In both cases, rows are evenly
distributed across partitions. Random has slightly higher overhead, but assigns rows in a
non-deterministic fashion between job runs.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel runtime example


• Row order is undefined across Round Robin
partitioning
partitions
• Consider this example job:
Row Generator
– Round robin partitioning distributes rows
10 rows
to the nodes in a specific order {A: Integer, initial_value=1, incr=1}
– But, across nodes, the order a particular
node outputs its rows may change with Results with a 4-node
each run: configuration file:
Node 0: 1, 5, 9
Node 1: 2, 6, 10
Node 2: 3, 7
Node 3: 4, 8

a:3 arrives before a:2

a:3 arrives after a:2

© Copyright IBM Corporation 2011

Figure 3-17. Parallel runtime example KM4001.0

Notes:
It is very important to know that row order is undefined across partitions in different job
runs. This is an example that illustrates this. In this example, we see that the row containing
a:3, which is in partition 2, arrives first. In the second job run, the row containing a:3 arrives
after other rows in other partitions, for example, a:2 in partition 1.

3-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Entire partitioning Keyless


• Each partition gets a copy of each
row
– Useful for distributing lookup and …8 7 6 5 4 3 2 1 0
reference data
• May have performance
impact in MPP / clustered
environments Entire
– On SMP platforms, Lookup stage
uses shared memory instead of
duplicating the entire reference
data
• On MPP platforms, each
server uses shared memory . . .
for a single local copy
. . .
3 3 3
2 2 2
• Entire is the default partitioning 1 1 1
method for Lookup reference links 0 0 0
– On SMP platforms, it is a good
practice to set this explicitly

© Copyright IBM Corporation 2011

Figure 3-18. Entire partitioning KM4001.0

Notes:
Entire partitioning is another keyless method. Each partition gets a complete copy of each
row. This is very useful for distributing lookup and reference data. On SMP platforms, the
Lookup stage uses shared memory instead of duplicating the entire reference data, so
there is no performance impact.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Hash partitioning Keyed

• Keyed partitioning method


• Rows are distributed Values of key column

according to values in key …0 3 2 1 0 2 3 2 1 1


columns
• Rows with same key values
go into the same partition Hash
• Prevents matching rows
from “hiding” in other
partitions
– For example, with Join, Merge, 0 1 2
3 1 2
Remove Duplicates, … 0 1 2
• Partition distribution is 3

relatively equal if the data


across the source key
columns is evenly
distributed
© Copyright IBM Corporation 2011

Figure 3-19. Hash partitioning KM4001.0

Notes:
For certain stages (Remove Duplicates, Join, Merge) to work correctly in parallel, the user
must use a keyed method such as Hash.
In this example, the numbers are values of key column. Hash guarantees that all the rows
with key value 3 end up in the same partition. Hash does not guarantee continuity. Here,
threes are bunched with zeros, not with neighboring two values.
Hash may not provide an even distribution of the data. Use key columns that have enough
values to distribute data across the available partitions. For example, gender would be a
poor choice of key because all rows would flow into two partitions.

3-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Unequal distribution example


• Same key values are • Hash on LName, with 4-node
assigned to the same config file:
partition

Part 0
ID LName FName Address
5 Dodge Horace 17840 Jefferson

6 Dodge John 75 Boston Boulevard


Source Data

Partition 1
ID LName FName Address ID LName FName Address
1 Ford Henry 66 Edison Avenue
1 Ford Henry 66 Edison Avenue

2 Ford Clara 66 Edison Avenue


2 Ford Clara 66 Edison Avenue

3 Ford Edsel 7900 Jefferson


3 Ford Edsel 7900 Jefferson

4 Ford Eleanor 7900 Jefferson


4 Ford Eleanor 7900 Jefferson

7 Ford Henry 4901 Evergreen


5 Dodge Horace 17840 Jefferson

8 Ford Clara 4901 Evergreen


6 Dodge John 75 Boston Boulevard

9 Ford Edsel 1100 Lakeshore


7 Ford Henry 4901 Evergreen

8 Ford Clara 4901 Evergreen 10 Ford Eleanor 1100 Lakeshore

9 Ford Edsel 1100 Lakeshore

10 Ford Eleanor 1100 Lakeshore


Hash partitioning distribution matches
source data key values distribution.

Here, the number of distinct hash key


values limits parallelism!
© Copyright IBM Corporation 2011

Figure 3-20. Unequal distribution example KM4001.0

Notes:
This is an example of unequal distribution of rows down the different partitions. This is
something you would want to avoid if possible. Partition 1 would take much longer to
process and so the job as a whole would take longer.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Modulus partitioning Keyed

• Keyed partitioning method


• Rows are distributed according Values of key column

to the values in a numeric key …0 3 2 1 0 2 3 2 1 1


column
– The modulus determines the
partition: MODULUS
• partition = MOD (key_value /
number of partitions)

• Faster than Hash


• Guarantees that rows with
0 1 2
identical key values go into the 3 1 2
same partition 0
3
1 2

• Partition size is relatively equal


if the data within the key
column is evenly distributed

© Copyright IBM Corporation 2011

Figure 3-21. Modulus partitioning KM4001.0

Notes:
Modulus is a keyed partitioning method that works like Hash, except that it can only be set
for numeric columns. Rows are distributed according to the values in a numeric key
column. Like Hash, which is slower, Modulus guarantees that rows with identical key
values go into the same partition.

3-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Range partitioning Keyed

• Rows are distributed by range Values of key column

according to the values in one or 4 0 5 1 6 0 5 4 3


more key columns
• Pre-process the data to generate a
range map
RANGE
– More expensive than Hash
partitioning Rang
e
Map f
ile
– Must read entire data twice to
guarantee results
0 4 5
• Guarantees that rows with 1 4 6
identical values in key columns 0 3 5

end up in the same partition


• Rows outside the map go into the
first or last partition
• Limited use – only useful in cases
where incoming data distribution is
consistent over time
© Copyright IBM Corporation 2011

Figure 3-22. Range partitioning KM4001.0

Notes:
Range partitioning is a keyed method. Rows are distributed by range according to the
values in one or more key columns. The partitioning is based on a range map.
If the source data distribution is consistent over time, it may be possible to re-use the
range map file and thereby avoid the time it takes to pre-process the data.
Note that at runtime, values that are outside of a given range map will land in the first or
last partition as appropriate.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Using Range partitioning Keyed

• Create a range map file using


the Write Range Map stage
– Save the job as a new job
– Replace the target stage with
a Write Range Map stage
• Reference this Range Map file
when specifying Range
partitioning
• Note that range map files are
specific to a given
configuration file

© Copyright IBM Corporation 2011

Figure 3-23. Using Range partitioning KM4001.0

Notes:
In general, it is best not to use range partitioning, because it requires two passes over the
data to guarantee good results: One pass to create the range map. Another pass running
the job using the Range partitioning method.

3-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Example partitioning icons

“fan-out”
Sequential to Parallel

Same partitioner

Re-partition
watch for this!
Auto partitioner

© Copyright IBM Corporation 2011

Figure 3-24. Example partitioning icons KM4001.0

Notes:
Reading link markings: S----------------->S (no marking). S----(fan out)--->P (partitioner).
P----(fan in) ---->S (collector). P----(box)------->P (no reshuffling: partitioner using Same
method). P----(butterfly)--->P (reshuffling: partitioner using another method).

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Auto partitioning
• DataStage inserts partitioners as necessary to ensure
correct results
– Generally chooses Round Robin or Same
– Inserts Hash for stages that require matched key values
(Join, Merge, Remove Duplicates)
– Inserts Entire on Lookup reference links
• Since DataStage has limited awareness of your data and
business rules, best practice is to explicitly specify Hash
partitioning when needed
– DataStage has no visibility into Transformer logic
– Hash is required before Sort and Aggregator stages
– DataStage sometimes inserts unnecessary partitioners
• Check the Score

© Copyright IBM Corporation 2011

Figure 3-25. Auto partitioning KM4001.0

Notes:
When Auto is chosen, DataStage inserts partitioners as necessary to ensure correct
results. Auto generally chooses Round Robin when going from sequential to parallel. It
generally chooses Same when going from parallel to parallel.
Since DataStage has limited awareness of your data and business rules, best practice is to
explicitly specify Hash partitioning when needed, that is, when processing requires groups
of related records.

3-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Preserve partitioning flag


• Preserve Partitioning flag is used in a stage before
stages that use Auto
– Flag has 3 possible settings:
• Set: downstream stages are to attempt to retain partitioning and
sort order
• Clear: downstream stages need not retain partitioning and sort
order
• Propagate: tries to pass the flag setting from input to output links
– Set automatically by some operators (Sort, Hash partitioning)
– Can be manually set (Stage>Advanced tab)
– Functionally equivalent to explicitly specifying SAME partitioning
• But allows DataStage to over-ride and optimize for performance
• Preserve Partitioning setting is part of data set metadata
• Log warnings are issued when Preserve Partitioning flag is
set but downstream operators cannot use the same
partitioning
© Copyright IBM Corporation 2011

Figure 3-26. Preserve partitioning flag KM4001.0

Notes:
The Preserve Partitioning flag is used in a stage before stages that use Auto. It has 3
possible settings but most often the default is used. Sometime you may want to choose
Set. In that case, downstream stages are to attempt to retain partitioning and sort order.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Partitioning strategy
• Use Hash when stage requires grouping of related values
• Use Modulus if group key is a single integer column
– Better performance than Hash
• Range may be appropriate in cases where data distribution
is uneven but consistent over time
• Know your data!
– How many unique values in the Hash key columns?
• If grouping is not required, use Round Robin
– Little overhead
• Try to optimize partitioning based on the entire job flow

© Copyright IBM Corporation 2011

Figure 3-27. Partitioning strategy KM4001.0

Notes:
This slide lists some best practices for setting stage partitioning.

3-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Partitioning strategy, continued


• Minimize the number of repartitions within and across
job flows
• Within a job flow:
– Examine upstream partitioning and sort order, and
attempt to preserve them for downstream stages using
Same partitioning
• Across jobs:
– Use data sets to retain partitioning

© Copyright IBM Corporation 2011

Figure 3-28. Partitioning strategy, continued KM4001.0

Notes:
This slide continues the list of best practices for setting stage partitioning.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

collector

Collecting Data
Stage
running
Sequentially

© Copyright IBM Corporation 2011

Figure 3-29. Collecting Data KM4001.0

Notes:

3-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Collectors
• Collectors combine partitions into a single input stream going
to a sequential stage

... data partitions (NOT links)

collector

Sequential Stage

© Copyright IBM Corporation 2011

Figure 3-30. Collectors KM4001.0

Notes:
Collector methods combine partitions into a single input stream going to a sequential stage
or stream. This slide illustrates this process. At the top are multiple data partitions reduced
to one.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Specifying the collector method


• Collector method is defined on the Input>Partitioning
tab
– Stage must be running sequentially
– The previous stage must have been running in parallel

Stage Stage
running in running
Parallel Sequentially

Collector icon
Note word “Collector”

© Copyright IBM Corporation 2011

Figure 3-31. Specifying the collector method KM4001.0

Notes:
Collector method is defined on the Input>Partitioning tab just as for the partitioning. The
word “Collector” indicates that we are selecting a collector method as opposed to a
partitioning method.

3-32 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Collector methods
– Auto
• Read in the first row that shows up from any partition
• Output row order is undefined (non-deterministic)
• Default collector method
– Round Robin
• Pick row from input partitions in round robin order
• Slower than Auto, rarely used
– Ordered
• Read all rows from first partition, then second, and so on
• Preserves order of rows that exists within each partition
– Sort Merge
• Produces a single stream of rows sorted on specified key
columns from input sorted on those keys
• Row order is not preserved for non-key columns

© Copyright IBM Corporation 2011

Figure 3-32. Collector methods KM4001.0

Notes:
This slide lists the collector methods available.
Auto (the default) reads rows from partitions as soon as they arrive. This can yield different
row orders in different runs with identical data (non-deterministic execution). Round Robin
picks the first row from partition 0, the next from partition 1, even if other partitions can
produce rows faster than partition 1.
Ordered is the “great American novel” collector. Assume you just finished writing the great
American novel. You use DataStage to spell check in parallel. Partition 0 holds chapter
one, partition 1 holds chapter 2, and so on. You need a collector before sending your opus
to the printer. The default collector (Auto) will print lines in random-looking order. Round
Robin will print the first line from partition 0, the next from partition 1, and so on. What you
need is the Ordered collector: it will first read all lines from partition 0, then from partition 1,
and so on.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Sort Merge example


Partition 0 Partition1
----------- -----------
2 5
2 5
1 3
Sorted input 0 3
partition 0 data 0 1 Sorted input
partition 1 data

Sort Merge
----------
5
5
3
3
Globally sorted
results
2
2
1
1
0
0
© Copyright IBM Corporation 2011

Figure 3-33. Sort Merge example KM4001.0

Notes:
Sort Merge produces a (globally) sorted sequential stream from within partition sorted
rows. Let us look how it works on a two-node example. Rows have one column, an integer,
and it is the key column. Assume that the rows are already sorted within each of the two
partitions. Sort Merge produces a sorted sequential stream using the following algorithm:
always pick the next row from the partition that produces the smallest key value.
This produces the desired ordered sequence: 0011223355, regardless of the original
partitioning as long as the input data partitions are sorted by key.

3-34 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Non-deterministic execution
• The collector may yield non-deterministic results in case of
ties
partition_0 partition_1
----------- -----------
2,"K" 5,"j"
2,"a" 5,"p"
1,"x" 3,"x"
0,"p" 3,"y"
0,"a" 1,"y“
• The third row can equally be (1,"x") or (1,"y") because there is
a tie (same key value 1) between partitions
• Can be avoided by Hash partitioning on the key values
– Then key 1 could not exist in both partitions

© Copyright IBM Corporation 2011

Figure 3-34. Non-deterministic execution KM4001.0

Notes:
The Sort Merge collector can yield non-deterministic results in some cases, for example,
when there are rows in two or more partitions that fit the sort sequence. In this example, the
third row can equally be 1,”x” or 1,”y” because there is a tie (same key value 1) in both
partitions. Which one is chosen depends on the relative speed in which these partitions are
processing rows.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Choosing a collector method


• Generally Auto is the fastest and most efficient method of collection
• To generate a single stream of sorted data, use the Sort Merge
collector
– Input data must be sorted on these keys
– Sort Merge does not perform a sort
• It assumes the data has already been sorted
• Ordered is only appropriate in special cases
• Round robin collector can sometimes be used to reconstruct the
original (sequential) row ordering for Round Robin partitioned inputs
– Intermediate processing must not have altered row order or
reduced the number of rows
– Rarely used

© Copyright IBM Corporation 2011

Figure 3-35. Choosing a collector method KM4001.0

Notes:
This slide describes some best practices for choosing a collector method. Generally Auto
is the fastest and most efficient method of collection. When you need sorted data, select
Sort Merge.

3-36 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Collector method versus Funnel stage

Do not confuse a collector with a Funnel stage!

• Collector • Funnel stage


– Operates on a single, – Stage that runs in parallel
partitioned link – Merges data from multiple links
– Consolidates partitions as the – Table definitions (schema) of all
input to a sequential stage links must match
– Always identified by a “fan-in”
link icon

Collector Funnel

© Copyright IBM Corporation 2011

Figure 3-36. Collector method versus Funnel stage KM4001.0

Notes:
Sometimes links are confused with partitions, so collectors seem like Funnel stages. But
remember that a single link can (and most often do) contain multiple partitions. They are
not the same thing.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel Job Design Examples

© Copyright IBM Corporation 2011

Figure 3-37. Parallel Job Design Examples KM4001.0

Notes:

3-38 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Parallel number sequences


• Examples: Counters, surrogate keys
• In the partitioned world, each operation is performed on each partition
– A counter in a Transformer counts the number of rows in the partition (not
globally)
– Each partition comes up with a separate count
– @INROWNUM operates at the partition level
• Variables to facilitate parallel calculations:
– Row / Column Generator stages
• part: Partition number
• partcount: Total number of partitions
– Transformer
• @PARTITIONNUM: Partition number
• @NUMPARTITIONS: Total number of partitions
• Surrogate Key Generator stage
– Generates a unique sequence of integers across jobs

© Copyright IBM Corporation 2011

Figure 3-38. Parallel number sequences KM4001.0

Notes:
In the parallel word, creating unique sequences of numbers is complicated. In the
partitioned world, each operation is performed on each partition. So a counter in a
Transformer counts the number of rows in the partition (not globally). Each partition creates
a duplicate list.
But there are some system variables that can be used to generate unique sequences.
The Surrogate Key Generator stage is discussed in a later unit.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-39
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Row Generator sequences of numbers

Set Initial value to part


(partition number) Set Increment to
partcount (number of
partitions)

© Copyright IBM Corporation 2011

Figure 3-39. Row Generator sequences of numbers KM4001.0

Notes:
In the Row Generator stage you can create a unique sequence of numbers in a particular
column by setting the properties shown here. Set Initial value to part (partition number).
Set Increment to partcount (number of partitions).

3-40 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Generated numbers

Row Generator running Row Generator running


in parallel: initial=1, in parallel: initial
increment=1 value=part,
increment=partcount

© Copyright IBM Corporation 2011

Figure 3-40. Generated numbers KM4001.0

Notes:
This shows the results of setting the Row Generator properties as shown in the previous
slide. The number of nodes in this example equals 2.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-41
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer example using @INROWNUM

Counter using
@INROWNUM,
with 2 nodes

Map Counter to RowCount


RowCount target column output

© Copyright IBM Corporation 2011

Figure 3-41. Transformer example using @INROWNUM KM4001.0

Notes:
This slide shows the results of a Transformer using @INROWNUM. Assume that there are 4
partitions. @INROWNUM will contain the number of the row going through the partition. Each
partition repeats the same sequence of integers.

3-42 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer example using parallel variables

Counter using
@INROWNUM,
with 2 nodes

Map Counter to RowCount


RowCount target column output

© Copyright IBM Corporation 2011

Figure 3-42. Transformer example using parallel variables KM4001.0

Notes:
This slide shows how to use the system variables along with @INROWNUM to generate a
unique sequence of integers.
Assume that there are 4 partitions. @INROWNUM will contain the number of the row going
through the partition. The formula @PARTIONNUM + (@NUMPARTITIONS * @INROWNUM - 1)
will yield the following sequence of integers for the rows going down partition 0: 0, 4, 8, …
For partition 1, the series will be: 1, 5, 9, … For partition 2, the series will be: 2, 6, 10, … For
partition 3, the series will be: 3, 7, 11, …

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-43
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Header and detail processing


• Source file contains both
header and detail records
– Second column is record
type
• A = Header
• B = Trailer Source file
– First column is order number
• Task: Assign header
information to detail records
– Resulting detail records
contain name and date from
the header record

Target file

© Copyright IBM Corporation 2011

Figure 3-43. Header and detail processing KM4001.0

Notes:
It is sometimes necessary to process files that have a header and detail format. The
header row contains information that apply to all the detail rows that follow (up to the next
header row).These two types of rows have different formats so their individual columns
cannot be specified on the Columns tab.

3-44 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Job design

Variable Split records Combine


format data into header and records by
file. Read in detail streams. order
as single field Parse out number
individual fields

© Copyright IBM Corporation 2011

Figure 3-44. Job design KM4001.0

Notes:
Here is a job that can be used to process a header detail file. The source file is a variable
format data file. The trick is to read the rows in as single fields. Then the individual fields
can be parsed out in the Transformer stage.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-45
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the Transformer


Constrain to
Header records

Use Field
function to
parse:
Field(string,
delimiter, num)

Convert
string integer
to integer

© Copyright IBM Corporation 2011

Figure 3-45. Inside the Transformer KM4001.0

Notes:
This shows how the Field function in a Transformer can be used to parse columns. The
Column Import stage can also be used.

3-46 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Examining the Score

Inserted Inserted
Hash Hash
partitioner partitioner

© Copyright IBM Corporation 2011

Figure 3-46. Examining the Score KM4001.0

Notes:
This slide shows the Score for the job. Notice the inserted Hash partitioners.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-47
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Difficulties with the design

• Going into the Join stage, Hash operators on the Join


key (OrderNum) are inserted (by default)
– Each group of Header/Detail records will be hashed into
the same partition
– Each group of records will run sequentially
• Essentially, the whole job runs sequentially
– Solution:
• Select Entire partitioning algorithm for Header into the Join
• Select Same partitioning algorithm for Detail into the Join
• Not all Detail records are in the same partition, but in every
partition they’re in, there’s a Header

© Copyright IBM Corporation 2011

Figure 3-47. Difficulties with the design KM4001.0

Notes:
This job design has some performance issues. Because of the parallelism that occurs in
DataStage, this is not a particularly easy task to accomplish. The header will only go down
one partition but we actually need to put it down all partitions. Two solutions suggest
themselves: Join them together by hashing on the key. The problem with this approach is
that the join is hashing on a single value and, in essence, running in sequential mode. Or
take the header information and copy it to all partitions and the join will run in parallel.

3-48 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Examining the Score

No Hash No Hash
partitioner, partitioner,
now Same now Entire

© Copyright IBM Corporation 2011

Figure 3-48. Examining the Score KM4001.0

Notes:
Notice that in the revised job design, there are different partitioners.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-49
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Generating a header detail data file

Export multiple
columns to a Combine multiple data
single column streams into a single stream

© Copyright IBM Corporation 2011

Figure 3-49. Generating a header detail data file KM4001.0

Notes:
You may be interested in knowing how to create a header detail file. This example shows
one way. The Column Export stages are used to put the header and detail records, which
have different formats, into a single format. This is necessary in order to use the Funnel
stage to merge these records together.

3-50 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Inside the Column Export stage

Columns to
export

Single output
column

© Copyright IBM Corporation 2011

Figure 3-50. Inside the Column Export stage KM4001.0

Notes:
This slide shows the inside of the Column Export stage. The Explicit column method has
been chosen and the individual columns in the input link are explicitly listed. These are
combined into a single column of output named Header. This can be funneled together with
the individual column of output from the detail link.

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-51
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the Funnel stage

All input links


must all have
the same Single output
column stream
metadata

© Copyright IBM Corporation 2011

Figure 3-51. Inside the Funnel stage KM4001.0

Notes:
This slide shows the inside of the Funnel stage. All input links must all have the same
column metadata.

3-52 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Checkpoint
1. What two sections does a Score contain?
2. How does Modulus partitioning differ from Hash partitioning?
3. What collection method can be used to collect sorted rows in
multiple partitions into a single sorted partition?

© Copyright IBM Corporation 2011

Figure 3-52. Checkpoint KM4001.0

Notes:
Write your answers here:

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-53
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 3 – Read data with multiple record formats


• In this lab exercise, you will:
– Build a job that generates a header,
detail data file
– Build a job that processes a header,
detail data file

© Copyright IBM Corporation 2011

Figure 3-53. Exercise 3 - Read data with multiple record formats KM4001.0

Notes:

3-54 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Unit summary
Having completed this unit, you should be able to:
• Understand how partitioning works in the Framework
• Viewing collectors and partitioners in the Score
• Selecting collecting and partitioning algorithms
• Generate sequences of numbers (surrogate keys) in a
partitioned, parallel environment

© Copyright IBM Corporation 2011

Figure 3-54. Unit summary KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 3. Partitioning and Collecting Data 3-55
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

3-56 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 4. Sorting Data

What this unit is about


This unit describes how sorting is implemented in parallel jobs. It also
talks about optimizing DataStage jobs by reducing the number of
sorts.

What you should be able to do


After completing this unit, you should be able to:
• Sort data in the parallel framework
• Find inserted sorts in the Score
• Reduce the number of inserted sorts
• Optimize Fork-Join jobs
• Use Sort stages to determine the last row in a group

How you will check your progress


• Lab exercises and checkpoint questions.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-1


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Sort data in the parallel framework
• Find inserted sorts in the Score
• Reduce the number of inserted sorts
• Optimize Fork-Join parallel jobs

© Copyright IBM Corporation 2011

Figure 4-1. Unit objectives KM4001.0

Notes:

4-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Traditional (sequential) sort


• Traditionally, the process of sorting data uses one primary key
column and (optionally) multiple secondary key columns to generate
a sequential, ordered result set
– Order of key columns determines sequence (and groupings)
– Each key column specifies an ascending or descending sort group

Sorted Result
Source Data

ID LName FName Address ID LName FName Address


1 Ford Henry 66 Edison Avenue 6 Dodge John 75 Boston Boulevard

2 Ford Clara 66 Edison Avenue 5 Dodge Horace 17840 Jefferson

3 Ford Edsel 7900 Jefferson


Sort 1 Ford Henry 66 Edison Avenue

4 Ford Eleanor 7900 Jefferson


on: 7 Ford Henry 4901 Evergreen

5 Dodge Horace 17840 Jefferson 4 Ford Eleanor 7900 Jefferson

6 Dodge John 75 Boston Boulevard


Lname 10 Ford Eleanor 1100 Lakeshore

7 Ford Henry 4901 Evergreen


(asc),
3 Ford Edsel 7900 Jefferson

8 Ford Clara 4901 Evergreen


FName 9 Ford Edsel 1100 Lakeshore

9 Ford Edsel 1100 Lakeshore


(desc) 2 Ford Clara 66 Edison Avenue

10 Ford Eleanor 1100 Lakeshore


8 Ford Clara 4901 Evergreen

© Copyright IBM Corporation 2011

Figure 4-2. Traditional (sequential) sort KM4001.0

Notes:
This slide discusses the traditional (sequential) sort. This process of sorting data uses one
primary key column and (optionally) multiple secondary key columns to generate a
sequential, ordered result set. This is the method that SQL uses in SQL statements with an
ORDER BY clause. This will be contrasted with parallel sort described on the next slide.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-3


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel sort
• In many cases, there is no need to globally sort data
– Sorting is most often needed to establish order within specified groups of
data
• Join, Merge, Aggregator, Remove Duplicates, for examples
• This sort can be done in parallel!
• Hash partitioning can be used to gather related rows into single
partitions
– Assigns rows with the same key column values to the same
partition
– Sorting is used to establish grouping and order within each partition based on
key columns
• Rows with the same key values are grouped together within the
partition
• Hash and Sort keys need not totally match
– Often the case before Remove Duplicates stage
• Hash ensures that all duplicates are in the same partition
• Sort groups the rows and then establishes an ordering within each
group, for example, by latest date
© Copyright IBM Corporation 2011

Figure 4-3. Parallel sort KM4001.0

Notes:
This slide discusses the parallel sort. In many cases, there is no need to globally sort data.
In these cases a parallel sort can be used and this will be much faster. If a global sort is
needed Sort Merge can be used to accomplish this after the parallel sort.

4-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Example parallel sort

• Using same source data, • Within each partition, sort


Hash partition by LName, using LName, FName:
FName:

Part 0
Part 0

ID LName FName Address ID LName FName Address


2 Ford Clara 66 Edison Avenue Parallel 2 Ford Clara 66 Edison Avenue

8 Ford Clara 4901 Evergreen Sort 8 Ford Clara 4901 Evergreen


Part 1

Part 1
ID LName FName Address ID LName FName Address
3 Ford Edsel 7900 Jefferson 5 Dodge Horace 17840 Jefferson

5 Dodge Horace 17840 Jefferson


Parallel 3 Ford Edsel 7900 Jefferson

9 Ford Edsel 1100 Lakeshore


Sort 9 Ford Edsel 1100 Lakeshore
Part 2

Part 2
ID LName FName Address ID LName FName Address
4 Ford Eleanor 7900 Jefferson 6 Dodge John 75 Boston Boulevard

6 Dodge John 75 Boston Boulevard


Parallel 4 Ford Eleanor 7900 Jefferson

10 Ford Eleanor 1100 Lakeshore


Sort 10 Ford Eleanor 1100 Lakeshore
Part 3

ID LName FName Address


1 Ford Henry 66 Edison Avenue Parallel Part 3 ID LName FName Address
1 Ford Henry 66 Edison Avenue
7 Ford Henry 4901 Evergreen Sort 7 Ford Henry 4901 Evergreen

© Copyright IBM Corporation 2011

Figure 4-4. Example parallel sort KM4001.0

Notes:
This illustrates a parallel sort. Each partition sorts the data within it separately from the
others.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-5


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Stages that require sorted data


• Stages that process groups of data
– Aggregator
– Remove Duplicates
– Transformer, Build stages with group processing logic
• Stages that can minimize memory usage by requiring
the data to be sorted
– Join
– Merge
– Aggregator (using Sort method, rather than Hash method)

© Copyright IBM Corporation 2011

Figure 4-5. Stages that require sorted data KM4001.0

Notes:
There are a number of stages that require sorted data. This includes stages that process
groups of data, and stages that can minimize memory usage by requiring the data to be
sorted.

4-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Parallel sorting methods

• Two methods for parallel sorting:


– Sort stage
• In parallel execution mode
(default)

– In-stage sorts
• Partitioning cannot be Auto
• Input links with in-stage sorts will have a Sort
icon:

• Both methods generate the same internal


tsort operator in the OSH

© Copyright IBM Corporation 2011

Figure 4-6. Parallel sorting methods KM4001.0

Notes:
There are two parallel sorting methods available in DataStage: The Sort stage, and
in-stage sorts. Internally they both use the same tsort operator, so there is no difference in
terms of performance.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-7


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

In-Stage sorting

Right-click to
specify sort options

Requires partitioner
other than Auto

Key can be used


- Easier job maintenance (fewer stages on job canvas) for sort or
partitioning or
- But fewer options (tuning, features) both

© Copyright IBM Corporation 2011

Figure 4-7. In-Stage sorting KM4001.0

Notes:
This shows how to define an in-stage sort. It requires a partitioning method other than
Auto, as illustrated. The same key can be specified for sorting, partitioning, or both sorting
and partitioning.

4-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Sort stage
Offers more options than an in-stage sort

DataStage Sort
Utility is
recommended

© Copyright IBM Corporation 2011

Figure 4-8. Sort stage KM4001.0

Notes:
This shows the Sort stage. The Sort stage offers more options than an in-stage sort. The
default sort utility is DataStage, which is recommended.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-9


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Stable sorts
• Preserves the order of
non-key columns within
each sort group
• Slower than non-stable
sorts
– Use only when needed
– Enabled by default in
Sort stage
– Not enabled by default
for in-stage sorts

© Copyright IBM Corporation 2011

Figure 4-9. Stable sorts KM4001.0

Notes:
Both the Sort stage and in-stage sorts offer stable sorts. A stable sort preserves the order
of non-key columns within each sort group. This is necessary for some business purposes,
but stable sorts are slower than non-stable sorts. Use only when needed. It is enabled by
default in the Sort stage, so be sure to disable this if it is not needed.

4-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Resorting on sub-groups
• Use Sort Key Mode property to re-use key column
groupings from previous sorts
– Uses significantly less memory and disk
• Sorts within previously sorted groups, not the total data set
• Outputs rows after each group, not the total data set
• Key column order is important
– Must be consistent across stages
• Be sure to retain incoming sort order and partitioning (Same)

© Copyright IBM Corporation 2011

Figure 4-10. Resorting on sub-groups KM4001.0

Notes:
A major property that the Sort stage has that is not available for in-stage sorts is the Sort
Key Mode property. Use the Sort Key Mode property to re-use key column groupings from
previous sorts. This uses significantly less memory and disk and improves performance.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-11


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Don’t sort (previously grouped)


• What’s the difference between Don’t Sort (Previously sorted)
and Don’t Sort (Previously grouped)?
• When rows were previously grouped by a key, all the rows with
the same key value are grouped together
– But the groups of rows are not necessarily in sort order
• When rows are previously sorted by a key, all the rows are
grouped together and, moreover, the groups are in sort order
• In either case the Sort stage can be used to sort by a sub-key
within each group

1,c 2,r
1,b 2,a
Sorted by col1 Grouped by col1
2,r 3,a
2,a 1,c
3,a 1,b

© Copyright IBM Corporation 2011

Figure 4-11. Don’t sort (previously grouped) KM4001.0

Notes:
The Sort Key Mode offers two options: Don’t Sort (Previously sorted) and Don’t Sort
(Previously grouped). This slide discusses the difference. When rows were previously
grouped by a key, all the rows with the same key value are grouped together. But the
groups of rows are not necessarily in sort order.

4-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Partitioning and sort order


• Sort order is not preserved after re-
partitioning
– To restore row order and grouping, a new 1 101
sort is required 2 102

• In the example, the data is no longer in 3 103

sort order
– 101 has moved to the left partition
Re-partitioned
– 1 has moved to the right partition

2 1
101 102
3 103

© Copyright IBM Corporation 2011

Figure 4-12. Partitioning and sort order KM4001.0

Notes:
Try to avoid repartitioning after a sort because this destroys the sort order. In that case, you
will not be able to reuse that sort order downstream.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-13


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Global sorting methods

• Two methods for


generating a global sort: Sequential
mode
– Sort stage operating in
Sequential mode
– Sort Merge collecting
Sort Merge
method
• Requires sorted input
partitions

© Copyright IBM Corporation 2011

Figure 4-13. Global sorting methods KM4001.0

Notes:
There are two ways you can create a global sort: Operate the Sort stage in sequential
mode, or use parallel sort followed by Sort Merge. In general, parallel sort along with the
Sort Merge collector will be much faster than a sequential sort unless data is already
sequential.
Database systems sort in a similar parallel way to achieve adequate performance.

4-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Inserted tsorts
op1[4p] {(parallel inserted
• By default, tsort operators are tsort operator
{key={value=LastName},
inserted into the Score as key={value=FirstName}}(0))
necessary on nodes (
node1[op2,p0]
– Before any stage that requires node2[op2,p1]
matched key values (Join, node3[op2,p2]
Merge, RemDups) node4[op2,p3]
)}
• Only inserted if the user has not
explicitly defined the sort
– Explicitly defined sorts show up
as Sort operators qualified with Score showing
the name of the stage inserted tsort
• Check the Score for inserted tsort operator
operators
– May be unnecessary

© Copyright IBM Corporation 2011

Figure 4-14. Inserted tsorts KM4001.0

Notes:
By default, tsort operators are inserted into the Score as necessary. By default they will be
inserted before any stage that requires matched key values (Join, Merge, RemDups). They
are only inserted if the user has not explicitly defined the sort. Explicitly defined sorts show
up as Sort operators qualified with the name of the stage.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-15


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Changing inserted tsort behavior


• Set
$APT_SORT_INSERTION_CHECK_ONLY
or $APT_NO_SORT_INSERTION to op1[4p] {(parallel inserted tsort
change behavior of automatically inserted operator {key={value=LastName,
subArgs={sorted}},
sorts key={value=FirstName},
subArgs={sorted}}(0))
• Set on nodes (
node1[op2,p0]
$APT_SORT_INSERTION_CHECK_ONLY node2[op2,p1]
node3[op2,p2]
– The inserted sort operators only VERIFY that node4[op2,p3]
)}
the data is sorted
• If data is not sorted properly at runtime,
the job aborts
• Set $APT_NO_SORT_INSERTION to
remove stop inserted sorts entirely

Score when
$APT_SORT_INSERTION_CHECK_ONLY is
turned on. Note “subArgs = {sorted}”
© Copyright IBM Corporation 2011

Figure 4-15. Changing inserted tsort behavior KM4001.0

Notes:
You can use the $APT_SORT_INSERTION_CHECK_ONLY and
$APT_NO_SORT_INSERTION environment variables to change behavior of automatically
inserted sorts. When $APT_NO_SORT_INSERTION is turned on, tsort operators are not
inserted even for checking.

4-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Sort resource usage


• By default, Sort uses 20MB per partition as an internal memory
buffer
– Applies to both job defined sorts and inserted tsorts
– Change the default using the Restrict Memory Usage option
• Increasing this value can improve performance
– Especially if the entire (or group) data partition can fit into memory
• Decreasing this value may hurt performance
• This option is unavailable for in-stage sorts
– To change the amount of memory used by all tsort operators set:
• $APT_TSORT_STRESS_BLOCKSIZE = [mb]
– This overrides the per-stage memory settings
• When the memory buffer is filled, sort uses temporary disk space in
the following order:
• Scratch disks in the $APT_CONFIG_FILE “sort” named disk pool
• Scratch disks in the $APT_CONFIG_FILE default disk pool
• The default directory specified by $TMPDIR
• The UNIX /tmp directory

© Copyright IBM Corporation 2011

Figure 4-16. Sort resource usage KM4001.0

Notes:
By default, Sort uses 20MB per partition as an internal memory buffer per partition. You can
change the default using the “Restrict Memory Usage” option, which may improve
performance.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-17


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Partition and sort keys

• Note that partition and sort keys do not always have


to be the same
– Partitioning assigns related records
– Sorting establishes group order
• Example: Remove Duplicates
– Partition on SSN, FName, LName
– Sort on SSN, FName, LName, Order Date
– Remove Duplicates on SSN, FName, LName, Order
Date

© Copyright IBM Corporation 2011

Figure 4-17. Partition and sort keys KM4001.0

Notes:
There is a difference between partition and sort keys, and they do not have to be the same.
Partitioning assigns related records. Sorting establishes group order.

4-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Optimizing job performance

• Minimize number of sorts within a job flow


– Each sort interrupts the parallel pipeline
• Must read all rows in the partition before generating output
• Specify only necessary key columns
• Avoid stable sorts unless needed
• Re-use previous sort keys
– Use “Sort Key Usage” key column option
• Within Sort stage, try adjusting “Restrict Memory
Usage” to see if more memory will help

© Copyright IBM Corporation 2011

Figure 4-18. Optimizing job performance KM4001.0

Notes:
This slide lists some best practices in using the Sort stage. A basic principle of optimization
is to minimize the number of sorts within a job flow. You can do this by defining the sort as
far as possible upstream and reusing the sort as far as possible downstream.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-19


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Job Design Examples

© Copyright IBM Corporation 2011

Figure 4-19. Job Design Examples KM4001.0

Notes:

4-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Fork join job example


• Task: Assign a group summary
value to each row
– For each customer count the
number of customers in the same
Zip
– Add this count to each customer
record

Customer
records

Add column
with zip
count

© Copyright IBM Corporation 2011

Figure 4-20. Fork join job example KM4001.0

Notes:
A fork join is one important job design you should be aware of. The data stream is split into
two output streams and then joined back. In this example, the data is split so that an
aggregation can be performed which is then joined back to each row.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-21


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Fork join job design

Fork

Join by zip
Group by zip;
count
records in
the group

© Copyright IBM Corporation 2011

Figure 4-21. Fork join job design KM4001.0

Notes:
This slide shows the job design. The Copy stage is used to fork the data to the Aggregator
and Join. The Join stage merges the aggregation result back to the main stream.

4-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Examining the Score

Note inserted Hash


partitioners

Note inserted tsort


operators

© Copyright IBM Corporation 2011

Figure 4-22. Examining the Score KM4001.0

Notes:
This shows the Score for the fork join job. Notice that hash partitioners are inserted, even
though they were not explicitly specified in the job. Similarly tsort operators have been
inserted.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-23


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Difficulties with the design

Here sorts are explicitly


specified

• Under the covers, DataStage inserts by default Hash


partitioners and tsort operators before the Aggregator and
Join stages
– Default when Auto is chosen
• This design is not optimized!
– Need to minimize sorts and hashing
© Copyright IBM Corporation 2011

Figure 4-23. Difficulties with the design KM4001.0

Notes:
The sort icons have been added to show what is going on under the covers, even when no
sorts are defined. Sorting is occurring before the Aggregator stage and on both input links
to the Join stage. This job can be optimized to remove so many sorts.

4-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Optimized solution

Move sort upstream Add Same


partitioners

• Explicitly set a Sort by Zip before the Copy stage


• Explicitly specify Same as the partitioner for the Aggregator
and Join stages
• Notice that the data is repartitioned and sorted once, instead
of three times

© Copyright IBM Corporation 2011

Figure 4-24. Optimized solution KM4001.0

Notes:
To optimize the job, the sort has been moved upstream before the Copy stage. Same
partitioners have been specified to avoid repartitioning which destroys the sort.
In earlier versions of DataStage, tsort operators are inserted and perform sorts unless the
$APT_SORT_INSERTION_CHECK_ONLY environment variable is set. In the latest
versions of DataStage, it is not necessary to set this variable, because tsort operators will
not be inserted, as shown on the next slide.

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-25


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Score of optimized Job

Notice there are


no inserted
tsorts

© Copyright IBM Corporation 2011

Figure 4-25. Score of optimized Job KM4001.0

Notes:
Notice that in the Score for the optimized job there are no inserted tsort operators.

4-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Checkpoint
1. Name two stages that require the data to be sorted.
2. What are the advantages of using a Sort stage in a job
design rather than an in-stage sort?

© Copyright IBM Corporation 2011

Figure 4-26. Checkpoint KM4001.0

Notes:
Write your answers here:

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-27


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 4 – Optimize a fork join job


• In this lab exercise, you will:
– Optimize a fork-join job by moving sorts
and partitioning upstream
– Optimize a fork-join job by moving sorts
and partitioning into the source data set

© Copyright IBM Corporation 2011

Figure 4-27. Exercise 4 - Optimize a fork join job KM4001.0

Notes:

4-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Unit summary
Having completed this unit, you should be able to:
• Sort data in the parallel framework
• Find inserted sorts in the Score
• Reduce the number of inserted sorts
• Optimize Fork-Join jobs

© Copyright IBM Corporation 2011

Figure 4-28. Unit summary KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 4. Sorting Data 4-29


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

4-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 5. Buffering in Parallel Jobs

What this unit is about


This unit describes how buffering works within DataStage jobs.

What you should be able to do


After completing this unit, you should be able to:
• Describe how buffering works in parallel jobs
• Tune buffers in parallel jobs
• Avoid buffer contentions

How you will check your progress


• Lab exercises and checkpoint questions.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Describe how buffering works in parallel jobs
• Tune buffers in parallel jobs
• Avoid buffer contentions

© Copyright IBM Corporation 2011

Figure 5-1. Unit objectives KM4001.0

Notes:

5-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Introducing the buffer operator


• In the Score, buffer operators are inserted to prevent
deadlocks and to optimize performance
– Provide resistance for incoming rows
– In fork join jobs, buffer
Stage 1
operators are inserted on all inputs
to the downstream join operator
Buffer Stage 2
– Buffer operators may also be inserted
in an attempt to match producer Buffer

and consumer rates


• Data is never repartitioned across
a buffer operator
– First-in, first-out row processing
• Some stages (Sort, Aggregator in Hash mode) internally
buffer the entire dataset before outputting a row
– Buffer operators are never inserted after these stages
© Copyright IBM Corporation 2011

Figure 5-2. Introducing the buffer operator KM4001.0

Notes:
In the Score, buffer operators are inserted to prevent deadlocks and to optimize
performance. Buffers provide resistance for incoming rows so that operators are not
overwhelmed with incoming rows.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Identifying buffer operators in the Score


• In the Score, buffer It has 6 operators:
op0[1p] {(sequential Row_Generator_0)
operators are located in the on nodes (
ecc3671[op0,p0]

operators section )}
op1[1p] {(sequential Row_Generator_1)
on nodes (
ecc3672[op1,p0]
)}
op2[1p] {(parallel APT_LUTCreateImpl in Lookup_3)
on nodes (
ecc3671[op2,p0]
)}
op3[4p] {(parallel buffer(0))
on nodes (
ecc3671[op3,p0]
ecc3672[op3,p1]
ecc3673[op3,p2]
ecc3674[op3,p3]
)}
op4[4p] {(parallel APT_CombinedOperatorController:
(APT_LUTProcessImpl in Lookup_3)
(APT_TransformOperatorImplV0S7_cpLookupTest1_Transformer_7 in
Transformer_7)

Inserted buffer (PeekNull)


) on nodes (
ecc3671[op4,p0]
operator ecc3672[op4,p1]
ecc3673[op4,p2]
ecc3674[op4,p3]
)}
op5[1p] {(sequential APT_RealFileExportOperator in Sequential_File_12)
on nodes (
ecc3672[op5,p0]
)}
It runs 12 processes on 4 nodes.

© Copyright IBM Corporation 2011

Figure 5-3. Identifying buffer operators in the Score KM4001.0

Notes:
In the Score, buffer operators are displayed in the operators section. This example Score
shows that one buffer that has been inserted.

5-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

How buffer operators work


• The primary goal of a buffer operator is to prevent
deadlocks
• This is accomplished by keeping rows in the buffer until the
downstream operator is ready to process them
– Rows are held in memory up to size defined by
APT_BUFFER_MAXIMUM_MEMORY
• Default is 3MB per buffer per partition
– When the buffer memory is filled, rows are spilled to scratch disk

Producer Buffer Consumer

© Copyright IBM Corporation 2011

Figure 5-4. How buffer operators work KM4001.0

Notes:
To prevent deadlocks, the buffer operator provides resistance to incoming rows.
Buffer sizes can be specified, but keep in mind that this will be allocated per partition and
per operator. The total amount of memory may be great.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Buffer flow control


• When buffer memory usage reaches
APT_BUFFER_FREE_RUN the buffer operator offers
resistance to new rows
– This slows down the rate rows are produced
– Default 0.5 = 50%

Producer Buffer Consumer

$APT_BUFFER_FREE_RUN
Buffer will offer resistance to
new rows, slowing down
the rate rows are
produced

© Copyright IBM Corporation 2011

Figure 5-5. Buffer flow control KM4001.0

Notes:
As the buffer fills, it will begin to push back once the $APT_BUFFER_FREE_RUN
threshold is crossed.
By default the number is 50%, at which point the buffer will offer resistance. 50% is
designated as .5.
If you set $APT_BUFFER_FREE_RUN to greater than 100%, it will stop the buffer from
offering any resistance.

5-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Buffer tuning
• Apply to stage (operator) links (input or output)
• Buffer policy
– $APT_BUFFERING_POLICY specifies the default buffering policy:
• AUTOMATIC_BUFFERING (Auto buffer)
– Initial installation default
– Buffer only if necessary to prevent a deadlock
• FORCE_BUFFERING (Buffer)
– Unconditionally buffer all links
• NO_BUFFERING (No buffer)
– Do not buffer under any circumstances
– May lead to deadlocks
• Buffer settings
– APT_BUFFER_MAXIMUM_MEMORY
• Maximum amount of memory per buffer (default is 3 MB)
– $APT_BUFFER_FREE_RUN
• Amount of memory to consume before offering resistance
– $APT_BUFFER_DISK_WRITE_INCREMENT
• Size of blocks of data moved to and from disk by buffering operator

© Copyright IBM Corporation 2011

Figure 5-6. Buffer tuning KM4001.0

Notes:
This slide summarizes the buffer settings. $APT_BUFFERING_POLICY specifies the
default buffering policy. This can be set to AUTOMATIC_BUFFERING (Auto buffer),
FORCE_BUFFERING (Buffer), or NO_BUFFERING (No buffer). Particular setting
customize the degree of buffering.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Cautions

• In general, buffer tuning should be done cautiously


• Default settings are appropriate for most jobs
• For jobs processing very wide rows, it may be necessary
to increase default buffer size to handle more rows in
memory
• Calculate total record width using internal storage for
each column data type, length, and scale. For
variable length columns, use the maximum length

© Copyright IBM Corporation 2011

Figure 5-7. Cautions KM4001.0

Notes:
Only tune buffers if you know what you are doing. Improper buffer settings can cause
deadlocks.
The width-size of a row determines how many rows we can put into a buffer. Therefore,
wide rows may require larger buffers. In this context, rows with more than 1000 columns
are considered wide.

5-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Changing buffer settings in a job stage


• Change on Inputs>Advanced
tab or Outputs>Advanced tab
of a stage
• Buffering mode specifies the
policy to use
– Default: Use policy specified by
$APT_BUFFERING_POLICY
– Auto buffer: Only for deadlocks
– Buffer: Force buffering
– No buffer
• Edit settings
– Amounts are in bytes
– Queue upper bound size setting
• 0 = No upper bound for
amount of data to buffer in
memory and on disk

© Copyright IBM Corporation 2011

Figure 5-8. Changing buffer settings in a job stage KM4001.0

Notes:
Buffer settings can be specified in a job stage. This settings would only apply to the
operator generated by the relevant stage. These settings are made on the
Inputs>Advanced tab or Outputs>Advanced tab of a stage. The settings apply to the
link, either the input link or the output link.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Buffer resource usage


• By default, each buffer operator uses 3MB per partition of
virtual memory
– Can be changed through Advanced link properties, or
globally using $APT_BUFFER_MAXIMUM_MEMORY
• When buffer memory is filled, temporary disk space is used
in the following order:
• Scratch disks in the $APT_CONFIG_FILE “buffer” named disk pool
• Scratch disks in the $APT_CONFIG_FILE default disk pool
• The default directory specified by $TMPDIR
• The UNIX /tmp directory

© Copyright IBM Corporation 2011

Figure 5-9. Buffer resource usage KM4001.0

Notes:
This slide discusses buffer resource usage. By default, each buffer operator uses 3MB per
partition of virtual memory. This can be changed through Advanced link properties or
globally using $APT_BUFFER_MAXIMUM_MEMORY.

5-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Buffering for group stages


• Stages that process groups of data (Join, Merge,
Aggregator in Sort mode) cannot output a row until:
– Data in the grouping key column changes (end of group) or all rows
have been processed (end of data)
– Rows are buffered in memory until then
• Some stages (Sort, Aggregator in Hash mode) must read
the entire input before outputting a single record
– Setting Don’t Sort, Previously Sorted key option changes Sort stage
behavior to output on groups instead of entire dataset

© Copyright IBM Corporation 2011

Figure 5-10. Buffering for group stages KM4001.0

Notes:
Stages that process groups of data (Join, Merge, Aggregator in Sort mode) cannot output a
row until either an “End of data” and “end of group” event occurs.
“End of data” and “end of group” are events that might cause something to happen within
the system. For example, in an Aggregator stage there may be a change in the key value of
a group that is being processed. This indicates that the group is done. At this point the
stage can output the summary row. Once an operator gets to the end of data, it can shut
itself down.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Join stage internal buffering

• Even for inner Joins, there is a difference between


the different input links to a Join stage
• The first link (#0, Left within link ordering) is the
driver
– Reads rows one at a time
• The second link (#1, Right by link ordering) buffers all
rows with key values that match the driver row

© Copyright IBM Corporation 2011

Figure 5-11. Join stage internal buffering KM4001.0

Notes:
For Join and Merge stages, the order of links is important. This slide illustrates that there is
a difference between buffering (as done by the buffer operator) and buffering as it is done
in the Join stage. Both can effect performance (but in different ways). An example of a job
where join buffering can degrade performance is later in this unit.

5-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Avoiding buffer contention in fork-join jobs

• One possible solution for avoiding fork-join buffer


contention is to split the job into two separate jobs
– Write intermediate results to data sets
• Data sets preserve partitioning
• Data sets do not buffer
– Develop the single fork-join job first
• Check if testing indicates a buffer-related performance
issue
• Check if issue can be resolved by adjusting buffer settings
– If not, try splitting the job

© Copyright IBM Corporation 2011

Figure 5-12. Avoiding buffer contention in fork-join jobs KM4001.0

Notes:
If buffering becomes an issue in a fork-join job, one solution is to split the job into two
separate jobs. Develop the single fork-join job first. Check if testing indicates a
buffer-related performance issue. See if you can resolve it by changing buffer settings. If
not, try changing the job design.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel Job Design Examples

© Copyright IBM Corporation 2011

Figure 5-13. Parallel Job Design Examples KM4001.0

Notes:

5-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Revisiting the header detail job design


• For large data volumes, buffering introduces a possible
problem with this solution:
– At runtime, buffer operators are inserted for this scenario
– The Join stage, operating on key-column groups, is unable to
output rows until end of group or data
• Generating one header row with no subsequent change in
join column, data is buffered until end of group
• Problem: Processing is halted until all rows in the group
are read

Buffer
Header
Src Out
Buffer
Detail

© Copyright IBM Corporation 2011

Figure 5-14. Revisiting the header detail job design KM4001.0

Notes:
Everything gets buffered up, and the join will not output data until the end-of-group
condition. If the groups contain small numbers of records, then the performance impact is
minimal. The problem is most severe in cases where there is a single header row to be
merged with all rows.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Buffering solution
• Perform the join in the Transformer using stage variables to
store the header record information
• Data is Hash partitioned to ensure that header and detail
records in a group are not spread across different partitions

© Copyright IBM Corporation 2011

Figure 5-15. Buffering solution KM4001.0

Notes:
Here we consider one possible solution to a buffering issue. Perform the join in the
Transformer using stage variables to store the header record information. In this case, the
data does not need to be split into two streams and then joined back.

5-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Redesigned header detail processing job

Parse out
OrderNum and
RecType
columns

Store header info


in stage variables

© Copyright IBM Corporation 2011

Figure 5-16. Redesigned header detail processing job KM4001.0

Notes:
The two fields (OrderNum and RecType) that are in common to both header and detail
records are parsed out using the Column Import stage. OrderNum is needed so that the
data can be Hash partitioned by OrderNum before it is processed by the Transformer.
It is assumed here that the data in the Orders file is sorted so that each group is contiguous
and the header record precedes the detail records that make up the group. This order is
required at the time of the Hash partitioning before the Transformer. So the Column Import
stage has to run sequentially.
Header record info (RecType = “A”) is stored in the Name and OrderDate fields. Detail
records (RecType= “B”) are written out with the added header information.

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Checkpoint
1. Which property determines the degree to which a buffer
offers resistance to new rows?
2. Name two stages that must read the entire input set of input
records before outputting a single record.

© Copyright IBM Corporation 2011

Figure 5-17. Checkpoint KM4001.0

Notes:
Write your answers here:

5-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Exercise – Optimize a fork join job


• In this lab exercise, you will:
– Redesign a fork join job to avoid join
buffering

© Copyright IBM Corporation 2011

Figure 5-18. Exercise - Optimize a fork join job KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 5. Buffering in Parallel Jobs 5-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit summary
Having completed this unit, you should be able to:
• Describe how buffering works in parallel jobs
• Tune buffers in parallel jobs
• Avoid buffer contentions

© Copyright IBM Corporation 2011

Figure 5-19. Unit summary KM4001.0

Notes:

5-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 6. Parallel Framework Data Types

What this unit is about


This unit describes how the framework handles different types of data,
including sequential data and mainframe data.

What you should be able to do


After completing this unit, you should be able to:
• Describe virtual data sets
• Describe schemas
• Describe data type mappings and conversions
• Describe how external data is processed
• Handle nulls
• Work with complex data

How you will check your progress


• Lab exercises and checkpoint questions.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Describe virtual data sets
• Describe schemas
• Describe data type mappings and conversions
• Describe how external data is processed
• Handle nulls
• Work with complex data

© Copyright IBM Corporation 2011

Figure 6-1. Unit objectives KM4001.0

Notes:

6-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Data formats

• Parallel operators
process in-memory data
sets
• For external data,
conversions are
performed:

External data
External data
Data

Conversion

Conversion
– Format translation set
• Using data type mappings format
– May also require:
• Recordization
• Columnization

© Copyright IBM Corporation 2011

Figure 6-2. Data formats KM4001.0

Notes:
Parallel operators process in-memory data sets. For data that exists outside of DataStage
(external) data, conversions are performed. These conversions can involve data type
conversions, recordization (breaking up the data block into individual records), and
columnization (breaking up the records into individual fields).

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Data sets
• Structured internal representations of data within the parallel framework
• Consist of:
– Schema
• Describes the format of records and columns
– Data
• Partitioned according to the number of nodes
• Virtual data sets are in memory
– Correspond to job links
• Persistent data sets are stored on disk
– Descriptor file: Lists schema, configuration file, data file locations, flags
– Multiple data files
• One per node
• Stored in disk resource file systems
• The copy operator is used to write to data sets
– No conversion is necessary as for external data

© Copyright IBM Corporation 2011

Figure 6-3. Data sets KM4001.0

Notes:
Data sets are structured internal representations of data. They include a schema, which
describes the format of records and columns, and the data.

6-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Example schemas

Column name: data type

Extended column
properties

Record
properties

Extended column
property

© Copyright IBM Corporation 2011

Figure 6-4. Example schemas KM4001.0

Notes:
The schema includes all the properties found in a table definition, including extended
properties. The schema data types are C++ types. The extended properties are in brackets
following the data type. Notice all the -schema record property which lists the record
delimiter and the column delimiter.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Type conversions
• DataStage provides conversion functions between input and output
data types
• Default type conversions
– Examples: int -> varchar, varchar -> char, char -> varchar
• Generally what makes sense (see chart on next page)
– Not defaulted: char -> date, date -> char
– Variable to fixed-length string conversions are padded, by default, with ASCII
Null (0x0) characters
– Use $APT_STRING_PADCHAR to change the default
• Non-default type conversions require use of Transformer or Modify
stages
– Use their type conversion functions
– Modify and Transformer offer many additional conversions
• Warnings are issued for default conversions with potentially
unexpected results
– For example, varchar(100) -> varchar(50)
• Truncation may occur

© Copyright IBM Corporation 2011

Figure 6-5. Type conversions KM4001.0

Notes:
When an input column is mapped to an output column of a different type, a data type
conversion occurs. Some of these are default conversions that occur automatically. Other
conversions must be explicitly specified in, for example, a Transformer derivation.
The default pad character is in keeping with the C++ end-of-string character (ASCII Null
(0x0)).

6-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Source to target type conversions


Source Field Target Field
d = There is a default type conversion from source field type to destination field type.
e = You can use a Modify or a Transformer conversion function to convert from the source type to the destination type. A blank cell indicates
that no conversion is provided.

int8

uint8

int16

uint16

int32

uint32

int64

uint64

sfloat

dfloat

decimal

string

ustring

raw

date

time

timestamp
int8 d d d d d d d d de d de de e e e

uint8 d d d d d d d d d d d d

int16 de d d d d d d d d d de de

uint16 d d d d d d d d d d de de

int32 de d d d d d d d d d de de e e

uint32 d d d d d d d d d d de de e

Int64 de d d d d d d d d d d d

uint64 d d d d d d d d d d d d

sfloat de d d d d d d d d d d d

dfloat de d d d d d d d d de de de e e

decimal de d d d de d de de d de de de

string de d de d d de d d d de de d e e e

ustring de d de d d de d d d de de d e e

raw e e

date e e e e e e e

time e e e e e de

timestamp e e e e e e e

© Copyright IBM Corporation 2011

Figure 6-6. Source to target type conversions KM4001.0

Notes:
You can use this chart as a reference. In the chart, “e” means that you need to explicitly
define the conversion. “d” means a default conversion is available. Note that this chart uses
framework data types, not DataStage GUI type names.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Using Modify Stage For Type Conversions


• Modify syntax:
– outputColumn:newtype = conversionFunction [format
specifier] (inputColumn)
• Colon (:) precedes type
– Types are Framework types (see table on previous slide)
• Format specifier is only applicable for some
conversions, for example, date conversions
• Input column name is in parentheses Format specifier
• Example: Converting a string to a date:
– OrderDate:date = date_from_string [%mm/%dd/%yyyy] (inDate)

© Copyright IBM Corporation 2011

Figure 6-7. Using Modify Stage For Type Conversions KM4001.0

Notes:
The Modify stage can be used to perform type conversions. It generates the modify
operator in the OSH. The modify operator is also inserted by DataStage as necessary into
the OSH and Score to perform required conversions.

6-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Processing external data


• External data can come from:
– Relational tables (DB2, Informix, Oracle, Teradata)
– SAS
– Flat text files
– Flat binary files
– Mainframe files (COBOL files)
• External data conversions fall in two major types:
– Automatic conversions
• Relational data
• SAS data
• Modify operator is inserted to perform the conversions
– Manual conversions
• Flat text files, binary files
– Use the Sequential File stage
• COBOL files
– Use the Complex Flat File stage
• import / export operators are used to perform the conversions
• Converting with source stage: import
• Converting with target stage: export
© Copyright IBM Corporation 2011

Figure 6-8. Processing external data KM4001.0

Notes:
External data can come from sources. Some can be converted automatically, for example,
relational data. Other requires a manual conversion. In the case of a sequential file, the
Sequential File stage is used to import the data from the file into the internal framework
format.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Sequential file import conversion

Conversion operator

Sequential File stage


GUI column Schema of
definitions conversion types

© Copyright IBM Corporation 2011

Figure 6-9. Sequential file import conversion KM4001.0

Notes:
In the Sequential File stage Format and Columns tab, you specify how you want the data
converted. The GUI columns data types are SQL types. A schema is generated from table
definition which the import operator uses to perform the conversion.

6-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

COBOL file import conversion

Conversion operator

Complex Flat File


stage column
Schema of
definitions (COBOL
converted types
view)
© Copyright IBM Corporation 2011

Figure 6-10. COBOL file import conversion KM4001.0

Notes:
COBOL files are another source of external data that needs to be converted. The Complex
Flat File (CFF) stage is used for this purpose. The Complex Flat File stage supports
complex data types including arrays (OCCURS) and groups. The schema generated from
the CFF stage includes complex framework types. Note the use of the subrec type, which
corresponds to a COBOL group.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Oracle automatic conversion


Oracle read operator

Modify operator
used to perform the
type conversions

Oracle table,
input columns

Stage output columns,


with schema types
© Copyright IBM Corporation 2011

Figure 6-11. Oracle automatic conversion KM4001.0

Notes:
Database stages, such as the Oracle Connector stage, convert types automatically. The
modify operator, which is inserted into the Score, is used to perform these conversions.

6-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Standard Framework data types


• Char: Fixed length string
• VarChar
– Variable length string
– Specify maximum length
• Integer
• Decimal (Numeric)
– Precision: length including digits after the decimal point
– Scale: number of digits after the decimal point
• Floating point
• Date
– Default string format: %yyyy-%mm-%dd
• Time
– Default string format: %hh:%nn:%ss
• Timestamp
– Default string format: %yyyy-%mm-%dd %hh:%nn:%ss
• VarBinary (raw): string of un-typed bytes

© Copyright IBM Corporation 2011

Figure 6-12. Standard Framework data types KM4001.0

Notes:
This slide lists the standard (non-complex) framework data types. These correspond to the
standard SQL types.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Complex data types


• Vector
– A one-dimensional array
– Elements are numbered 0 to n
– Elements can be of any single type
– All elements must have the same type
– Can have fixed or variable number of elements
• Subrecord
– A group or structure of elements
– Elements of the subrecord can be of any type
– Subrecords can be embedded

© Copyright IBM Corporation 2011

Figure 6-13. Complex data types KM4001.0

Notes:
This slide lists the framework complex types. A vector is a one-dimensional array. All the
elements in the array have to be of the same type. A subrecord is a group or structure of
elements. The elements of the subrecord can be of any type.

6-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Schema with complex types

subrecord

vector

• Table definition with complex types


• Authors is a subrecord
• Books is a vector of three 3 strings of length 5

© Copyright IBM Corporation 2011

Figure 6-14. Schema with complex types KM4001.0

Notes:
On the Layout tab, you can view the metadata according to different views. In this way you
can easily see how a COBOL file description (CFD) will be converted to a schema. Shown
in this screenshot is the Parallel view schema that includes some complex types.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Complex types column definitions

subrecord

Elements of subrecord
Higher level
numbers indicate
a subrecord Vector
(group)
© Copyright IBM Corporation 2011

Figure 6-15. Complex types column definitions KM4001.0

Notes:
Importing metadata from a COBOL copybook can generate these level structures for use in
the Complex Flat File stage. In COBOL, the higher level numbers indicate a subrecord
(group). The OCCURS is equivalent to a vector (array).

6-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Complex Flat File Stage

© Copyright IBM Corporation 2011

Figure 6-16. Complex Flat File Stage KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Complex Flat File (CFF) stage


• Process data in a COBOL file
– File is described by a COBOL file description (CFD)
– File can contain multiple record types
• COBOL copybooks with multiple record formats can be
imported
– Each format is stored as a separate DataStage table definition
• Columns can be loaded for each record type
• On the Records ID tab, you specify how to identify each type of
record
• Columns from any or all record types can be selected for
output
– This allows columns of data from multiple records of different types to
be combined into a single output record

© Copyright IBM Corporation 2011

Figure 6-17. Complex Flat File (CFF) stage KM4001.0

Notes:
The Complex Flat File (CFF) stage can be used to process data in a mainframe COBOL
file. A COBOL file is described by a COBOL file description (CFD). COBOL copybooks with
multiple record formats can be imported. In the stage separate table definitions can be
loaded for each record type.

6-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Sample COBOL copybook

CLIENT record
format

POLICY record
format

COVERAGE
record format

© Copyright IBM Corporation 2011

Figure 6-18. Sample COBOL copybook KM4001.0

Notes:
This slide shows an example of a COBOL copybook. It has three record types as indicated
by the call-outs.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Importing a COBOL File Definition (CFD)

Level 01 column position

Level 01 items

© Copyright IBM Corporation 2011

Figure 6-19. Importing a COBOL File Definition (CFD) KM4001.0

Notes:
In Designer, you can import CFD files. These will be converted into one or more table
definitions. In this example, a single file contains three record types: CLIENT, COVERAGE,
and POLICY. These correspond to the level 01 items in the CFD file.

6-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

COBOL table definitions

Level numbers

© Copyright IBM Corporation 2011

Figure 6-20. COBOL table definitions KM4001.0

Notes:
This shows the table definition for the CLIENT record type that was imported. The level
numbers are preserved in the table definition indicating the column hierarchy.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

COBOL file layout

Layout tab

COBOL layout

© Copyright IBM Corporation 2011

Figure 6-21. COBOL file layout KM4001.0

Notes:
In the table definition Layout tab, you can switch from the Parallel view to the COBOL and
back. In the screenshot, PIC X(30) is a COBOL data type, equivalent to Char(30).

6-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Specifying a date mask

Select date
mask

© Copyright IBM Corporation 2011

Figure 6-22. Specifying a date mask KM4001.0

Notes:
You can specify date masks for columns that contain dates. Double-click to the left of the
column number on the Columns tab to open the Edit Column Meta Data window. Select a
field that contains date values. Then select the date mask that describes the format of the
date from the Date format list.
The SQL type is changed to Date. All dates are stored in a common format, which is
described in project or job properties. By default, dates are stored in DB2 format.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Example data file with multiple formats


Record Type = ‘1’
CLIENT record

Record Type = ‘2’


POLICY record

Record Type = ‘3’


COVERAGE
record

© Copyright IBM Corporation 2011

Figure 6-23. Example data file with multiple formats KM4001.0

Notes:
For clarity in this example, each record type has been placed on a separate line. Spaces
have been added between fields. In practice, the records might follow each other
immediately without being placed on a separate line. In the file used in the lab exercises,
records follow each other immediately with a single record character, the pipe (|),
separating them.
In this example, client information is stored as a group of three types of records: CLIENT,
POLICY, COVERAGE. There is one CLIENT record type which is the first record of the
group. This can be followed by one or more POLICY records. Each POLICY record is
followed by one or more COVERAGE records. Client Ralesh has two insurance policies.
The first is for motor vehicles (MOT). He has two coverages under this policy. The second
policy is for travel (TRA). He has one coverage under this policy.

6-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Sample job With CFF Stage

CFF stage

© Copyright IBM Corporation 2011

Figure 6-24. Sample job With CFF Stage KM4001.0

Notes:
The Transformer in this job is used to split the data into multiple outputs streams. In the
Transformer, a separate constraint is defined on each output link. Alternatively, the three
output links with their constraints could have come directly from the CFF stage. The CFF
stage supports multiple output links and constraints. A Transformer is required if
derivations need to be performed on any columns of data.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

File options tab

Data file

Job parameter

© Copyright IBM Corporation 2011

Figure 6-25. File options tab KM4001.0

Notes:
The CFF stage contains a number of tabs. This shows the File options tab. Here you can
specify one or more files to be read.

6-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Records tab

Active
record type

Add another Load


record type columns for
record type

Set as
master
© Copyright IBM Corporation 2011

Figure 6-26. Records tab KM4001.0

Notes:
Define each record type on the Records tab. Here we see that three record types have
been defined. For each type, click the Load button to load the table definition that defines
the type.
To add another record type click a button at the bottom of the Records tab. Click the far
right icon to set it as master. When a master record is read the output buffer will be
emptied.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Record ID tab

Condition that
identifies the
type of record

© Copyright IBM Corporation 2011

Figure 6-27. Record ID tab KM4001.0

Notes:
On the Record ID tab, you specify how to identify which type of record you are currently
reading. The condition specified here says that if the RECTYPE_3 field contains a ‘3’, then
the record is a COVERAGE record. A constraint must be defined for each record type.

6-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Selection tab

© Copyright IBM Corporation 2011

Figure 6-28. Selection tab KM4001.0

Notes:
After each record is read, a record will be sent out the output link. This tab is where you
specify the columns of the output record. Notice that the output record can contain values
from any or all of the record types.
Since only a single record type is read at a time, only some of the output columns (those
which get their values from the current record type) will receive values. The other columns
will retain whatever value they had before or they will be empty. Whenever the master
record is read, all columns are emptied before the new values are written.
It is crucial to be aware that although each output record has all of these columns, not all of
these columns will necessarily have valid data. When you process these records, for
example, in a Transformer, you need to determine which fields contain valid data.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Record options tab

© Copyright IBM Corporation 2011

Figure 6-29. Record options tab KM4001.0

Notes:
On the Records options tab, you specify format information about the file records. Here,
the file is described as a text file (rather than binary), as an ASCII file (rather than
EBCDIC), and a file with records separated by the pipe (|).

6-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Layout tab

Layout tab

COBOL layout

© Copyright IBM Corporation 2011

Figure 6-30. Layout tab KM4001.0

Notes:
The Layout tab is a very useful tab. It displays the length of the record (as described by the
metadata), and the lengths and offsets of each column in the record. It is crucial that the
metadata accurately describe the actual physical layout of the file. Otherwise, errors will
occur when the file is read.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

View data

Start of
CLIENT Start of
Start of COVERAGE
columns
POLICY columns
columns

© Copyright IBM Corporation 2011

Figure 6-31. View data KM4001.0

Notes:
Click the View Data button to view the data. When you view the data, you are viewing the
data in all the output columns.
Notice that output columns for a given row can contain data from previous reads. For
example, when the second record, which is a POLICY record, is read, the CLIENT columns
are populated with data from the previous record, which was a CLIENT record. So you
need to distinguish, usually within a Transformer, which columns contain valid data.

6-32 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Processing multi-format records

Stage variables
in the
Derivations identify which Transformer
type of record is coming into
the Transformer

© Copyright IBM Corporation 2011

Figure 6-32. Processing multi-format records KM4001.0

Notes:
Usually a CFF stage will be followed by a Transformer stage so that the different record
types can be identified and processed. In this example, when the IsClient stage variable
equals ‘Y’, then we know that the CLIENT columns contain valid data. When the IsPolicy
stage variable equals ‘Y’, then we know that the POLICY columns contain valid data. When
the IsCoverage stage variable equals ‘Y’, then we know that the COVERAGE columns
contain valid data.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer constraints

© Copyright IBM Corporation 2011

Figure 6-33. Transformer constraints KM4001.0

Notes:
These constraints ensure that a record is written out to the CLIENT output link only when
the columns contain valid client information. And so on, for the POLICY and COVERAGE
output links.

6-34 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Nullability

© Copyright IBM Corporation 2011

Figure 6-34. Nullability KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Nullable data
• Out-of-band: an internal data value marks a field as null
– Value cannot be mistaken for a valid data value of the given type
• In-band: a specific user-defined field value indicates a null
– Disadvantage:
• Must reserve a field value that cannot be used as valid data elsewhere
– Examples:
• Numeric field’s most negative possible value
• Empty string
• To convert an out-of-band null to an in-band Null, and vice-versa:
– Transformer stage:
• Stage variables: IF ISNULL(linkname.colname) THEN … ELSE …
• Derivations: SetNull(linkname.colname)
– Modify stage:
• destinationColumnName = handle_null(sourceColumnName,value)
• destinationColumnName = make_null(sourceColumnName,value)

© Copyright IBM Corporation 2011

Figure 6-35. Nullable data KM4001.0

Notes:
Nulls are categorized as “out-of-band” and “in-band”. The former is an internal data value
that marks a field as null. The latter is a specific user-defined field value that indicates a
null. The value of an out-of-band type of null is that it cannot be mistaken for a valid data
value of the given type.

6-36 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Null transfer rules

When mapping between source and destination columns


of different nullability settings:

Source Field Destination Field Result

not_nullable not_nullable Source value propagates to


destination.

nullable nullable Source value or Null propagates.

not_nullable nullable Source value propagates;


destination value is never Null.

nullable not_nullable WARNING messages in log. If


source value is Null, a fatal error
occurs. Must handle in Transformer
or Modify stage.

© Copyright IBM Corporation 2011

Figure 6-36. Null transfer rules KM4001.0

Notes:
When mapping between source and destination columns of different nullability settings,
there are four possibilities. The last case (nullable -> not_nullable) is the only case that
creates a problem.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Nulls and sequential files

• When writing to nullable columns in sequential files, the null


representation can be:
– Null field value
• A number, string, or C-style literal escape value (for example, \xAB)
that defines the Null value representation
– Null field value of empty string (‘’)
• Only for variable-length files
• Null field representation can be any string, regardless of
valid values for actual column data type

© Copyright IBM Corporation 2011

Figure 6-37. Nulls and sequential files KM4001.0

Notes:
Nulls can be written to and read from sequential files. In the file, some value or lack of a
value (for example, indicated by two side-by-side column delimiters) means null. You can
specify this in a Sequential File stage.

6-38 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Null field value examples

Integer

Varchar

Date

Char(6) or
Varchar

© Copyright IBM Corporation 2011

Figure 6-38. Null field value examples KM4001.0

Notes:
This slide shows some examples of values you can specify. The null field representation
can be any string, regardless of valid values for actual column data type.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-39
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Viewing data with Null values

Actual value in
Word “NULL” is the file
displayed by Actual value in
DataStage for Null the file
values

© Copyright IBM Corporation 2011

Figure 6-39. Viewing data with Null values KM4001.0

Notes:
When you view the file data within DataStage, the word “NULL” is displayed by DataStage
for null values, regardless of their actual value in the file.

6-40 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Lookup stage and nullable columns


• When using Lookup Failure is Continue, set
reference link non-key columns equal to nullable
– Even if the reference data is not nullable
– Ensures that Lookup assigns null values to non-key
reference columns for unmatched rows
– Can be used to identify unmatched rows in
subsequent Transformer logic
K: integer
Z: varchar(30) NULLABLE

K: integer
K: integer A: varchar(20)
A: varchar(20) Z: varchar(30) NULLABLE

Lookup Failure = Continue


© Copyright IBM Corporation 2011

Figure 6-40. Lookup stage and nullable columns KM4001.0

Notes:
A best practice, when using the Lookup stage, is to specify that the reference link key
columns are nullable. This ensures that the Lookup stage assigns null values to non-key
reference columns for unmatched rows. These lookup failure rows can then be identified in
a Transformer following the Lookup stage.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-41
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Default values
• What happens if non-key reference columns are not nullable?
– Lookup stage assigns a default value to a row without a match
– The default Value depends on the data type. For example:
• Integer columns default to zero.
• Varchar defaults to empty string (not to be confused with Null)
• Char to a fixed length string of $APT_STRING_PADCHAR characters
– More difficult to identify in subsequent stages (Transformer)

K: integer Unmatched output rows follow


Z: varchar(30) NOT NULLABLE nullability attributes of non-
key reference link columns

K: integer
K: integer A: varchar(20)
A: varchar(20) Z: varchar(30) NOT NULLABLE or NULLABLE

Lookup Failure Option is “Continue”

© Copyright IBM Corporation 2011

Figure 6-41. Default values KM4001.0

Notes:
If non-key reference columns are not nullable then the Lookup stage assigns a default
value to a row without a match. The default value depends on the data type. This makes it
more difficult to identify whether the row is a lookup failure in subsequent stages such as a
Transformer.

6-42 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Nullability in lookups
Continue. If lookup
fails, returns Nulls in
reference columns

Lookup non-key
reference column

If nullable, returns Nulls;


otherwise, returns empty
string
© Copyright IBM Corporation 2011

Figure 6-42. Nullability in lookups KM4001.0

Notes:
This slide shows the inside of the Lookup stage. Here the JOB_DESCRIPTION column has
been set to nullable, so that null will be returned by the lookup stage for a lookup failure.
Note that the output column that this row is mapped to must also be nullable or you will get
a runtime error.

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-43
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Outer joins and nullable columns


• Similar to using Lookup: when performing an outer
join, set non-key columns on outer links to nullable
– Ensures that Join stage assigns null values to columns
associated with unmatched (outer) records

K: integer K: integer
Z: varchar(30) NULLABLE Z: varchar(30) NULLABLE

Left Left
leftrec_K: integer
K: integer rightrec_K: integer
A: varchar(20) Z: varchar(30) NULLABLE
Right Right A: varchar(20) NULLABLE
Z: varchar(30) NULLABLE

K: integer K: integer
A: varchar(20) A: varchar(20) NULLABLE

Left outer join Full outer join

© Copyright IBM Corporation 2011

Figure 6-43. Outer joins and nullable columns KM4001.0

Notes:
Like the Lookup stage, the Join stage can generate nulls when using outer joins.

6-44 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Checkpoint
1. What type of files contain the metadata that is typically
loaded into the CFF stage?
2. Does the CFF stage support variable length records?
3. What does it accomplish to select a record type as a master?
4. Which of the following conversions are automatic and which
require manual conversions using a Transformer or Modify
stage? integer --> varchar, date --> char, varchar --> char,
char --> varchar, char --> date
5. Suppose the Lookup Failure option is "Continue". The
reference link column is a varchar but not nullable. What
values will be returned for rows that are lookup failures?

© Copyright IBM Corporation 2011

Figure 6-44. Checkpoint KM4001.0

Notes:
Write your answers here:

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-45
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 6 – Test nullability


• In this lab exercise, you will:
– Test nullability
– Test data conversion

© Copyright IBM Corporation 2011

Figure 6-45. Exercise 6 - Test nullability KM4001.0

Notes:

6-46 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Unit summary
Having completed this unit, you should be able to:
• Describe virtual data sets
• Describe schemas
• Describe data type mappings and conversions
• Describe how external data is processed
• Handle nulls
• Work with complex data

© Copyright IBM Corporation 2011

Figure 6-46. Unit summary KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 6. Parallel Framework Data Types 6-47
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

6-48 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 7. Reusable components

What this unit is about


This unit describes how you can take advantage of Runtime Column
Propagation (RCP) to create flexible, reusable components.

What you should be able to do


After completing this unit, you should be able to:
• Create a schema file
• Read a sequential file using a schema
• Describe Runtime Column Propagation (RCP)
• Enable and disable RCP
• Create and use shared containers

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-1


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Create a schema file
• Read a sequential file using a schema
• Describe Runtime Column Propagation (RCP)
• Enable and disable RCP
• Create and use shared containers

© Copyright IBM Corporation 2011

Figure 7-1. Unit objectives KM4001.0

Notes:

7-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Using Schema Files to Read


Sequential Files

© Copyright IBM Corporation 2011

Figure 7-2. Using Schema Files to Read Sequential Files KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-3


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Schema file
• Alternative way of specifying column definitions and record
formats
– Similar to a table definition
• Written in a plain text file
• Can be imported as a table definition
• Can be created from a table definition
• Can be used in place of a table definition in a Sequential
File stage
– Requires Runtime Column Propagation (RCP)
– Schema file path can be parameterized
• Enables a single job to process files with different column
definitions

© Copyright IBM Corporation 2011

Figure 7-3. Schema file KM4001.0

Notes:
The format of each line describing a column is: column_name:[nullability]datatype;
Here column_name is the name that identifies the column. Names must start with a letter
or an underscore (_) and can contain only alphanumeric or underscore characters. The
name is not case sensitive. The name can be of any length.
You can optionally specify whether a column is allowed to contain a null value or whether
this would be viewed as invalid. If the column can be null, insert the word ’nullable’. By
default columns are not nullable. You can also include the nullable property at record level
to specify that all columns are nullable, then override the setting for individual columns by
specifying ‘not nullable’. For example:
record nullable ('
Age:int32;
BirthDate:date)
Following the nullability specifier is the C++ data type of the column.

7-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Creating a schema file


• Using a text editor
– Follow correct syntax for definitions
– Not recommended
• Import from an existing data set or file set
– In DataStage Designer: Import>Table Definitions>
Orchestrate Schema Definitions
– Select checkbox for a file with .fs or .ds
• Import from a database table
• Create from a table definition
– Click Parallel on Layout tab

© Copyright IBM Corporation 2011

Figure 7-4. Creating a schema file KM4001.0

Notes:
This slide lists several ways to create a schema file. Another good way of capturing a
schema is to set the $OSH_PRINT_SCHEMAS environment variable and copy entries
from the DataStage Director log.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-5


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Importing a schema

Schema location can be


server or workstation

Import from a Import from a data


database table set or file set

© Copyright IBM Corporation 2011

Figure 7-5. Importing a schema KM4001.0

Notes:
Schemas can be imported from data sets, file sets, files on the DataStage Server system or
your workstation, and from database tables.

7-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Creating a schema from a table definition

Parallel layout

Layout

Schema corresponding
to the table definition Save schema

© Copyright IBM Corporation 2011

Figure 7-6. Creating a schema from a table definition KM4001.0

Notes:
It is easy to create a schema file from an existing table definition. Open the table definition
to the Layout tab. This displays the schema. Then right-click and select Save As. The file
is saved on your client system.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-7


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Reading a sequential file using a schema

No columns
defined here

Path to
schema file

© Copyright IBM Corporation 2011

Figure 7-7. Reading a sequential file using a schema KM4001.0

Notes:
To use a schema file to read from a sequential file, first add the Schema File optional
property. Schemas can only be used when Runtime Column Propagation (RCP) is turned
on in the stage. This is discussed later in this unit.

7-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Runtime Column Propagation


(RCP)

© Copyright IBM Corporation 2011

Figure 7-8. Runtime Column Propagation (RCP) KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-9


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Runtime Column Propagation (RCP)


• When RCP is turned on:
– Columns of data can flow through a stage without being explicitly defined in
the stage
– Output columns in a stage need not have any input columns or values
explicitly mapped to them
• No column mapping enforcement at design time
– Input values are implicitly mapped to output columns based on the column
name
• Benefits of RCP
– Job flexibility
• Job can process input files and tables with different column layouts
– Ability to create reusable components in shared containers
• Component logic can apply to a single named column
• All other columns flow through untouched

© Copyright IBM Corporation 2011

Figure 7-9. Runtime Column Propagation (RCP) KM4001.0

Notes:
This slide describes RCP and lists some benefits of using it. The key feature of RCP is that
when it is turned on columns of data can flow through a stage without being explicitly
defined in the stage. The key benefit is a flexible job design.

7-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Enabling Runtime Column Propagation (RCP)


• Project level
– DataStage Administrator Parallel tab
• Job level
– Job properties General tab
• Stage level
– Link Output Column tab
• Settings at a lower level override settings at a higher level
– E.g., disable at the project level, but enable for a given job
– E.g., enable at the job level, but disable a given stage

© Copyright IBM Corporation 2011

Figure 7-10. Enabling Runtime Column Propagation (RCP) KM4001.0

Notes:
RCP can be enabled at the project level, the job level, or even the stage level.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-11


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Enabling RCP at Project Level

Check to enable
RCP to be used

Check to make
RCP the default for
new jobs

© Copyright IBM Corporation 2011

Figure 7-11. Enabling RCP at Project Level KM4001.0

Notes:
In the Administrator client, you must set the Enable Runtime Column Propagation for
Parallel Jobs property if you are to use RCP in the project at any level. Check the Enable
Runtime Column Propagation for new links property (not recommended) to have it
turned on by default.

7-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Enabling RCP at Job Level

Check to make
RCP the job default

© Copyright IBM Corporation 2011

Figure 7-12. Enabling RCP at Job Level KM4001.0

Notes:
If RCP has been enabled for the project in Administrator, it can be turned on at the job level
on the Job Properties General tab. This will turn it on for all stages in the job.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-13


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Enabling RCP at Stage Level


• Sequential File stage
– Output Columns tab
• Transformer
– Open Stage Properties
– Stage Properties Output tab
Enable RCP in
Sequential File
stage

Check to enable
RCP in
Transformer

© Copyright IBM Corporation 2011

Figure 7-13. Enabling RCP at Stage Level KM4001.0

Notes:
If RCP has been enabled for the project in Administrator, it can be turned on at the stage
level. How this is done varies somewhat for different types of stages. Shown here are the
Sequential File Stage and the Transformer stage.

7-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

When RCP is Disabled


• DataStage Designer enforces Stage Input to Output column
mappings.

Colored red; job


will not compile

© Copyright IBM Corporation 2011

Figure 7-14. When RCP is Disabled KM4001.0

Notes:
When RCP is turned off every output column have an input column explicitly mapped to it.
Otherwise the job will not compile. In this example, this is indicated by the columns in red.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-15


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

When RCP is Enabled


• DataStage does not enforce mapping rules
• Runtime error if no incoming columns match unmapped target
column names

Job will compile

© Copyright IBM Corporation 2011

Figure 7-15. When RCP is Enabled KM4001.0

Notes:
When RCP is turned on, output columns do not have to have input columns explicitly
mapped to them. The job will compile. In this example, this is indicated by the columns not
being in red. However, a runtime error will occur if no incoming columns match unmapped
target column names.

7-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Where do RCP columns come from? (1)


• Columns previously defined in the job flow
– Suppose a Copy stage earlier in the flow has an output column named
“Address” and Address does not explicitly continue in the flow

CustID CustID CustID CustID


Name Name Name Name
Address Address

Copy stage
columns. Address Transformer stage
column is not columns. Nothing
pushed through is explicitly
mapped to output
column Address

© Copyright IBM Corporation 2011

Figure 7-16. Where do RCP columns come from? (1) KM4001.0

Notes:
There are a number of ways in which implicit columns (columns not explicitly defined on a
stage Columns tab) can get into the job. One way, shown here, is from columns previously
defined in the job flow.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-17


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Where do RCP columns come from? (2)


• A sequential file is read using a schema file
– Column data values flowing in get the column names used to read
them
• In the example below, “Ohio” flows into the Address column
– If the schema file is parameterized, the names used to read them
can change in different job runs
33 CustID CustID CustID
“Alvin” Name Name Name
“Ohio” Address Address

Transformer stage
Schema columns columns. Nothing
is explicitly
mapped to output
Row of values read
column Address
from sequential file © Copyright IBM Corporation 2011

Figure 7-17. Where do RCP columns come from? (2) KM4001.0

Notes:
There are a number of ways in which implicit columns (columns not explicitly defined on a
stage Columns tab) can get into the job. Another way is from a sequential file read with a
Sequential File stage using a schema file.

7-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Where do RCP columns come from? (3)


• A relational table read using SELECT *
CustID
Name
Customer table
Address column names

33 CustID
“Alvin” Name
“Ohio” Address

Row of values read from table: Transformer stage


SELECT * FROM Customers columns. Nothing
is explicitly
mapped to any
output columns
© Copyright IBM Corporation 2011

Figure 7-18. Where do RCP columns come from? (3) KM4001.0

Notes:
There are a number of ways in which implicit columns (columns not explicitly defined on a
stage Columns tab) can get into the job. Another way if by reading from a relational table
using SELECT *.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-19


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Shared Containers

© Copyright IBM Corporation 2011

Figure 7-19. Shared Containers KM4001.0

Notes:

7-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Shared containers
• Encapsulate job design components into a named stored container
• Provide named reusable job components stored in the Repository
– Example: Apply stored Transformer business logic to convert dates from one
format to another
• Can be inserted into jobs
– Inserted by reference: Changes made to the shared container outside of the job
will apply to the job
• Jobs that use the container need to be recompiled for the changes to take effect
– Can include job parameters
• If included, values or job parameters from the containing job, must be specified for
them
• Creating shared containers
– Can be created from scratch
– Easiest to save a set of stages and links within an existing job as a shared
container
• Can be converted to local containers (local to the job)
– Local containers can be “deconstructed”

© Copyright IBM Corporation 2011

Figure 7-20. Shared containers KM4001.0

Notes:
Shared containers encapsulate job design components into a named stored container. In
this way they provide named reusable job components stored in the Repository which can
be inserted into jobs.
Shared containers are inserted by reference: Changes made to the shared container
outside of the job will apply to the job, although the job must be recompiled.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-21


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Creating a shared container


• Select stages from an existing job
• Click Edit>Construct Container>Shared
• The selected components reformat two input date columns
– InDate1 and InDate2 are converted from year-first to year-last dates

Selected
components

Create shared
container
© Copyright IBM Corporation 2011

Figure 7-21. Creating a shared container KM4001.0

Notes:
The easiest way to create a shared container is by selecting components from within an
existing job. This also allows you to test the container at the time you build it.

7-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Inside the shared container


• Input and Output stages are used in the shared container to
provide an interface to links in the containing job
– Number of Input / Output stages determines the number of expected
input / output links going to the container in the containing job
– Input / Output stage format determines the columns expected in the
input / output links going to the container from the containing job

Input interface Output


stage interface stage
© Copyright IBM Corporation 2011

Figure 7-22. Inside the shared container KM4001.0

Notes:
This shows the inside of a shared container. Input and Output stages are used in the
shared container to provide an interface to links in the containing job it is added to. The
container will only work in a job if there are input and output links in the job that can be
validly mapped to the Input and Output stages of the container.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-23


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the shared container Transformer


• Transformer will process two columns named InDate1 and
InDate2
• Other columns will flow through by RCP

© Copyright IBM Corporation 2011

Figure 7-23. Inside the shared container Transformer KM4001.0

Notes:
In this example shared container, the Transformer will process two columns named
InDate1 and InDate2. Other columns will flow through by RCP. Two columns will need to
match InDate1 and InDate2 by name.

7-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Using a shared container in a job


• Input links going into the container must match the number of expected
input links
• Output links going out of the container must match the number of
expected output links
• If RCP is being used input and output link columns are matched by
name
• If RCP is not being used, the number, order, and types of columns
must match up
– Names do not have to match

Shared
Container

© Copyright IBM Corporation 2011

Figure 7-24. Using a shared container in a job KM4001.0

Notes:
This shows the shared container in a job. If RCP is being used, input and output link
columns are matched by name. If RCP is not being used, the number, order, and types of
columns must match up.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-25


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Mapping input / output links to the container


• Select the container link to map input link to
– Metadata in the input link to the shared container must match the
input interface
• Click Validate to validate the interface mapping compatibility

Specify property (job


parameter) values

Specify output
link mapping

Select container link


to map input link to
© Copyright IBM Corporation 2011

Figure 7-25. Mapping input / output links to the container KM4001.0

Notes:
After you link up the shared container to the job open the shared container. A stage
container can contain job parameters. If they exist they can be specified on the Stage tab.
On the Inputs and Outputs tabs, map job links to the container links. Click Validate to
validate the interface mapping compatibility

7-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Interfacing with the shared container

Input link
columns from job

Input columns passed Container input


to container columns
© Copyright IBM Corporation 2011

Figure 7-26. Interfacing with the shared container KM4001.0

Notes:
This shows one type of interface where RCP is being used. So there are many input
columns in the link mapped to the container. Two of the columns have to match InDate1
and InDate2. The other columns will flow through the container by RCP.

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-27


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Checkpoint
1. What are two benefits of RCP?
2. What can you use to encapsulate stages and links in a
job to make them reusable?

© Copyright IBM Corporation 2011

Figure 7-27. Checkpoint KM4001.0

Notes:
Write down your answers here:
1.
2.

7-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Exercise 7 – Reusable components


• In this lab exercise, you will:
– Create schema file
– Read a sequential file using a schema
– Create a flexible job
– Create and use a shared container

© Copyright IBM Corporation 2011

Figure 7-28. Exercise 7 - Reusable components KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 7. Reusable components 7-29


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit summary
Having completed this unit, you should be able to:
• Create a schema file
• Read a sequential file using a schema
• Describe Runtime Column Propagation (RCP)
• Enable and disable RCP
• Create and use shared containers

© Copyright IBM Corporation 2011

Figure 7-29. Unit summary KM4001.0

Notes:

7-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 8. Advanced Transformer Logic

What this unit is about


This unit describes the advanced Transformer processing.

What you should be able to do


After completing this unit, you should be able to:
• Describe and set null handling in the Transformer
• Use Loop processing in the Transformer
• Process groups in the Transformer

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Describe and set null handling in the Transformer
• Use Loop processing in the Transformer
• Process groups in the Transformer

© Copyright IBM Corporation 2011

Figure 8-1. Unit objectives KM4001.0

Notes:

8-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer Null Handling

© Copyright IBM Corporation 2011

Figure 8-2. Transformer Null Handling KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer legacy null handling


• Before IS release 8.5, input rows processed with unhandled nulls were
dropped or rejected by the Transformer stage
– The Transformer Legacy null processing option can be used to set the
Transformer to this behavior
• Turned off by default for new IS 8.5 jobs
• Turned on by default for imported legacy IS jobs
• Set on Transformer stage properties Stage>General tab
– Example:
• Fname : Lname : Address1 : City : State : PostalCode -> Address
– Here, Address is a stage variable or output column in a Transformer. All others are input
columns
• If Address1 or any other of the input columns is null in the current row being
processed, then the row will be rejected
• Set the Abort on unhandled null option to abort the job when a row
with an unhandled null is processed by the Transformer
• Add a reject link from the Transformer to capture rejected rows
– This property is not compatible with the Abort on unhandled null option

© Copyright IBM Corporation 2011

Figure 8-3. Transformer legacy null handling KM4001.0

Notes:
Before IS release 8.5, input rows processed with unhandled nulls were dropped or rejected
by the Transformer stage. This behavior is called legacy null handling. With IS 8.5, this
behavior can be turned on or off. With legacy behavior it is recommended that you add a
reject link from the Transformer to capture rejected rows.

8-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Legacy null processing example


• Source file contains nulls in FName and Zip columns
• Reject link added to Transformer to capture rows rejected by the
Transformer
• Derivation for a Transformer stage variable contains Zip
• Derivation for a target column contains FName

© Copyright IBM Corporation 2011

Figure 8-4. Legacy null processing example KM4001.0

Notes:
This is a legacy null processing example. Notice the reject link to capture rejected rows.
Here we assume that the source file contains nulls in the FName and Zip columns.
Suppose that a derivation for a Transformer stage variable contains Zip. And suppose that
a derivation for a target column contains FName. It is expected that rows containing these
nulls will be rejected.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the Transformer stage

Nullable input column


Stage variable in derivation

Nullable input column


in derivation
© Copyright IBM Corporation 2011

Figure 8-5. Inside the Transformer stage KM4001.0

Notes:
This shows the inside of the Transformer stage. There is a nullable input column in the
derivation of the stage variable and in the derivation for the output column.

8-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer stage properties

Legacy null
processing

Abort on
unhandled null

© Copyright IBM Corporation 2011

Figure 8-6. Transformer stage properties KM4001.0

Notes:
Open the Transformer Stage Properties window to specify legacy null handling.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Results
• First case: Input rows with nulls are rejected or dropped
• Second case: When the Abort on unhandled null option is
set, the job aborts
First case: Records
are rejected

Second case:
Job aborts in
second case

© Copyright IBM Corporation 2011

Figure 8-7. Results KM4001.0

Notes:
Now let us look at some results. In the first test, input rows with nulls are rejected or
dropped as we see from messages in the job log. In the second test where the Abort on
unhandled null option is set, the job aborts.

8-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer non-legacy null handling


• With IS release 8.5, derivations involving unhandled nulls
return null
– The Transformer Legacy null processing option must be turned off
in the Transformer stage properties to get this behavior
– Example:
• Fname : Lname : Address1 : City : State : PostalCode -> Address
– Here, Address is a stage variable or output column in a Transformer. All
others are input columns
• If Address1 or any other of the input columns is null in the current row
being processed, then null will be written to the Address stage variable
• Set the Abort on unhandled null option to abort the job when
a row with an unhandled null is processed by the Transformer
– Expressions containing nulls will not abort the job
• They will evaluate to null
– Nulls written to non-nullable output columns will abort the job
© Copyright IBM Corporation 2011

Figure 8-8. Transformer non-legacy null handling KM4001.0

Notes:
With IS release 8.5 non-legacy null handling, derivations involving unhandled nulls return
null. This is true whether they are derivations for output columns or stage variables.
What happens if you set the Abort on unhandled null option? Expressions containing
nulls will not abort the job, since they are being handled by evaluating to null.
However, be aware that nulls written to non-nullable output columns will abort the job. This
is always true.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer stage properties

Non-legacy null
processing

Abort on
unhandled null

© Copyright IBM Corporation 2011

Figure 8-9. Transformer stage properties KM4001.0

Notes:
This shows the Non-legacy setting on the Transformer Stage Properties General tab.

8-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Results with non-legacy null processing


• Case
– Legacy null processing not set
– Abort on unhandled null not set
• Results:
– If output columns are nullable, no rows are rejected
– If output rows are non-nullable, rows with nulls are dropped or
rejected
• Case
– Legacy null processing not set
– Abort on unhandled null is set
• Result
– Transform operator aborts thereby aborting the job when target
column is non-nullable

© Copyright IBM Corporation 2011

Figure 8-10. Results with non-legacy null processing KM4001.0

Notes:
This slide summarizes the results for non-legacy null processing.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer Loop Processing

© Copyright IBM Corporation 2011

Figure 8-11. Transformer Loop Processing KM4001.0

Notes:

8-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer loop processing


• Enable each row to be processed an indefinite number of
times within a Transformer, each with separate outputs
• Multiple output links in a Transformer enable each row to be
processed multiple times, but this is fixed by the number of
output links
• Uses
– Process rows containing repeating input columns
• Output one row for each repeating input column
– Process rows containing multiple values within a single column
• Output one row for each value within the column
– Output multiple rows based on value within an input column
• Number of rows output depends on the value

© Copyright IBM Corporation 2011

Figure 8-12. Transformer loop processing KM4001.0

Notes:
Transformer loop processing enable each row to be processed an indefinite number of
times within a Transformer, each with separate outputs. This can be done without using a
loop using multiple output columns. But this is fixed by the number of output links.
Loop processing has many uses including the ability to process rows containing an
indefinite number of values within a single column.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Repeating columns example


• Input rows

ProdID Size Color1 Color2 Color3 Color4

P21 12 Red Blue Yellow Black


P31 7 Green Yellow NULL NULL
P41 8 Tan Orange Black NULL

• First three output rows for first input row

ProdID Size Color


P21 12 Red
P21 12 Blue
P21 12 Yellow
© Copyright IBM Corporation 2011

Figure 8-13. Repeating columns example KM4001.0

Notes:
Here is an example of rows that have repeating columns. A product can have up to four
colors. These colors are entered into the Color columns. Nulls are added if there are fewer
than four colors.
In this example, product rows with just one of the colors is output. One row is output for
each color.

8-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Solution using multiple-output links


• Each ColorN Transformer output link writes out a record based
on the value of the ColorN column in the input row
– Potentially, one output row is written out for each ColorN column
• The Funnel stage collects the records into one output stream

ColorN output
links Funnel stage
© Copyright IBM Corporation 2011

Figure 8-14. Solution using multiple-output links KM4001.0

Notes:
This example shows how this can be done using multiple output links. Each ColorN
Transformer output link writes out a record based on the value of the ColorN column in the
input row. The Funnel stage collects the records into one output stream.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the Transformer stage


• Each output link writes out a row with a single color
• Constraints checks that the color input column is not null

Constraint

© Copyright IBM Corporation 2011

Figure 8-15. Inside the Transformer stage KM4001.0

Notes:
Inside the Transformer, the constraint for each Color link checks whether the column
contains a color or not (column is null).

8-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Limitations of the multiple output links solution


• Works only in cases where the maximum number of potential output rows is
known
– Because the input record format is fixed in this case, we know the maximum number, but
this is not always the case
• May not be known if the repeating values are stored within a single column
• May not be known in cases were the record formats are variable

ProdID Size Colors


P21 12 Red | Blue | Yellow | Black
P31 12 Green | yellow

P41 12 Tan | Orange | Black

• Requires the extra step of collecting all the output rows from the multiple links
into a single stream

© Copyright IBM Corporation 2011

Figure 8-16. Limitations of the multiple output links solution KM4001.0

Notes:
The main limitation of the multiple output links solution is that it works only in cases where
the maximum number of potential output rows is known.
We can also imagine a similar case involving variable record formats. The variable record
formats example is similar to that shown here, except that each of the colors is in a
separate column. So the first row would have six columns, the second four, and the third
three. How to read a file with multiple format records is discussed elsewhere in this course.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Loop processing
• For each row read into the Transformer, the loop condition is
tested
• While the loop condition remains true, the loop variables are
processed in order from top to bottom
– After the loop variables are processed each output link is processed
• If the output link constraint is true then process output columns
– When the loop condition is false, the loop variables are not processed
and no output rows are written out

© Copyright IBM Corporation 2011

Figure 8-17. Loop processing KM4001.0

Notes:
Now let us see how this can be done with loop processing. For each row read into the
Transformer, the loop condition is tested. While the loop condition remains true, the loop
variables are processed in order from top to bottom.

8-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Creating the loop condition


• The loop continues while the loop condition remains true
– It should become false after the input row has been fully processed
• @ITERATION system variable holds a count of then number of
times the loop has iterated, starting at 1
– Reset to 1 when a new input row is read by the Transformer
• If you can determine the number of iterations that are needed,
then the loop condition can be specified as follows:
– @ITERATION <= numNeededIterations
– For a delimited list of items, the Count function can be used:
• Count(“Red/Blue/Green”, “/”) + 1 -> 3 items
• Loop iteration warning threshold
– Warning written to log when threshold is reached
– Designed to inform about loops that never stop iterating

© Copyright IBM Corporation 2011

Figure 8-18. Creating the loop condition KM4001.0

Notes:
Each loop has a loop condition. The loop continues while the loop condition remains true. It
should become false after the input row has been fully processed. The @ITERATION
system variable can be used in the condition.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Loop variables
• Executed in order from top to bottom
• Similar to stage variables
• Defined on Transformer Stage>Loop Variables tab
• Loop variables can be referenced in derivations for other loop
variables and in derivations for output columns

© Copyright IBM Corporation 2011

Figure 8-19. Loop variables KM4001.0

Notes:
Loop variables are similar to stage variables. Their derivations are executed in order from
top to bottom.

8-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Repeating columns solution using a loop


• For each input row, create a colors list string and store it in a
stage variable:
– “Red/Blue/Yellow/Black” -> varColors stage variable
• Create the loop condition
– Use the Count() function to determine the number of colors in the list
• Use the Field function to extract the next color as you iterate
through the list
Only one output
link is needed

© Copyright IBM Corporation 2011

Figure 8-20. Repeating columns solution using a loop KM4001.0

Notes:
Here is a repeating columns solution using a loop. Notice that the job is simpler in overall
design. This slide outlines the main steps of the solution.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the Transformer

Create colors list

Count number
of colors
Loop condition

Extract color for


this iteration

Output row with


extracted color

© Copyright IBM Corporation 2011

Figure 8-21. Inside the Transformer KM4001.0

Notes:
The varColors stage variable is used to create a colors list by examining the Color
columns in the input row. Then the number of colors in the list are counted. varColorCount
contains the number of colors to process in the loop. This variable is used in the loop
condition. During each loop iteration the color in the list corresponding to the loop iteration
is extracted and written out in a separate row.

8-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer Group Processing

© Copyright IBM Corporation 2011

Figure 8-22. Transformer Group Processing KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer group processing


• Provides an alternative to using an Aggregator stage
– Can add results to individual rows without using Fork-Join job design
• Offers better performance than Fork-Join design
– Can perform aggregations not possible in Aggregator stage
• For example, create a list of the group row IDs
• Perform calculations not available in the Aggregator stage
• Validate entries before including them in the aggregation
• LastRowInGroup(In.Col) function can be used to
determine when the last row in a group is being processed
– Transformer stage must be preceded by a Sort stage that sorts the
data by the group key columns

© Copyright IBM Corporation 2011

Figure 8-23. Transformer group processing KM4001.0

Notes:
Transformer group processing provides an alternative to using an Aggregator stage. It can
add results to individual rows without using Fork-Join job design. The
LastRowInGroup(In.Col) function can be used to determine when the last row in a group
is being processed.

8-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Building a Transformer group processing job (1)


• In this example, the aggregation result will be added to each row
• Precede Transformer by a Sort stage
– Sort key columns define the groups
• If data is already sorted, use “Don’t Sort (Previously Sorted)” option
• Use SaveInputRecord() function to save the group records in the
Transformer queue
– They are saved so that the aggregation result can be added to each one before it
is written out of the Transformer
– Execute in a stage variable derivation
– Returns the number of rows in the queue
• Use the LastRowInGroup(group_key_columns) to determine when
the last row in the group is being processed
– Execute in a stage variable derivation
– Returns True (1) if last row in the group is being processed; else returns False (0)
– Requires a Sort stage before the Transformer

© Copyright IBM Corporation 2011

Figure 8-24. Building a Transformer group processing job (1) KM4001.0

Notes:
This slide lists the main steps of our example. In this example, the aggregation result will be
added to each row.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Building a Transformer group processing job (2)


• Perform aggregations
– Build summary results as rows are read in and store the results in stage variables
– Initialize the stage variable (RESULT) that stores the results before the next group
is read in
– Save the final aggregation result in a stage variable before initializing the stage
variable that stores the results (FINAL_RESULT)
• The FINAL_RESULT stage variable should precede the RESULT stage variable
– So the derivation for FINAL_RESULT references the RESULT before it gets initialized
• Determine the number of loop iterations
– Typically, equal to the number of rows in the queue after the last row in the group
has been read in
• Make sure there are no loop iterations before the last row of the group!
• Use GetSavedInputRecord() in a loop variable derivation
– Returns the index of the saved row
– Populates the input columns just as if the row had been read in by the
Transformer from the input link
• Mappings from input columns to output columns will move values from the saved
record

© Copyright IBM Corporation 2011

Figure 8-25. Building a Transformer group processing job (2) KM4001.0

Notes:
This slide continues the list of the main steps of our example.

8-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Group processing example job


• For each customer row:
– Count of number of customers in the same postal code
– List of customer IDs in the same postal code List of IDs

Count

Customer rows

Sort by group key


© Copyright IBM Corporation 2011

Figure 8-26. Group processing example job KM4001.0

Notes:
This slide shows the job and the expected results. For each customer row, count of number
of customers in the same postal code and list of customer IDs in the same postal code.
The Sort stage is required when using the LastRowInGroup(group_key_columns) function.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer stage variables


• NumSavedRows: Number of rows in queue
• IsBreak: Last row of the group?
• ZipCount: Number of customers in same postal code
• TotalZipList: List of customers in same postal code
• ZipList: Running list of customer IDs in the group
• NumIterations: Number of loop iterations to perform

© Copyright IBM Corporation 2011

Figure 8-27. Transformer stage variables KM4001.0

Notes:
This slide lists the stage variables that will be used.

8-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Stage Variable Derivations


• SaveInputRecord() returns number of saved rows
• A change in Zip column value indicates a break
• The number of IDs in the list is equal to the number of records in the
queue
• At the time the derivation for TotalZipList is executed, ZipList contains
all the IDs except for the current row
• NumIterations, which is used in the Loop condition, must be 0 except at
the break
– So it only runs when all the rows in the group are in the queue

© Copyright IBM Corporation 2011

Figure 8-28. Stage Variable Derivations KM4001.0

Notes:
This slide explains the derivations for each of the stage variables.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Specifying the Loop


• The @ITERATION system variable is initialized to 1 before the
loop is processed
– Incremented by 1 after each iteration
• GetSavedInputRecord() retrieves the next row from the
queue
– First row retrieved is the first row put into the queue (row index 1)

© Copyright IBM Corporation 2011

Figure 8-29. Specifying the Loop KM4001.0

Notes:
The loop condition uses the @ITERATION system variable. The GetSavedInputRecord()
retrieves the next row from the queue.

8-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Runtime errors
• The number of calls to GetSavedInputRecord() must be equal to the number of
calls to SaveInputRecord()
– Runtime error if GetSavedInputRecord() is called before SaveInputRecord() is called
– Runtime error if GetSavedInputRecord() is called three times but SaveInputRecord() was
only called twice
– Runtime error if SaveInputRecord() is called but GetSavedInputRecord() is never called
• After GetSavedInputRecord() is called once, it must be called enough times to
empty the queue before another call to SaveInputRecord()
– Runtime error if there are only two iterations of the loop, each iteration calling
GetSavedInputRecord(), but there are three or more records in the queue
• A warning is written to the job log whenever a multiple of warning loop threshold is
reached
– Set in Transformer Stage properties Loop Variables tab or using the
APT_TRANSFORM_LOOP_WARNING_THRESHOLD environment variable
• Set to 10,000 by default
• Applies both to the number of loop iterations and the number of records written to the queue
• Jobs can be set to abort after a certain number of warnings in the Job Run Options window

© Copyright IBM Corporation 2011

Figure 8-30. Runtime errors KM4001.0

Notes:
A number of runtime errors are possible. This slide lists some of the main cases.

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Validating rows before saving them in the queue


• That is, rows that do not meet certain conditions are not saved
in the queue
– Drops the row
• For example, only save rows with valid customer IDs
(ID>=10100)
• For example, only save rows with valid postal codes
(zip>10000 and <= 99999)
• Programming complication:
– Be careful not to include data from invalid rows in group totals

© Copyright IBM Corporation 2011

Figure 8-31. Validating rows before saving them in the queue KM4001.0

Notes:
One thing you can do in Transformer group processing that cannot be done using, for
example, an Aggregator stage is to validate the rows before saving them in the queue.
Invalid rows do not become part of the group summary.

8-32 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Checkpoint
1. What function can you use in a Transformer to determine
when you are processing the last row in a group? What
additional stage is required to use this function?
2. What function can you use in a Transformer to save copies
of input rows?
3. What function can you use in a Transformer to retrieve saved
rows?

© Copyright IBM Corporation 2011

Figure 8-32. Checkpoint KM4001.0

Notes:
Write your answers here:

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 8 – Transformer Logic


• In this lab exercise, you will:
– Explore Transformer stage legacy null
processing
– Explore Transformer stage non-legacy
null processing
– For comparison, first process repeating
columns using multiple output links
– Then process repeating columns using
a loop
– Process groups in a Transformer stage

© Copyright IBM Corporation 2011

Figure 8-33. Exercise 8 - Transformer Logic KM4001.0

Notes:

8-34 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Unit summary
Having completed this unit, you should be able to:
• Describe and set null handling in the Transformer
• Use Loop processing in the Transformer
• Process groups in the Transformer

© Copyright IBM Corporation 2011

Figure 8-34. Unit summary KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 8. Advanced Transformer Logic 8-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

8-36 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 9. Extending the Functionality of Parallel


Jobs

What this unit is about


This unit describes the three ways of extending the functionality of
DataStage jobs. This includes creating Build and Wrapped stages and
adding C++ functions written outside of DataStage to the list of
available Transformer functions.

What you should be able to do


After completing this unit, you should be able to:
• Create Wrapped stages
• Create Build stages
• Create new External Function routines
• Describe Custom stages

How you will check your progress


• Lab exercises and checkpoint questions.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Create Wrapped stages
• Create Build stages
• Create new External Function routines
• Describe Custom stages

© Copyright IBM Corporation 2011

Figure 9-1. Unit objectives KM4001.0

Notes:

9-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Ways of adding new functionality


• Build stages
– A way of creating a new stage using the DataStage GUI that compiles into a new framework
operator
– You define the new stage and specify:
• Properties
• Input and output interfaces
• C++ source code to be compiled and executed
– C++ source is created, compiled, and linked inside of DataStage
• Custom stages
– A way of creating a new stage in C++ that compiles into a new framework operator
• New operators are instantiations of the APT_Operator class
– You define a new stage that invokes the custom operator
• Property values are passed to the operator by the stage
– C++ source is created, compiled, and linked outside of DataStage
• Wrapper stages
– Wrap an existing executable into a new custom stage
• External functions
– Define a new parallel routine (function) to use in Transformer stages
– Specify input arguments
– C++ function is created, compiled, and linked outside of DataStage

© Copyright IBM Corporation 2011

Figure 9-2. Ways of adding new functionality KM4001.0

Notes:
This slide lists and describes four ways the functionality of DataStage can be extended.
Custom stages are beyond the scope of this course. They require low-level knowledge of
C++ and the Framework class libraries.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Wrapped Stages

© Copyright IBM Corporation 2011

Figure 9-3. Wrapped Stages KM4001.0

Notes:

9-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Building Wrapped stages

You can “wrap” an executable:


> Binary
> Unix command
> Shell script

… and turn it into a custom stage capable of


parallel execution…

As long as the legacy executable is:


• Amenable to data-partition parallelism
> No dependencies between rows
• Pipe-safe
> Can read rows sequentially
> No random access to data

© Copyright IBM Corporation 2011

Figure 9-4. Building Wrapped stages KM4001.0

Notes:
You may improve performance of an existing legacy application that meets the
requirements for parallelism by wrapping it and running it in parallel.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Wrapped stages

Wrapped stages are treated as “black boxes”


• DataStage has no knowledge of contents
• DataStage has no means of managing anything that occurs inside the
wrapped stage
• DataStage only knows how to export data into and import data out of
the wrapped stage
• User must know at design time the intended behavior of the wrapped
stage and its schema interface

© Copyright IBM Corporation 2011

Figure 9-5. Wrapped stages KM4001.0

Notes:
Wrapped stages are treated as “black boxes”. DataStage has no knowledge of contents.
And DataStage has no means of managing anything that occurs inside the wrapped stage.

9-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Wrapped stage example


• Create a source stage that produces a listing of files
– Wrap the UNIX ls command
– Pass a directory path as a parameter
• The new stage will have no inputs and one output
– Output will be a single VarChar column containing the list returned from the ls
command
• To create a new Wrapped stage
– Specify name of stage and operator
– Specify how to invoke the executable
– Specify properties to be passed to the operator, including:
• -Name Value: Pass the value preceded by the name of the property
• Value only: Pass just the value to the executable
– Create and load table definitions that define the input and output interfaces

© Copyright IBM Corporation 2011

Figure 9-6. Wrapped stage example KM4001.0

Notes:
We will take a look at a Wrapped stage example. This example will wrap the UNIX ls
command. It will have one property: the directory to be listed.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Creating a Wrapped stage

New stage type

Default execution
mode

Executable
command
© Copyright IBM Corporation 2011

Figure 9-7. Creating a Wrapped stage KM4001.0

Notes:
To create a new Wrapped stage, click the right mouse button over the Stage Types folder
and then click New>Other>Parallel Stage Type (Wrapped). Specify the new stage type
and the command to be executed.

9-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Defining the Wrapped stage interfaces

Optionally
define
properties

Generate
Select table new stage
definition
defining output

© Copyright IBM Corporation 2011

Figure 9-8. Defining the Wrapped stage interfaces KM4001.0

Notes:
On the Interfaces tab, specify input and output interfaces. This is done by selecting an
existing table definition that specifies the expected input or output columns and their types.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Specifying Wrapped stage properties

Property
name How to convert
Required or the property
optional? passed to the
operator

© Copyright IBM Corporation 2011

Figure 9-9. Specifying Wrapped stage properties KM4001.0

Notes:
On the Properties tab you can specify stage properties. This stage has one property
named Dir, the directory to be listed.
Here we want to run, for example, ls c:/KM400Files, so we choose the conversion
property Value Only. If you chose -Name Value the following would be executed: ls –Dir
c:/KM400Files. This is not proper syntax for the ls command, so the job would abort.

9-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Job with Wrapped stage

Wrapped
stage

© Copyright IBM Corporation 2011

Figure 9-10. Job with Wrapped stage KM4001.0

Notes:
This shows a job with Wrapped stage. The stage functions as any other stage functions.
Also shown here are some sample results of running the job.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 9 – Wrapped stages


• In this lab exercise, you will:
– Create a simple Wrapped stage
– Create a job that uses the created
Wrapped stage

© Copyright IBM Corporation 2011

Figure 9-11. Exercise 9 - Wrapped stages KM4001.0

Notes:

9-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Build Stages

© Copyright IBM Corporation 2011

Figure 9-12. Build Stages KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Build stages

• Work like existing parallel stages


• Extend the functionality of parallel jobs
• Can be used in any parallel jobs
• Coded in C++
– Predefined macros can be used in the code
– Predefined header files make additional framework classes and
class functions available
• Documentation
– Parallel Job Advanced Developer Guide: “Specifying Your Own
Parallel Stages”

© Copyright IBM Corporation 2011

Figure 9-13. Build stages KM4001.0

Notes:
Build stages, like Wrapped stages, work like existing parallel stages and extend the
functionality of parallel jobs. They differ from Wrapped stages in that their functionality is
created in DataStage using C++ code.

9-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Example job with Build stage

Build stage

© Copyright IBM Corporation 2011

Figure 9-14. Example job with Build stage KM4001.0

Notes:
This shows a job with Build stage. The stage functions as any other stage functions. The
Copy stage here is used to change column names so that they match the Build stage’s
expected interface.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Creating a new Build stage

Stage name

Class instantiation
name from
APT_Operator
class

Framework
operator name
• To create a new Build stage
– Click right mouse button over Repository folder
– Click New>Other>Parallel Stage Type (Build)

© Copyright IBM Corporation 2011

Figure 9-15. Creating a new Build stage KM4001.0

Notes:
The Build stage window is similar to the Wrapped stage window. On the General tab you
provide the stage type name and the name of the operator that will be generated by the
stage.

9-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Build stage elements


• Properties
– Defined properties show up on stage Properties tab
• Input / output
– Interfaces
• Build stages require at least one input and one output
• Build stages have static interfaces, cannot dynamically access a schema
– Reads / Writes
• Specify auto or noauto
• Framework macros can be used to explicitly execute reads and writes
• Transfer method
– Specify auto or noauto
– Combine or Separate transfers
– Framework macros can be used to explicitly execute transfers
• Code
– Variable definitions and initializations
– Pre-loop: Code executed before rows are read
– Per-record: Coded executed for each row read
– Post-loop: Code executed after all rows have been read

© Copyright IBM Corporation 2011

Figure 9-16. Build stage elements KM4001.0

Notes:
This slide lists the main task that need to be done in the Build stage: Specify properties,
define input/output interfaces, specify the transfer method, and write the C++ code.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Anatomy of a Build stage

Transfer
auto / noauto
auto / noauto auto / noauto
combine / separate

Build
Properties

Input link Definitions


Output link
columns: Pre-Loop
columns:
a, b,… Per-Record
a, b, x, y, …
Post-Loop

Input port: in0, … Output port: out0,…


Interface columns: a, b Interface columns: a, b, x, y,…

• Interface input / output fields are defined by table definitions


• C++ variables, includes added to Definitions tab
• C++ code added to Pre-Loop, Per-Record, Post-Loop of Build tab

© Copyright IBM Corporation 2011

Figure 9-17. Anatomy of a Build stage KM4001.0

Notes:
On the left is the input, virtual data set whose rows will be read. Its schema provides the
names and types of the input field values. On the right is the output, virtual data set whose
rows will be written. Its schema provides the names and types of the output field values.
The Build stage has an input interface consisting of one or more ports along with their
schemas. This interface is specified by means of Table Definitions referenced when
building the stage, one Table Definition for each input port. The Build stage also has an
output interface consisting of one or more ports along with their schemas. This interface is
specified by means of Table Definitions referenced when building the stage.
There are three ways for data to move across or through the Build stage:
(1) Code assignments. Fields enumerated in the output interface can be assigned values.
These values can be based on values in referenced input fields. (2) Transfers. When a
Transfer is specified (whether automatic or manual), the whole input record is copied to the
output schema. The input record includes columns of values specified in the input interface
as well as all the columns in the input data set schema. The transferred columns are added
after the columns explicitly enumerated in the output interface. (3) RCP. RCP functions like

9-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty a Transfer. RCP on an input port adds the whole input record to the input as a block of
fields. RCP on an output port copies the whole input record to the output schema. RCP
must be turned on for both the input and output. RCP is also incompatible with Transfer.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Defining the input, output interfaces


• Create table definitions for each input and output link
– Table definitions are required if you want to reference specific input/output
fields (which you do most of the time)
• List inputs and outputs on Interfaces tab
– Provide input port name: Default is in0, in1, …
– Provide output port name: Default is out0, out1, …
– Specify Auto Read / Auto Write
– RCP
• Leave this False
• This should not be confused with RCP in parallel jobs.
• Input/Output macros can be used to explicitly control reads and
writes
– readRecord(port_number), writeRecord(port_number)
– inputDone(port_number)
• Need to execute after read, before referencing fields populated by the read

© Copyright IBM Corporation 2011

Figure 9-18. Defining the input, output interfaces KM4001.0

Notes:
To define the interfaces, first create table definitions for each input and output link. Then
specify whether you want the stage to automatically read and write records or you want to
control this using input/output macros.

9-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Interface table definition

• This DataStage table definition defines the input interface to a Build


stage
– Provides column names and types
• Choose C++ field data types for input interface
– Otherwise, class function and operator signatures will not directly apply
• The input link column to the Build stage must have columns with the
same names and compatible types as the input interface columns
– If necessary, use Copy stage to modify incoming field names
© Copyright IBM Corporation 2011

Figure 9-19. Interface table definition KM4001.0

Notes:
This shows an example of an interface table definition. Choose C++ field data types for
input interface. Otherwise, class function and operator signatures will not directly apply.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Specifying the input interface

Intra-stage RCP
alternative to
Transfer. Don’t use!

Table definition
Port name. auto / use defining
Alias for in0 macros interface

• Default port names are in0, in1, … in order defined


• Select table definition that defines the input fields
– Required to reference specific input columns in the C++ code

© Copyright IBM Corporation 2011

Figure 9-20. Specifying the input interface KM4001.0

Notes:
RCP is an alternative to the mechanism of transfer for moving data not explicitly assigned
to output columns through the stage. If it is turned on, then you cannot specify either
automatic or manual transfers. Without some special reason, it should be turned off, since
the transfer mechanism is more flexible.

9-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Specifying the output interface

Table
Port name. auto / no auto definition
Alias for out0

• Default port names are out0, out1, … in order defined


• Select table definition that defines the output interface

© Copyright IBM Corporation 2011

Figure 9-21. Specifying the output interface KM4001.0

Notes:
Specifying the output interface is similar to specifying the input interface. Select table
definition that defines the output interface.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transfer
• Used to pass unreferenced input link columns through the Build
stage
– Input link columns are passed as a block to the output link
– Auto transfers occur at end of each iteration of the Per-Record loop code
• Can be auto or no auto
• Transfer type can be combined or separate
– Combined: Second transfer to same output link (port) replaces the first
• Assumes same source metadata for each transfer
– Separate: Second transfer to same output link adds columns
• Assumes different source metadata for each transfer
• Transfer macros are used in code to explicitly transfer records from
input buffers to output buffers
– DoTransfer(transfer_index): Index is integer of defined transfer: 0,
1, …
– DoTransfersFrom(input)
– DoTransfersTo(output)
– TransferAndWriteRecord(output)

© Copyright IBM Corporation 2011

Figure 9-22. Transfer KM4001.0

Notes:
The transfer mechanism is used to pass unreferenced input link columns through the Build
stage. Transfers can be done automatically by the stage or manually specified in the code.

9-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Defining a transfer

Input port name

Output port name Transfer type:


auto / no auto combined / separate

• Definition order defines the transfer index: 0, 1,…


• Refer to ports by specified names or default names
• Specify whether transfer is to be done automatically at the end of
each loop
• Specify type of transfer (separate or combined)
© Copyright IBM Corporation 2011

Figure 9-23. Defining a transfer KM4001.0

Notes:
Define the transfer on the Transfer tab. Specify the input port that is to be transferred to the
specified output port. Also specify whether transfer is to be done automatically at the end of
each loop.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Anatomy of a transfer

Qty Qty Amount OrderNum


Price Price --------------- ItemNum
TaxRate TaxRate outRec.* Qty
OrderNum ----------- Price
ItemNum inRec.* TaxRate
Amount
Input link
columns Output link
Amount = Qty*Price…
columns
Input buffer Output buffer
• ReadRecs (auto or explicit) brings ennumerated, input interface values into the input buffer. If a
transfer is specified, then the whole input record also comes in as a block of values
• Assignments in Per-Record move values to ennumerated output fields
• Transfers copy input fields as a block (inRec.*) to the output buffer
• Duplicate columns coming from a transfer are dropped with warnings in log
– For example, if the input record contained a column named Amount, this would be dropped. Explicit
assignments in the code to output interface columns take precedence over Transferred column values
• If RCP is enabled instead of a Transfer, the picture is the same. If neither Transfer nor RCP is
specified, then the inRec.* and outRec.* will not exist
© Copyright IBM Corporation 2011

Figure 9-24. Anatomy of a transfer KM4001.0

Notes:
This slide looks at the anatomy of a transfer. ReadRecs (auto or explicit) brings
enumerated, input interface values into the input buffer. If a transfer is specified, then the
whole input record also comes in as a block of values. Transfers copy input fields as a
block (inRec.*) to the output buffer. Duplicate columns coming from a transfer are dropped
with warnings in log.

9-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Defining stage properties


• Property specifications
– Name
– Data type
– Prompt
– Default value
– Required: Is the property required or optional?
– Conversion: Specifies how property get processed
• Most of the time you should choose –Name Value
– Other types are not generally useful
• Extended properties can also be specified

© Copyright IBM Corporation 2011

Figure 9-25. Defining stage properties KM4001.0

Notes:
Defining stage properties in a Build stage is similar to defining stage properties in a
Wrapped stage. Specifying the conversion is required. Most of the time you should choose
–Name Value.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Specifying properties

Choose
Property Default -Name Value
type value

• If data type is List, open the Extended Properties window


to define the members

© Copyright IBM Corporation 2011

Figure 9-26. Specifying properties KM4001.0

Notes:
In a Build stage, you would generally choose -Name Value. The other options are not very
useful since you are not invoking the operator created by the Build stage from the
command line.

9-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Defining the Build stage logic

• Definitions
– Variables
– Include files
• Pre-Loop
– Code executed once, prior to entering the Per-Record loop
• Per-Record
– Executed for each input record
• Post-Loop
– Code executed once, after exiting the Per_Record loop

© Copyright IBM Corporation 2011

Figure 9-27. Defining the Build stage logic KM4001.0

Notes:
The code is written on several different tabs depending on its purpose. This slide lists and
describes the different tabs.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Definitions tab

• Define variables
• Include header files

© Copyright IBM Corporation 2011

Figure 9-28. Definitions tab KM4001.0

Notes:
On the Definitions tab you define variables and specify any header files you want to
include.

9-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Pre-Loop tab

Initialize
variable

• Code to be executed before input records are processed


• This code is executed only once

© Copyright IBM Corporation 2011

Figure 9-29. Pre-Loop tab KM4001.0

Notes:
On the Pre-Loop tab, you specify the code that is to be executed before input records are
processed. This code is executed only once.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Per-Record tab

Build macro

Qualified input
column

Unqualified
output column

• Code to be executed for each input record read in


• This code is executed once for each input record

© Copyright IBM Corporation 2011

Figure 9-30. Per-Record tab KM4001.0

Notes:
Most of your code will be written on the Per-Record tab. This is code to be executed for
each input record read in. In this example, the code is C++ code. Notice the macros that
are used in the code, for example endLoop().

9-32 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Post-Loop tab

Framework
types

Property

Framework
functions

• Code to be executed after all input records are processed


• This code is executed only once

© Copyright IBM Corporation 2011

Figure 9-31. Post-Loop tab KM4001.0

Notes:
On the Post-Loop tab you specify Code to be executed after all input records are
processed. Notice in this example, the reference to a property, Debug. This property was
defined on the Properties tab as shown earlier and can be referenced in the code.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Writing to the job log


• Use ostream standard output objects: cout, clog, cerr
– Standard output objects redirected to the DataStage log
– clog, cerr generate log warning messages
– Example: cout << “Message to write” << endl;
• User errorLog object
– Defined in errorlog.h
– Example:
• *errorLog() << “Message to write” << endl;
• errorLog(). logInfo(index): generates informational log
message
• errorLog(). logWarning(index): generates warning
• errorLog(). logError(index): generates error

© Copyright IBM Corporation 2011

Figure 9-32. Writing to the job log KM4001.0

Notes:
Using the errorLog object is the only way to write error messages to the log (messages
with yellow or red icons by default). Writing an error message does not abort the job. To
abort the job, you can call the failstep() macro after writing an error message to the log.
The message number is an index to the message. However, this is not relevant for Build
stages, so you can choose any number.

9-34 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Using a Build stage in a job

Build stage

© Copyright IBM Corporation 2011

Figure 9-33. Using a Build stage in a job KM4001.0

Notes:
This shows an example of a job using a Build stage. A Build stage functions as any other
stage.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Stage properties

Category and
property List of property
values

© Copyright IBM Corporation 2011

Figure 9-34. Stage properties KM4001.0

Notes:
This shows the inside of the Build stage in the job. Notice that the properties are displayed
and edited in the same way as other stages.

9-36 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Build stages with multiple ports


• Input / output ports are indexed: 0, 1, 2,…
– Order defined by order of definition, top to bottom
– Example: writeRecord(1) writes to second output port/link
– Example: readRecord(0) reads from first input port/link
– Use Link Ordering on stage Properties tab to specify ordering
of links (ports)

© Copyright IBM Corporation 2011

Figure 9-35. Build stages with multiple ports KM4001.0

Notes:
Build stages can have multiple input/ output ports. They are indexed 0, 1, 2, and so on. In
macros you can specify which port you are reading the record from or which port you are
writing the record to.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Build Macros

© Copyright IBM Corporation 2011

Figure 9-36. Build Macros KM4001.0

Notes:

9-38 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Build macros
• Informational
– inputs() -> number of inputs
– outputs() -> number of outputs
– transfers() -> number of transfers
• Flow Control
– endLoop(): Exit Per-Record loop and go to Post-Loop code
– nextLoop(): Read next input record
– failStep(): Abort the job
• Input/Output
– Input / output ports are indexed: 0, 1, 2, …
– readRecord(index), writeRecord(index), inputDone(index)
– holdRecord(index): suspends next auto read
– discardRecord(index): suspends next auto write
– discardTransfer(index): suspends next auto transfer
• Transfers
– Transfers are indexed: 0, 1, 2, …
– doTransfer(index): Do specified transfer
– doTransfersFrom(index): Do all transfers from specified input
– doTransfersTo(index): Do all transfers to specified output
– transferAndWriteRecord(index): Do all transfers to specified output, then write a record

© Copyright IBM Corporation 2011

Figure 9-37. Build macros KM4001.0

Notes:
This slide lists most of the macros that are available to you and puts them into different
categories depending on their functions.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-39
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Turning off auto read, write, and transfer


Turn off Auto Read

Turn off Auto Write

Turn off Auto Transfer

© Copyright IBM Corporation 2011

Figure 9-38. Turning off auto read, write, and transfer KM4001.0

Notes:
In most cases it is simpler to let the Build stage automatically handle reading, writing, and
transferring records. But for maximum control you can turn off this functionality and
explicitly handle it yourself in the code.

9-40 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Reading records using macros


• readRecord(0) reads a record from the first input link
• Use readRecord(0) in the pre-Loop logic to bring in the first
record
• After all records have been read, the execution of
readRecord(0) will not bring in usable input data for
processing
– Use inputDone() to test whether the current readRecord(0)
instance contains a genuine record to be processed

© Copyright IBM Corporation 2011

Figure 9-39. Reading records using macros KM4001.0

Notes:
This slide lists and describes the macros you can use for reading. readRecord(0) reads a
record from the first input link. Use readRecord(0) in the pre-Loop logic to bring in the first
record. You can use the inputDone() macro to test whether the current readRecord(0)
instance contains a genuine record to be processed.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-41
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

APT Framework Classes

© Copyright IBM Corporation 2011

Figure 9-40. APT Framework Classes KM4001.0

Notes:

9-42 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

APT framework and utility classes


• Classes of functions and macros
– Use in Build stage code
• IBM/InformationServer/PXEngine/include/apt_util
– Classes of useful functions and macros
– Automatically included in Build stages
• No need to explicitly include them
– APT_ prefix distinguishes utility objects from standard C++ objects
– string.h
• Defines string handling functions and operators
– errlog.h
• Functions for writing messages to the DataStage log

© Copyright IBM Corporation 2011

Figure 9-41. APT framework and utility classes KM4001.0

Notes:
When you install DataStage, the APT framework and utility classes are installed. You can
include the header files for these classes and then use any of the class functions in your
code. The APT_ prefix distinguishes utility objects from standard C++ objects.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-43
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Framework class sampler


• APT_String
• APT_Decimal
• APT_Int32
• APT_DFloat
• APT_SFloat
• APT_Date
• APT_Time
• APT_TimeStamp

© Copyright IBM Corporation 2011

Figure 9-42. Framework class sampler KM4001.0

Notes:
This slide lists some of the APT classes. It is beyond the scope of this course to look at this
in detail. But you can open up the header files and study the functions that are available.

9-44 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

APT_String Build stage example

s1 and s2 are
declared as APT_String
APT_String assignment
variables operator

APT_String
concatenation
operator

APT_String
function

© Copyright IBM Corporation 2011

Figure 9-43. APT_String Build stage example KM4001.0

Notes:
This slide shows an APT_String Build stage example. In this example s1 and s2 are
declared to be APT_String objects. This allows the + operator to be used for string
concatenation, as well as the toLower and toUpper class functions to be used.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-45
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 9 – Build stages


• In this exercise, you will:
– Create a simple Build stage
– Create a job that uses the created Build
stage
– Define and use properties in a Build
stage
– Send an error message from a Build
stage
– Use Build macros

© Copyright IBM Corporation 2011

Figure 9-44. Exercise 9 - Build stages KM4001.0

Notes:

9-46 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

External Functions

© Copyright IBM Corporation 2011

Figure 9-45. External Functions KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-47
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parallel routines
• Two types:
– External function
• Returns a value:
• Use in Transformer derivations and constraints
– External Before/After
• Does not return values
• Can be executed before/after a job runs or before/after a Transformer stage
• Specify in Job Properties or Transformer Stage Properties
• C++ function compiled outside of DataStage
– Object file
– Shared object library
• In DataStage, define routine metadata
– Input and output arguments
– Static / dynamic linking

© Copyright IBM Corporation 2011

Figure 9-46. Parallel routines KM4001.0

Notes:
External function routines extend the functionality of a Transformer stage. There are two
types. Our focus is on the external function type that can be used in the Transformer.

9-48 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

External function example

• Function returns “Y” if key words are found in the input


string; else returns “N”

© Copyright IBM Corporation 2011

Figure 9-47. External function example KM4001.0

Notes:
The function itself is coded outside of DataStage in the usual C++ way. In this example,
keyWords returns a string (char*). It returns “Y” if it finds in the input parameter string
(inString) any of the words listed in the code (“hello”, “ugly”).
This is a simple example but a function like it can serve a real business purpose. An
enterprise may want a function that checks text for the business names.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-49
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Another external function example

Framework classes
include file

APT_String class

• Function returns “Y” if key words are found in the input string; else
returns “N”
• This version of the function uses the APT_String class functions
– Note that orchestrate.h is included. This file includes all Framework classes

© Copyright IBM Corporation 2011

Figure 9-48. Another external function example KM4001.0

Notes:
This shows another example using class functions in the framework classes. The
framework class is included at the top of the code. Note here that orchestrate.h is
included. This file includes all Framework classes.

9-50 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Creating an external function


DataStage GUI
function name External function Static link
name

Return type

Object file path


• To create an external function
– Click right mouse button over Routines branch
– Click New Parallel Routine
– Select External Function type
© Copyright IBM Corporation 2011

Figure 9-49. Creating an external function KM4001.0

Notes:
Once you have coded the external function outside of DataStage you need to register it
within the DataStage GUI. In this example, the C++ executable object file is referenced in
the Library path box. The function return type (char*) is specified in the Return type box.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-51
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Defining the input arguments

• Define all the input arguments


• Only an input argument is defined in this example

© Copyright IBM Corporation 2011

Figure 9-50. Defining the input arguments KM4001.0

Notes:
As noted, the keyWords function has one input argument. Define this on the Arguments
tab.

9-52 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Calling the external function

New function

• External functions are listed in the DSRoutines folder in


the DataStage Expression Editor

© Copyright IBM Corporation 2011

Figure 9-51. Calling the external function KM4001.0

Notes:
Once created, the external function is available in the Transformer in the DSRoutines
folder.

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-53
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 9 – External Function Routines


• In this lab exercise, you will:
– Create an External Function Routine
– Use an External Function in a
Transformer stage

© Copyright IBM Corporation 2011

Figure 9-52. Exercise 9 - External Function Routines KM4001.0

Notes:

9-54 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Checkpoint
1. What is a Wrapper stage? How does it differ from a Build
stage?
2. What defines the input and output interfaces to Build and
Wrapper stages?
3. True or false? External functions are C++ functions that are
coded within the DataStage GUI?

© Copyright IBM Corporation 2011

Figure 9-53. Checkpoint KM4001.0

Notes:
Write your answers here:

© Copyright IBM Corp. 2005, 2011 Unit 9. Extending the Functionality of Parallel Jobs 9-55
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit summary
Having completed this unit, you should be able to:
• Create Wrapped stages
• Create Build stages
• Create new External Function routines
• Describe Custom stages

© Copyright IBM Corporation 2011

Figure 9-54. Unit summary KM4001.0

Notes:

9-56 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 10. Accessing Databases

What this unit is about


This unit describes the functionality of the DataStage Connector
stages.

What you should be able to do


After completing this unit, you should be able to:
• Use Connector stages to read from relational tables
• Use Connector stages to write to relational tables
• Handle SQL errors in Connector stages
• Use Connector stages with multiple input links
• Optimize jobs that write to relational tables by separating inserts
from updates

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Use Connector stages to read from relational tables
• Use Connector stages to write to relational tables
• Handle SQL errors in Connector stages
• Use Connector stages with multiple input links
• Optimize jobs that write to relational tables by separating
inserts from updates

© Copyright IBM Corporation 2011

Figure 10-1. Unit objectives KM4001.0

Notes:

10-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Overview

© Copyright IBM Corporation 2011

Figure 10-2. Overview KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Connector stages
• Used to read and write to database tables
• Types of Connector stages
– Individual databases: DB2, Oracle, Teradata, and so on
– ODBC: connect to data sources using ODBC drivers
• Data sources include databases and non-relational data sources such as files
– Any source that has an ODBC driver available
– Classic Federation for z/OS stage
• Access mainframe data by means of a Federation Server database
– Requires Federation Server to be installed
• Provides an SQL interface to mainframe data
– Stored Procedure stage
• Extends functionality of DataStage
• Supports several database types: DB2, Oracle, SQL Server, Teradata
• Can be called once per job or once per row
• Supports input and output parameters
• Documentation
– See the set of Connectivity Guides for each database type

© Copyright IBM Corporation 2011

Figure 10-3. Connector stages KM4001.0

Notes:
Connector stages are used to read and write to database tables. This slide lists some of the
types. Other database types are supported. This is a partial list of the main types. Other
types include Informix, Sybase, SQL Server, and others.

10-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Connector stage usage


• Parallel support for both reading and writing
– Read: parallel connections to the server and modified SQL queries for each
connection
– Write: parallel connections to the server
– Supports bulk loading
– Multiple input links can be used to write rows to multiple tables within
the same unit of work
• Can be used for lookups
– Supports sparse lookups
• You can create your own SQL or let the stage generate the SQL
– Create your own SQL manually, using a tool outside of DataStage, or using SQL
Builder
• SQL Builder is accessible from within the stage and fully integrated with the stage
– The Connector stage optionally generates SQL based on the table name and
column definitions
• Supports Before / After SQL
– SQL statement to be processed once before or after data is processed by the
Connector stage
– Use, for example, to create or drop secondary indexes

© Copyright IBM Corporation 2011

Figure 10-4. Connector stage usage KM4001.0

Notes:
Connector stages offer parallel support for both reading and writing. They also support bulk
loading. You can create your own SQL or let the stage generate the SQL.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Connector stage look and feel


Connector types have the same look and feel and the
same core set of properties
– Some types include properties specific to the database type
• Job parameters can be inserted into any properties
• Required properties are visually identified
• Properties are divided into two basic categories
– Connection properties
• Data Connection objects can be used to populate these
properties
– Usage properties

© Copyright IBM Corporation 2011

Figure 10-5. Connector stage look and feel KM4001.0

Notes:
All Connector stages have the same look and feel and the same core set of properties.
Some may include properties specific to the database type.

10-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Connector stage GUI

Navigation Panel

Properties
Columns

Test
connection View data

© Copyright IBM Corporation 2011

Figure 10-6. Connector stage GUI KM4001.0

Notes:
This slide shows the inside of the ODBC Connector stage and highlights some of its
features. The Navigation panel provides a way of moving between different sets of
properties. Click the stage icon in the middle to access the stage properties. Click a link
icon to access properties related to that link.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Connection properties
• ODBC Connection Properties
– Data source name or database name
– User name and password
– Requires a defined ODBC data source on the DataStage Server
• DB2 Connection Properties
– Instance
• Not necessary if a default is specified in the environment variables
– Database
– User name and password
– DB2 client library file
• Use Test to test the connection
• Can Load Connection properties from a Data Connection
object (discussed later)

© Copyright IBM Corporation 2011

Figure 10-7. Connection properties KM4001.0

Notes:
This slide discusses the connection properties in the Connector stage.

10-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Usage properties - Generate SQL


• Have the stage generate the SQL?
– If Yes, stage generates SQL based on column definitions and
specified table name
• Table name
– If schema name is not specified, then assumes DataStage user ID
> For example: ITEMS -> DSADM.ITEMS
– If No, then you must specify the SQL
– Paste it in
– Manually type it
– Invoke SQL Builder

© Copyright IBM Corporation 2011

Figure 10-8. Usage properties - Generate SQL KM4001.0

Notes:
This slide discusses the Usage properties in the Connector stage. It focuses on the
Generate SQL properties.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Deprecated stages
• Enterprise stages: DB2 UDB Enterprise, Oracle Enterprise,
Teradata Enterprise, and others
• “Plug-in” stages: DB2 UDB API, DB2 UDB Load, Oracle OCI
Load, Teradata API, Dynamic RDBMS, and others
– “Plug-in” stages are stages ported over from Server jobs
• Run sequentially
• Invoke the DataStage Server engine
• Cannot span multiple servers in grid or cluster configurations
• Deprecated stages have been removed from the Palette but
are still available in the Repository Stage Types folder

© Copyright IBM Corporation 2011

Figure 10-9. Deprecated stages KM4001.0

Notes:
There are many, many database stages available. Many of these have been deprecated.
That is, they have been replaced by the Connector stages that are the focus of this unit. In
some cases, you may want to use one of the deprecated stages. Deprecated stages have
been removed from the Palette but are still available in the Repository Stage Types folder.

10-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Database stages

All available database


stages Connector stages

Stored
Procedure
stage
Classic Federation stage

Stages in Palette

© Copyright IBM Corporation 2011

Figure 10-10. Database stages KM4001.0

Notes:
The DataStage Repository window displays all available database stages that are
available in the Stage types>Parallel>Database folder. Not all of these stages are
included in the default Designer Palette. You can customize the Palette to add additional
stage types by dragging them from the Repository window to the Palette.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Do it in DataStage or in the Database?


• When reading data from a database, it is often possible to use either
SQL or DataStage for some tasks
• Leverage the strengths of each technology:
– Where possible, use an SQL filter (WHERE clause) to limit the number of
rows sent to the DataStage job
– Use an SQL join to combine data from tables with small-to-medium number
of rows, especially where the join columns are indexed
– In general, avoid sorting in the database
• DataStage sorting is much faster and can be run in parallel
• Use DataStage sort and join to combine data from very large tables, or when the
join condition is complex
– Avoid the use of database stored procedures on a per-row basis
• Implement these routines in DataStage
• When the choice is not obvious, test the possibilities to see which
yields the best performance

© Copyright IBM Corporation 2011

Figure 10-11. Do it in DataStage or in the Database? KM4001.0

Notes:
When reading data from a database, it is often possible to use either SQL or DataStage for
some tasks. In these situations you should leverage the strengths of each technology.

10-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Connector Stage
Functionality

© Copyright IBM Corporation 2011

Figure 10-12. Connector Stage Functionality KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Reading with Connector stages


• For best performance, limit the use of SELECT * to read columns
– Uses more memory
• May impact job performance
– Appropriate for dynamic source flows, where columns move through by RCP
• Explicitly specify only the columns needed
– Select only the columns needed when loading columns on the Columns tab
– Or use auto-generated or user-defined SQL with specfied columns

Selected columns

© Copyright IBM Corporation 2011

Figure 10-13. Reading with Connector stages KM4001.0

Notes:
This slide lists some best practices using the Connector stages.

10-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Before/After SQL
• Before SQL statements are executed before the stage starts
processing
– For example, create temporary table to write to
• After SQL statements are executed after the stage finishes processing
– For example, SELECT FROM … INSERT INTO… from temporary table to
actual table
– For example, Delete temporary table

Enable
Before/After SQL

Before SQL
© Copyright IBM Corporation 2011

Figure 10-14. Before/After SQL KM4001.0

Notes:
Before/After SQL can be used in Connector stages. Before SQL statements are executed
before the stage starts processing. After SQL statements are executed after the stage
finishes processing.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Sparse lookups
• By default lookup data is loaded into memory
• A sparse lookup sends individual SQL statements to the database for
each input row
– Expensive operation from a performance point of view
– Appropriate when the number of input rows is significantly smaller than the
number of reference rows (1:100 or more)
• An alternative is to use a Join stage between input and reference table

Sparse lookup

© Copyright IBM Corporation 2011

Figure 10-15. Sparse lookups KM4001.0

Notes:
When a Connector stage is being used for a lookup, the Sparse lookup option is available.
By default lookup data is loaded into memory. A sparse lookup sends individual SQL
statements to the database for each input row. This is a very expensive operation from a
performance point of view. It may be appropriate when you are dealing with huge lookup
tables that cannot fit into memory.

10-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Writing using Connector stages


• Write mode: Type of write, including:
– Insert, Update, Delete
– Insert then update, update then insert, delete then insert
– Bulk load
• Not available in ODBC Connector
• Table action
– Append
– Create
– Replace: Drop the table if it exists; then create it
– Truncate: Empty the table before writing to
– Can be parameterized
• Create within the Connector stage
• Insert uses database host array processing to improve
performance
– Default Array size is 2000
© Copyright IBM Corporation 2011

Figure 10-16. Writing using Connector stages KM4001.0

Notes:
Connector stages offer several types of write operation, including bulk load.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parameterizing the table action


• All Connector properties can be parameterized
– When creating a parameter for a list type property, it is best to create it in the Connector stage
• Select the property
• Click the Use Job Parameter icon
• Click New Parameter
• Specify parameter Parameter name

Default value

© Copyright IBM Corporation 2011

Figure 10-17. Parameterizing the table action KM4001.0

Notes:
All Connector properties can be parameterized. When creating a parameter for a list type
property, it is best to create it in the Connector stage. To do this click New Parameter from
the Use Job Parameter icon.

10-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Optimizing the insert/update performance


• For Insert then update write mode:
– The insert statement is executed first
– If the insert fails with a unique-constraint violation, the update
statement is executed
• For Update then insert write mode:
– The update statement is executed first
– If the update fails because there is no matching key, the insert
statement is executed
• Choose Insert then update or Update then insert based on
the expected number of inserts over updates
• For larger data volumes, it is often faster to identify insert and
update data within the job and separate into different
Connector target stages

© Copyright IBM Corporation 2011

Figure 10-18. Optimizing the insert/update performance KM4001.0

Notes:
The Connector stage offers two types of insert plus update (sometime called “upsert”)
statements. For the Insert then update write mode, the insert statement is executed first. If
the insert fails with a unique-constraint violation, the update statement is executed. The
Update then insert is the reverse. Choose Insert then update or Update then insert
based on the expected number of inserts over updates. For example, if you expect more
updates than inserts, choose the latter.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Commit interval
• Auto commit
– If Off (default), commits are made after the number of records specified
by the Record count property are processed
– If On, commits are made after each write operation
• Record count
– Number of rows before a commit
– Default is 2000 rows
– Must be a multiple of Array size

Record size

Auto commit
© Copyright IBM Corporation 2011

Figure 10-19. Commit interval KM4001.0

Notes:
This slide discusses how commits are handled.

10-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Bulk load
• Most Connector stages support bulk load
– Insert/update uses database APIs
• Allows concurrent processing with other jobs and applications
• Does not bypass database constraints, indexes, triggers
– Bulk load uses database-specific parallel load utilities
• Significantly faster than insert/update for large data volumes
• Subject to database-specific limitations of load utilities
– May be issues with index maintenance, constraints, etc.
– May not work with tables that have associated triggers
– Requires exclusive access to target table
• Load control set of properties are enabled to set utility specific
parameters
Load control

© Copyright IBM Corporation 2011

Figure 10-20. Bulk load KM4001.0

Notes:
Most Connector stages support bulk load. Insert/update uses database APIs. Bulk load
uses database-specific parallel load utilities. It can be significantly faster than insert/update
for large data volumes.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Cleaning Up failed DB2 loads


• In the event of a failure during a DB2 Load operation, the DB2
Fast Loader marks the table inaccessible (quiesced exclusive
or load pending state)
• To reset the target table state to normal mode:
– Re-run the job setting Clean-up on failure to Yes
– Any rows that were inserted before the load failure must be deleted
manually

Clean-up on failure

© Copyright IBM Corporation 2011

Figure 10-21. Cleaning Up failed DB2 loads KM4001.0

Notes:
In the event of a failure during a DB2 Load operation, the DB2 Fast Loader marks the table
inaccessible (quiesced exclusive or load pending state).
You can reset the target table to the normal mode by rerunning the job with the Clean-up
on failure option turned on.

10-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

i
Y

Uempty

Error Handling in Connector stages

© Copyright IBM Corporation 2011

Figure 10-22. Error Handling in Connector stages KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Error handling in Connector stages


• Reject links can be added to Connector stages to specify
conditions under which rows are rejected
– If no conditions are specified, SQL errors abort the job
– Optionally include the error code and message as additional columns
in the rejected rows
– Optionally abort the job after a specified number of percentage of
rejects
• Specify conditions upon which rows will be rejected
– SQL error
• For example, trying to insert a row into a table that matches key values
with an existing row
– Row not updated
• The key in the row does not match the key of any existing row in the
table

© Copyright IBM Corporation 2011

Figure 10-23. Error handling in Connector stages KM4001.0

Notes:
Reject links can be added to Connector stages to specify conditions under which rows are
rejected. If no conditions are specified, SQL errors abort the job. Conditions include SQL
error and row not updated.

10-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Connector stage with reject link


• Connector stages with multiple input links can include multiple
reject links
• Select the reject link in the stage navigation panel to specify
the reject conditions

Reject link

© Copyright IBM Corporation 2011

Figure 10-24. Connector stage with reject link KM4001.0

Notes:
This example shows a Connector stage with a reject link. Connector stages can have
multiple input links. (This is discussed later in this unit.) If there are multiple input links there
can be multiple reject links.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Specifying reject conditions

Select reject link

Include error info in


Reject conditions rejected rows

Abort condition

© Copyright IBM Corporation 2011

Figure 10-25. Specifying reject conditions KM4001.0

Notes:
Select the reject link in the Navigation panel to specify the reject link conditions and other
properties.

10-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Added error code information examples


• Insert error message

Reject condition
• Update error message
Reject condition

© Copyright IBM Corporation 2011

Figure 10-26. Added error code information examples KM4001.0

Notes:
This slide shows the types of messages that will show up in the job log when reject
conditions are specified. The top shows an insert error message. The bottom shows an
update error message.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Multiple Input Links

© Copyright IBM Corporation 2011

Figure 10-27. Multiple Input Links KM4001.0

Notes:

10-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Multiple input links


• Write rows to multiple tables within the same unit of work
– Use navigation panel in stage to select link properties
– Order of input records to input links can be specified
• Record ordering Stage property
– All records: All records from first link, then next link, etc.
– First record: One record from each link is processed at a time
– Ordered: User specified ordering
• Reject links can be created for each input link

© Copyright IBM Corporation 2011

Figure 10-28. Multiple input links KM4001.0

Notes:
Multiple input links write rows to multiple tables within the same unit of work. Reject links
can be created for each input link based on SQL error or Row not updated conditions.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the Connector – stage properties

Specify record
ordering in stage
properties

Record ordering

© Copyright IBM Corporation 2011

Figure 10-29. Inside the Connector - stage properties KM4001.0

Notes:
In the Navigation panel you see reject links corresponding to each of the input links. Click
the stage properties icon to specify the record ordering properties which apply to both input
links.

10-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Job Design Examples

© Copyright IBM Corporation 2011

Figure 10-30. Job Design Examples KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Data Connection Objects

Stage type

Connection
properties and their
default values

© Copyright IBM Corporation 2011

Figure 10-31. Data Connection Objects KM4001.0

Notes:
Data Connection objects can be used to store data connection property values in a named
Repository object. They are similar to parameter sets. Passwords can be encrypted.

10-32 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Standard insert plus update example

Connector

Load data
Connection values

Insert then
update

© Copyright IBM Corporation 2011

Figure 10-32. Standard insert plus update example KM4001.0

Notes:
In this example, Insert then update has been chosen for the Write mode.

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Insert-Update Example

Separate links for


updates and
inserts

© Copyright IBM Corporation 2011

Figure 10-33. Insert-Update Example KM4001.0

Notes:
For the Inserts link, just use the standard Insert or you can choose Insert then update.
From a performance point of view this will be equivalent to just doing inserts, since inserts
are tried first. Updates are performed only if the Insert fails. For the updates link, use
Update or Update, then insert, if you want to be safe.

10-34 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Checkpoint
1. What is a sparse lookup?
2. How do you decide whether to use Insert then update or
Update then insert write modes?

© Copyright IBM Corporation 2011

Figure 10-34. Checkpoint KM4001.0

Notes:
Write your answers here:

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 10. Working with Connectors


• In this lab exercise, you will:
– Handle errors in the Connector stage
– Create a job with multiple Connector
input links
– Separate inserts from updates to
improve performance

© Copyright IBM Corporation 2011

Figure 10-35. Exercise 10. Working with Connectors KM4001.0

Notes:

10-36 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Unit summary
Having completed this unit, you should be able to:
• Use Connector stages to read from relational tables
• Use Connector stages to write to relational tables
• Handle SQL errors in Connector stages
• Use Connector stages with multiple input links
• Optimize jobs that write to relational tables by separating
inserts from updates

© Copyright IBM Corporation 2011

Figure 10-36. Unit summary KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 10. Accessing Databases 10-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

10-38 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 11. Processing XML Data

What this unit is about


This unit describes the XML stage and its functionality.

What you should be able to do


After completing this unit, you should be able to:
• Use the XML stage to parse, compose, and transform XML data
• Use the Schema Library Manager to import and manage XML
schemas
• Use the Assembly editor in the XML stage to build an assembly of
parsing, composing, and transformation steps

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Use the XML stage to parse, compose, and transform XML
data
• Use the Schema Library Manager to import and manage XML
schemas
• Use the Assembly editor in the XML stage to build an
assembly of parsing, composing, and transformation steps

© Copyright IBM Corporation 2011

Figure 11-1. Unit objectives KM4001.0

Notes:

11-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

XML stage
• Use to parse, compose, and transform XML data
• Located in Real Time folder in Designer Palette
• Supports both input and output links
– Supports multiple input and output links
– Can have both or just one or the other
• Configured by creating an assembly
– Built using the Assembly editor
– An assembly consists of a series of steps
• The initial step maps DataStage tabular data (rows and columns) to an XML
hierarchical structure
• The final step maps the XML hierarchical structure to DataStage tabular data (rows
and columns)
– Steps in between can parse, compose, and transform
• Importing XML schemas
– Import>Schema Library Manager
– You can create libraries of schemas organized into categories (folders)
© Copyright IBM Corporation 2011

Figure 11-2. XML stage KM4001.0

Notes:
The XML stage can be used to parse, compose, and transform XML data. It can also
combine any of these operations. The XML stage is configured by creating an assembly.
An assembly consists of a series of parse, compose, and transform steps.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Schema Library Manager


• Import schemas for use in the XML stage
– Used for parsing and composing XML data
• To open the Schema Library Manager:
– Click Import>Schema Library Manager in Designer
– Click the Library tab within the XML stage
• Imported schemas can be organized into libraries
– Shared across all DataStage projects
– Validated whenever schemas are added or removed to the library

© Copyright IBM Corporation 2011

Figure 11-3. Schema Library Manager KM4001.0

Notes:
The Schema Library Manager is fully integrated with the XML stage but is also available in
Designer outside the stage. Imported schemas can be organized into libraries.

11-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Schema Library Manager window

Import a
schema file

Schema files.
Library Can include
category multiple
schema files
Library

© Copyright IBM Corporation 2011

Figure 11-4. Schema Library Manager window KM4001.0

Notes:
This slide shows the inside of the Schema Library Manager window. In this example,
km400 is a schema library category (folder) used to organize the libraries. The km400
category contains one library name EmpDept. You can create new categories and
libraries. Click the Import New Resource button to import a schema file.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Schemas
• Describes the structure of an XML document
• XML data is hierarchical
– Contains objects within objects
– Most data processed in DataStage jobs is flat (tabular)
• Consists of rows
• Each row consists of columns of single values
• Example structure
• Employees: list of employees
• Employee:
– Employee ID
– Job title
– Name
– Gender
– Birth date
– Hire date
– Work department
> Department number
> Department name
> Department location

© Copyright IBM Corporation 2011

Figure 11-5. Schemas KM4001.0

Notes:
Schemas describes the structure of an XML document. XML data is hierarchical but most
data processed in DataStage jobs is flat. Input and output links from the XML stage are
used to map the flat data to the hierarchical data and vice versa.

11-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Schema file

PersonType has
three elements

Employee type is
an extension of
PersonType

Additional
elements
Additional
attributes

One of
EmployeeType
elements
© Copyright IBM Corporation 2011

Figure 11-6. Schema file KM4001.0

Notes:
This slide shows an example of a schema file. Different objects and their elements are
highlighted.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Composing XML Data

© Copyright IBM Corporation 2011

Figure 11-7. Composing XML Data KM4001.0

Notes:

11-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Composing XML data


• Create an XML document from DataStage tabular data
– Based on a referenced schema library
• XML targets include: File, XML string to pass downstream, or
LOB (Large OBject) to go into LOB-aware target field
– For LOB targets, the last stage in the job must be a LOB-aware stage,
such as the DB2 Connector or Oracle Connector
• Document root: Select from the top-level elements of the
schema library
• Mappings
– Define how to create the target nodes
• Target nodes can be either list nodes or content nodes (values)
– Once a target list node is mapped its content nodes become available for
mappings
• All required target nodes must have mappings
– Select item for mapping or use Auto map or enter constant value

© Copyright IBM Corporation 2011

Figure 11-8. Composing XML data KM4001.0

Notes:
A composition step in the XML stage can be used to create an XML document from
DataStage tabular data. The document created is based on a referenced schema library.
The compositional step target does not have to be a file. It can also be an XML string
passed downstream in the job or an LOB (Large OBject).

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Compositional Job
• Tabular data to be “composed” into an XML document comes
from upstream sources
• Output link is not needed if target is a file
– Specify path to output directory and filename prefix

Join department data


to employee data

XML stage
Input tabular
data
© Copyright IBM Corporation 2011

Figure 11-9. Compositional Job KM4001.0

Notes:
This shows an example of a compositional job. First the data is from two tables is joined.
This becomes input to the XML stage. The XML stage will compose an XML document
from this data. In this example, the XML target is a file, so no output link is necessary. In the
XML stage a path to the file is specified.

11-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Inside the XML stage


• Same GUI as Connector stages
• Most work is done in the Assembly editor
– Assembly editor can only be accessed from with the XML stage

Stage Assembly
properties editor

© Copyright IBM Corporation 2011

Figure 11-10. Inside the XML stage KM4001.0

Notes:
The XML stage has the same basic GUI as a Connector stage. However, most of the work
is done in the Assembly editor which is invoked from the stage.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Inside the Assembly editor


Schema
Library
Manager Open palette

Assembly of
steps. Click
on step to
open Palette steps

© Copyright IBM Corporation 2011

Figure 11-11. Inside the Assembly editor KM4001.0

Notes:
This shows the inside of the Assembly editor. Notice the Libraries tab at the top where you
can invoke the Schema Library Manager. Click the Palette button to open the palette. The
palette contains the list of steps that you can add to your assembly. The assembly steps
are shown on the left.
You can add any number of steps in any order. The Input Step and Output Step are
always present and are always first and last, respectively. If there is no input or output link,
then the corresponding step will be empty, but still present.

11-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Input step
• The output from one step becomes the input for the next step
• The input to this step is the link data
– There can be multiple links

Switch to link view

Input to the Output from


step in tree the step
view

© Copyright IBM Corporation 2011

Figure 11-12. Input step KM4001.0

Notes:
The input to the Input step is the input link data. The input step maps this data into the
stage. The mapped data then becomes available to the step following the Input step.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Composer step – XML Target tab


• On the XML Target tab, specify the target
– File: Specify path to directory and filename prefix
• Depending on stage configuration multiple files may be output
– String or LOb
• A result-string node will be created on Output tab

Select target

© Copyright IBM Corporation 2011

Figure 11-13. Composer step - XML Target tab KM4001.0

Notes:
Once you have added a step, for example, the Composer step shown here, you can open
it. Inside there are tabs to edit based on the type of step. A Composer step has a XML
Target tab. On this tab you specify the type of target. In this example, the target is a file or
set of files. You specify the directory path and the prefix to use for the name when creating
the XML file or files.

11-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Composer step – XML Document Root tab


• Select the document root within a schema library
• The selected root and its tree of elements is displayed
– These will become the mapping targets

Document Root
tab

Browse for
root in library
Selected root

© Copyright IBM Corporation 2011

Figure 11-14. Composer step - XML Document Root tab KM4001.0

Notes:
Once you have added a step, for example, the Composer step shown here, you can open
it. Inside there are tabs to edit based on the type of step. A Composer step has a
Document Root tab. On this tab you browse a schema library for the root of the document
you want to compose.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Composer step – Validation tab


• Select the type of validation and the action to take for
exceptions
– Strict validation (default): By default, job fails with violations
– Minimal validation: By default, job ignores violations
Validation tab

Type of
validation
Menu of
Type of options
violation
© Copyright IBM Corporation 2011

Figure 11-15. Composer step - Validation tab KM4001.0

Notes:
The Validation tab exists in all the different types of steps. Select the type of validation and
the action to take for exceptions. For Strict validation (default), the job fails with any
violations. For Minimal validation, the job ignores violations although they are recorded in
the log. Either of these types of validation can be modified using the menu lists available.

11-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Composer step – Mappings tab


• Source mapping choices for a target node must match level
and type
Header info to add to
– Lists map to lists; values to values document
– Value type of source must match of target
Target list nodes and
value nodes

Select
mapping
source: list
or value

© Copyright IBM Corporation 2011

Figure 11-16. Composer step - Mappings tab KM4001.0

Notes:
On the Mappings tab, you specify mappings to the target document nodes. The target
nodes can be list nodes or value nodes. The object mapped to the target node must be the
same level of object. For example, the employee node is an object that contains a number
of elements and attributes. You can map a source link object to this node because the link
contains a number of columns. But you cannot map a single column to the employee node.
In general, list nodes must be mapped to lists and value nodes must be mapped to values.
Map the list nodes first. Then the elements it contains are available for mapping.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

XML file output

Employees list

Employee with
employee
number
attribute

Employee
elements

© Copyright IBM Corporation 2011

Figure 11-17. XML file output KM4001.0

Notes:
This shows the XML document output produced in this example. Notice that the employee
object attributes (for example, EmpNo) and elements (for example, dateOfBirth) have
been populated with values.

11-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Parsing XML Data

© Copyright IBM Corporation 2011

Figure 11-18. Parsing XML Data KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parsing XML data


• Converts XML data into DataStage tabular data
• Add an XML_Parser step from the Assembly Editor Palette
– Specify the XML source: String, single file, set of files
– Select the document root from a schema library to use to parse the
document
– Specify validation
– Optionally use the Test Data tab to test whether the document can be
parsed using the specified schema library root

© Copyright IBM Corporation 2011

Figure 11-19. Parsing XML data KM4001.0

Notes:
A parsing step can be used to convert XML hierarchical data into DataStage tabular data. It
is the reverse of a compositional step. Here, we need to specify the format of the XML
source that is to be flattened. We do this by referencing a schema library and choosing the
document root.

11-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Parser step – XML Source tab


• Add the XML_Parser step from the Palette
• Select source
Parser step

Singe file
source

© Copyright IBM Corporation 2011

Figure 11-20. Parser step - XML Source tab KM4001.0

Notes:
On the XML Source tab specify the type of source: file, string, or set of files.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Parser step – Document Root tab


• Open the Schema Library Manager
– Select the library
– Select the document root
• Must match the root element in the document being parsed

Document Root

Browse for root


Selected root in schema
library

© Copyright IBM Corporation 2011

Figure 11-21. Parser step - Document Root tab KM4001.0

Notes:
On the Document root tab, browse for the document root in a schema library.

11-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transforming XML Data

© Copyright IBM Corporation 2011

Figure 11-22. Transforming XML Data KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transforming XML data


• Types of transformations include:
– Aggregation: Aggregate on the items in a list
– Sort: Sort items in a list
– Horizontal Pivot (H-Pivot): Transpose a list into a set of items
– Vertical Pivot (V-Pivot): Transform records into fields of another
record
– Join (HJoin): Merge two lists of items into one nested list
• One list becomes the child of the other list
– Union: Combine two lists into a single list with a pre-defined structure
– Switch: Split items in a list into one or more new lists

© Copyright IBM Corporation 2011

Figure 11-23. Transforming XML data KM4001.0

Notes:
There are several types of transformation steps that you can choose from. This slide lists
and describes the types.

11-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformation Example - HJoin


• Merge multiple input lists
• Tasks:
– Add HJoin step
– Select parent list
– Select child list
– Specify join key

Multiple input lists

© Copyright IBM Corporation 2011

Figure 11-24. Transformation Example - HJoin KM4001.0

Notes:
In this example we are using the HJoin transformation step to join the data from the two
source tables much as you can do with the Join step. The difference is that the result of the
join will be an XML hierarchical object, not a flat tabular object.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Editing the HJoin step

Select parent,
child lists

Add HJoin step Define join key

© Copyright IBM Corporation 2011

Figure 11-25. Editing the HJoin step KM4001.0

Notes:
On the Configuration tab, you select the parent list, the child list, and the key used to join
the two lists together. In this example, the Department link provides the child list and the
Employee link provides the parent list.

11-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Switch step
• Categorize items in a list into one or more target lists based on a
constraint.
• Constraints include:
– isNull, Greater than, Equals, Compare, Like, and so on
• A default target captures all items that fail to go to any of the other
targets
– Targets become output nodes from the step
List to categorize

Switch targets

Added targets

© Copyright IBM Corporation 2011

Figure 11-26. Switch step KM4001.0

Notes:
Another type of composition step you can create is a Switch step. This functions a little bit
like a Transformer stage. It splits the data into one or more target lists based on
constraints.

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Aggregate step
• Aggregate one or more items in a list
• Tasks: Select the list. Add items and specified aggregation
functions
• Functions include: Sum, Max, Min, First, Last, Average,
Concatenate
List to aggregate

Target output
nodes
Added targets

© Copyright IBM Corporation 2011

Figure 11-27. Aggregate step KM4001.0

Notes:
Another type of composition step you can create is a Aggregate step. This functions a little
bit like an Aggregator stage. It performs summary calculations over elements in a list.

11-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Checkpoint
1. What three types of steps can be performed within an XML
stage?
2. What three types of XML targets are supported?

© Copyright IBM Corporation 2011

Figure 11-28. Checkpoint KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 11 – XML stage


• In this lab exercise, you will:
– Compose XML data
– Parse XML data
– Transform XML data

© Copyright IBM Corporation 2011

Figure 11-29. Exercise 11 - XML stage KM4001.0

Notes:

11-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Unit summary
Having completed this unit, you should be able to:
• Use the XML stage to parse, compose, and transform XML
data
• Use the Schema Library Manager to import and manage XML
schemas
• Use the Assembly editor in the XML stage to build an
assembly of parsing, composing, and transformation steps

© Copyright IBM Corporation 2011

Figure 11-30. Unit summary KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 11. Processing XML Data 11-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

11-32 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 12. Slowly Changing Dimensions Stages

What this unit is about


This unit describes the Slowly Changing Dimensions stage, and how
to use it to load and update a star schema database.

What you should be able to do


After completing this unit, you should be able to:
• Design a job that creates a surrogate key source key file
• Design a job that updates a surrogate key source key file from a
dimension table
• Design a job that processes a star schema database with Type 1
and Type 2 slowly changing dimensions

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Design a job that creates a surrogate key source key file
• Design a job that updates a surrogate key source key file from
a dimension table
• Design a job that processes a star schema database with Type
1 and Type 2 slowly changing dimensions

© Copyright IBM Corporation 2011

Figure 12-1. Unit objectives KM4001.0

Notes:

12-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Surrogate Key Generator Stage

© Copyright IBM Corporation 2011

Figure 12-2. Surrogate Key Generator Stage KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Surrogate Key Generator stage


• Use to create and update the surrogate key state file
• Surrogate key state file
– One file per dimension table
– Stores the last used surrogate key integer for the dimension table
– Binary file

© Copyright IBM Corporation 2011

Figure 12-3. Surrogate Key Generator stage KM4001.0

Notes:
The Surrogate Key Generator stage is used to create and update a surrogate key state file.
There is one file per dimension table. The file stores the last used surrogate key integer for
the dimension table.

12-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Example job to create surrogate key state files

Create Surrogate
State File for
Product
dimension table

Create Surrogate
State File for
Store dimension
table

© Copyright IBM Corporation 2011

Figure 12-4. Example job to create surrogate key state files KM4001.0

Notes:
Without any links the stage is used just to create a state file for a dimension table.

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Editing the Surrogate Key Generator stage

Path to state file

Create the
state file

© Copyright IBM Corporation 2011

Figure 12-5. Editing the Surrogate Key Generator stage KM4001.0

Notes:
In this example, the stage is just creating a file. The path to the file to be created is
specified.

12-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Example job to update the surrogate key state file

© Copyright IBM Corporation 2011

Figure 12-6. Example job to update the surrogate key state file KM4001.0

Notes:
If there are links going into the Surrogate Key stage, as shown in this example, the stage
can be used to update the state file based on the surrogate keys that already exist in the
table.

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Specifying the update information

Table column
containing
surrogate key
values

Update the
state file

© Copyright IBM Corporation 2011

Figure 12-7. Specifying the update information KM4001.0

Notes:
Here, the column that contains the surrogate keys needs to be indicated. The Source
Update action in this case is Update.

12-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Slowly Changing Dimensions


Stage

© Copyright IBM Corporation 2011

Figure 12-8. Slowly Changing Dimensions Stage KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Slowly Changing Dimensions stage


• Used for processing a star schema
• Performs a lookup into a star schema dimension table
– Multiple SCD stages can be chained to process multiple dimension
tables
• Inserts new rows into the dimension table as required
• Updates existing rows in the dimension table as required
– Type 1 fields of a matching row are overwritten
– Type 2 fields of a matching row are retained as history rows
• A new record with the new field value is added to the dimension table
and made the current record
• Generally used in conjunction with the Surrogate Key
Generator stage
– Creates a Surrogate Key state file that retains a list of the previously
used surrogate keys
© Copyright IBM Corporation 2011

Figure 12-9. Slowly Changing Dimensions stage KM4001.0

Notes:
The Slowly Changing Dimensions stage is a stage designed to be used for processing a
star schema data warehouse. It is an extremely powerful stage that performs all the
necessary tasks. It performs a lookup into a star schema dimension table to see if the
incoming row is an insert or update. It inserts new rows into the dimension table as
required. It updates existing rows in the dimension table as required. It can perform both
type 1 and type 2 updates.

12-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Star schema database structure and mappings


Dimension Source rows
tables

Fact table

© Copyright IBM Corporation 2011

Figure 12-10. Star schema database structure and mappings KM4001.0

Notes:
Here is an example of a star schema data warehouse. This example is used in the lab
exercise for this unit.
The fact table is the center of the star schema. It contains the numerical (factual) data that
is aggregated over to produce analytical reports covering the different dimensions.
Non-numerical (non-factual) information is stored in the dimension tables. This information
is referenced by surrogate key values in the act table rows.
This example star schema database has two dimensions. The StoreDim table stores
non-numerical information about stores. Each store has been assigned a unique surrogate
key value (integer). Each row stores information about a single store, including its name, its
manager, and its business identifier (a.k.a., natural key, business key). The ProdDim table
stores non-numerical information about a single product, including its brand, its description,
and its business identifier.
Each row in the fact table references a single store and a single product by means of their
surrogate keys. Why are surrogate keys used rather than the business keys? There are
two major reasons. First, surrogate keys can yield better performance because they are

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

numbers rather than, possibly, long strings of characters. Secondly, it is possible for their to
be duplicate business keys, coming from different source systems. For example, the
business key X might refer to bananas in Australia, but tomato soup in Mexico.
In this example, each row in the fact table contains a sales amount and units for a particular
product sold by a particular store for some given period of time not shown in this example.
For simplicity, the time dimension has been omitted.
A source record contains sales detail from a sales order. It includes information about the
product sold and the store that sold the product. This information needs to be put into the
star schema. The store information needs to go into the StoreDim table. The product
information needs to go into the ProdDim table and the factual information needs to go into
the Facttbl table. Moreover, the record put into the fact table must contain surrogate key
references to the corresponding rows in the StoreDim and ProdDim tables.
In this example, the Mgr field in the StoreDim table is considered a type 1 dimension table
attribute. This means that if a source record that references a certain store lists a different
manager, then this is to be considered a simple update of the record for that store. The
value in the source data replaces the value in the existing store record by means of a
simple update to the existing record. Similarly, Brand is a type 1 dimension table attribute
of the ProdDim table.
In this example, the Descr field is a type 2 dimension table attribute. Suppose a source
data record contains a different product description for a given product than the current
record for that product in the ProdDim table. The record in the ProdDim table is not simply
updated with the new product description. The record is retained with the old product
description but flagged as “non-current”. A new record is created for that product with the
new product description. This record is flagged as “current”. The field that is used to flag a
record as current or non-current is called the “Current Indicator” field. Two additional fields
(called “Effective Date” and “Expire Date”) are used to specify the date-range that the
description is applicable.

12-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Example Slowly Changing Dimensions (SCD) job

Check for Check for


matching matching
Product StoreDim
rows rows

Perform
Perform
Type 1 and
Type 1 and
Type 2
Type 2
updates to
updates to
Product
StoreDim
table
table
© Copyright IBM Corporation 2011

Figure 12-11. Example Slowly Changing Dimensions (SCD) job KM4001.0

Notes:
This shows the SCD job. It processes two dimensions so there are two SCD stages. For
each SCD stage there is both a reference link to a Connector stage used for lookup into the
dimension table and an output link to a Connector stage used to insert and update rows in
the dimension table. It is important to note that both Connector stages access the same
dimension table. That is, two stages are used to write to the same table.

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Working in the SCD stage


• Five “Fast Path” pages to edit
• Select the Output link
– This is the link coming out of the SCD stage that is not used to update the
dimension table
• Specify the purpose codes
– Fields to match by
• Business key fields and the source fields to match to it
– Surrogate key field
– Type 1 fields
– Type 2 fields
– Current Indicator field for Type 2
– Effective Date, Expire Date for Type 2
• Surrogate Key management
– Location of State file
• Dimension update specification
• Output mappings
© Copyright IBM Corporation 2011

Figure 12-12. Working in the SCD stage KM4001.0

Notes:
The SCD stage is designed like a “wizard”. There are a series of five “fast path” stages that
guide you through the process. This slide lists and describes the five pages. The following
slides go through each step.

12-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Selecting the output link

Select the
output link

© Copyright IBM Corporation 2011

Figure 12-13. Selecting the output link KM4001.0

Notes:
On the first page all you need to do is to select the output link from the SCD stage. Recall
that there are two output links from the SCD stage. One goes to the Connector stage that
updates the dimension table. The other goes out the SCD to the downstream stage, which
in this case happens to be another SCD stage. The output link is the latter link, not the one
that goes to the Connector stage.

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Specifying the purpose codes

Lookup key Type 1 field Surrogate


mapping key
Type 2 field

Fields used
for Type 2
handling

© Copyright IBM Corporation 2011

Figure 12-14. Specifying the purpose codes KM4001.0

Notes:
On this fast path page, you select the purpose codes for the columns in the dimension
table. Select Surrogate Key for the table column that contains the surrogate keys. Select
Business Key for the table column that contains the natural or business key. Also map the
field in the input record that contains the business key. This information is used for the
lookup that determines whether the record is an insert or an update.
For any type 1 updates select Type 1. For any type 2 updates select Type 2. Not all fields
in the dimension table are required to have purpose codes or required to be updated.
Choose one field as the Current Indicator field if you are performing type 2 updates. This
field will be used to indicate whether the record is the currently active record or an historical
record. You can use Effective Date and Expiration Date codes to specify the date range
that a particular record is effective.

12-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Surrogate key management

Path to state
file

Initial surrogate
key value
Number of values
to retrieve at one
time
© Copyright IBM Corporation 2011

Figure 12-15. Surrogate key management KM4001.0

Notes:
On the Surrogate Key Management tab, specify the path to the surrogate key state file
associated with this dimension table. You can specify the number of values to retrieve at
each physical read of the file. The larger the blocks of numbers the fewer the number of
reads required, so the better the performance.
In this example, the path is a Windows path format (rather than UNIX). This implies that in
this example the job is running on a Windows system. The surrogate key state files are
always located on the DataStage Server system; never on the client system.

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Dimension update specification

Function used to
Value that
retrieve the next
means
surrogate key value
current

Functions used to calculate


history date range
© Copyright IBM Corporation 2011

Figure 12-16. Dimension update specification KM4001.0

Notes:
On the Dim Update tab, you specify how updates are to be performed given the purpose
codes you specified earlier. For the Surrogate Key column, invoke the
NextSurrogateKey() function to retrieve the next available surrogate key from the state file
or the block of surrogate key values held in memory by the stage.
For the type 1 and type 2 updates, map the columns from the source file that will be used to
update the table columns.
The the Current Indicator field, specify the values for current and non-current. In this
example ‘Y’ means current and ‘N’ means not-current.
For the Effective Date and Expiration Date fields specify the functions or values that are
to be used.

12-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Output mappings

© Copyright IBM Corporation 2011

Figure 12-17. Output mappings KM4001.0

Notes:
On the Output Mappings tab, specify the columns that will be sent out of the stage. In this
example, surrogate key (PRODSK) is output. The business key field is not retained
because it is not used in the fact table. The sales fields are output because they will be
processed in the next SCD stage which updates the STOREDim dimension table. The
product columns are dropped because they are not used in the fact table.

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Checkpoint
1. How many Slowly Changing Dimension stages are needed to
process a star schema with 4 dimension tables?
2. How many Surrogate Key state files are needed to process a
star schema with 4 dimension tables?
3. What’s the difference between a Type 1 and a Type 2
dimension field attribute?
4. What additional fields are needed for handling a Type 2
slowly changing dimension field attribute?

© Copyright IBM Corporation 2011

Figure 12-18. Checkpoint KM4001.0

Notes:
Write your answers here:

12-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Exercise 12 – Slowly Changing Dimensions


• In this lab exercise, you will:
– Create the surrogate key source files
– Build an SCD job to process the first
dimension of a data mart with slowly
changing dimensions
– Build an SCD job to process the second
dimension

© Copyright IBM Corporation 2011

Figure 12-19. Exercise 12 - Slowly Changing Dimensions KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 12. Slowly Changing Dimensions Stages 12-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit summary
Having completed this unit, you should be able to:
• Design a job that creates a surrogate key source key file
• Design a job that updates a surrogate key source key file from
a dimension table
• Design a job that processes a star schema database with Type
1 and Type 2 slowly changing dimensions

© Copyright IBM Corporation 2011

Figure 12-20. Unit summary KM4001.0

Notes:

12-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty Unit 13. Best Practices

What this unit is about


This unit describes a variety of best practices related to overall job
guidelines and individual stages.

What you should be able to do


After completing this unit, you should be able to:
• Describe overall job guidelines
• Describe stage usage guidelines
• Describe Lookup stage guidelines
• Describe Aggregator stage guidelines
• Describe Transformer stage guidelines

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Unit objectives
After completing this unit, you should be able to:
• Describe overall job guidelines
• Describe stage usage guidelines
• Describe Lookup stage guidelines
• Describe Aggregator stage guidelines
• Describe Transformer stage guidelines

© Copyright IBM Corporation 2011

Figure 13-1. Unit objectives KM4001.0

Notes:

13-2 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Job Design Guidelines

© Copyright IBM Corporation 2011

Figure 13-2. Job Design Guidelines KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Overall job design

• First priority is that job design must meet business


and infrastructure requirements
• Ideal job design must strike a balance between
performance, resource usage, and restartability

Performance

Resources Restartability

© Copyright IBM Corporation 2011

Figure 13-3. Overall job design KM4001.0

Notes:
The first requirement for a job is that it meets the business requirements. Other
requirements on this slide should only be examined after the first job requirement for a job
is met.

13-4 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Balancing performance with requirements


• In theory, best performance results when all data is
processed in memory, without landing to disk
• This requires hardware resources (CPU, memory)
and operating system resources (number of processes,
number of files)
– Resource usage grows exponentially based on degree of parallelism and
number of stages in a flow
– Must also consider what else is running on the Servers
• May not be possible with very large amounts of data
– Sort will use scratch disk if data is larger than memory buffer
• Business rules may dictate job boundaries
– Dimension table loading before fact table loading
– Lookup reference data must be created before lookup processing

© Copyright IBM Corporation 2011

Figure 13-4. Balancing performance with requirements KM4001.0

Notes:
Resource usage can grow dramatically as the degree of parallelism increases.
Ulimit restricts the number of processes that can be spawned and the amount of memory
that can be allocated. Ulimit may prevent your parallel job from running.
Reading everything into memory may not be possible with large amounts of data.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Modular job design


• Parallel shared containers facilitate component reusability
– Reuse stages, logic
– RCP allows maximum shared container re-use
• Only need define columns used within container logic
• Job parameters facilitate job reusability
– More flexible runtime configuration
– Parameterize schemas to enable a job to process data in different
formats
• Land intermediate results to parallel datasets
– Breaks a job up into more manageable components
– Use job sequences to run them as a batch
– Landing the data reduces performance, but improves
maintainability and modularity
• Datasets preserve partitioning

© Copyright IBM Corporation 2011

Figure 13-5. Modular job design KM4001.0

Notes:
Shared containers create operations that can be reused.
If you use a shared container in a job that utilizes RCP the shared container will act
somewhat like a function call. The data not effected by the shared container will
automatically pass through the flow; only the data needed by the shared container will be
affected.
Remember, a shared container may have many stages and will demand resources for all
processes hidden by the shared container.
Landing intermediate results to data set preserves the partitioning and is very efficient.

13-6 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Establishing job boundaries


• Business requirements may dictate boundaries
– Land data for auditing
– Restart points
• Functional / DataStage requirements may dictate a boundary
– Lookup tables need to be loaded
• File stages and relational stages do not support both input and output links at the
same time
• Establish re-start points in the event of a failure
– Segment long-running steps
– Separate final database load from extract and transformation phases
• Resource utilization
– Break a job up into smaller jobs requiring less resources
• Performance
– Fork-join job flows may run faster if split into two separate jobs with
intermediate datasets
• Depends on processing requirements and ability to tune buffering
• Job development and maintenance
– Smaller jobs are easier to develop and maintain

© Copyright IBM Corporation 2011

Figure 13-6. Establishing job boundaries KM4001.0

Notes:
The developer is responsible for making jobs restartable. DataStage does not have an
automatic restart.
Separate long-running processes from other processes.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Use job sequences to combine job modules


• Job sequences can be used Enable sequence restart in job
properties (enabled by default)
to combine modular jobs
into a functional stream

• Job Sequences are


“restartable”
– Re-running the sequence
does not re-run stage
activities that have Do not checkpoint run property
successfully completed executes the stage activity every run
– Developers must ensure that
individual modules can be
re-run after a failure

© Copyright IBM Corporation 2011

Figure 13-7. Use job sequences to combine job modules KM4001.0

Notes:
If you use a job sequencer and one stage activity fails, the job sequence can be re-run, and
it will start at the step that failed. It does not do any specialized processing, like rolling back
rows from a database. If you want something like this you will need to build it into the job
design.

13-8 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Adding environment variables as job parameters


• Applies to both system environment variables and user-defined
environment variables
• When an environment variable is added as a job parameter its default
value is added
– This value is hard-coded into the parameter default value field
– This allows the environment variable value to be changed on different job runs
• Use $PROJDEF to pick up the default value at Administrator level when
the job is run
– Picks up the value current at Administrator level at the time the job is run

Environment variables Use current value set in


have “$” prefix Administrator
© Copyright IBM Corporation 2011

Figure 13-8. Adding environment variables as job parameters KM4001.0

Notes:
When an environment variable is added as a job parameter its default value is added. Use
$PROJDEF to pick up the default value at Administrator level when the job is run.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Stage Usage Guidelines

© Copyright IBM Corporation 2011

Figure 13-9. Stage Usage Guidelines KM4001.0

Notes:

13-10 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Reading sequential files

• Accessing sequential files in parallel depends on


the access method and options
• Sequential I/O:
– Specific files; single file
– File pattern
• Parallel I/O:
– Single file when Readers Per Node > 1
– Multiple individual files
– Reading with a file pattern, when
$APT_IMPORT_PATTERN_USES_FILESET is
turned on

© Copyright IBM Corporation 2011

Figure 13-10. Reading sequential files KM4001.0

Notes:
Sequential row order cannot be maintained when reading a file in parallel.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Reading a sequential file in parallel


• The Number of Readers Per Node optional
property can be used to read a single input file in
parallel at evenly spaced offsets

© Copyright IBM Corporation 2011

Figure 13-11. Reading a sequential file in parallel KM4001.0

Notes:
The readers per node can be set for both fixed and variable-length files.

13-12 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Parallel file pattern I/O


• By default, the File Pattern read method spawns a
single sub-process that calls the cat command to
stream output sequentially
– Uses a list of the files that match the given expression
– Data in individual files matching the pattern is
concatenated into a single, sequential stream
• To change the default, set
$APT_IMPORT_PATTERN_USES_FILESET
– Dynamically creates a fileset header with a list of files
that match the given expression
– Files in the list are read in parallel
• Degree of parallelism is determined by
$APT_CONFIG_FILE
© Copyright IBM Corporation 2011

Figure 13-12. Parallel file pattern I/O KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Partitioning and sequential files


• Sequential File stage creates one partition for each input file
– Always follow a Sequential File stage with Round Robin or other
appropriate partitioning type
• Never follow a Sequential File stage with Same partitioning!
• If reading from one file, downstream flow will run sequentially!
• If reading from multiple files, the number of files may not match the
number of partitions defined in the configuration file
• Same is only appropriate in cases where the source data is already
running in multiple partitions

© Copyright IBM Corporation 2011

Figure 13-13. Partitioning and sequential files KM4001.0

Notes:
Round robin is fastest way to partition data.

13-14 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Other sequential file tips


• When writing from variable-length columns to fixed-length fields
– Padding occurs if the fixed-length field is larger than the variable-length value
– Use the Pad char extended property to specify the padding character
• By default an ASCII NULL character 0x0 is used
• When reading delimited files, extra characters are silently truncated
for source file values longer than the maximum specified length of
VarChar columns
– Set the environment variable $APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS to
reject these records instead

© Copyright IBM Corporation 2011

Figure 13-14. Other sequential file tips KM4001.0

Notes:
You should specify the pad character because normally you do not want ASCII nulls as
padding.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Buffering sequential file writes


• By default, target Sequential File stages write to memory
buffers
– Improves performance
– Buffers are automatically flushed to disk when the job completes
successfully
– Not all rows may be written to disk if the job crashes
• Environment variable $APT_EXPORT_FLUSH_COUNT can
be used to specify the number of rows to buffer
– $APT_EXPORT_FLUSH_COUNT=1 flushes to disk for every row
– Setting this value too low incurs a performance penalty!

© Copyright IBM Corporation 2011

Figure 13-15. Buffering sequential file writes KM4001.0

Notes:
When DataStage issues a write, it writes to memory and assumes the record made it to
disk. However, with operating system buffering, this may not be true if there is a hard crash.
Setting the $APT_EXPORT_FLUSH_COUNT to 1 will guarantee the record is written to
disk.

13-16 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Lookup Stage Guidelines

© Copyright IBM Corporation 2011

Figure 13-16. Lookup Stage Guidelines KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Lookup stage
• Lookup stage runs in two phases:
– Read all rows from reference link into memory;
indexed by lookup key
– Process incoming rows
• Reference data should be small enough to fit into
physical (shared) memory
– For reference data sets larger than available memory,
use the JOIN or MERGE stage
• Lookup stage processing cannot begin until all
reference links have been read into memory

© Copyright IBM Corporation 2011

Figure 13-17. Lookup stage KM4001.0

Notes:
The Lookup stage always runs in two phases: First, it reads all rows from reference link into
memory (until end-of-data), indexing by lookup key. Secondly, it processes incoming rows.
Lookup processing cannot begin until data for all reference links have been read into
memory.

13-18 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Partitioning lookup reference data


• Entire is the default partitioning method for Lookup
reference links
– On single systems, the Lookup stage uses shared memory
instead of duplicating the entire reference data
• Be careful using Entire on Clustered / Grid / MPP
configurations:
– Reference data will be copied to all engine machines
– It may be appropriate to use a keyed partitioning method,
especially if data is already partitioned on those keys
• Make sure input stream link and reference link partitioning
keys match

© Copyright IBM Corporation 2011

Figure 13-18. Partitioning lookup reference data KM4001.0

Notes:
On SMP configurations, it is usually best to specify ENTIRE for lookup reference data
partitioning.
For clustered/GRID/MPP configurations you should consider a keyed (for example, Hash)
partitioning method.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Lookup reference data


• The Lookup stage cannot output any rows until all reference link
data has been read into memory
– In this job design, that means reading all the source data (which might be
vast) into memory
• Never generate Lookup reference data using a fork-join of source
data
– Separate the creation of Lookup reference data from lookup processing

HeaderRef
Header

Src Out
Detail

© Copyright IBM Corporation 2011

Figure 13-19. Lookup reference data KM4001.0

Notes:
Because the Lookup stage cannot begin to process data until all reference data has been
loaded, you should never create lookup reference data from a fork of incoming source data.

13-20 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Lookup file sets


• Use Lookup file sets to store Lookup data
– Data is stored in native format, partitioned, and pre-
indexed on lookup key columns
– Key columns and partitioning are specified when the
file set is created
– Particularly useful when static reference data can be
re-used in multiple jobs (or runs of the same job)
• Lookup file sets can only be used as reference input
link to a Lookup stage
– The partitioning method and key columns specified
when the Lookup file set was created will be used to
process the reference data

© Copyright IBM Corporation 2011

Figure 13-20. Lookup file sets KM4001.0

Notes:
Lookups read data from the source into memory and create indexes on that data. If you are
going to reuse data that does not change much, create a lookup file set because the
indexes are saved with the data.
Lookup file sets can only be read in a lookup reference, which limits their real-world use.
No utilities can read a lookup file set, including orchadmin.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Using Lookup File Set stages


• Lookup file sets are specific to the
configuration file used to create them
– Particularly on cluster/GRID/MPP
configurations
• Within the Lookup stage editor, you cannot
change the Lookup key column derivations
– Key column names in the Lookup file set must
match source key column names

© Copyright IBM Corporation 2011

Figure 13-21. Using Lookup File Set stages KM4001.0

Notes:
Beware when using Lookup File Set stages. Lookup file sets are specific to the
configuration file used to create them

13-22 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Aggregator Stage Guidelines

© Copyright IBM Corporation 2011

Figure 13-22. Aggregator Stage Guidelines KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-23
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Aggregator
• Match input partitioning to Aggregator stage groupings
• Use Hash method for a limited number of distinct key values
(that is, limited number of groups)
– Uses 2K of memory per group
– Incoming data does not need to be pre-sorted
– Results are output after all rows
have been read
– Output row order is undefined
• Even if input data is sorted
• Use Sort method with a large
number of distinct key-column values
– “Control break” processing
– Requires input pre-sorted on key columns
– Results are output after each group
© Copyright IBM Corporation 2011

Figure 13-23. Aggregator KM4001.0

Notes:
Because rows depend on each other, partitioning matters.
Hash performs aggregations in memory and will build a table. Data does not need to be
sorted.

13-24 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Using Aggregator to sum all input rows


• Generate a constant-value key column for all rows
– Column Generator
• Use cycle algorithm on one value
– Transformer
• Hardcode value
– Aggregate on the new column
• Run Aggregator in sequential mode
• Set in Stage Advanced Properties
• Use two Aggregators to reduce collection time
– First Aggregator processes rows in parallel
– Second aggregator runs sequentially, getting the final global sum

Parallel Sequential

© Copyright IBM Corporation 2011

Figure 13-24. Using Aggregator to sum all input rows KM4001.0

Notes:
The Aggregator stage does not contain a “sum all” function, and by default it runs in
parallel. This slide outlines the steps for summing over all rows in all partitions.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-25
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer Stage Guidelines

© Copyright IBM Corporation 2011

Figure 13-25. Transformer Stage Guidelines KM4001.0

Notes:

13-26 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer performance guidelines

• Minimize the number of Transformers


– If possible, combine derivations from multiple
Transformers
• Never use the “BASIC Transformer”
– Doesn’t show up in the standard palette by default
– Intended to provide a migration path for existing DataStage
Server applications that use DataStage BASIC routines
– Runs sequentially
– Invokes the DataStage server engine
– Extremely slow!

© Copyright IBM Corporation 2011

Figure 13-26. Transformer performance guidelines KM4001.0

Notes:
Transformer stage have a lot of overhead. You can reduce the overhead if you reduce the
number of Transformer stages.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-27
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer vs. other stages

• For optimum performance, consider more


appropriate stages instead of a Transformer in
parallel job flows:
– Use non-Transformer stage (for example, Copy
stage) to:
• Rename Columns
• Drop Columns
• Perform default type conversions
• Split output
• Transformer constraints are faster than the Filter or Switch
stages
– Filter and Switch expressions are interpreted at runtime
– Transformer constraints are compiled
© Copyright IBM Corporation 2011

Figure 13-27. Transformer vs. other stages KM4001.0

Notes:
Transformer stage have a lot of overhead. You can reduce the overhead if you reduce the
number of Transformer stages by using other stages with less overhead. For example, use
a Copy stage rather than a Transformer if all you want to do is rename some columns or
split the output stream.

13-28 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Modify stage
• May perform better than the Transformer stage in some
cases
• Consider as a possible alternative for:
– Non-default type conversions
– Null handling
– String trimming
– Date / Time handling
• Drawback of Modify stage is that it has no expression editor
– Expressions are “hand-coded”
• Can be used to parameterize column names
– Only stage where you can do this

© Copyright IBM Corporation 2011

Figure 13-28. Modify stage KM4001.0

Notes:
The Modify stage can do many of the types of derivations the Transformer stage can do,
but it has less overhead. On the negative it is not as “user friendly” and maintainable.
Specific syntax for Modify is detailed in the “DataStage Parallel Job Developers’ Guide”.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-29
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Optimizing Transformer expressions


• Minimize repeated use of the same derivation
– Execute derivation in a stage variable
• Reference stage variable for other uses
• Examples:
– Portions of output column derivations that are used in multiple derivations
– Where an expression includes calculated constant values:
• Use the stage variable initial value to calculate once for all rows
– Where an expression is used as a constant (same value for every row
read):
• Set it as the stage variable initial value

© Copyright IBM Corporation 2011

Figure 13-29. Optimizing Transformer expressions KM4001.0

Notes:
You can improve Transformer performance by optimizing Transformer expressions. This
slide lists some ways.

13-30 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Simplifying Transformer expressions

- Leverage built-in functions to simplify complex


expressions
- For example:
Original expression:
IF Link_1.ProdNum = "000" OR Link_1.ProdNum = "800" OR
Link_1.ProdNum = "888" OR Link_1.ProdNum = "866" OR
Link_1.ProdNum = "877" OR Link_1.ProdNum = "855" OR
Link_1.ProdNum = "844" OR Link_1.ProdNum = "833" OR
Link_1.ProdNum = "822" OR Link_1.ProdNum = "900"
THEN 'N‘
ELSE "Y"

Simplified expression:
IF index('000|800|888|866|877|855|844|833|822|900',
Link_1.ProdNum, 1) > 0
THEN 'N'
ELSE "Y"
© Copyright IBM Corporation 2011

Figure 13-30. Simplifying Transformer expressions KM4001.0

Notes:
Simplifying Transformer expressions can save some time.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-31
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer stage compared with Build stage

• Build stages provide a lower-level method to build


framework components
– Can redesign Transformer stages that do not meet
performance requirements
• Only replace those Transformers that are
bottlenecks
– Build stages require more knowledgeable developers
– Generally, more difficult to maintain

© Copyright IBM Corporation 2011

Figure 13-31. Transformer stage compared with Build stage KM4001.0

Notes:
Build stages provide a lower-level method to build framework components. You may be
able to accomplish what a Transformer is doing in a Build stage and get better
performance. Only replace those Transformers that are bottlenecks. Build stages are much
more difficult to maintain.

13-32 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Transformer decimal arithmetic

• Default internal decimal variables are precision


38, scale 10
– Can be changed by:
• $APT_DECIMAL_INTERM_PRECISION
• $APT_DECIMAL_INTERM_SCALE

© Copyright IBM Corporation 2011

Figure 13-32. Transformer decimal arithmetic KM4001.0

Notes:
Default internal decimal variables are precision 38, scale 10. You can change these
defaults using the environment variables described on this slide.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-33
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Transformer decimal rounding


• Use $APT_DECIMAL_INTERM_ROUND_MODE to specify decimal
rounding
• ceil: round up
– 1.4 -> 2, -1.6 -> -1
• floor: round down
– 1.6 -> 1, -1.4 -> -2
• round_inf: round to nearest integer. Up for ties
– 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2
• trunc_zero: discard any fractional digits to the right
of the rightmost fractional digit supported
– 1.56 -> 1.5, -1.56 -> -1.5

© Copyright IBM Corporation 2011

Figure 13-33. Transformer decimal rounding KM4001.0

Notes:
Use $APT_DECIMAL_INTERM_ROUND_MODE to specify decimal rounding.

13-34 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Conditionally aborting a job


• Use the Abort After Rows property in the Transformer
constraints to conditionally abort a job
– Create a new output link and assign a link constraint that
matches the abort condition
– Set the Abort After Rows for this link to the number of rows
allowed before the job aborts

© Copyright IBM Corporation 2011

Figure 13-34. Conditionally aborting a job KM4001.0

Notes:
Use the Abort After Rows property in the Transformer constraints to conditionally abort a
job. Here, the constraint decribes the condition that you are measuring to determine
whether to default. These might, for example, be rows that contain out-of-range values.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-35
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Job Design Examples

© Copyright IBM Corporation 2011

Figure 13-35. Job Design Examples KM4001.0

Notes:

13-36 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Summing all rows with Aggregator stage

Used to
generate a new
column with a
single value

Aggregates
over the new
generated
column
Aggregator
must run
Sequentially

© Copyright IBM Corporation 2011

Figure 13-36. Summing all rows with Aggregator stage KM4001.0

Notes:
Written to the CUSTS_Log sequential file are the number of rows that should have been
written to the CUSTS table. This is because the same number of rows that go down the
ToCount link go down the CUSTS link. The database may reject some rows. In that case,
the number in the log will be more than the number in the table. So the CUSTS_log file
provides a check.
On the next slide we see that the purpose of the Transformer in this job is to conditionally
abort the job.

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-37
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Conditionally aborting the job

Range check Number of rows


constraint on per partition
Rejects link before abort

Abort message
in the log

© Copyright IBM Corporation 2011

Figure 13-37. Conditionally aborting the job KM4001.0

Notes:
Here, we are conditionally aborting the job when the number of rows going down the
Rejects link reaches 50 (in any given partition).

13-38 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Checkpoint
1. What effect does using $PROJDEF as a default value of a
job parameter have?
2. What optional property can you use to read a sequential file
in parallel?
3. What is the default partitioning method for the Lookup stage?

© Copyright IBM Corporation 2011

Figure 13-38. Checkpoint KM4001.0

Notes:
Write your answers here:

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-39
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

Exercise 13 – Best practices


• In this lab exercise, you will:
– Sum all input rows using
Aggregator stage
– Use Column Import stage to add
key column
– Run Aggregator stage
sequentially, so all rows get
counted
– Conditionally abort a job

© Copyright IBM Corporation 2011

Figure 13-39. Exercise 13 - Best practices KM4001.0

Notes:

13-40 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V5.4
Student Notebook

Uempty

Unit summary
Having completed this unit, you should be able to:
• Describe overall job guidelines
• Describe stage usage guidelines
• Describe Lookup stage guidelines
• Describe Aggregator stage guidelines
• Describe Transformer stage guidelines

© Copyright IBM Corporation 2011

Figure 13-40. Unit summary KM4001.0

Notes:

© Copyright IBM Corp. 2005, 2011 Unit 13. Best Practices 13-41
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook

13-42 Advanced DataStage v8 © Copyright IBM Corp. 2005, 2011


Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
V6.0

backpg
Back page

You might also like