Ibm Infosphere Datastage Essentials V11.5: Course Guide

®
Course Guide
IBM Infosphere DataStage Essentials v11.5
Course code KM204 ERC 1.0
IBM Training
Preface
November, 2015
NOTICES
This information was developed for products and services offered in the USA.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for
information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to
state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any
non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document.
The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive, MD-NC119
Armonk, NY 10504-1785
United States of America
The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND,
EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in
certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these
changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the
program(s) described in this publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of
those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information
concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available
sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the
examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and
addresses used by an actual business enterprise is entirely coincidental.
TRADEMARKS
IBM, the IBM logo, and ibm.com, InfoSphere and DataStage are trademarks or registered trademarks of International Business Machines Corp.,
registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml.
Adobe, and the Adobe logo, are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other
countries.
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
UNIX is a registered trademark of The Open Group in the United States and other countries.
© Copyright International Business Machines Corporation 2015.
This document may not be reproduced in whole or in part without the prior written permission of IBM.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
© Copyright IBM Corp. 2005, 2015 P-2

Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Preface
Contents
Preface................................................................................................................. P-1
Contents ............................................................................................................. P-3
Course overview............................................................................................... P-14
Document conventions ..................................................................................... P-15
Additional training resources ............................................................................ P-16
IBM product help .............................................................................................. P-17
Introduction to DataStage .................................................................... 1-1
Unit objectives .................................................................................................... 1-3
What is IBM InfoSphere DataStage? .................................................................. 1-4
What is Information Server ................................................................................. 1-5
Information Server backbone.............................................................................. 1-6
Information Server Web Console........................................................................ 1-7
DataStage architecture ....................................................................................... 1-8
DataStage Administrator .................................................................................... 1-9
DataStage Designer ......................................................................................... 1-10
DataStage Director ........................................................................................... 1-11
Developing in DataStage .................................................................................. 1-12
DataStage project repository ............................................................................ 1-13
Types of DataStage jobs .................................................................................. 1-14
Design elements of parallel jobs....................................................................... 1-15
Pipeline parallelism .......................................................................................... 1-16
Partition parallelism .......................................................................................... 1-17
Three-node partitioning .................................................................................... 1-18
Job design versus execution ............................................................................ 1-19
Configuration file .............................................................................................. 1-20
Example: Configuration file............................................................................... 1-21
Checkpoint ....................................................................................................... 1-22
Checkpoint solutions ........................................................................................ 1-23
Unit summary ................................................................................................... 1-24
Deployment ........................................................................................... 2-1
Unit objectives .................................................................................................... 2-3
What gets deployed ............................................................................................ 2-4
Deployment: Everything on one machine ........................................................... 2-5
Deployment: DataStage on a separate machine ................................................ 2-6
Metadata Server and DB2 on separate machines .............................................. 2-7
Information Server start-up ................................................................................. 2-8
Starting Information Server on Windows ............................................................ 2-9
Starting Information Server on Linux ................................................................ 2-10

Preface
Verifying that Information Server is running ...................................................... 2-11

Web Console Login window ............................................................................. 2-12
Checkpoint ....................................................................................................... 2-13
Demonstration 1: Log into the Information Server Administration Console ....... 2-15
Unit summary ................................................................................................... 2-18
DataStage Administration .................................................................... 3-1
Unit objectives .................................................................................................... 3-3
Information Server Web Console - Administration .............................................. 3-4
Web Console Login window ............................................................................... 3-5
User and group management ............................................................................. 3-6
Create a DataStage User ID ............................................................................... 3-7
Assign DataStage roles ...................................................................................... 3-8
DataStage credentials ........................................................................................ 3-9
DataStage Credentials Default Mapping........................................................... 3-10
Logging onto DataStage Administrator ............................................................. 3-11
DataStage Administrator Projects tab ............................................................... 3-12
DataStage Administrator General tab ............................................................... 3-13
Environment variables ...................................................................................... 3-14
Environment reporting variables ....................................................................... 3-15
DataStage Administrator Permissions tab ........................................................ 3-16
Adding users and groups.................................................................................. 3-17
Specify DataStage role ..................................................................................... 3-18
DataStage Administrator Logs tab .................................................................... 3-19
DataStage Administrator Parallel tab ................................................................ 3-20
Checkpoint ....................................................................................................... 3-21
Demonstration 1: Administering DataStage ...................................................... 3-23
Unit summary ................................................................................................... 3-34
Work with metadata .............................................................................. 4-1
Unit objectives .................................................................................................... 4-3
Login to Designer ............................................................................................... 4-4
Designer work area ............................................................................................ 4-5
Repository window ............................................................................................. 4-6
Import and export ............................................................................................... 4-7
Export procedure ................................................................................................ 4-8
Export window .................................................................................................... 4-9
Import procedure .............................................................................................. 4-10
Import options................................................................................................... 4-11
Source and target metadata ............................................................................. 4-12

Preface
Sequential file import procedure ....................................................................... 4-13

Import sequential metadata .............................................................................. 4-14
Sequential import window ................................................................................. 4-15
Specify format .................................................................................................. 4-16
Edit column names and types........................................................................... 4-17
Extended properties window ............................................................................ 4-18
Table definition in the repository ....................................................................... 4-19
Checkpoint ....................................................................................................... 4-20
Demonstration 1: Import and export DataStage objects ................................... 4-22
Demonstration 2: Import a table definition ........................................................ 4-27
Unit summary ................................................................................................... 4-33
Create parallel jobs ............................................................................... 5-1
Unit objectives .................................................................................................... 5-3
What is a parallel job? ........................................................................................ 5-4
Job development overview ................................................................................. 5-5
Tools Palette ...................................................................................................... 5-6
Add stages and links .......................................................................................... 5-7
Job creation example sequence ......................................................................... 5-8
Create a new parallel job .................................................................................... 5-9
Drag stages and links from the Palette ............................................................. 5-10
Rename links and stages ................................................................................. 5-11
Row Generator stage ....................................................................................... 5-12
Inside the Row Generator stage ....................................................................... 5-13
Row Generator Columns tab ............................................................................ 5-14
Extended properties ......................................................................................... 5-15
Peek stage ....................................................................................................... 5-16
Peek stage properties....................................................................................... 5-17
Job parameters ................................................................................................ 5-18
Define a job parameter ..................................................................................... 5-19
Use a job parameter in a stage......................................................................... 5-20
Add job documentation ..................................................................................... 5-21
Job Properties window documentation ............................................................. 5-22
Annotation stage properties .............................................................................. 5-23
Compile and run a job ...................................................................................... 5-24
Errors or successful message .......................................................................... 5-25
DataStage Director ........................................................................................... 5-26
Run options ...................................................................................................... 5-27
Performance statistics ...................................................................................... 5-28
Director Status view ......................................................................................... 5-29
Job log, viewed from Designer ......................................................................... 5-30

Preface
Message details ............................................................................................... 5-31

Other job log functions...................................................................................... 5-32
Director monitor ................................................................................................ 5-33
Run jobs from the command line ...................................................................... 5-34
Parameter sets ................................................................................................. 5-35
Create a parameter set..................................................................................... 5-36
Defining the parameters ................................................................................... 5-37
Defining values files ......................................................................................... 5-38
Load a parameter set into a job ........................................................................ 5-39
Use parameter set parameters ......................................................................... 5-40
Run jobs with parameter set parameters .......................................................... 5-41
Checkpoint ....................................................................................................... 5-42
Demonstration 1: Create parallel jobs............................................................... 5-44
Unit summary ................................................................................................... 5-56
Access sequential data ........................................................................ 6-1
Unit objectives .................................................................................................... 6-3
How sequential data is handled .......................................................................... 6-4
Features of the Sequential File stage ................................................................. 6-5
Sequential file format example ........................................................................... 6-6
Job design with Sequential File stages ............................................................... 6-7
Sequential File stage properties ......................................................................... 6-8
Format tab .......................................................................................................... 6-9
Columns tab ..................................................................................................... 6-10
Reading sequential files using a file pattern...................................................... 6-11
Multiple readers ................................................................................................ 6-12
Writing to a sequential file ................................................................................ 6-13
Reject links ....................................................................................................... 6-14
Source and target reject links ........................................................................... 6-15
Setting the Reject Mode property ..................................................................... 6-16
Copy stage ....................................................................................................... 6-17
Copy stage example ......................................................................................... 6-18
Copy stage Mappings ....................................................................................... 6-19
Demonstration 1: Reading and writing to sequential files ................................. 6-20
Working with nulls ............................................................................................ 6-32
Specifying a value for null................................................................................. 6-33
Empty string example ....................................................................................... 6-34
Viewing data with nulls ..................................................................................... 6-35
Demonstration 2: Reading and writing null values ............................................ 6-36
Data Set stage.................................................................................................. 6-43
Job with a target Data Set stage....................................................................... 6-44

Preface
Data Set Management utility............................................................................. 6-45

Data and schema displayed ............................................................................. 6-46
File set stage .................................................................................................... 6-47
Demonstration 3: Working with data sets ......................................................... 6-48
Checkpoint ....................................................................................................... 6-53
Unit summary ................................................................................................... 6-55
Partitioning and collecting algorithms ................................................ 7-1
Unit objectives .................................................................................................... 7-3
Partition parallelism ............................................................................................ 7-4
Stage partitioning ............................................................................................... 7-5
DataStage hardware environments .................................................................... 7-6
Partitioning algorithms ........................................................................................ 7-7
Collecting ........................................................................................................... 7-8
Collecting algorithms ........................................................................................ 7-10
Keyless versus keyed partitioning algorithms ................................................... 7-11
Round Robin and Random partitioning ............................................................. 7-12
Entire partitioning ............................................................................................. 7-13
Hash partitioning .............................................................................................. 7-14
Modulus partitioning ......................................................................................... 7-15
Auto partitioning ............................................................................................... 7-16
Partitioning requirements for related records .................................................... 7-17
Partition imbalances example ........................................................................... 7-18
Partitioning / Collecting link icons ..................................................................... 7-19
More partitioning icons ..................................................................................... 7-20
Specify a partitioning algorithm......................................................................... 7-21
Specify a collecting algorithm ........................................................................... 7-22
Configuration file .............................................................................................. 7-23
Example configuration file ................................................................................ 7-24
Adding $APT_CONFIG_FILE as a job parameter ............................................ 7-25
Editing configuration files.................................................................................. 7-26
Parallel job compilation..................................................................................... 7-27
Generated OSH................................................................................................ 7-28
Stage-to-operator mapping examples............................................................... 7-29
Job Score ......................................................................................................... 7-30
Viewing the Score ............................................................................................ 7-31
Checkpoint ....................................................................................................... 7-32
Demonstration 1: Partitioning and collecting ..................................................... 7-34
Unit summary ................................................................................................... 7-43

Preface
Combine data ........................................................................................ 8-1

Unit objectives .................................................................................................... 8-3
Combine data ..................................................................................................... 8-4
Lookup, Join, Merge stages ............................................................................... 8-5
Lookup Stage features ....................................................................................... 8-6
Lookup types ...................................................................................................... 8-7
Equality match Lookup stage example ............................................................... 8-8
Lookup stage with an equality match .................................................................. 8-9
Define the lookup key ....................................................................................... 8-10
Specify the output columns .............................................................................. 8-11
Lookup failure actions....................................................................................... 8-12
Specifying lookup failure actions ...................................................................... 8-13
Lookup stage with reject link ............................................................................ 8-14
Lookup stage behavior ..................................................................................... 8-15
Lookup stage output ......................................................................................... 8-16
Demonstration 1: Using the Lookup stage ........................................................ 8-17
Range Lookup stage job................................................................................... 8-26
Range on reference link ................................................................................... 8-27
Selecting the stream column ............................................................................ 8-28
Range expression editor................................................................................... 8-29
Range on stream link........................................................................................ 8-30
Specifying the range lookup ............................................................................. 8-31
Range expression editor................................................................................... 8-32
Demonstration 2: Range lookups ..................................................................... 8-33
Join stage ......................................................................................................... 8-39
Job with Join stage ........................................................................................... 8-40
Join stage properties ........................................................................................ 8-41
Output Mapping tab .......................................................................................... 8-42
Join stage behavior .......................................................................................... 8-43
Inner join output................................................................................................ 8-44
Left outer join output ......................................................................................... 8-45
Right outer join output ...................................................................................... 8-46
Full outer join .................................................................................................... 8-47
Merge stage ..................................................................................................... 8-48
Merge stage job................................................................................................ 8-49
Merge stage properties..................................................................................... 8-50
Comparison Chart ............................................................................................ 8-51
What is a Funnel stage? ................................................................................... 8-52
Funnel stage example ...................................................................................... 8-53
Funnel stage properties .................................................................................... 8-54

Preface
Checkpoint ....................................................................................................... 8-55

Demonstration 3: Using Join, Merge, and Funnel stages ................................. 8-57
Unit summary ................................................................................................... 8-65
Group processing stages ..................................................................... 9-1
Unit objectives .................................................................................................... 9-3
Group processing stages.................................................................................... 9-4
Sort data............................................................................................................. 9-5
Sorting alternatives ............................................................................................. 9-6
In-Stage sorting .................................................................................................. 9-7
Stable sort illustration ......................................................................................... 9-8
Sort stage Properties tab .................................................................................... 9-9
Specify sort keys .............................................................................................. 9-10
Sort stage options ............................................................................................ 9-11
Create key change column ............................................................................... 9-12
Partition sorts ................................................................................................... 9-13
Aggregator stage .............................................................................................. 9-14
Job with Aggregator stage ................................................................................ 9-15
Aggregation types ............................................................................................ 9-16
Count Rows aggregation type .......................................................................... 9-17
Output Mapping tab .......................................................................................... 9-18
Output Columns tab ......................................................................................... 9-19
Calculation aggregation type ............................................................................ 9-20
Grouping methods ............................................................................................ 9-21
Method = Hash ................................................................................................. 9-22
Method = Sort................................................................................................... 9-23
Remove duplicates ........................................................................................... 9-24
Remove Duplicates stage job ........................................................................... 9-25
Remove Duplicates stage properties ................................................................ 9-26
Checkpoint ....................................................................................................... 9-27
Demonstration 1: Group processing stages ...................................................... 9-29
Fork-Join Job Design ....................................................................................... 9-39
Unit Summary................................................................................................... 9-40
Transformer stage ............................................................................ 10-1
Unit objectives .................................................................................................. 10-3
Transformer stage ............................................................................................ 10-4
Job with a Transformer stage ........................................................................... 10-5
Inside the Transformer stage ............................................................................ 10-6
Transformer stage elements ............................................................................. 10-7

Preface
Constraints ....................................................................................................... 10-9

Constraints example ....................................................................................... 10-10
Define a constraint ......................................................................................... 10-11
Use the expression editor ............................................................................... 10-12
Otherwise links for data integrity..................................................................... 10-13
Otherwise link example .................................................................................. 10-14
Specify the link ordering ................................................................................. 10-15
Specify the Otherwise link constraint .............................................................. 10-16
Demonstration 1: Define a constraint.............................................................. 10-17
Derivations ..................................................................................................... 10-24
Derivation targets ........................................................................................... 10-25
Stage variables............................................................................................... 10-26
Stage variable definitions ............................................................................... 10-27
Build a derivation ............................................................................................ 10-28
Define a derivation ......................................................................................... 10-29
IF THEN ELSE derivation ............................................................................... 10-30
String functions and operators ........................................................................ 10-31
Null handling .................................................................................................. 10-32
Unhandled nulls.............................................................................................. 10-33
Legacy null processing ................................................................................... 10-34
Transformer stage reject link .......................................................................... 10-35
Demonstration 2: Define derivations ............................................................... 10-36
Loop processing ............................................................................................. 10-44
Functions used in loop processing ................................................................. 10-45
Loop processing example............................................................................... 10-46
Loop processing example job ......................................................................... 10-47
Inside the Transformer stage .......................................................................... 10-48
Demonstration 3: Loop processing ................................................................. 10-49
Group processing ........................................................................................... 10-55
Group processing example ............................................................................. 10-56
Job results ...................................................................................................... 10-57
Transformer logic ........................................................................................... 10-58
Loop through saved input rows....................................................................... 10-59
Example job results ........................................................................................ 10-60
Transformer logic ........................................................................................... 10-61
Parallel job debugger ..................................................................................... 10-62
Set breakpoints .............................................................................................. 10-63
Edit breakpoints.............................................................................................. 10-64
Running a parallel job in the debugger ........................................................... 10-65
Add columns to the watch list ......................................................................... 10-66

Preface
Demonstration 4: Group processing in a Transformer .................................... 10-67

Checkpoint ..................................................................................................... 10-85
Checkpoint solutions ...................................................................................... 10-86
Unit summary ................................................................................................. 10-87
Repository functions ....................................................................... 11-1
Unit objectives .................................................................................................. 11-3
Quick find ......................................................................................................... 11-4
Found results ................................................................................................... 11-5
Advanced Find window..................................................................................... 11-6
Advanced Find options ..................................................................................... 11-7
Using the found results ..................................................................................... 11-8
Performing an impact analysis.......................................................................... 11-9
Initiating an impact analysis ............................................................................ 11-10
Results in text format ...................................................................................... 11-11
Results in graphical format ............................................................................. 11-12
Displaying the dependency graph .................................................................. 11-13
Displaying the dependency path..................................................................... 11-14
Generating an HTML report ............................................................................ 11-15
Viewing column-level data flow....................................................................... 11-16
Finding where a column originates ................................................................. 11-17
Displayed results ............................................................................................ 11-18
Finding the difference between two jobs......................................................... 11-19
Initiating the comparison................................................................................. 11-20
Comparison results ........................................................................................ 11-21
Saving to an HTML file ................................................................................... 11-22
Comparing table definitions ............................................................................ 11-23
Checkpoint ..................................................................................................... 11-24
Demonstration 1: Repository functions ........................................................... 11-26
Unit summary ................................................................................................. 11-35
Work with relational data ................................................................. 12-1
Unit objectives .................................................................................................. 12-3
Importing relational table definitions ................................................................. 12-4
Orchestrate schema import .............................................................................. 12-5
ODBC import .................................................................................................... 12-6
Connector stages ............................................................................................. 12-7
Reading from database tables .......................................................................... 12-8
Connector stage GUI ........................................................................................ 12-9
Navigation panel............................................................................................. 12-10
Connection properties .................................................................................... 12-11

Preface
Usage properties - Generate SQL .................................................................. 12-12

Usage properties - Transaction ...................................................................... 12-13
Usage properties - Session and Before/After SQL ......................................... 12-14
Writing to database tables .............................................................................. 12-15
DB2 Connector GUI ....................................................................................... 12-16
Connector write properties ............................................................................. 12-17
Data connection objects ................................................................................. 12-18
Data connection object ................................................................................... 12-19
Creating a new data connection object ........................................................... 12-20
Loading the data connection .......................................................................... 12-21
Demonstration 1: Read and write to relational tables...................................... 12-22
Multiple input links .......................................................................................... 12-32
Job with multiple input links and reject links.................................................... 12-33
Specifying input link properties ....................................................................... 12-34
Record ordering property................................................................................ 12-35
Reject link specification .................................................................................. 12-36
Demonstration 2: Connector stages with multiple input links .......................... 12-37
SQL Builder .................................................................................................... 12-49
Table definition Locator tab ............................................................................ 12-50
Opening SQL Builder ..................................................................................... 12-51
SQL Builder window ....................................................................................... 12-52
Creating a calculated column ......................................................................... 12-53
Constructing a WHERE clause ....................................................................... 12-54
Sorting the data .............................................................................................. 12-55
Viewing the generated SQL ............................................................................ 12-56
Checkpoint ..................................................................................................... 12-57
Demonstration 3: Construct SQL using SQL Builder ...................................... 12-59
Unit summary ................................................................................................. 12-67
Job control........................................................................................ 13-1
Unit objectives .................................................................................................. 13-3
What is a job sequence? .................................................................................. 13-4
Basics for creating a job sequence ................................................................... 13-5
Job sequence stages........................................................................................ 13-6
Job sequence example..................................................................................... 13-7
Job sequence properties .................................................................................. 13-8
Job Activity stage properties ............................................................................. 13-9
Job Activity trigger .......................................................................................... 13-10
Execute Command stage ............................................................................... 13-11
Notification Activity stage................................................................................ 13-12
User Variables stage ...................................................................................... 13-13

Preface
Referencing the user variable ......................................................................... 13-14

Wait for File stage .......................................................................................... 13-15
Sequencer stage ............................................................................................ 13-16
Nested Condition stage .................................................................................. 13-17
Loop stages .................................................................................................... 13-18
Handling activities that fail .............................................................................. 13-19
Exception Handler stage ................................................................................ 13-20
Enable restart ................................................................................................. 13-21
Disable checkpoint for a Stage ....................................................................... 13-22
Checkpoint ..................................................................................................... 13-23
Demonstration 1: Build and run a job sequence ............................................. 13-25
Unit summary ................................................................................................. 13-38

Preface
Course overview
Preface overview
This course enables the project administrators and ETL developers to acquire the skills
necessary to develop parallel jobs in DataStage. The emphasis is on developers. Only
administrative functions that are relevant to DataStage developers are fully discussed.
Students will learn to create parallel jobs that access sequential and relational data and
combine and transform the data using functions and other job components.
Intended audience
Project administrators and ETL developers responsible for data extraction and
transformation using DataStage.
Topics covered
Topics covered in this course include:
• Introduction DataStage
• Deployment
• DataStage Administration
• Work with metadata
• Create parallel jobs
• Access sequential data
• Partitioning and collecting algorithms
• Combine data
• Group processing stages
• Transformer stage
• Repository functions
• Work with relational data
• Control jobs
Course prerequisites
Participants should have:
• No prerequisites

Preface
Document conventions
Conventions used in this guide follow Microsoft Windows application standards, where
applicable. As well, the following conventions are observed:
• Bold: Bold style is used in demonstration and exercise step-by-step solutions to
indicate a user interface element that is actively selected or text that must be
typed by the participant.
• Italic: Used to reference book titles.
• CAPITALIZATION: All file names, table names, column names, and folder names
appear in this guide exactly as they appear in the application.
To keep capitalization consistent with this guide, type text exactly as shown.

Preface
Additional training resources

• Visit IBM Analytics Product Training and Certification on the IBM website for
details on:
• Instructor-led training in a classroom or online
• Self-paced training that fits your needs and schedule
• Comprehensive curricula and training paths that help you identify the courses
that are right for you
• IBM Analytics Certification program
• Other resources that will enhance your success with IBM Analytics Software
• For the URL relevant to your training requirements outlined above, bookmark:
• Information Management portfolio:
http://www-01.ibm.com/software/data/education/

Preface
IBM product help

Help type When to use Location
Task- You are working in the product and IBM Product - Help link
oriented you need specific task-oriented help.
Books for You want to use search engines to Start/Programs/IBM

Printing find information. You can then print Product/Documentation
(.pdf) out selected pages, a section, or the
whole book.
Use Step-by-Step online books
(.pdf) if you want to know how to
complete a task but prefer to read
about it in a book.
The Step-by-Step online books
contain the same information as the
online help, but the method of
presentation is different.
IBM on the You want to access any of the

Web following:
• IBM - Training and Certification • http://www-01.ibm.com/

software/analytics/training-
and-certification/
• Online support • http://www-947.ibm.com/
support/entry/portal/
Overview/Software
• IBM Web site • http://www.ibm.com

Preface

Introduction to DataStage
Introduction to DataStage
IBM Infosphere DataStage v11.5
© Copyright IBM Corporation 2015

Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit 1 Introduction to DataStage
© Copyright IBM Corp. 2005, 2015 1-2

Unit objectives
• List and describe the uses of DataStage
• List and describe the DataStage clients
• Describe the DataStage workflow
• Describe the two types of parallelism exhibited by DataStage
parallel jobs
Introduction to DataStage © Copyright IBM Corporation 2015
Unit objectives

What is IBM InfoSphere DataStage?

• Design jobs for Extraction, Transformation, and Loading (ETL)
• Ideal tool for data integration projects – such as, data warehouses,
data marts, and system migrations
• Import, export, create, and manage metadata for use within jobs
• Build, run, and monitor jobs, all within DataStage
• Administer your DataStage development and execution environments
• Create batch (controlling) jobs
 Called job sequence
What is IBM InfoSphere DataStage?

DataStage is a comprehensive tool for the fast, easy creation and maintenance of data
marts and data warehouses. It provides the tools you need to build, manage, and
expand them. With DataStage, you can build solutions faster and give users access to
the data and reports they need.
With DataStage you can design jobs that extract, integrate, aggregate, load, and
transform the data for your data warehouse or data mart. To facilitate your
development, you can create and reuse metadata and job components. After building
the DataStage job, you can run, monitor, and schedule it.

What is Information Server?

• Suite of applications, including DataStage, that share a common:
 Repository
 Set of application services and functionality
− Provided by the Metadata Server component
• By default an application named “server1”, hosted by an IBM WebSphere
Application Server (WAS) instance
− Provided services include:
• Security
• Repository
• Logging and reporting
• Metadata management
• Managed using the Information Server Web Console client
What is Information Server

Information Server (IS) is a suite of applications that all share the same repository and
the same backbone of services and functionality. It is managed using web console
clients. Individual applications are managed using their own set of clients.
The backbone of services is provided by a WebSphere Application Server (WAS)
instance, which by default is named server1. Individual applications and components in
the Information Server suite all utilize these services.

Information Server backbone
Information Information Information FastTrack DataStage / MetaBrokers

Data Click
Services Governance Analyzer QualityStage
Director Catalog
Metadata Metadata
Access Services Analysis Services
Metadata Server
Information Server Web Console
Information Server backbone

This graphic shows the Information Server backbone. The hosted applications are at
the top. They all share the same services displayed in the middle. They all share the
same repository displayed at the lower right. They are managed using the Information
Server Web Console as well as their individual clients.
Although DataStage and QualityStage are separate products with separate licenses,
QualityStage is actually embedded within DataStage as a set of stages.

Administration Reporting
InfoSphere
Users

This graphic shows the Information Server Administration Console. Click the
Administration tab to perform Information Server administrative functions. Shown is
the folder where DataStage user IDs are created. An Information Server administration
role is required to create user IDs for any of the Information Server products.
Also shown is the Reporting tab. DataStage users can log in and create reports using
one of the supplied DataStage report templates.

DataStage architecture
• DataStage clients
Administrator Designer Director
• DataStage engines
 Parallel engine
− Runs parallel jobs
 Server engine
− Runs server jobs
− Runs job sequences
DataStage architecture
The top half displays the DataStage clients. On the lower half are two engines. The
parallel engine runs DataStage parallel jobs. The server engine runs DataStage server
jobs and job sequences. Our focus in this course is on Parallel jobs and job sequences.
The DataStage clients are:
Administrator
Configures DataStage projects and specifies DataStage user roles.
Designer
Creates DataStage jobs that are compiled into executable programs.
Director
Used to run and monitor the DataStage jobs, although this can also be done in
Designer.

DataStage Administrator
Project environment
variables
DataStage Administrator
Use the Administrator client to specify general server defaults, to add and delete
projects, and to set project defaults and properties.
On the General tab, you have access to the project environment variables. On the
Permissions tab, you can specify DataStage user roles. On the Parallel tab, you
specify general defaults for parallel jobs. On the Sequence tab, you specify defaults for
job sequences. On the Logs tab, you specify defaults for the job log.
A DataStage administrator role, set in the Information Server Web Console, has full
authorization to work in the DataStage Administrator client.

DataStage Designer
Menus / toolbar
DataStage parallel
job with DB2
Connector stage
Job log
DataStage Designer
DataStage Designer is where you build your ETL (Extraction, Transformation, Load)
jobs. You build a job by dragging stages from the Palette (lower left corner) to the
canvas. You draw links between the stages to specify the flow of data. In this example,
a Sequential File stage is used to read data from a sequential file. The data flows into a
Transformer stage where various transformations are performed. Then the data is
written out to target DB2 tables based on constraints defined in the Transformer and
SQL specified in the DB2 Connector stage.
The links coming out of the DB2 Connector stage are reject links which capture SQL
errors.

DataStage Director
Log messages
DataStage Director
As your job runs, messages are written to the log. These messages display information
about errors and warnings, information about the environment in which the job is
running, statistics about the numbers of rows processed by various stages, and much
more.
The graphic shows the job log displayed in the Director client. For individual jobs open
in Designer, the job log can also be displayed in Designer.

Developing in DataStage
• Define global and project properties in Administrator
• Import metadata into the Repository
 Specifies formats of sources and targets accessed by your jobs
• Build job in Designer
• Compile job in Designer
• Run the job and monitor job log messages
 The job log can be viewed either in Director or in Designer
− In Designer, only the job log for the currently opened job is available
 Jobs can be run from either Director, Designer, or from the command line
 Performance statistics show up in the log and also on the Designer canvas
as the job runs
Developing in DataStage
Development workflow: Define your project’s properties in Administrator. Import the
metadata that defines the format of data your jobs will read from or write to. In Designer,
build the job. Define data extractions (reads). Define data flows. Define data
combinations, data transformations, data constraints, data aggregations, and data
loads (writes).
After you build your job, compile it in Designer. Then you can run and monitor the job,
either in Designer or Director.

DataStage project repository
User-added folder
Standard jobs folder
Standard table
definitions folder
DataStage project repository

All your work is stored in a DataStage project. Before you can do anything, other than
some general administration, you must open (attach to) a project.
Projects are created during and after the installation process. You can add projects
after installation on the Projects tab of Administrator.
A project is associated with a directory. The project directory is used by DataStage to
store your jobs and other DataStage objects and metadata on the DataStage server
system.
Projects are self-contained. Although multiple projects can be open at the same time,
they are separate environments. You can, however, import and export objects between
them.
Multiple users can be working in the same project at the same time. However,
DataStage will prevent multiple users from editing the same DataStage object (job,
table definition, and so on) at the same time.

Types of DataStage jobs

• Parallel jobs
 Executed by the DataStage parallel engine
 Built-in capability for pipeline and partition parallelism
 Compiled into OSH
− Executable script viewable in Designer and the log
• Server jobs
 Executed by the DataStage Server engine
 Use a different set of stages than parallel jobs
 No built-in capability for partition parallelism
 Runtime monitoring in the job log
• Job sequences (batch jobs, controlling jobs)
 A server job that runs and controls jobs and other activities
 Can run both parallel jobs and other job sequences
 Provides a common interface to the set of jobs it controls
Types of DataStage jobs

This course focuses on parallel jobs and job sequences that control batches of jobs. But
these are not the only kinds of jobs you can create in DataStage. Each type of job has
its own canvas and set of stages.
The key difference between DataStage parallel and server jobs is the engine used to
run them. DataStage parallel jobs are run using the parallel engine. Parallel jobs can
achieve very high performance using the engine’s capacity for pipeline and partition
parallelism.

Design elements of parallel jobs

• Stages
 Passive stages (E and L of ETL)
− Read data
− Write data
− Examples: Sequential File, DB2, Oracle, Peek stages
 Processor (active) stages (T of ETL)
− Transform data (Transformer stage)
− Filter data (Transformer stage)
− Aggregate data (Aggregator stage)
− Generate data (Row Generator stage)
− Merge data (Join, Lookup stages)
• Links
 "Pipes” through which the data moves from stage-to-stage
Design elements of parallel jobs

You design your DataStage parallel job using stages and links. Links are like pipes
through which data flows. There are two categories of stages. Passive stages are used
to read and write to data sources. Processor (active) stages are used to perform some
sort of operation on the data.
There are many different types of active stages. Many perform very specific functions,
such as sorting, filtering, and joining data. Others contain large amounts of functionality,
such as the Transformer and XML stages.

Pipeline parallelism
• Transform, Enrich, Load stages execute in parallel

• Like a conveyor belt moving rows from stage to stage
 Run downstream stages while upstream stages are running
• Advantages:
 Reduces disk usage for staging areas
 Keeps processors busy
• Has limits on scalability
Pipeline parallelism
In this diagram, the arrows represent rows of data flowing through the job. While earlier
rows are undergoing the Loading process, later rows are undergoing the Transform and
Enrich processes. In this way a number of rows (7 in the picture) are being processed
at the same time, in parallel.
Although pipeline parallelism improves performance, there are limits on its scalability.

Partition parallelism
• Divide the incoming stream of data into subsets to be separately
processed by an operation
 Subsets are called partitions
• Each partition of data is processed by copies the same stage
 For example, if the stage is Filter, each partition will be filtered in exactly
the same way
• Facilitates near-linear scalability
 8 times faster on 8 processors
 24 times faster on 24 processors
 This assumes the data is evenly distributed
Partitioning breaks a stream of data into smaller subsets. This is a key to scalability.
However, the data needs to be evenly distributed across the partitions; otherwise, the
benefits of partitioning are reduced.
It is important to note that what is done to each partition of data is the same. How the
data is processed or transformed is the same. In effect, copies of each stage or
operator are running simultaneously, and separately, on each partition of data.
To scale up the performance, you can increase the number of partitions (assuming your
computer system has the processors to process them).

Three-node partitioning
Node 1
subset1 Stage
Node 2
subset2
Data Stage
Node 3
subset3
Stage
• Here the data is split into three partitions (nodes)

• The stage is executed on each partition of data separately and in
parallel
• If the data is evenly distributed, the data will be processed three
times faster
Three-node partitioning
This diagram depicts how partition parallelism is implemented in DataStage. The data is
split into multiple data streams which are each processed separately by the same stage
or operator.

Job design versus execution
A developer designs the flow in DataStage Designer
… at runtime, this job runs in parallel for any number

of partitions (nodes)
Job design versus execution

Much of the parallel processing paradigm is hidden from the designer. The designer
simply diagrams the process flow, as shown in the upper portion of this diagram. The
parallel engine, using definitions in a configuration file, will actually execute processes
that are partitioned and parallelized, as illustrated in the bottom portion.
A misleading feature of the lower diagram is that it makes it appear as if the data
remains in the same partition through the duration of the job. In fact, partitioning and re-
partitioning occurs on a stage-by-stage basis. There will be times when the data moves
from one partition to another.

Configuration file
• Determines the degree of parallelism (number of partitions) of jobs
that use it
• Every job runs under a configure file
• Each DataStage project has a default configuration file
 Specified by the $APT_CONFIG_FILE job parameter
 Individual jobs can run under different configuration files than the project
default
− The same job can also run using different configuration files on different job runs
Configuration file
The configuration file determines the degree of parallelism (number of partitions) of jobs
that use it. Each job runs under a configure file. The configuration file is specified by the
$APT_CONFIG_FILE environment variable. This environment variable can be added
to the job as a job parameter. This allows the job to use different configuration files on
different job runs.

Example: Configuration file
Node (partition)
Node (partition)
Resources attached
to the node
Example: Configuration file

Here you see a configuration file, viewed in the Designer Configurations editor. In this
example, there are two nodes (partitions). Any job running under this configuration file
will process the data in two parallel partitions.
In addition to specifying the number of partitions, the configuration file also specifies
resources used by stages and operators running in the partition. For example, scratch
disk is disk used for sorting, when memory is exhausted.

Checkpoint
1. True or false: DataStage Director is used to build and compile your
ETL jobs
2. True or false: Use Designer to monitor your job during execution
3. True or false: Administrator is used to set global and project
properties
Checkpoint

Checkpoint solutions
1. False.
DataStage Designer is used to build and compile jobs.
Use DataStage Director to run and monitor jobs, but you can do this
from DataStage Designer too.
2. True.
The job log is available both in Director and Designer. In Designer,
you can only view log messages for a job open in Designer.
3. True.

Unit summary
• List and describe the uses of DataStage
• List and describe the DataStage clients
• Describe the DataStage workflow
• Describe the two types of parallelism exhibited by DataStage parallel
jobs
Unit summary

Deployment
Deployment

U n i t 2 D e p l o ym e n t

Unit objectives
• Identify the components of Information Server that need to be installed
• Describe what a deployment domain consists of
• Describe different domain deployment options
• Describe the installation process
• Start the Information Server
Deployment © Copyright IBM Corporation 2015
Unit objectives
In this unit we will take a look at how DataStage is deployed. The deployment is
somewhat complex because DataStage is now one component among many.

What gets deployed

An Information Server domain, consisting of the following:
• Metadata Server backbone, hosted by an IBM WebSphere Application
Server (WAS) instance
• One or more DataStage servers
 Can be on the same system or on separate systems
• One database manager instance containing the Repository database
(XMETA)
• Information Server clients
 Web Console
 DataStage clients
• Additional Information Server products
 Information Analyzer, Information Governance Catalog,
 QualityStage (part of DataStage), Data Click, FastTrack
What gets deployed

Here is a list of the different components that get deployed, including an
IBM WebSphere Application Server (WAS) instance, a database manager instance
containing the Information Server repository (XMETA), one or more DataStage servers,
and the various clients and the component applications.
Many of these different components can be on different computer systems.

Deployment: Everything on one machine

• All Information Server
components on one system
Metadata Server
• Additional client workstations backbone (WAS)
can connect to this machine
Clients
DataStage
Clients Server
XMETA Repository
Deployment: Everything on one machine

Information Server is available for a variety of Windows and Unix platforms, but cannot
be mixed (except for the clients).
The DataStage clients only run on Windows. If Information Server is installed on a
UNIX platform, then the DataStage clients must be running on a separate Windows
system.
Multiple DataStage servers can run on the same system or on separate systems in the
same domain. For simplicity only one DataStage server is shown.
Another complexity not shown here is that DataStage parallel jobs can in certain grid
environments be distributed over multiple systems.

Deployment: DataStage on a separate machine
• IS components on
multiple systems
Metadata Server
 DataStage servers backbone (WAS)
 Metadata server
WAS and XMETA
repository
DataStage
Server
Clients XMETA Repository
Deployment: DataStage on a separate machine

Here WAS and the repository are on the same system. The DataStage server system
or systems are separate. If multiple DataStage servers are in the domain, they can be
on the same or on separate systems.
When multiple systems are involved, the systems must be connected by a high-speed
network, so that they can communicate with each other. Agent processes run on each
of the nodes to facilitate the communication.

Metadata Server and DB2 on separate machines
Metadata Server
• IS components all on backbone (WAS)
separate systems
 DataStage Server
 Metadata Server
(WAS)
 XMETA Repository
DataStage
Clients Server
XMETA Repository
Metadata Server and DB2 on separate machines

Here the repository has been placed on a separate system from the WAS. This
configuration may not always perform well because of the high volume of network traffic
between the WAS and the repository database.

Information Server start-up

• Starting the Metadata Server (WAS) on Windows:
 Select the IBM WebSphere menu
 Click Start the Server from the InfoSphere profile
• Starting the Metadata Server on Unix platforms:
 Invoke the startServer.sh script in the
WebSphere/AppServer/profiles/InfoSphere/bin directory
• By default, the startup services are configured to run automatically
upon system startup
• To begin work in DataStage, double-click on a DataStage client icon,
and then log in
• To begin work in the Information Server Web Console, open a web
browser, enter the address of the services (WAS) system, and then log
in
Information Server start-up

By default, the startup services are configured to run automatically when the system
starts, but they can also be started manually. The first two bullets describe the manual
process. The XMETA Repository database must be running before you try to start
Information Server.

Starting Information Server on Windows
Start the Server Application Server

Profiles folder
Starting Information Server on Windows

Information Server can be setup to start automatically when Windows is started.
Information Server can also be started from the Windows command line. Shown here,
is the menu item used to start the Metadata Server (WAS). To access this menu click
IBM WebSphere Application Server>Profiles>InfoSphere>Start the server.

Starting Information Server on Linux

• Open a terminal window
• Change to the AppServer/bin directory
• Run the startServer.sh script
Change to AppServer/bin directory
Default name of
Metadata Server
Starting Information Server on Linux

This graphic shows how to manually start Information Server from the Unix command
line.
You can also check the status of the Metadata Server using the command
./serverStatus.sh server1.

Verifying that Information Server is running

• Log into the Information Server Administration Console
 Note: This doesn’t establish that individual component applications such as
DataStage are running
• To log into the Administration Console:
 Click the Administration Console link in the Information Server Launch Pad
− To log into the Launch Pad: https://edserver:9443/ibm/iis/launchpad
• edserver: Name of the Information Server domain system
• 9443: Port address for communicating with the domain server
 In a web browser enter the IP address of the InfoSphere Information Server
Web Console: https://edserver:9443/ibm/iis/console/
• On the WAS system, you can check whether the Metadata Server is
running using the serverStatus.sh script
 Change to WAS bin directory and run serverStatus.sh server1
− By default, the Metadata Server is “server1”
− Log in as WAS administrator: wasadmin
Verifying that Information Server is running

From the client, an easy way to tell if Information Server is running is to open the
Information Server Administration Console. You log into the Administration Console
from a web browser using the IP address shown or from the Information Server Launch
Pad. The Information Server Launch Pad contains links to various Information Server
products and components including the Administration Console.
From the WAS system, you can use the serverStatus.sh script to determine whether
Information Server is running. First, change to WAS bin directory (for example,
/opt/IBM/WebSphere/Appserver/bin on Linux, or c:\IBM\WebSphere\AppServer\bin
on a Windows server).

Web Console Login window
Information Server
Administrator ID
Log in

This graphic shows how to log into the Information Server Administration Console.
In a web browser, type the address: https://edserver:9443/ibm/iss/console/. Log in
using an Information Server administrator ID. The default administrator ID is isadmin.

Checkpoint
1. What Information Server components make up a domain?
2. Can a domain contain multiple DataStage servers?
3. Does the database manager with the repository database need to be
on the same system as the WAS application server?
Checkpoint

1. Metadata Server hosted by a WAS instance. One or more
DataStage servers. One database manager (for example, DB2 or
Oracle) containing the XMETA Repository.
2. Yes.
The DataStage servers can be on separate systems or on a single
system.
3. No.
The DB2 instance with the repository can reside on a separate
machine than the WebSphere Application Server (WAS).

Demonstration 1
Log into the Information Server Administration Console
Demonstration 1: Log into the Information Server Administration Console

Demonstration 1:
Log into the Information Server Administration Console
Purpose:
In this demonstration you will log into the Information Server Administration
Console and verify that Information Server is running.
Windows User/Password: student/student
Server: http://edserver:9443/
Console: Administration Console
User/Password: isadmin / isadmin
Task 1. Log into the Information Server Administration
Console.
1. If prompted to login to Windows, use student/student.
2. In the Mozilla Firefox browser, type the IP address of the InfoSphere Information
Server Launch Pad: http://edserver:9443/ibm/iis/launchpad/.
Here, edserver is the name of the Information Server computer system and
9443 is the port number used to communicate with it.
3. Click Administration Console .

Type the Information Server Administrator user ID/password, isadmin/isadmin.

4. Click Login.
Note: If the login window does not show up, this is probably because
Information Server (DataStage) has not started up. It can take over 5 minutes to
start up.
If it has not started up, examine Windows services. There is a shortcut on the
desktop. Verify that DB2 - DB2Copy has started. If not, select it and then click
Start. Then select IBM Websphere Application Server and then click Restart.
DB2 typically starts up automatically, but if it does not, Information Server
(DataStage) will not start.
Results:
In this demonstration you logged into the Information Server Administration
Console and verified that Information Server is running.

Unit summary
• Identify the components of Information Server that need to be installed
• Describe what a deployment domain consists of
• Describe different domain deployment options
• Describe the installation process
• Start the Information Server
Unit summary

DataStage Administration
DataStage Administration

Unit 3 DataStage Administration

Unit objectives
• Open the Information Server Web console
• Create new users and groups
• Assign Suite roles and Component roles to users and groups
• Give users DataStage credentials
• Log into DataStage Administrator
• Add a DataStage user on the Permissions tab and specify the user’s
role
• Specify DataStage global and project defaults
• List and describe important environment variables
DataStage Administration © Copyright IBM Corporation 2015
Unit objectives
This unit goes into detail about the Administrator client.

Information Server Web Console - Administration

• Used for administering Information Server
 Domain management
 Session management
 Users and groups
 Log management
 Schedule management
• Our focus is on users and groups
 How DataStage user IDs are created
• We will also look at domain management
 DataStage credentials
Information Server Web Console - Administration

There are many administrative functions that can be performed on the Administration
tab of the Information Server Administration Console. However, our focus in this course
is on the management of DataStage users and groups and what is referred to as
domain management.
In practice you will probably not be creating Information Server user IDs. However, it is
important that you have some understanding of how this is done, so that you can
function effectively as a DataStage developer.

Administration
console address
Information
Server
administrator ID
Log in

To open the Administrative Console, enter the web console address in an internet
browser, either Internet Explorer or Mozilla Firefox.
The console address is of the form: https://machine:nnnn/ibm/iis/console/
Here machine is the host name or IP address of the machine running the application
server that hosts Metadata Server.
nnnn is the port address of the console. By default, it is 9443.
The Information Server administrator ID and password is specified during installation.
The default is isadmin. After installation, new administrator IDs can be specified.
You can also log into the Web Console using an Information Server non-administration,
user role. However, the user role is limited. An administrator role is required for creating
user IDs.

User and group management

• Authorizations can be provided to either users or groups
 Users that are members of a group acquire the authorizations of the group
• Authorizations are provided in the form of roles
 Two types of roles
− Suite roles: Apply to the suite
− Suite component roles: Apply to a specific product or component of Information Server, for
example, DataStage
• Two types of roles
 Administrator: Full authorizations
 User: Limited authorizations
• DataStage roles
 Administrator: Full authorizations
− Full authorizations within Administrator client
− Full developer and operator authorizations within Designer and Director
 User: Limited set of authorizations
− Permissions are specified in the DataStage Administrator client by a DataStage
administrator
User and group management

There are two DataStage roles that can be set in the Information Server Web Console:
administrator, user. If the user ID is assigned the DataStage administrator role, then
the user will immediately acquire the DataStage administrator role for all projects.
If the user ID is assigned the DataStage user role, the specific permissions the user
has in DataStage are specified in DataStage Administrator client by a DataStage
administrator.

Creating a DataStage User ID
Administration tab
Users Create new user
Create a DataStage User ID

This graphic shows the Administration tab of the Information Server Web Console.
The Users and Groups folder has been expanded.
The process of creating a new group is similar to creating a new user. Users assigned
to a group inherit the authorizations assigned to the group.
To create a user ID, expand the Users and Groups folder, and then click Users. Then
click New User.
Shown in the graphic are the list of users already created, including an Information
Server administrator (isadmin) and a WAS administrator (wasadmin).

Assigning DataStage roles
User ID
Suite roles
Suite User
role
Component roles
DataStage
Administrator
role
Assign DataStage roles

In this graphic, the user dsadmin is given Suite User role and DataStage
Administrator roles. Users of any Information Server application must be given the
Suite User role.
Required fields include the user ID and password and the user name. Other user
information is optional.

DataStage credentials
• DataStage credentials for a user ID
 Required by DataStage
 Required in addition to Information Server authorizations
• DataStage credentials are given to a user ID (for example, dsadmin)
by mapping the user ID to an operating system user ID on the
DataStage server system
• Specified in the Domain Management>Engine Credentials folder
 Default or individual mappings can be specified
DataStage credentials
To log into a DataStage client, in addition to having a DataStage user ID, you also need
DataStage credentials. The reason for this has to do with the DataStage legacy.
Originally, DataStage was a stand-alone product that required a DataStage server
operating system user ID. Although DataStage is now part of the Information Server
suite of products, and uses the Information Server registry, it still has this legacy
requirement. This requirement is implemented by mapping DataStage user IDs to
DataStage server operating system IDs.
This assumes that when DataStage was installed, the style of user registry selected for
the installation was Internal User Registry. Other options are possible.

DataStage Credentials Default Mapping
Operating system
user ID on the
DataStage Server
DataStage Credentials Default Mapping

On the Engine Credentials tab, select the DataStage server. Then click Open
Configuration. In the text boxes specify an operating system user ID and password on
the DataStage Server system.
You can also be map individual Information Server user IDs to specific DataStage
Server user IDs. Select the DataStage Server, and then click Open User Credentials.
Individual mappings provide better accountability.
Note that dsadm in this example need not be a suite administrator or user. It is an
operating system user ID that DataStage user IDs are mapped to.

Logging onto DataStage Administrator
Host name of services

system (WAS)
DataStage administrator
ID and password
Name of DataStage
server system
Logging onto DataStage Administrator

This graphic shows the DataStage Administrator login window. Select the host name
(here EDSERVER), user name and password, and select the host name of the system
running DataStage (here EDSERVER).
Recall that multiple DataStage servers can exist in a domain. Here you select the
DataStage server that you want to administer.
You can log in as either a DataStage administrator or user. The user role has some
limitations.

DataStage Administrator Projects Tab

Click to specify
project properties
Link to Information
Server Web console
DataStage Administrator Projects tab

This graphic shows the Administrator Projects tab. Select the project you want to
configure and then click Properties.
When you first log in you are placed in the General tab.
Notice also that you can add and delete projects from this tab. The
ANALYZERPROJECT project shown in the projects list is a special project created for
Information Analyzer, which is another product in the Information Server suite. This
project and dstage1 were created during Information Server installation. DSProject
was created after Information Server installation by clicking the Add button on this tab.
Notice the link in the lower, right corner. You can use this link to open the Information
Server Administration Console.

DataStage Administrator General tab
Environment variable
settings
DataStage Administrator General tab

This graphic shows the General tab of Administrator. This is where you get access to
the environment variables for the project. Click the Environment button to display and
edit environment variables settings.
The following pages discuss some of the main environment variables.

Environment variables
Parallel job variables Configuration file

path
Environment variables
This graphic shows the Parallel folder in the Environment Variables window.
Click the Environment button on the General tab to open this window. The variables
listed in the Parallel folder apply to parallel jobs.
In particular, notice the $APT_CONFIG_FILE environment variable. This specifies the
path to the default configuration file for the project. Any parallel job in the project will, by
default, run under this configuration file.
You can also specify your own environment variables in the User Defined folder.
These variables can be passed to jobs through their job parameters to provide project
level job defaults.

Environment reporting variables
Display Score
Display OSH
Environment reporting variables

This graphic shows the Reporting folder of environment variables. These are variables
that determine how much information is displayed in the job log. Information includes
startup processes, performance statistics, debugging information, and the like.
The Score and OSH environment variables are highlighted. These variables provide
very useful information for debugging DataStage parallel jobs.

DataStage Administrator Permissions tab
Assigned DataStage roles
Add DataStage users

DataStage Administrator Permissions tab

This graphic shows the Permissions tab.
Listed are suite users and groups that have either a DataStage user or administrator
role.
When suite users or groups that have a DataStage administrator role are added, they
automatically are displayed here and assigned the role of DataStage Administrator.
Suite users or groups that have a DataStage user role need to be manually added. To
accomplish this, click the Add User or Group button. Then you select the DataStage
user role (Operator, Super Operator, Developer, Production Manager) that this user
ID is to have.

Adding users and groups
Available users /
groups with a Add DataStage users
DataStage User role
Adding users and groups

Click the Add User or Group button to open this window. On the left are Information
Server users and groups that have been assigned a DataStage user role in the
Information Server Web Console. Select the users to be added and then click OK.

Specify DataStage role
Added DataStage
user
Select DataStage role
Specify DataStage role

Once a user or group has been added, you can specify the user’s role within this
DataStage project.
There are four user roles that can be assigned to a DataStage user:
DataStage Developer, who has full access to all areas of the DataStage project.
DataStage Operator, who can run and monitor DataStage jobs in the Director client.
DataStage Super Operator, who can open Designer and view the parallel jobs and
other DataStage objects in read-only mode.
DataStage Production Manager, who can create and manipulate protected projects. A
protected project is a project that stores the DataStage jobs that have been released for
production.

DataStage Administrator Logs tab
Auto-purge of the
Director job log Logs
DataStage Administrator Logs tab

This graphic shows the Logs tab. Here you can set defaults regarding the DataStage
job log.
The Auto-purge option is highlighted. The job log can fill up quickly. If this box is
checked DataStage will automatically purge the log after either a certain number of job
runs (here two) or a certain number of days.

DataStage Administrator Parallel tab
Display OSH
Column type defaults
DataStage Administrator Parallel tab

This graphic shows the Parallel tab. Use this tab to specify parallel job defaults.
Use this tab to change the project default formats for dates and times.
Here, also you can choose to make the OSH visible in DataStage projects. Be aware
that this setting applies to all DataStage projects, not just the one opened in
Administrator. Typically, this setting is enabled. When you click the Compile button in
DataStage Designer, the GUI job diagram is compiled into an OSH script that can be
executed by the parallel engine. Viewing the OSH can sometimes provide useful
information about how your jobs work, because it provides a lower-level view of the job.

Checkpoint
1. Authorizations can be assigned to what two items?
2. What two types of authorization roles can be assigned to a user or
group?
3. In addition to Suite authorization to log into DataStage, what else
does a DataStage developer require to work in DataStage?
4. Suppose that dsuser has been assigned the DataStage User role in
the IS Web Console. What permission role in DataStage
Administrator does dsuser need to build jobs in DataStage?
Checkpoint

1. Users and groups.
Members of a group acquire the authorizations of the group.
2. Suite roles and suite component roles.
3. DataStage credentials.
4. DataStage Developer.

Demonstration 1
Administering DataStage
Demonstration 1: Administering DataStage

Demonstration 1:
Administering DataStage
Purpose:
You will create DataStage user IDs in the InfoSphere Web Console. Then you
will log into DataStage Administrator and configure your DataStage
environment.
Information Server Launch Pad: http://edserver:9443/ibm/iis/launchpad/
Console: Administration Console
User/Password: isadmin / isadmin
Task 1. Create a DataStage administrator and user.
1. From the Information Server Launch Pad, log into the Information Server
Administration Console as isadmin/isadmin.
2. In the Information Server Administration Console, click the
Administration tab.
3. Expand Users and Groups, and then click Users.
You should see at least two users: isadmin is the Information Server
administrator ID; wasadmin is the WebSphere Application Server administrator
ID. These users are created during Information Server installation.
4. Select the checkbox for the isadmin user, and then in the right pane, click
Open User.
Note the first and last names of this user.

5. Expand Suite and Suite Component, if not already expanded.

Note what Suite roles and Suite Component roles have been assigned to this
user. Scroll to view more roles.
6. In the left pane, click Users to return to the Users main window.
7. In the right pane, click New User.

8. Create a new user ID named dsadmin, with the following:

Password: dsadmin
First name: dsadmin
Last Name: dsadmin
Suite Role: Suite User
Suite Component Role: DataStage and QualityStage Administrator
9. Scroll down to the bottom of the window, and then click Save and Close.
Note: If prompted to save the password, click "Never Remember Password For
This Site."
10. Following the same procedure, create an additional user named dsuser, with
the following:
Password: dsuser
First name: dsuser
Last Name: dsuser
Suite Role: Suite User
Suite Component Role: DataStage and QualityStage User
11. Scroll down, and then click Save and Close.

12. Verify that dsuser and dsadmin have been created.
13. Click File > Exit to close the Infosphere Administration Console.
Task 2. Log into DataStage Administrator.
1. Double-click the Administrator Client icon on the Windows desktop.

2. Select the host name and port number edserver:9443, in the User name and
Password boxes type dsadmin/dsadmin, and then select EDSERVER as your
Information Server engine.
3. Click Login.
Task 3. Specify property values in DataStage Administrator.
1. Click the Projects tab, select your project - DSProject - and then click the
Properties button.

2. On the General tab, select Enable Runtime Column Propagation for Parallel
jobs (do not select the new links option).
3. Click the Environment button to open up the Environment Variables window.

4. Under Categories, with Parallel expanded, click Parallel to select it. Examine
the APT_CONFIG_FILE parameter and its default. The configuration file is
discussed in a later unit.

5. Click Reporting to select it, and then ensure that APT_DUMP_SCORE,

APT_STARTUP_STATUS, and OSH_DUMP variables are set to True.
Tip: you may need to resize the Environment Variables window, and the Name
column under the Details pane, to view the variable names.
6. Click OK.
7. On the Parallel tab, enable the option to make the generated OSH visible.
Note the default date and time formats. For example, the default date format is
“YYYY-MM-DD”, which is expressed by the format string shown.

8. On the Sequence tab, select all options that are available.
Task 4. Set DataStage permissions and defaults.

1. Click the Permissions tab.
Notice that isadmin and dsadmin (among others) already exist as DataStage
Administrators. This is because they were assigned the DataStage Suite
Component Administrator role in the Information Server Administration console.
DataStage Administrators have full developer and administrator permissions in
all DataStage projects.
On the other hand, dsuser, does not receive permission to develop within a
specified DataStage project unless a DataStage Administrator explicitly grants
permission. So you do not see dsuser here.
2. Click Add User or Group.

Notice that dsuser is available to be added.

3. Click dsuser to select it, and then click Add.
4. Click OK to return to the Permissions tab. Select dsuser. In the User Role drop
down, select the DataStage and QualityStage Developer role.
5. Click OK, and then click Close, to close DataStage Administrator.

6. Relaunch Administrator Client, and log in dsuser/dsuser.
7. Select your project, and then click Properties.
Notice that the Permissions tab is disabled. This is because dsuser has not
been assigned the DataStage Administrator role and therefore does not have
the authority to set DataStage permissions.

8. Click the Logs tab, ensure Auto-purge of job log is selected, and then set the
Auto-purge action to up to 2 previous job runs.
9. Click OK, and then close Administrator Client.

Results:
You created DataStage user IDs in the Information Server Administration
Console. Then you logged into DataStage Administrator and configured your
DataStage environment.

Unit summary
• Open the Information Server Web console
• Create new users and groups
• Assign Suite roles and Component roles to users and groups
• Give users DataStage credentials
• Log into DataStage Administrator
• Add a DataStage user on the Permissions tab and specify the user’s
role
• Specify DataStage global and project defaults
• List and describe important environment variables
Unit summary

Work with metadata
Work with metadata

U n i t 4 W o r k wi t h m e t a d a t a

Unit objectives
• Login to DataStage
• Navigate around DataStage Designer
• Import and export DataStage objects to a file
• Import a table definition for a sequential file
Work with metadata © Copyright IBM Corporation 2015
Unit objectives

Login to Designer
• A domain may contain
multiple DataStage
Servers
• Qualify the project
(DSProject) by the name
of the DataStage Server
(EDSERVER)
Domain Server system
Select project
Login to Designer
This graphic shows the Designer Attach to Project window, which you use to log into
DataStage Designer. The process is similar to logging onto Administrator, but here you
select a specific project on a particular DataStage server.
In this example, the project is named DSProject. Notice that the project name is
qualified by the name of the DataStage server system that the project exists on. This is
a necessary and required qualifier because multiple DataStage server systems can
exist in an Information Server domain.

Designer work area

Repository Menus Toolbar
Parallel
canvas
Palette
Job log
Designer work area

This graphic shows the Designer window. The major elements are highlighted.
There are four major areas shown here. Exactly how these areas are configured is
customizable, but this is close to the standard default layout. At the top left corner is the
Repository window. This stores the DataStage jobs and other objects that you create.
One of these DataStage jobs is opened and displayed in the canvas at the top right
corner.
When a job is open, the Palette window at the middle left side contains the stages that
can be dragged onto the canvas.
At the bottom is the job log for the job currently open and displayed. This window is
optionally displayed. Click View>Job Log to toggle open this window. It is convenient
to have this window open, so you do not have to log into Director to view the job log
messages.

Repository window
Search for objects

in the project
Project
Default jobs folder

User-defined
folder
Default table
definitions folder
Repository window
The Repository window displays the folders of objects stored in the repository for the
DataStage project logged into.
The project repository contains a standard set of folders where objects are stored by
default. These include the Jobs folder which is where a DataStage job is by default
saved. However, new folders can be created at any level, in which to store repository
jobs and other objects. And any object can be saved into any folder.
In this example, there is a user-created folder named _Training. In this folder there are
sub-folders (not shown) for storing jobs and the table definitions associated with the
jobs.

Import and export

• Any object or set of objects in the Repository window can be exported
to a file
• Can export whole projects
• Uses:
 Use for backup
 Sometimes used for version control
 Move DataStage objects from one project to another
 Share DataStage jobs and projects with other developers
• How environment variables are handled in an export
 Environment variables included in jobs or parameter sets will be created in
the new project they are imported into if they do not already exist
− Their default values are set to the empty string
Import and export

Any set of project repository objects, including whole projects can be exported to a
file. This export file can then be imported back into a DataStage project, either the
same or a different project.
Import and export can be used for many purposes, including:
• Backing up jobs and projects.
• Maintaining different versions of a job or project.
• Moving DataStage objects from one project to another. Just export the objects,
move to the other project, then re-import them into the new project.
• Sharing jobs and projects with other developers. The export files, when zipped,
are small and can be easily emailed from one developer to another.

Export procedure
• Click Export > DataStage Components
• Add DataStage objects for export
• Specify type of export:
 DSX: Default format
 XML: Enables processing of export file by XML applications, for example,
for generating reports
• Specify file path on client system
• Can also right click over selected objects in the Repository to do an
export
Export procedure
Click Export > DataStage Components to begin the export process.
Select the types of components to export. You can select either the whole project or
select a sub-set of the objects in the project.
Specify the name and path of the file to export to. By default, objects are exported to a
text file in a special format. By default, the extension is dsx. Alternatively, you can
export the objects to an XML document.
The directory you export to is on the DataStage client, not the server.
Objects can also be exported from a list of returned by a search. This procedure is
discussed later in the course.

Export window
Click to select
Selected
objects from
objects
Repository
Select
path on
client
system
Export type
Begin export
Export window
This graphic shows the Repository Export window.
Click Add to browse the repository for objects to export. Specify a path on your client
system. Click Export.
By default, the export type is dsx. For most purposes, use this format.

Import procedure
• Click Import > DataStage Components
 Or Import > DataStage Components (XML) if you are importing an
XML-format export file
• Select DataStage objects for import
Import procedure
A previously created export (dsx) file can be imported back into a DataStage project.
To import DataStage components, click Import>DataStage Components.
Select the file to import. Click Import all to begin the import process, or click Import
selected to view a list of the objects in the import file. You can import selected objects
from the list. Select the Overwrite without query button to overwrite objects with the
same name without warning.

Import options
Import all objects Path to import file
in the file
Select items to
import from a list
Import options
This graphic shows the Repository Import window. Browse for the file in the Import
from file box. Select whether you want to import all the objects or whether you want to
display a list of the objects in the import file.
For large imports, you may want to disable Perform impact analysis. This adds
overhead to the import process.

Source and target metadata

• Metadata, “data about data”, describes the format of data, source data
or target data
• In order to read rows of data from a data source, DataStage needs to
given the metadata that describes the data it is to read
• DataStage stores metadata as “table definitions”
• Table definitions can be loaded into job stages
• You can import table definitions for:
 Sequential files
 Relational tables
 COBOL files
 Many other types of data sources
Source and target metadata

Table definitions define the formats of a variety of data files and tables. These
definitions can then be used and reused in your jobs for reading from and writing to
these files and tables.
For example, you can import the format and column definitions of the Customers.txt
file. You can then load this into a Sequential File source stage of a job that extracts data
from the Customers.txt file.
You can load this same metadata into other stages that access data with the same
format. In this sense the metadata is reusable. It can be used to access any file or table
with the same format.
If the column definitions are similar to what you need you can modify the definitions and
save the table definition under a new name.
You can import and define many different kinds of table definitions including table
definitions for sequential files and for relational tables.

Sequential file import procedure

• Click Import > Table Definitions > Sequential File Definitions
• Select directory containing sequential file
 The files are displayed in the Files window
• Select the sequential file
• Select a Repository folder in which to store the table definition
• Examine the format and column specifications and edit as necessary
Sequential file import procedure

To start the import, click Import>Table Definitions>Sequential File Definitions. The
Import Meta Data (Sequential) window is displayed. Then select the directory
containing the sequential files. The Files box is then populated with the files you can
import.
Select the file to import. Then select or specify the repository folder to store the table
definition into.
DataStage guesses the types of the columns in the sequential file by reading rows of
data from the file. You know your data better than DataStage. You can and should edit
the column types and lengths as required to properly handle the data.

Import sequential metadata
Sequential
File
Definitions
Import sequential metadata

This graphic shows the menu selection for importing table definitions for sequential files.
Click Import>Table Definitions and then select Sequential File Definitions.
Notice from the menu list that there are many different types of imports that can be
performed.

Sequential import window

Select directory
containing files
Start import
Select file
Select
Repository
folder
Sequential import window

This graphic shows the sequential file Import Metadata window.
Select the directory on the DataStage server system that contains the sequential file
you want to import. The files in the directory are displayed in the Files window. In the
To folder box, select a folder in the repository in which to store the imported table
definition, and then click Import.

Specify format
Select if first row has
Edit columns column names
Delimiter
Preview
data
Specify format
This graphic shows the Format tab of the Define Sequential Metadata window.
On the Format tab, specify the format including, in particular, the column delimiter, and
whether the first row contains column names. Click Preview to display the data using
the specified format. If everything looks good, click the Define tab to specify the column
definitions.

Edit column names and types
Double-click to define
extended properties
Edit column names and types

This graphic shows the Define tab of the Define Sequential Metadata window.
The column names displayed come from the first row of column names, if it exists. If
there is not a first row of column names, then default column names are used. Edit the
names and types of the columns as required. DataStage is guessing their types based
on its examination of rows of data in the file. DataStage can sometimes be wrong about
the types.
You can also specify additional extended properties for any columns. Double-click on
the number to the left of the column name to open up a window in which you specify
these extended properties.

Extended properties window
Parallel properties
Property categories
Available properties
Extended properties window

This graphic shows the extended properties window. On the Parallel tab, there are
several folders of properties that can be added. Select the folder and select the specific
property. Then specify the value the property is to have in the text box that is enabled.
The standard properties are displayed at the top half of the window. You can change
any of these properties here as well as on the Define tab.

Table definition in the repository
Columns Format
Stored table
definition
Table definition in the repository

After the table definition has been imported, it is stored in the folder you specified during
the import. This graphic shows the table definition after it has been opened in the
Repository window for viewing.
To view the table definition, in the Repository window, select the folder that contains
the table definition. Double-click the table definition to open the Table Definition
window.
Click the Columns tab to view and modify any column definitions. Select the Format
tab to edit the file format specification. Select the Parallel tab to specify parallel format
properties.

Checkpoint
• True or false? The directory to which you export is on the DataStage
client machine, not on the DataStage server machine.
• Can you import table definitions for sequential files with fixed-length
record formats?
Checkpoint

1. True.
2. Yes.
Record lengths are determined by the lengths of the individual
columns.

Demonstration 1
Import and export DataStage objects
Demonstration 1: Import and export DataStage objects

Demonstration 1:
Import and export DataStage objects
Purpose:
You will use DataStage Designer to import and export DataStage objects. As
part of this demonstration, you will create Repository folders and DataStage
objects files. Finally you will export a folder.
DataStage Client: Designer
Designer Client User/Password: student/student
Project: EDSERVER/DSProject
Task 1. Log into DataStage Designer.
1. Open Designer Client via the icon on the Windows desktop.
2. Log in to your DataStage project with:

• Host name of the services tier and port number: edserver:9443
• User name: student
• Password: student
• Project: EDSERVER/DSProject
Task 2. Create Repository folders
1. Click Cancel to close the New window.
2. In the left pane, below Repository, select your project, DSProject.
3. Right-click DSProject, and then click New > Folder.
4. Create a folder named _Training, and under it, create two sub-folders:
Jobs and Metadata.

5. From the Repository menu, click Refresh.

This moves the folder(s) you created, to the top of the view.
Task 3. Import DataStage object files.

1. From the Import menu, click DataStage Components.
2. In Import from file, browse to C:\CourseData\DSEss_Files\dsxfiles, select
the file TableDefs.dsx, and then click Open.
Tip: Start browsing by clicking Computer in the left pane.
3. Confirm Import selected is selected.

4. Click OK.
5. Click to select Table Definitions, and then click OK.

6. Double-click the table definition you just imported. You will find it under the
_Training > Metadata folder. It is named Employees.txt.
Tip: if double-clicking does not work, right-click and select Properties.
7. Click the Columns tab.
Note the column definitions and their types.
8. Click OK, to close the Table Definition window.

Task 4. Export a folder.
In this task, you export your _Training folder into a file named Training.dsx.
1. Right-click _Training, and then click Export.
2. In the Export to file box, set the folder path (by using the browse option) to
C:\CourseData\DSEss_Files\dsxfiles, and add the file name Training.dsx by
typing it into File name.

3. Click Open.
The Employees.txt file can now be exported, based on your settings.
4. Click Export, click OK to the message, and then click Close.

Results:
You used DataStage Designer to import and export DataStage objects. As part
of this demonstration, you created Repository folders and DataStage objects
files. Finally you exported a folder.

Demonstration 2
Import a table definition
Demonstration 2: Import a table definition

Demonstration 2:
Import a table definition
Purpose:
You want to load your table definition into a Sequential File stage so it can be
read. You will first import a table definition for a sequential file and then view
a table definition stored in the Repository.
Task 1. Import a table definition from a sequential file.

1. In a text editor, such as WordPad, open up the Selling_Group_Mapping.txt
file found in your C:\CourseData\DSEss_Files directory, and examine its
format and contents. Some questions to consider:
• Is the first row a row of column names?
• Are the columns delimited or fixed-width?
• If the columns are delimited, what is the delimiter?
• How many columns? What types are they?
2. In Designer, from the Import menu, click
Table Definitions > Sequential File Definitions.
3. In the Directory box, browse to CourseData > DSEss_Files directory.
Note that the files in that directory will not show up in the selection window
because you are just selecting the directory that contains the files.
4. Click OK.
The files in the DSEss_Files directory are displayed in the Files panel.
5. In the Files box, select Selling_Group_Mapping.txt.

6. In the To folder box, select _Training\Metadata, and then click OK.
7. Click Import.
You specify the general format on the Format tab.
8. Specify that the first line is column names, if this is the case.
DataStage can use these names in the column definitions.

9. Click Preview to view the data in your file, in the specified format.
If you change the delimiter, clicking Preview shows the change in the Data
Preview window. This is a method to confirm whether you have defined the
format correctly. If it looks like a mess, you have not correctly specified the
format. In the current case, everything looks fine.
10. Click the Define tab to examine the column definitions.
11. Click OK to import your table definition, and then click Close.

12. After closing the Import Meta Data window, locate and then open your new
table definition in the Repository window. It is located in the folder you specified
in the To folder box during the import, namely, _Training\Metadata.
NOTE: If the table definition is not in _Training\Metadata in Designer, look for it

in the Table Definitions folder, where table definitions go by default. You may
move the Table Definition from there to _Training\Metadata by drag and drop.
13. Click on the Columns tab to examine the imported column definitions.

14. Click on the Format tab to examine the format specification.

Notice the delimiter, and that First line is column names is selected.
15. Click OK, to close the Table Definition.

Results:
You used DataStage Designer to import and export DataStage objects. As part
of this demonstration, you created Repository folders and DataStage objects
files. Finally you exported a folder.

Unit summary
• Login to DataStage
• Navigate around DataStage Designer
• Import and export DataStage objects to a file
• Import a table definition for a sequential file
Unit summary


Create parallel jobs

Unit 5 Create parallel jobs

Unit objectives
• Design a parallel Job in DataStage Designer
• Define a job parameter
• Use the Row Generator, Peek, and Annotation stages in the job
• Compile the job
• Run the job
• Monitor the job log
• Create a parameter set and use it in a job
Create parallel jobs © Copyright IBM Corporation 2015
Unit objectives

What is a parallel job?

• Executable DataStage program
• Created in DataStage Designer
 Built using DataStage components, primarily stages and links
• Built using a graphical user interface
• Compiles into a scripting language called OSH
• Run using the DataStage parallel engine
What is a parallel job?

A job is an executable DataStage program. DataStage jobs are designed and built in
Designer. They are then compiled and executed under the control of DataStage. When
they are compiled the GUI design is converted into what is called an OSH script. In the
OSH, for instance, stages are converted into operators and links are converted into
input and output data sets.
The OSH is executable code that can be run by the DataStage parallel engine.
Recall that you can view the OSH if you enable this for the project in DataStage
Administrator.

Job development overview

• Import metadata defining sources and targets
 Done within Designer using import process
• In Designer, add stages defining data extractions and loads
• Add processing stages to define data transformations
• Add links defining the flow of data from one stage to another
• Click the compile button
 Creates an executable (OSH) version of the job
• Run and monitor the job log
 Job log can be viewed in Designer or Director
 Can run the job in Designer or Director
Job development overview

In a previous unit, you learned how to import a table definition for a sequential file. In
this unit, you will learn how to load this table definition into a stage in a DataStage job.
The job we build here will be a relatively simple job, but it will enable us to see the
whole development process from design, build, compile, run, and monitor.

Tools Palette
Stage categories
Stages
Tools Palette
This graphic shows the Designer Palette. The Palette contains the stages you can add
to your job design by dragging them over to the job canvas.
There are several categories of stages. At first you may have some difficulty knowing
where a stage is. Most of the stages you will use will be in the Database folder, the File
folder, and the Processing folder. A small collection of special-purpose stages,
including the Row Generator stage which we will use in our example job, are in the
Development/Debug folder.

Add stages and links

• Drag stages from the Palette to the diagram
 Can also be dragged from Stage Type branch in the Repository window to
the diagram
• Draw links from source to target stage
 Right mouse over source stage
 Release mouse button over target stage
Add stages and links

To build your job, drag stages from the Palette to the diagram. Then draw links from
source to target stages on the canvas. To draw the link, hold down your right mouse
button over the source stage. Drag the link across to the target stage and release the
mouse button.

Job creation example sequence

• Brief walkthrough of procedure
• Assumes table definition of source already exists in the Repository
• The job in this simple example:
 Generates its own data using the Row Generator stage
− The Row Generator stage is one type of Source stage
− Other source stages, which we will look at later, include the Sequential File stage
and the DB2 stage
 Writes its output data to the job log using the Peek stage
− The Peek stage is one type of target stage
− Other target stages, which we will look at later, include the Sequential File stage
and the DB2 stage
Job creation example sequence

The example illustrated in the following pages will give you a brief walkthrough of the
job development workflow. For this example, we will assume that a table definition
already exists.
Our example job consists of a Row Generator stage and a Peek stage. The former
generates rows of data based on the table definition loaded into it. The Peek stage
writes messages to the job log.

Create a new parallel job
Parallel job canvas
Open New window
Parallel job
Create a new parallel job

This graphic shows how to open a new canvas for a parallel job. Click the New button
in the toolbar to open the New window. Click on the Parallel Job icon to create a new
parallel job (the focus of this course).
As mentioned earlier, there are several different types of jobs that can be created in
DataStage. Each type has its own special set of stages. Be sure you see the word
Parallel in the top left corner of the canvas, so you can verify that you are working with
the correct set of stages.

Drag stages and links from the Palette
Compile
Run
Job Properties Row Generator

Peek
Link
Drag stages and links from the Palette

This graphic shows the job after the stages have been dragged to the canvas and
linked.
The Job Properties icon is highlighted because this is where job parameters are
created. The Compile and Run buttons are also highlighted.

Rename links and stages

• Click on a stage or link to rename it
• Meaningful names have many benefits
 Documentation
 Clarity
 Fewer development errors
Rename links and stages

This graphic illustrates how to rename links and stages. If you click on a stage and
start typing, a text box, in which you can write the name, is enabled.
One of the major benefits of DataStage is that DataStage jobs are in a sense “self-
documenting”. The GUI layout of the job documents the data flow of the job. You
will, however, only get this benefit, if you give meaningful names to your links and
stages, and add additional Annotation stages where needed.

Row Generator stage

• Produces mock data for specified columns
• No inputs link; single output link
• On Properties tab, specify number of rows
• On Columns tab, load or specify column definitions
 Open Extended Properties window to specify the algorithms used to
generate the data
 The algorithms available depend on the column data type
• Algorithms for Integer type
 Random: seed, limit
 Cycle: Initial value, increment
• Algorithms for string type: Cycle, alphabet
• Algorithms for date type: Random, cycle
Row Generator stage

In our example job, the Row Generator stage produces the source data. Later jobs
in this course will read the data from files and tables.
The Row Generator stage is in the Development/Debug folder because it is often
used during development to create test data for a new job.
Most of the stages have a similar look and feel. Typically, there is a Properties tab
that contains a list of properties specific to the stage type. You specify values for
these properties to configure how the stage is to behave in the job.
There is also typically a Columns tab which lists the columns of the data that will
flow through the stage. A table definition can be loaded into the stage to create
these columns.
In a previous unit, you learned about extended properties. For the Row Generator
stage, extended properties are used to specify how the data is to be generated for
each of the columns. Based on the column type, there are different algorithms that
you can choose from.

Inside the Row Generator stage
Properties tab
Set property
Property value
Inside the Row Generator stage

This graphic shows the Properties tab in the Row Generator stage.
To specify a value for a property, select the property. Then use the text box on the right
side to manually specify or select the value for the property.
The properties are divided into folders. In this simple stage, there is only one folder with
only one property. If you select a folder, additional properties you can add show up in
the Available properties to add window at the lower right corner of the stage. (In the
graphic, this area is dulled-out.)

Row Generator Columns tab
Double-click to specify
extended properties
View data
Load a table
definition
Select table
definition
Row Generator Columns tab

The top graphic shows the Row Generators Columns tab. You can see the columns
that have been loaded from the table definition shown at the lower left.
Once loaded, the column definitions can be changed. Or, alternatively, these column
definitions can be entered and edited manually.
The data that gets generated from the stage will correspond to these columns.

Extended properties
Specified properties
and their values
Additional
properties to add
Extended properties
This graphic shows the Extended Properties window.
In this example, the Generator folder was selected and then the Type property was
added from the Available properties to add window at the lower right corner. The
cycle value was selected for the Type property. Then the Type property was selected
and the Initial Value and Increment properties were added.
The cycle algorithm generates values by cycling through a list of values beginning with
the specified initial value.

Peek stage
• Displays field values
 By default, written job log
 Can control number of records to be displayed
 Can specify the columns to be displayed
• Useful stage for checking the data at a particular stage in the job
 For example, put one Peek stage before a Transformer stage and one
Peek stage after it
− Gives a before / after picture of the data
Peek stage
The generated data is then written to the Peek stage. By default, the Peek stage
displays column values in the job log, rather than writing them to a file. After the job is
run, the Peek messages can be viewed in the job log. In this example, the rows
generated by the Row Generator stage will be written to the log.

Peek stage properties
Output to job log
Peek stage properties

This graphic show the Properties tab of the Peek stage. Typically, the default values
selected for the properties do not require editing.
By default, the Peek stage writes to the log. You can also output from the Peek stage to
a file.

Job parameters
• Defined in Job Properties window
• Makes the job more flexible
• Parameters can be used anywhere a value can be specified
 Used in path and file names
 To specify property values
 Used in constraints and derivations in a Transformer stage
• Parameter values are specified at run time
• When used for directory and files names and property values, they are
surrounded with pound signs (#)
 For example, #NumRows#
 The pound signs distinguish the job parameter from a hand-coded value
• DataStage environment variables can be included as job parameters
Job parameters
Job properties are defined in Job Properties window. They make a job more flexible by
allowing values to be specified at runtime to configure the how the job behaves.
Job parameters can be entered in many places in a DataStage job. Here we focus on
their use as property variables. A job parameter is used in place of a hand-coded value
of a property. On different job runs, different values can then be specified for the
property.
In this example, instead of typing in, say, 100 for the Number of Records property, we
create a job parameter named NumRows and specify the parameter as the value of
the property. At runtime, we can enter a value for this parameter, for example, 100 or
100,000.

Define a job parameter

Parameters tab
Parameter
Add environment
variable
Define a job parameter

This graphic shows the Parameters tab in the Job Properties window. Here, you can
manually specify any job parameters you want to use in your job. Also, you can click the
Add Environment Variable button to add environment variables as parameters.
Click the Job Properties icon in the Designer toolbar to open the Job Properties
window.
Notice too the Add Parameter Set button. Click this button to add parameter set
variables to the list of parameters. Parameter sets are discussed later in this unit.

Use a job parameter in a stage
Job parameter
Click to insert Job
parameter
Use a job parameter in a stage

This graphic shows how to use job parameters in your job. Here, you see how to use
the NumRows job parameter in the Row Generator stage.
Select the property. Then enter the value in the text box. Click the button at the right of
the text box to display a menu for selecting a job parameter.

Add job documentation

• In Job Properties window
 Short and long descriptions
• Annotation stage
 Displays formatted text descriptions on diagram
Add job documentation

In addition to the documentation that the naming of links and stages provides, you can
also add further documentation using Annotation stages. You can also specify
descriptions that describe the job on the General tab of the Job Properties window.

Job Properties window documentation
Documentation
Job Properties window documentation

This graphic shows where you can add job descriptions on the General tab of the
Job Properties window.
Job descriptions are available to users without opening the job. Some users, such as
DataStage operators, do not have permission to open a job or even to log into
Designer. So these job descriptions would be all they have (apart from the job name) to
use to determine how the job behaves.

Annotation stage properties
Annotation stage properties

This graphic shows the inside of the Annotation stage. Add one or more Annotation
stages to the canvas to document your job.
An Annotation stage works like a text box with various formatting options. You type in
the text. You can specify the font and text properties.
You can optionally show or hide the Annotation stages by pressing a button on the
toolbar.
There are two types of Annotation stages. The Description Annotation stage links its
text to the descriptions specified in the Job Properties window.

Compile and run a job
Run
Compile
Annotation stage
Compile and run a job

This graphic shows how to compile and run a job within Designer.
Before you can run your job, you must compile it. To compile it, click File > Compile or
click the Compile button on the toolbar. The Compile Job window displays the status
of the compile.
After you compile the job, assuming it compiles without errors, you can run it from within
Designer or Director. To view the job log, you will need to either go into the Director
client or open the job log within Designer.

Errors or successful message
Highlight stage with error Click for more info
Errors or successful message

This graphic shows the Compile Job window, which shows the status of the compile.
If an error occurs, you can click Show Error to highlight the stage where the error
occurred.
When enabled, click More to retrieve addition information about the error beyond what
you see in the Compilation Status window.

DataStage Director
• Use to run and schedule jobs
• View runtime messages
• Can invoke directly from Designer
 Tools > Run Director
DataStage Director
You can open Director from within Designer by clicking Tools > Run Director. In a
similar way, you can move from Director to Designer.
There are two methods for running a job: Run it immediately. Or schedule it to run at a
later date and time. Click the Schedule view icon in the toolbar to schedule the job.
To run a job immediately in Director, select the job in the Job Status view. The job
must have been compiled. Then click Job > Run Now or click the Run Now button in
the toolbar. The Job Run Options window is displayed. If the job has job parameters,
you can set them at this point or accept any default parameter values.

Run options
Assign values to
parameter
Run options
This graphic shows the Job Run Options window. The Job Run Options window is
displayed when you click Job > Run Now.
In this window, you can specify values for any job parameters. If default values were
specified for the job parameters when they were defined, these defaults initially show
up.
Click the Run button on this window to start the job.

Performance statistics
• Performance statistics are displayed in Designer when the job runs
 To enable, right click over the canvas and then click
Show performance statistics
• Link turns green if data flows through it
• Number of rows and rows-per-second are displayed
• Links turn red if runtime errors occur
Performance statistics
This graphic displays the Designer performance statistics, which are displayed when
you run a job and view it within Designer. These statistics are updated as the job runs.
The colors of the links indicates the status of the job. Green indicates that the data
flowed through the link without errors. Red indicates an error.
To turn performance monitoring on or off, click the right mouse button over the canvas
and then enable or disable Show performance statistics.

Director Status view
Status view
Log view
Schedule
view
Select job whose

messages you
want to view
Director Status view

This graphic shows the Director Status view, which lists jobs in the project and their
statuses: Compiled, Running, Aborted, and so on. It also displays the start and stop
times of the last run.
The jobs are listed in the right pane along with their statuses. Click the “open book” icon
to view the job log for a selected job.

Job log, viewed from Designer
Peek message
Job log, viewed from Designer

This graphic shows the job log in Designer for a specific job. The job log is available
both in Designer (click View > Job log) and Director (click the Log icon). The job log
displays messages that are written during the execution of the job.
Some messages are about control events, such as the starting, finishing, or aborting of
a job. Also included are informational messages, warning messages, and error
messages. Double-click on a message to open it.
Peek messages are prefixed by the name of the Peek stage.

Message details
Data generated from Row Generator

stage and written to the Peek stage
Message details
This graphic shows an example of message details. Double-click on a message to
open it and read the message details.
In this example, the Peek message is displaying rows of data in one of the partitions or
nodes (partition 0). If the job is running on multiple partitions, there will be Peek
messages for each.
Each row displays the names of columns followed by their values.

Other job log functions

• Clear job log of messages
 In Director, click Job > Clear Log
 This function is not available in Designer
• Job reset
 If a job aborts, it may go into an unexecutable state
− Click the Reset button in the Director toolbar or the Designer job log toolbar to
return the job to an executable state
Other job log functions

Some other useful job log functions are listed here. The job log can fill up, so you may
want to clear the messages in the log for a particular job. In Director, click
Job > Clear Log to do this. This function is not available in Designer. With respect to
the job log, Director has more functionality than Designer.
Sometimes if a job aborts, it may go into a non-executable state. You can reset it using
the Reset button. Sometimes it may not be possible to reset a job. In those cases, you
need to recompile the job to return it to an executable state.

Director monitor
• Director Monitor
 Click Tools > New Monitor
 View runtime statistics on a stage / link basis
(like the performance statistics on the canvas)
 View runtime statistics on a partition-by-partition basis
− Click right mouse over window to turn this on
Peek Employees
stage running on
partition 0
Director monitor
This graphic shows the Director Monitor, which depicts performance statistics. As
mentioned earlier you can also view runtime statistics on the Designer canvas.
However, the statistics on the Designer canvas cannot be broken down to individual
partitions, which you can view in Director.
Here we see that the Peek stage named PeekEmployees runs on both partitions
(0 and 1). Each instance processes 5 rows. Overall then, 10 are processed by the Peek
stage.
The Employees Row Generator stage is running on a single partition (0). Here, we see
that it has generated 10 rows.

Run jobs from the command line

• dsjob –run –param numrows=10 DSProject GenDataJob
 Runs a job
 Use –run to run the job
 Use –param to specify parameters
 In this example, DSProject is the name of the project
 In this example, GenDataJob is the name of the job
• dsjob –logsum DSProject GenDataJob
 Displays a job’s messages in the log
• Documented in “IBM InfoSphere DataStage Programmer’s Guide”
Run jobs from the command line

Although the focus in this course is on running jobs and viewing the log through the
DataStage clients, it is important to note that DataStage also has a command line
interface. This lists some command examples.
The primary command is the dsjob command. The first example uses it to run the
GenDataJob in a DataStage project named DSProject.
The second example uses the dsjob command to display the messages in the job log
for the same job.

Parameter sets
• Store a collection of job parameters in a named repository object
 Can be imported and exported like any other repository objects
• One or more values files can be linked to the parameter set
 Particular values files can be selected at runtime
 Implemented as text files stored in the project directory
• Uses:
 Store standard sets of parameters for re-use
 Use values files to store common sets of job parameter values
Parameter sets
Parameter sets store a set of job parameters in a named object. This allows them to be
loaded into a job as a collection rather than separately. And this allows them to be
imported and exported as a set.
Suppose that an enterprise has a common set of 20 parameters that they include in
every job they create. Without parameter sets, they would have to manually create
those parameters in every job. With parameter sets, they can add the whole collection
at once.
Another key feature of parameter sets is that they can be linked to one or more “values
files” - files that supply values to the parameters in the parameter set. At runtime, a user
can select which values file to use.

Create a parameter set
Create a parameter set

To create a parameter set, click New and then select the Other folder. The graphic
shows the Other folder icons.

Defining the parameters

• Specify job parameters just as you would in a job
• Default values will specified here become the default values for
parameters specified in the values files, on the Values tab
Specify parameter set

name, via General tab
Defining the parameters

This graphic shows the Parameters tab of the Parameter Set window. Individual
parameters are defined just as they are defined individually in jobs. You specify the
name, prompt, type, and optionally a default type of the parameter.
As you will see, when you create a values file, on the Values tab, the default values you
specify here become the default values in the values file.
Note that environment variables can be included as parameters in a parameter set.

Defining values files

• Type in names of values files
• Enter values for parameters
 Default values show up initially, but can be overridden
Defining values files

This graphic shows the Values tab of the Parameter Set window. Optionally, type in
the names of one or more values files. The parameters specified on the Parameters
tab then become column headings on this tab. The default values entered on the
Parameters tab become the default values in the values file.
You can edit any of these default parameter values. The whole purpose of these values
files is to provide option sets of values. For example, one values file might be used
during development and another used during production.

Load a parameter set into a job
Added parameter set
View parameter set
Add parameter set
Load a parameter set into a job

This graphic shows the Parameters tab of the Job Properties window in a job. Click
the Add Parameter Set button to add the collection of parameters. Notice that the type
(Parameter Set) distinguishes it in the window from an ordinary parameter.
You can also click the View Parameter Set button to view the contents of the
parameter set while working within the Job Properties window.

Use parameter set parameters
Parameter set prefix

Parameter
Use parameter set parameters

This graphic shows the Properties tab of the Row Generator stage in our example job.
A parameter from a parameter set is used as the Number of Records property value.
Notice that parameter set parameters are distinguished from ordinary parameters by
being prefixed by the name of the parameter set.

Run jobs with parameter set parameters
Select values file
Run jobs with parameter set parameters

This graphic shows the Job Run Options window, which opens when you click the
Run button. The parameter set is listed along with the individual parameters in the
parameter set.
For the parameter set you can select a values file. For any individual parameter, you
can change its value, thereby overriding the default value provided by the values file.

Checkpoint
1. Which stage can be used to display output data in the job log?
2. Which stage is used for documenting your job on the job canvas?
3. What command is used to run jobs from the operating system
command line?
4. What is a “values file”?
Checkpoint

1. Peek stage
2. Annotation stage
3. dsjob -run
4. One or more values files are associated with a parameter set. The
values file is a text file that contains values that can be passed to the
job at runtime.

Demonstration 1
Creating parallel jobs
• In this demonstration, you will:

 Create a DataStage job
 Compile a job
 Run a job
 View messages written to the job log
 Document a job using the Annotation stage
 Define and use a job parameter in the job
 Define and use a parameter set in the job
Demonstration 1: Create parallel jobs

Demonstration 1:
Purpose:
You want to explore the entire process of creating, compiling, running, and
monitoring a DataStage parallel job. To do this you will first design, compile,
and run the DataStage parallel job. Next, you will monitor the job by first
viewing the job log, and then documenting it in the Annotation stage. Finally
you will use job parameters to increase the flexibility of the job and create a
parameter set to store the parameters for reuse.
Task 1. Create a parallel job.
You want to create a new parallel job with the name GenDataJob, and then save it in
your _Training > Jobs folder.
1. Log into Designer as student/student.
2. From the File menu, click New.
3. Click Parallel Job , and then click OK.

4. From the File menu, click Save.
5. Save your job as GenDataJob in your _Training > Jobs folder.
Next you want to add a Row Generator stage and a Peek stage from the
Development/Debug folder.
6. In the left pane, below Palette, expand Development/Debug.
Tip: you may need to resize panes, to be able to view elements under Palette.
7. Drag the Row Generator and Peek stages to the GenDataJob canvas.
8. Draw a link from the Row Generator stage to the Peek stage.
To accomplish this, Click+hold the right mouse button over top of the Row
Generator stage, and then drag the mouse cursor to the Peek stage, before
releasing the mouse button.

9. Name the Row Generator, and then link as Employees. Name the Peek stage
PeekEmployees, as shown.
10. Open up the Employees - Row Generator stage, and then click the
Columns tab.
11. Click the Load button, and then load the column definitions from the
Employees.txt table definition you imported in an earlier demonstration.
12. Verify your column definitions with the following.

13. On the Properties tab, specify that 100 records are to be generated. To do this,
select Number of Records = 10 in the left pane, and then update the value in
the Number of Records box to 100. Press Enter to apply the new value.
14. Click View Data, and then click OK, to view the data that will be generated.
15. Click Close, and then click OK to close the Row Generator stage.

Task 2. Compile, run, and monitor the job

1. From the toolbar, click Compile to compile your job. If your job compiles
with errors, fix the errors before continuing.
2. Right-click over an empty part of the canvas, and ensure that Show
performance statistics is enabled.
3. Run your job by clicking Run from the toolbar.

4. From the View menu, enable Job Log to open the pane within Designer, so
that you can view the log messages.
5. Scroll through the messages in the log.
There should be no warnings or errors. If there are, double-click on the
messages to examine their contents. Fix the problem, and then recompile and
run.
Notice that there are one or more log messages starting with the
“PeekEmployees,” label on your Peek stage.
6. Double-click on one of these to open the Log Event Detail window.
7. Close the Job Log window.

Task 3. Specify Extended Properties.
1. Save your job as GenDataJobAlgor, in your _Training > Jobs folder.
2. Open up the Employees Row Generator stage, and then go to the
Columns tab.
3. Double-click on the row number to the left of the first column name.

4. Specify the extended properties, as shown.

• Click on Type to add the Type property.
• Click on Initial Value; set its value to 10000 in the Initial value field to the
right.
• Select the Type property, and then add the Increment property; set 1 as the
increment value
5. Click Apply, then click Next.

6. For the Name column, specify that you want to cycle through three names of
your choice, by setting the following:
• Select Generator in the Properties panel, and then click Algorithm.
• Choose cycle from the drop down menu on the right.
• Click on Value; in the Value field, add a name for the first value.
• Press Enter to add a second value.
• Repeat to add a third value.
7. Click Apply, and then Next.

8. For the HireDate column, specify that you want the dates generated randomly.
• In the Available properties to add: window on the lower right, choose
Type.
• In the Type field, select random.
9. Click Apply, and then click Close.

10. Click View Data to see the data that will be generated.
11. Close the stage.

Task 4. Document your job.
1. From the Palette General folder, add an Annotation stage to your job diagram.
Open up the Annotation stage and choose another background color. Briefly
describe what the job does.
2. Compile and run your job.

3. In Designer, click View > Job Log to view the messages in the job log. Fix any
warnings or errors.
4. Verify the data by examining the Peek stage messages in the log.
Task 5. Add a job parameter.
1. Save your job as GenDataJobParam, in your _Training > Jobs folder.
2. From the Designer menu, click Edit > Job Properties. (Alternatively, click the
Job Properties icon in the toolbar.) Click the Parameters tab.
3. Define a new parameter named NumRows, with a default value of 10, type
Integer.
4. Open up the Properties tab of the Row Generator stage in your job. Select the
Number of Records property, and then click on the right-pointing arrow to
select your parameter, as shown. Select your new NumRows parameter.
The result appears as follows:
5. View the data.

6. Compile and run your job. Verify the results.

Task 6. Create a parameter set.

1. From the File menu, click New.
2. Click the Other folder.
3. Double-click the Parameter Set icon, and then name the parameter set
RowGenTarget.

4. Click the Parameters tab. Create the NumRows parameter, as an Integer,

along with the default value shown (100).
5. Click the Values tab. Create two values files. The first is named LowGen and
uses the default values for the NumRows parameter. The second, HighGen,
changes the default value of the NumRows parameter to 10000.
6. Click OK. Save your parameter set in your _Training > Metadata folder.
7. Save your job as GenDataJobParamSet.
8. From the Edit menu, click Job Properties, and then select the Parameters tab.
9. Click the Add Parameter Set button.
10. Select the RowGenTarget parameter set you created earlier (expand folders).

11. Click OK to add the parameter set to the job.
12. Click OK to close the Job Properties window.

13. Open up the Employees Row Generator stage, and then select the Number of
Records property.
14. Select the NumRows parameter from the parameter set, as the value for the
property.
15. Click OK to close the stage.

16. Compile your job.

17. Click the Run button. In the Job Run Options dialog, select the HighGen
values file.
18. Click Run. Verify that the job generates 10000 records.
Results:
You wanted to explore the entire process of creating, compiling, running, and
monitoring a DataStage parallel job. To do this you first designed, compiled,
and ran the DataStage parallel job. Next, you monitored the job by first
viewing the job log, and then documenting it in the Annotation stage. Finally
you used job parameters to increase the flexibility of the job and created a
parameter set to store a collection parameters for reuse.

Unit summary
• Design a parallel Job in DataStage Designer
• Define a job parameter
• Use the Row Generator, Peek, and Annotation stages in the job
• Compile the job
• Run the job
• Monitor the job log
• Create a parameter set and use it in a job
Unit summary

Access sequential data
Access sequential data

Unit 6 Access sequential data

Unit objectives
• Understand the stages for accessing different kinds of file data
• Read and write to sequential files using the Sequential File stage
• Read and write to data set files using the Data Set stage
• Create reject links
• Work with nulls in sequential files
• Read from multiple sequential files using file patterns
• Use multiple readers
Access sequential data © Copyright IBM Corporation 2015
Unit objectives
Purpose - In the last unit, students built a job that sourced data generated by the
Row Generator stage. In this unit we work with one major type of data: sequential
data. In a later unit we will focus on the other major type of data: relational data.

How sequential data is handled

• The Sequential File stage can be used to read from and write to
sequential files
• The Sequential File stage uses a table definition to determine the
format of the data in the sequential files
• The table definition describes the record format (end of line) and the
columns format (column types, delimiter)
 Records that cannot be read or written are “rejected”
• Messages in the job log use the “import” / “export” terminology
 Import = read; Export = write
 For example, “100 records imported / exported successfully; 2 rejected”
How sequential data is handled

The Sequential File stage is used to read from and write to sequential files in a
DataStage job. In order to successfully read from a sequential file, the stage needs
to be told the format of the file and the number of columns and their types. This is
typically done by loading a table definition into the stage.
What happens if the stage cannot read one or more of the rows of data? Usually
this happens because the data in the row does not match the table definition that
was loaded into the stage. Perhaps the data has a fewer number of columns. Or
perhaps the value in one of the columns does not match the type of the column.
For example, the data is a non-numeric string “abc”, but the column is defined as an
integer type.
When a row cannot be read by the stage it is rejected. As you will see later, these
rows can be captured using a reject link.

Features of the Sequential File stage

• Normally executes in sequential mode
• Can execute in parallel
 When reading multiple files
 When using multiple readers
• The stage needs to be told:
 How the file is divided into rows (record format)
 How rows are divided into columns (column format)
• Optionally supports a reject link
 Captures rows that are rejected by the stage
Features of the Sequential File stage

This lists the main features of the Sequential File stage. By default, a Sequential
File stage executes in sequential mode, but it can execute in parallel mode
depending on some property settings, as you will see later in this unit.
In order to read the sequential file, the stage needs to be told about the format of
the file. It needs to be told the record format and column format. Record format has
to do with how the stage can tell where one record of data ends and another begins.
That is, is there an end-of-line character or do the records have a fixed length? If
there is an end-of-line character, is it DOS or UNIX?
As mentioned earlier, a reject link can be created to capture rows that the stage
cannot successfully read (import).

Sequential file format example
Record delimiter
Field 1 , Field 12 , Field 13 , Last field nl
Final Delimiter = end

Field Delimiter
Field 1 , Field 12 , Field 13 , Last field , nl
Final Delimiter = comma
Sequential file format example

This graphic shows the format of one major type of sequential file. Delimiters
separate columns. Similarly, records are separated by terminating characters. In
order to read and write to sequential files, this information must be specified in the
stage. Typically, it is specified by loading a table definition into the stage, but it can
also be manually specified.
In this graphic commas are used as column delimiters, but any character is
possible. Frequently, you will also see the pipe character (|) used as the column
delimiter.

Job design with Sequential File stages
Read from file Write to file
Stream link
Reject link Reject link

(broken line) (broken line)
Job design with Sequential File stages

This graphic shows a job that reads from one file using a Sequential File stage and
writes to another file also using a Sequential File stage. A Sequential File stage
used to read from a job will have a single stream output link (unbroken line) and
optionally a reject link (broken line). The data that is read in will flow out this link.
A Sequential File stage used to write to a job will have a single stream input link
(unbroken line) and optionally a reject output link (broken line). The data that is
written to the file will flow into the stage from this link.
The Sequential File stage does not allow more than one input link or output (stream)
link. And it cannot have both input and an output stream links.

Sequential File stage properties
Output tab
Properties tab
Path to file
Column names
in first row
Sequential File stage properties

The graphic shows the Properties tab in the Sequential File stage. Here you
specify the Read Method (a specifically named file, or a file pattern) and the path to
the file. Select the File property and then browse for the file you want the stage to
read. The file path must be visible from the DataStage server system, where the
DataStage job is run.
These properties are being specified on the Output tab. This implies that there is a
link going out of the stage. Therefore, this stage is being used to read from a file.
Some (not all) sequential files have a first row of column names. This row is not real
data. It is used as metadata describing the contents of the file. If you are reading
from a file that has this, set the First Line is Column Names property to true.
Otherwise, the stage will confuse this row with real data and probably reject the row.

Format tab
Format tab
Record
format
Load format from

table definition
Column format
Format tab
This graphic shows the Format tab of the Sequential File stage.
Here you specify the record delimiter and general column format, including the
column delimiter and quote character. Generally, these properties are specified by
loading the imported table definition that describes the sequential file, but these
properties can also be specified manually.
Use the Load button to load the format information from a table definition.
Note that the columns definitions are not specified here, but rather separately on the
Columns tab. So, as you will see, there are two places where you can load the
table definitions: the Format tab and the Columns tab.

Columns tab
Columns tab
View data
Load columns from

table definition
Save as a new
table definition
Columns tab
This graphic shows the Columns tab of the Sequential File stage.
Click the Load button to load the table definition columns into the stage. The column
definitions can be modified after they are loaded. When this is done you can save
the modified columns as a new table definition. This is the purpose of the Save
button. Note, do not confuse this Save button with saving the job. Clicking this
button does not save the job.
After you finish editing the stage properties and format, you can click the View Data
button. This is a good test to see if the stage properties and format have been
correctly specified. If you cannot view the data, then your job when it runs will
probably not be able to read the data either!

Reading sequential files using a file pattern
Use wild
cards
Select File Pattern
Reading sequential files using a file pattern

The graphic shows the Properties tab of the Sequential File stage.
To read files using a file pattern, change the Read Method to File Pattern. The File
Pattern property recognizes the asterisk (*) and question mark (?) wild card
characters in the path specification. The asterisk means any zero or more
characters. The question mark means any single character.
In this example, the stage will read all the files in the /Temp directory with names
that start with “TargetFile_” followed by any single character. It is assumed that all
of these files have the same format and column definitions.

Multiple readers
Number of Readers per Node is an

optional property you can add
2 readers per node
Multiple readers
The graphic shows the Properties tab of the Sequential File stage.
The Number of Readers per Node is an optional property you can add that allows
you to read a single sequential file using multiple reader processes running in
parallel. If you, for example, specify two readers, then this file can be read twice as
fast as with just one reader (the default). Conceptually, you can picture this as one
reader reading the top half of the file and the second reader reading the bottom half
of the file, simultaneously, in parallel.
Note that the row order is not maintained when you use multiple readers. Therefore,
if input rows need to be identified, this option can only be used if the data itself
provides a unique identifier. This works for both fixed-length and variable-length
records.

Writing to a sequential file
Input Tab
Path to output file
Append / Overwrite
Add first row of

column names
Writing to a sequential file

We have been discussing how to use the Sequential File stage to read from
sequential files. Now we turn to using it to write to sequential files. This graphic
shows the Properties tab of the Sequential File stage on the Input tab. This implies
that there is a link going into the stage. Therefore, this stage is being used to write
to a sequential file.
The File property is used to specify the path to the output file, which may or may not
already exist. The File Update Mode property is used to specify whether you want
to overwrite the existing file, if it exists, or append to the existing file.
The First Line is Column Names property also exists here. In this case, it specifies
whether the stage is to add a first row of columns based on the column definitions
loaded into the stage.

Reject links
• Optional output link
• Distinguished from normal, stream output links by their broken lines
• Capture rows that the stage rejects
 In a source Sequential File stage, rows that cannot be read because of a
metadata or format issue
 In a target Sequential File stage, rows that cannot be written because of a
metadata or format issue
• Captured rows can be written to a Sequential File stage or Peek stage
or processed in some other manner
• Rejected rows are written as a single column of data:
datatype = raw (binary)
• Use the Reject Mode property to specify that rejects are to be output
Reject links
The Sequential File stage can have a single reject link. Reject links can be added to
Sequential File stages used either for reading or for writing. They captures rows that
the Stage rejects. In a source Sequential File stage, this includes rows that cannot
be read because of a metadata or format issue. In a target Sequential File stage,
this includes rows that cannot be written because of a metadata or format issue.
In addition to drawing the reject link out of the stage, you also must set the
Reject Mode property. Otherwise, you will get a compile error.
Rejected rows are written out the reject link as a single column of binary data
(data type raw).

Source and target reject links
Stream link
Reject link
Reject link
(broken line)
(broken line)
Source and target reject links

This graphic displays a job with reject links from Sequential File stages.
The second link you draw from a source stage is automatically interpreted as a
reject link. You can change the type of a link by clicking the right mouse over it and
selecting the type.
In this example, rejects are sent to Peek stages, which write the data to the job log.
However, you could also send the data to Sequential File stages or to processing
stages, such as a Transformer stage.

Setting the Reject Mode property
Output rejects
Setting the Reject Mode property

This graphic shows the Properties tab of the Sequential File stage.
By default the Reject Mode property is set to Continue. This means that a rejected
row will be thrown away and processing will continue with the next row. If you add a
reject link, then you must set the Reject Mode to Output.

Copy stage
• Rows coming into the Copy stage through the input link can be
mapped to one or more output links
• No transformations can be performed on the data
• No filtering conditions can be specified
 What goes in must come out
• Operations that can be performed:
 Numbers of columns can be reduced
 Names of columns can be changed
 Automatic type conversions can occur
• On the Mapping tab, input columns are mapped to output link columns
Copy stage
The Copy stage is a simple, but powerful processing stage. It is called the Copy
stage because no transformations or filtering of the data can be performed within
the stage. The input data is simply copied to the output links. For this reason, the
stage has little overhead. Nevertheless, the stage has several important uses. Since
it supports multiple output links, it can be used to split a single stream into multiple
streams for separate processing.
Metadata can also be changed using the stage. The number of columns in the
output can be reduced and the names of the output columns can be changed.
Although no explicit transformations can be performed, automatic type conversions
do take place. For example, Varchar() type columns can be changed to Char() type
columns.

Copy stage example

• One input link
• Two output links
 Splits the input data into two output streams
 All input rows go out both output links
Copy stage example

This graphic shows a Copy stage with one input link and two output links. This splits
the single input stream into multiple output streams. All of the input rows will go out
both output links.

Copy stage Mappings
Output name links list
List of output links
Column Names of
mappings columns have
changed
Copy stage Mappings

This graphic shows the Output > Mapping tab of the Copy stage.
Mappings from input columns to output columns are done on the
Output > Mapping tab. In this example, two input columns have been dragged to
the output side. The names of the columns have also been changed. Four columns
flow in, two columns flow out this output link.
If there are multiple output links, you need to specify the mappings for each. Select
the name of each output link from the Output name list at the top left of the stage,
and then specify the mappings for each.

Demonstration 1
Reading and writing to sequential files

 Read from a sequential file using the Sequential File stage
 Write to a sequential file using the Sequential File stage
• Use the Copy stage in a job

• Create Reject links from Sequential File stages
• Use multiple readers in the Sequential file stage
• Read multiple files using a file pattern
Demonstration 1: Reading and writing to sequential files

Demonstration 1:
Reading and writing to sequential files
Purpose:
Sequential files are one type of data that enterprises commonly need to
process. You will read and write sequential files using the Sequential File
Stage. Later, you will create a second output link, create reject links from
Sequential File stages, use multiple readers in the Sequential file stage, and
read multiple files using a file pattern.
Task 1. Read and write to a sequential file.
In this task, you design a job that reads data from the Selling_Group_Mapping.txt file,
copies it through a Copy stage, and then writes the data to a new file named
Selling_Group_Mapping_Copy.txt.
1. From the File menu, click New, and then in the left pane, click Jobs.
2. Click Parallel Job, click OK, and then save the job under the name
CreateSeqJob to the _Training > Jobs folder.
3. Add a Sequential File stage from the Palette File folder, a Copy stage from the
Palette Processing folder, and a second Sequential File stage.

4. Draw links between stages, and name the stages and links as shown.
5. In the source (Selling_Group_Mapping) Sequential File stage, Columns and

Format tabs, load the column and format definitions from the
Selling_Group_Mapping.txt table definition you imported in a previous
demonstration.
6. On the Properties tab, specify a path to the file to be read - namely, the
Selling_Group_Mapping.txt file. Also, set the First Line is Column Names
property to True.
If you do not set the property, your job will have trouble reading the first row and
issue a warning message in the job log.

7. Click View Data to verify that the metadata has been specified properly in the
stage.
8. Click Close, and then click OK.

9. In the Copy stage, Output > Mapping tab, drag all the columns across from
the source to the target.
10. Click OK.

11. In the target (Selling_Group_Mapping_Copy) Sequential File stage, click the
Format tab. Confirm that Field defaults > Delimiter = comma.
12. Return to the Properties tab. Name the file
Selling_Group_Mapping_Copy.txt, and write it to your
C:\CourseData\DSEss_Files\Temp directory.

13. Create it with a first line of column names. It should overwrite any existing file
with the same name.
14. Click OK. Compile and run your job.

15. View the job log, and fix any errors - if any exist.
16. To view the data in the target stage, right-click over the stage, and then click
View <stage name> data. Since no changes were made to the data the data
will look the same as it did in the source stage.
Task 2. Create a job parameter for the target file.
1. Save your CreateSeqJob job as CreateSeqJobParam. Rename the last link
and Sequential File stage to TargetFile.
2. Open up the Job Properties window.

3. On the Parameters tab, define a job parameter named TargetFile, of type

String. Create an appropriate default filename, for example, TargetFile.txt.
4. Open up your target sequential stage to the Properties tab. Select the File
property. In the File text box retain the directory path. Replace the name of your
file by your job parameter.

Task 3. Add Reject links.

1. Add a second link (which will automatically become a reject link) from the
source Sequential File stage to a Peek stage. Also add a reject link from the
target Sequential File stage to a Peek stage. Give appropriate names to these
new stages and links.
2. On the Properties tab of each Sequential File stage, change the Reject Mode
property value to Output.
3. Compile and run. Verify that it is running correctly. You should not have any
rejects, errors, or warnings.
4. To test the rejects link, temporarily change the property First Line is Column
Names to False, in the source stage, and then recompile and run.
This will cause the first row to be rejected because the values in the first row,
which are all strings, will not match the column definitions, some of which are
integer types.

5. Examine the job log. Look for a warning message indicating an import error in
the first record read (record 0). Also open the SourceRejects Peek stage
message. Note the data in the row that was rejected.
Task 4. Create a second output link from a Copy stage.

1. Add a second output link from your Copy stage to a Peek stage, naming the link
ToPeek.
2. Open the Copy stage. Click the Output > Mapping tab, and then select from
the Output name drop down list box, the link to your Peek stage ToPeek.

3. Drag the first two columns to the target link.
4. Click on the Columns tab, and then rename the second column SG_Desc.

5. Compile and run your job. View the messages written to the log by the Peek
output stage.
Task 5. Read a file using multiple readers.

1. Save your job as CreateSeqJobMultiRead.
2. Click the Properties tab of your source Sequential File stage.
3. Click the Options folder to select it, and then add the Number of Readers Per
Node property. Set this property to 2.
5. View the job log.
Note: You will receive some warning messages in the job log related to the first
row. And this row will be rejected. You can safely ignore these.
Task 6. Create a job that reads multiple files.
1. Save your job as CreateSeqJobPattern.
2. Open the target Sequential File stage, and select the Format tab.
3. Select the Record Level folder, and then click Record delimiter in the
Available properties to add window.

4. Accept its default value - UNIX newline.

This will produce the files with UNIX record delimiters, which is what we want in
this case - because the source stage reads files in that format.
5. Compile and then run your job twice, specifying the following file names in the
job parameter for the target file: TargetFile_A.txt, TargetFile_B.txt. This writes
two files to your DSEss_Files\Temp directory.
6. Edit the source Sequential stage, and change read method to File Pattern. You
will get a warning message. Click Yes to continue.

7. Browse for the TargetFile_A.txt file. Place a wildcard (?) in the last portion of
the file name: TargetFile_?.txt.
8. Click View Data to verify that you can read the files.
9. Compile and run the job, writing to a file named TargetFile.txt. View the job log.
10. Right-click the target stage, and then Click View TargetFile data, to verify the
results.
There should be two copies of each row, since you are now reading two
identical files. You can use the Find button in the View Data window to locate
both copies.
Results:
You read and wrote sequential files using the Sequential File Stage. Later, you
created a second output link, created reject links from Sequential File stages,
used multiple readers in the Sequential file stage, and read multiple files
using a file pattern.

Working with nulls

• Internally, null is represented by a special value outside the range of
any existing, legitimate values
• If null is written to a non-nullable column, the job will abort
• Columns can be specified as nullable
 Nulls can be written to nullable columns
• You must “handle” nulls written to nullable columns in a
Sequential File stage
 You need to tell DataStage what value to write to the file
 Unhandled rows are rejected
• In a Sequential File source stage, you can specify values you want
DataStage to convert to nulls
Working with nulls

Nulls can enter the job flow, and when they do, they must be carefully handled.
Otherwise, runtime errors and unexpected results can occur. This outlines how null
values can be handled in DataStage in the context of sequential files. Later units will
discuss null values in other contexts.
Internally, null is represented by a value outside the range of any possible legitimate
data value. Therefore, it cannot be confused with a legitimate data value. And this is
why it is so useful.
Nullability is a property of columns. Columns either allow nulls or they prohibit nulls.
A null value written to a non-nullable column at runtime will cause the job to abort.
Columns in a Sequential File stage can be nullable. Therefore, nulls can be read
from and written to columns in a Sequential File stage. But what value should go
into the sequential file when a null is written to a nullable column in the Sequential
File stage? Should it be the empty string? Should it be the word “NULL” or should it
be some other value? The Sequential File stage allows you to specify the value. It
can be whatever value supports your business purpose.

Specifying a value for null
Nullable column
Added property
Specifying a value for null

This graphic shows the extended properties window for a nullable column in the
Sequential File stage. To specify a value for null, add the optional Null field value
property. Then specify a value for this property. The value can be whatever you want
it to be: the empty string (“”), the word “unknown”, anything. The value does not
even have to match the column type. For example, you can use “unknown” to
represent null integer values.
What happens if you do not specify a value for a nullable column and null is written
to the column at runtime? The job does not abort. The row is rejected.
Note that on the Format tab, you can specify a default value for all nullable columns
in the stage.

Empty string example

• If you want two column delimiters with nothing between them to
mean null, then specify the empty string (“”) as the Null field value
Empty string value
Empty string example

The graphic shows how to specify the empty string (“”) as the null value. Add the
Null field value property and then type two quotes without spaces. The quotes can
be either single quotes or double quotes. Here, and in general, DataStage allows
either.

Viewing data with nulls

• When you click View Data, null values, regardless of their actual value
in the file, show up as “NULL”
• To see the actual values that represent null, you need to view the
actual data file
Empty string value
Viewing data with nulls

This graphic shows how null values are displayed when you click the View Data
button. Regardless of the actual value in the file, the value is displayed by the word
“NULL”. This sometimes confuses DataStage developers. They have, for example,
just specified the word “unknown” to represent null. But it appears as if the word
“unknown” was not written to the file. However, if you go look directly at the file (in a
text editor) on the DataStage server system, you will find the word “unknown”, not
the word “NULL”.

Demonstration 2
Reading and writing null values

 Read values meaning null from a sequential file
 Write values meaning null to a sequential file
Demonstration 2: Reading and writing null values

Demonstration 2:
Reading and writing NULL values
Purpose:
You want to read and write NULL values using a sequential file. NULL values
enter into the job stream in a number of places in DataStage jobs. You want to
look at how the NULL values are handled in the context of reading from and
writing to sequential files.
NOTE:
In this demonstration and other demonstrations in this course there may be tasks that
start with jobs you have been instructed to build in previous tasks. If you were not able
to complete the earlier job you can import it from the
DSEssLabSolutions_V11_5_1.dsx file in your C:\CourseData\DSEss_Files\dsx files
directory. This file contains all the jobs built in the demonstrations for this course.
Please note: If you need to import (and overwrite your existing saved work) you may
want to rename your existing element, so that you don't lose what you have created.
This will avoid overwriting (and losing) what you have worked on so far in the course.
Steps:
1. From the Designer menu, click Import, and then click DataStage
Components.
2. Select the Import selected option (this will enable you to pick and choose what
you want to import), and then select the element you require from the list of
elements that is displayed.
Task 1. Read NULL values from a sequential file.
1. Open your CreateSeqJobParam job.

2. Save your job as CreateSeqJobNULL.
3. From the Windows\All Programs\Accessories, click WordPad.

4. From the File menu in Wordpad, click Open.
5. In the Open window, change the file type to Text Documents (*.txt) - if it is not
already showing - and then browse under the following path to open the file:
C:\CourseData\DSEss_Files\Selling_Group_Mapping_Nulls.txt.
Notice in the data that the Special_Handling_Code column contains some

integer values of 1. Notice also that the last column (Distr_Chann_Desc) is
missing some values.
To test how to read NULLs, let us assume that 1 in the third column means
NULL, and that the absence of a value in the last column also means NULL.
In the following steps, you will specify this.

6. Open up the source Sequential stage to the Columns tab. Double-click to the
left of the Special_Handling_Code column to open up the Edit Column Meta
Data window.
7. Change the Nullable field to Yes. Notice that the Nullable folder shows up in
the Properties pane. Select this folder and then add the Null field value
property. Specify a value of 1 for it.
8. Click Apply, and then click Next.

9. Move to the Distribution_Channel_Description column. Set this field to

nullable. Add the Null field value property. Here, you will treat the empty string
as meaning NULL. To do this specify "" (back-to-back double quotes).
10. Click Apply, and then click Close.

11. On the Properties tab, for the File property, select the
Selling_Group_Mapping_Nulls.txt file.
12. Click the View Data button.
Notice that values that are interpreted by DataStage as NULL show up as the
word “NULL”, regardless of their actual value in the file.
13. Click Close, and then click OK.


It should abort since NULL values will be written to non-nullable columns on
your target.
15. View the job log to see the messages.
Task 2. Write NULL values to a sequential file.

1. Save your job as CreateSeqJobHandleNULL.
2. Open up your target Sequential File stage to the Columns tab. Specify that the
Special_Handling_Code column and the Distribution_Channel_Description
column are nullable.
What happens?
In this case, the job does not abort, since NULL values are not being written to
non-nullable columns. But the rows with NULL values get rejected because the
NULL values are not being handled. They are written to the TargetRejects
Peek stage, and you can view them in the job log.
Now, let us handle the NULL values. That is, we will specify values to be written
to the target file that represent NULLs.

4. Open up the target stage on the Columns tab, and then specify:
• Special_Handling_Code column, Null field value of -99999.
• Distribution_Channel_Description column Null field value UNKNOWN.
The procedure is the same as when the Sequential stage is used as a source
(Task 1 of this Demonstration)
The results appear as follows:
5. Compile and run your job. View the job log. You should not get any errors or
rejects.
6. Click View Data on the target Sequential File stage to verify the results.
7. To see the actual values written to the file open the file TargetFile.txt in the
DSEss_Files\Temp directory. Look for the values -99999 and UNKNOWN.
Note: When you view the data in DataStage, all you will see is the word “NULL”,
not the actual values. To see actual values you would need to open up the data
file on the DataStage server system in a text editor.
Results:
You read and wrote NULL values using a sequential file. NULL values enter
into the job stream in a number of places in DataStage jobs. You looked at
how the NULL values are handled in the context of reading from and writing to
sequential files.

Data Set stage

• Binary data file
• Preserves partitioning
 Component dataset files are written to each partition
• Suffixed by .ds
• Referred to by a header file
• Managed by Data Set Management utility from GUI
(Designer, Director)
• Represents persistent data
• Key to good performance in set of linked jobs
 No import / export conversions are needed
 No repartitioning needed
• Accessed using Data Set stage
• Linked to a particular configuration file
Data Set stage

Data sets represent persistent data maintained in the DataStage internal format.
They are files, but they are a special kind of file, very different from sequential files.
To identify a file as a data set file, apply the .ds extension to the filename.
There are two main features of data sets. First, they contain binary data, and so
their data cannot be viewed using an ordinary text editor. In this respect, they differ
from file sets, which are discussed later in this unit.
Secondly, data sets contain partitioned data. Their data is partitioned according to
the number of nodes in the configuration file used to create the data set. Individual
data component files, referenced by a header file, exist on each node identified in
the configuration file.
Data sets are the key to good performance between a set of linked parallel jobs.
One job can write to a data set that the next job reads from without collecting the
data onto a single node, which would slow the performance.

Job with a target Data Set stage

Data Set stage
Data Set stage

properties
Job with a target Data Set stage

The top graphic displays a job with a target Data Set stage. The bottom graphic
displays the Properties tab of the Data Set stage. The File property has been set to
the name and path of the data set. This is the actual location of the data set header
file. The linked data component files will be located elsewhere, on each of the
nodes.

Data Set Management utility
Display data
Display schema
Display record counts

for each partition
Data Set Management utility

This graphic displays the Data Set Management window.
The window is available from both Designer and Director. In Designer, click
Tools > Data Set Management to open this window.
Click the Show Schema icon at the top of the window to view the data set schema.
A data set contains its own column metadata in the form of a schema. A schema is
the data set version of a table definition.
Click the Data Set Viewer icon to view the data in the data set. Records can be
displayed for each individual partition or altogether.

Data and schema displayed
Data viewer
Schema describing the

format of the data
Data and schema displayed

The left graphic shows the data set data from the Data Set Viewer window. The
right graphic shows the Record Schema window, describing the format of the data.
Notice that the record consists of the names of the columns followed by their data
types. The data types are C++ data types. At the DataStage GUI level most of the
column data types are SQL types. Internally, DataStage uses C++ types.

File set stage

• Use to read and write to file sets
• Files suffixed by .fs
• Similar to a dataset
 Partitioned
• How file sets differ from data sets
 File sets are readable by text editors (non-binary)
− Hence suitable for archiving
File set stage

File sets are similar to data sets. Like data sets, they are partitioned. They both
have headers, which reference component data files on each partition.
Their main difference is that they are readable by ordinary text editors. This slightly
reduces their performance, compared to data sets, but makes them suitable for
archiving.

Demonstration 3
Working with data sets

 Write to a data set
 Use the Data Set Management utility to view data in a data set
Demonstration 3: Working with data sets

Demonstration 3:
Working with data sets
Purpose:
Data Sets are suitable as temporary staging files between DataStage jobs.
Here, you will write to a data set and then view the data in the data set using
the Data Set Management Utility.
NOTE:
Steps:
1. Click Import, and then click DataStage Components.
2. Select the Import selected option, and then select the job you want from the list
that is displayed.
If you want to save a previous version of the job, be sure to save it under a new name
before you import the version from the demonstration solutions file.
Task 1. Write to a Data Set
1. Open up your CreateSeqJob job, and then save it as CreateDataSetJob.
2. Delete the target sequential stage, leaving a dangling link.
3. Drag a Data Set stage from the Palette File folder to the canvas, and then
connect it to the dangling link. Change the name of the target stage to
Selling_Group_Mapping_Copy.

4. Edit the target Data Set stage properties. Write to a file named
Selling_Group_Mapping.ds in your DSEss_Files\Temp directory.
5. Open the source Sequential File stage and add the optional property to set
number of readers per node. Click Yes when confronted with the warning
message. Change the value of the property to 2.
(This will ensure that data is written to more than one partition.)
6. Compile and run your job. Check the job log for errors. You can safely ignore
the warning message about record 0.

Task 2. View a data set.

1. In Designer, click Tools > Data Set Management. Browse for the data set that
was created. Notice how many records are written to each of the two partitions.
2. Click the Show Data Window icon at the top of the window. Select partition
number 1. This will only display the data in the second partition.
3. Click OK to view the records in that partition.

4. Click the Show Schema Window icon at the top of the window to view the
data set schema.
A data set contains its own column metadata in the form of a schema. A
schema is the data set version of a table definition.
Results:
You wrote to a data set and then viewed the data in the data set using the Data
Set Management Utility.

Checkpoint
1. List three types of file data.
2. What makes data sets perform better than sequential files in
parallel jobs?
3. What is the difference between a data set and a file set?
Checkpoint

1. Sequential files, data sets, file sets.
2. They are partitioned and they store data in the native parallel format.
3. Both are partitioned. Data sets store data in a binary format not
readable by user applications. File sets are readable.

Unit summary
• Understand the stages for accessing different kinds of file data
• Read and write to sequential files using the Sequential File stage
• Read and write to data set files using the Data Set stage
• Create reject links
• Work with nulls in sequential files
• Read from multiple sequential files using file patterns
• Use multiple readers
Unit summary


Partitioning and collecting algorithms
Partitioning and collecting

algorithms

Unit 7 Partitioning and collecting algorithms

Unit objectives
• Describe parallel processing architecture
• Describe pipeline parallelism
• Describe partition parallelism
• List and describe partitioning and collecting algorithms
• Describe configuration files
• Describe the parallel job compilation process
• Explain OSH
• Explain the Score
Partitioning and collecting algorithms © Copyright IBM Corporation 2015
Unit objectives
Purpose - DataStage developers need a basic understanding of the parallel
architecture and framework in order to develop efficient and robust jobs.

• Divide the incoming stream of data into subsets to be separately
processed by a stage/operation
 Subsets are called partitions (nodes)
 Facilitates high-performance processing
−2 nodes = Twice the performance
− 12 nodes = Twelve times the performance
• Each partition of data is processed by the same stage/operation
 If the stage is a Transformer stage, each partition will be processed by
instances of the same Transformer stage
• Number of partitions is determined by the configuration file
• Partitioning occurs at the stage level
 At the input link of a stage that is partitioning, the stage determines the
algorithm that will be used to partition the data
Partitioning breaks the stream of data into smaller sets that are processed
independently, in parallel. This is a key to scalability. You can increase performance
by increasing the number of partitions, assuming that you have the number of
physical processors to process them. Although there are limits to the number of
processors reasonably available in a single system, a GRID configuration is
supported which distributes the processing among a networked set of computer
systems. There is no limit to the number of systems (and hence processors) that
can be networked together.
The data needs to be evenly distributed across the partitions; otherwise, the
benefits of partitioning are reduced.
It is important to note that what is done to each partition of data is the same. Exact
copies of each stage/operator are run on each partition.

Stage partitioning
Stage/Operation Node 0
subset1
subset2
Data Stage/Operation Node 0
subset3
Stage/Operation Node 0
• Here the data is partitioned into three partitions

• The operation is performed on each partition of data separately and in
parallel
• If the data is evenly distributed, the data will be processed three times
faster
Stage partitioning
This diagram illustrates how stage partitioning works. Subsets of the total data go
into each partition where the same stage or operation is applied. How the data is
partitioned is determined by the stage partitioning algorithm that is used.
The diagram is showing just one stage. Typical jobs involve many stages. At each
stage, partitioning, re-partitioning, or collecting occurs.

DataStage hardware environments
• Grid / Cluster
• Single CPU • SMP
 Multiple, multi-CPU
systems
• Dedicated memory • Multi-CPU (2-64+)
 Dedicated memory per
& disk
• Shared memory & node
disk  Typically SAN-based

shared storage
• MPP
 Multiple nodes with
dedicated memory,
storage
• 2 – 1000’s of CPUs
DataStage hardware environments

This graphic illustrates the three hardware environments that can be used to run
DataStage jobs: Single CPU, SMP, and GRID.
DataStage parallel jobs are designed to be platform-independent. A single job, if
properly designed, can run across the resources within a single machine (single
CPU or SMP) or multiple machines (cluster, GRID, or MPP architectures).
While parallel jobs can run on a single-CPU environment, DataStage is designed to
take advantage of parallel platforms.

Partitioning algorithms
• Round robin
• Random
• Hash: Determine partition based on key value
 Requires key specification
• Modulus
 Requires key specification
• Entire: Send all rows down all partitions
• Same: Preserve the same partitioning
• Auto: Let DataStage choose the algorithm
 DataStage chooses the algorithm based on the type of stage
Partitioning algorithms
Partitioning algorithms determine how the stage partitions the data. Shown here are
the main algorithms used. You are not required to explicitly specify an algorithm for
each stage. Most types of stages are by default set to Auto, which allows
DataStage to choose the algorithm based on the type of stage.
Do not think of Same as a separate partitioning algorithm. It signals that the stage is
to use the same partitioning algorithm adopted by the previous stage, whatever that
happens to be.

Collecting (1 of 2)
• Collecting returns partitioned data back into a single stream
 Collection algorithms determine how the data is collected
• Collection reduces performance, but:
 Sometimes is necessary for a business purpose
− For example, we want the data loaded into a single sequential file
 Sometimes required by the stage
− Some, mostly legacy, stages only run in sequential mode
− Stages sometimes run in sequential mode to get a certain result, for example,
a global count of all records
Collecting
Collecting is the opposite of partitioning. Collecting returns partitioned data back into
a single stream. Collection algorithms determine how the data is collected.
Generally speaking, it is the parallel processing of the data that boosts the
performance of the job. In general, then, it is preferable to avoid collecting the data.
However, collecting is often required to meet business requirements. And some
types of stages run in sequential mode. For examples, the Sequential File and the
Row Generator stages both run by default in sequential mode.

Collecting (2 of 2)
Stage/Operation
Node 0
Stage/Operation Stage/Operation
Node 1
Stage/Operation
Node 2
• Here the data is collected from three partitions down to a single node
• At the input link of a stage that is collecting, the stage determines the
algorithm that will be used to collect the data
This diagram illustrates how the data in three partitions is collected into a single
data stream. The initial stage, shown here, is running in parallel on three nodes. The
second stage is running sequentially. To support the operation of the second stage,
all the data has to be collected onto a single node (Node 0).
Just as with partitioning, there are different algorithms that the second stage can
use to collect the data. Generally, by default, the algorithm is “take the row that
arrives first”.

Collecting algorithms
• Round robin
• Auto
 Collect first available record
• Sort Merge
 Read in by key
 Presumes data is sorted by the collection key in each partition
 Builds a single sorted stream based on the key
• Ordered
 Read all records from first partition, then second, and so on
Collecting algorithms
Shown is a list of the main collecting algorithms. By default, most stages are set to
Auto, which lets DataStage decide the algorithm to use. In most cases, this is to
collect the next available row.
Sort Merge is the collection algorithm most often used apart from Auto. It is used to
build a global, sorted collection of data from several partitions of sorted data.

Keyless versus keyed partitioning algorithms

• Keyless: Rows are distributed independently of data values
 Round Robin
 Random
 Entire
 Same
• Keyed: Rows are distributed based on values in the specified key
 Hash: Partition based on key
− Example: Key is State. All “CA” rows go into the same partition; all “MA” rows
go into the same partition. Two rows from the same state never go into
different partitions
 Modulus: Partition based on key divided by the number of partitions.
Key is a numeric type
− Example: Key is OrderNumber (numeric type). Rows with the same order
number will all go into the same partition
 DB2: Matches DB2 Enterprise Edition partitioning
Keyless versus keyed partitioning algorithms

Partitioning algorithms can be divided into two main categories: keyed and keyless.
The former distributes the data based on the data in one or more key columns. The
latter distributes the data independently of data values. Among the keyless
algorithms are Round Robin, Random, Entire, and Same.
The primary keyed partitioning algorithm is Hash. This algorithm maps data values
in one or more columns to partition numbers. Every occurrence of the same data
value in the key column is guaranteed to go into the same partition. For example,
suppose the key column is State and that there are multiple rows of data with the
same value “CA” in the key column. All of these rows will go into the same partition.
We do not know which one, but we know wherever one goes, the others will go too.

Round Robin and Random partitioning

• Keyless partitioning methods
Keyless
• Rows are evenly distributed
across partitions
…8 7 6 5 4 3 2 1 0
 Good for initial import of data if
no other partitioning is needed
 Useful for redistributing data
Round Robin
• Low overhead
• Round Robin assigns rows to
partitions like dealing cards
• Random has slightly higher 6 7 8
overhead, but assigns rows in a 3 4 5
0 1 2
non-deterministic fashion
between job runs
Round Robin and Random partitioning

The diagram illustrates the Round Robin partitioning method. Round Robin
assigns rows to partitions like dealing cards. The first row goes to the first partition,
the second goes to the second partition, and so on. The main advantage of using
the Round Robin partitioning algorithm is that it evenly distributes the data across
all partitions. As mentioned earlier, this yields the best performance.
Random has a similar result of more-or-less evenly distributing the rows (although
not perfectly of course). But there is no fixed ordering of the rows into the partitions.
For certain initial sets of data, this might be desirable. Random has slightly more
overhead than Round Robin.

Entire partitioning
• Each partition gets a complete copy Keyless
of the data …8 7 6 5 4 3 2 1 0
 May have performance impact
because of the duplication of data
• Entire is the default partitioning Entire
algorithm for Lookup stage reference
links
 On SMP platforms, Lookup stage uses
shared memory instead of duplicating
the entire set of reference data . . .
. . .
 On Grid platforms data duplication will 3 3 3
occur 2 2 2
1 1 1
0 0 0
Entire partitioning
The diagram illustrates the Entire partitioning method. Each partition gets a
complete copy of all the data. Entire is the default partitioning algorithm for Lookup
reference links. This ensures that the search for a matching row in the lookup table
will always succeed, if a match exists. The row cannot be “hiding” in another
partition, since all the rows are in all the partitions.

Hash partitioning
Keyed
• Keyed partitioning method
• Rows are distributed according to Values of key column
the values in key columns …0 3 2 1 0 2 3 2 1 1
 Guarantees that rows with same key
values go into the same partition
 Needed to prevent matching rows Hash
from “hiding” in other partitions
 Data may become unevenly
distributed across the partitions
depending on the frequencies of the
key column values 0 1 2
3 1 2
• Selected by default for Aggregator, 0
3
1 2
Remove Duplicates, Join stages
Hash partitioning
For certain stages (Remove Duplicates, Join, Merge) to work correctly in parallel,
Hash - or one of the other similar algorithms (Range, Modulus) - is required. The
default selection Auto selects Hash for these stages.
The diagram illustrates the Hash partitioning method. Here the numbers are no
longer row identifiers, but the values of the key column. Hash guarantees that all
the rows with key value 3, for example, end up in the same partition.
Hash does not guarantee “continuity” between the same values. Notice in the
diagram that there are zeros separating some of the threes.
Hash also does not guarantee load balance. Some partitions may have many more
rows than others. Make sure to choose key columns that have enough different
values to distribute the data across the available partitions. Gender, for example,
would be a poor choice of a key. All rows would go into just a few partitions,
regardless of how many partitions are available.

Modulus partitioning
Keyed
• Rows are distributed according
to the values in one numeric key Values of key column
column
…0 3 2 1 0 2 3 2 1 1
 Uses modulus
partition = MOD (key_value /
number of partitions)
Modulus
• Faster than Hash
• Logically equivalent to Hash
0 1 2
3 1 2
0 1 2
3
Modulus partitioning
Modulus functions the same as Hash. The only difference is that it requires the key
column to be numeric. Because the key column is restricted to numeric types, the
algorithm is somewhat faster than Hash.

Auto partitioning
• DataStage inserts partition operators as necessary to ensure
correct results
 Generally chooses Round Robin or Same
 Inserts Hash on stages that require matched key values
(Join, Merge, Remove Duplicates)
 Inserts Entire on Lookup stage reference links
• Since DataStage has limited awareness of your data and business
rules, you may want to explicitly specify Hash or other partitioning
 DataStage has no visibility into Transformer logic
 DataStage may choose more expensive partitioning algorithms than you
know are needed
− Check the Score in the job log to determine the algorithm used
Auto partitioning
Auto is the default choice of stages. Do not think of Auto, however, as a separate
partitioning algorithm. It signals that DataStage is to choose the specific algorithm.
DataStage’s choice is generally based on the type of stage.
Auto generally chooses Round Robin when going from sequential to parallel
stages. It generally chooses Same when going from parallel to parallel stages. It
chooses the latter to avoid unnecessary repartitioning, which reduces performance.
Since DataStage has limited awareness of your data and business rules, best
practice is to explicitly specify Hash partitioning when needed, that is, when
processing requires groups of related records.

Partitioning requirements for related records

• Misplaced records
 Using Aggregator stage to sum customer sales by customer number
 If there are 25 customers, 25 records should be output
 But suppose records with the same customer numbers are spread across
partitions
− This will produce more than 25 groups (records)
 Solution: Use Hash partitioning algorithm
• Partition imbalances
 If all the records are going down only one of the nodes, then the job is in
effect running sequentially
Partitioning requirements for related records

Choose the right partitioning algorithm to avoid misplaced records and partition
imbalances, as described here.
Partitioning imbalances occur when the numbers of records going down some of the
available partitions far exceeds the numbers of records going down others. The
amount of time it takes to process the partitions with the most records will obviously
take longer than the amount of time it takes to process the partitions with fewer
records. The crucial point to realize is that the total amount of time of the job is the
total time it takes to process the slowest partition. That is, the job does not finish
until all partitions are finished.
The problem of misplaced records occurs when the total set of records needed to
perform a certain calculation are not available within the partition. That is, some of
the records are in other partitions. What happens is that instead of there being a
single calculation of all the records for customer X, there are multiple calculations
for customer X, one for each of the partitions that has customer X records. To avoid
this, all of the customer X records have to be in one, and only one, partition.

Partition imbalances example
• Same key values are • Hash on LName, with 2-node

assigned to the same configuration file
partition
Part 0
ID LName FName Address
Source Data

5 Dodge Horace 17840 Jefferson
6 Dodge John 75 Boston Boulevard

1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
Partition 1
3 Ford Edsel 7900 Jefferson
1 Ford Henry 66 Edison Avenue

4 Ford Eleanor 7900 Jefferson
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson

5 Dodge Horace 17840 Jefferson
4 Ford Eleanor 7900 Jefferson
6 Dodge John 75 Boston Boulevard 7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
7 Ford Henry 4901 Evergreen 9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore

8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
Partition imbalances example

This is an example of a partition imbalance of rows down different partitions.
Partition distribution matches source data distribution. In this example, the low
number of distinct Hash key values limits the benefit of parallelism! The job will not
finish until all the rows in partition 1 are processed. In effect, this job will not run
much faster than if it were running sequentially, with all rows in a single partition.

Partitioning / Collecting link icons
Indicates that the

data is being
partitioned
Indicates that the

data is being
collected
Partitioning / Collecting link icons

This graphic highlights the partitioning icons on the links of a job.
The “fan out” icon (on the left) indicates that the data is being partitioned. That is,
the data is moving from one node (partition) to multiple nodes (partitions). The “fan
in” icon indicates that the data is being collected. That is, the data is moving from
multiple nodes to a single node. The particular algorithm that is being used for
partitioning / collecting is not indicated.

More partitioning icons
Same partitioner
“Butterfly” indicates
repartitioning Auto partitioner
More partitioning icons

This graphic highlights more partitioning icons in a job.
Some icons indicate the partitioning algorithm that is being used. Here icons
indicating Auto and Same are highlighted. The “butterfly” icon indicates that re-
partitioning is occurring. That is, rows of data in some partitions are moving to other
partitions. This is something to watch out for. Data moving across partitions can
impact performance, especially on a GRID, where repartitioned data travels across
a network.

Specify a partitioning algorithm
Partitioning tab
Input tab
Select key columns
Partition type
Select algorithm
Specify a partitioning algorithm

This graphic displays the Input > Partitioning tab in an example stage. The
partitioning algorithms from which you can choose are displayed.
If you select a keyed partitioning algorithm (for example, Hash), then you need to
select the column or columns that make up the key.
You select both partitioning and collecting algorithms on the Input > Partitioning
tab. How can you tell whether the stage is partitioning or collecting? The words just
above the list indicate this. If you see Partition type as opposed to Collector type,
you know the stage is partitioning.

Specify a collecting algorithm
Partitioning tab
Select key columns
Collector type
Specify a collecting algorithm

This graphic displays the Input > Partitioning tab in a example stage. The
collecting algorithms from which you can choose are listed. Notice the words
Collector type above the list, indicating that the stage is collecting, rather than
partitioning.

Configuration file
• Determines the number of nodes (partitions) the job runs on
• Specifies resources that can be used by individual nodes for:
 Temporary storage
 Memory overflow
 Data Set data storage
• Specifies “node pools”
 Used to constrain stages (operators) to use certain nodes
 The setting of the environment variable $APT_CONFIG_FILE determines
which configuration file is in effect during a job run
 If you add $APT_CONFIG_FILE as a job parameter you can specify at
runtime which configuration file a job uses
Configuration file
The configuration file determines the number of nodes (partitions) a job runs on. The
configuration in effect for a particular job run is the configuration file currently
referenced by the $APT_CONFIG_FILE environment variable. This variable has a
project default or can be added as a job parameter to a job.
In addition to determining the number of nodes, the configuration file specifies
resources that can be used by the job on each of the nodes. These resources
include temporary storage, storage for data sets, and temporary storage that can be
used when memory is exhausted.

Example configuration file
Node name
Node resources
Example configuration file

This graphic displays an example configuration file with two nodes. The node names
are user specified. Notice the resource entries for each node. These specify
resources that can be used by the job for stages running on the node.
In the job log, open the message labeled main_program: APT configuration file…
to display the configuration file used by the job during that job run.
The fastname entry indicates the network name of the computer system on which
the node exists. In this example, both nodes exist on EDSERVER.

Adding $APT_CONFIG_FILE as a job parameter
Add environment
$APT_CONFIG_FILE
variable
Adding $APT_CONFIG_FILE as a job parameter

This graphic shows the Parameters tab in the Job Properties window for an open
job in Designer.
If you add the environment variable $APT_CONFIG_FILE as a job parameter, you
can select at runtime the configuration file the job is to use. If not added, the job will
use the default configuration file specified for the project.

Editing configuration files

• Click Tools > Configurations to open the editor
• Use to create and edit configuration files
Editing configuration files

This graphic shows the Configuration File editor in Designer. Click
Tools > Configurations to open the editor. Here you can optionally create, view,
and edit available configuration files.
When Information Server is installed, a default configuration file is created. You can
create additional configuration files that can be selected for the
$APT_CONFIG_FILE environment variable.
It is easy to add a node to a configuration file. Just copy one of the existing nodes
and then change the node name. Then modify any resources or other entries as
required for the new node.

Parallel job compilation

• What gets generated: Designer
Client
• OSH: A kind of script

• OSH represents the design data flow and Compile
stages
 Stages are compiled into OSH operators DataStage server
• Transform operator for each Transformer

 A custom operator built during the compile
 Compiled into C++ and then to corresponding
native operators
− Thus a C++ compiler is needed to compile jobs Executable
Job
with a Transformer stage
Transformer
Components
Parallel job compilation

When you click the Compile button for a job, OSH (Orchestrate Shell Script) is
generated. This is a script file that can be executed by the DataStage parallel
engine. The OSH contains operators that correspond to stages on the diagram.
The graphic illustrates how for each Transformer stage in a job, the compile process
builds a customized OSH operator. First it generates C++ source code for the
operator and then it compiles the C++ source code into an executable OSH
operator. This explains why DataStage requires a C++ compiler on the system in
which it is installed. The C++ compiler is not needed to run DataStage jobs. It is
needed to compile DataStage parallel jobs containing Transformer stages.

Generated OSH
OSH viewable
Stage name OSH is visible in:

- Job Properties
Operator
window
- Job log
Schema - View Data window
- Table definitions
Generated OSH
You can view the generated OSH in DataStage Designer on the Job Properties
Generated OSH tab. This displays the OSH that is generated when the job is
compiled. It is important to note, however, that this OSH may go through some
additional changes before it is executed.
The left graphic shows the generated OSH in the Job Properties window. In order
to view the generated OSH, the view OSH option must be turning on in
Administrator, as shown in the graphic at the top right.

Stage-to-operator mapping examples

• Sequential File stage
 Used as a Source: import operator
 Used as a Target: export operator
• Data Set stage: copy operator
• Sort stage: tsort
• Aggregator stage: group operator
• Row Generator stage: generator operator
• Transformer stage: custom operator labeled with word ‘transform’ in
the name
Stage-to-operator mapping examples

When the OSH is generated, stages on the GUI canvas get mapped to OSH
operators. Here some examples are listed.
The stages on the diagram do not necessarily map one-to-one to operators. For
example, the Sequential File stage, when used as a source, is mapped to the
import operator. When the same stage used as a target, it is mapped to the export
operator.
The converse is also true. Different types of stages can be mapped to the same
operator. For example, the Row Generator and Column Generator stages are both
mapped to the generator operator.
As previously mentioned, the Transformer stage operator is mapped to a custom
operator. You can identify this operator in the OSH by the word ‘transform’ in its
name.

Job Score
• Generated from the OSH along with the configuration file used to run
the job
• Think of “Score” as in musical score, not game score
• Assigns nodes (partitions) to each OSH operator
• Specifies additional OSH operators as needed
 tsort operators, when required by a stage
 Partitioning algorithm operators explicitly or implicitly specified (Auto)
 Adds buffer operators to prevent deadlocks
• Defines the actual job processes
• Useful for debugging and performance tuning
Job Score
The Job Score is generated from the OSH along with the configuration file used to
run the job. Since it is not known until runtime which configuration file a job will use,
the Job Score is not generated until runtime. Generating the Score is part of the
initial overhead of the job.
The Score directs which operators run on which nodes. This will be a single node for
(stages) operators running in sequential mode. This can be multiple nodes for
operators running in parallel mode.
The Score also adds additional operators as needed. For example, some stages,
such as the Join stage, require the data to be sorted. The Score will add tsort
operators to perform these sorts. Buffer operators are also added as necessary to
buffer data going into operators, where deadlocks can occur.
Experienced DataStage developers frequently look at the Score to gather
information useful for debugging and performance tuning.

Viewing the Score

• Set $APT_DUMP_SCORE to
output the Score to the job log
• To identify the Score message,
Operators with
look for “main program: This node
step …” assignments
 The word ‘Score’ is not used
Score message in job log
Viewing the Score

The Score is not viewable until the job is run. One of the Reporting environment
variables determines whether it is displayed in the job log. To identify the Score
message, look for the message titled “main program: This step …”
The graphic displays an example Score. Notice how operators are assigned to
nodes. Notice that op0 is assigned to a single node (node1). This was generated
from a Sequential File stage running in sequential mode. op2, generated from a
Copy stage, is assigned to two nodes.

Checkpoint
1. What file defines the degree of parallelism a job runs under?
2. Name two partitioning algorithms that partition based on key values?
3. Which partitioning algorithms produce even distributions of data in
the partitions?
4. What does a job design compile into?
5. What gets generated from the OSH and the configuration file used to
run the job?
Checkpoint

1. Configuration file.
2. Hash, Modulus.
3. Round Robin, Entire, Random (maybe).
4. OSH script.
5. Score.

Demonstration 1

 View partitioning icons
 Set partitioning algorithms in stages
 View the OSH in the job log
 View the configuration file in the job log
 View the Score in the job log
Demonstration 1: Partitioning and collecting

Demonstration 1:
Purpose:
In this exercise, you will determine how data gets put into the nodes
(partitions) of a job by setting partitioning and collecting algorithms in each
stage.
NOTE:
starts with jobs you have been instructed to build in previous tasks. If you were not able
Steps:
that is displayed.
Task 1. Partitioning and collecting.
1. Save your CreateSeqJobParam job as CreateSeqJobPartition.
Note the icon on the input link to the target stage (fan-in). It indicates that the
stage is collecting the data.

2. Open up the target Sequential File stage to the Input > Partitioning tab.
Note under the Partitioning / Collecting area, that it indicates 'Collector type' -
and that the collecting algorithm '(Auto)' is selected.

4. View the data in the target stage.
5. Open up the target Sequential stage to the Properties tab.
Instead of writing to a single file, you want to write to 2 files that have different
names. You want the files in your DSEss_Files\Temp directory.
6. Click the Target folder.
7. Under the Available properties to add panel, click File.
8. For the File properties, add the directory path and the #TargetFile# parameter
for the second file.

9. Append something to the end of the path to distinguish the two file names. For
example, 1 and 2. Here, 1 and 2 have been appended to each file name
parameter, respectively, so that the names of the two files are different.
10. Click on the Partitioning tab.

Notice that the stage is no longer collecting, but now is partitioning, because it is
writing the data to the two files in separate, parallel streams of output data. You
can confirm this by noting the words above the Partitioning / Collecting drop
down. If it says Partition type, then the stage is partitioning. If it says
Collector type, it is collecting.

11. Click OK to close the stage.

Notice that the partitioning icon has changed. It no longer indicates collecting.
The icon you see now indicates Auto partitioning.
12. Now open the target Sequential File stage again, and change Partition type to
Same.

13. Close the stage.

Notice how the partitioning icon has changed.

15. View the job log. Notice how the data is exported to the two different partitions
(0 and 1). 24 records go into one partition (partition 0) and 23 records go into
the other (partition 1).

Task 2. View the OSH, Configuration File, and Score.

1. In the job log for the last run of the CreateSeqJobPartition job, open the
message labeled OSH script.
This displays the OSH script that was generated when the job was compiled.
2. In the OSH notice the following:

• Operators: These correspond to stages in the job design.
• Schemas: These correspond to table definitions in the stages.
• Properties: These correspond to properties defined on the stage Properties
tab.

3. In the log open up the message labeled

main_program: APT configuration file.
4. Notice the following in the configuration file:

• The number of nodes and their names. In this example, there are two
nodes labeled “node1” and “node2”
• Resource disks used by each node. The entries labeled “resource disk”.
This identifies disk space used to store the data in data sets.
• Resource scratch disks used by each node. These store temporary files
created during a job run, such as those used in sorting.

5. In the log, open up the message labeled (note: 'X' represents a #)

main_program: This step has X datasets.
This is the Score.
The score is divided into two sections. The second section lists the nodes each
operator runs on. For example, op0 runs on just the single node, node1.
Notice that op3 (…TargetFile) runs on two nodes.
Results:
You determined how data gets put into the nodes (partitions) of a job by
setting partitioning and collecting algorithms in each stage.

Unit summary
• Describe parallel processing architecture
• Describe pipeline parallelism
• Describe partition parallelism
• List and describe partitioning and collecting algorithms
• Describe configuration files
• Describe the parallel job compilation process
• Explain OSH
• Explain the Score
Unit summary


Combine data
Combine data

Unit 8 Combine data

Unit 8 Combine data
Unit objectives
• Combine data using the Lookup stage
• Define range lookups
• Combine data using Merge stage
• Combine data using the Join stage
• Combine data using the Funnel stage
Combine data © Copyright IBM Corporation 2015
Unit objectives
This unit discusses the main stages that can be used to combine data. As previously,
discussed, some “passive” stages for accessing data (Sequential File stage, Data Set
stage). In this unit you begin discussing some “active”, processing stages.

Unit 8 Combine data
Combine data
• Common business requirement
 Records contain columns that reference data in other data sources
− Anorder record contains customer IDs that reference customer information in the
CUSTOMERS table or file
 Records from two or more different sources are combined into one longer
record based on a matching key value
− An employee’s payroll information in one record is combined with the employee’s
address information from another record
• DataStage has a number of different stages that can be used to
combine data:
 Join
 Merge
 Lookup
• Combine data from one or more input links which can contain data
from relational tables, files, or upstream processing
Combine data
Combining data is a common business requirement. For example, records of data in
one table or file might contain references to data in another table or file. The data is to
be combined so that individual records contain data from both tables.
DataStage has a number of different stages that can be used to combine data: Join,
Merge, and Lookup. You can generally accomplish the same result using any one of
these stages. However, they differ regarding their requirements and individual
properties.
It is important to note that these stages combine data streams or links of data. The
source of the data is not restricted. You can combine data from relational tables, flat
files, or data coming out of another processing stage, such as a Transformer.

Unit 8 Combine data
Lookup, Join, Merge stages

• These stages combine two or more input links
 Data is combined by designated key columns
• These stages differ mainly in:
 Memory usage
 Stage properties
 Stage requirements
− Whether data has to be sorted
− Whether data has to be de-duplicated
 How match failures are handled
Lookup, Join, Merge stages

These stages have similar functionality. So, which do you use? This depends on
several factors, listed here. The main differences are regarding memory usage (some
of these stages need more memory than others), stage requirements (some require
that the input data is sorted), and stage properties (one of these stages may have a
property that is useful to you in the given context).
All of these stages combine data based on matching key column values.

Unit 8 Combine data
Lookup Stage features

• One stream input link (source link)
• One or more input reference links
• One output link
• Optional reject link
 Captures match failures
• Lookup failure options
 Continue, Drop, Fail, Reject
• Can optionally return multiple matching rows from one input
reference link
• Builds an indexed file structure in memory from the reference link
data
 Indexed by the lookup key
 Must have enough memory to hold the reference data or the data spills
over to disk
Lookup Stage features

This lists the main features of the Lookup stage. The Lookup stage can have only a
single, stream input link and a single stream output link. Optionally, an additional output
link, called a reject link, can be added to capture lookup, match failures.
The links to any lookup tables or files or other processing links are implemented as links
coming from the lookup tables or files into the Lookup stage. Therefore, they are input
links into the Lookup stage and are called reference links. They have broken lines to
distinguish them from the main stream input link.
Prior to processing the first row into the Lookup stage, all the reference data is stored in
memory in an indexed structure. So no physical file reads are necessary for performing
a lookup for a row at the time the row is read. In this way, lookups can be performed
quickly. However, there has to be enough memory to hold all of the reference data or
the data will be written to disk.

Unit 8 Combine data
Lookup types
• Equality match
 Match values in the lookup key column of the reference link to selected
values in the source row
 Return matching row or rows
 Supports exact match or caseless match
• Range match
 Two columns define the range
 A match occurs when a value is within the specified range
 Range can be on the source input link or on the reference link
 Range matches can be combined with equality matches
− Lookup records for the employee ID within a certain range of dates
Lookup types
There are two general types of lookups that you can perform using the Lookup stage.
Equality matches and range lookups. Equality matches compare two or more key
column values for equality. An example is matching a customer ID value in a stream
link column to a value in a column in the reference link.
A range match compares a value in a column in the stream link with the values in two
columns in the reference link. The match succeeds if the value is between the values in
the two columns. Range matches can also compare a single value in a reference link to
two columns in the stream link.
Range lookups can be combined with equality lookups. For example, you can look for
matching customer ID within a range of dates.

Unit 8 Combine data
Equality match Lookup stage example
Source (stream) Reference

link link
Equality match Lookup stage example

This slide displays an example of a DataStage job with a Lookup stage (center stage).
In this example, the job uses an equality match to determine which row or rows to
extract from the reference link, which in this case is a link to a sequential file (Items).
The Sequential File stage as the source of the reference data is just an example. There
are no restrictions on the reference link data. It can flow from a relational table, a
sequential file, or from more complex processing.
Notice that the stream input and output links have solid lines. The reference link has a
dotted line.

Unit 8 Combine data
Lookup stage with an equality match

Source link
columns
Lookup
constraints
Output
columns
Lookup match
Reference link
columns
Column names
and definitions
Lookup stage with an equality match

This slide shows the inside of the Lookup stage and highlights its main features.
For an equality or caseless match lookup, one or more columns in the reference link
are selected as keys (see lower left panel). Columns from the source link are matched
to the key columns using drag and drop. To specify an equality match, select the equal
sign (=) from the Key Type cell of the reference link panel. To specify a caseless
match, select Caseless from the Key Type box of the reference link panel.
Output columns are specified in the top, right panel. Columns from the source and
reference link are dragged to the front of these columns to specify the values to be
mapped to the output columns.
The column definitions of the columns listed in the link windows are specified in the tabs
at the bottom of the window.

Unit 8 Combine data
Define the Lookup key

Drag this
• Drag columns from the column
source input link to the cell
to the left of the matching
reference key columns
 The Key checkbox of the
reference link column is Key column
checked
• Select the Key type
 Equality
 Caseless
Equality
match
Lookup key
column
Define the lookup key

This slide shows the left side of the Lookup stage where the equality match is specified.
In this example, the Items window lists the reference link columns and the Warehouse
window lists the stream link columns. First you need to select the key column or
columns from the Items window and specify the type of match in the Key Type cell to
its left.
To specify the lookup key matching columns, drag the key column from the stream link
(here, column Warehouse) to the matching key column from the reference link (column
Item).

Unit 8 Combine data
Specify the output columns

• Drag columns from the
reference link or stream link
on the left side over to the
right side
• You can select one or more
columns to drag
 Dragging the link header drags
all the columns
• Optionally, rename output link
columns in the bottom
window
• Optionally reorder output
columns using drag and drop
Renamed column
Specify the output columns

Output mappings are specified on the right side of the Lookup stage window. Input
columns that you want to send out the stage can be dragged across from the left
windows to the right window. In this example, all of the columns from the Warehouse
link have been dragged across, along with the Description column from the Items link.
As mentioned earlier, the tabs at the bottom provide the metadata for the columns in
the link windows. In this example, the name of the Description column has been
changed to ItemDescription. This column also has been moved to third in the output
list.

Unit 8 Combine data
Lookup failure actions

• If the lookup fails to find a matching key column, one of several
actions can be taken:
 Fail (Default)
− Stage reports an error and the job fails
 Drop
− Input row is dropped
 Continue
− Input row is transferred to the output. Reference link columns are filled with null
or default values
 Reject
− Input row sent to a reject link
− Stage must have a reject link
Lookup failure actions

Click the Lookup Constraints icon in the top left corner of the Lookup stage to specify
the lookup failure actions. By default, the lookup failure action is Fail, that is, the job
fails (aborts). For many purposes, this action is too drastic.
Rather than fail the job, you can specify that the lookup failure row is to be dropped,
rejected, or sent out the stage for further processing.

Unit 8 Combine data
Specifying lookup failure actions
Select reference link to

return multiple rows
Select lookup
failure action
Specifying lookup failure actions

Click the Lookup Constraints icon in the top left corner of the Lookup stage to open
the Lookup Stage Conditions window. On the right side, select the Lookup Failure
action.
By default, if there is more than one matching row, only one match is returned. You can
select a reference link from which all matching rows should be returned. Only one
reference link can be selected, if there is more than one. If this is selected, then a single
input row going into the Lookup stage can result in multiple rows going out of the stage,
one for each match.

Unit 8 Combine data
Lookup stage with reject link
Reject link. Select

Reject for the lookup
failure action
Lookup stage with reject link

This slide shows a job with a reject link from a Lookup stage. This requires that Reject
is selected as the Lookup Failure Action. (See previous page.) Any input rows that
have no matching reference row will be sent out this link. In this example, the rows are
sent to a Peek stage. But any passive stage or series processing stages can be used to
process the rejects.

Unit 8 Combine data
Lookup stage behavior
Source link Reference link
Revolution Citizen Citizen Exchange

1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE
Lookup key
column
Lookup stage behavior

This example and the following illustrate Lookup stage behavior for different lookup
failure actions. In this example, the Citizen column in the source link is matched to the
Citizen column in the reference link. For the first source row, the lookup will not find a
match (because there is no Lefty row in the reference link data. For the second, it will
find a match (the first row with M_B_Dextrous).
The next page illustrates the output from the Lookup stage.

Unit 8 Combine data
Lookup stage output
Output of Lookup with Continue option
Revolution Citizen Exchange

Empty string
1789 Lefty or null
1776 M_B_Dextrous Nasdaq
Output of Lookup with Drop option

Lookup stage output

This shows the results, depending on which Lookup option has been selected.
For the first source row (1789), the lookup fails to find a match. Since Continue is the
lookup failure option, the row is output. The Exchange column is populated with null
(if the column is nullable) or the empty string (if the column is not nullable).
For the second source row (1776), the lookup finds a match, so the Exchange column
gets a value from the lookup file.
If Drop is the lookup failure action, the first row is dropped, because there is no match.

Unit 8 Combine data
Demonstration 1
Using the Lookup stage
Demonstration 1: Using the Lookup stage

Unit 8 Combine data
Demonstration 1:
Using the Lookup stage
Purpose:
You will create lookups using the Lookup stage, identify how lookup failures
are handled, and finally capture lookup failures as a reject link.
NOTE:
DSEssLabSolutions_v11_5_1.dsx file in your C:\CourseData\DSEss_Files\dsx files
Steps:
that is displayed.
Task 1. Look up the warehouse item description
1. Open a new parallel job, and save it under the name LookupWarehouseItem.
2. Add the stages, laying them out as shown, and name them accordingly. The
Lookup stage is found in the Processing section of the Palette.

Unit 8 Combine data
3. Once all stages are added, add the links - starting from left to right - between
the 3 stages across the bottom of the diagram first. Once the bottom 3 stages
are connected, add the link from the remaining stage to the Lookup stage.
Your results will appear as shown (note the solid versus dashed connectors):
4. From Windows Explorer, locate and open the following file, using Wordpad:
C:\CourseData\DSEss_Files\Warehouse.txt
Note the delimiter in the data - in this case, the pipe (|) symbol.
5. Import the table definition for the Warehouse.txt sequential file to your
_Training > Metadata folder.

Unit 8 Combine data
6. Click Import, and confirm your settings are as shown below.
7. Click the Define tab, verify your column names appear, and then click OK.
8. Edit the Warehouse Sequential File stage, defining Warehouse.txt as the
source file from which data will be extracted. The format properties identified in
the table definition will need to be duplicated in the Sequential File stage. Be
sure you can view the data. If there are problems, check that the metadata is
correct on both the Columns and the Format tabs.
9. Import the table definition for the Items.txt file.

Unit 8 Combine data
10. Edit the Items Sequential File stage to extract data from the Items.txt file.
Perform the Load, and confirm your results as shown. Be sure to update the
Quote option to 'single'.
11. Again, be sure you can view the data in the Items stage before continuing.
12. Open the Lookup stage. Map the Item column in the top left pane to the lookup
Item key column in the bottom left pane of the Items table panel, by dragging
one to the other. If the Confirm Action window appears, click Yes to make the
Item column a key field.

Unit 8 Combine data
13. Drag all the Warehouse panel columns to the Warehouse_Items target link on
the right.
14. Drag the Description column from the Items panel to just above the Onhand
target column in the Warehouse_Items panel.
15. On the Warehouse_Items tab at the bottom of the window, change the name
of the Description target column, which you just added, to ItemDescription.
16. Edit your target Sequential stage as needed.

17. Compile and run. Examine the job log. Your job probably aborted. Try to
determine why it failed and think what you might do about it. (You will fix things
in the next task.)

Unit 8 Combine data
Task 2. Handle lookup failures.

1. Save your job as LookupWarehouseItemNoMatch.
2. Open up Lookup stage. Click the Constraints icon (top, second from left).
When the lookup fails, specify that the job is to continue.
3. Compile and run. Examine the log. You should not get any fatal errors this time.
4. View the data in the target file. Do you find any rows in the target file in which
the lookup failed? These would be rows with missing item descriptions.
Increase the number of rows displayed to at least a few hundred, if you do not
initially see any missing items. By default, when there is a lookup failure with
Continue, DataStage outputs empty values to the lookup columns. If the
columns are nullable, DataStage outputs NULLs. If the columns are not
nullable, DataStage outputs default values depending on their type.
5. Open up the Lookup stage. Make both the Description column on the left side
and the ItemDescription column on the right side nullable. Now, for non-
matches DataStage will return NULLs instead of empty strings.

Unit 8 Combine data
6. Since NULLs will be written to the target stage, you will need to handle them.
Open up the target Sequential stage. Replace NULLs by the string
“NOMATCH”. To do this, double-click to the left of the ItemDescription column
on the Columns tab. In the extended properties, specify a null field value of
NOMATCH.
7. Compile and run.

8. View the data in the target Sequential File stage. Run the view with at least 200
rows of data.
9. Click Find. Type NULL in the Find what: box. Select ItemDescription for the
In column: drop down. Click Find Next to locate the first NULL value. Results
will appear similar to below.

Unit 8 Combine data
Task 3. Add a Reject link.

1. Save your job as LookupWarehouseItemReject.
2. Open up Lookup stage and, using Constraints, specify that lookup failures are
to be rejected.
3. Close the Lookup stage and then add a rejects link going to a Peek stage to
capture the lookup failures.
4. Compile and run. Examine the Peek messages in the job log to see what rows
were lookup failures.
5. Examine the job log. Notice in the Peek messages that a number of rows were
rejected.
Results:
You matched lookups using the Lookup stage, identified how lookup failures
are handled, and finally captured lookup failures as a reject link.

Unit 8 Combine data
Range Lookup stage job
Reference link
Lookup stage
Range Lookup stage job

This slide again shows a job with a Lookup stage. In this example, a range lookup will
be specified in the Lookup stage instead of an equality match.

Unit 8 Combine data
Range on reference link
Reference range
values
Retrieve
description
Source values
Range on reference link

Here, you see the source data and the reference link data. The Item column value in
the source link will be matched to the range specified in the reference link by the
StartItem and EndItem columns. In this example, the first row of the source data will fit
within the “Description A” range. So for the first row, “Description A” will be
returned.

Unit 8 Combine data
Selecting the stream column
Double-click to Source link

specify range
Reference link
Selecting the stream column

This slide shows the inside of the Lookup stage. Warehouse is the stream link and
Range_Description is the reference link. To specify a range on the reference link, you
first select the Range box next to the key column (Item). Then double-click on the Key
Expression cell on the left of the key column. This opens the Range Expression
Editor window, where you specify the range.

Unit 8 Combine data
Range expression editor
Select range
columns
Select
operators

This slide shows the Range Expression Editor window. Select the operators and
columns to define the range. In this example, the range expression will be true when
Item is greater than or equal to the StartItem value and less than the EndItem column
value.
Notice here that two separate conditions are conjoined (AND) using a logical operator.

Unit 8 Combine data
Range on stream link
Source range
Retrieve other
column values
Reference link key
Range on stream link

This slide shows a job example where the range is on the stream link instead of the
reference link. Notice that the stream link (the solid line) is coming from the
Range_Description stage at the top. It has two columns, StartItem and EndItem,
which specify the range. The reference link has the Item column that will be matched to
this range.

Unit 8 Combine data
Specifying the range lookup
Select Range
key type
Specifying the range lookup

Here you see the inside of the Lookup stage. Select Range in the Key Type column
next to Item in the Warehouse reference link. Then double-click on the cell to its left to
open the Range Expression Editor window.

Unit 8 Combine data
Select range
columns

This slide shows the Range Expression Editor window. Here, as before, you select
the operators and columns to define the range.

Unit 8 Combine data
Demonstration 2
Range lookups
Demonstration 2: Range lookups

Unit 8 Combine data
Demonstration 2:
Range lookups
Purpose:
You want understand the two types of range lookups better. In order to do so,
you will design a job with a reference link range lookup and a job with a
stream range lookup.
NOTE:
Steps:
that is displayed.
Task 1. Design a job with a reference link range lookup.
1. Open your LookupWarehouseItem job and save it under the name
LookupWarehouseItemRangeRef. Save in the _Training > Jobs folder.
Rename the stages and links as shown.

Unit 8 Combine data
2. Import the table definition for the Range_Descriptions.txt sequential file. The
StartItem and EndItem fields should be defined like the Item field is defined in
the Warehouse stage, namely, as VarChar(255).
3. Edit the Range_Description Sequential File stage to read from the

Range_Descriptions.txt by setting the properties and changing the format
settings appropriately. When loading the new column definitions, delete the
existing columns first. Verify that you can view the data.
4. Open the Lookup stage. Edit the Description column on the left and the
ItemDescription column on the right so that both are nullable.
5. Select the Range checkbox to the left of the Item field in the Warehouse panel
window.

Unit 8 Combine data
6. Double-click on the Key Expression cell for the Item column to open the
Range Expression editor. Specify that the Warehouse.Item column value is to
be greater than or equal to the StartItem column value and less than the
EndItem column value.
7. Open the Constraints window and specify that the job is to continue if a lookup
failure occurs.
8. Edit the target Sequential File stage. The ItemDescription column in the
Sequential File stage is nullable. Go to the extended properties window for this
column. Replace NULL values by the string NO_DESCRIPTION.
10. View the data in the target stage to verify the results.

Unit 8 Combine data
Task 2. Design a job with a stream range lookup.

This job reads from the Range_Descriptions.txt file. It then does a lookup into the
Warehouse.txt file. For each row read, it selects all the records from the
Warehouse.txt file with items within the range. The appropriate description is
added to each record which is then written out to a file.
1. Save your job as LookupItemsRangeStream in your _Training > Jobs folder.
2. Reverse the source and lookup links. First make the source link a reference link.
Click the right mouse button and click Convert to reference. Then make the
lookup link a stream link.
3. Open up your Lookup stage. Select the Item column in the Warehouse table as
the key. Specify the Key type as Range.
4. Double-click on the Key Expression cell next to Item. Specify the range
expression.

Unit 8 Combine data
5. Click the Constraints icon. Specify that multiple rows are to be returned from
the Warehouse link. Also specify that the job is to continue if there is a lookup
failure.

7. View the data to verify the results.
Results:
You designed a job with a reference link range lookup and a job with a stream
range lookup.

Unit 8 Combine data
Join stage
• Four types of joins:
 Inner
 Left outer
 Right outer
 Full outer
• Input link data must be sorted
 Left link and a right link. Which is which can be specified in the stage
 Supports additional “intermediate” links
• Light-weight
 Little memory required, because of the sort requirement
• Join key column or columns
 Column names for each input link must match. If necessary, add a Copy
stage before the Join stage to change the name of one of the key columns
Join stage
Like the Lookup stage, the Join stage can also be used to combine data. It has the
same basic functionality as an SQL join. You can select one of four types of joins: inner,
left outer, right outer, and full outer.
An inner join outputs rows that match.
A left outer join outputs all rows on the left link, whether they have a match on the right
link or not. Default values are entered for any missing values in case of a match failure.
A right outer join outputs all rows on the right link, whether they have a match on the left
link or not. Default values are entered for any missing values in case of a match failure.
A full outer join outputs all rows on the left link and right link, whether they have
matches or not. Default values are entered for any missing values in case of match
failures.

Unit 8 Combine data
Job with Join stage
Right input
link
Left input link Join stage
Job with Join stage

This slide displays a simple job with a Join stage. There are two input links. The links
are ordered. One is designated the left link and the other is designated the right link,
which is important when defining left and right outer joins. The stage contains a tab
where this link ordering can be specified. (You cannot tell from diagram which link is left
and which is right, although this is highlighted in the example.)

Unit 8 Combine data
Join stage properties
Select which link

is left / right
Column to
match
Select join
type
Select if multiple columns

make up the join key
Join stage properties

This slide shows the Properties tab of the Join stage. Here, you specify the join key
columns and the join type. The Link Ordering tab is highlighted.
By default, a single Key property is specified. This allows you to choose one key
column. If the key contains more than one key column, click the Key property in the
Available properties to add window.
The key columns consist of columns from both the left and right links. The column
names must match exactly. Thus, the Item column in the example refers to an Item
column in the left link and the Item column in the right link. If the key columns do not
match exactly, you will need to add a Copy stage as an input link to rename one of the
columns, so that they match.

Unit 8 Combine data
Output Mapping tab

• Drag input columns from the input to the output
• Output link includes columns from both input links
 Item.Description from one input link
 All columns from the other input link
Output Mapping tab

This slide shows the Output>Mapping tab. Here you specify the output column
mappings. The Join stage requires a single output link. Multiple output links are not
supported.

Unit 8 Combine data
Join stage behavior
Left link (primary input) Right link (secondary input)
Revolution Citizen Citizen Exchange

1789 Lefty M_B_Dextrous Nasdaq
1776 M_B_Dextrous Righty NYSE
Join key
column
Join stage behavior

In this and the following pages, examples illustrate the Join stage behavior. In this
example, the Citizen column in the source link is matched to the Citizen column in the
reference link. For the first source row (Lefty), there is no matching row in the right link.
For the second, there is a matching row (M_B_Dextrous).

Unit 8 Combine data
Inner join output

• Only rows with matching key values are output
Output of inner join on key Citizen

Inner join output

If an inner join is selected in the stage, only the second row of the left link
(M_B_Dextrous) and its matching row in the right link, will be output.

Unit 8 Combine data
Left outer join output

• All rows from the left link are output. All rows from the right link with
matching key values are output

Null or default
1789 Lefty value
Left outer join output

If a left outer join is selected in the stage, both rows from the left link will be output. The
first row in the left link (Lefty) does not have a matching row in the right link. Therefore
the row Exchange column, which comes from the right link, is filled in with either null or
with a default value, depending on the column type.

Unit 8 Combine data
Right outer join output

• All rows from the right link are output. All rows from the left link with
matching key values are output

Righty NYSE
Null or default
value
Right outer join output

If a right outer join is selected in the stage, both rows from the right link will be output.
The first row in the right link (M_B_Dextrous) has a matching row in the right link. The
second row does not. Therefore the row Revolution column, which comes from the left
link, is filled in with either null or with a default value, depending on the column type.

Unit 8 Combine data
Full outer join

• All rows from the left link are output. All rows from the right link are
output
• Creates new columns corresponding to the key columns of the left
and right links
Revolution leftRec_Citizen rightRec_Citizen Exchange

1789 Lefty
1776 M_B_Dextrous M_B_Dextrous Nasdaq
0 Righty NYSE
Null or default Null or default

value value
Full outer join

This shows the results for a full outer join. It combines the results of both a left outer join
and a right outer join. The Revolution and Exchange columns which exist on just one
link will receive null or default values for non-matches.
Notice that both the right link key columns and the left link key columns will be added to
the output. For non-matching output rows, at least one of these columns will contain null
or default values.

Unit 8 Combine data
Merge stage
• Similar to Join stage
 Master (stream) link and one or more secondary links
• Stage requirements
 Master and secondary link data must be sorted by merge key
 Master link data must be duplicate-free
• Light-weight
 Little memory required, because of the sort requirement
• Unmatched master link rows can be kept or dropped
• Unmatched secondary link rows can be captured
 One reject link can be added for each secondary link
Merge stage
The Merge stage is similar to the Join stage. It can have multiple input links, one of
which is designated the master link.
It differs somewhat in its stage requirements. Master link data must be duplicate-free, in
addition to being sorted, which was not a requirement of the Join stage.
The Merge stage also differs from the Join stage in some of its properties. Unmatched
secondary link rows can be captured in reject links. One reject link can be added for
each secondary link.
Like the Join stage, it requires little memory, because of the sort requirement.

Unit 8 Combine data
Merge stage job
Master link Secondary link
Capture secondary
link non-matches
Merge stage job

This slide shows an example job with a Merge stage. The input links are ordered:
Master link and secondary link.
As mentioned earlier, the Merge stage supports reject links for capturing secondary link
non-matches. In this example, the ItemsReject link captures non-matching rows from
the Items secondary link.

Unit 8 Combine data
Merge stage properties
Match key
Keep or drop
unmatched
masters
Merge stage properties

This slide shows the Property tab of the Merge stage. In addition to the Key properties,
there are several optional properties that can be used. Highlighted is the Unmatched
Masters Mode property. Use this property to specify whether the stage is to keep or
drop master rows that do not have matching secondary link rows.

Unit 8 Combine data
Comparison Chart
Joins Lookup Merge

Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)
Memory usage light heavy light
# and names of Inputs 2 or more: left, right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort all inputs no all inputs
Duplicates in primary input OK OK Warning!
Duplicates in secondary input(s) OK Warning! OK only when N = 1
Options on unmatched primary Keep (left outer), Drop (Inner) [fail] | continue | drop | reject [keep] | drop
Options on unmatched secondary Keep (right outer), Drop (Inner) NONE capture in reject set(s)
On match, secondary entries are captured captured consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)
Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries
Comparison Chart
This chart summarizes the differences between the three combination stages. The key
point here is that the Join and Merge stages are light on memory usage, but have the
additional requirement that the data is sorted. The Lookup stage does not have the sort
requirement, but is heavy on memory usage.
Apart from the memory requirements, each stage offers a slightly different set of
properties.

Unit 8 Combine data
What is a Funnel stage?

• Collects rows of data from multiple input links into a single output
stream
 Rows coming out have the same metadata as rows going in. Just more
rows
• All sources must have compatible metadata
 Same number of columns of compatible types
• Three modes
 Continuous: Records are combined in no particular order
 Sort Funnel: Preserves the sorted output of sorted input links
 Sequence: Outputs all records from the first input link, then all from the
second link, and so on
What is a Funnel stage?

The Funnel stage collects rows of data from multiple input links into a single output
stream. Although the Funnel stage combines data, it combines in a very different way
from the Join, Merge, and Lookup stages. The latter horizontally combine the columns
from each input link. The Funnel stage output link has the same columns as exist in the
input links. And each input link has the same number of columns with compatible types.

Unit 8 Combine data
Funnel stage example
Funnel stage
Funnel stage example

This slide shows a job with a funnel stage. Both input links must have the same
metadata, that is, same number of columns and compatible column types. The output is
a single stream containing all the rows from both input links.
The total number of rows going through the output link is the sum of the number of rows
for each input link.

Unit 8 Combine data
Funnel stage properties

• Funnel stage has only one property: Funnel Type
 Here Continuous Funnel has been selected
Funnel Type
property
Funnel stage properties

This slide shows the Funnel stage properties. The Funnel stage has only one property:
Funnel Type. Here Continuous Funnel has been selected. This implies that the
records going through the output link will not be in any particular ordering.

Unit 8 Combine data
Checkpoint
1. Which stage uses the least amount of memory? Join or Lookup?
2. Which stage requires that the input data is sorted? Join or Lookup?
3. If the left input link has 10 rows and the right input link has 15 rows,
how many rows are output from the Join stage for a Left Outer join?
From the Funnel stage?
Checkpoint

Unit 8 Combine data
1. Join
2. Join
3. At least 10 rows will be output from the Join stage using a Left Outer
Join. Possibly up to 15, if there are multiple matches. 25 rows will
be output from the Funnel stage.

Unit 8 Combine data
Demonstration 3
Using Join, Merge, and Funnel stages
Demonstration 3: Using Join, Merge, and Funnel stages

Unit 8 Combine data
Demonstration 3:
Using the Join, Merge, and Funnel stages
Purpose:
You want to understand how the Join, Merge and Funnel stages can be used
to combine data, so you will create each of these stages in a job.
NOTE:
Steps:
that is displayed.
Task 1. Use the Join stage in a job.
1. Open your LookupWarehouseItem job. Save it as JoinWarehouseItem.
2. Delete the Lookup stage and replace it with a Join stage available from the
Processing folder in the palette. (Just delete the Lookup stage, drag over a
Join stage, and then reconnect the links.)
3. Verify that you can view the data in the Warehouse stage.

Unit 8 Combine data
4. Verify that you can view the data in the Items stage.
5. Open the Join stage. Join by Item. Specify a Right Outer join.
6. Click the Link Ordering tab. Make Warehouse the Right link by selecting
either Items or Warehouse, and then clicking up or down arrow accordingly.
7. Click the Output > Mapping tab. Be sure all columns are mapped to the output.

Unit 8 Combine data
8. Edit the target Sequential File stage. Edit or confirm that the job writes to a file
named WarehouseItems.txt in your lab files Temp directory.
9. Compile and run. Verify that the number of records written to the target
sequential file is the same as were read from the Warehouse.txt file, since this
is a Right Outer join.
10. View the data. Verify that the description is joined onto each Warehouse file
record of columns.

Unit 8 Combine data
Task 2. Use the Merge stage in a job.

In this task, you will see if the Merge stage can be used in place of the Join stage.
You will see that it cannot be successfully used.
1. Save your job as MergeWarehouseItem. Replace the Join stage by the Merge
stage. (Just delete the Join stage, drag over a Merge stage, and then reconnect
the links.)
2. In the Merge stage, specify that data is to be merged, with case sensitivity, by
the key (Item). Assume that the data is sorted in ascending order. Also specify
that unmatched records from Warehouse (the master link) are to be dropped.

Unit 8 Combine data
3. On the Link Ordering tab, ensure that the Warehouse link is the master link.
4. On the Output > Mapping tab, be sure that all input columns are mapped to
the appropriate output columns.
5. Compile and run. View the data.

6. View the job log. Notice that a number of master records have been dropped
because they are duplicates.
Recall that the Merge stage requires the master data to be duplicate-free in the
key column. A number of update records have also been dropped because they
did not match master records.

Unit 8 Combine data
The moral here - you cannot use the Merge stage if your Master source has
duplicates. None of the duplicate records will match with update records.
Recall that another requirement of the Merge stage (and Join stage) is that the
data is hash partitioned and sorted by the key. You did not do this explicitly, so
why did our job not fail? Let us examine the job log for clues.
7. Open up the Score message.
Notice that hash partitioners and sorts (tsort operators) have been inserted by
DataStage.
Task 3. Use the Funnel stage in a job.

In this task, you will funnel rows from two input files into a single file.
1. Open a new parallel job and save it as FunnelWarehouse. Add links and
stages and name them as shown.

Unit 8 Combine data
2. Edit the two source Sequential File stages to, respectively, extract data from the
two Warehouse files, Warehouse_031005_01.txt and
Warehouse_031005_02.txt. They have the same format and column
definitions as the Warehouse.txt file.
3. Edit the Funnel stage to combine data from the two files in Continuous Funnel
mode.
4. On the Output > Mapping tab, map all columns through the stage.
5. In the target stage, write to a file named TargetFile.txt in the Temp directory.
6. Compile and run. Verify that the number of rows going into the target is the sum
of the number of rows coming from the two sources.
Results:
You wanted to understand how the Join, Merge and Funnel stages can be
used to combine data, so you created each of these stages in a job.

Unit 8 Combine data
Unit summary
• Combine data using the Lookup stage
• Define range lookups
• Combine data using Merge stage
• Combine data using the Join stage
• Combine data using the Funnel stage
Unit summary

Unit 8 Combine data

Group processing stages

Unit 9 Group processing stages

Unit objectives
• Sort data using in-stage sorts and Sort stage
• Combine data using Aggregator stage
• Combine data Remove Duplicates stage
Group processing stages © Copyright IBM Corporation 2015
Unit objectives


• Group processing stages include:
 Sort stage
 Aggregator stage
 Remove Duplicates stage
 Transformer stage (discussed in another unit)
• In all Group processing stages, you will specify one or more key
columns that define the groups

Group processing stages perform activities over groups of rows. The groups are
defined by one or more key columns. The Sort stage puts the groups into sort order.
The Aggregator stage performs calculations over each group. The Remove Duplicates
stage retains a single row from each group.
In addition to the Sort, Aggregator, and Remove Duplicates stages, the Transformer
stage can also perform group processing. This is discussed in a later unit.

Sort data
• Uses
 Sorting is a common business requirement
− Pre-requisite for many types of reports
 Some stages require sorted input
− Join, Merge stages
 Some stages are more efficient with sorted input
− Aggregator stage uses less memory
• Two ways to sort:
 In-stage sorts
− On input link Partitioning tab
• Requires partitioning algorithm other than Auto
− Sort icon shows up on input link
 Sort stages
− More configurable properties than in-stage sorting
Sort data
Sorting has many uses within DataStage jobs. In addition to implementing business
requirements, sorted input data is required by some stages and helpful to others.
Sorting can be specified within stages (in-stage sorts), or using a separate Sort stage.
The latter provides properties not available in in-stage sorts.

Sorting alternatives
Sort stage In-stage

sort icon
Sorting alternatives
This slide shows two jobs that sort data. The Sort stage is used in the top job. In the
lower job, you see the in-stage sort icon, which provides a visual indicator that a sort
has been defined in the stage associated with the icon.

In-Stage sorting
Partitioning Enable Preserve non-key
tab sort row ordering
Remove
dups
Select key Select partitioning Sort

columns algorithm key
In-Stage sorting
This slide shows the Input>Partitioning tab of a typical stage (here, a Merge stage).
To specify an in-stage sort, you first select the Perform sort check box. Then you
select the sort key columns from the Available box. In the Selected box you can
specify some sort options.
You can optionally select Stable. Stable will preserve the original ordering of records
within each key group. If not set, no particular ordering of records within sort groups is
guaranteed.
Optionally, select the Unique box to remove duplicate rows based on the key columns.
Sorting is only enabled if a Partition type other than Auto is selected.

Stable sort illustration
Key Col Key Col
4 X 1 K
3 Y 1 A
1 K 2 P
3 C 2 L
2 P 3 Y
3 D 3 C
1 A 3 D
2 L 4 X
Stable sort illustration

This diagram illustrate how stable sorting functions. The ordering of non-key column
values within each sort group is preserved. For example, on the left the 1-K row is
before the 1-A row. On the right, this ordering is preserved. Similarly, the 2-P row is
before 2-L row. This ordering is preserved.
Sometimes, for business requirements, this ordering needs to be preserved. For
example, suppose that the last record is considered to be the “final” version, which is
used in later processing. The earlier versions are to be removed from later processing.

Sort stage Properties tab
Sort
key
Sort
options
Sort stage Properties tab

This slide shows the inside of the Sort stage and highlights the Sort Keys property. In
this example, the sort key has three columns.
There are two folders of properties: Sorting Keys, Options. These properties and
options are discussed in the following pages.

Specify sort keys

• Add one or more keys
• Specify Sort Key Mode for each key
 Sort: Sort by this key
 Don’t sort (previously sorted):
− Assumes the data has already been sorted on this key
 Purpose is to avoid unnecessary sorting, which impacts performance
• Specify sort order: Ascending / Descending
• Specify case sensitivity
Specify sort keys

The most important property within the Sort stage, one which is unavailable for in-stage
sorts, is the Sort Key Mode property. Its purpose is to avoid unnecessary sorting,
which impacts performance. If the data has already been partially sorted, the stage can
take advantage of that.

Sort stage options
Option More information

Sort Utility Choose DataStage, which is the default
Stable Same as for in-stage sorting
Allow duplicates Same as for in-stage sorting
Restrict Memory Specifies the maximum amount of memory or memory that

Usage can be used for sorting
property • Amount is per partition
Sorting is done in memory to improve performance.
• Uses scratch disk (as defined in the configuration file) if it runs
out of memory
Increasing amount of memory can improve performance
Create key Add a column with a value of 1 / 0
change 1 indicates that the key value has changed
0 means that the key value hasn’t changed
column Useful for group processing in the Transformer stage
Sort stage options

There are several optional sort properties available within the Sort stage.
By default, the Sort stage uses the DataStage sort utility. This is faster than the
alternative.
The Restrict Memory Usage specifies the maximum amount of memory available to
the stage per partition. Increase this amount if there is not enough memory available to
the stage.
The Create key change Column property is used for group processing within a
downstream Transformer stage. Group processing in the Transformer stage is
discussed in a later unit.

Create key change column
Key Col Key Col K_C
4 X 1 K 1
3 Y 1 A 0
1 K 2 P 1
2 L 0
3 C
3 Y 1
2 P
3 D 3 C 0
1 A 3 D 0
2 L 4 X 1
Create key change column

This diagram illustrates how the Create Key Change Column works. Notice that after
the sort, an additional column (K_C) has been added with 1’s and 0’s. “1” indicates the
start of a new group of rows. In this example, 3-Y, 1-K, and 4-X are among the rows
that start new groups.
The Transformer stage sees one row at a time, but can keep running totals. It can use
the key change column to detect when its total for a group is complete.

Partition sorts
• Sorting occurs separately within each partition
 By default, the Sort stage runs in parallel mode
• What if you need a final global sort, that is, a sort of all the data, not
just the data in a particular partition?
 When you write the data out, collect the data using the
Sort Merge algorithm
 Or, run the Sort stage in sequential mode
(not recommended because this reduces performance)
Partition sorts
By default, the Sort stage runs in parallel mode. Sorting occurs separately within each
partition. In many cases, this is all the sorting that is needed. In some cases, a global
sort, across all partitions, is needed. Even in this case, it makes sense to run the stage
in parallel mode, and collect it afterwards using Sort Merge. This is generally much
faster than running the stage in sequential mode.

Aggregator stage
• Purpose: Perform data aggregations
 Functions like an SQL statement with a GROUP BY clause
• Specify one or more key columns that define the aggregation groups
• Two types of aggregations
 Those that aggregate the data within specific columns
− Select the columns
− Specify the aggregations: SUM, MAX, MIN, etc.
 Those that simply count the rows within each group
• The Aggregator stage can work more efficiently if the data has been
pre-sorted
 Specified in the Method property: Hash (default) / Sort
Aggregator stage
This slide lists the major features of the Aggregator stage. It functions much like an
SQL statement with a GROUP BY clause. However, it contains far more possible
aggregations than what SQL typically provides.
The key activities you perform in the Aggregator stage is specifying the key columns
that define the groups, and selecting the aggregations the stage is to perform. There
are two basic types of calculations: Counting the rows within each group, which is a
calculation which is not performed over any specific column; and calculations
performed over selected columns.
If the data going into the aggregator stage has already been sorted, the Aggregator
stage can work more efficiently. You indicate this using the Method property.

Job with Aggregator stage
Aggregator
stage
Job with Aggregator stage

This slide shows a “fork-join” job design with an Aggregator stage. In this job, all rows
go out both output links from the Copy stage. One output link goes to the Aggregator
stage where the data is grouped and summarized. Then summary result is then joined
back to each of the rows going from the Copy to the Join stage.
It is called a “fork-join” job design because the data is forked out into multiple output
streams and then joined back together.

Aggregation types
• Count rows
 Count rows in each group
 Specify the output column
• Calculation
 Select columns for calculation
 Select calculations to perform, including:
− Sum
− Min, max
− Mean
− Missing value count
− Non-missing value count
 Specify output columns
Aggregation types
There are two basic aggregation types: Count rows, Calculation. The former counts
the number of rows in each group. With the latter type, you select an input column that
you want to perform calculations on. Then you select the calculations to perform on that
input column and the output columns to put the results in.

Count Rows aggregation type
Group key
column
Count Rows
aggregation
type
Column for
the result
Count Rows aggregation type

This slide shows the inside of the Aggregator stage on the Properties tab and
highlights the main properties. The Group property specifies the columns that define
the groups. Select either Count Rows or Calculation for the Aggregation Type
property.
To specify a new output column, just type in the name of the output column in the
Count Output Column property. This column will show up on the Output > Mapping
tab with a default type. On the Output > Mapping tab, you can edit the column data
type, if needed.
In this example, Sort has been selected for the Method property. This tells the stage
that the data going into the stage has already been sorted. The stage itself does not
sort the data! If the data is not actually sorted, runtime errors will occur.

Output Mapping tab

• Drag the columns across to create the output columns
• You can modify the name and type of the columns on the
Columns tab
Results column for count

Output Mapping tab

This slide shows the Output > Mapping tab of the Aggregator stage. This is where you
map the aggregation results to output columns. In this example, the stage output has
not yet been specified. Here, both columns on the left will be dragged across to the
output link. So the output link will have both the group key and the group results. The
group key will be used to join the data back to the other stream, in the Join stage.

Output Columns tab

• New output columns are created with a default type of Double
 Optionally, change the type of the output column
Default column
type
Output Columns tab

This slide shows the Output > Columns tab. This shows the output column metadata
for the columns specified on the Properties tab. You can edit the column names and
default types.

Calculation aggregation type
Grouping key
column
Calculation
aggregation type
Calculations and
output column names
Column for
calculation
More
calculations
Calculation aggregation type

In this example a Calculation aggregation type has been selected. When this type
is selected, you need to select the column or columns upon which calculations are
to be performed along with the results columns for the calculations.
In this example, calculations are being performed over the values in the Item
column. The Maximum is taken and put into a column named ItemMax. The
Minimum is taken and put into a column named ItemMin.

Grouping methods
• Hash (default)
 Calculations are made for all groups and stored in memory
− Hash table structure (hence the name)
 Results are written out after all rows in the partition have been processed
 Input does not need to be sorted
 Needs enough memory to store all the groups of data to be processed
• Sort
 Requires the input data to be sorted by grouping keys
− Does not perform the sort! Expects the sort
 Only a single group is kept in memory at a time
− After a group is processed, the group result is written out
 Only needs enough memory to store the currently processed group
Grouping methods
There are two grouping methods in the Aggregator stage. This summarizes their
features and differences. The default method is Hash. When this method is selected,
the Aggregator stage will make calculations for all the groups and store the results in
memory. Put another way, all the input data is read in and processed. If there is not
enough memory to read and process all of the data in memory, the stage will use
scratch disk, which slows processing down considerably. This method does not
required that the data be presorted.
The Sort method requires that the data has been presorted. The stage itself does not
perform the sort. When Sort is selected the stage only stores a single group in memory
at a time. So very little memory is required. The Aggregator stage can also work faster,
since the data has been preprocessed.

Method = Hash
Key Col 4 4X
4 X
3 Y
1 K 3 3Y 3C 3D
3 C
2 P 1 1K 1A
3 D
1 A 2 2P 2L
2 L
Method = Hash
This diagram illustrates the Hash method.
When Method equals Hash, all the groups of data must be put into memory. This is
illustrated by the circle around all of the groups. The structure in memory is a keyed
structure for fast return of the results.

Method = Sort
Key Col
1 K
1K 1A
1 A
2 P
2 L
2P 2L
3 Y
3 C
3 D 3Y 3C 3D
4 X
4X
Method = Sort
This diagram illustrates the Sort method.
When Method equals Sort, only the current group needs to be put into memory. This is
illustrated by the circles around the individual groups.

Remove duplicates
• by Sort stage
 Use unique option
− No choice on which duplicate to keep
− Stable sort always retains the first row in the group
− Non-stable sort is indeterminate
OR
• by Remove Duplicates stage

 Has more sophisticated ways to remove duplicates
− Can choose to retain first or last
Remove duplicates
There are several ways you can remove duplicates in a DataStage job. When sorting,
you can optionally specify that duplicates are to be removed, whether you are sorting
using a Sort stage or performing an in-stage sort. Alternatively, the job can use the
Remove Duplicates stage. The advantage of using the Remove Duplicates stage is that
you can specify whether the first or last duplicate is to be retained.

Remove Duplicates stage job
Remove Duplicates
stage
Remove Duplicates stage job

Here is an example of a DataStage job with a Remove Duplicates stage. Like the Sort
stage it has one input link and one output link.

Remove Duplicates stage properties
Columns that define

duplicates
Duplicate to
retain
Optionally, add more

key columns
Remove Duplicates stage properties

This slide shows the Properties tab of the Remove Duplicates stage. The main
requirement is to specify the Key columns that define what counts as a duplicate
record (two records with matching key values). It is important to note that duplicate
does not mean all the data in the records match. It just means that all the data in the
specified key columns match. The key columns define what it means to be a
duplicate.
The other key property in the stage is the Duplicate to Retain property. This
property is not available in the Sort stage.

Checkpoint
1. What stage is used to perform calculations of column values
grouped in specified ways?
2. In what two ways can sorts be performed?
3. What is a stable sort?
4. What two types of aggregations can be performed?
Checkpoint

1. Aggregator stage
2. Using the Sort stage. In-stage sorts.
3. Stable sort preserves the order of non-key values.
4. Count Rows and Calculations.

Demonstration 1

 Create a job that uses Sort, Aggregator, and Remove Duplicates stages
 Create a Fork-Join job design
Demonstration 1: Group processing stages

Demonstration 1:
Purpose:
In order to understand how groups of data are processed, you will create a job
that uses the Sort, Aggregator, and Remove Duplicates stages. You will also
create a Fork join design.
NOTE:
Steps:
that is displayed.

Task 1. Create the job design.

1. Open a new parallel job and save it as ForkJoin. Add stages and links and
name them as shown. You will find Sort , Aggregator , Copy ,
Join , and Remove Duplicates in Palette > Processing.
2. Edit the Selling_Group_Mapping_Dups Sequential File stage to read from the

Selling_Group_Mapping_Dups.txt file. It has the same format as the
Selling_Group_Mapping.txt file.

3. Edit the Sort_By_Code Sort stage. Perform an ascending sort by

Selling_Group_Code. The sort should not be a stable sort. Send all columns
through the stage.
4. In the Copy stage, specify that all columns move through the stage to the
CopyToJoin link.

5. Specify that only the Selling_Group_Code column moves through the Copy
stage to the Aggregator stage.
6. Edit the Aggregator stage. Specify that records are to be grouped by

Selling_Group_Code.
7. Specify that the type of aggregation is Count Rows.
8. Specify that the aggregation amount is to go into a column named
CountGroup.
Select Sort as the aggregation method, because the data has been sorted by
the grouping key column.
Next you want to define the columns.

9. On the Output > Mapping tab, drag both columns to AggToJoin.

We want to include Selling_Group_Code so we can join the outputs in the Join
stage later.
10. On the Output > Columns tab, define CountGroup as an integer, length 10.
11. Edit the Join stage. The join key is Selling_Group_Code. The join type is
Left Outer.

12. Verify on the Link Ordering tab that the CopyToJoin link is the left link.
13. On the Output > Mapping tab, map all columns across. Click Yes to the
message to overwrite the value, if prompted.

14. Edit the Sort_By_Handling_Code stage. The key column of

Selling_Group_Code has already been sorted, so specify Don't Sort
(Previously Sorted) for that key column. Add Special_Handling_Code as an
additional sort key. Turn off stable sort.
15. On the Output > Mapping tab, move all columns through the stage.
16. On the Input > Partitioning tab, select Same to guarantee that the partitioning
going into the stage will not change.

17. Edit the Remove Duplicates stage. Group by Selling_Group_Code. Retain

the last record in each group.
18. On the Output > Mapping tab, move all columns through the stage.
19. Edit the target Sequential stage. Write to a file named
Selling_Group_Code_Deduped.txt in the lab files Temp directory. On the
Partitioning tab, collect the data using Sort Merge based on the two columns
by which the data has been sorted, clicking the columns to move them to the
Selected box.
20. Compile and run. View the job log to check whether there are any problems.

21. View the results. There should be fewer rows going into the target stage than
the number coming out of the source stage, because the duplicate records have
been eliminated.
22. View the data in the target stage. Take a look at the CountGroup to see that
you are getting multiple duplicate counts for some rows.
Results:
In order to understand how groups of data are processed, you created a job
that uses the Sort, Aggregator, and Remove Duplicates stages. You also
created a Fork join design.

Fork-Join Job Design
Fork data
Join data
Fork-Join Job Design

The Copy stage forks the data into two output streams. One stream goes to an
Aggregator stage where calculations are performed over all the groups of data in the
input. The results are then joined back to each row of data from the left fork.

Unit summary
• Sort data using in-stage sorts and Sort stage
• Combine data using Aggregator stage
• Combine data Remove Duplicates stage
Unit Summary

Transformer stage
Transformer stage

Unit 10 Transformer stage

Unit objectives
• Use the Transformer stage in parallel jobs
• Define constraints
• Define derivations
• Use stage variables
• Create a parameter set and use its parameters in constraints and
derivations
Transformer stage © Copyright IBM Corporation 2015
Unit objectives
This unit focuses on the primary stage for implementing business logic in a DataStage
job, namely, the Transformer.

Transformer stage
• Primary stage for filtering, directing, and transforming data
 Only rows that satisfy the specified condition can pass out the link
 Use to filter data
− For example, only write out rows for customers located in California
 Use to direct data down different output links based on specified conditions
− Forexample, send unregistered customers out one link and registered customers out
another link
 Derive an output value from various input columns and write it to a column or
stage variable
• Compiles into a custom operator in the OSH
 This is why DataStage requires a C++ compiler
• Optionally include a reject link
 Captures rows that the Transformer stage cannot process
Transformer stage
This lists the primary features of the Transformer stage, which is the primary stage for
filtering, directing, and transforming data.
In a Transformer stage, you can specify constraints for any output links. Constraints can
be used to filter data or to constrain data to run in a specific output link.
In a Transformer stage, you can define derivations for any output column or variable. A
derivation defines the value that is to be written to the column or variable.

Job with a Transformer stage
Transformer
Single input
Reject link
Multiple
outputs
Job with a Transformer stage

This slide shows an example of a job with a Transformer stage.
In this example, rows that are written out the Transformer stage are directed down one
of two output links based on constraints defined in the stage. Rows that cannot be
processed by the Transformer stage are captured by a reject link.

Inside the Transformer stage
Stage
variables
Loops
Input link
columns
Derivations
Output
columns
Column
definitions

This slide shows the inside of the Transformer stage and highlights its main features,
which are described in more detail in subsequent pages.
On the top, left side are the columns of the input link going into the Transformer. The
definitions for these columns are displayed at the bottom, left side.
On the top, right side are the columns for each of the stage output links. The columns
for each output link are located in separate windows within the stage. The definitions for
these columns are displayed and edited at the bottom, right side.

Transformer stage elements (1 of 2)

• Input link columns
 Names of columns are listed in the input link window on the left side
 Column metadata (name, type, nullability) is specified on the tabs at the
bottom
− One tab per link window
• Output link columns
 Names of link columns are listed in output link windows on the right side
 Column metadata (name, type, nullability) is specified on the tabs at the
bottom
 There is one output link window for each output link
− Title is the name of the output link. (Be sure to name your output links!)
• Derivation cells
 Cells to the left of each stage variable or output column
 Double-click on the cell to open the expression editor
Transformer stage elements

This describes the primary Transformer stage features identified on the previous
page.

Transformer stage elements (2 of 2)

• Constraints
 Double-click to the right of the word “Constraint” at the top of an output link
window to open the Transformer Stage Constraints window
− Alternatively click the Constraints icon at the top (second from the left)
• Stage variables window: Top right
 Lists defined stage variables in the order of their execution
 Right-click mouse, then click Stage Variable Properties to define new stage
variables
• Loop Condition window: Second-to-top right
 Right-click, then click Loop Variable Properties to define new loop variables
 Double click to right of Loop While to open expression editor to define the
Loop While condition
• Transformer stage properties
 Click the icon at the top left corner of the window
This continues the description of the Transformer stage features identified on the prior
page.

Constraints
• What is a constraint?
 Defined for each output link
 Specifies a condition under which a row of data is allowed to flow out the
link
• Uses
 Filter data: Functions like an SQL WHERE clause
 Direct data down different output links based on the constraints defined on
the links
• Built using the expression editor
• Specified on the Constraints window
 Lists the names of the output links
 Double-click on the cell to the right of the link name to open the expression
editor to define the constraint
 Output links with no defined constraints output all rows
Constraints
This describes the main features of constraints: what they are, how they are used, and
how they are built.
A constraint is a condition. It is either true or false. When it is true (satisfied), data is
allowed to flow through its output link. Only if the constraint is satisfied will the
derivations for each of the link’s output columns will be executed.

Constraints example
• Here, low handling codes are directed down one output link and high
handling codes down another
• In the Transformer, constraints are defined for both output links
Constraints example
This slide displays a parallel job with a Transformer stage. There are two output links. In
the Transformer, constraints are defined for both output links. In this example, low
handling codes are directed down one output link and high handling codes down the
other.
A row of data can satisfy none or more than one output link constraint. It will be written
out each output link whose constraint is satisfied. All rows will be written out for links
that have no constraints.

Define a constraint
Output links
Select input column

from menu
Define a constraint
You double-click on the cell to the right of the link name to open the Transformer stage
expression editor to define the constraint. This slide shows an example of a constraint
defined in the expression editor. Select items from the menu to build the constraint.
Click the Constraints icon at the top of the Transformer (yellow chain) to open the
Transformer Stage Constraints window.

Use the expression editor

• Click the right mouse button at the spot where you want to insert an
item (for example, an input column)
• Select the type of item to insert into the expression
• Select from the list of items presented
Use the expression editor

This discusses how constraints are built. In the example shown in the screenshot, an
input column is being inserted into the expression. The menu provides a list of all the
items (input columns, job parameters, system variables, and so on) that you can insert
into the expression.
You can, alternatively, manually type in the names of these items, but be aware that
some items, such as input columns, have prefixes that are part of their complete
names. Input columns are prefixed by the names of their input links.
The location of the cursor determines the type of items available to be inserted. If the
cursor is located where an operator belongs, the menu will display a list of available
operators (>, <, =, and so on).

Otherwise links for data integrity

• Suppose a row contains a special handling code outside the range of
valid values
• You can add an “otherwise” link to capture these bad-data rows
 Add another link
 Set the link to be last in the link ordering
 Check the Otherwise box next to the new link name
• The otherwise constraint captures all rows that do not satisfy any of the
earlier executed constraints
Otherwise links for data integrity

Otherwise links can be added to promote data integrity. Suppose, for example, a row
contains a special handling code outside the range of valid values. You can add an
otherwise link to capture these bad-data rows.
An otherwise link is the same as any other output link, except that it does not have an
explicitly coded constraint defined for it. Do not confuse an otherwise link with a reject
link, the latter of which is a special type of output link, indicated by the broken line.

Otherwise link example

• The RangeErrors link is intended to capture any rows with codes
outside the acceptable range
• Here it is assumed that rows that do not satisfy either the LowCode
or HighCode constraints are bad data
Otherwise
link
Otherwise link example

This slide shows an otherwise link coming out of a Transformer. It is an additional
output link. It will capture rows that do not go down the Lowcode and Highcode output
links.
Subsequent pages will show how you set RangeErrors to be an otherwise link.

Specify the link ordering

• Output links have an execution order
 A row is tested against the first output link constraint, then the second,
and so on
• Otherwise links should be placed last in the execution order
• Click the top right icon to open the Link Ordering tab
Otherwise
link
Specify the link ordering

Otherwise links should be placed last in the execution order. This slide shows the Link
Ordering tab where the ordering of output links can be specified. Notice that
RangeErrors is last in list. Use the controls at the right to change the ordering of any of
the links.
Click the Stage Properties icon at the top left of the Transformer to open the
Transformer Stage Properties window.

Specify the Otherwise link constraint

• Checking this box does two things:
 Captures rows not processed by earlier output links
 Writes a warning message to the log if any rows go down the
Otherwise link
• Do not code the constraint in the Constraint cell
 If you code a constraint, the constraint acts like an ordinary constraint,
but adds a warning message to the log if it is satisfied
Otherwise link
Specify the Otherwise link constraint

To create an Otherwise link, check the Otherwise/Log box next to the link name in the
Transformer Stage Constraints window, displayed here.
Do not code a constraint for an Otherwise link. This defeats its purpose, and makes it
behave like an ordinary output link.

Demonstration 1
Define a constraint

 Define constraints in a job
 Create an Otherwise link
Demonstration 1: Define a constraint

Demonstration 1:
Define a constraint
Purpose:
You want to define constraints in the Transformer stage of a job. Later you
will define an Otherwise link.
NOTE:
Steps:
that is displayed.
before you import the version from the lab solutions file.
Task 1. Define Transformer Constraints.
1. Create a new parallel job and save it as TransSellingGroup.
2. Add a Sequential File stage (available from Palette > File), a Transformer stage
(available from Palette > Processing), and two target Sequential File stages to
the canvas. Name the links and stages as shown.

3. Open the source Sequential File stage. Edit it to read data from the
Selling_Group_Mapping_RangeError.txt file. It has the same metadata as
the Selling_Group_Mapping.txt file.
4. Open up the Transformer stage. Drag all the input columns across to both
output link windows.
5. Double-click to the right of the word Constraint in either output link window.
This opens the Transformer Stage Constraints window.

6. Double-click the Constraint cell for LowCode to open the Expression Editor.
Click the ellipsis box, and then select Input Column. Start with selecting
Special_Handling_Code from the Input Column menu. Right-click to the right
of the added item, to use the Editor to define a condition that selects just rows
with special handling codes between 0 and 2 inclusive.
7. Double-click on the Constraint cell to the right of the HighCode link name to
open the Expression Editor. Using the same process as in the previous step,
define a condition that selects just rows with special handling codes between 3
and 6 inclusive.
8. Edit the LowCode target Sequential File stage to write to a file named
LowCode.txt in the lab files Temp directory.
9. Edit the HighCode target Sequential File stage to write to a file named
HighCode.txt in the lab files Temp directory.
11. View the data in your target files to verify that they each contain the right rows.
Here is the LowCode.txt file data. Notice that it only contains rows with special
handling codes between 0 and 2.

Task 2. Use an Otherwise Link to capture range errors in the

data.
1. Save your job as TransSellingGroupOtherwise.
2. Add an additional link from the Transformer to another Sequential File stage
and label the new stage and link RangeErrors.
3. In the Transformer, drag all input columns across to the new target link.
4. From the toolbar, click Output Link Execution Order .

5. Reorder the links so that the RangeErrors link is last in output link ordering.
(Depending on how you drew your links, this link may already be last.)
6. Open the Constraints window. Select the Otherwise/Log box to the right of
RangeErrors.
7. Edit the RangeErrors Sequential File stage as needed to write to the

RangeErrors.txt file in the lab files Temp directory.

8. Compile and run your job. There should be a few range errors.
Results:
You defined constraints in the Transformer stage of a job. Later you defined
an Otherwise link.

Derivations
• Derivations are expressions that derive a value
• Like expressions for constraints they are built out of items:
 Input columns
 Job parameters
 Functions
 Stage variables
 System variables
• How derivations differ from constraints
 Constraints are:
− Expressions that are either true or false
− Apply to rows
 Derivations:
− Return a value that is written to a stage variable or output column
− Apply to columns
Derivations
Here are the main features of derivations. Derivations are expressions that return a
value.
Derivations are built using the same expression editor that constraints are built with.
And for the most part, they can contain the same types of items. The difference is that
constraints are conditions that evaluate to either true or false. Derivations return a value
(other than true or false) that can be stored in a column or variable.

Derivation targets
• Derivation results can be written to:
 Output columns
 Stage variables
 Loop variables
• Derivations are executed in order from top to bottom
 Stage variable derivations are executed first
 Loop variable derivations are executed second
 Output column derivations are executed last
− Executed only if the output link constraints are satisfied
− Output link ordering determines the order between the sets of output link variables
Derivation targets
The values derived from derivations can be written to several different targets: output
columns, stage variables, loop variables. (Loop variables are discussed later in this
unit.)

Stage variables
• Function like target columns, but they are not output (directly)
from the stage
• Stage variables are one item that can be referenced in
derivations and constraints
 In derivations, function in a similar way as input columns
• Have many uses, including:
 Simplify complex derivations
 Reduce the number of derivations
− Thederivation into the stage variable is executed once, but can be used
many times
Stage variables
Stage variables function like target columns, but they are not output (directly) from the
stage. Stage variables are one item (among others) that can be referenced in
derivations and constraints. They have many uses, including: simplifying complex
derivations and reducing the number of derivations.
Stage variables are called “stage” variables because their scope is limited to the
Transformer in which they are defined. For example, a derivation in one Transformer
cannot reference a stage variable defined in another Transformer.

Stage variable definitions

• Click the Stage Properties icon (far left)
 Click the Stage Variables tab
• Defining the stage variable
 Name
 SQL type and precision
 Initial value
− Value before any rows are processed by the stage
Stage variable definitions

Defining a stage variable is like defining a column. You specify a name, type, and
precision. Unlike with columns, however, you can initialize the stage variable with a
value. This is the value it will have when the first row is read in by the Transformer
stage to be processed.
Stage variables are not automatically refreshed when new rows are read in. They retain
their values until derivations change their values. This is a key feature of stage
variables. This makes it possible to compare the values from earlier rows to values in
the current row.

Build a derivation
• Double-click in the cell to the left of the stage variable or output
column to open the expression editor
• Select the input columns, stage variables, functions and other
elements needed in your derivation
 Do not try to manually type the names of input columns
− Easy to make a mistake
− Input columns are prefixed by their link name
 Functions are divided into categories: Date & Time, Number, String,
Type conversion, and so on
− When you insert an empty function, it displays its syntax and parameter types
Build a derivation
As with constraints, derivations are built using the expression editor. Double-click in the
cell to the left of the stage variable or output column to open the expression editor.
To avoid errors in derivations, it is generally preferable to insert items into the
expression using the expression editor menu, rather than manually typing in their
names.

Define a derivation
Input column
String in quotes Concatenation

(single or double) operator (:)
Define a derivation
This slide shows an example of a derivation being defined in the expression editor. Use
the menu to insert items into the expression.
This expression contains string constants. String constants must be surrounded by
either single or double quotes. The colon (:) is the concatenation operator. Use it to
combine two strings together into a single string. Shown in the above concatenation is a
column (Special_Handling_Code). For this expression to work, this column should be
a string type: char or varchar. You cannot concatenate, for example, an integer with a
string (unless the integer is a string numeric such as “32”).

IF THEN ELSE derivation

• Use IF THEN ELSE to conditionally derive a value
• Format:
 IF <condition> THEN <expression1> ELSE <expression2>
 If the condition evaluates to true, then the result of expression1 will be
written out
 If the condition evaluates to false, then the result of expression2 will be
written out
• Example:
 Suppose the source column is named In.OrderID and the target column is
named Out.OrderID
 To replace In.OrderID values of 3000 by 4000:
 IF In.OrderID = 3000 THEN 4000 ELSE In.OrderID
IF THEN ELSE derivation

IF THEN ELSE derivations are frequently used to express business rules. Using them,
you can express what value is to conditionally go into an output column or variable.
One typical use is replacing one data value with another. This might be used when the
name or identifier for a product or service is changed. Notice in the example how this is
done. You cannot code the derivation as IF In.Order ID = 3000. A derivation
must in every case return a value. Without an ELSE clause, it will not return a value
when the IF antecedent is false. Since you have to have an ELSE, you need to output
some value. So you output the un-changed value in the column.

String functions and operators

• Substring operator
 Format: “String” [loc, length]
 Example:
− SupposeIn.Description contains the string “Orange Juice”
− InDescription[8,5] = “Juice”
• UpCase(<string>) / DownCase(<string>)
 Example: UpCase(In.Description) = “ORANGE JUICE”
• Len(<string>)
 Example: Len(In.Description) = 12
String functions and operators

One common type of function you may need to use in your derivations are string
functions. Here you see a few of the many string functions you can use in your
derivations.
UpCase and DownCase are very useful functions when you need to compare strings.
For example, suppose you need to compare a string in a column to a string in a job
parameter. To make sure that the comparison will work when one string is upper case
and the other is mixed case, you can “standardize” the two strings by first applying the
UpCase or DownCase functions to them.

Null handling
• Nulls can get into the data flow:
 From lookups (lookup failures)
 From source data that contains nulls
• Nulls written to non-nullable, output
columns cause the job to abort
• Nulls can be handled using Transformer
null-handling functions:
 Test for null in column or variable
− IsNull(<column>)
− IsNotNull(<column>)
 Replace null with a value

− NullToValue(<column>, <value>)
 Set to null: SetNull()
− Example: IF In.Col = 5 THEN SetNull() ELSE
In.Col
Null handling
This slide shows the standard null handling functions available in the Transformer
expression editor.
Nulls in the job flow have to be handled or the job can abort or yield unexpected results.
For example, a null value written to a non-nullable column will cause the job to abort.
This type of runtime error can be difficult to catch, because the job may run fine for a
while before it aborts from the occurrence of the null.
Also, recall that nulls written to a sequential file will be rejected by the Sequential File
stage, unless they are handled. Although these nulls can be handled in the Sequential
File stage, they can also be handled earlier in a Transformer.

Unhandled nulls
• What happens if an input column in a derivation contains null, but
is not handled, for example by using NullToValue(in.col)?
 This is determined by the Legacy null processing setting
− If set, the row is dropped or rejected
• Use a reject link to capture these rows
− If not set, the derivation returns null
• Example: Assume in.col is nullable and for this row is null
 5 + NullToValue(in.col, 0) = 5
 5 + in.col = Null, if Legacy null processing is not set
 5 + in.col = row is rejected or dropped, if Legacy null processing is set
• Best practice
 When Legacy null processing is set, create a reject link
Unhandled nulls
The Legacy null processing setting determines how nulls are handled in the
Transformer. If set, the row is dropped or rejected, just as it was in earlier versions of
DataStage. Use a reject link to capture these rows. If not set, the derivation returns null.
This feature was added in DataStage v8.5.
Note that this has to do with how nulls are handled within expressions, whether an
expression involving a null returns null or is rejected. In either case, a null value can
never be written to a non-nullable column.

Legacy null processing

• When set, when an unhandled null occurs the row is rejected
 Set on the Stage Properties>General tab
• If Abort on unhandled Null is set in addition to Legacy Null Processing,
unhandled nulls cause the job to abort
Legacy null
processing Abort on
unhandled null
Legacy null processing

This slide shows where the Legacy null processing option is set, namely, in the
Transformer Stage Properties window General tab.
By default, this option will be turned on for imported parallel jobs created prior to v8.5.
This is to ensure that those jobs will behave as they behaved when they were first
created. By default, jobs created in v8.5 and later will have this option turned off.

Transformer stage reject link

• Capture unhandled nulls
• To create, draw an output link. Right-click over the link, and then
select Convert to reject
Reject
link
Transformer stage reject link

This slide shows a Transformer with a reject link to capture unhandled nulls. As
mentioned earlier, if you are using legacy null processing, best practice is to have reject
links for Transformers. Otherwise, any rejected rows will disappear. It is very difficult to
tell if any rows have been rejected by a Transformer, if you do not have reject rows to
capture them.

Demonstration 2
Define derivations

 Define a stage variable
 Build a formatting derivation
 Use functions in derivations
 Build a conditional replacement derivation
 Specify null processing options
 Capture rejects
Demonstration 2: Define derivations

Demonstration 2:
Define derivations
Purpose:
You want to define derivations in the Transformer stage.
NOTE:
Steps:
that is displayed.
Task 1. Build a formatting derivation.
1. Open up your TransSellingGroupOtherwise job and save it as
TransSellingGroupDerivations.

2. Open the Transformer.
3. From the toolbar, click Stage Properties , and then click the Stage >
Stage Variables tab.
4. Create a stage variable named HCDesc. Set its initial value to the empty string.
Its SQL type is VarChar, precision 255.
5. Close the Transformer Stage Properties window. The name of the stage
variable shows up in the Stage Variables window.

6. Double-click in the cell to the left of the HCDesc stage variable. Define a
derivation that places each row's special handling code within a string of the
following form: “Handling code = [xxx]”. Here “xxx” is the value in the
Special_Handling_Code column.
7. Create a new VarChar(255) column named Handling_Code_Description for

each of the LowCode and HighCode output links. You can create these on the
corresponding tabs at the bottom of the Transformer window.
8. Drag the value of the HCDesc stage variable to each of these link columns.

9. Compile and run. View the data in the output files.
Task 2. Use a function in a derivation.

2. In the derivation for the Distribution_Channel_Description target column in
the LowCode output link, turn the output text to uppercase and trim the string of
any blanks.

3. Compile, run, and view the results.
Task 3. Build a conditional replacement derivation.

2. Write a derivation for the target Selling_Group_Desc columns in both the
LowCode and HighCode output links that replaces "SG055" by "SH055",
leaving the rest of the description as it is. In other words, "SG055 Live Swine",
for example, becomes "SH055 Live Swine".
NOTE: Use the IF THEN ELSE operator. Also, you may need to use the
substring operator and Len functions.
3. Compile, run, and test your job. Here is some of the output from the HighCode
stage. Notice specifically, the row (550000), which shows the replacement of
SG055 with SH055 in the second column.

Task 4. Capture rejects.

1. Save your job as TransSellingGroupRejects.
2. Add another output link to a Peek stage. Name the link Rejects and the stage
Peek_Rejects.
3. Right-click over the link and then click Convert to reject.

4. Open up the Transformer and then click the Stage Properties icon (top left).
Select the Legacy null processing box (if it is not already selected).

Your job probably will not have any rejects.
Results:
You defined derivations in the Transformer stage.

Loop processing
• For each row read, the loop is processed
 Multiple output rows can be written out for each input row
• A loop consists of:
 Loop condition: Loop continues to iterate while the condition is true
− @ITERATION system variable:
• Holds a count of the number of times the loop has iterated, starting at 1
• Reset to 1 when a new row is read
− Loop iteration warning threshold
• Warning written to log when threshold is reached
 Loop variables:
− Executed in order from top to bottom
− Similar to stage variables
− Defined on Loop Variables tab
Loop processing
With loops, multiple output rows can be written out for each input row. A loop consists
of a loop condition and loop variables, which are similar to stage variables. As long as
the loop condition is satisfied the loop variable derivations will continue to be executed
from top to bottom.
The loop condition is an expression that evaluates to true or false (like a constraint). It is
evaluated once after a row is read, before the loop variable derivations are executed.
You must ensure that the loop condition will eventually evaluate to false. Otherwise,
your loop will continue running forever. The loop iteration warning threshold is designed
to catch some of these cases. After a certain number of warnings, your job will abort.

Functions used in loop processing

• Key break detection
− When your data is grouped and sorted on a column, you can detect the last
row in a group using the LastRowInGroup(In.Col) function
• In.Col is the column the data is grouped by
• When multiple columns are part of the key, choose the inner-most
• Count(In.col, “sub-string”)
 Counts the number of occurrences of a substring in In.col
 Example: Count(“Red|Blue|Green”, “|”) = 2
• Field(In.col, “|”, n)
 Retrieves the n-th sub-field from a string, where the sub-string delimiter
in this example is “|”
− Example: Field(“abc|de|fghi”, “|”, 2) = “de”
Functions used in loop processing

Here are some functions typically used in loop processing. If your data is grouped and
sorted on a column, you can detect the last row in a group using the
LastRowInGroup(In.Col) function. You can use the Count function to count the
number of occurrences of a substring. You can use the Field function to retrieve the
n-th field in a string.

Loop processing example

• Each source row contains a field that contains a list of item colors
 Example: 23,Red|Green|Black
• For each row, separate the colors out into separate rows
 Example:
− 23,Red
− 23,Green
− 23,Black
Loop processing example

In this example, each source row contains a field that contains a list of item colors, as
shown in the example. You can use the Field function to parse out individual colors in
the list.

Loop processing example job
Source data
Results
Loop processing example job

This slide displays the loop processing example job. It shows the source data and the
final results. The source data row 16, for example, contains a list of four colors. In the
output results, four item 16 rows are written out, one for each color.
For each row read, the loop will iterate through the colors in the list.

Count the number

of colors
Iterate through
the list of colors

This slide shows the inside of the Transformer stage. The loop condition references the
@ITERATION system variable, which tracks the current iteration through the loop. The
Field function is used to parse individual colors from the list into the Color loop
variable. The Color loop variable is mapped to an output column. Each iteration is then
written out.
The @ITERATION system variable is incremented after each iteration through the loop.
Eventually, it will reach the number contained in the stage variable NumColors, and
then the loop condition will become false.

Demonstration 3
Loop processing

 Create a job that outputs multiple rows for each input rows
 Use a loop to iterate through a list of colors contained in a single column
of the input
Demonstration 3: Loop processing

Demonstration 3:
Loop processing
Purpose:
You want to create loop variables and loop conditions. You also want to
process input rows through a loop.
NOTE:
Steps:
that is displayed.
Task 1. Pivot.
1. Open C:\CourseData\DSEss_Files\ColorMappings.txt in WordPad.
This is your source file. Each Item number is followed by a list of colors.

2. Create a new parallel job named TransPivot. Name the links and stages as
shown.
3. Import the table definition for the ColorMappings.txt file. Store it in your
_Training>Metadata folder.
4. Open the ColorMappings stage. Edit the stage so that it reads from the
ColorMappings.txt file. Verify that you can view the data.
5. Open the Transformer stage. Drag the Item column across to the ItemColor
output link.

6. Create a new VarChar(10) column named Color.
7. Create a new integer stage variable named NumColors.

This will store the number of colors in the list of colors.
8. Next, click in the Derivation box beside the NumColors stage variable to set the
variable. Use the Count string function to count the number of occurrences of
the substring “|” in the Colors input column.
Note that the number of “|” delimiters in the color list is one less than the
number of colors.
9. Open the Loop Condition window. Double-click the white box beside the
Loop While box to open the Expression Editor. Specify a loop condition that will
iterate for each color. The total number of iterations is stored in the NumColors
stage variable. Use the @ITERATION system variable.

10. Create a new VarChar(10) loop variable named Color.
11. For each iteration, store the corresponding color from the colors list in the Color
loop variable. Use the Field function to retrieve the color from the colors list.
12. Drag the Color loop variable down to the derivation cell next to the Color output
link column.
13. Edit the target stage to write to a sequential file named ItemColor.txt in your lab
files Temp directory. Be sure the target file is written with a first row of column
names.

14. Compile and run your job. You should see more rows going into the target file
than coming out of the source file.
15. View the data in the target stage. You should see multiple rows for each item
number.
16. Test that you have the right results. For example, count the number of rows for
item 16.
Results:
You created loop variables and loop conditions. You also processed input
rows through a loop.

Group processing
• LastRowInGroup(In.Col) can be used to determine when the last
row in a group is being processed
 Transformer stage must be preceded by a Sort stage that sorts the data
by the group key columns
• Stage variables can be used to calculate group summaries and
aggregations
Group processing
In group processing, the LastRowInGroup(In.Col) function can be used to determine
when the last row in a group is being processed.
This function requires the Transformer stage to be preceded by a Sort stage that sorts
the data by the group key columns.

Group processing example

• In order to use the LastRowInGroup(In.Col) function, a Sort stage is
required before the Transformer
• Here, the ItemColor.txt file contains items sold with their individual
colors
• For each item, you want a list of all the colors it can have
Sort by
group key
Group processing example

This slide shows the group processing example job. Notice the Sort stage preceding
the Transformer stage. This is required when using the LastRowInGroup() function.
The Sort stage does not have to immediately precede the Transformer, but the
DataStage compiler must be able to determine from the job flow that the data is
grouped in the right way.

Job results
Before After
Job results
These slides show the before and after job results. Notice that the individual colors for
the group of Item records show up in the results as a list of colors.
The source data is grouped by item number. The data is also sorted by item number,
but this is not required. The LastRowInGroup() function is used to determine that, for
example, the row 16 white color is the last row in the group. At this point the results for
group can be completed and written out. In this example, the group result consists of a
list of all the colors in the group. But this is just an example, any type of group
aggregation can be similarly produced.

Transformer logic
LastRowInGroup()
TotalColorList
CurrentColorList
Transformer logic
In this example, the IsLastInGroup stage variable is used as a flag. When it equals
“Y”, the last row is currently being processed. The LastRowInGroup() function is used
to set the flag.
The value for the TotalColorList stage variable is built by concatenating the current
color to the CurrentColorList. When the IsLastInGroup flag is set, the
CurrentColorList contains the whole list except for the current row.
The CurrentColorList is built as each row in the group is processed. When the last row
is processed, but after the TotalColorList is created, it is initialized to the empty string.

Loop through saved input rows

• The SaveInputRecord() function can be used to save a copy of
the current input row into a queue for later retrieval
 Located in the functions Utility folder
 Returns the number of rows saved in the queue
 Can only be invoked in a stage variable derivation
• The GetSavedInputRecord() function can be used to retrieve
rows in the queue
 Located in the functions Utility folder
 Returns the index of the row in the queue
 Can only be invoked in a loop variable derivation
• Can use these functions to iterate through a set of saved rows
adding group results to individual group records
Loop through saved input rows

The Transformer stage supports looping through saved input rows. The
SaveInputRecord() function can be used to save a copy of the current input row into a
queue for later retrieval. The GetSavedInputRecord() function can be used to retrieve
rows in the queue. You can use these functions to iterate through a set of saved rows,
adding group results to individual group records.

Example job results
Before After
Example job results

These slides show the before and after results for the example job. Here, for example,
there are two item 25 records. In the output, the total list of colors of item 25 records is
added to each individual record. So there are two item 25 rows, each containing the
total list item 25 colors.
This is similar to what can be accomplished using a fork-join job design.

Transformer logic
Save input
row
Iterate through
saved rows when
the last group row
is processed
Retrieve saved row
Output
Transformer logic
This slide shows Transformer logic. After saving the records in a group, the records are
retrieved in a loop. An output row is written for each iteration through the loop. This
consists of data from the retrieved row plus the total color list.

Parallel job debugger

• Set breakpoints on links in a parallel job
• Specify a condition under which a breakpoint is enabled
 Every nth row
 Expression
− Expressions can include input columns, operators, and string constants
• Examine the data in the link columns when the breakpoint is enabled
 Viewed in the Debug window
 The data can be viewed for each of the nodes the stage/operator is
running in
• Optionally, add columns to the watch list
 Displays values for each node with enabled breakpoints
Parallel job debugger

A breakpoint is a point in the job where processing is suspended. Breakpoints are set
on links. When data flows through the link, the breakpoint suspends processing, if the
breakpoint condition is satisfied.
When a breakpoint is enabled, the link columns of data are displayed in the Debug
window.
Typically jobs are running on multiple partitions (nodes). The link columns of data are
displayed for each node.

Set breakpoints
Debug
window
Set
breakpoint
Breakpoint
icon
Set breakpoints
To set a breakpoint, select the link and then click the Toggle Breakpoint icon in the
Debug window. To open the Debug window, click Debug>Debug Window.
Use the icons in the Debug window toolbar to set and edit breakpoints, add watch
variables, run the job within the debugger, and other operations.
When a breakpoint is set on a link, a small icon is added to the link on the diagram, as
indicated.

Edit breakpoints
• Select the link and then click Edit Breakpoints
• Expressions can include input columns, operators, and input
columns
Breakpoint
conditions
Edit breakpoints
The breakpoint condition is either Every N Rows or an expression that you build using
the expression editor. Expressions can include input columns, operators (=, <>, and so
on), and string constants.
The Edit Breakpoints window displays all the breakpoints that are set in the job. You
can edit the breakpoint condition for any selected breakpoint in the job.

Running a parallel job in the debugger

• Click the Start/Continue icon in the Debug window
 Alternatively, click Run to End to run the job to completion
• The job stops at the next enabled breakpoint
• Data in the link columns is displayed
 One tab per node
Start/Continue
icon
Node 1 tab
Enabled Link columns

breakpoint data
Running a parallel job in the debugger

Click the Start/Continue icon in the Debug window toolbar to run the job to the next
enabled breakpoint. The breakpoint on the link where the breakpoint is enabled is
graphically emphasized, as you can see in the diagram.
In the Debug window, there are separate tabs for each of the nodes on which the
breakpoints are enabled. Click the tab to view the links columns data on that data.

Add columns to the watch list

• Right-click over the column to add
 Select Add to Watch List
• Watch list displays values for all
nodes with enabled breakpoints
Watch list
Add columns to the watch list

You can add columns to a watch list. These are typically the columns of data you are
most interested in. The data for each of the active nodes is displayed horizontally next
to the column name.

Demonstration 4
Group processing in a Transformer

 Use the LastRowInGroup() function to determine when you are
processing the last row in a group
 Use stage variables to accumulate group results
 Use the SaveInputRecord() and GetSavedInputRecord() functions to
add group results to individual records
 Use the parallel job debugger to debug a parallel job
 Set breakpoints
 Edit breakpoint conditions
 Add watch variables
 View column data at breakpoints
Demonstration 4: Group processing in a Transformer

Demonstration 4:
Group processing in a Transformer
Purpose:
You want to process groups of data rows in a Transformer. Later you will use
the parallel job debugger.
NOTE:
Steps:
that is displayed.
Task 1. Process groups in a Transformer.
1. Create a new job named TransGroup. Name the links and stages as shown.

2. Import a table definition for the ItemColor.txt file that you created in the
previous lab. Reminder: This file is located in the Temp directory rather than the
DSEss_Files directory. (If you did not previously create this file, you can use
the ItemColor_Copy.txt file in your lab files directory.)
Below, a portion of the file is displayed.
3. Edit the source Sequential File stage to read data from the ItemColor.txt file.
4. On the Format tab, remove the Record delimiter property in the Record level
folder. Then add the Record delimiter string property and set its value to DOS
format.
This is because the file you created in your Temp directory uses Windows DOS
format.
5. Be sure you can view the data.

6. Edit the Sort stage. Sort the data by the Item column.
7. On the Sort stage Output > Mapping tab, drag all columns across.

8. On the Sort Input > Partitioning tab, hash partition by the Item column.
9. Open the Transformer stage. Drag the Item column across to the output link.
Define a new column named Colors as a VarChar(255).
10. Create a Char(1) stage variable named IsLastInGroup. Initialize with ‘N'
(meaning “No”).
11. Create a VarChar(255) stage variable named TotalColorList. Initialize it with
the empty string.

12. Create a VarChar(255) stage variable named CurrentColorList. Initialize it

with the empty string.
13. For the derivation for IsLastInGroup, use the LastRowInGroup() function on
the Item column to determine if the current row is the last in the current group of
Items. If so, return ‘Y' (meaning “Yes”); else return ‘N'.
14. For the derivation of TotalColorList, return the conjunction of the current color
to CurrentColorList when the last row in the group is being processed.
Otherwise, return the empty string.
15. For the derivation of CurrentColorList, return the conjunction of the current
color to the CurrentColorList when the last row in the group is not being
processed. When the last row is being processed, return the empty string.
16. Drag the TotalColorList stage variable down to the cell next to Colors in the
target link.

17. Next, define a constraint for the target link. Add the constraint
IsLastInGroup = 'Y' - to output a row when the last row in the group is being
processed.
18. Click OK to close the Transformer.

19. Edit the target Sequential File stage. Write to a file named ColorMappings2.txt
in your lab files Temp directory.

20. Compile and run your job. Check the job log for error messages.
View the data in your target stage. For each set of Item rows in the input file,
you should have a single row in the target file followed by a comma-delimited
list of its colors.
Task 2. Add group results to individual group records.

1. Save your job as TransGroupLoop.
2. Open the Transformer stage.

3. Add a new integer stage variable named NumSavedRows.
4. For its derivation invoke the SaveInputRecord() function, found in the Utility
folder.
This saves a copy of the row into the Transformer stage queue.
5. Define the loop condition. Iterate through the saved rows after the last row in the
group is reached.
6. Define an integer loop variable named SavedRowIndex.

7. For its derivation invoke the GetSavedInputRecord() function in the Utility

folder. This retrieves a copy of the row from the Transformer stage queue.
8. Drag the Color column across from the input link to the target output link. Put
the column second in the list of output columns.
9. Remove the output link constraint, by right-clicking the constraint under

ColorMappings2, and opening the Constraints dialog. Double-click the
constraint definition, and clear it.
10. Compile and run. Check the job log for errors. View the data in the output.

Task 3. DataStage parallel job debugger.

1. Open up your TransSellingGroupOtherwise job and save it as
TransSellingGroupDebug.
NOTE: If you do not have a working copy of the TransSellingGroupOtherwise

job, import the TransSellingGroupOtherwise.dsx job in your lab files dsxfiles
directory.
2. Open up your source stage. Set the stage to read from the
Selling_Group_Mapping_Debug.txt file.
3. From Job Properties, create a job parameter named Channel. Make it a string
with a default value of "Food Service", with the quotes.

4. In the Transformer, open up the Constraints window. Add to the LowCode and
HighCode constraints, the condition that the
Distribution_Channel_Description column value matches the Channel
parameter value.
5. Compile the job.

6. From the Designer menu, click Debug > Debug Window.
Select the LowCode output link, and then click Toggle Breakpoint in the
Debug window. Repeat for the HighCode and RangeErrors links. Verify that
the breakpoint icon has been added to the links on the diagram.
7. Select the RangeErrors link, and then click Edit Breakpoints in the
Debug window.

8. Set the breakpoint Expression to break when

Distribution_Channel_Description equals "Food Service".
9. Similarly, set the LowCode and HighCode breakpoint expressions to break

when Distribution_Channel_Description does not equal “Food Service”.
10. Click Start/Continue in the Debug window.

11. When prompted for the job parameter value, accept the default of
"Food Service", and then click OK.
Notice that the debugger stops at the RangeErrors link. The column values are
displayed in the Debug window.
12. Click on the Node 1 and Node 2 tabs to view both the data values for both
nodes. Notice that each seems to have the correct value in the
Distribution_Channel_Description column. And the
Special_Handling_Code is not out of range. So why are these values going
out the otherwise link instead of down the Lowcode link?

13. In the Debug window, right-click over the Distribution_Channel_Description

column, and then right-click Add to Watch List.
This way you can highlight the values for the column in both nodes.
14. In the Debug window, click Run to End to see where the other rows go.
The job finishes and all the rows go down the otherwise link. But why? This
should not happen.
Note: To quickly see how many items are written to each sequential file, right-
click anywhere on the canvas, and then ensure that there is a check mark
beside Show performance statistics.

15. In the Debug window, click the Start/Continue Debugging icon to start the job
again. This time, remove the quotes from around “Food Service” when
prompted for the job parameter value.
16. Things definitely look better this time. More rows have gone down the
LowCode link and the breakpoint for the LowCode link has not been activated.
The breakpoint for the otherwise link has been activated. Since the
Special_Handling_Code value is out of range, this is as things should be.
17. In the Debug window, click Run to End to continue the job.
This time the job completes.

18. View the data in the LowCode file to verify that it contains only “Food Service”
rows.
19. View the data in the RangeErrors file to verify that it does not contain any
“Food Service” rows that are not out of range.
There appear to be several “Food Service” rows that should have gone out the
LowCodes link.
20. See if you can fix the bugs left in the job.
Hint: Try recoding the constraints in the Transformer.
Results:
You processed groups of data rows in a Transformer. Later you used the
parallel job debugger examine the data.

Checkpoint
1. What occurs first? Derivations or constraints?
2. Can stage variables be referenced in constraints?
3. What function can you use in a Transformer to determine when you
are processing the last row in a group? What additional stage is
required to use this function?
4. What function can you use in a Transformer to save copies of input
rows?
5. What function can you use in a Transformer to retrieve saved rows?
Checkpoint

1. Constraints.
2. Yes.
3. LastRowInGroup(In.Col) function.
The Transformer stage must be preceded by a Sort stage which
sorts by the group key column or columns.
4. SaveInputRecord().
5. GetSavedInputRecord().

Unit summary
• Use the Transformer stage in parallel jobs
• Use stage variables
• Create a parameter set and use its parameters in constraints and
derivations
Unit summary


Repository functions

Unit 11 Repository functions

Unit objectives
• Perform a simple Find
• Perform an Advanced Find
• Perform an impact analysis
• Compare the differences between two table definitions
• Compare the differences between two jobs
Repository Functions © Copyright IBM Corporation 2015
Unit objectives

Quick find
Name with wild card

character (*)
Include matches
in object
descriptions
Execute Find
Quick find
This slide shows an example of a Quick Find. It searches for objects matching the
name in the Name to find box. The asterisk (*) is a wild card character standing for
zero or more characters.
Quick Find highlights the first object that matches in the Repository window. You can
click Find repeatedly to move through more matching objects.
If the Include descriptions box is checked, the text in Short descriptions and Long
descriptions will be searched as well as the names of the objects.

Found results
Click to open
Click Next to highlight
Advanced Find
the next item
window
Found item
Found results
This slide shows the results from the Quick Find. The first found item is highlighted.
Click Next to go to the next found item.
You can move to the Advanced Find window by clicking the Adv... button. The
Advanced Find window lists all the found results in one list.

Advanced Find window
Found items
Search options
Advanced Find window

The Advanced Find window lists all the results on a single window, as shown in this
slide.
You can also initiate searches from within this window. The Advanced Find window
supports more search options than the Quick Find. These options are listed and
described on the next page.

Advanced Find options

• Type: type of object
 Select the list of types of objects to search: Table definitions, stages, …
• Creation:
 Select by a range of dates and/or user who created the object
− For example, up to a week ago
• Last modification:
 Select by a range of dates of the last modification
• Where used: objects that use the searched for objects
 For example, a job that uses a specified table definition
• Dependencies of: objects that are dependencies of objects
 For example, a table definition that is referenced in a specified job
• Options
 Case sensitivity
 Search within last result set
Advanced Find options

This lists and describes the Advanced Find options. As with the Quick Find, you can
select the types of objects you want to search. In addition you can specify a number of
options regarding how the object was created: When it was created? By Whom? And
so forth.
The Where used and Dependencies of options create impact analysis reports, which
are discussed later in this unit.

Using the found results
Compare
objects
Create impact
analysis
Export to a file
Using the found results

Once you have captured a set of results, you can use the set of found results in various
ways. For example, you can compare the objects, export them to a file, or create impact
analyses. To initiate these, select the objects and then click your right mouse button.
Select the operation from the menu that is displayed.

Performing an impact analysis

• Find where an object is used
 Find the jobs or stages a table definition is used in
 Find the job sequences a job is in
 Find the jobs, table definitions, stages where columns are used
• Find object dependencies
 Find the stages a job depends on
• Right-click over an object to open the menu
• The dependency can be displayed textually or graphically
Performing an impact analysis

An impact analysis is aimed at finding the impact of making a change to an object (table
definition, job). What other objects will be impacted if the change is made? One of the
most common uses of this is when a file or table that a job reads from or writes to is
changed. Perhaps, a column is added or removed. The table definition that describes
this table or file is also changed. This impacts any job that uses that table definition. The
impact analysis will provide a list of all the jobs that need to be modified and retested.
You can perform an impact analysis from two directions. You can find where an object
is used, which displays the objects that are dependent on a selected object. Or you can
search for object dependencies.
A dependency graph of the results can be displayed textually or graphically.

Initiating an impact analysis
Find jobs a table

definition is used in
Initiating an impact analysis

It is easy to initiate an impact analysis. Select the object and then click your right mouse
button. If you are searching for other objects that are dependent on the selected job,
click Find where used. If you are searching for objects that the selected object is
dependent on, click Find dependencies. You can then select the types of objects you
are interested in.
There are two versions of each of these commands. The deep version only differs in
the range of different types of objects you can select from.

Results in text format
Results
Results tab
Results in text format

There are two formats that the dependency graph can be presented in. This slide
shows the detailed results of an impact analysis displayed in text format.

Results in graphical format
Results
Jobs that
depend on
the table
“Birds Eye”
definition
view
Graphical
Results tab
Results in graphical format

This slide shows the graphical results of an impact analysis.
Click the Results - Graphical tab at the bottom of the window to display this format.
The results show that there are two jobs (on the left) that depend on the table definition
on the right.
The Bird’s Eye View window appears in the lower right-hand corner. It displays how
the diagram fits onto the canvas. This will reveal if there are any parts of the diagram
that are extending outside the viewing area.
At the top of the window are controls for zooming in and zooming out.

Displaying the dependency graph

• Displays in detail how one object (for example, a job) depends on
another object (a table definition)
• Select the dependency in the Results list (textual or graphical) and
then click Show dependency path to ‘…’
Show dependency
graph
Displaying the dependency graph

This slide shows how to display a dependency graph for a table definition. A
dependency graph displays in detail how one object (for example, a job) depends on
another object (for example, a table definition).

Displaying the dependency path
Table
definition
Job containing
(dependent on)
table definition
Displaying the dependency path

This slide shows the dependency graph. On the left is the job. On the far right is the
table definition. This graph answers the question, “How does this job depend on this
table definition?” The answer is as follows. The job contains a stage, which contains an
output link, which contains columns that are in the table definition.

Generating an HTML report

• Where used:
 \_Training\Metadata\Range_Description.txt
− Case insensitive: Yes
− Find in last result set: No
− Name and description matching: Either name or description can match
Dependency path
descriptions
Name Sample dependency path Folder path Type

• LookupWarehouseItemRangeRef->
Range_Description->
LookupWarehouseItemRangeRef \Training\Jobs Parallel Job
Range_Description-> EndItem->
EndItem->Range_Descriptions.txt
• LookupWarehouseItemRangeStream->
Range_Description->
LookupWarehouseItemRangeStream \Training\Jobs Parallel Job
Range_Description-> EndItem->
EndItem->Range_Descriptions.txt
Generating an HTML report

Viewing column-level data flow

• Display how data will flow through the job
 How data will flow to a selected column
 How data flows from a selected column
• The analysis is based on column mappings at design time
 Information Server Metadata Workbench can provide reports based on
runtime analyses
• The flow is graphically displayed on the diagram through high-lighting
• You can also trace column data flow from Repository table definitions
 Select the table definition in the Repository
 Right-click Find where column used
 Select columns to trace
Viewing column-level data flow

Column-level data flow shows how input columns are mapped to output columns
through the job. You can trace how data in a particular column will move through the
job.
To create a column-level data flow analysis, open a job. Then select a stage. Right-click
Show where data flows to / originates. Select a link flowing in or out of the stage or
the stage table definition. Then select one or more columns on the link. You can also
right-click outside of any stage and select Configure data flow view.
You can trace forwards from a column or backwards from a column. The latter answers
the question, “Where did the data in this column come from?” The former answers the
question, “Where is the data in this column going?”

Finding where a column originates
Select, then click

Show where data
originates from
Select
columns
Finding where a column originates

This slide shows an example job. A column in the target Data Set stage has been
selected. You want to know where the data in this column comes from.
Finding where data flows to involves a similar process. Select a stage with an output
link. Click Show where data flows to. Select the columns you want to trace.

Displayed results
Displayed results
This slide shows the job after the graph has been generated. The path from the Items
Sequential File stage to the target Data Set stage is highlighted in yellow.

Finding the difference between two jobs

• Example: Job1 is saved as Job2. Changes are made to Job2. What
changes have been made?
 Job1 may be a production job.
 Job2 is a copy of the production job after enhancements or other changes
have been made to it
Finding the difference between two jobs

It is sometimes very useful to determine the differences between two jobs.
Here, for example, Job1 may be a production job. Job2 is a copy of the production job
after enhancements or other changes have been made to it. You now want to compare
the enhanced version of the job to the previous version.

Initiating the comparison
Job with the

changes
Initiating the comparison

This slide shows how to initiate a comparison between two jobs. Select one of the jobs.
Click your right mouse button, and then click Compare against…

Comparison results
Click underlined
item to open
stage editor
Click stage and link

references to
highlight in open jobs
Comparison results
This slide shows the comparison results and highlights certain features in the report. In
this particular example, the report lists changes to the name of the job, changes to
property values within stages, and changes to column definitions.
Notice that some items are underlined. You can click on these to open the item in a
stage editor.

Saving to an HTML file
Click when
Comparison Results
window is active
Saving to an HTML file

The comparison results can be saved into an HTML file. This slide shows how to initiate
this. Click File > Save As with the Comparison Results window open.

Comparing table definitions

• Same procedure as when comparing jobs
Comparing table definitions

You can also compare table definitions. This slide shows the results of comparing two
example table definitions.

Checkpoint
1. You can compare the differences between what two kinds of objects?
2. What “wild card” characters can be used in a Find?
3. You have a job whose name begins with “abc”. You cannot
remember the rest of the name or where the job is located. What
would be the fastest way to export the job to a file?
4. Name three filters you can use in an Advanced Find.
Checkpoint
Write your answers here:

1. Jobs. Table definitions.
2. Asterisk (*). It stands for any zero or more characters.
3. Do a Find for objects matching “abc*”. Filter by type job. Locate the
job in the result set, click the right mouse button over it, and then
click Export.
4. Type of object, creation date range, last modified date range, where
used, dependencies of, other options including case sensitivity and
search within last result set.

Demonstration 1

 Execute a quick find
 Execute an advanced find
 Generate a report
 Perform an impact analysis
 Find differences between jobs
 Find differences between table definitions
Demonstration 1: Repository functions

Demonstration 1:
Purpose:
You want to use repository functions to find DataStage objects, generate a
report, and perform an impact analysis. Finally you will find the differences
between two jobs and between two table definitions.
NOTE:
In this demonstration, and other demonstrations in this course, there may be tasks that
Steps:
that is displayed.
Task 1. Execute a Quick Find.
1. In the left pane, in the Repository window, click Open quick find at the top.
2. In the Name to find box, type Lookup*.

3. In the Types to find list, click Unselect all, and then under Jobs, select
Parallel Jobs.
4. Select the Include descriptions box.

5. Click Find. The first found item will be highlighted.

Note: Your results might differ somewhat from the screenshots shown in this
unit, since the results depend on what each person has done on their systems.
6. Click Next, to highlight the next item.

Task 2. Execute an Advanced Find.
1. Click on the Adv button. This opens the Repository Advanced Find window.
2. In the Name to find field choose Lookup* from the drop down menu. If
Lookup* is not available, type it in the field.
3. In the Type box, ensure Parallel Jobs and Table Definitions are selected.
4. In the Last modification panel, specify objects modified within the last week by
your user ID, student.
5. In the Where used panel, select the DSProject\_Training\Metadata\
Range_Descriptions.txt table definition.
This reduces the list of found items to those that use this table definition.

6. Click Find.
7. Select the found items, right-click them, and then click Export.
8. Export these jobs to a file named LookupJobs.dsx in your lab files Temp
folder.
9. Close the Repository Export window.
10. Click the Results – Graphical tab.
Next, you want to explore some of the graphical tools.

11. Expand the graphic, and move the graphic around by holding down the right
mouse button over the graphic and dragging it. Drag the graphic around by
moving the icon in the Bird's Eye view window. Explore.
Task 3. Generate a report.
1. Click File > Generate report to open a window from which you can generate a
report describing the results of your advanced find.
2. Clicking OK to generate the report, and then click on the top link to view the
report.
This report is saved in the Repository where it can be viewed by logging onto
the Reporting Console.
3. Scroll through the report to view its contents.

Task 4. Perform an impact analysis.

1. In the graphical results window, right-click on
LookupWarehouseItemRangeRef. Click Show dependency path to
'Range_Descriptions.txt'.
2. If necessary, use the Zoom control to adjust the size of the dependency path so
that it fits into the window.
3. Hold your right mouse button over a graphical object and move the path around.
4. Close the Advanced Search window.

Task 5. Find the differences between two jobs.

1. Open your LookupWarehouseItemRangeRef job, and save it as
LookupWarehouseItemRangeRefComp into your _Training > Jobs folder.
2. Make the following changes to the LookupWarehouseItemRangeRefComp
job:
• Open the Range_Description sequential file stage, and then on the
Columns tab, change the length of the first column (StartItem) to 111.
On the Properties tab, change the First Line is Column Names to
False.
• Change the name of the link going to the Warehouse_Items target
Sequential File stage to WAREHOUSE_ITEMS.
• Open the Lookup stage. In the constraints window, change the Lookup
Failure condition to Drop.
3. Save the changes to your job.
4. Open up both the LookupWarehouseItemRangeRef and the
LookupWarehouseItemRangeRefComp jobs. Click Tile from the Window
menu to display both jobs in a tiled manner.

5. In the Repository window, right-click your

LookupWarehouseItemRangeRefComp job, and then select Compare
Against.
6. In the Compare window, click your LookupWarehouseItemRangeRef job, and
then click OK.
The Comparison Results window appears as shown.
7. Click on a stage or link in the report, for example, Range_Description.

Notice that the stage is highlighted in both of the jobs.
8. Click on one of the underlined words.
Notice that the editor is opened for the referenced item.
9. With the Comparison Results window selected, click File > Save as, and then
save your report as an html file, to your DSEss_Files\Temp folder.
10. Open up the html file in a browser to see what it looks like.

Task 6. Find the differences between two table definitions.

1. In the Repository pane on the left side, in _Training\Metadata folder, right-
click your Warehouse.txt table definition, and then click Create copy to create
CopyOfWarehouse.txt.
2. Open CopyOfWarehouse.txt, and then on the General tab, update the Short
description field to reflect your name.
3. On the Columns tab, change the name of the Item column to ITEM_ZZZ, and
then change its type and length to Char(33).
4. Click OK, and click Yes if prompted.
5. Right-click over your copy of the table definition, and then select Compare
against.
6. In the Comparison window select your original Warehouse.txt table.
7. Click OK to display the Comparison Results window.
Results:
You used repository functions to find DataStage objects, generate a report,
and perform an impact analysis. Finally you found the differences between
two jobs and between two table definitions.

Unit summary
• Perform a simple Find
• Perform an Advanced Find
• Perform an impact analysis
• Compare the differences between two table definitions
• Compare the differences between two jobs
Unit summary


Work with relational data
Work with relational data

U n i t 1 2 W o r k wi t h r e l a t i o n a l d a t a

Unit objectives
• Import table definitions for relational tables
• Create data connections
• Use ODBC and DB2 Connector stages in a job
• Use SQL Builder to define SQL SELECT and INSERT statements
• Use multiple input links into Connector stages to update multiple
tables within a single transaction
• Create reject links from Connector stages to capture rows with
SQL errors
Work with relational data © Copyright IBM Corporation 2015
Unit objectives

Importing relational table definitions

• Can import using ODBC or using Orchestrate schema definitions
 With Orchestrate schema definitions, can import only one table at a time
 With ODBC, multiple tables can be imported at one time
− Requires ODBC data source connection
• Import > Table Definitions > Orchestrate Schema Definitions
• Import > Table Definitions > ODBC Table Definitions
Importing relational table definitions

There are two primary methods for importing relational tables: the orchdbutil utility,
ODBC imports.
The orchdbutil utility is limited to importing one table at a time. However, this utility is
also available as a command-line utility that can be scripted to import a large number of
table definitions.
Within Designer, ODBC offers a simple way of importing table definitions.

Orchestrate schema import
Import database
table
Table name
Select DBMS type Database name
Orchestrate schema import

This slide shows the Import Orchestrate Schema window. It highlights the properties
to set to import a table definition. As you would expect, you need to provide information,
including the table name, database type, database name, and a user ID and password
authorized to access the database table.
Depending on how DataStage is configured, you also may need to specify the
database server.

ODBC import
Select ODBC data Start import
source name
Select tables
to import
Table definition
Repository folder
ODBC import
This slide shows the ODBC Import Metadata window. The ODBC data source that
accesses the database, containing the tables to be imported, must have been
previously defined.
Select one or more tables to import. In the To folder box, select the Repository folder in
which to store the imported table definitions.

Connector stages
• Connector types include:
 ODBC
 DB2
 Oracle
 Teradata
• All Connector stages have the same look and feel and the same core
set of properties
 Some types include properties specific to the database type
• Job properties can be inserted into any properties
• Required properties are visually identified
• Parallel support for both reading and writing
 Read: parallel connections to the server and modified SQL queries for each
connection
 Write: parallel connections to the server
Connector stages
Connector stages exist for all the major database types, and additional types are added
on an ongoing basis. All Connector types have the same look and feel and the same
core set of properties.
Other stages exist for accessing relational data (for example, Enterprise stages), but in
most cases Connector stages offer the most functionality and the best performance.
Connector stages offer parallel support for both reading from and writing to database
tables. This is true whether or not the database system itself implements parallelism.

Reading from database tables
ODBC Connector
for reading
Reading from database tables

This slide shows a parallel job that reads from a database table using the ODBC
Connector stage. The ODBC connector can read from any data source that has a
defined ODBC data source.

Connector stage GUI
Properties Columns
Test
connection
View data
Connector stage GUI

This slide shows the inside of the ODBC Connector stage and highlights some of its
features. Shown here is the ODBC Connector, but other Connector stages have the
same look-and-feel.
At the top left, is the link name box. Use it to select a link and display its properties. This
is useful when there are multiple input and/or output links.
Just as with the other stages, Connector stages have a Columns tab where table
definitions can be imported.

Navigation panel
• Stage tab
 Displays the subset of properties in common to all uses of the stage,
regardless of its input and output links
 For example, database connection properties
• Output / Input tab
 Displays properties related to the output or input link
 For example, the name of the table the output link is reading from or the
input link is writing to
Navigation panel
Use the Navigation panel to highlight a link or stage in the panel to display properties
associated with it.

Connection properties
• ODBC Connection properties
 Data source name or database name
 User name and password
 Requires a defined ODBC data source on the DataStage Server
• DB2 Connection properties
 Instance
− Not necessary if a default is specified in the environment variables
 Database
 DB2 client library file
• Use Test to test the connection
• Can load connection properties from a data connection object
(discussed later)
Connection properties
The particular set of connection properties depends on the type of stage. All require a
data source or database name and a user name and password. Some types of
Connector stages will include additional connection properties. The DB2 Connector
stage has properties for specifying the name of the DB2 instance, if this cannot be
determined by environment variable settings, and for specifying the location of the DB2
client library file, if this cannot be determined by environment variable settings.
When you have specified the connection properties, click Test to verify the connection.

Usage properties - Generate SQL

• Have the stage generate the SQL?
 If Yes, stage generates SQL based on column definitions and specified
table name
− Table name
• If schema name is not specified, then assumes DataStage user ID
• For example: ITEMS becomes STUDENT.ITEMS
 If No, then you must specify the SQL
• Paste it in
• Manually type it
• Invoke SQL Builder
Usage properties - Generate SQL

The Usage properties folder contains the Generate SQL property. Use this property to
specify whether you want the stage to generate the SQL based on your other property
settings and the imported table definition column, or whether you will build or provide
the SQL.
If you choose to build the SQL, you can either create it outside the stage and paste it in
manually, type it into the stage, or you can have the SQL Builder utility to build the SQL.
The SQL Builder utility is invoked from within the Connector stage.

Usage properties - Transaction

• Defines the Unit of Work, when a COMMIT occurs
• Record count
 Number of records to process before the current transaction is committed
• Array size
 Number of rows to transfer in each read or write operation
 Record count must be a multiple of Array size
• End of wave
 A marker that is inserted into the data to indicate the end of a Unit of Work
 The transaction unit is committed when the end of wave marker has
passed through the job
− Data is written to output data sets or database tables as a batch of rows
(record count) when the end of wave marked is reached
Usage properties - Transaction

The Usage properties folder in the Connector stage contains a set of transaction
properties. A transaction defines the unit of work. That is, it specifies the number of
rows written by the stage before the data is committed to the table. A value of 0 in the
Record count property directs the stage to write out all rows before the commit.
Array size determines the number of rows transferred in each read or write operation.
The larger the array size the fewer physical writes, and therefore better performance.

Usage properties - Session and Before/After SQL

• Session
 Isolation level:
− Read uncommitted: Rows that are read during a transaction can be changed
by other processes
− Read committed: Rows that are read during a transaction can be changed by
other processes, but can’t be read until the transaction is completed
− Repeatable read: Rows can’t be changed by other processes until the
transaction is completed
− Serializable: Rows can’t be read or changed by other processes until the
transaction is competed
• Before / After SQL
 SQL statement to be processed before or after data is processed by the
Connector stage
 Use e.g. to create or drop secondary indexes
Usage properties - Session and Before/After SQL

The Usage folder also contains a folder of Session properties. Here, you can specify
an isolation level.
Connector stages support Before / After SQL. These are SQL statements that are to
be executed either before the stage begins processing the data or after the stage
processes the data.

Writing to database tables
DB2 Connector for

writing
Writing to database tables

This slide shows a job that writes to a DB2 table using the DB2 Connector stage.
Connector stages support multiple input links and reject links. This is discussed later in
this unit.

DB2 Connector GUI
Connection Properties
Write mode
Generate SQL
Table action
DB2 Connector GUI

This slide shows the inside of the DB2 Connector stage and highlights some its main
properties. Notice that the DB2 Connector stage has the same basic look-and-feel as
the ODBC Connector stage. The only difference is that it has a couple of additional
properties.

Connector write properties

• Write mode includes:
 Insert
 Update
 Insert then update
− If Insert fails, try update
 Update then insert
− If update fails, try insert
 Bulk load
− Invoke DB2 bulk loading utility
• Table action
 Append: append data to existing table
 Truncate: delete existing data before writing
 Create: create the table
 Replace: create table or replace existing table
Connector write properties

Connector stages used for table writes have a Write mode property. Use this property
to specify the type of write operation. The stage supports both inserts and updates. It
also supports combined inserts and updates. Choose Insert then update if your job will
be doing more inserts than updates. Choose Update then insert if your job will be
doing more updates than inserts. The results are the same in either case. Which you
choose is a matter of performance.
If the database type, such as DB2, supports bulk loading, then you can optionally have
the Connector stage invoke this utility.
Use the Table action property to specify whether the written rows are to be added to
existing rows in the table (Append) or whether they replace the existing rows
(Truncate). You can also direct the Connector stage to create or re-create the table
before writing the rows.

Data connection objects

• Stores connection property values as a Repository object:
− Password is encrypted
 Data source or database name
 Other connection properties specific to the type of connection
• Data connection objects are linked to a specific type of Connector or
other type of database stage
• Data connection object values can be loaded into a job Connector
stage
 Load link within the stage
 Right mouse button>Load Data Connection menu selection
 Existing stage values can also be saved into a data connection object
Data connection objects

Data connection objects store connection property values in a named Repository
object. These connection properties can then be loaded into the Connector stage as a
set. This avoids the task of manually entering values for connection properties. It also
allows developers to enter connection properties into a Connector stage without
knowing the actual password, which is encrypted.
Data connection objects are linked to a specific type of Connector. When a data
connection object is created, the type of Connector stage it will be used in is selected.

Data connection object

Select type of
relational stage
Connector property
values
Data connection object

This slide shows the inside of a data connection object. Notice that is provides
connection property values for a DB2 Connector stage type.

Creating a new data connection object
New data
connection
Creating a new data connection object

This slide shows how to create a new data connection object. Click New, and then
select the Other folder.
You can also optionally save the parameters and values specified in an existing
Connector stage into a new data connection object.

Loading the data connection
Load data Save data

connection connection
Loading the data connection

This slide shows one way of loading a data connection object into a stage. Click your
right mouse button over the stage, and then click Load Data Connection.
Another way of loading the data connection is to drag-and-drop it onto the stage.
Another way is to click the Load button within the stage.
Click Save data connection to save the connection property values in the stage to a
new data connection object.

Demonstration 1
Read and write to relational tables

 Create a data connection object for a DB2 Connector stage type
 Create and load a DB2 table using the DB2 Connector stage
 Import a table definition using ODBC
 Read from a DB2 table using the ODBC Connector stage
Demonstration 1: Read and write to relational tables

Demonstration 1:
Read and write to relational tables
Purpose:
You want to read and write from a database. To do so, first you will create a
Data Connection object, then you will create and load a DB2 table. Finally you
will read from the DB2 table and write to a file.
NOTE:
Steps:
that is displayed.
Task 1. Create a Data Connection object.
1. Click New , and then click Other.

2. Click Data Connection, and then click OK, to open the Data Connection
window.
3. In the Data Connection name box, type DB2_Connect_student.
4. Click the Parameters tab, and then in the Connect using Stage Type box,
click the ellipses to select the DB2 Connector stage type:

5. Click Open, and then enter parameter values for the first three parameters:
• ConnectionString: SAMPLE
• Username: student
• Password: student
6. Click OK, and then save the parameter set to your Metadata folder.
Task 2. Create and load a DB2 table using the DB2 Connector
stage.
1. Create a new parallel job named relWarehouseItems. The source stage is a
Sequential File stage. The target stage is a DB2 Connector stage, which you
will find in Palette > Database. Name the links and stages as shown.
2. Edit the Warehouse Sequential File stage to read data from the
Warehouse.txt file. Be sure you can view the data.
Next you want to edit the DB2 Connector stage

3. ouble-click the DB2 Connector stage, and then in the right corner of the
Properties pane, click the Load link, to load the connection information from
the DB2_Connect_student that you created earlier.
This sets the Database property to SAMPLE, and sets the user name and
password properties.
4. Set the Write mode property to Insert. Set Generate SQL to Yes. The Table
name is ITEMS.
NOTE: You can also type STUDENT.ITEMS, because the DB2 schema for this
database is STUDENT.

5. Scroll down and set the Table action property to Replace. Also change the
number of rows per transaction (Record count) to 1. Once the value is
changed, you must also set Array size to 1 (because the number of rows per
transaction must be a multiple of the array size).
6. Compile and run, and then check the job log for errors.
Next you want to see the data in the table.
7. Right-click ITEMS, and then click view Warehouse data.

Task 3. Import a table definition using ODBC.

1. From the Designer menu, click Import > Table Definitions > ODBC Table
Definitions.
2. In the DSN box, select SAMPLE
3. In the User name and Password boxes, type student / student.
4. Click OK.
5. Specify the To folder to point to your _Training > Metadata folder. Select the
STUDENT.ITEMS table.
NOTE: If you have trouble finding it, type STUDENT.ITEMS in the Name
Contains box, and then click Refresh.
6. Click Import.
7. Open up your STUDENT.ITEMS table definition in the Repository pane, and
then click the Columns tab to examine its column definitions. If the ITEM
column contains an odd SQL type, change the SQL type to NVarChar.

8. Click on the Locator tab, and then type EDSERVER in the Computer box.
9. Verify that the schema and table fields are filled in correctly, as shown.
This metadata is saved in the Repository with the table definition, and is used by
Information Server tools and components, including SQL Builder.
10. Click OK to close the table definition.

Task 4. Create a job that reads from a DB2 table using the
ODBC Connector stage.
1. Create a new parallel job named relReadTable_odbc. Use the ODBC
Connector stage to read from the ITEMS table you created in an earlier task.
Write to a Data Set stage.
2. Open up the ITEMS Connector stage to the Properties tab. Type SAMPLE in
the Data source box. Specify your database user name and password - in this
case, student/student. Click Test to test the connection.
3. Set the Generate SQL property to Yes.

4. Type the table name: STUDENT.ITEMS.
5. Click the Columns tab. Load your STUDENT.ITEMS table definition. Verify that
the column definitions match what you see below.

6. On the Properties tab, verify that you can view the data.
7. In the Transformer stage, map all columns from ITEMS to ItemsOut.
8. In the target Data Set stage, write to a file named ITEMS.ds in your Temp
directory.
9. Compile and run your job. Check the job log for errors. Be sure you can view
the data in the target data set file.
Results:
First you created a Data Connection object, then you created and loaded a
DB2 table. Finally you read from the DB2 table and wrote to a Data Set file.

Multiple input links

• Write rows to multiple tables within the same unit of work
 Use navigation panel in stage to select link properties
 Order of input records to input links can be specified
− Record ordering property
• All records: All records from first link, then next link, etc.
• First record: One record from each link is processed at a time
• Ordered: User specified ordering
• Reject links can be created for each input link
 Can be based on:
− SQL error
− Row not updated
 ERRORCODE and ERRORTEXT columns can be added to each reject row
− Contain error code and error text, respectively
Multiple input links

Multiple input links write rows to multiple tables within the same unit of work. Reject
links can be created for each input link. Rows can be captured based on two conditions:
the occurrence of an SQL error an update failure. The former would occur if an insert
failed because the key column value matched an existing row key column value. The
latter would occur if an update failed because there was not an existing row with a
matching key value.
When using multiple input links, the order in which rows are written can be specified
using the Record ordering property. Select All records to write all records from the
first link before writing records from the next link. Select First record to write records
one at a time to each link. Select Ordered to specify a customized ordering.

Job with multiple input links and reject links
Multiple input Reject links

links
Job with multiple input links and reject links

This slide shows a job writing to two DB2 tables using the DB2 Connector stage with
multiple input links. Also shown are rejects links corresponding to each of the input
links. So, for example, the top reject link labeled SGM_DESC_Rejects will capture SQL
errors occurring in the SGM_DESC input link.

Specifying input link properties
Select input link
Job parameter
Click to create
job parameter
Specifying input link properties

This slide shows the inside of the Connector stage. You can click on a particular input
link in the link name box to display its properties. In this example, the SGM_DESC input
link has been selected. The table action specified applies to this link.
Notice also that a job parameter is being used to specify the table action. Click the icon
indicated to create a job parameter for a property within the Connector stage.

Record ordering property
Stage
properties
Record
ordering
Record ordering property

This slide shows the stage properties for the Connector stage. Here is where you can
specify the ordering of records for multiple input links using the Record ordering
property.

Reject link specification
Reject link
Reject link Include in

conditions reject row
Reject link
association
Reject link specification

Select a reject link in the link name box to display its properties. In the window on the
left, below the link name box, you specify the conditions capturing rows in the reject link.
In the window on the right, you can specify whether to include error information along
with the rejected row. If, for example, you check ERRORCODE, a column named
ERRORCODE will be added each reject row. This new column will contain the SQL
error code that occurred.
Each reject link is associated with an input link. You specify this in the
Reject From Link box at the bottom of the window.

Demonstration 2
Connector stages with multiple input links

 Create a job with multiple input links to a Connector stage
 Create job parameters for Connector stage properties
 Create Connector stage Reject links
Demonstration 2: Connector stages with multiple input links

Demonstration 2:
Connector stages with multiple input links
Purpose:
You will update relational tables using multiple Connector input links in a
single job.
NOTE:
Steps:
that is displayed.
Task 1. Create a job with multiple Connector input links.
1. Create a new parallel job named relMultInput. Name the links and stages as
shown. Be sure to work from left to right as you create your job workflow, adding
your elements and connectors.

2. Open the source Sequential File stage. Edit it so that it reads from the
Selling_Group_Mapping.txt file. Be sure you can view the data.
3. Open the Transformer. Map the Selling_Group_Code and

Selling_Group_Desc fields to the SGM_DESC output link. Map the
Selling_Group_Code, Special_Handling_Code, and
Distribution_Channel_Description fields to the SGM_CODES output link.
The Distribution_Channel_Description presents a problem. The column
name is too long for DB2.
4. Change the name of the output column to Distribution_Channel_Desc.

5. Open up the DB2 Connector stage.

6. Click on the Stage tab at the top left.
This displays the Connection properties.
7. Click the Load link. Select the DB2_Connect_student Data Connection object
you created in an earlier lab.
8. Click on the Input tab.

9. In the Input name (upstream stage) box, select SGM_DESC (Split). Set the
Write mode property to Insert, set Generate SQL to Yes, and type
SGM_DESC for Table name, as shown.
10. Click Table action to select the row, and then click Use Job Parameter .
11. Click New Parameter, and then create a new job parameter named
TableAction, with a default value of Append.

12. Click OK.

This adds the job parameter enclosed in pound signs (#).
13. Click the Columns tab. Select the Key box next to Selling_Group_Code.
This will define the column as a key column when the table is created.

14. In the Input name (upstream stage) box at the top left of the stage, select
SGM_CODES (Split).
15. On the Properties tab, set the Write mode property to Insert, the Generate
SQL property to Yes, the Table name property to SGM_CODES, and Table
action to #TableAction#, as shown.
16. Click the Columns tab. Select the Key box next to the Selling_Group_Code
box.
This will define the column as a key column when the table is created.
17. Click on the Output tab, and then select SGM_DESC_Rejects

(Peek_SGM_DESC_Rejects) from the Output name (downstream stage)
drop down list.

18. In the Reject From Link box, select SGM_DESC.

19. Select the SQL error, ERRORCODE, and ERRORTEXT boxes.
20. From the drop down list, select SGM_CODES_Rejects

(Peek_SGM_CODES_Rejects).
21. In the Reject From Link box, select SGM_CODES.

22. Select the SQL error, ERRORCODE, and ERRORTEXT boxes.
23. Click OK to close the Connector stage.

24. Compile your work.

25. Run your job.

The Job Run Options window is displayed.
26. The first time you run this job, select Create as the Table action, so that the
target tables get created.
27. Click the Run button.

28. View the job log. Notice the DB2 Connector stage messages that display
information about the numbers of rows inserted and rejected.

29. In the log, open the message that describes the statement used to generate the
table. Notice that the CREATE TABLE statement includes the PRIMARY KEY
option.
30. Now, let us test the reject links. Run the job again, this time selecting a
Table action of Append.

31. Notice that all the rows are rejected, because they have duplicate keys.
32. In the job log, open up one of the reject Peek messages and view the
information it contains. Notice that it contains two additional columns of
information (RejectERRORCODE, RejectERRORTEXT) that contains SQL
error information.
Results:
You updated relational tables using multiple Connector input links in a single
job.

SQL Builder
• Uses the table definition
 Be sure the Locator tab information is correct
− Schema and table names are based on Locator tab information
• Drag table definitions to SQL Builder canvas
• Drag columns from table definition to select columns table
 Optionally, specify sort order
• Define column expressions
• Define WHERE clause
SQL Builder
Connector stages contain a utility called SQL Builder that can be used to build the SQL
used by the stage. SQL is built using GUI operations such as drag-and-drop in a
canvas area. Using SQL Builder you can construct complex SQL statements without
knowing how to manually construct them.

Table definition Locator tab

Locator
tab
Table schema
name
Table
name
Table definition Locator tab

If you are going to use SQL Builder, it is important that the table definition you will drag
to the SQL Builder canvas, to specify the SELECT clause, have the correct information
on the Locator tab. SQL Builder uses some of this information in the construction of the
SQL. In particular, make sure the table schema name and table name are correct, since
these names cannot be directly edited from within SQL Builder.

Opening SQL Builder
Open SQL
Builder
Constructed
SQL
Opening SQL Builder

This slide shows how to open SQL Builder from within a Connector stage. The Tools
button is at the far right of the SQL statement row. In this example, a SELECT
statement has been built using SQL Builder. Alternatively, this is where you would
manually type or paste in an SQL statement.

SQL Builder window
Drag table
definition
Drag
columns
WHERE clause ORDER BY
SQL Builder window

This slide shows the SQL Builder window.
You build the query on the Selection tab, which is the first window you see when you
open SQL Builder. Begin by dragging a table definition to the canvas from the
Repository window shown at the top left. Be sure the information on the Locator tab of
the table definition is correct. In particular, be sure the table name and schema are
correctly specified.
From the table definition, you can drag columns down to the Select columns window
to build the SQL SELECT clause. Use the Construct filter expression window to
construct your WHERE clause.

Creating a calculated column

Select expression Column
editor alias
Function Select function Function parameters
Creating a calculated column

This slide shows how to build a calculated column in SQL Builder. First open the
expression editor for a new Column Expression cell. In this window, select a predicate
(Functions, Calculation) and then begin building the expression.
In this example, the SUBSTRING function has been selected in the Expression Editor
list. Then the parameters for this function have been specified at the right. The string
which the function is applied to is a column from the ITEMS table. The substring starts
at character 1 and goes for 15 characters.

Constructing a WHERE clause
Select Job Add condition

predicate parameter to clause
Add second job

parameter
Constructing a WHERE clause

This slide illustrates how to construct a WHERE clause in SQL Builder. Construct the
expression as shown in this example. Then click Add to add the expression to the
expression window. Then you can optionally create additional expressions to add to the
WHERE clause.
Notice that job parameters can be used within an expression. In this example, the job
parameter #WarehouseLow# sets the low value of a range.

Sorting the data
Sort Ascending/ Second column

Descending to sort by
First column
to sort by
Sorting the data

This slide illustrates how to create an ORDER BY clause in the SQL statement. In the
Select columns window, specify the ordering of the sort key columns in the Sort Order
column. For each of these, you can specify Ascending or Descending in the Sort
column.

Viewing the generated SQL
Read-only
SQL tab
Viewing the generated SQL

At any time, you can view the SQL that has been generated up to that point. The SQL
tab is read-only. You cannot edit the SQL manually.
Notice in the SQL the FROM clause, where the table name and schema names are
used. These came from the table definition Locator tab.

Checkpoint
1. What are three ways of building SQL statements in Connector
stages?
2. Which of the following statements can be specified in Connector
stages? Select, Insert, Update, Upsert, Create Table.
3. What are two ways of loading data connection metadata into a
database stage?
Checkpoint

1. Manually. Using SQL Builder. Have the Connector stage generate
the SQL.
2. All of them.
3. Click the right mouse button over the stage and click Load Data
Connection. Drag the data connection from the Repository and
drop it on the stage.

Demonstration 3
Construct SQL using SQL Builder

 Invoke SQL Builder
 Construct the SELECT clause
 Construct the ORDER BY clause
 Create a column expression
 Define a WHERE clause
Demonstration 3: Construct SQL using SQL Builder

Demonstration 3:
Construct SQL using SQL Builder
Purpose:
You want to build an SQL SELECT statement using SQL Builder.
NOTE:
Steps:
that is displayed.
Task 1. Build an SQL SELECT statement using SQL Builder.
1. Open your relReadTable_odbc job and save it as
relReadTable_odbc_sqlBuild.

2. Open up your STUDENT.ITEMS table definition. Click on the Locator tab. Edit
or verify that the schema and table boxes contain the correct schema name
and table name, respectively.
3. Open up the Job Properties window, and then create two job parameters:
• WarehouseLow as an integer type, with a default value of 0
• WarehouseHigh as an integer type, with a default value of 999999

4. Open up the Connector source stage. In the Usage folder, set the
Generate SQL property to No. Notice that the new warning next to
Select statement.
5. Click the Select statement row, and then click Tools. Click
Build new SQL (ODBC 3.52 extended syntax).
This opens the SQL Builder window.
6. Drag your STUDENT.ITEMS table definition onto the canvas.
7. Select all the columns except ALLOCATED and HARDALLOCATED, and then
drag them to the Select columns pane.

8. Sort by ITEM and WAREHOUSE, in ascending order. To accomplish this select

Ascending in the Sort column. Specify the sort order in the last column.
9. Click the SQL tab at the bottom of the window to view the SQL based on your
specifications so far.
10. Click OK to save and close your SQL statement and SQL Editor.
11. You may get some warning messages. Click Yes to accept the SQL as
generated and allow DataStage to merge the SQL Builder selected columns
with the columns on the Columns tab.
12. Click the Columns tab. Ensure that the ALLOCATED and HARDALLOCATED
columns are removed, since they are not referenced in the SQL. Also make
sure that the column definitions match what you see below.

13. Click the Properties tab.

Notice that the SQL statement you created using SQL Builder has been put into
the Select statement property.
14. Open up the Transformer. Remove the output columns in red, since they are
no longer used.
15. Compile and run with defaults. View the job log.
16. Verify that you can view the data in the target stage.
Task 2. Use the SQL Builder expression editor.
1. Save your job as relReadTable_odbc_expr.
2. Open up your source ODBC Connector stage, and then beside the SELECT
statement you previously generated click on the Tools button.
3. Click Edit existing SQL (ODBC 3.52 extended syntax).
4. Click in the empty Column Expression cell beside *. From the drop-down list,
select Expression Editor.
This opens the Expression Editor Dialog window.
5. In the Predicates box select the Functions predicate and then select the
SUBSTRING function in the Expression Editor box. Specify that it is to select
the first 15 characters of the ITEM column.
6. Click OK.

7. For the new calculated column, specify a column alias of SHORT_ITEM.
8. In the Construct filter expression (WHERE clause) window, construct a

WHERE clause that selects the following: Warehouses with numbers between
#WarehouseLow# and #WarehouseHigh#, where #WarehouseLow# and
#WarehouseHigh# are job parameters.
9. Click the Add button to add it to the SELECTION window.
10. Click the SQL tab at the bottom of the SQL Builder to view the constructed
SQL. Verify that it is correct.
11. Click OK to return to the Properties tab. A message is displayed informing you
that your columns in the stage do not match columns in the SQL statement.
Click Yes to add the SHORT_ITEM column to your metadata.
12. On the Columns tab, specify the correct type for the SHORT_ITEM column,
namely Varchar(15).

13. Open the Transformer stage, and then map the new SHORT_ITEM column
across. Remove the ONHAND and ONORDER columns from the output.
14. Compile and run.

15. View the results.
Results:
You built an SQL SELECT statement using SQL Builder.

Unit summary
• Import table definitions for relational tables
• Create data connections
• Use ODBC and DB2 Connector stages in a job
• Use SQL Builder to define SQL SELECT and INSERT statements
• Use multiple input links into Connector stages to update multiple
tables within a single transaction
• Create reject links from Connector stages to capture rows with
SQL errors
Unit summary


Job control
Job control

Unit 13 Job control

Unit 13 Job control
Unit objectives
• Use the DataStage job sequencer to build a job that controls a
sequence of jobs
• Use Sequencer links and stages to control the sequence a set of jobs
run in
• Use Sequencer triggers and stages to control the conditions under
which jobs run
• Pass information in job parameters from the master controlling job to
the controlled jobs
• Define user variables
• Enable restart
• Handle errors and exceptions
Job control © Copyright IBM Corporation 2015
Unit objectives

Unit 13 Job control
What is a job sequence?

• A master controlling job that controls the execution of a set of
subordinate jobs
• Passes values to the subordinate job parameters
• Controls the order of execution (links)
• Specifies conditions under which the subordinate jobs get executed
(triggers)
• Specifies complex flow of control
 Loops
 All / Some
 Wait for file
• Perform system activities
 Email
 Execute system commands and executables
• Can include Restart checkpoints
What is a job sequence?

A job sequence is a master controlling job that controls the execution of a set of
subordinate jobs. A job sequence is special type of job, which has its own canvas and
set of stages that can be dragged onto the canvas.
The job sequence manages and controls the set of subordinate jobs. Parameter values
can be passed from the job sequence to the individual jobs. In this way, the job
sequence can provide a single interface to a whole set of jobs.
The job sequence controls when its subordinate jobs run and the order in which they
run. There are also a number of separate stages that can be used to control the job
flow.
In addition to controlling and running jobs, other system activities can be performed.

Unit 13 Job control
Basics for creating a job sequence

• Open a new job sequence
 Specify whether its restartable
• Add stages
 Stages to execute jobs
 Stages to execute system commands and executables
 Special purpose stages
• Add links
 Specify the order in which jobs are to be executed
• Specify triggers
 Triggers specify the condition under which control passes across a link
• Specify error handling
• Enable / disable restart checkpoints
Basics for creating a job sequence

To create a job sequence, you first open a new job sequence canvas. You then add
stages and links, just as for parallel jobs. However, the stages and links have a different
meaning. The stages are used to execute jobs, and for performing other activities. The
links are used to specify the order in which jobs get executed.
For each link, you can specify a triggering condition under which control will be allowed
to pass to the next stage.

Unit 13 Job control
Job sequence stages

• Run stages
 Job Activity: Run a job
 Execute Command: Run a system
command
 Notification Activity: Send an email
• Flow control stages
 Sequencer: Go if All / Some
 Wait for File: Go when file exists / doesn’t
exist
 StartLoop / EndLoop
 Nested Condition: Go if condition satisfied
• Error handling
 Exception Handler
 Terminator
• Variables
 User Variables
Job sequence stages

The job sequence stages shown in the slide on the left can be placed into different
categories, as shown. Some stages are used to run jobs and perform other sorts of
activities. Some stages are used for complex flow of control. There are two stages that
are used for error handling. And the User Variables stage provides a mechanism for
passing data to individual job parameters.
These stages are each discussed in the following pages.

Unit 13 Job control
Job sequence example
Wait
for file
Execute a
Run job
command
Send
email
Handle
exceptions
Job sequence example

This slide displays an example of a job sequence. It contains many of the different
stages that are available. These different types of stages are highlighted by the call
outs.
Notice the coloring of the links. Different colors indicate different triggering conditions,
which are discussed in the following pages. For example, a red link passes control to
the following stage, when a job or other activity fails. A green link passes control to the
following stage, when a job or other activity succeeds.

Unit 13 Job control
Job sequence properties
Restart Job log

options
Exception stage to
handle aborts
Job sequence properties

This slide shows the job sequence properties that can be set. One key feature of job
sequences is that they are restartable. That is, if one of the jobs fails, after several ran
successfully, execution will start at the point of failure when the sequence is restarted.
To enable restartability, check the Add checkpoints so sequence is restartable on
failure box.

Unit 13 Job control
Job Activity stage properties
Job to be executed
Execution mode
Job parameters
and their values
Job Activity stage properties

This slide shows the Job tab of a Job Activity stage and highlights its main features. A
Job Activity stage is used to run a job. The Job name field specifies the job.
The Execution action specifies how the job is to run. The Reset if required, then run
execution mode will reset a job that aborted on the previous run to an executable
condition.
The job parameters of the job to be executed are listed at the bottom, along with the
values that are to be passed to them. Value expressions for these parameters can
include the parameters of the job sequence. In this way, when the sequence is run, the
values passed to the job sequence will be passed down to the individual jobs it controls.

Unit 13 Job control
Job Activity trigger

Output link
names
List of
trigger types
Build custom trigger

expressions
Job Activity trigger

This slide displays the Triggers tab of a Job Activity stage. Most job sequence stages
have a Triggers tab.
A trigger can be specified for each link going out of the stage. A list of the trigger types
is shown at the lower left. In this example, a Custom trigger is being defined. The
trigger expression is built using the expression editor. A menu of items that can be
inserted into the expression is displayed.
Several other types of triggers can be selected. The OK trigger will pass control across
the link, if the job or other activity runs successfully. The Failed trigger will pass control
across the link, if the job or other activity fails.

Unit 13 Job control
Execute Command stage
• Execute system commands, shell scripts, and other executables

• Use, for example, to drop or rename database tables
Executable
Parameters to pass
Execute Command stage

This slide shows the inside of the Execute Command stage, which is used to run
system commands, shell scripts, and other executables.
The command to run the executable is specified in the Command box. In this example,
the Echo_Script.sh script will be executed.
Parameters can be passed to the executable. The parameter values are listed in the
Parameters box.

Unit 13 Job control
Notification Activity stage
Include job status

info in email body
Notification Activity stage

This slide displays the inside of the Notification Activity stage. The Notification Activity
stage is used to send emails. Boxes are provided in which to specify the email
addresses of the sender and recipients. A subject line and attachments can also be
specified.
Select the Include job status in email box to include a status report about the
activities in the job sequence in the email.

Unit 13 Job control
User Variables stage
User Variables Variable

stage
Expression defining
the value for the
variable
User Variables stage

This slide shows a job sequence with a User Variables Activity stage. The inside of the
User Variables Activity stage is shown. A single variable is defined along with the
expression that specifies its value. This variable can be passed to any of the jobs that
follow it. For example, this variable can be passed to seqJob1 or seqJob3.

Unit 13 Job control
Referencing the user variable
Variable
Referencing the user variable

This slide displays the Job tab of a Job Activity stage. The PeekHeading parameter is
passed the user variable shown on the previous page.

Unit 13 Job control
Wait for File stage
File
Options
Wait for File stage

This shows the inside of the Wait for File stage. In the Filename box, you specify a file
that the stage is to wait to appear or disappear. When that event happens, control will
be passed out the stage based on specified Trigger conditions.
In this example, control will be passed to the next stage when the StartRun file
disappears.

Unit 13 Job control
Sequencer stage
• Sequence multiple jobs using the Sequence stage
Can be set to
All or Any
Sequencer stage
This slide shows an example of a job sequence with the Sequencer stage. This stage
passes control to the next stage (PTPCredit) when control reaches it from all or some
of its input links. It has two modes: All/Any). If All is the active mode, then control must
reach if from all of its input links, before it will pass control to the next stage. If Some is
the active mode, then control must reach if from at least one of its input links, before it
will pass control the next stage.

Unit 13 Job control
Nested Condition stage
Fork based on
trigger conditions
Trigger conditions
Nested Condition stage

This slide shows the Nested Condition stage in a job sequence. It can be used to pass
control across one or more output links based on their Trigger conditions. The specified
Trigger conditions are displayed in the window at the bottom left, as noted.
The Nested Condition stage does not perform any activity. It is used to split the flow of
control across different output paths.

Unit 13 Job control
Loop stages
Reference link
to start
Counter Pass counter

values value
Loop stages
This slide shows a job sequence with a loop stage. In this example, the Loop stage
processes each of the list of values in the Delimited Values box shown at the bottom
left. The values are delimited by commas. In this example, the loop will iterate three
times. The value for each iteration will be stored in the Counter stage variable which
will be passed to the ProcessPayrollFiles Job Activity stage in the FileName parameter.
For each iteration, the job run by the Job Activity stage will read from the file whose
name is in the Counter stage variable.

Unit 13 Job control
Handling activities that fail
Pass control to
Exception stage
when an activity fails
Handling activities that fail

This slide shows the Job Properties window of the job sequence. If the
Automatically handle activities that fail box is selected, as shown here, control will be
passed to the Exception Handler stage when any activity fails.

Unit 13 Job control
Exception Handler stage
Control goes here

if an activity fails
Exception Handler stage

This slide shows a job sequence with an Exception Handler stage, which is highlighted.
If one of the activities run by an Activity stage fails (for example, Job_2 or
Execute_Command_27), control is immediately passed to the Exception Handler
stage. This stage initiates a set of activities. In this example, the sequence sends an
email and gracefully terminates the jobs handled by the job sequence.

Unit 13 Job control
Enable restart
Enable checkpoints
to be added
Enable restart
This slide shows the Job Properties window of the job sequence.
If Add check points so sequence is restartable on failure, the sequence can be
restarted upon failure. Execution will start at the point of failure. Activities that have
previously run successfully, and were checkpointed, will not be rerun.

Unit 13 Job control
Disable checkpoint for a Stage
Do not checkpoint
this activity
Disable checkpoint for a Stage

This slide shows the inside of a Job Activity stage. The Do not checkpoint run box is
highlighted. If this box is checked, this Job Activity stage will run each time the
sequence is run, whether or not it ran successfully on the previous run.

Unit 13 Job control
Checkpoint
1. Which stage is used to run jobs in a job sequence?
2. Does the Exception Handler stage support an input link?
Checkpoint

Unit 13 Job control
1. Job Activity stage
2. No, control is automatically passed to the stage when an exception
occurs (for example, a job aborts).

Unit 13 Job control
Demonstration 1
Build and run a job sequence

 Build a job sequence that runs three jobs
 Pass parameters from the job sequence to the Job Activity stages
 Specify custom triggers
 Define a user variable
 Add a Wait for File stage
 Add exception handling
 Run a job sequence
Demonstration 1: Build and run a job sequence

Unit 13 Job control
Demonstration 1:
Build and run a job sequence
Purpose:
You want to build a job sequence that runs three jobs and explore how to
handle exceptions.
NOTE:
In this demonstration, and other demonstrations in this course, there may be tasks that
Steps:
that is displayed.
Task 1. Build a Job Sequence.
1. Import the seqJobs.dsx file in your DSEss_Files\dsxfiles directory.
This file contains the jobs you will execute in your job sequence: seqJob1,
seqJob2, and seqJob3.
2. When prompted, import everything listed in the DataStage Import dialog.
3. Open up seqJob1. Compile the job.

Unit 13 Job control
4. In the Repository window, right-click seqJob2, and then click Multiple Job
Compile.
The DataStage Compilation Wizard window is opened.
5. Ensure both seqJob2 and seqJob3 are added to the Selected items window.

Unit 13 Job control
6. Click Next two times to move to the Compile Process window.
7. Click Start Compile.

8. After the jobs compile successfully, click Finish. If a report opens after the
compile, you can just close it.
9. Return to the open seqJob1 canvas. In the Job Properties window, click the
Parameters tab, and note the parameters defined for seqJob1. The other jobs
have similar parameters.

Unit 13 Job control
10. Open the Transformer stage. Notice that the job parameter PeekHeading
prefixes the column of data that will be written to the job log using the Peek
stage.
11. Click New, and then select the Jobs folder.
12. Open a new Sequence Job, and then save it as seq_Jobs.

13. Under Palette, under Sequence, drag three Job Activity stages to the canvas,
link them, and name the stages and links as shown. (Alternatively, you can drag
seqJob1, seqJob2, and seqJob3 to the canvas.)

Unit 13 Job control
14. Open the General tab in the Job Properties window. Review and select all
compilation options.
15. Add job parameters to the job sequence to supply values to the job parameters
in the jobs. Click on the Add Environment Variable button and then add
$APT_DUMP_SCORE. Set $APT_DUMP_SCORE to True.
Hint: double-click the bottom of the window, to sort the variables.
16. Add three numbered RecCount variables: RecCount1, RecCount2, and
RecCount3. All are type string with a default value of 10.
17. Open up the first Job Activity stage and set and/or verify that the Job name
value is set to the job the Activity stage is to run.
18. For the Job Activity stage, set the job parameters to the corresponding job
parameters of the job sequence. For the PeekHeading value use a string with a
single space.

Unit 13 Job control
19. Set the Execution action to Reset if required, then run.

The result for the seqJob1 appears as follows. The others are similar.
20. Repeat the setup for the other 2 stages, using the corresponding 2 and 3 values
that match to the corresponding stage.
In each of the first two Job Activity stages, you want to set the job triggers so
that later jobs only run if earlier jobs run without errors, although possibly with
warnings.This means that the DSJS.JOBSTATUS is either DSJS.RUNOK or
DSJS.RUNWARN.
To do this, you need to create a custom trigger that specifies that the previous
job's status is equal to one of the above two values.
21. For seqJob1, on the Triggers tab, in the Expression Type box, select Custom
- (Conditional).
22. Double-click the Expression cell, right-click, click Activity Variable, and then
insert $JobStatus.
23. Right-click to insert "=", right-click, click DS Constant, and then insert
DSJS.RUNOK.
24. Right-click to insert Or.
25. Right-click to insert "=", right-click, click DS Constant, and then insert
DSJS.RUNWARN.

Unit 13 Job control
26. Press Enter.

The result for seqJob1 appears as follows:
27. Repeat the previous steps for seqJob2, to add the custom expression.
The result for seqJob2 appears as follows:
28. Compile and run your job sequence.

29. View the job log for the sequence. Verify that each job ran successfully and
examine the job sequence summary message and the individual job report
messages.
Task 2. Add a user variable.

1. Save your job sequence as seq_Jobs_UserVar. Add a User Variables Activity
stage as shown.

Unit 13 Job control
2. Open the User Variables stage, then the User Variables tab. Right-click in the
pane, and then click Add Row. Create a user variable named
varMessagePrefix.
3. Double-click in the Expression cell to open the Expression Editor. Concatenate
the string constant "Date is " with the DSJobStartDate DSMacro, followed by a
bar surrounded with spaces (" | ").
4. Open each Job Activity stage. For each PeekHeading parameter, insert the
parameter varMessagePrefix in the Value Expression cell.
5. Compile and run.

You want to confirm that your user variable is added to every peek heading
item.

Unit 13 Job control
6. From Tools > Run Director, double-click seqJob1 job.

The following shows that PeekHeading is added, but does not show that the
value was added to every item. More detail is required.
7. Close the Job Status Detail dialog, then right-click seqJob1, and then click
View Log.
8. In the job log, double-click the Peek_0.0 item, as indicated.
You now see the user variable "Date is: " prefixes the data going into col1.

Unit 13 Job control
Task 3. Add a Wait for File stage.

In this task, you modify your design so that the job waits to be executed until the
StartRun.txt file appears in your DSEss_Files/Temp directory.
1. Save your job sequence as seq_Jobs_Wait.
2. Add a Wait for File Activity stage as shown.
3. On the Job Properties page, add a job parameter named StartFile to pass the
name of the file to wait for. Specify a default value StartRun.txt.
4. Edit the Wait for File stage. Specify that the job is to wait forever until the
#StartFile# file appears in the DSEss_Files>Temp directory.

Unit 13 Job control
5. On the Triggers tab, specify an unconditional trigger.

6. Compile and run your job sequence. Now view the job log for the sequence. As
you can see in the log, the sequence is waiting for the file.
7. Now open the seqStartSequence job that was part of the seqJobs.dsx file
that you imported earlier. This job creates the StartRun.txt file in your
DSEss_Files/Temp directory.
8. Compile and run the seqStartSequence job to create the StartRun.txt file.
Then return to the log for your sequence to watch the sequence continue to the
end.
Task 4. Add exception handling.
1. Save your sequence as seq_Jobs_Exception.
2. Add the Exception Handler and Terminator Activity stages as shown.

Unit 13 Job control
3. Edit the Terminator stage so that any running jobs are stopped when an
exception occurs.
4. Compile and run your job. To test that it handles exceptions make an Activity
fail. For example, set the RecCount3 parameter to -10. Then go to the job log
and open the Summary message. Verify that the Terminator stage was
executed.
Results:
You built a job sequence that runs three jobs and explored how to handle
exceptions.

Unit 13 Job control
Unit summary
• Use the DataStage job sequencer to build a job that controls a
sequence of jobs
• Use Sequencer links and stages to control the sequence a set of jobs
run in
• Use Sequencer triggers and stages to control the conditions under
which jobs run
• Pass information in job parameters from the master controlling job to
the controlled jobs
• Define user variables
• Enable restart
• Handle errors and exceptions
Unit summary

IBM Training
© Copyright IBM Corporation 2015. All Rights Reserved.

Ibm Infosphere Datastage Essentials V11.5: Course Guide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ibm Infosphere Datastage Essentials V11.5: Course Guide

Uploaded by

Copyright:

Available Formats

®

© Copyright IBM Corp. 2005, 2015 P-2

© Copyright IBM Corp. 2005, 2015 P-3

Verifying that Information Server is running ...................................................... 2-11

© Copyright IBM Corp. 2005, 2015 P-4

Sequential file import procedure ....................................................................... 4-13

© Copyright IBM Corp. 2005, 2015 P-5

Message details ............................................................................................... 5-31

© Copyright IBM Corp. 2005, 2015 P-6

Data Set Management utility............................................................................. 6-45

© Copyright IBM Corp. 2005, 2015 P-7

Combine data ........................................................................................ 8-1

© Copyright IBM Corp. 2005, 2015 P-8

Checkpoint ....................................................................................................... 8-55

© Copyright IBM Corp. 2005, 2015 P-9

Constraints ....................................................................................................... 10-9

© Copyright IBM Corp. 2005, 2015 P-10

Demonstration 4: Group processing in a Transformer .................................... 10-67

© Copyright IBM Corp. 2005, 2015 P-11

Usage properties - Generate SQL .................................................................. 12-12

© Copyright IBM Corp. 2005, 2015 P-12

Referencing the user variable ......................................................................... 13-14

© Copyright IBM Corp. 2005, 2015 P-13

© Copyright IBM Corp. 2005, 2015 P-14

© Copyright IBM Corp. 2005, 2015 P-15

Additional training resources

© Copyright IBM Corp. 2005, 2015 P-16

IBM product help

Books for You want to use search engines to Start/Programs/IBM

IBM on the You want to access any of the

• IBM - Training and Certification • http://www-01.ibm.com/

© Copyright IBM Corp. 2005, 2015 P-17

© Copyright IBM Corp. 2005, 2015 P-18

IBM Infosphere DataStage v11.5

© Copyright IBM Corporation 2015

© Copyright IBM Corp. 2005, 2015 1-2

Introduction to DataStage © Copyright IBM Corporation 2015

© Copyright IBM Corp. 2005, 2015 1-3

What is IBM InfoSphere DataStage?

Introduction to DataStage © Copyright IBM Corporation 2015

What is IBM InfoSphere DataStage?

© Copyright IBM Corp. 2005, 2015 1-4

What is Information Server?

Introduction to DataStage © Copyright IBM Corporation 2015

What is Information Server

© Copyright IBM Corp. 2005, 2015 1-5

Information Server backbone

Information Information Information FastTrack DataStage / MetaBrokers

Information Server Web Console

Introduction to DataStage © Copyright IBM Corporation 2015

Information Server backbone

© Copyright IBM Corp. 2005, 2015 1-6

Information Server Web Console

Introduction to DataStage © Copyright IBM Corporation 2015

Information Server Web Console

© Copyright IBM Corp. 2005, 2015 1-7

Administrator Designer Director

Introduction to DataStage © Copyright IBM Corporation 2015

© Copyright IBM Corp. 2005, 2015 1-8

Introduction to DataStage © Copyright IBM Corporation 2015

© Copyright IBM Corp. 2005, 2015 1-9

Introduction to DataStage © Copyright IBM Corporation 2015

© Copyright IBM Corp. 2005, 2015 1-10