Dx444show PDF

DataStage Essentials v8.
1 Core Modules
Copyright IBM Corporation 2009
Copyright, Disclaimer of Warranties and Limitation of Liability

Copyright IBM Corporation 2009 IBM Software Group One Rogers Street Cambridge, MA 02142 All rights reserved. Printed in the United States. IBM and the IBM logo are registered trademarks of International Business Machines Corporation. The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: AnswersOnLine AIX APPN AS/400 BookMaster C-ISAM Client SDK Cloudscape Connection Services Database Architecture DataBlade DataJoiner DataPropagator DB2 DB2 Connect DB2 Extenders DB2 Universal Database Distributed Database Distributed Relational DPI DRDA DynamicScalableArchitecture DynamicServer DynamicServer.2000 DynamicServer with Advanced DecisionSupportOption DynamicServer with Extended ParallelOption DynamicServer with UniversalDataOption DynamicServer with WebIntegrationOption DynamicServer, WorkgroupEdition Enterprise Storage Server FFST/2 Foundation.2000 Illustra Informix Informix4GL InformixExtendedParallelServer InformixInternet Foundation.2000 Informix RedBrick Decision Server J/Foundation MaxConnect MVS MVS/ESA Net.Data NUMA-Q ON-Bar OnLineDynamicServer OS/2 OS/2 WARP OS/390 OS/400 PTX QBIC QMF RAMAC RedBrickDesign RedBrickDataMine RedBrick Decision Server RedBrickMineBuilder RedBrickDecisionscape RedBrickReady RedBrickSystems RelyonRedBrick S/390 Sequent SP System View Tivoli TME UniData UniData&Design UniversalDataWarehouseBlueprint UniversalDatabaseComponents UniversalWebConnect UniVerse VirtualTableInterface Visionary VisualAge WebIntegrationSuite WebSphere WebSphere DataStage
Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. All other product or brand names may be trademarks of their respective companies. All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant. This document may not be reproduced in whole or in part without the priori written permission of IBM. Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
Course Contents
Mod 01: Introduction 05 Mod 02: Deployment 27 Mod 03: Administering DataStage 41 Mod 04: DataStage Designer 67 Mod 05: Creating Parallel Jobs 97 Mod 06: Accessing Sequential Data 133 Mod 07: Platform Architecture 163 Mod 08: Combining Data 201 Mod 09: Sorting and Aggregating Data 243 Mod 10: Transforming Data 271 Mod 11: Repository Functions 301 Mod 12: Working With Relational Data 327 Mod 13: Metadata in the Parallel Framework 373 Mod 14: Job Control 397 Appendix A: Information Server on zLinux 425
IBM InfoSphere DataStage Mod 01: Introduction
Unit objectives
After completing this unit, you should be able to: List and describe the uses of DataStage List and describe the DataStage clients Describe the DataStage workflow List and compare the different types of DataStage jobs Describe the two types of parallelism exhibited by DataStage parallel jobs
What is IBM InfoSphere DataStage?

Design jobs for Extraction, Transformation, and Loading (ETL) Ideal tool for data integration projects such as, data warehouses, data marts, and system migrations Import, export, create, and manage metadata for use within jobs Schedule, run, and monitor jobs, all within DataStage Administer your DataStage development and execution environments Create batch (controlling) jobs
IBM Information Server

Suite of applications, including DataStage, that:
Share a common repository DB2, by default Share a common set of application services and functionality Provided by Metadata Server components hosted by an application server
IBM InfoSphere Application Server
Provided services include:

Security Repository Logging and reporting Metadata management
Managed using web console clients

Administration console Reporting console
Information Server Backbone
Information Services Director
Business Glossary
Information Analyzer
DataStage
QualityStage
MetaBrokers Federation Server
Metadata Access Services
Metadata Analysis Services Metadata Server
Information Server console
Information Server Administration Console
DataStage Architecture
Clients
Parallel engine
Server engine Shared Repository
Engines
DataStage Administrator
DataStage Designer
DataStage parallel job
DataStage Director
Log message events
Developing in DataStage
Define global and project properties in Administrator Import metadata into the Repository Build job in Designer Compile job in Designer Run and monitor job log messages in Director
Jobs can also be run in Designer, but job lob messages cannot be viewed in Designer
DataStage Project Repository

User-added folder
Standard jobs folder
Standard Table Definitions folder
Types of DataStage Jobs

Server jobs
Executed by the DataStage server engine Compiled into Basic Runtime monitoring in DataStage Director Executed by the DataStage parallel engine Built-in functionality for pipeline and partition parallelism Compiled into OSH (Orchestrate Scripting Language) OSH executes Operators Runtime monitoring in DataStage Director
Executable C++ class instances
Parallel jobs
Job sequences (batch jobs, controlling jobs)
Master Server jobs that kick-off jobs and other activities Can kick-off Server or Parallel jobs Runtime monitoring in DataStage Director Executed by the Server engine
Design Elements of Parallel Jobs

Stages
Implemented as OSH operators (pre-built components) Passive stages (E and L of ETL) Read data Write data E.g., Sequential File, DB2, Oracle, Peek stages Processor (active) stages (T of ETL) Transform data Filter data Aggregate data Generate data Split / Merge data E.g., Transformer, Aggregator, Join, Sort stages
Links
Pipes through which the data moves from stage to stage
Pipeline Parallelism
Transform, clean, load processes execute simultaneously Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running
Advantages:
Reduces disk usage for staging areas Keeps processors busy
Still has limits on scalability
Partition Parallelism
Divide the incoming stream of data into subsets to be separately processed by an operator
Subsets are called partitions
Each partition of data is processed by the same operator

E.g., if operation is Filter, each partition will be filtered in exactly the same way 8 times faster on 8 processors 24 times faster on 24 processors This assumes the data is evenly distributed
Facilitates near-linear scalability
Three-Node Partitioning
Node 1
Operation
subset1 Node 2 subset2
Operation
Node 3
Data
subset3
Operation
Here the data is partitioned into three partitions The operation is performed on each partition of data separately and in parallel If the data is evenly distributed, the data will be processed three times faster
Job Design v. Execution

User designs the flow in DataStage Designer
at runtime, this job runs in parallel for any configuration or partitions (called nodes)
Checkpoint
1. True or false: DataStage Director is used to build and compile your ETL jobs 2. True or false: Use Designer to monitor your job during execution 3. True or false: Administrator is used to set global and project properties
Checkpoint solutions
1. False. 2. False. 3. True.
Unit summary
Having completed this unit, you should be able to: List and describe the uses of DataStage List and describe the DataStage clients Describe the DataStage workflow List and compare the different types of DataStage jobs Describe the two types of parallelism exhibited by DataStage parallel jobs
IBM InfoSphere DataStage Mod 02: Deployment
Unit objectives
After completing this unit, you should be able to: Identify the components of Information Server that need to be installed Describe what a deployment domain consists of Describe different domain deployment options Describe the installation process Start the Information Server
What Gets Deployed

An Information Server domain, consisting of the following: MetaData Server, hosted by an IBM InfoSphere Application Server instance One or more DataStage servers
DataStage server includes both the parallel and server engines
One DB2 UDB instance containing the Repository database Information Server clients
Administration console Reporting console DataStage clients
Administrator Designer Director
Additional Information Server applications

Information Analyzer Business Glossary Rational Data Architect Information Services Director Federation Server
Deployment: Everything on One Machine

Here we have a single domain with the hosted MetaData Server backbone applications all on one machine Clients Additional Client workstations can connect to this machine using TCP/IP
DataStage server Clients
DB2 instance with Repository
Deployment: DataStage on Separate Machine

Here the domain is split between two machines
DataStage server MetaData server and DB2 Repository
MetaData Server backbone
DataStage server Clients
DB2 instance with Repository
MetaData Server and DB2 On Separate Machines

Here the domain is split between three machines
DataStage server MetaData server DB2 Repository
MetaData Server backbone
Clients DataStage server DB2 instance with Repository
Information Server Installation
Installation Configuration Layers

Configuration layers include:
Client DataStage and Information Server clients Engine DataStage and other application engines Domain MetaData Server and hosted metadata server components Installed products domain components Repository Repository database server and database Documentation
Selected layers are installed on the machine local to the installation Already existing components can be configured and used
E.g., DB2, InfoSphere Application Server
Information Server Start-Up

Start the MetaData Server From Windows Start menu, click Start the Server after the profile to be used (e.g., default) From the command line, open the profile bin directory
Enter startup server1
> server1 is the default name of the application server hosting the MetaData Server
Start the ASB agent From Windows Start menu, click Start the agent after selecting the Information Server folder Only required if DataStage and the MetaData Server are on different machines To begin work in DataStage, double-click on a DataStage client icon To begin work in the Administration and Reporting consoles, double-click on the Web Console for Information Server icon
Starting the MetaData Server Backbone
Application Server Profiles folder
Profile Profile
Start the Server Profile\bin directory
Startup command
Starting the ASB Agent
Start the agent
Checkpoint
1. What application components make up a domain? 2. Can a domain contain multiple DataStage servers? 3. Does the DB2 instance and the Repository database need to be on the same machine as the Application Server? 4. Suppose DataStage is on a separate machine from the Application Server. What two components need to be running before you can log onto DataStage?
1. Metadata server hosted by the Application Server. One or more DataStage servers. One DB2/UDB instance containing the Suite Repository database. 2. Yes. The DataStage servers must be on separate machines. 3. No. The DB2 instance with the Repository can reside on a separate machine than the Application Server. 4. The Application Server and the ASB agent.
Unit summary
Having completed this unit, you should be able to: Identify the components of Information Server that need to be installed Describe what a deployment domain consists of Describe different domain deployment options Describe the installation process Start the Information Server
IBM InfoSphere DataStage Mod 03: Administering DataStage
Unit objectives
After completing this unit, you should be able to: Open the Administrative console Create new users and groups Assign Suite roles and Product roles to users and groups Give users DataStage credentials Log onto DataStage Administrator Add a DataStage user on the Permissions tab and specify the users role Specify DataStage global and project defaults List and describe important environment variables
Information Server Administration Web Console

Web application for administering Information Server Use for:
Domain management Session management Management of users and groups Logging management Scheduling management
Opening the Administration Web Console
Information Server console address
Information Server administrator ID
Users and Group Management
User and Group Management

Suite authorizations can be provided to users or groups Authorizations are provided in the form of roles
Users that are members of a group acquire the authorizations of the group Two types of roles Suite roles: Apply to the Suite Suite Component roles: Apply to a specific product or component of Information Server, e.g., DataStage Administrator Perform user and group management tasks Includes all the privileges of the Suite User role User Create views of scheduled tasks and logged messages Create and run reports DataStage DataStage user
Suite roles
Suite Component roles

Permissions are assigned within DataStage > Developer, Operator, Super Operator, Production Manager Full permissions to work in DataStage Administrator, Designer, and Director
DataStage administrator
And so on, for all products in the Suite
Creating a DataStage User ID

Administration console
Create new user
Users
Assigning DataStage Roles

Assign Suite Administrator role
User ID
Users
Assign Suite User role
Assign DataStage User role
Assign DataStage Administrator role
DataStage Credential Mapping

DataStage credentials
Required by DataStage in order to log onto a DataStage client Managed by the DataStage Server operating system or LDAP
Users given DataStage Suite roles in the Suite Administration console do not automatically receive DataStage credentials
Users need to be mapped to a user who has DataStage credentials on the DataStage Server machine This DataStage user must have file access permission to the DataStage engine/project files or Administrator rights on the operating system
Three ways of mapping users (IS ID

Many to 1 1 to 1 Combination of the above
OS ID)
DataStage Credential Mapping
Map DataStage User credential
DataStage Administrator
Logging onto DataStage Administrator

Host name, port number of application server
DataStage administrator ID and password Name or IP address of DataStage server machine
DataStage Administrator Projects Tab
Server projects Click to specify project properties
Link to Information Server Administration console
DataStage Administrator General Tab
Enable runtime column propagation (RCP) by default
Specify auto purge
Environment variable settings
Environment Variables
Parallel job variables
User-defined variables
Environment Reporting Variables
Display Score
Display record counts
Display OSH
DataStage Administrator Permissions Tab
Assigned product role
Add DataStage users
Adding Users and Groups
Available users / groups. Must have a DataStage product User role.
Add DataStage users
Specify DataStage Role
Added DataStage user
Select DataStage role
DataStage Administrator Tracing Tab
DataStage Administrator Parallel Tab
DataStage Administrator Sequence Tab
Checkpoint
1. Authorizations can be assigned to what two items? 2. What two types of authorization roles can be assigned to a user or group? 3. In addition to Suite authorization to log onto DataStage what else does a DataStage developer require to work in DataStage?
1. Users and groups. Members of a group acquire the authorizations of the group. 2. Suite roles and Component roles. 3. Must be mapped to a user with DataStage credentials.
Unit objectives
Having completing this unit, you should be able to: Open the Administrative console Create new users and groups Assign Suite roles and Product roles to users and groups Give users DataStage credentials Log onto DataStage Administrator Add a DataStage user on the Permissions tab and specify the users role Specify DataStage global and project defaults List and describe important environment variables
IBM InfoSphere DataStage Mod 04: DataStage Designer
Unit objectives
After completing this unit, you should be able to: Log onto DataStage Navigate around DataStage Designer Import and export DataStage objects to a file Import a Table Definition for a sequential file
Logging onto DataStage Designer

Host name, port number of application server
DataStage server machine/project
Designer Work Area

Repository Menus Toolbar
Parallel canvas
Palette
Importing and Exporting DataStage Objects
Repository Window
Search for objects in the project
Project Default jobs folder
Default Table Definitions folder
Import and Export

Any object in the repository window can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers
Export Procedure
Click Export>DataStage Components Select DataStage objects for export Specify type of export:
DSX: Default format XML: Enables processing of export file by XML applications, e.g., for generating reports
Specify file path on client machine
Export Window
Click to select objects from the repository
Selected objects
Export type
Export location on client system Begin export
Import Procedure
Click Import>DataStage Components
Or Import>DataStage Components (XML) if you are importing an XML-format export file
Select DataStage objects for import
Import Options
Import all objects in the file
Display list to select from
Import/Export Command Line Interface

Command istool Located \IBM\InformationServer\Clients\istools\clients
Documentation in same location
Invoke through Programs IBM Information Server Information Server Command Line Interface Examples:
istool hostname domain:9080 username dsuser password inf0server deployment-package importfile.dsx import istool hostname domain:9080 username dsuser password inf0server job GenDataJob filename exportfile.dsx export
Information Server Manager
Information Server Manager

Create and deploy packages of DataStage objects Perform DataStage object exports / imports
Similar to DataStage Designer exports / imports Export files are stored as archives (extension .isx)
Provides a domain-wide view of the DataStage Repository

Displays all DataStage servers E.g., a Development server and a Production server Displays all DataStage projects on each server
Before you can use it you must add a domain

Right-click over Repository, click Add Domain Select domain Specify user ID / password
Invoke through Information Server start menu:

IBM Information Server Information Server Manager
Adding a Domain
Information Server domain. May contain multiple DataStage systems
Information Server Manager Window
Created export archive DataStage Repository. Shows all DataStage servers and projects
Information Server Manager Package View

Multiple packages Multiple builds within each package Build history Domain Packages
Builds
Objects
Importing Table Definitions
Importing Table Definitions

Table Definitions describe the format and columns of files and tables You can import Table Definitions for:
Sequential files Relational tables COBOL files Many other things
Table Definitions can be loaded into job stages
Sequential File Import Procedure

Click Import>Table Definitions>Sequential File Definitions Select directory containing sequential file and then the file Select a repository folder to store the Table Definition in Examined format and column definitions and edit as necessary
Importing Sequential Metadata

Import Table Definition for a sequential file
Sequential Import Window

Select directory containing files
Start import Select file
Select repository folder
Specify Format
Edit columns Select if first row has column names
Delimiter
Edit Column Names and Types

Double-click to define extended properties
Extended Properties window
Parallel properties
Property categories
Available properties
Table Definition General Tab
Type
Source
Stored Table Definition
Checkpoint
True or False? The directory to which you export is on the DataStage client machine, not on the DataStage server machine. Can you import Table Definitions for sequential files with fixed-length record formats?
1. True. 2. Yes. Record lengths are determined by the lengths of the individual columns.
Unit summary
Having completed this unit, you should be able to: Log onto DataStage Navigate around DataStage Designer Import and export DataStage objects to a file Import a Table Definition for a sequential file
IBM InfoSphere DataStage Mod 05: Creating Parallel Jobs
Unit objectives
After completing this unit, you should be able to: Design a simple Parallel job in Designer Define a job parameter Use the Row Generator, Peek, and Annotation stages in a job Compile your job Run your job in Director View the job log
What Is a Parallel Job?

Executable DataStage program Created in DataStage Designer
Can use components from Repository
Built using a graphical user interface Compiles into Orchestrate script language (OSH) and object code (from generated C++)
Job Development Overview

Import metadata defining sources and targets
Done within Designer using import process
In Designer, add stages defining data extractions and loads Add processing stages to define data transformations Add links defining the flow of data from sources to targets Compile the job In Director, validate, run, and monitor your job
Can also run the job in Designer Can only view the job log in Director
Tools Palette
Stage categories
Stages
Adding Stages and Links

Drag stages from the Tools Palette to the diagram
Can also be dragged from Stage Type branch to the diagram
Draw links from source to target stage

Right mouse over source stage Release mouse button over target stage
Job Creation Example Sequence

Brief walkthrough of procedure Assumes Table Definition of source already exists in the repository
Create New Parallel Job
Open New window Parallel job
Drag Stages and Links From Palette

Compile
Row Generator Job properties
Peek
Renaming Links and Stages

Click on a stage or link to rename it Meaningful names have many benefits
Documentation Clarity Fewer development errors
RowGenerator Stage
Produces mock data for specified columns No inputs link; single output link On Properties tab, specify number of rows On Columns tab, load or specify column definitions
Open Extended Properties window to specify the values to be generated for that column A number of algorithms for generating values are available depending on the data type
Algorithms for Integer type

Random: seed, limit Cycle: Initial value, increment
Algorithms for string type: Cycle , alphabet Algorithms for date type: Random, cycle
Inside the Row Generator Stage

Properties tab
Set property value
Property
Columns Tab
Double-click to specify extended properties
View data
Select Table Definition
Load a Table Definition
Extended Properties
Specified properties and their values
Additional properties to add
Peek Stage
Displays field values
Displayed in job log or sent to a file Skip records option Can control number of records to be displayed Shows data in each partition, labeled 0, 1, 2,
Useful stub stage for iterative job development

Develop job to a stopping point and check the data
Peek Stage Properties
Output to job log
Job Parameters
Defined in Job Properties window Makes the job more flexible Parameters can be:
Used in directory and file names Used to specify property values Used in constraints and derivations
Parameter values are determined at run time When used for directory and files names and property values, they are surrounded with pound signs (#)
E.g., #NumRows#
Job parameters can reference DataStage environment variables

Prefaced by $, e.g., $APT_CONFIG_FILE
Defining a Job Parameter

Parameters tab
Parameter Add environment variable
Using a Job Parameter in a Stage
Job parameter
Click to insert Job parameter
Adding Job Documentation

Job Properties
Short and long descriptions
Annotation stage
Added from the Tools Palette Displays formatted text descriptions on diagram
Job Properties Documentation
Documentation
Annotation Stage Properties
Compiling a Job
Run Compile
Errors or Successful Message
Highlight stage with error
Click for more info
Running Jobs and Viewing the Job Log in Designer
DataStage Director
Use to run and schedule jobs View runtime messages Can invoke directly from Designer
Tools > Run Director
Run Options
Assign values to parameter
Run Options
Stop after number of warnings
Stop after number of rows
Director Status View

Status view Schedule view Log view
Select job to view messages in the log view
Director Log View

Click the notepad icon to view log messages
Peek messages
Message Details
Other Director Functions

Schedule job to run on a particular date/time Clear job log of messages Set job log purging conditions Set Director options
Row limits (server job only) Abort after x warnings
Running Jobs from Command Line

dsjob run param numrows=10 dx444 GenDataJob
Runs a job Use run to run the job Use param to specify parameters In this example, dx444 is the name of the project In this example, GenDataJob is the name of the job
dsjob logsum dx444 GenDataJob

Displays a jobs messages in the log
Documented in Parallel Job Advanced Developers Guide
Checkpoint
1. Which stage can be used to display output data in the job log? 2. Which stage is used for documenting your job on the job canvas? 3. What command is used to run jobs from the operating system command line?
1. Peek stage 2. Annotation stage 3. dsjob -run
Unit summary
Having completed this unit, you should be able to: Design a simple Parallel job in Designer Define a job parameter Use the Row Generator, Peek, and Annotation stages in a job Compile your job Run your job in Director View the job log
IBM InfoSphere DataStage Mod 06: Accessing Sequential Data
Unit objectives
After completing this unit, you should be able to: Understand the stages for accessing different kinds of file data Sequential File stage Data Set stage Create jobs that read from and write to sequential files Create Reject links Work with NULLs in sequential files Read from multiple files using file patterns Use multiple readers
Types of File Data

Sequential
Fixed or variable length
Data Set Complex flat file
How Sequential Data is Handled

Import and export operators are generated
Stages get translated into operators during the compile
Import operators convert data from the external format, as described by the Table Definition, to the framework internal format
Internally, the format of data is described by schemas
Export operators reverse the process Messages in the job log use the import / export terminology
E.g., 100 records imported successfully; 2 rejected E.g., 100 records exported successfully; 0 rejected Records get rejected when they cannot be converted correctly during the import or export
Using the Sequential File Stage

Both import and export of sequential files (text, binary) can be performed by the SequentialFile Stage.
Importing/Exporting Data
Data import:
Internal format
Data export
Internal format
Features of Sequential File Stage

Normally executes in sequential mode Executes in parallel when reading multiple files Can use multiple readers within a node
Reads chunks of a single file in parallel
The stage needs to be told:

How file is divided into rows (record format) How row is divided into columns (column format)
Sequential File Format Example

Record delimiter Field 1 , Field 1 2 , Field 1 3 , Last field nl
Final Delimiter = end Field Delimiter
Field 1
2 Field 1
Field 1 3
, Last field
, nl
Final Delimiter = comma
Sequential File Stage Rules

One input link One stream output link Optionally, one reject link
Will reject any records not matching metadata in the column definitions Example: You specify three columns separated by commas, but the row thats read had no commas in it Example: The second column is designated a decimal in the Table Definition, but contains alphabetic characters
Job Design Using Sequential Stages
Stream link
Reject link (broken line)
Input Sequential Stage Properties

Output tab File to access
Column names in first row
Click to add more files to read
Format Tab
Record format
Load from Table Definition
Column format
Sequential Source Columns Tab
View data Load from Table Definition Save as a new Table Definition
Reading Using a File Pattern
Use wild cards
Select File Pattern
Properties - Multiple Readers
Multiple readers option allows you to set number of readers per node
Sequential Stage As a Target

Input Tab
Append / Overwrite
Reject Link
Reject mode =
Continue: Continue reading records Fail: Abort job Output: Send down output link All records not matching the metadata (column definitions) are rejected All records that fail to be written for any reason
In a source stage
In a target stage
Rejected records consist of one column, datatype = raw

Reject mode property
Inside the Copy Stage
Column mappings
Reading and Writing NULL Values to a Sequential File
Working with NULLs

Internally, NULL is represented by a special value outside the range of any existing, legitimate values If NULL is written to a non-nullable column, the job will abort Columns can be specified as nullable
NULLs can be written to nullable columns
You must handle NULLs written to nullable columns in a Sequential File stage
You need to tell DataStage what value to write to the file Unhandled rows are rejected
In a Sequential File source stage, you can specify values you want DataStage to convert to NULLs
Specifying a Value for NULL
Nullable column
Added property
DataSet Stage
Data Set
Binary data file Preserves partitioning
Component dataset files are written to each partition
Suffixed by .ds Referred to by a header file Managed by Data Set Management utility from GUI (Manager, Designer, Director) Represents persistent data Key to good performance in set of linked jobs
No import / export conversions are needed No repartitioning needed
Accessed using DataSet stage Implemented with two types of components: Descriptor file:
contains metadata, data location, but NOT the data itself contain the data multiple files, one per partition (node)
Data file(s)
Job With DataSet Stage

DataSet stage
DataSet stage properties
Data Set Management Utility

Display schema Display data
Display record counts for each data file
Data and Schema Displayed

Data viewer
Schema describing the format of the data
File Set Stage

Use to read and write to filesets Files suffixed by .fs Files are similar to a dataset
Partitioned Implemented with header file and data files
How filesets differ from datasets

Data files are text files Hence readable by external applications Datasets have a proprietary data format which may change in future DataStage versions
Checkpoint
1. List three types of file data. 2. What makes datasets perform better than other types of files in parallel jobs? 3. What is the difference between a data set and a file set?
1. Sequential, dataset, & complex flat files. 2. They are partitioned and they store data in the native parallel format. 3. Both are partitioned. Data sets store data in a binary format not readable by user applications. File sets are readable.
Unit summary
Having completed this unit, you should be able to: Understand the stages for accessing different kinds of file data Sequential File stage Data Set stage Create jobs that read from and write to sequential files Create Reject links Work with NULLs in sequential files Read from multiple files using file patterns Use multiple readers
IBM InfoSphere DataStage Mod 07: Platform Architecture
Unit objectives
After completing this unit, you should be able to: Describe parallel processing architecture Describe pipeline parallelism Describe partition parallelism List and describe partitioning and collecting algorithms Describe configuration files Describe the parallel job compilation process Explain OSH Explain the Score
Key Parallel Job Concepts

Parallel processing:
Executing the job on multiple CPUs
Scalable processing:
Add more resources (CPUs and disks) to increase system performance
Example system: 6 CPUs (processing nodes) and disks Scale up by adding more CPUs Add CPUs as individual nodes or to an SMP system
Scalable Hardware Environments
Single CPU Dedicated memory & disk
SMP Multi-CPU (2-64+) Shared memory & disk
GRID / Clusters
Multiple, multi-CPU systems Dedicated memory per node Typically SAN-based shared storage Multiple nodes with dedicated memory, storage
MPP
2 1000s of CPUs
Pipeline Parallelism
Transform, clean, load processes execute simultaneously Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running
Advantages:
Reduces disk usage for staging areas Keeps processors busy
Still has limits on scalability
Partition Parallelism
Divide the incoming stream of data into subsets to be separately processed by an operation
Subsets are called partitions (nodes)
Each partition of data is processed by the same operation

E.g., if operation is Filter, each partition will be filtered in exactly the same way
Facilitates near-linear scalability
8 times faster on 8 processors 24 times faster on 24 processors This assumes the data is evenly distributed
Three-Node Partitioning
Node 1
Operation
subset1 Node 2 subset2
Operation
Node 3
Data
subset3
Operation
Here the data is partitioned into three partitions The operation is performed on each partition of data separately and in parallel If the data is evenly distributed, the data will be processed three times faster
Parallel Jobs Combine Partitioning and Pipelining
Within parallel jobs pipelining, partitioning, and repartitioning are automatic Job developer only identifies: Sequential vs. parallel mode (by stage) Default for most stages is parallel mode Method of data partitioning The method to use to distribute the data into the available nodes Method of data collecting The method to use to collect the data from the nodes into a single node Configuration file Specifies the nodes and resources
Job Design v. Execution

User assembles the flow using DataStage Designer
at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)
No need to modify or recompile the job design!
Configuration File
Configuration file separates configuration (hardware / software) from job design
Specified per job at runtime by $APT_CONFIG_FILE Change hardware and resources without changing job design
Defines nodes with their resources (need not match physical CPUs)
Dataset, Scratch, Buffer disk (file systems) Optional resources (Database, SAS, etc.) Advanced resource optimizations
Pools (named subsets of nodes)
Multiple configuration files can be used different occasions of job execution

Optimizes overall throughput and matches job characteristics to overall hardware resources Allows runtime constraints on resource usage on a per job basis
Example Configuration File

{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} } }
3 1
4 2
Key points: 1. 2. 3. Number of nodes defined Resources assigned to each node. Their order is significant. Advanced resource optimizations and configuration (named pools, database, SAS)
Partitioning and Collecting
Partitioning and Collecting

Partitioning breaks incoming rows into multiple streams of rows (one for each node) Each partition of rows is processed separately by the stage/operator Collecting returns partitioned data back to a single stream Partitioning / Collecting is specified on stage input links
Partitioning / Collecting Algorithms

Partitioning algorithms include:
Round robin Random Hash: Determine partition based on key value Requires key specification Modulus Entire: Send all rows down all partitions Same: Preserve the same partitioning Auto: Let DataStage choose the algorithm Round robin Auto Collect first available record Sort Merge Read in by key Presumes data is sorted by the key in each partition Builds a single sorted stream based on the key Ordered Read all records from first partition, then second,
Collecting algorithms include:
Keyless V. Keyed Partitioning Algorithms

Keyless: Rows are distributed independently of data values
Round Robin Random Entire Same
Keyed: Rows are distributed based on values in the specified key

Hash: Partition based on key Example: Key is State. All CA rows go into the same partition; all MA rows go in the same partition. Two rows of the same state never go into different partitions Modulus: Partition based on modulus of key divided by the number of partitions. Key is a numeric type. Example: Key is OrderNumber (numeric type). Rows with the same order number will all go into the same partition. DB2: Matches DB2 EEE partitioning
Round Robin and Random Partitioning Keyless

Keyless partitioning methods Rows are evenly distributed across partitions
Good for initial import of data if no other partitioning is needed Useful for redistributing data
8 7 6 5 4 3 2 1 0
Fairly low overhead Round Robin assigns rows to partitions like dealing cards
Row/Partition assignment will be the same for a given $APT_CONFIG_FILE
6 3 0
Round Robin
7 4 1
8 5 2
Random has slightly higher overhead, but assigns rows in a non-deterministic fashion between job runs
ENTIRE Partitioning
Each partition gets a complete copy of the data
Useful for distributing lookup and reference data May have performance impact in MPP / clustered environments On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data On MPP platforms, each server uses shared memory for a single local copy
Keyless
8 7 6 5 4 3 2 1 0
ENTIRE
ENTIRE is the default partitioning for Lookup reference links with Auto partitioning
On SMP platforms, it is a good practice to set this explicitly on the Normal Lookup reference link(s)
. . 3 2 1 0
. . 3 2 1 0
. . 3 2 1 0
HASH Partitioning
Keyed partitioning method Rows are distributed according to the values in key columns
Guarantees that rows with same key values go into the same partition Needed to prevent matching rows from hiding in other partitions
Values of key column
Keyed
0 3 2 1 0 2 3 2 1 1
HASH
E.g. Join, Merge, Remove Duplicates,
Partition distribution is relatively equal if the data across the source key column(s) is evenly
0 3 0 3
1 1 1
2 2 2
distributed
Modulus Partitioning
Keyed partitioning method Rows are distributed according to the values in one integer key column
Uses modulus
partition = MOD (key_value / #partitions) Values of key column 0 3 2 1 0 2 3 2 1 1
Keyed
MODULUS
Faster than HASH Guarantees that rows with identical key values go in the same partition Partition size is relatively equal if the data within the key column is evenly distributed
0 3 0 3
1 1 1
2 2 2
Auto Partitioning
DataStage inserts partition components as necessary to ensure correct results
Before any stage with Auto partitioning Generally chooses ROUND-ROBIN or SAME Inserts HASH on stages that require matched key values (e.g. Join, Merge, Remove Duplicates) Inserts ENTIRE on Normal (not Sparse) Lookup reference links NOT always appropriate for MPP/clusters
Since DataStage has limited awareness of your data and business rules, explicitly specify HASH partitioning when needed
DataStage has no visibility into Transformer logic Hash is required before Sort and Aggregator stages DataStage sometimes inserts un-needed partitioning Check the log
Partitioning Requirements for Related Records

Misplaced records
Using Aggregator stage to sum customer sales by customer number If there are 25 customers, 25 records should be output But suppose records with the same customer numbers are spread across partitions
This will produce more than 25 groups (records)
Solution: Use hash partitioning algorithm
Partition imbalances
If all the records are going down only one of the nodes, then the job is in effect running sequentially
Unequal Distribution Example

Same key values are assigned to the same partition
Source Data
ID
1 2 3 4 5 6 7 8 9 10
Hash on LName, with 2-node config file

Part 0
ID
5 6
LName
Dodge Dodge
FName
Horace John
Address
17840 Jefferson 75 Boston Boulevard
LName
Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
Partition 1
ID
1 2 3 4 7 8 9 10
LName
Ford Ford Ford Ford Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
Partitioning / Collecting Link Icons
Partitioning icon
Collecting icon
More Partitioning Icons

fan-out Sequential
to Parallel
SAME partitioner
Re-partition
watch for this!
AUTO partitioner
Partitioning Tab
Input tab Key specification
Algorithms
Collecting Specification
Key specification
Algorithms
Runtime Architecture
Parallel Job Compilation

What gets generated:
OSH: A kind of script OSH represents the design data flow and stages
Stages become OSH operators
Designer Client
Compile
DataStage server
Transform operator for each Transformer
A custom operator built during the compile Compiled into C++ and then to corresponding native operators
Thus a C++ compiler is needed to compile jobs with a Transformer stage
Executable Job
Transformer Components
Generated OSH
Enable viewing of generated OSH in Administrator:
Comments Operator Schema
OSH is visible in:
- Job properties - Job run log - View Data - Table Defs
Stage to Operator Mapping Examples

Sequential File
Source: import Target: export
DataSet: copy Sort: tsort Aggregator: group Row Generator, Column Generator, Surrogate Key Generator: generator Oracle
Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert
Generated OSH Primer

Comment blocks introduce each operator
Operator order is determined by the order stages were added to the canvas Operator name Schema Operator options ( -name value format) Input (indicated by n< where n is the input #) Output (indicated by n> where n is the output #) may include modify
Generated OSH for first 2 stages
Syntax

#################################################### #### STAGE: Row_Generator_0 ## Operator generator ## Operator options -schema record ( a:int32; b:string[max=12]; c:nullable decimal[10,2] {nulls=10}; ) -records 50000 ## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')] ## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ; #################################################### #### STAGE: SortSt ## Operator tsort ## Operator options -key 'a' -asc ## General options [ident('SortSt'); jobmon_ident('SortSt'); par] ## Inputs 0< 'Row_Generator_0:lnk_gen.v' ## Outputs 0> [modify ( keep a,b,c; )] 'SortSt:lnk_sorted.v' ;
For every operator, input and/or output datasets are numbered sequentially starting from 0. E.g.: Virtual datasets are generated to connect operators
op1 0> dst op1 1< src
Virtual dataset is used to connect output of one operator to input of another
Job Score
Generated from OSH and configuration file Think of Score as in musical score, not game score Assigns nodes for each operator Inserts sorts and partitioners as needed Defines connection topology (virtual datasets) between adjacent operators Inserts buffer operators to prevent deadlocks Defines the actual processes
Where possible, multiple operators are combined into a single process
Viewing the Job Score

Score message in job log
Set $APT_DUMP_SCORE to output the Score to the job log To identify the Score dump, look for main program: This step
You dont see anywhere the word Score
Virtual datasets
Operators with node assignments
Job Execution: The Orchestra

Conductor Node
Conductor - initial process

Composes the Score Creates Section Leader processes (one/node) Consolidates messages to DataStage log Manages orderly shutdown
Processing Node
SL
Section Leader (one per Node)

Forks Player processes (one per stage) Manages up/down communication
Processing Node
SL
Players
The actual processes associated with stages Combined players: one process only Sends stderr, stdout to Section Leader
Establish connections to other players for data flow Clean up upon completion
Runtime Control and Data Networks

Control Channel/TCP Conductor Stdout Channel/Pipe Stderr Channel/Pipe APT_Communicator
Section Leader,0
Section Leader,1
Section Leader,2
generator,0
generator,1
generator,2
copy,0
copy,1
copy,2
$ osh generator -schema record(a:int32) [par] | roundrobin | copy
Checkpoint
1. What file defines the degree of parallelism a job runs under? 2. Name two partitioning algorithms that partition based on key values? 3. What partitioning algorithm produces the most even distribution of data in the various partitions?
1. Configuration file. 2. Hash, Modulus, DB2. 3. Round robin or Entire.
Unit summary
Having completed this unit, you should be able to: Describe parallel processing architecture Describe pipeline parallelism Describe partition parallelism List and describe partitioning and collecting algorithms Describe configuration files Describe the parallel job compilation process Explain OSH Explain the Score
IBM InfoSphere DataStage Mod 08: Combining Data
Unit objectives
After completing this unit, you should be able to: Combine data using the Lookup stage Define range lookups Combine data using Merge stage Combine data using the Join stage Combine data using the Funnel stage
Combining Data
Ways to combine data: Horizontally:
Multiple input links One output link made of columns from different input links. Joins Lookup Merge
Vertically:
One input link, one output link combining groups of related records into a single record Aggregator Remove Duplicates
Funneling: Multiple input streams funneled into a single output stream

Funnel stage
Lookup, Merge, Join Stages

These stages combine two or more input links
Data is combined by designated "key" column(s)
These stages differ mainly in:

Memory usage Treatment of rows with unmatched key values Input requirements (sorted, de-duplicated)
Lookup Stage
Lookup Features
One stream input link (source link) Multiple reference links One output link Lookup failure options
Continue, Drop, Fail, Reject Only available when lookup failure option is set to Reject
Reject link Can return multiple matching rows Hash file is built in memory from the lookup files
Indexed by key Should be small enough to fit into physical memory
Lookup Types
Equality match
Match exactly values in the lookup key column of the reference link to selected values in the source row Return row or rows (if multiple matches are to be returned) that match
Caseless match
Like an equality match except that its caseless E.g., abc matches AbC
Range on the reference link

Two columns on the reference link define the range A match occurs when a selected value in the source row is within the range
Range on the source link

Two columns on the source link define the range A match occurs when a selected value in the reference link is within the range
Lookup Example
Reference link Source (stream) link
Lookup Stage With an Equality Match

Source link columns Lookup constraints
Mappings to output columns
Reference key column and source mapping
Column definitions
Lookup Failure Actions

If the lookup fails to find a matching key column, one of several actions can be taken:
Fail (Default) Stage reports an error and the job fails immediately Drop Input row is dropped Continue Input row is transferred to the output. Reference link columns are filled with null or default values Reject Input row sent to a reject link This requires that a reject link has been created for the stage
Specifying Lookup Failure Actions

Select to return multiple rows
Select lookup failure action
Lookup Stage Behavior
Source link
Revolution 1789 1776 Citizen Lefty M_B_Dextrous
Reference link
Citizen M_B_Dextrous Righty Exchange Nasdaq NYSE
Lookup key column
Lookup Stage
Output of Lookup with Continue option
Revolution 1789 1776 Citizen Lefty M_B_Dextrous Exchange Nasdaq
Output of Lookup with Drop option

Revolution 1776 Citizen M_B_Dextrous Exchange Nasdaq
Empty string or NULL
Designing a Range Lookup Job
Range Lookup Job

Reference link
Lookup stage
Range on Reference Link

Reference range values
Retrieve description
Source values
Selecting the Stream Column

Source link Double-click to specify range
Reference link
Range Expression Editor

Select range columns
Select operators
Range on Stream Link

Source range
Retrieve other column values Reference link key
Specifying the Range Lookup
Reference link key column Range type
Range Expression Editor
Select range columns
Join Stage
Join Stage
Four types of joins
Inner Left outer Right outer Full outer Left link and a right link Supports additional intermediate links Little memory required, because of the sort requirement Column names for each input link must match
Input links must be sorted Light-weight

Join key column or columns
Job With Join Stage
Right input link
Left input link
Join stage
Join Stage Editor
Column to match Select left input link
Select join type
Select if multiple columns make up the join key
Join Stage Behavior
Left link (primary input)

Revolution 1789 1776 Citizen Lefty M_B_Dextrous
Right link (secondary input)

Citizen M_B_Dextrous Righty Exchange Nasdaq NYSE
Join key column
Inner Join
Transfers rows from both data sets whose key columns have matching values
Output of inner join on key Citizen
Revolution 1776 Citizen M_B_Dextrous Exchange Nasdaq
Left Outer Join

Transfers all values from the left link and transfers values from the right link only where key columns match.
Revolution 1789 1776 Citizen Lefty M_B_Dextrous Exchange Nasdaq
Null or default value
Right Outer Join

Transfers all values from the right link and transfers values from the left link only where key columns match.
Revolution 1776
Citizen M_B_Dextrous Righty
Exchange Nasdaq NYSE
Null or default value
Full Outer Join

Transfers rows from both data sets, whose key columns contain equal values, to the output link It also transfers rows, whose key columns contain unequal values, from both input links to the output link Treats both input symmetrically Creates new columns, with new column names!
Revolution 1789 1776 0 leftRec_Citizen Lefty M_B_Dextrous rightRec_Citizen M_B_Dextrous Righty Exchange Nasdaq NYSE
Merge Stage
Merge Stage
Similar to Join stage Input links must be sorted

Master link and one or more secondary links Master must be duplicate-free Little memory required, because of the sort requirement
Light-weight Unmatched master rows can be kept or dropped Unmatched ("Bad") update rows in input link n can be captured in a "reject" link in corresponding output link n. Matched update rows are consumed Follows the Master-Update model Master row and one or more updates row are merged iff they have the same value in user-specified key column(s). A non-key column name occurs in several inputs?
The lowest input port number prevails e.g., master over update; update values are ignored
Merge Stage Job
Merge stage
Capture update link non-matches
Merge Stage Properties

Match key
Keep or drop unmatched masters
The Merge Stage
Master
One or more updates
0 0
Merge
1 1
Allows composite keys Multiple update links Matched update rows are consumed Unmatched updates in input port n can be captured in output port n Lightweight
Output
Rejects
Comparison: Joins, Lookup, Merge
Joins
Model Memory usage # and names of Inputs Mandatory Input Sort Duplicates in primary input Duplicates in secondary input(s) Options on unmatched primary Options on unmatched secondary On match, secondary entries are # Outputs Captured in reject set(s) RDBMS-style relational light
Lookup
Source - in RAM LU Table heavy
Merge
Master -Update(s) light 1 Master, N Update(s) all inputs W arning! OK only when N = 1 [keep] | drop capture in reject set(s) consumed 1 out, (N rejects) unmatched secondary entries
1 Source, N LU Tables 2 or more: left, right all inputs no OK OK OK W arning! Keep (left outer), Drop (Inner) [fail] | continue | drop | reject Keep (right outer), Drop (Inner) NONE captured captured 1 Nothing (N/A) 1 out, (1 reject) unmatched primary entries
Funnel Stage
What is a Funnel Stage?

Combines data from multiple input links to a single output link All sources must have identical metadata Three modes
Continuous Records are combined in no particular order First available record Sort Funnel Combines the input records in the order defined by a key Produces sorted output if the input links are all sorted by the same key Sequence: Outputs all records from the first input link, then all from the second input link, and so on
Funnel Stage Example
Checkpoint
1. Name three stages that horizontally join data? 2. Which stage uses the least amount of memory? Join or Lookup? 3. Which stage requires that the input data is sorted? Join or Lookup?
1. Lookup, Merge, Join 2. Join 3. Join
Unit summary
Having completed this unit, you should be able to: Combine data using the Lookup stage Define range lookups Combine data using Merge stage Combine data using the Join stage Combine data using the Funnel stage
IBM InfoSphere DataStage Mod 09: Sorting and Aggregating Data
Unit objectives
After completing this unit, you should be able to: Sort data using in-stage sorts and Sort stage Combine data using Aggregator stage Combine data Remove Duplicates stage
Sort Stage
Sorting Data
Uses
Some stages require sorted input Join, merge stages require sorted input Some stages use less memory with sorted input e.g., Aggregator
Sorts can be done:

Within stages On input link Partitioning tab, set partitioning to anything other than Auto In a separate Sort stage Makes sort more visible on diagram Has more options
Sorting Alternatives
Sort stage
Sort within stage
In-Stage Sorting
Partitioning tab
Do sort Preserve non-key ordering Remove dups
Cant be Auto when sorting
Sort key
Sort Stage
Sort key
Sort options
Sort keys
Add one or more keys Specify sort mode for each key
Sort: Sort by this key Dont sort (previously sorted): Assumes the data has already been sorted on this key Continue sorting by any secondary keys
Specify sort order: ascending / descending Specify case sensitive or not
Sort Options
Sort Utility
DataStage the default Unix: Dont use. Slower than DataStage sort utility
Stable Allow duplicates Memory usage

Sorting takes advantage of the available memory for increased performance Uses disk if necessary Increasing amount of memory can improve performance
Create key change column

Add a column with a value of 1 / 0 1 indicates that the key value has changed 0 means that the key value hasnt changed Useful for processing groups of rows in a Transformer
Stable Sort
Key
4 3 1 3 2 3 1 2
Col
X Y K C P D A L
Key
1 1 2 2 3 3 3 4
Col
K A P L Y C D X
Create (Cluster) Key Change Column

Key
4 3 1 3 2 3 1 2 X Y K C P D A L
Col
Key
1 1 2 2 3 3 3 4
Col
K A P L Y C D X
K_C
1 0 1 0 1 0 0 1
Partitioning V. Sorting Keys

Partitioning keys are often different than Sorting keys
Keyed partitioning (e.g., Hash) is used to group related records into the same partition Sort keys are used to establish order within each partition
Example:
Partition on HouseHoldID, sort on HouseHoldID, EntryDate Partitioning on HouseHoldID ensures that the same ID will not be spread across multiple partitions Sorting orders the records with the same ID by entry date
Useful for deciding which of a group of duplicate records with the same ID should be retained
Aggregator Stage
Aggregator Stage
Purpose: Perform data aggregations Specify: One or more key columns that define the aggregation units (or groups) Columns to be aggregated Aggregation functions include, among many others:
Count (nulls/non-nulls) Sum Max / Min / Range
The grouping method (hash table or pre-sort) is a performance issue
Job with Aggregator Stage
Aggregator stage
Aggregation Types
Count rows
Count rows in each group Put result in a specified output column
Calculation
Select column Put result of calculation in a specified output column Calculations include: Sum Count Min, max Mean Missing value count Non-missing value count Percent coefficient of variation
Count Rows Aggregator Properties
Grouping key columns
Count Rows aggregation type
Output column for the result
Calculation Type Aggregator Properties
Grouping key columns
Calculation aggregation type
Calculations with output columns Available calculations
Grouping Methods
Hash (default)
Calculations are made for all groups and stored in memory Hash table structure (hence the name) Results are written out after all input has been processed Input does not need to be sorted Useful when the number of unique groups is small Running tally for each groups aggregations needs to fit into memory
Sort
Requires the input data to be sorted by grouping keys Does not perform the sort! Expects the sort Only a single aggregation group is kept in memory When a new group is seen, the current group is written out Can handle unlimited numbers of groups
Grouping Method - Hash

Key Col
4 3 1 3 2 3 1 2 X Y K C P D A L
4 3 3Y
4X 3C 3D
1K 2P
1A 2L
Grouping Method - Sort

Key
1 1 2 2 3 3 3 4
Col
K A P L Y C D X
1K
1A
2P
2L
3Y
3C
3D
4X
Remove Duplicates Stage
Removing Duplicates
Can be done by Sort stage
Use unique option No choice on which duplicate to keep Stable sort always retains the first row in the group Non-stable sort is indeterminate
OR
Remove Duplicates stage

Has more sophisticated ways to remove duplicates Can choose to retain first or last
Remove Duplicates Stage Job
Remove Duplicates stage
Remove Duplicates Stage Properties

Key that defines duplicates
Retain first or last duplicate
Checkpoint
1. What stage is used to perform calculations of column values grouped in specified ways? 2. In what two ways can sorts be performed? 3. What is a stable sort?
1. Aggregator stage 2. Using the Sort stage. In-stage sorts. 3. Stable sort preserves the order of non-key values.
Unit summary
Having completed this unit, you should be able to: Sort data using in-stage sorts and Sort stage Combine data using Aggregator stage Combine data Remove Duplicates stage
IBM InfoSphere DataStage Mod 10: Transforming Data
Unit objectives
After completing this unit, you should be able to: Use the Transformer stage in parallel jobs Define constraints Define derivations Use stage variables Create a parameter set and use its parameters in constraints and derivations
Transformer Stage
Transformer Stage
Column mappings Derivations Constraints
Written in Basic Final compiled code is C++ generated object code Filter data Direct data down different output links For different processing or storage Input columns Job parameters Functions System variables and constants Stage variables External routines
Expressions for constraints and derivations can reference
Job With a Transformer Stage

Transformer
Single input
Reject link
Multiple outputs
Inside the Transformer Stage

Stage variables Show/Hide Input columns Constraint Derivations / Mappings Column defs Output columns
Defining a Constraint
Input column
Function Job parameter
Defining a Derivation
Input column
String in quotes
Concatenation operator (:)
IF THEN ELSE Derivation

Use IF THEN ELSE to conditionally derive a value Format:
IF <condition> THEN <expression1> ELSE <expression2> If the condition evaluates to true then the result of expression1 will be copied to the target column or stage variable If the condition evaluates to false then the result of expression2 will be copied to the target column or stage variable
Example:
Suppose the source column is named In.OrderID and the target column is named Out.OrderID Replace In.OrderID values of 3000 by 4000 IF In.OrderID = 3000 THEN 4000 ELSE In.OrderID
String Functions and Operators

Substring operator
Format: String [loc, length] Example: Suppose In.Description contains the string Orange Juice InDescription[8,5] Juice
UpCase(<string>) / DownCase(<string>)
Example: UpCase(In.Description) ORANGE JUICE
Len(<string>)
Example: Len(In.Description) 12
Checking for NULLs

Nulls can be introduced into the data flow from lookups
Mismatches (lookup failures) can produce nulls
Can be handled in constraints, derivations, stage variables, or a combination of these NULL functions
Testing for NULL
IsNull(<column>) IsNotNull(<column>) NullToValue(<column>, <value>) Example: IF In.Col = 5 THEN SetNull() ELSE In.Col
Replace NULL with a value Set to NULL: SetNull()
Transformer Functions
Date & Time Logical Null Handling Number String Type Conversion
Transformer Execution Order

Derivations in stage variables Constraints are executed before derivations Column derivations in earlier links are executed before later links Derivations in higher columns are executed before lower columns
Transformer Stage Variables

Derivations execute in order from top to bottom
Later stage variables can reference earlier stage variables Earlier stage variables can reference later stage variables These variables will contain a value derived from the previous row that came into the Transformer
Multi-purpose
Counters Store values from previously read rows to make comparisons with the currently read row Store derived values to be used in multiple target field derivations Can be used to control execution of constraints
Transformer Reject Links
Reject link
Convert link to a Reject link
Transformer and Null Expressions

Within a Transformer, any expression that includes a NULL value produces a NULL result
1 + NULL = NULL John : NULL : Doe = NULL
When the result of a link constraint or output derivation is NULL, the Transformer will output that row to the reject link. If no reject link, job will abort! Standard practices
Always create a reject link Always test for NULL values in expressions with NULLABLE columns IF isNull(link.col) THEN ELSE Transformer issues warnings for rejects
Otherwise Link
Otherwise link
Defining an Otherwise Link
Check to create otherwise condition
Can specify abort condition
Specifying Link Ordering

Link ordering toolbar icon
Last in order
Parameter Sets
Parameter Sets
Store a collection of parameters in a named object One or more values files can be named and specified
A values file stores values for specified parameters Values are picked up at runtime
Parameter Sets can be added to the job parameters specified on the Parameters tab in the job properties
Creating a New Parameter Set
New Parameter Set
Parameters Tab
Specify parameters
Parameter set name is specified on General tab
Values Tab
Values file name
Values for parameters
Adding a Parameter Set to Job Properties
View Parameter Set Parameter Set Reference Add Parameter Set
Using Parameter Set Parameters
Select Parameter Set parameters
Checkpoint
1. What occurs first? Derivations or constraints? 2. Can stage variables be referenced in constraints? 3. Where should you test for NULLS within a Transformer? Stage Variable derivations or output column derivations?
1. Constraints. 2. Yes. 3. Stage variable derivations. Reference the stage variables in the output column derivations.
Unit summary
Having completed this unit, you should be able to: Use the Transformer stage in parallel jobs Define constraints Define derivations Use stage variables Create a parameter set and use its parameters in constraints and derivations
IBM InfoSphere DataStage Mod 11: Repository Functions
Unit objectives
After completing this unit, you should be able to: Perform a simple Find Perform an Advanced Find Perform an impact analysis Compare the differences between two Table Definitions Compare the differences between two jobs
Searching the Repository
Quick Find
Name with wild card character (*)
Include matches in object descriptions
Execute Find
Found Results
Highlight next item
Number found; Click to open Advanced Find window
Found item
Advanced Find Window
Filter search
Advanced Find Filtering Options

Type: type of object
Job, Table Definition, etc.
Creation: range of dates

E.g., Up to a week ago
Last modification: range of dates

E.g., Up to a week ago
Where used: objects that use specified objects

E.g., a job that uses a specified Table Definition
Dependencies of: objects that are dependencies of objects

E.g., a Table Definition that is referenced in a specified job
Options
Case sensitivity Search within last result set
Using the Found Results
Create impact analysis Export to a file Draw Comparison
Impact Analysis
Performing an Impact Analysis

Find where Table Definitions are used
Right-click over a stage or Table Definition Select Find where Table Definitions Used or Select Find where Table Definitions Used (deep) Deep includes additional object types Displays a list of the objects using the Table Definition
Find object dependencies

Select Find dependencies or Select Find dependencies (deep) Displays list of objects dependent on the one selected
Graphical functionality
Display the dependency path Collapse selected objects Move the graphical object Birds-eye view
Initiating an Impact Analysis from a Stage
ERROR: stackunderflow OFFENDING COMMAND: ~ STACK:

Dx444show PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dx444show PDF

Uploaded by

Copyright:

Available Formats

DataStage Essentials v8.

Copyright IBM Corporation 2009

Copyright, Disclaimer of Warranties and Limitation of Liability

IBM InfoSphere DataStage Mod 01: Introduction

Copyright IBM Corporation 2009

What is IBM InfoSphere DataStage?

IBM Information Server

Provided services include:

Managed using web console clients

Information Server Backbone

Information Services Director

MetaBrokers Federation Server

Metadata Access Services

Metadata Analysis Services Metadata Server

Information Server console

Information Server Administration Console

Server engine Shared Repository

DataStage parallel job

DataStage Project Repository

Standard jobs folder

Standard Table Definitions folder

Types of DataStage Jobs

Job sequences (batch jobs, controlling jobs)

Design Elements of Parallel Jobs

Still has limits on scalability

Each partition of data is processed by the same operator

Facilitates near-linear scalability

Job Design v. Execution

IBM InfoSphere DataStage Mod 02: Deployment

Copyright IBM Corporation 2009

What Gets Deployed

Additional Information Server applications

Deployment: Everything on One Machine

DataStage server Clients

DB2 instance with Repository

Deployment: DataStage on Separate Machine

MetaData Server backbone

DataStage server Clients

DB2 instance with Repository

MetaData Server and DB2 On Separate Machines

Clients DataStage server DB2 instance with Repository

Information Server Installation

Installation Configuration Layers

Information Server Start-Up

Starting the MetaData Server Backbone

Application Server Profiles folder

Start the Server Profile\bin directory

Starting the ASB Agent

Start the agent

IBM InfoSphere DataStage Mod 03: Administering DataStage

Copyright IBM Corporation 2009

Information Server Administration Web Console

Opening the Administration Web Console

Information Server console address

Information Server administrator ID

Users and Group Management

User and Group Management

Suite Component roles

And so on, for all products in the Suite

Creating a DataStage User ID

Create new user

Assigning DataStage Roles

Assign Suite User role