You are on page 1of 312

DataStage Essentials v8.

1 Core Modules

Copyright IBM Corporation 2009

Copyright, Disclaimer of Warranties and Limitation of Liability


Copyright IBM Corporation 2009 IBM Software Group One Rogers Street Cambridge, MA 02142 All rights reserved. Printed in the United States. IBM and the IBM logo are registered trademarks of International Business Machines Corporation. The following are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both: AnswersOnLine AIX APPN AS/400 BookMaster C-ISAM Client SDK Cloudscape Connection Services Database Architecture DataBlade DataJoiner DataPropagator DB2 DB2 Connect DB2 Extenders DB2 Universal Database Distributed Database Distributed Relational DPI DRDA DynamicScalableArchitecture DynamicServer DynamicServer.2000 DynamicServer with Advanced DecisionSupportOption DynamicServer with Extended ParallelOption DynamicServer with UniversalDataOption DynamicServer with WebIntegrationOption DynamicServer, WorkgroupEdition Enterprise Storage Server FFST/2 Foundation.2000 Illustra Informix Informix4GL InformixExtendedParallelServer InformixInternet Foundation.2000 Informix RedBrick Decision Server J/Foundation MaxConnect MVS MVS/ESA Net.Data NUMA-Q ON-Bar OnLineDynamicServer OS/2 OS/2 WARP OS/390 OS/400 PTX QBIC QMF RAMAC RedBrickDesign RedBrickDataMine RedBrick Decision Server RedBrickMineBuilder RedBrickDecisionscape RedBrickReady RedBrickSystems RelyonRedBrick S/390 Sequent SP System View Tivoli TME UniData UniData&Design UniversalDataWarehouseBlueprint UniversalDatabaseComponents UniversalWebConnect UniVerse VirtualTableInterface Visionary VisualAge WebIntegrationSuite WebSphere WebSphere DataStage

Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. All other product or brand names may be trademarks of their respective companies. All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant. This document may not be reproduced in whole or in part without the priori written permission of IBM. Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.

Course Contents
Mod 01: Introduction 05 Mod 02: Deployment 27 Mod 03: Administering DataStage 41 Mod 04: DataStage Designer 67 Mod 05: Creating Parallel Jobs 97 Mod 06: Accessing Sequential Data 133 Mod 07: Platform Architecture 163 Mod 08: Combining Data 201 Mod 09: Sorting and Aggregating Data 243 Mod 10: Transforming Data 271 Mod 11: Repository Functions 301 Mod 12: Working With Relational Data 327 Mod 13: Metadata in the Parallel Framework 373 Mod 14: Job Control 397 Appendix A: Information Server on zLinux 425

IBM InfoSphere DataStage Mod 01: Introduction

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: List and describe the uses of DataStage List and describe the DataStage clients Describe the DataStage workflow List and compare the different types of DataStage jobs Describe the two types of parallelism exhibited by DataStage parallel jobs

What is IBM InfoSphere DataStage?


Design jobs for Extraction, Transformation, and Loading (ETL) Ideal tool for data integration projects such as, data warehouses, data marts, and system migrations Import, export, create, and manage metadata for use within jobs Schedule, run, and monitor jobs, all within DataStage Administer your DataStage development and execution environments Create batch (controlling) jobs

IBM Information Server


Suite of applications, including DataStage, that:
Share a common repository DB2, by default Share a common set of application services and functionality Provided by Metadata Server components hosted by an application server
IBM InfoSphere Application Server

Provided services include:


Security Repository Logging and reporting Metadata management

Managed using web console clients


Administration console Reporting console

Information Server Backbone

Information Services Director

Business Glossary

Information Analyzer

DataStage

QualityStage

MetaBrokers Federation Server

Metadata Access Services

Metadata Analysis Services Metadata Server

Information Server console

Information Server Administration Console

DataStage Architecture

Clients

Parallel engine

Server engine Shared Repository

Engines

DataStage Administrator

DataStage Designer

DataStage parallel job

DataStage Director
Log message events

Developing in DataStage
Define global and project properties in Administrator Import metadata into the Repository Build job in Designer Compile job in Designer Run and monitor job log messages in Director
Jobs can also be run in Designer, but job lob messages cannot be viewed in Designer

DataStage Project Repository


User-added folder

Standard jobs folder

Standard Table Definitions folder

Types of DataStage Jobs


Server jobs
Executed by the DataStage server engine Compiled into Basic Runtime monitoring in DataStage Director Executed by the DataStage parallel engine Built-in functionality for pipeline and partition parallelism Compiled into OSH (Orchestrate Scripting Language) OSH executes Operators Runtime monitoring in DataStage Director
Executable C++ class instances

Parallel jobs

Job sequences (batch jobs, controlling jobs)

Master Server jobs that kick-off jobs and other activities Can kick-off Server or Parallel jobs Runtime monitoring in DataStage Director Executed by the Server engine

Design Elements of Parallel Jobs


Stages
Implemented as OSH operators (pre-built components) Passive stages (E and L of ETL) Read data Write data E.g., Sequential File, DB2, Oracle, Peek stages Processor (active) stages (T of ETL) Transform data Filter data Aggregate data Generate data Split / Merge data E.g., Transformer, Aggregator, Join, Sort stages

Links
Pipes through which the data moves from stage to stage

Pipeline Parallelism

Transform, clean, load processes execute simultaneously Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running

Advantages:
Reduces disk usage for staging areas Keeps processors busy

Still has limits on scalability

Partition Parallelism
Divide the incoming stream of data into subsets to be separately processed by an operator
Subsets are called partitions

Each partition of data is processed by the same operator


E.g., if operation is Filter, each partition will be filtered in exactly the same way 8 times faster on 8 processors 24 times faster on 24 processors This assumes the data is evenly distributed

Facilitates near-linear scalability

Three-Node Partitioning

Node 1

Operation
subset1 Node 2 subset2

Operation
Node 3

Data

subset3

Operation

Here the data is partitioned into three partitions The operation is performed on each partition of data separately and in parallel If the data is evenly distributed, the data will be processed three times faster

Job Design v. Execution


User designs the flow in DataStage Designer

at runtime, this job runs in parallel for any configuration or partitions (called nodes)

Checkpoint
1. True or false: DataStage Director is used to build and compile your ETL jobs 2. True or false: Use Designer to monitor your job during execution 3. True or false: Administrator is used to set global and project properties

Checkpoint solutions
1. False. 2. False. 3. True.

Unit summary
Having completed this unit, you should be able to: List and describe the uses of DataStage List and describe the DataStage clients Describe the DataStage workflow List and compare the different types of DataStage jobs Describe the two types of parallelism exhibited by DataStage parallel jobs

IBM InfoSphere DataStage Mod 02: Deployment

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Identify the components of Information Server that need to be installed Describe what a deployment domain consists of Describe different domain deployment options Describe the installation process Start the Information Server

What Gets Deployed


An Information Server domain, consisting of the following: MetaData Server, hosted by an IBM InfoSphere Application Server instance One or more DataStage servers
DataStage server includes both the parallel and server engines

One DB2 UDB instance containing the Repository database Information Server clients
Administration console Reporting console DataStage clients
Administrator Designer Director

Additional Information Server applications


Information Analyzer Business Glossary Rational Data Architect Information Services Director Federation Server

Deployment: Everything on One Machine


Here we have a single domain with the hosted MetaData Server backbone applications all on one machine Clients Additional Client workstations can connect to this machine using TCP/IP

DataStage server Clients

DB2 instance with Repository

Deployment: DataStage on Separate Machine


Here the domain is split between two machines
DataStage server MetaData server and DB2 Repository

MetaData Server backbone

DataStage server Clients

DB2 instance with Repository

MetaData Server and DB2 On Separate Machines


Here the domain is split between three machines
DataStage server MetaData server DB2 Repository
MetaData Server backbone

Clients DataStage server DB2 instance with Repository

Information Server Installation

Installation Configuration Layers


Configuration layers include:
Client DataStage and Information Server clients Engine DataStage and other application engines Domain MetaData Server and hosted metadata server components Installed products domain components Repository Repository database server and database Documentation

Selected layers are installed on the machine local to the installation Already existing components can be configured and used
E.g., DB2, InfoSphere Application Server

Information Server Start-Up


Start the MetaData Server From Windows Start menu, click Start the Server after the profile to be used (e.g., default) From the command line, open the profile bin directory
Enter startup server1
> server1 is the default name of the application server hosting the MetaData Server

Start the ASB agent From Windows Start menu, click Start the agent after selecting the Information Server folder Only required if DataStage and the MetaData Server are on different machines To begin work in DataStage, double-click on a DataStage client icon To begin work in the Administration and Reporting consoles, double-click on the Web Console for Information Server icon

Starting the MetaData Server Backbone

Application Server Profiles folder

Profile Profile

Start the Server Profile\bin directory

Startup command

Starting the ASB Agent

Start the agent

Checkpoint
1. What application components make up a domain? 2. Can a domain contain multiple DataStage servers? 3. Does the DB2 instance and the Repository database need to be on the same machine as the Application Server? 4. Suppose DataStage is on a separate machine from the Application Server. What two components need to be running before you can log onto DataStage?

Checkpoint solutions
1. Metadata server hosted by the Application Server. One or more DataStage servers. One DB2/UDB instance containing the Suite Repository database. 2. Yes. The DataStage servers must be on separate machines. 3. No. The DB2 instance with the Repository can reside on a separate machine than the Application Server. 4. The Application Server and the ASB agent.

Unit summary
Having completed this unit, you should be able to: Identify the components of Information Server that need to be installed Describe what a deployment domain consists of Describe different domain deployment options Describe the installation process Start the Information Server

IBM InfoSphere DataStage Mod 03: Administering DataStage

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Open the Administrative console Create new users and groups Assign Suite roles and Product roles to users and groups Give users DataStage credentials Log onto DataStage Administrator Add a DataStage user on the Permissions tab and specify the users role Specify DataStage global and project defaults List and describe important environment variables

Information Server Administration Web Console


Web application for administering Information Server Use for:
Domain management Session management Management of users and groups Logging management Scheduling management

Opening the Administration Web Console

Information Server console address

Information Server administrator ID

Users and Group Management

User and Group Management


Suite authorizations can be provided to users or groups Authorizations are provided in the form of roles
Users that are members of a group acquire the authorizations of the group Two types of roles Suite roles: Apply to the Suite Suite Component roles: Apply to a specific product or component of Information Server, e.g., DataStage Administrator Perform user and group management tasks Includes all the privileges of the Suite User role User Create views of scheduled tasks and logged messages Create and run reports DataStage DataStage user

Suite roles

Suite Component roles


Permissions are assigned within DataStage > Developer, Operator, Super Operator, Production Manager Full permissions to work in DataStage Administrator, Designer, and Director

DataStage administrator

And so on, for all products in the Suite

Creating a DataStage User ID


Administration console

Create new user

Users

Assigning DataStage Roles


Assign Suite Administrator role

User ID

Users

Assign Suite User role

Assign DataStage User role

Assign DataStage Administrator role

DataStage Credential Mapping


DataStage credentials
Required by DataStage in order to log onto a DataStage client Managed by the DataStage Server operating system or LDAP

Users given DataStage Suite roles in the Suite Administration console do not automatically receive DataStage credentials
Users need to be mapped to a user who has DataStage credentials on the DataStage Server machine This DataStage user must have file access permission to the DataStage engine/project files or Administrator rights on the operating system

Three ways of mapping users (IS ID


Many to 1 1 to 1 Combination of the above

OS ID)

DataStage Credential Mapping

Map DataStage User credential

DataStage Administrator

Logging onto DataStage Administrator


Host name, port number of application server

DataStage administrator ID and password Name or IP address of DataStage server machine

DataStage Administrator Projects Tab

Server projects Click to specify project properties

Link to Information Server Administration console

DataStage Administrator General Tab

Enable runtime column propagation (RCP) by default

Specify auto purge

Environment variable settings

Environment Variables
Parallel job variables

User-defined variables

Environment Reporting Variables

Display Score

Display record counts

Display OSH

DataStage Administrator Permissions Tab

Assigned product role

Add DataStage users

Adding Users and Groups

Available users / groups. Must have a DataStage product User role.

Add DataStage users

Specify DataStage Role

Added DataStage user

Select DataStage role

DataStage Administrator Tracing Tab

DataStage Administrator Parallel Tab

DataStage Administrator Sequence Tab

Checkpoint
1. Authorizations can be assigned to what two items? 2. What two types of authorization roles can be assigned to a user or group? 3. In addition to Suite authorization to log onto DataStage what else does a DataStage developer require to work in DataStage?

Checkpoint solutions
1. Users and groups. Members of a group acquire the authorizations of the group. 2. Suite roles and Component roles. 3. Must be mapped to a user with DataStage credentials.

Unit objectives
Having completing this unit, you should be able to: Open the Administrative console Create new users and groups Assign Suite roles and Product roles to users and groups Give users DataStage credentials Log onto DataStage Administrator Add a DataStage user on the Permissions tab and specify the users role Specify DataStage global and project defaults List and describe important environment variables

IBM InfoSphere DataStage Mod 04: DataStage Designer

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Log onto DataStage Navigate around DataStage Designer Import and export DataStage objects to a file Import a Table Definition for a sequential file

Logging onto DataStage Designer


Host name, port number of application server

DataStage server machine/project

Designer Work Area


Repository Menus Toolbar

Parallel canvas

Palette

Importing and Exporting DataStage Objects

Repository Window
Search for objects in the project

Project Default jobs folder

Default Table Definitions folder

Import and Export


Any object in the repository window can be exported to a file Can export whole projects Use for backup Sometimes used for version control Can be used to move DataStage objects from one project to another Use to share DataStage jobs and projects with other developers

Export Procedure
Click Export>DataStage Components Select DataStage objects for export Specify type of export:
DSX: Default format XML: Enables processing of export file by XML applications, e.g., for generating reports

Specify file path on client machine

Export Window

Click to select objects from the repository

Selected objects

Export type

Export location on client system Begin export

Import Procedure
Click Import>DataStage Components
Or Import>DataStage Components (XML) if you are importing an XML-format export file

Select DataStage objects for import

Import Options
Import all objects in the file

Display list to select from

Import/Export Command Line Interface


Command istool Located \IBM\InformationServer\Clients\istools\clients
Documentation in same location

Invoke through Programs IBM Information Server Information Server Command Line Interface Examples:
istool hostname domain:9080 username dsuser password inf0server deployment-package importfile.dsx import istool hostname domain:9080 username dsuser password inf0server job GenDataJob filename exportfile.dsx export

Information Server Manager

Information Server Manager


Create and deploy packages of DataStage objects Perform DataStage object exports / imports
Similar to DataStage Designer exports / imports Export files are stored as archives (extension .isx)

Provides a domain-wide view of the DataStage Repository


Displays all DataStage servers E.g., a Development server and a Production server Displays all DataStage projects on each server

Before you can use it you must add a domain


Right-click over Repository, click Add Domain Select domain Specify user ID / password

Invoke through Information Server start menu:


IBM Information Server Information Server Manager

Adding a Domain

Information Server domain. May contain multiple DataStage systems

Information Server Manager Window

Created export archive DataStage Repository. Shows all DataStage servers and projects

Information Server Manager Package View


Multiple packages Multiple builds within each package Build history Domain Packages

Builds

Objects

Importing Table Definitions

Importing Table Definitions


Table Definitions describe the format and columns of files and tables You can import Table Definitions for:
Sequential files Relational tables COBOL files Many other things

Table Definitions can be loaded into job stages

Sequential File Import Procedure


Click Import>Table Definitions>Sequential File Definitions Select directory containing sequential file and then the file Select a repository folder to store the Table Definition in Examined format and column definitions and edit as necessary

Importing Sequential Metadata


Import Table Definition for a sequential file

Sequential Import Window


Select directory containing files

Start import Select file

Select repository folder

Specify Format
Edit columns Select if first row has column names

Delimiter

Edit Column Names and Types


Double-click to define extended properties

Extended Properties window

Parallel properties

Property categories

Available properties

Table Definition General Tab

Type

Source

Stored Table Definition

Checkpoint
True or False? The directory to which you export is on the DataStage client machine, not on the DataStage server machine. Can you import Table Definitions for sequential files with fixed-length record formats?

Checkpoint solutions
1. True. 2. Yes. Record lengths are determined by the lengths of the individual columns.

Unit summary
Having completed this unit, you should be able to: Log onto DataStage Navigate around DataStage Designer Import and export DataStage objects to a file Import a Table Definition for a sequential file

IBM InfoSphere DataStage Mod 05: Creating Parallel Jobs

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Design a simple Parallel job in Designer Define a job parameter Use the Row Generator, Peek, and Annotation stages in a job Compile your job Run your job in Director View the job log

What Is a Parallel Job?


Executable DataStage program Created in DataStage Designer
Can use components from Repository

Built using a graphical user interface Compiles into Orchestrate script language (OSH) and object code (from generated C++)

Job Development Overview


Import metadata defining sources and targets
Done within Designer using import process

In Designer, add stages defining data extractions and loads Add processing stages to define data transformations Add links defining the flow of data from sources to targets Compile the job In Director, validate, run, and monitor your job
Can also run the job in Designer Can only view the job log in Director

Tools Palette
Stage categories

Stages

Adding Stages and Links


Drag stages from the Tools Palette to the diagram
Can also be dragged from Stage Type branch to the diagram

Draw links from source to target stage


Right mouse over source stage Release mouse button over target stage

Job Creation Example Sequence


Brief walkthrough of procedure Assumes Table Definition of source already exists in the repository

Create New Parallel Job

Open New window Parallel job

Drag Stages and Links From Palette


Compile

Row Generator Job properties

Peek

Renaming Links and Stages


Click on a stage or link to rename it Meaningful names have many benefits
Documentation Clarity Fewer development errors

RowGenerator Stage
Produces mock data for specified columns No inputs link; single output link On Properties tab, specify number of rows On Columns tab, load or specify column definitions
Open Extended Properties window to specify the values to be generated for that column A number of algorithms for generating values are available depending on the data type

Algorithms for Integer type


Random: seed, limit Cycle: Initial value, increment

Algorithms for string type: Cycle , alphabet Algorithms for date type: Random, cycle

Inside the Row Generator Stage


Properties tab

Set property value

Property

Columns Tab
Double-click to specify extended properties

View data

Select Table Definition

Load a Table Definition

Extended Properties

Specified properties and their values

Additional properties to add

Peek Stage
Displays field values
Displayed in job log or sent to a file Skip records option Can control number of records to be displayed Shows data in each partition, labeled 0, 1, 2,

Useful stub stage for iterative job development


Develop job to a stopping point and check the data

Peek Stage Properties

Output to job log

Job Parameters
Defined in Job Properties window Makes the job more flexible Parameters can be:
Used in directory and file names Used to specify property values Used in constraints and derivations

Parameter values are determined at run time When used for directory and files names and property values, they are surrounded with pound signs (#)
E.g., #NumRows#

Job parameters can reference DataStage environment variables


Prefaced by $, e.g., $APT_CONFIG_FILE

Defining a Job Parameter


Parameters tab

Parameter Add environment variable

Using a Job Parameter in a Stage

Job parameter

Click to insert Job parameter

Adding Job Documentation


Job Properties
Short and long descriptions

Annotation stage
Added from the Tools Palette Displays formatted text descriptions on diagram

Job Properties Documentation

Documentation

Annotation Stage Properties

Compiling a Job

Run Compile

Errors or Successful Message

Highlight stage with error

Click for more info

Running Jobs and Viewing the Job Log in Designer

DataStage Director
Use to run and schedule jobs View runtime messages Can invoke directly from Designer
Tools > Run Director

Run Options

Assign values to parameter

Run Options

Stop after number of warnings

Stop after number of rows

Director Status View


Status view Schedule view Log view

Select job to view messages in the log view

Director Log View


Click the notepad icon to view log messages

Peek messages

Message Details

Other Director Functions


Schedule job to run on a particular date/time Clear job log of messages Set job log purging conditions Set Director options
Row limits (server job only) Abort after x warnings

Running Jobs from Command Line


dsjob run param numrows=10 dx444 GenDataJob
Runs a job Use run to run the job Use param to specify parameters In this example, dx444 is the name of the project In this example, GenDataJob is the name of the job

dsjob logsum dx444 GenDataJob


Displays a jobs messages in the log

Documented in Parallel Job Advanced Developers Guide

Checkpoint
1. Which stage can be used to display output data in the job log? 2. Which stage is used for documenting your job on the job canvas? 3. What command is used to run jobs from the operating system command line?

Checkpoint solutions
1. Peek stage 2. Annotation stage 3. dsjob -run

Unit summary
Having completed this unit, you should be able to: Design a simple Parallel job in Designer Define a job parameter Use the Row Generator, Peek, and Annotation stages in a job Compile your job Run your job in Director View the job log

IBM InfoSphere DataStage Mod 06: Accessing Sequential Data

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Understand the stages for accessing different kinds of file data Sequential File stage Data Set stage Create jobs that read from and write to sequential files Create Reject links Work with NULLs in sequential files Read from multiple files using file patterns Use multiple readers

Types of File Data


Sequential
Fixed or variable length

Data Set Complex flat file

How Sequential Data is Handled


Import and export operators are generated
Stages get translated into operators during the compile

Import operators convert data from the external format, as described by the Table Definition, to the framework internal format
Internally, the format of data is described by schemas

Export operators reverse the process Messages in the job log use the import / export terminology
E.g., 100 records imported successfully; 2 rejected E.g., 100 records exported successfully; 0 rejected Records get rejected when they cannot be converted correctly during the import or export

Using the Sequential File Stage


Both import and export of sequential files (text, binary) can be performed by the SequentialFile Stage.

Importing/Exporting Data

Data import:

Internal format

Data export

Internal format

Features of Sequential File Stage


Normally executes in sequential mode Executes in parallel when reading multiple files Can use multiple readers within a node
Reads chunks of a single file in parallel

The stage needs to be told:


How file is divided into rows (record format) How row is divided into columns (column format)

Sequential File Format Example


Record delimiter Field 1 , Field 1 2 , Field 1 3 , Last field nl

Final Delimiter = end Field Delimiter

Field 1

2 Field 1

Field 1 3

, Last field

, nl

Final Delimiter = comma

Sequential File Stage Rules


One input link One stream output link Optionally, one reject link
Will reject any records not matching metadata in the column definitions Example: You specify three columns separated by commas, but the row thats read had no commas in it Example: The second column is designated a decimal in the Table Definition, but contains alphabetic characters

Job Design Using Sequential Stages

Stream link

Reject link (broken line)

Input Sequential Stage Properties


Output tab File to access

Column names in first row

Click to add more files to read

Format Tab

Record format

Load from Table Definition

Column format

Sequential Source Columns Tab

View data Load from Table Definition Save as a new Table Definition

Reading Using a File Pattern

Use wild cards

Select File Pattern

Properties - Multiple Readers

Multiple readers option allows you to set number of readers per node

Sequential Stage As a Target


Input Tab

Append / Overwrite

Reject Link
Reject mode =
Continue: Continue reading records Fail: Abort job Output: Send down output link All records not matching the metadata (column definitions) are rejected All records that fail to be written for any reason

In a source stage

In a target stage

Rejected records consist of one column, datatype = raw


Reject mode property

Inside the Copy Stage

Column mappings

Reading and Writing NULL Values to a Sequential File

Working with NULLs


Internally, NULL is represented by a special value outside the range of any existing, legitimate values If NULL is written to a non-nullable column, the job will abort Columns can be specified as nullable
NULLs can be written to nullable columns

You must handle NULLs written to nullable columns in a Sequential File stage
You need to tell DataStage what value to write to the file Unhandled rows are rejected

In a Sequential File source stage, you can specify values you want DataStage to convert to NULLs

Specifying a Value for NULL

Nullable column

Added property

DataSet Stage

Data Set
Binary data file Preserves partitioning
Component dataset files are written to each partition

Suffixed by .ds Referred to by a header file Managed by Data Set Management utility from GUI (Manager, Designer, Director) Represents persistent data Key to good performance in set of linked jobs
No import / export conversions are needed No repartitioning needed

Accessed using DataSet stage Implemented with two types of components: Descriptor file:
contains metadata, data location, but NOT the data itself contain the data multiple files, one per partition (node)

Data file(s)

Job With DataSet Stage


DataSet stage

DataSet stage properties

Data Set Management Utility


Display schema Display data

Display record counts for each data file

Data and Schema Displayed


Data viewer

Schema describing the format of the data

File Set Stage



Use to read and write to filesets Files suffixed by .fs Files are similar to a dataset
Partitioned Implemented with header file and data files

How filesets differ from datasets


Data files are text files Hence readable by external applications Datasets have a proprietary data format which may change in future DataStage versions

Checkpoint
1. List three types of file data. 2. What makes datasets perform better than other types of files in parallel jobs? 3. What is the difference between a data set and a file set?

Checkpoint solutions
1. Sequential, dataset, & complex flat files. 2. They are partitioned and they store data in the native parallel format. 3. Both are partitioned. Data sets store data in a binary format not readable by user applications. File sets are readable.

Unit summary
Having completed this unit, you should be able to: Understand the stages for accessing different kinds of file data Sequential File stage Data Set stage Create jobs that read from and write to sequential files Create Reject links Work with NULLs in sequential files Read from multiple files using file patterns Use multiple readers

IBM InfoSphere DataStage Mod 07: Platform Architecture

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Describe parallel processing architecture Describe pipeline parallelism Describe partition parallelism List and describe partitioning and collecting algorithms Describe configuration files Describe the parallel job compilation process Explain OSH Explain the Score

Key Parallel Job Concepts


Parallel processing:
Executing the job on multiple CPUs

Scalable processing:
Add more resources (CPUs and disks) to increase system performance

Example system: 6 CPUs (processing nodes) and disks Scale up by adding more CPUs Add CPUs as individual nodes or to an SMP system

Scalable Hardware Environments

Single CPU Dedicated memory & disk

SMP Multi-CPU (2-64+) Shared memory & disk

GRID / Clusters
Multiple, multi-CPU systems Dedicated memory per node Typically SAN-based shared storage Multiple nodes with dedicated memory, storage

MPP

2 1000s of CPUs

Pipeline Parallelism

Transform, clean, load processes execute simultaneously Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running

Advantages:
Reduces disk usage for staging areas Keeps processors busy

Still has limits on scalability

Partition Parallelism
Divide the incoming stream of data into subsets to be separately processed by an operation
Subsets are called partitions (nodes)

Each partition of data is processed by the same operation


E.g., if operation is Filter, each partition will be filtered in exactly the same way

Facilitates near-linear scalability

8 times faster on 8 processors 24 times faster on 24 processors This assumes the data is evenly distributed

Three-Node Partitioning

Node 1

Operation
subset1 Node 2 subset2

Operation
Node 3

Data

subset3

Operation

Here the data is partitioned into three partitions The operation is performed on each partition of data separately and in parallel If the data is evenly distributed, the data will be processed three times faster

Parallel Jobs Combine Partitioning and Pipelining

Within parallel jobs pipelining, partitioning, and repartitioning are automatic Job developer only identifies: Sequential vs. parallel mode (by stage) Default for most stages is parallel mode Method of data partitioning The method to use to distribute the data into the available nodes Method of data collecting The method to use to collect the data from the nodes into a single node Configuration file Specifies the nodes and resources

Job Design v. Execution


User assembles the flow using DataStage Designer

at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)

No need to modify or recompile the job design!

Configuration File
Configuration file separates configuration (hardware / software) from job design
Specified per job at runtime by $APT_CONFIG_FILE Change hardware and resources without changing job design

Defines nodes with their resources (need not match physical CPUs)
Dataset, Scratch, Buffer disk (file systems) Optional resources (Database, SAS, etc.) Advanced resource optimizations
Pools (named subsets of nodes)

Multiple configuration files can be used different occasions of job execution


Optimizes overall throughput and matches job characteristics to overall hardware resources Allows runtime constraints on resource usage on a per job basis

Example Configuration File


{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} } }

3 1

4 2

Key points: 1. 2. 3. Number of nodes defined Resources assigned to each node. Their order is significant. Advanced resource optimizations and configuration (named pools, database, SAS)

Partitioning and Collecting

Partitioning and Collecting


Partitioning breaks incoming rows into multiple streams of rows (one for each node) Each partition of rows is processed separately by the stage/operator Collecting returns partitioned data back to a single stream Partitioning / Collecting is specified on stage input links

Partitioning / Collecting Algorithms


Partitioning algorithms include:
Round robin Random Hash: Determine partition based on key value Requires key specification Modulus Entire: Send all rows down all partitions Same: Preserve the same partitioning Auto: Let DataStage choose the algorithm Round robin Auto Collect first available record Sort Merge Read in by key Presumes data is sorted by the key in each partition Builds a single sorted stream based on the key Ordered Read all records from first partition, then second,

Collecting algorithms include:

Keyless V. Keyed Partitioning Algorithms


Keyless: Rows are distributed independently of data values
Round Robin Random Entire Same

Keyed: Rows are distributed based on values in the specified key


Hash: Partition based on key Example: Key is State. All CA rows go into the same partition; all MA rows go in the same partition. Two rows of the same state never go into different partitions Modulus: Partition based on modulus of key divided by the number of partitions. Key is a numeric type. Example: Key is OrderNumber (numeric type). Rows with the same order number will all go into the same partition. DB2: Matches DB2 EEE partitioning

Round Robin and Random Partitioning Keyless


Keyless partitioning methods Rows are evenly distributed across partitions
Good for initial import of data if no other partitioning is needed Useful for redistributing data
8 7 6 5 4 3 2 1 0

Fairly low overhead Round Robin assigns rows to partitions like dealing cards
Row/Partition assignment will be the same for a given $APT_CONFIG_FILE
6 3 0

Round Robin

7 4 1

8 5 2

Random has slightly higher overhead, but assigns rows in a non-deterministic fashion between job runs

ENTIRE Partitioning
Each partition gets a complete copy of the data
Useful for distributing lookup and reference data May have performance impact in MPP / clustered environments On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data On MPP platforms, each server uses shared memory for a single local copy

Keyless
8 7 6 5 4 3 2 1 0

ENTIRE

ENTIRE is the default partitioning for Lookup reference links with Auto partitioning
On SMP platforms, it is a good practice to set this explicitly on the Normal Lookup reference link(s)

. . 3 2 1 0

. . 3 2 1 0

. . 3 2 1 0

HASH Partitioning
Keyed partitioning method Rows are distributed according to the values in key columns
Guarantees that rows with same key values go into the same partition Needed to prevent matching rows from hiding in other partitions
Values of key column

Keyed

0 3 2 1 0 2 3 2 1 1

HASH

E.g. Join, Merge, Remove Duplicates,

Partition distribution is relatively equal if the data across the source key column(s) is evenly

0 3 0 3

1 1 1

2 2 2

distributed

Modulus Partitioning
Keyed partitioning method Rows are distributed according to the values in one integer key column
Uses modulus
partition = MOD (key_value / #partitions) Values of key column 0 3 2 1 0 2 3 2 1 1

Keyed

MODULUS

Faster than HASH Guarantees that rows with identical key values go in the same partition Partition size is relatively equal if the data within the key column is evenly distributed

0 3 0 3

1 1 1

2 2 2

Auto Partitioning
DataStage inserts partition components as necessary to ensure correct results
Before any stage with Auto partitioning Generally chooses ROUND-ROBIN or SAME Inserts HASH on stages that require matched key values (e.g. Join, Merge, Remove Duplicates) Inserts ENTIRE on Normal (not Sparse) Lookup reference links NOT always appropriate for MPP/clusters

Since DataStage has limited awareness of your data and business rules, explicitly specify HASH partitioning when needed
DataStage has no visibility into Transformer logic Hash is required before Sort and Aggregator stages DataStage sometimes inserts un-needed partitioning Check the log

Partitioning Requirements for Related Records


Misplaced records
Using Aggregator stage to sum customer sales by customer number If there are 25 customers, 25 records should be output But suppose records with the same customer numbers are spread across partitions
This will produce more than 25 groups (records)

Solution: Use hash partitioning algorithm

Partition imbalances
If all the records are going down only one of the nodes, then the job is in effect running sequentially

Unequal Distribution Example


Same key values are assigned to the same partition
Source Data
ID
1 2 3 4 5 6 7 8 9 10

Hash on LName, with 2-node config file


Part 0
ID
5 6

LName
Dodge Dodge

FName
Horace John

Address
17840 Jefferson 75 Boston Boulevard

LName
Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford

FName
Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor

Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore

Partition 1

ID
1 2 3 4 7 8 9 10

LName
Ford Ford Ford Ford Ford Ford Ford Ford

FName
Henry Clara Edsel Eleanor Henry Clara Edsel Eleanor

Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore

Partitioning / Collecting Link Icons

Partitioning icon

Collecting icon

More Partitioning Icons


fan-out Sequential
to Parallel

SAME partitioner

Re-partition
watch for this!

AUTO partitioner

Partitioning Tab
Input tab Key specification

Algorithms

Collecting Specification

Key specification

Algorithms

Runtime Architecture

Parallel Job Compilation


What gets generated:
OSH: A kind of script OSH represents the design data flow and stages
Stages become OSH operators
Designer Client

Compile
DataStage server

Transform operator for each Transformer

A custom operator built during the compile Compiled into C++ and then to corresponding native operators
Thus a C++ compiler is needed to compile jobs with a Transformer stage
Executable Job

Transformer Components

Generated OSH
Enable viewing of generated OSH in Administrator:

Comments Operator Schema

OSH is visible in:

- Job properties - Job run log - View Data - Table Defs

Stage to Operator Mapping Examples


Sequential File
Source: import Target: export

DataSet: copy Sort: tsort Aggregator: group Row Generator, Column Generator, Surrogate Key Generator: generator Oracle
Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert

Generated OSH Primer


Comment blocks introduce each operator
Operator order is determined by the order stages were added to the canvas Operator name Schema Operator options ( -name value format) Input (indicated by n< where n is the input #) Output (indicated by n> where n is the output #) may include modify

Generated OSH for first 2 stages

Syntax

#################################################### #### STAGE: Row_Generator_0 ## Operator generator ## Operator options -schema record ( a:int32; b:string[max=12]; c:nullable decimal[10,2] {nulls=10}; ) -records 50000 ## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')] ## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ; #################################################### #### STAGE: SortSt ## Operator tsort ## Operator options -key 'a' -asc ## General options [ident('SortSt'); jobmon_ident('SortSt'); par] ## Inputs 0< 'Row_Generator_0:lnk_gen.v' ## Outputs 0> [modify ( keep a,b,c; )] 'SortSt:lnk_sorted.v' ;

For every operator, input and/or output datasets are numbered sequentially starting from 0. E.g.: Virtual datasets are generated to connect operators
op1 0> dst op1 1< src

Virtual dataset is used to connect output of one operator to input of another

Job Score
Generated from OSH and configuration file Think of Score as in musical score, not game score Assigns nodes for each operator Inserts sorts and partitioners as needed Defines connection topology (virtual datasets) between adjacent operators Inserts buffer operators to prevent deadlocks Defines the actual processes

Where possible, multiple operators are combined into a single process

Viewing the Job Score


Score message in job log
Set $APT_DUMP_SCORE to output the Score to the job log To identify the Score dump, look for main program: This step
You dont see anywhere the word Score

Virtual datasets

Operators with node assignments

Job Execution: The Orchestra


Conductor Node

Conductor - initial process


Composes the Score Creates Section Leader processes (one/node) Consolidates messages to DataStage log Manages orderly shutdown

Processing Node
SL

Section Leader (one per Node)


Forks Player processes (one per stage) Manages up/down communication

Processing Node
SL

Players
The actual processes associated with stages Combined players: one process only Sends stderr, stdout to Section Leader

Establish connections to other players for data flow Clean up upon completion

Runtime Control and Data Networks


Control Channel/TCP Conductor Stdout Channel/Pipe Stderr Channel/Pipe APT_Communicator

Section Leader,0

Section Leader,1

Section Leader,2

generator,0

generator,1

generator,2

copy,0

copy,1

copy,2

$ osh generator -schema record(a:int32) [par] | roundrobin | copy

Checkpoint
1. What file defines the degree of parallelism a job runs under? 2. Name two partitioning algorithms that partition based on key values? 3. What partitioning algorithm produces the most even distribution of data in the various partitions?

Checkpoint solutions
1. Configuration file. 2. Hash, Modulus, DB2. 3. Round robin or Entire.

Unit summary
Having completed this unit, you should be able to: Describe parallel processing architecture Describe pipeline parallelism Describe partition parallelism List and describe partitioning and collecting algorithms Describe configuration files Describe the parallel job compilation process Explain OSH Explain the Score

IBM InfoSphere DataStage Mod 08: Combining Data

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Combine data using the Lookup stage Define range lookups Combine data using Merge stage Combine data using the Join stage Combine data using the Funnel stage

Combining Data
Ways to combine data: Horizontally:
Multiple input links One output link made of columns from different input links. Joins Lookup Merge

Vertically:
One input link, one output link combining groups of related records into a single record Aggregator Remove Duplicates

Funneling: Multiple input streams funneled into a single output stream


Funnel stage

Lookup, Merge, Join Stages


These stages combine two or more input links
Data is combined by designated "key" column(s)

These stages differ mainly in:


Memory usage Treatment of rows with unmatched key values Input requirements (sorted, de-duplicated)

Lookup Stage

Lookup Features
One stream input link (source link) Multiple reference links One output link Lookup failure options
Continue, Drop, Fail, Reject Only available when lookup failure option is set to Reject

Reject link Can return multiple matching rows Hash file is built in memory from the lookup files
Indexed by key Should be small enough to fit into physical memory

Lookup Types
Equality match
Match exactly values in the lookup key column of the reference link to selected values in the source row Return row or rows (if multiple matches are to be returned) that match

Caseless match
Like an equality match except that its caseless E.g., abc matches AbC

Range on the reference link


Two columns on the reference link define the range A match occurs when a selected value in the source row is within the range

Range on the source link


Two columns on the source link define the range A match occurs when a selected value in the reference link is within the range

Lookup Example

Reference link Source (stream) link

Lookup Stage With an Equality Match


Source link columns Lookup constraints

Mappings to output columns

Reference key column and source mapping

Column definitions

Lookup Failure Actions


If the lookup fails to find a matching key column, one of several actions can be taken:
Fail (Default) Stage reports an error and the job fails immediately Drop Input row is dropped Continue Input row is transferred to the output. Reference link columns are filled with null or default values Reject Input row sent to a reject link This requires that a reject link has been created for the stage

Specifying Lookup Failure Actions


Select to return multiple rows

Select lookup failure action

Lookup Stage Behavior

Source link
Revolution 1789 1776 Citizen Lefty M_B_Dextrous

Reference link
Citizen M_B_Dextrous Righty Exchange Nasdaq NYSE

Lookup key column

Lookup Stage
Output of Lookup with Continue option
Revolution 1789 1776 Citizen Lefty M_B_Dextrous Exchange Nasdaq

Output of Lookup with Drop option


Revolution 1776 Citizen M_B_Dextrous Exchange Nasdaq

Empty string or NULL

Designing a Range Lookup Job

Range Lookup Job


Reference link

Lookup stage

Range on Reference Link


Reference range values

Retrieve description

Source values

Selecting the Stream Column


Source link Double-click to specify range

Reference link

Range Expression Editor


Select range columns

Select operators

Range on Stream Link


Source range

Retrieve other column values Reference link key

Specifying the Range Lookup

Reference link key column Range type

Range Expression Editor

Select range columns

Join Stage

Join Stage
Four types of joins
Inner Left outer Right outer Full outer Left link and a right link Supports additional intermediate links Little memory required, because of the sort requirement Column names for each input link must match

Input links must be sorted Light-weight


Join key column or columns

Job With Join Stage

Right input link

Left input link

Join stage

Join Stage Editor

Column to match Select left input link

Select join type

Select if multiple columns make up the join key

Join Stage Behavior

Left link (primary input)


Revolution 1789 1776 Citizen Lefty M_B_Dextrous

Right link (secondary input)


Citizen M_B_Dextrous Righty Exchange Nasdaq NYSE

Join key column

Inner Join
Transfers rows from both data sets whose key columns have matching values
Output of inner join on key Citizen
Revolution 1776 Citizen M_B_Dextrous Exchange Nasdaq

Left Outer Join


Transfers all values from the left link and transfers values from the right link only where key columns match.
Revolution 1789 1776 Citizen Lefty M_B_Dextrous Exchange Nasdaq

Null or default value

Right Outer Join


Transfers all values from the right link and transfers values from the left link only where key columns match.

Revolution 1776

Citizen M_B_Dextrous Righty

Exchange Nasdaq NYSE

Null or default value

Full Outer Join


Transfers rows from both data sets, whose key columns contain equal values, to the output link It also transfers rows, whose key columns contain unequal values, from both input links to the output link Treats both input symmetrically Creates new columns, with new column names!
Revolution 1789 1776 0 leftRec_Citizen Lefty M_B_Dextrous rightRec_Citizen M_B_Dextrous Righty Exchange Nasdaq NYSE

Merge Stage

Merge Stage

Similar to Join stage Input links must be sorted


Master link and one or more secondary links Master must be duplicate-free Little memory required, because of the sort requirement

Light-weight Unmatched master rows can be kept or dropped Unmatched ("Bad") update rows in input link n can be captured in a "reject" link in corresponding output link n. Matched update rows are consumed Follows the Master-Update model Master row and one or more updates row are merged iff they have the same value in user-specified key column(s). A non-key column name occurs in several inputs?
The lowest input port number prevails e.g., master over update; update values are ignored

Merge Stage Job

Merge stage

Capture update link non-matches

Merge Stage Properties


Match key

Keep or drop unmatched masters

The Merge Stage

Master

One or more updates

0 0
Merge

1 1

Allows composite keys Multiple update links Matched update rows are consumed Unmatched updates in input port n can be captured in output port n Lightweight

Output

Rejects

Comparison: Joins, Lookup, Merge

Joins
Model Memory usage # and names of Inputs Mandatory Input Sort Duplicates in primary input Duplicates in secondary input(s) Options on unmatched primary Options on unmatched secondary On match, secondary entries are # Outputs Captured in reject set(s) RDBMS-style relational light

Lookup
Source - in RAM LU Table heavy

Merge
Master -Update(s) light 1 Master, N Update(s) all inputs W arning! OK only when N = 1 [keep] | drop capture in reject set(s) consumed 1 out, (N rejects) unmatched secondary entries

1 Source, N LU Tables 2 or more: left, right all inputs no OK OK OK W arning! Keep (left outer), Drop (Inner) [fail] | continue | drop | reject Keep (right outer), Drop (Inner) NONE captured captured 1 Nothing (N/A) 1 out, (1 reject) unmatched primary entries

Funnel Stage

What is a Funnel Stage?


Combines data from multiple input links to a single output link All sources must have identical metadata Three modes
Continuous Records are combined in no particular order First available record Sort Funnel Combines the input records in the order defined by a key Produces sorted output if the input links are all sorted by the same key Sequence: Outputs all records from the first input link, then all from the second input link, and so on

Funnel Stage Example

Checkpoint
1. Name three stages that horizontally join data? 2. Which stage uses the least amount of memory? Join or Lookup? 3. Which stage requires that the input data is sorted? Join or Lookup?

Checkpoint solutions
1. Lookup, Merge, Join 2. Join 3. Join

Unit summary
Having completed this unit, you should be able to: Combine data using the Lookup stage Define range lookups Combine data using Merge stage Combine data using the Join stage Combine data using the Funnel stage

IBM InfoSphere DataStage Mod 09: Sorting and Aggregating Data

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Sort data using in-stage sorts and Sort stage Combine data using Aggregator stage Combine data Remove Duplicates stage

Sort Stage

Sorting Data
Uses
Some stages require sorted input Join, merge stages require sorted input Some stages use less memory with sorted input e.g., Aggregator

Sorts can be done:


Within stages On input link Partitioning tab, set partitioning to anything other than Auto In a separate Sort stage Makes sort more visible on diagram Has more options

Sorting Alternatives

Sort stage

Sort within stage

In-Stage Sorting
Partitioning tab

Do sort Preserve non-key ordering Remove dups

Cant be Auto when sorting

Sort key

Sort Stage
Sort key

Sort options

Sort keys
Add one or more keys Specify sort mode for each key
Sort: Sort by this key Dont sort (previously sorted): Assumes the data has already been sorted on this key Continue sorting by any secondary keys

Specify sort order: ascending / descending Specify case sensitive or not

Sort Options
Sort Utility
DataStage the default Unix: Dont use. Slower than DataStage sort utility

Stable Allow duplicates Memory usage


Sorting takes advantage of the available memory for increased performance Uses disk if necessary Increasing amount of memory can improve performance

Create key change column


Add a column with a value of 1 / 0 1 indicates that the key value has changed 0 means that the key value hasnt changed Useful for processing groups of rows in a Transformer

Stable Sort
Key
4 3 1 3 2 3 1 2

Col
X Y K C P D A L

Key
1 1 2 2 3 3 3 4

Col
K A P L Y C D X

Create (Cluster) Key Change Column


Key
4 3 1 3 2 3 1 2 X Y K C P D A L

Col

Key
1 1 2 2 3 3 3 4

Col
K A P L Y C D X

K_C
1 0 1 0 1 0 0 1

Partitioning V. Sorting Keys


Partitioning keys are often different than Sorting keys
Keyed partitioning (e.g., Hash) is used to group related records into the same partition Sort keys are used to establish order within each partition

Example:
Partition on HouseHoldID, sort on HouseHoldID, EntryDate Partitioning on HouseHoldID ensures that the same ID will not be spread across multiple partitions Sorting orders the records with the same ID by entry date
Useful for deciding which of a group of duplicate records with the same ID should be retained

Aggregator Stage

Aggregator Stage
Purpose: Perform data aggregations Specify: One or more key columns that define the aggregation units (or groups) Columns to be aggregated Aggregation functions include, among many others:
Count (nulls/non-nulls) Sum Max / Min / Range

The grouping method (hash table or pre-sort) is a performance issue

Job with Aggregator Stage

Aggregator stage

Aggregation Types
Count rows
Count rows in each group Put result in a specified output column

Calculation
Select column Put result of calculation in a specified output column Calculations include: Sum Count Min, max Mean Missing value count Non-missing value count Percent coefficient of variation

Count Rows Aggregator Properties

Grouping key columns

Count Rows aggregation type

Output column for the result

Calculation Type Aggregator Properties

Grouping key columns

Calculation aggregation type

Calculations with output columns Available calculations

Grouping Methods
Hash (default)
Calculations are made for all groups and stored in memory Hash table structure (hence the name) Results are written out after all input has been processed Input does not need to be sorted Useful when the number of unique groups is small Running tally for each groups aggregations needs to fit into memory

Sort
Requires the input data to be sorted by grouping keys Does not perform the sort! Expects the sort Only a single aggregation group is kept in memory When a new group is seen, the current group is written out Can handle unlimited numbers of groups

Grouping Method - Hash


Key Col
4 3 1 3 2 3 1 2 X Y K C P D A L

4 3 3Y

4X 3C 3D

1K 2P

1A 2L

Grouping Method - Sort


Key
1 1 2 2 3 3 3 4

Col
K A P L Y C D X

1K

1A

2P

2L

3Y

3C

3D

4X

Remove Duplicates Stage

Removing Duplicates
Can be done by Sort stage
Use unique option No choice on which duplicate to keep Stable sort always retains the first row in the group Non-stable sort is indeterminate

OR

Remove Duplicates stage


Has more sophisticated ways to remove duplicates Can choose to retain first or last

Remove Duplicates Stage Job

Remove Duplicates stage

Remove Duplicates Stage Properties


Key that defines duplicates

Retain first or last duplicate

Checkpoint
1. What stage is used to perform calculations of column values grouped in specified ways? 2. In what two ways can sorts be performed? 3. What is a stable sort?

Checkpoint solutions
1. Aggregator stage 2. Using the Sort stage. In-stage sorts. 3. Stable sort preserves the order of non-key values.

Unit summary
Having completed this unit, you should be able to: Sort data using in-stage sorts and Sort stage Combine data using Aggregator stage Combine data Remove Duplicates stage

IBM InfoSphere DataStage Mod 10: Transforming Data

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Use the Transformer stage in parallel jobs Define constraints Define derivations Use stage variables Create a parameter set and use its parameters in constraints and derivations

Transformer Stage

Transformer Stage
Column mappings Derivations Constraints

Written in Basic Final compiled code is C++ generated object code Filter data Direct data down different output links For different processing or storage Input columns Job parameters Functions System variables and constants Stage variables External routines

Expressions for constraints and derivations can reference

Job With a Transformer Stage


Transformer

Single input

Reject link

Multiple outputs

Inside the Transformer Stage


Stage variables Show/Hide Input columns Constraint Derivations / Mappings Column defs Output columns

Defining a Constraint

Input column

Function Job parameter

Defining a Derivation
Input column

String in quotes

Concatenation operator (:)

IF THEN ELSE Derivation


Use IF THEN ELSE to conditionally derive a value Format:
IF <condition> THEN <expression1> ELSE <expression2> If the condition evaluates to true then the result of expression1 will be copied to the target column or stage variable If the condition evaluates to false then the result of expression2 will be copied to the target column or stage variable

Example:
Suppose the source column is named In.OrderID and the target column is named Out.OrderID Replace In.OrderID values of 3000 by 4000 IF In.OrderID = 3000 THEN 4000 ELSE In.OrderID

String Functions and Operators


Substring operator
Format: String [loc, length] Example: Suppose In.Description contains the string Orange Juice InDescription[8,5] Juice

UpCase(<string>) / DownCase(<string>)
Example: UpCase(In.Description) ORANGE JUICE

Len(<string>)
Example: Len(In.Description) 12

Checking for NULLs


Nulls can be introduced into the data flow from lookups
Mismatches (lookup failures) can produce nulls

Can be handled in constraints, derivations, stage variables, or a combination of these NULL functions
Testing for NULL
IsNull(<column>) IsNotNull(<column>) NullToValue(<column>, <value>) Example: IF In.Col = 5 THEN SetNull() ELSE In.Col

Replace NULL with a value Set to NULL: SetNull()

Transformer Functions
Date & Time Logical Null Handling Number String Type Conversion

Transformer Execution Order


Derivations in stage variables Constraints are executed before derivations Column derivations in earlier links are executed before later links Derivations in higher columns are executed before lower columns

Transformer Stage Variables


Derivations execute in order from top to bottom
Later stage variables can reference earlier stage variables Earlier stage variables can reference later stage variables These variables will contain a value derived from the previous row that came into the Transformer

Multi-purpose
Counters Store values from previously read rows to make comparisons with the currently read row Store derived values to be used in multiple target field derivations Can be used to control execution of constraints

Transformer Reject Links

Reject link

Convert link to a Reject link

Transformer and Null Expressions


Within a Transformer, any expression that includes a NULL value produces a NULL result
1 + NULL = NULL John : NULL : Doe = NULL

When the result of a link constraint or output derivation is NULL, the Transformer will output that row to the reject link. If no reject link, job will abort! Standard practices
Always create a reject link Always test for NULL values in expressions with NULLABLE columns IF isNull(link.col) THEN ELSE Transformer issues warnings for rejects

Otherwise Link

Otherwise link

Defining an Otherwise Link

Check to create otherwise condition

Can specify abort condition

Specifying Link Ordering


Link ordering toolbar icon

Last in order

Parameter Sets

Parameter Sets
Store a collection of parameters in a named object One or more values files can be named and specified
A values file stores values for specified parameters Values are picked up at runtime

Parameter Sets can be added to the job parameters specified on the Parameters tab in the job properties

Creating a New Parameter Set

New Parameter Set

Parameters Tab

Specify parameters

Parameter set name is specified on General tab

Values Tab

Values file name

Values for parameters

Adding a Parameter Set to Job Properties

View Parameter Set Parameter Set Reference Add Parameter Set

Using Parameter Set Parameters

Select Parameter Set parameters

Checkpoint
1. What occurs first? Derivations or constraints? 2. Can stage variables be referenced in constraints? 3. Where should you test for NULLS within a Transformer? Stage Variable derivations or output column derivations?

Checkpoint solutions
1. Constraints. 2. Yes. 3. Stage variable derivations. Reference the stage variables in the output column derivations.

Unit summary
Having completed this unit, you should be able to: Use the Transformer stage in parallel jobs Define constraints Define derivations Use stage variables Create a parameter set and use its parameters in constraints and derivations

IBM InfoSphere DataStage Mod 11: Repository Functions

Copyright IBM Corporation 2009

Unit objectives
After completing this unit, you should be able to: Perform a simple Find Perform an Advanced Find Perform an impact analysis Compare the differences between two Table Definitions Compare the differences between two jobs

Searching the Repository

Quick Find
Name with wild card character (*)

Include matches in object descriptions

Execute Find

Found Results

Highlight next item

Number found; Click to open Advanced Find window

Found item

Advanced Find Window

Filter search

Advanced Find Filtering Options


Type: type of object
Job, Table Definition, etc.

Creation: range of dates


E.g., Up to a week ago

Last modification: range of dates


E.g., Up to a week ago

Where used: objects that use specified objects


E.g., a job that uses a specified Table Definition

Dependencies of: objects that are dependencies of objects


E.g., a Table Definition that is referenced in a specified job

Options
Case sensitivity Search within last result set

Using the Found Results

Create impact analysis Export to a file Draw Comparison

Impact Analysis

Performing an Impact Analysis


Find where Table Definitions are used
Right-click over a stage or Table Definition Select Find where Table Definitions Used or Select Find where Table Definitions Used (deep) Deep includes additional object types Displays a list of the objects using the Table Definition

Find object dependencies


Select Find dependencies or Select Find dependencies (deep) Displays list of objects dependent on the one selected

Graphical functionality
Display the dependency path Collapse selected objects Move the graphical object Birds-eye view

Initiating an Impact Analysis from a Stage

ERROR: stackunderflow OFFENDING COMMAND: ~ STACK:

You might also like