You are on page 1of 299

IBM WebSphere DataStage

Introduction To Enterprise Edition








Course Contents















Module 01: Introduction
Module 02: Setting Up Your DataStage Environment
Module 03: Creating Parallel Jobs
Module 04: Accessing Sequential Data
Module 05: Platform Architecture
Module 06: Combining Data
Module 07: Sorting and Aggregating Data
Module 08: Transforming Data
Module 09: Standards and Techniques





Module 10: Accessing Relational Data
Module 11: Compilation and Execution
Module 12: Testing and Debugging
Module 13: Metadata in Enterprise Edition
Module 14: Job Control




































Course













Objectives
DataStage Clients and Server
Setting up the parallel environment
Importing metadata
Building DataStage jobs
Loading metadata into job stages
Accessing Sequential data
Accessing Relational data
Introducing the Parallel framework
architecture
Transforming data
Sorting and aggregating data
Merging data
Configuration files
Creating job sequences





































IBM WebSphere DataStage
Module 01: Introduction







What is IBM WebSphere DataStage?



Design jobs for Extraction, Transformation, and Loading (ETL)

Ideal tool for data integration projects such as, data warehouses, data marts,
and system migrations

Import, export, create, and manage metadata for use within jobs

Schedule, run, and monitor jobs all within DataStage

Administer your DataStage development and execution environments

Create batch (controlling) jobs















































DataStage Server and Clients
Windows or Unix Server
Microsoft Windows







































Client Logon






DataStage Administrator







































DataStage Manager






DataStage Designer






DataStage Director




Developing in DataStage



Define global and project properties in Administrator

Import metadata into the Repository
Manager
Designer Repository View

Build job in Designer

Compile job in Designer

Run and monitor job in Director













































DataStage Projects





































DataStage Jobs
Parallel jobs
Executed under control of DataStage Server runtime environment
Built-in functionality for Pipeline and Partitioning Parallelism
Compiled into OSH (Orchestrate Scripting Language)
OSH executes Operators
Executable C++ class instances
Runtime monitoring in DataStage Director
Job Sequences (Batch jobs, Controlling jobs)
Master Server jobs that kick-off jobs and other activities
Can kick-off Server or Parallel jobs
Runtime monitoring in DataStage Director
Server jobs (Requires Server Edition license)
Executed by the DataStage Server Edition
Compiled into Basic (interpreted pseudo-code)
Runtime monitoring in DataStage Director
Mainframe jobs (Requires Mainframe Edition license)
Compiled into COBOL
Executed on the Mainframe, outside of DataStage









































Design Elements of Parallel Jobs
Stages
Implemented as OSH operators (pre-built components)
Passive stages (E and L of ETL)



Read data
Write data
E.g., Sequential File, Oracle, Peek stages
Processor (active) stages (T of ETL)






Transform data
Filter data
Aggregate data
Generate data
Split / Merge data
E.g., Transformer, Aggregator, Join, Sort stages
Links
Pipes through which the data moves from stage to stage






































Quiz True or False?







DataStage Designer is used to build and compile your ETL jobs

Manager is used to execute your jobs after you build them

Director is used to execute your jobs after you build them

Administrator is used to set global and project properties






































Introduction to the Lab Exercises
Two types of exercises in this course:
Conceptual exercises
Designed to reinforce a specific modules topics
Provide hands-on experiences with DataStage
Introduced by the word Concept
E.g., Conceptual Lab 01A
Solution Development exercises
Based on production applications
Provide development examples
Introduced by the word Solution
E.g., Solution Lab 05A
The Solution Development exercises are introduced and discussed in a later
module






Lab Exercises
Conceptual Lab 01A
Install DataStage clients
Test connection to the DataStage Server
Install lab files





































IBM WebSphere DataStage
Module 02: Setting
Environment
up Your DataStage






Module Objectives







Setting project properties in Administrator

Defining Environment Variables

Importing / Exporting DataStage objects in Manager

Importing Table Definitions defining sources and targets in Manager






































Setting Project Properties





































Project Properties
Projects can be created and deleted in Administrator
Each project is associated with a directory on the DataStage Server

Project properties, defaults, and environmental variables are specified
in Administrator
Can be overridden at the job level

01/15/06








Setting Project Properties
To set project properties, log onto Administrator, select your project,
and then click Properties








































Project Properties General Tab







































Environment Variables






Permissions Tab







































Tracing Tab







































Parallel Tab






Sequence Tab




Importing and
DataStage
Exporting
Objects




What Is Metadata?
Data
Source Target
Metadata
Metadata
Metadata
Repository

Transform








































DataStage Manager





Manager Contents
Metadata
Describing sources and targets: Table definitions
Describing inputs / outputs from external routines
Describing inputs and outputs to BuildOp and CustomOp stages

DataStage objects
Jobs
Routines
Compiled jobs / objects
Stages






Import and Export











Any object in Manager can be exported to a file

Can export whole projects

Use for backup

Sometimes used for version control

Can be used to move DataStage objects from one project to another

Use to share DataStage jobs and projects with other developers







































Export






Procedure

In Manager, click Export>DataStage Components

Select DataStage objects for export

Specify type of export:
DSX: Default format
XML: Enables processing of export file by XML applications, e.g., for
generating reports

Specify file path on client machine







































Quiz - True or False?
You can export DataStage objects such as jobs, but you cant export
metadata, such as field definitions of a sequential file.







































Quiz - True or False?
The directory to which you export is on the DataStage client machine,
not on the DataStage server machine.









































Exporting DataStage Objects









































Select Objects for Export









































Options Tab
Select by folder or
individual object







































Import Procedure
In Manager, click Import>DataStage Components
Or Import>DataStage Components (XML) if you are importing an XML-
format export file

Select DataStage objects for import








Importing DataStage Objects







Import Options





Importing Metadata







































Metadata Import









Import format and column definitions from sequential files

Import relational table column definitions

Imported as Table Definitions

Table definitions can be loaded into job stages

Table definitions can be used to define Routine and Stage interfaces







































Sequential File Import Procedure







In Manager, click Import>Table Definitions>Sequential File Definitions

Select directory containing sequential file and then the file

Select Manager category

Examined format and column definitions and edit is necessary









































Importing Sequential Metadata






Sequential Import Window







Specify Format









































Specify Column Names and Types
Double-click to define
extended properties







Extended Properties window
Property
categories
Available
properties







Table Definition General Tab
Second level
category
Top level
category









































Table Definition Columns Tab









































Table Definition Parallel Tab









































Table Definition Format Tab







































Lab Exercises
Conceptual Lab 02A
Set up your DataStage environment

Conceptual Lab 02B
Import a sequential file Table Definition













































IBM WebSphere DataStage
Module 03: Creating Parallel Jobs




Module Objectives







Design a simple Parallel job in Designer

Compile your job

Run your job in Director

View the job log





Creating Parallel Jobs







































What Is a Parallel Job?



Executable DataStage program

Created in DataStage Designer
Can use components from Manager Repository

Built using a graphical user interface

Compiles into Orchestrate shell language (OSH) and object code
(from generated C++)








Job Development Overview
Import metadata defining sources and targets
Can be done within Designer or Manager

In Designer, add stages defining data extractions and loads

Add processing stages to define data transformations

Add links defining the flow of data from sources to targets

Compile the job

In Director, validate, run, and monitor your job
Can also run the job in Designer
Can only view the job log in Director
















Designer Work Area
Canvas
Repository
Tools
Palette







Designer Toolbar
Provides quick access to the main functions of Designer
Show/hide metadata markers
Run
Job properties
Compile







Tools Palette





Adding Stages and Links
Drag stages from the Tools Palette to the diagram
Can also be dragged from Stage Type branch to the diagram

Draw links from source to target stage
Right mouse over source stage
Release mouse button over target stage






Job Creation Example Sequence



Brief walkthrough of procedure

Assumes table definition of source already exists in the repository







Create New Job







Drag Stages and Links From Palette
Peek
Row
Generator
Annotation







Renaming Links and Stages



Click on a stage or link to rename it

Meaningful names have many
benefits
Documentation
Clarity
Fewer development errors







































RowGenerator Stage







Produces mock data for specified columns

No inputs link; single output link

On Properties tab, specify number of rows

On Columns tab, load or specify column definitions
Click Edit Row over a column to specify the values to be generated for that
column
A number of algorithms for generating values are available depending on the
data type

Algorithms for Integer type
Random: seed, limit
Cycle: Initial value, increment

Algorithms for string type: Cycle , alphabet

Algorithms for date type: Random, cycle











Inside the Row Generator Stage

Properties
tab
Set property
value
Property









Columns Tab
View data
Load a
Table
definition
Select Table
Definition







Extended Properties
Specified
properties and
their values
Additional
properties to add





Peek Stage
Displays field values
Displayed in job log or sent to a
Skip records option
file
Can control number of records to be displayed
Shows data in each partition, labeled 0, 1, 2,

Useful stub stage for iterative job development
Develop job to a stopping point and check the data







Peek Stage Properties
Output to
job log






































Job Parameters



Defined in Job Properties window
Makes the job more flexible
Parameters can be:
Used in directory and file names
Used to specify property values
Used in constraints and derivations
Parameter values are determined at run time

When used for directory and files names and names of properties,
surround with pound signs (#)
E.g., #NumRows#

Job parameters can reference DataStage and system environment
variables
$PROJDEF
$ENV










Defining a Job Parameter
Parameters tab
Parameter







Using a Job Parameter in a Stage
Job parameter surrounded
with pound signs




Adding Job Documentation
Job Properties
Short and long descriptions
Shows in Manager

Annotation stage
Added from the Tools Palette
Display formatted text descriptions on diagram








Job Properties Documentation
Documentation







Annotation Stage Properties







Compiling a Job
Compile







Errors or Successful Message
Highlight stage
with error Click for more info





Running Jobs and Viewing the Job
Log in Designer







Prerequisite to Job Execution





DataStage Director





Use to run and schedule jobs

View runtime messages

Can invoke from DataStage Manager or Designer
Tools > Run Director







Run Options
Stop after number
of warnings
Stop after number
of rows







Director Log View
Click the open
book icon to view
log messages
Peek messages







Message Details





Other Director Functions







Schedule job to run on a particular date/time

Clear job log of messages

Set job log purging conditions

Set Director options
Row limits
Abort after x warnings





Running Jobs from Command Line





Use dsjob run

Use dsjob logsum to display messages in the log

Documented in Parallel Job Advanced Developers Guide, ch. 7





Lab Exercises
Conceptual Lab 03A
Design a simple job in Designer
Define a job parameter
Document the job
Compile
Run
Monitor the job in Director




IBM WebSphere DataStage
Module 04: Accessing Sequential Data






Module Objectives













Understand the stages for accessing different kinds of sequential data

Sequential File stage

Data Set stage

Complex Flat File stage

Create jobs that read from and write to sequential files

Read from multiple files using file patterns

Use multiple readers





Types of Sequential Data Stages
Sequential
Fixed or variable length

Data Set

Complex Flat File








The Framework and Sequential Data



The EE Framework processes only datasets

For files other than datasets, such as sequential flat files, import and
export operations are done
Import and export OSH operators are generated by Sequential and
Complex Flat File stages

During import or export DataStage performs format translations
into, or out of, the EE internal format


Internally, the format of data is described by
Like Table Definitions
schemas









Using the Sequential File Stage
Both import and export of general files (text, binary) are
performed by the SequentialFile Stage.
Data import:
Data export

EE internal format
EE internal format




Features of Sequential File Stage





Normally executes in sequential mode

Executes in parallel when reading multiple files

Can use multiple readers within a node
Reads chunks of a single file in parallel

The stage needs to be told:
How file is divided into rows (record format)
How row is divided into columns (column format)


1 1
1 1




File Format Example
Record delimiter
Final Delimiter = end
Field Delimiter
Final Delimiter = comma

nl ,
Last field
,
Field
3
,
Field 2 ,
Field 1

nl

Last field

,

Field 3
,
Field 2
,
Field 1





Sequential File Stage Rules





One input link

One stream output link

Optionally, one reject link
Will reject any records not matching metadata in the column definitions
Example: You specify three columns separated by commas, but the row
thats read had no commas in it







Job Design Using Sequential Stages
Reject link







Sequential Source Columns Tab
View data
Load Table Definition
Save as a new
Table Definition







Input Sequential Stage Properties
Output tab
File to
access
Column names
in first row
Click to add more files having
the same format







Format Tab
Record format
Column format







Reading Using a File Pattern
Use wild
cards
Select File
Pattern







Properties - Multiple Readers
Multiple readers option allows
you to set number of readers
per node









































Sequential Stage As a Target
Input Tab
Append /
Overwrite








Reject Link
Reject mode =
Continue: Continue reading records
Fail: Abort job
Output: Send down output link
In a source stage
All records not matching the
metadata (column definitions) are
rejected
In a target stage
All records that fail to be written for
any reason

Rejected records consist of one
column, datatype = raw



Reject mode property







Inside the Copy Stage
Column mappings





DataSet Stage







Data Set



Operating system (Framework) file

Preserves partitioning
Component dataset files are written to on each partition

Suffixed by .ds

Referred to by a header file

Managed by Data Set Management utility from GUI (Manager, Designer,
Director)

Represents persistent data

Key to good performance in set of linked jobs
No import / export conversions are needed
No repartitioning needed















Persistent Datasets

Accessed using DataSet Stage.

Two parts:
Descriptor file:
contains metadata, data location, but NOT the data itself



input.ds
Data file(s)
record (
partno: int32;
description: string;
)


contain the data
multiple Unix files (one per node), accessible in parallel
node1:/local/disk1/
node2:/local/disk2/





Data Translation
Occurs on import
From sequential files or file sets
From RDBMS

Occurs on export
From datasets to file sets or sequential files
From datasets to RDBMS

DataStage engine is most efficient when processing internally
formatted records (i.e. datasets)







FileSet Stage
01/15/06






File Set Stage





Can read or write file sets

Files suffixed by .fs

File set consists of:
Descriptor file contains location of raw data files + metadata
Individual raw data files

Can be processed in parallel

Similar to a dataset
Main difference is that file sets are not in the internal format and
therefore more accessible to external applications
1.
2.













File Set Stage Example
Descriptor file







Lab Exercises
Conceptual Lab 04A
Read and write to a sequential file
Create reject links
Create a data set

Conceptual Lab 04B
Read multiple files using a file path

Conceptual Lab 04C
Read a file using multiple readers







DataStage Data Types


Standard types Complex types

















Char

VarChar

Integer

Decimal (Numeric)

Floating point

Date

Time

Timestamp

VarBinary (raw)



Vector (array, occurs)
Subrecord (group)





Standard Types
Char
Fixed length string
VarChar
Variable length string
Specify maximum length
Integer
Decimal (Numeric)
Precision (length including numbers after the decimal point)
Scale (number of digits after the decimal point)
Floating point
Date





Default string format:
Time
Default string format:
Timestamp
Default string format:
VarBinary (raw)
%yyyy-%mm-%dd

%hh:%nn:%ss

%yyyy-%mm-%dd %hh:%nn:%ss






Complex Data Types
Vector
A one-dimensional array
Elements are numbered 0 to n
Elements can be of any single type
All elements must have the same type
Can have fixed or variable number of elements
Subrecord
A group or structure of elements
Elements of the subrecord can be of any type
Subrecords can be embedded








Schema With Complex Types
subrecord
vector





Table Definition with complex types

Authors is a subrecord

Books is a vector of 3 strings of length 5









































Complex Types Column Definitions
subrecord
Elements of subrecord
Vector







Reading and Writing Complex Data
Complex Flat
File target
stage
Complex Flat
File source
stage







Importing Cobol Copybooks
Click Import>Table
Definitions>COBOL File Definitions
to begin the import

Each level 01 item begins a Table
Definition

Specify position of level 01 items


Level 01 start
position
Path to
copybook file
Where to store the
Table Definition







































Reading and Writing NULL Values






Working with NULLs
Internally, NULL is represented by a special value outside the range of
any existing, legitimate values

If NULL is written to a non-nullable column, the job will abort

Columns can be specified as nullable
NULLs can be written to nullable columns

You must handle NULLs written to non-nullable columns in a
Sequential File stage
You need to tell DataStage what value to write to the file
Unhandled rows are rejected

In a Sequential source stage, you can specify values you want
DataStage to convert to NULLs












Specifying a Value for NULL
Nullable
column
Added
property







































Managing DataSets





Managing DataSets



GUI (Manager, Designer, Director) tools > data set management

Dataset management from the system command line
Orchadmin



Unix command line utility
List records
Remove datasets
Removes all component files, not just the header file
Dsrecords
Lists number of records in a dataset







Displaying Data and Schema
Display data
Schema








































Dsrecords
Gives record count
Unix command-line utility
$ dsrecords ds_name
E.g., $ dsrecords myDS.ds
156999 records
Orchadmin
Manages EE persistent data sets
Unix command-line utility
E.g., $ orchadmin delete myDataSet.ds


Manage Datasets from the System Command
Line




Lab Exercises
Conceptual Lab 04D
Use the dsrecords utility
Use Data Set Management tool

Conceptual Lab 04E
Reading and Writing NULLs





IBM WebSphere DataStage
Module 05: Platform Architecture
2005 IBM Corporation




Module Objectives









Parallel processing architecture

Pipeline parallelism

Partition parallelism

Partitioning and collecting

Configuration files





Key EE Concepts
Parallel processing:
Executing the job on multiple CPUs

Scalable processing:
Add more resources (CPUs and disks) to increase system performance

Example system: 6 CPUs (processing
nodes) and disks
Scale up by adding more CPUs
Add CPUs as individual nodes or to
an SMP system




1


2




3


4




5


6












Scalable Hardware Environments



SMP
Multi-CPU (2-64+)
Shared memory & disk
GRID / Clusters
Multiple, multi-CPU systems
Dedicated memory per node
Typically SAN-based shared storage

MPP
Multiple nodes with dedicated memory,
storage

2 1000s of CPUs



Single CPU

Dedicated memory &
disk









Pipeline Parallelism



Transform, clean, load processes execute simultaneously

Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running

Advantages:
Reduces disk usage for staging areas
Keeps processors busy

Still has limits on scalability







Partition Parallelism
Divide the incoming stream of data into subsets to be separately
processed by an operation
Subsets are called partitions (nodes)
Each partition of data is processed by the same operation
E.g., if operation is Filter, each partition will be filtered in exactly the same
way
Facilitates near-linear scalability
8 times faster on 8 processors
24 times faster on 24 processors
This assumes the data is evenly distributed



















Three-Node Partitioning
Node 1

Operation
subset1
Node 2

Operation
subset2
Data
subset3
Node 3
Operation



Here the data is partitioned into three partitions
The operation is performed on each partition of data separately and in parallel
If the data is evenly distributed, the data will be processed three times faster







EE Combines Partitioning and Pipelining
Within EE, pipelining, partitioning, and repartitioning are automatic
Job developer only identifies:




Sequential vs. Parallel operations (by stage)
Method of data partitioning
Configuration file (which identifies resources)
Advanced stage options (buffer tuning, operator combining, etc.)



















































































Job Design v. Execution
User assembles the flow using DataStage Designer
at runtime, this job runs in parallel for any configuration
(1 node, 4 nodes, N nodes)
No need to modify or recompile the job design!







































Configuration File
Configuration file separates configuration (hardware / software) from job design
Specified per job at runtime by $APT_CONFIG_FILE
Change hardware and resources without changing job design

Defines number of nodes (logical processing units) with their resources (need not
match physical CPUs)
Dataset, Scratch, Buffer disk (file systems)
Optional resources (Database, SAS, etc.)
Advanced resource optimizations
Pools (named subsets of nodes)

Multiple configuration files can be used at runtime
Optimizes overall throughput and matches job characteristics to overall hardware resources
Allows runtime constraints on resource usage on a per job basis






Example Configuration File
Key points:
1.

2.
Number of nodes defined

Resources assigned to each
node. Their order is significant.

Advanced resource
optimizations and configuration
(named pools, database, SAS)
3.



3

4








1

2




{
node "n1" {
fastname "s1"
pool "" "n1" "s1" "app2" "sort"
resource disk "/orch/n1/d1" {}
resource disk "/orch/n1/d2" {"bigdata"}
resource scratchdisk "/temp" {"sort"}
}
node "n2" {
fastname "s2"
pool "" "n2" "s2" "app1"
resource disk "/orch/n2/d1" {}
resource disk "/orch/n2/d2" {"bigdata"}
resource scratchdisk "/temp" {}
}
node "n3" {
fastname "s3"
pool "" "n3" "s3" "app1"
resource disk "/orch/n3/d1" {}
resource scratchdisk "/temp" {}
}
node "n4" {
fastname "s4"
pool "" "n4" "s4" "app1"
resource disk "/orch/n4/d1" {}
resource scratchdisk "/temp" {}
}
}







































Partitioning and Collecting







































Partitioning and Collecting



Partitioning breaks incoming rows into sets (partitions) of rows

Each partition of rows is processed separately by the stage/operator
If the hardware and configuration file supports parallel processing, partitions
of rows will be processed in parallel

Collecting returns partitioned data back to a single stream

Partitioning / Collecting occurs on stage Input links

Partitioning / Collecting is implemented automatically
Based on stage and stage properties
How the data is partitioned / collected can be specified










Partitioning / Collecting Algorithms
Partitioning algorithms include:
Round robin
Hash: Determine partition based on key value
Requires key specification
Entire: Send all rows down all partitions
Same: Preserve the same partitioning
Auto: Let DataStage choose the algorithm
Collecting algorithms include:
Round robin
Sort Merge
Read in by key
Presumes data is sorted by the key in each partition
Builds a single sorted stream based on the key
Ordered
Read all records from first partition, then second,








































Keyless V. Keyed Partitioning Algorithms
Keyless: Rows are distributed independently of data values
Round Robin
Entire
Same

Keyed: Rows are distributed based on values in the specified key
Hash: Partition based on key
Example: Key is State. All CA rows go into the same partition; all MA
rows go in the same partition. Two rows of the same state never go into
different partitions
Modulus: Partition based on modulus of key divided by the number of
partitions. Key is a numeric type.
Example: Key is OrderNumber (numeric type). Rows with the same
order number will all go into the same partition.
DB2: Matches DB2 EEE partitioning






Partitioning Requirements for Related Records
Misplaced records
Using Aggregator stage to sum customer sales by customer number
If there are 25 customers, 25 records should be output
But suppose records with the same customer numbers are spread
across partitions
This will produce more than 25 groups (records)
Solution: Use hash partitioning algorithm
Partition imbalances
Peek stage shows number of records going down each partition






Unequal Distribution Example
Same key values are assigned to
the same partition
Hash on LName, with 2-node config file

P
a
r
t
i
t
i
o
n

1

ID LName FName Address
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
S
o
u
r
c
e

D
a
t
a

ID LName FName Address
1 Ford Henry 66 Edison Avenue
2 Ford Clara 66 Edison Avenue
3 Ford Edsel 7900 Jefferson
4 Ford Eleanor 7900 Jefferson
5 Dodge Horace 17840 Jefferson
6 Dodge John 75 Boston Boulevard
7 Ford Henry 4901 Evergreen
8 Ford Clara 4901 Evergreen
9 Ford Edsel 1100 Lakeshore
10 Ford Eleanor 1100 Lakeshore
P
a
r
t

0

ID LName FName Address
5 Dodge Horace 17840 Jefferson
6 Dodge John 75 Boston Boulevard






Partitioning / Collecting Link Icons
Partitioning icon
Collecting icon









More Partitioning Icons
fan-out
Sequential to Parallel
SAME partitioner
Re-partition
watch for this!
AUTO partitioner







Partitioning Tab
Key specification
Algorithms







Collecting Specification
Key specification
Algorithms















Quiz True or False?
Everything that has been data-partitioned must be
collected in same job









Data Set Stage
Is the data partitioned?





Introduction to the Solution
Exercises
Development





Solution Development Jobs



Series of 4 jobs extracted from production jobs

Use a variety of stages in interesting, realistic configurations
Sort, Aggregator stages
Join, lookup stage
Peek, Filter stages
Modify stage
Oracle stage

Contain useful techniques
Use of Peeks
Datasets used to connect jobs
Use of project environment variables in job parameters
Fork Joins
Lookups for auditing








Warehouse Job 01







Glimpse Into the Sort Stage
Algorithms
Sort key to add






Copy Stage With Multiple Output Links
Select output link





Filter Stage





Used with Peek stage to select a portion of data for checking

On Properties tab, specify a Where clause to filter the data

On Mapping tab, map input columns to output columns







Setting the Filtering Condition
Filtering
condition







Warehouse Job 02







Warehouse Job 03







Warehouse Job 04







Warehouse Job 02 With Lookup





Lab Exercises
Conceptual Lab 05A
Experiment with partitioning / collecting
Solution Lab 05B (Build Warehouse_01 Job)
Add environment variables as job parameters
Read multiple sequential files
Use the Sort stage
Use Filter and Peek stages
Write to a DataSet stage





IBM WebSphere DataStage
Module
06: Combining Data






Module Objectives







Combine data using the Lookup stage

Combine data using Merge stage

Combine data using the Join stage

Combine data using the Funnel stage







Ways to combine data:
Horizontally:
Multiple input links
One output link made of columns from different input links.
Joins
Lookup
Merge

Vertically:
One input link, one output link combining groups of related records into a
single record
Aggregator
Remove Duplicates

Funneling: Multiple input streams funneled into a single output stream
Funnel stage



Combining Data




Lookup, Merge, Join Stages
These stages combine two or more input links
Data is combined by designated "key" column(s)

These stages differ mainly in:
Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)






Not all Links are Created Equal
DataStage distinguishes between:
- The Primary input: (Framework port 0)
- Secondary inputs:
ports)
Conventions:
in some cases "Reference" (other Framework

Tip: Check Link Ordering" tab to make sure intended
Primary is listed first



Joins Lookup Merge

Primary Input: port 0
Secondary Input(s): ports 1,
Left Source Master
Right Lookup table(s) Update(s)


Lookup Stage
01/15/06










Lookup Features







One Stream Input link (Source)

Multiple Reference links (Lookup files)

One output link

Optional Reject link
Only one per Lookup stage, regardless of number of reference links

Lookup Failure options
Continue, Drop, Fail, Reject

Can return multiple matching rows

Hash tables are built in memory from the lookup files
Indexed by key
Should be small enough to fit into physical memory









The Lookup Stage
Uses one or more key columns as an index into a table
Usually contains other values associated with each key.

The lookup table is created in memory before any lookup source rows are processed


Lookup table

Associated Value Index
[]
SC
SD
TN
TX
UT
VT
[]
Key column of source
state_code
TN
South Carolina
South Dakota
Tennessee
Texas
Utah
Vermont








Lookup from Sequential File Example
Reference link
Driver (Source)
link
(lookup table)









































Lookup Key Column in Sequential File

Lookup key







Lookup Stage Mappings
Source link
Reference link
Derivation for lookup key







Handling Lookup Failures
Select action





Lookup Failure Actions
If the lookup fails to find a matching key column, one of these actions
can be taken:
fail: the lookup Stage reports an error and the job fails immediately.
This is the default.

drop: the input row with the failed lookup(s) is dropped

continue: the input row is transferred to the output, together with the successful table
entries. The failed table entry(s) are not transferred, resulting in either default output
values or null output values.

reject: the input row with the failed lookup(s) is transferred to a second output link, the
"reject" link.




There is no option to capture unused table entries
Compare with the Join and Merge stages





Lookup Stage Behavior

We shall first use a simplest case, optimal input:



Two input links: Source" as primary, Look up" as secondary
sorted on key column (here "Citizen"),
without duplicates on key
Source link (primary input) Lookup link (secondary input)

Citizen Exchange
M_B_Dextrous Nasdaq
Righty NYSE
Revolution Citizen
1789 Lefty
1776 M_B_Dextrous





Lookup Stage

Output of Lookup with continue option on key Citizen
Same output as outer join and merge/keep
Empty string
or NULL
Output of Lookup with drop option on key Citizen
Same output as inner join and merge/drop

Revolution Citizen Exchange
1776 M_B_Dextrous Nasdaq
Revolution Citizen Exchange
1789 Lefty

1776 M_B_Dextrous Nasdaq




The Lookup Stage



Lookup Tables should be small enough to fit into physical memory

On a MPP you should partition the lookup tables using entire partitioning method
or partition them by the same hash key as the source link
Entire results in multiple copies (one for each partition)

On a SMP, choose entire or accept the default (which is entire)
Entire does not result in multiple copies because memory is shared








































Join Stage






The Join Stage
Four types:




Inner
Left outer
Right outer
Full outer
2 or more sorted input links, 1 output link
"left" on primary input, "right" on secondary input
Pre-sort make joins "lightweight": few rows need to be in RAM
Follow the RDBMS-style relational model
Cross-products in case of duplicates
Matching entries are reusable for multiple matches
Non-matching entries can be captured (Left, Right, Full)
No fail/reject option for missed matches









Join Stage Editor
Link Order
immaterial for Inner
and Full Outer Joins,
but very important for
Left/Right Outer
joins)
One of four variants:




Inner
Left Outer
Right Outer
Full Outer
Multiple key columns
allowed





Join Stage Behavior
We shall first use a simplest case, optimal input:



two input links: "left" as primary, "right" as secondary
sorted on key column (here
without duplicates on key
"Citizen"),
Left link (primary input) Right link (secondary input)

Citizen Exchange
M_B_Dextrous Nasdaq
Righty NYSE
Revolution Citizen
1789 Lefty
1776 M_B_Dextrous





Inner Join
Transfers rows from both data sets whose key columns
contain equal values to the output

Treats both inputs symmetrically
link

Output of inner join on key Citizen
Same output as lookup/reject and merge/drop

Revolution Citizen Exchange
1776 M_B_Dextrous Nasdaq




Left


Outer Join

Transfers all values from the left link and transfers values from the right link
only where key columns match.
Same output as lookup/continue and merge/keep

Revolution Citizen Exchange
1789 Lefty

1776 M_B_Dextrous Nasdaq






Left Outer Join
Check Link Ordering Tab
intended Primary to make sure is listed first





Right Outer Join
Transfers all values from the right link and transfers values from the left link only
where key columns match.

Revolution Citiz en Ex c hange
1776 M_B _Dex trous Nas daq
Null or 0 Righty NYSE




Full Outer Join
Transfers rows from both data sets, whose key columns contain equal values, to
the output link.

It also transfers rows, whose key columns contain unequal values, from both input
links to the output link.

Treats both input symmetrically.

Creates new columns, with new column names!





Revolution leftRec_Citizen rightRec_Citizen Exchange
1789 Lefty

1776 M_B_Dextrous M_B_Dextrous Nasdaq
0

Righty NYSE




Merge Stage









































Merge Stage Job






1






























































The Merge Stage







Allows composite keys

Multiple update links

Matched update rows are consumed

Unmatched updates in
input port n can be captured in output
port n

Lightweight:
One or more
updates
Master

Merge
Rejects
Output

0
1 2




0 2







Merge Stage Editor
Unmatched Master rows

One of two options:
Unmatched Update rows option:

Capture in reject link(s).
Implemented by adding
outgoing links


Keep [default]
Drop
(Capture in reject link is NOT
an option)





Comparison: Joins, Lookup, Merge


Joins Lookup Merge

Model
M emory us age

# and nam es of Inputs
M andatory Input S ort
Duplic ates in prim ary input
Duplic ates in s ec ondary input(s )
Options on unmatc hed prim ary
Options on unmatc hed s ec ondary
On m atc h, s ec ondary entries are

# Outputs
Captured in rejec t s et(s )
RDBMS-s ty le relational S ourc e - in RAM LU Table Mas ter -Update(s )
light heavy light

2 or more: left, right 1 S ourc e, N LU Tables 1 Mas ter, N Update(s )
all inputs no all inputs
OK (x -produc t) OK W arning!
OK (x -produc t) W arning! OK only when N = 1
K eep (left outer), Drop (Inner) [fail] | c ontinue | drop | rejec t [k eep] | drop
K eep (right outer), Drop (Inner) NONE c apture in rejec t s et(s )
c aptured c aptured c ons um ed

1 1 out, (1 rejec t) 1 out, (N rejec ts )
Nothing (N/A) unmatc hed primary entries unm atc hed s ec ondary entries




Funnel Stage





What is a Funnel Stage?
A processing stage that combines data from multiple input links to a
single output link

Useful to combine data from several identical data sources into a single
large dataset

Operates in three modes
Continuous
SortFunnel
Sequence







Three Funnel modes
Continuous:
Combines the records of the input link in no guaranteed order.
It takes one record from each input link in turn. If data is not available on an input link,
the stage skips to the next link rather than waiting.
Does not attempt to impose any order on the data it is processing.

Sort Funnel: Combines the input records in the order defined by the value(s) of one or
more key columns and the order of the output records is determined by these sorting
keys.

Sequence: Copies all records from the first input link to the output link, then all the
records from the second input link and so on.







Sort Funnel Method



Produces a sorted output (assuming input links are all sorted on key)
Data from all input links must be sorted on the same key column
Typically data from all input links are hash partitioned before they are sorted
Selecting Auto partition type under Input Partitioning tab defaults to this
Hash partitioning guarantees that all the records with same key column
values are located in the same partition and are processed on the same
node.
Allows for multiple key columns
1 primary key column, n secondary key columns
Funnel stage first examines the primary key in each input record.
For records with multiple records with same primary key value, it will then
examine secondary keys to determine the order of records it will output








Funnel Stage Example







Funnel Stage Properties





Lab Exercises
Conceptual Lab 06A
Use a Lookup stage
Handle lookup failures
Use a Merge stage
Use a Join stage
Use a Funnel stage

Solution Lab 06B (Build Warehouse_02 Job)
Use a Join stage







































IBM WebSphere DataStage
Module
07: Sorting and Aggregating Data





Module Objectives





Sort data using in-stage sorts and Sort stage

Combine data using Aggregator stage

Combine data Remove Duplicates stage





Sort Stage





Sorting Data
Uses
Some stages require sorted input
Join, merge stages require sorted input
Some stages use less memory with sorted input
E.g., Aggregator

Sorts can be done:
Within stages
On input link Partitioning tab, set partitioning to anything other than Auto
In a separate Sort stage
Makes sort more visible on diagram
Has more options










Sorting Alternatives
Sort stage Sort within
stage







In-Stage Sorting
Partitioning
ort
Preserve
-key
ordering
Remove
dups
Cant
when sorting



tab
Do s








non















be Auto Sort key







Sort Stage
Sort key
Sort options





Sort keys



Add one or more keys

Specify sort mode for each key
Sort: Sort by this key
Dont sort (previously sorted):
Assume the data has already been sorted by this key
Continue sorting by any secondary keys

Specify sort order: ascending / descending

Specify case sensitive or not








Sort Options
Sort Utility
DataStage the default
Unix: Dont use. Slower than DataStage sort utility
Stable
Allow duplicates
Memory usage
Sorting takes advantage of the available memory for increased performance
Uses disk if necessary
Increasing amount of memory can improve performance
Create key change column
Add a column with a value of 1 / 0
1 indicates that the key value has changed
0 mean that the key value hasnt changed
Useful for processing groups of rows in a Transformer











Sort Stage Mapping Tab






Partitioning V. Sorting Keys
Partitioning keys are often different than Sorting keys
Keyed partitioning (e.g., Hash) is used to group related records into the
same partition
Sort keys are used to establish order within each partition

For example, partition on HouseHoldID, sort on HouseHoldID,
PaymentDate
Important when removing duplicates. Sorting within each partition is uses to
establish order for duplicate retention (first or last in the group)






Aggregator Stage





Aggregator Stage
Purpose: Perform data aggregations
Specify:
Zero or more key columns that define
groups)
Columns to be aggregated
the aggregation units (or

Aggregation functions, include among many others:
count (nulls/non-nulls)
Sum
Max / Min / Range
The grouping method
issue
(hash table or pre-sort) is a performance







Job with Aggregator Stage
Aggregator stage







Aggregator Stage Properties
Group columns
Group method
Aggregation
functions





Aggregator Functions
Aggregation type = Count rows
Count rows in each group
Put result in a specified output column

Aggregation type = Calculation
Select column
Put result of calculation in a specified output column
Calculations include:








Sum
Count
Min, max
Mean
Missing value count
Non-missing value count
Percent coefficient of variation





Grouping Methods
Hash (default)
Intermediate results for each group are stored in a hash table
Final results are written out after all input has been processed
No sort required
Use when number of unique groups is small
Running tally for each groups aggregate calculations needs to fit into
memory. Requires about 1K RAM / group
E.g. average family income by state requires .05MB of RAM

Sort
Only a single aggregation group is kept in memory
When a new group is seen, the current group is written out
Requires input to be sorted by grouping keys
Can handle unlimited numbers of groups
Example: average daily balance by credit card








Aggregation Types
























Calculation types




Remove Duplicates Stage





Removing Duplicates
Can be done by Sort stage
Use unique option



No choice on which to keep
Stable sort always retains the first row in the group
Non-stable sort is indeterminate
OR
Remove Duplicates stage
Has more sophisticated ways to remove duplicates
Can choose to retain first or last







Remove Duplicates Stage Job
Remove Duplicates
stage









































Remove Duplicates Stage Properties
Key that defines
duplicates
Retain first or last
duplicate




Lab Exercises
Solution Development Lab 07A
Use Sort stage
Use Aggregator stage
Use RemoveDuplicates stage
(Build Warehouse_03 job)




IBM WebSphere DataStage
Module
08: Transforming Data






Module Objectives



Understand ways DataStage allows you to transform data

Use this understanding to:
Create column derivations using user-defined code and system functions
Filter records based on business criteria
Control data flow based on data conditions





Transformed Data
Derivations may include incoming fields or parts of incoming
fields
Derivations may reference system variables and constants
Frequently uses
Date and time
Mathematical
Logical
Null handling
More
functions performed on incoming values





Stages Review
Stages that can transform data
Transformer
Modify
Aggregator
Stages that do not transform data
File stages: Sequential, Dataset, Peek, etc.
Sort
Remove Duplicates
Copy
Filter
Funnel






Transformer Stage


Column mappings
Derivations
Written in Basic
Final compiled code is C++ generated object code
Constraints
Filter data
Direct data down different output links
For different processing or storage
Expressions for constraints and derivations can reference
Input columns
Job parameters
Functions
System variables and constants
Stage variables
External routines









Transformer Stage Uses
Transformer with
multiple outputs
Control data flow
Constrain data
Direct data

Derivations







Inside the Transformer
Input columns
Stage
Stage variables
Output columns
Constraints

Derivations / Mappings
Input / Output column defs











Output






Defining a Constraint
Input column
Job parameter







Defining a Derivation
Input column
String in quotes Concatenation
operator (:)




IF THEN ELSE Derivation



Use IF THEN ELSE to conditionally derive a value

Format:
IF <condition> THEN <expression1> ELSE <expression1>
If the condition evaluates to true then the result of expression1 will be copied
to the target column or stage variable
If the condition evaluates to false then the result of expression2 will be
copied to the target column or stage variable

Example:
Suppose the source column is named In.OrderID and the target column is
named Out.OrderID
Replace In.OrderID values of 3000 by 4000
IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID






String Functions and Operators
Substring operator
Format: String [loc, length]
Example:
Suppose In.Description contains the string Orange Juice
InDescription[8,5] Juice

UpCase(<string>) / DownCase(<string>)
Example: UpCase(In.Description) ORANGE JUICE

Len(<string>)
Example: Len(In.Description) 12









Checking for NULLs
Nulls can be introduced into the data flow from
lookups
Mismatches (lookup failures) can produce nulls

Can be handled in constraints, derivations,
stage variables, or a combination of these

NULL functions
Testing for NULL




IsNull(<column>)
IsNotNull(<column>)
Replace NULL with a value
NullToValue(<column>, <value>)
Set to NULL: SetNull()
Example: IF In.Col = 5 THEN SetNull()
ELSE In.Col





Transformer Functions











Date & Time

Logical

Null Handling

Number

String

Type Conversion





Transformer Execution Order







Derivations in stage variables

Constraints are executed before derivations

Column derivations in earlier links are executed before later links

Derivations in higher columns are executed before lower columns





Transformer Stage Variables
Derivations execute in order from top to bottom
Later stage variables can reference earlier stage variables
Earlier stage variables can reference later stage variables
These variables will contain a value derived from the previous row
that came into the Transformer

Multi-purpose
Counters
Store values from previous rows to make comparisons
Store derived values to be used in multiple target field derivations
Can be used to control execution of constraints








Stage Variables Toggle
Show/Hide button










Transformer Reject Links
Reject link
Convert link to a
Reject link







Otherwise Link
Otherwise link







Defining an Otherwise Link
Check to create
otherwise link Can specify abort
condition







Specifying Link Ordering
Link ordering toolbar icon
Last in
order





Transformer Stage Tips
Suggestions -
Include reject links
Test for NULL values before using a column in a function
Use RCP (Runtime Column Propogation)
Map columns that have derivations (not just copies).
More on RCP later.
Be aware of column and stage variable data types.
Often developers do not pay attention to stage variable types.
Avoid type conversions.
Try to maintain the data type as imported.





Modify Stage





Modify Stage



Modify column types

Perform some types of derivations
Null handling
Date / time handling
String handling

Add or drop columns







Job With Modify Stage
Modify stage







Specifying a Column Conversion

Derivation / Conversion
New column
Specification
property





Lab Exercises
Conceptual Lab 08A
Add a Transformer to a job
Define a constraint
Work with null values
Define a rejects link
Define a stage variable
Define a derivation



IBM WebSphere DataStage
Module
09: Standards and Techniques






Module Objectives









Establish standard techniques for Parallel job development
Job documentation
Naming conventions for jobs, links, and stages
Iterative job design
Useful stages for job development
Using configuration files for development
Using environmental variables
Job parameters
Containers







Job Presentation
Document using the

































Document using
annotation stage






Job Properties Documentation
Organize jobs into
categories
Description is displayed in
Manager and MetaStage





Naming Conventions
Stages named after the
Data they access
Function they perform
DO NOT leave default stage names like Sequential_File_0
One possible convention:
Use 2-character prefixes to indicate stage type, e.g.,



SF_ for Sequential File stage
DS_ for Dataset stage
CP_ for Copy stage
Links named for the data they carry
DO NOT leave default link names like DSLink3
One possible convention:
Prefix all link names with lnk_
Name links after the data flowing through them







Stage and Link Names
handle


































Name stages and
links for the data they




Iterative Job Design



Use Copy and Peek stages as stubs

Test job in phases
Small sections first, then increasing in complexity

Use Peek stage to examine records
Check data at various locations
Check before and after processing stages








Copy Stage Stub Example
Copy stage







Copy Stage Example

With 1 link in, 1 link out:
The Copy Stage is the ultimate

Partitioners
Sort / Remove Duplicates
Rename, Drop column
Can be placed on:
"no-op" (place-holder):
input link (Partitioning): Partitioners, Sort, Remove Duplicates)
output link (Mapping page): Rename, Drop.
Sometimes replace the transformer:
Rename,
Drop,
Implicit type Conversions
Link Constraint break up schema





Developing Jobs

1. Keep it simple
a) Jobs with many stages are hard to debug and maintain
Start small and build to final solution
Use view data, copy, and peek
Start from source and work out
Develop with a 1 node configuration file
Solve the business problem before the performance problem
Dont worry too much about partitioning until the sequential flow works
as expected

If you land data in order to break complex jobs into smaller sets of
jobs for purposes of restartability or maintainability, use persistent
datasets
Retains partitioning and internal data types
This is true only as long as you dont need to read the data outside of
DataStage
2.
a)
b)
c)
3.
a)
4.
a)
b)







Final Result






Good Things to Have in each Job



Job parameters

Useful environmental variables to add
$APT_DUMP_SCORE
Report OSH to message log
$APT_CONFIG_FILE
to job parameters


Establishes runtime parameters to EE engine
Establishes degree of parallelization







Setting Job Parameters
























Click to add
environment
variables











DUMP SCORE Output
Setting APT_DUMP_SCORE yields:
Double-click Partitioner
And
Collector
Mapping
Node--> partition





Use Multiple Configuration Files





Make a set for 1X, 2X,.

Use different ones for test versus production

Include as a parameter in each job





Containers
Two varieties
Local
Shared

Local
Simplifies a large, complex diagram

Shared
Creates reusable object that many jobs within the project can
include









Reusable Job Components
Use Shared Containers for repeatedly used components
Container







Creating a Container




Create a job
Select (loop) portions to containerize
Edit > Construct container > local or shared






Lab Exercises
Conceptual Lab 07A
Apply best practices when naming links and stages



IBM WebSphere DataStage
Module 10: Accessing Relational Data






Module Objectives
Understand how DataStage jobs
RDBMS tables

Import relational table definitions
read and write records to a





Read from and write to database tables
Use database tables to lookup data












































































Parallel Database Connectivity
E En nt te er rp pr ri is se e E Ed di it ti io on n
C Cl li ie en nt t- -S Se er rv ve er r
L Lo oa ad d
C Cl li ie en nt t
X Parallel server runs APPLICATIONS
X Application has parallel connections to RDBMS
X Suitable for large data volumes
X Higher levels of integration possible
X Only RDBMS is running in parallel
X Each application has only one connection
X Suitable only for small data volumes























P Pa ar ra al ll le el l R RD DB BM MS S



S So or rt t


T Tr ra ad di it ti io on na al l



C Cl li ie en nt t





C Cl li ie en nt t




C Cl li ie en nt t


C Cl li ie en nt t


C Cl li ie en nt t







P Pa ar ra al ll le el l R RD DB BM MS S






Supported Database Access
Enterprise Edition provides high performance / scalable interfaces for:











DB2 / UDB

Informix

Oracle

Teradata

SQL Server

ODBC




Importing Table Definitions
Can import using ODBC or using Orchestrate schema definitions
Orchestrate schema imports are better because the data types are more
accurate

Import>Table Definitions>Orchestrate Schema Definitions

Import>Table Definitions>ODBC Table Definitions










Orchestrate Schema Import







ODBC Import
Select ODBC data
source name




RDBMS Access
Automatically convert RDBMS table layouts to/from DataStage Table
Definitions

RDBMS NULLs converted to/from DataStage NULLs

Support for standard SQL syntax for specifying:
SELECT clause list
WHERE clause filter condition
INSERT / UPDATE
Supports user-defined queries













Native Parallel RDBMS Stages











DB2/UDB Enterprise

Informix Enterprise

Oracle Enterprise

Teradata Enterprise

ODBC Enterprise
SQL Server Enterprise





RDBMS Usage
As a source
Extract data from table (stream link)
Read methods include: Table, Generated SQL SELECT, or User-
defined SQL
User-defined can perform joins, access views
Lookup (reference link)



Normal lookup is memory-based (all table data read into memory)
Can perform one lookup at a time in DBMS (sparse option)
Continue/drop/fail options
As a target
Inserts
Upserts (Inserts and updates)
Loader









DB2 Enterprise Stage Source
Auto-generated
SELECT
Connection
information
Job example







Sourcing with User-Defined SQL
User-defined
read method
Columns in SQL must
match definitions on
Columns tab







DBMS Source Lookup
Reference
link







DBMS as a Target





Write Methods
Write methods
Delete
Load
Uses database load utility
Upsert
INSERT followed by an UPDATE
Write (DB2)
INSERT
Write modes
Truncate: Empty the table before writing
Create: Create a new table
Replace: Drop the existing table (if it exists) then create a new one
Append








DB2 Stage Target Properties
SQL INSERT
Drop table and
create
Database specified
by job parameter
Optional CLOSE command







DB2 Target Stage Upsert
SQL INSERT
SQL UPDATE
Upsert method


#################################






Generated OSH for first 2 stages
Generated OSH Primer
Comment blocks introduce each operator
Operator order is determined by the order stages
were added to the canvas
OSH uses the familiar syntax of the UNIX shell
Operator name
Schema
Operator options ( -name value format)
Input (indicated by n< where n is the input #)
Output (indicated by n> where n is the output #)
may include modify

For every operator, input and/or output datasets are
numbered sequentially starting from 0. E.g.:
op1 0> dst
op1 1< src
Virtual datasets are generated to connect operators


###################
## Operator
## Operator options


####################################################
#### STAGE: Row_Generator_0
## Operator
generator
## Operator options
-schema record
(
a:int32;
b:string[max=12];
c:nullable decimal[10,2] {nulls=10};
)
-records 50000

## General options
[ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')]
## Outputs
0> [] 'Row_Generator_0:lnk_gen.v'
;
Virtual dataset is
#### STAGE: SortSt
used to connect

tsort
output of one
-key 'a' operator to input of
-asc

another
## General options
[ident('SortSt'); jobmon_ident('SortSt'); par]
## Inputs
0< 'Row_Generator_0:lnk_gen.v'
## Outputs
0> [modify (
keep
a,b,c;
)] 'SortSt:lnk_sorted.v'
;




Framework v. DataStage Terminology
Framework
schema

property

type

virtual dataset

Record / field

operator

step, flow, OSH command

Framework
DataStage
table definition

format

SQL type and length

link

row / column

stage

job

DS Parallel Engine
GUI uses both terminologies
Log messages (info, warnings, errors) use Framework terminology





Elements of a Framework Program





Operators

Virtual datasets: set of rows processed

Schema:
by Framework
data description (metadata) for datasets and links





Enterprise Edition Runtime Architecture




Enterprise Edition Job Startup
Generated OSH and configuration file are used to compose a job
Score
Think of Score as in musical score, not game score
Similar to the way an RDBMS builds a query optimization plan
Identifies degree of parallelism and node assignments for each operator
Inserts sorts and partitioners as needed to ensure correct results
Defines connection topology (virtual datasets) between adjacent operators
Inserts buffer operators to prevent deadlocks
E.g., in fork-joins
Defines number of actual OS processes
Where possible, multiple operators are combined within a single OS process
to improve performance and optimize resource requirements
Job Score is used to fork processes with communication interconnects for
data, message, and control
Set $APT_STARTUP_STATUS to show each step of job startup
Set $APT_PM_SHOW_PIDS to show process IDs in DataStage log







Enterprise Edition Runtime
It is only after the job Score and processes are created that
processing begins
Startup overhead of an EE job
Job processing ends when either:
Last row of data is processed by final operator
A fatal error is encountered by any operator
Job is halted (SIGINT) by DataStage Job Control or human intervention
(e.g. DataStage Director STOP)






Viewing the Job Score



Set $APT_DUMP_SCORE to output the Score to the job log

For each job run, 2 separate Score dumps are written

First score is for the license operator

Second score entry is the real job score

To identify the Score dump, look for main program: This step
You dont see anywhere the word Score

License operator job score
Job score







Example Job Score
Job scores are divided into two
sections
Datasets
partitioning and collecting
Operators
node/operator mapping
Both sections identify sequential or
parallel processing
Why 9 Unix processes?
















Q & A










Thank You

You might also like