Module 01: Introduction Module 02: Setting Up Your DataStage Environment Module 03: Creating Parallel Jobs Module 04: Accessing Sequential Data Module 05: Platform Architecture Module 06: Combining Data Module 07: Sorting and Aggregating Data Module 08: Transforming Data Module 09: Standards and Techniques
Module 10: Accessing Relational Data Module 11: Compilation and Execution Module 12: Testing and Debugging Module 13: Metadata in Enterprise Edition Module 14: Job Control
Course
Objectives DataStage Clients and Server Setting up the parallel environment Importing metadata Building DataStage jobs Loading metadata into job stages Accessing Sequential data Accessing Relational data Introducing the Parallel framework architecture Transforming data Sorting and aggregating data Merging data Configuration files Creating job sequences
IBM WebSphere DataStage Module 01: Introduction
What is IBM WebSphere DataStage?
Design jobs for Extraction, Transformation, and Loading (ETL)
Ideal tool for data integration projects such as, data warehouses, data marts, and system migrations
Import, export, create, and manage metadata for use within jobs
Schedule, run, and monitor jobs all within DataStage
Administer your DataStage development and execution environments
Create batch (controlling) jobs
DataStage Server and Clients Windows or Unix Server Microsoft Windows
Client Logon
DataStage Administrator
DataStage Manager
DataStage Designer
DataStage Director
Developing in DataStage
Define global and project properties in Administrator
Import metadata into the Repository Manager Designer Repository View
Build job in Designer
Compile job in Designer
Run and monitor job in Director
DataStage Projects
DataStage Jobs Parallel jobs Executed under control of DataStage Server runtime environment Built-in functionality for Pipeline and Partitioning Parallelism Compiled into OSH (Orchestrate Scripting Language) OSH executes Operators Executable C++ class instances Runtime monitoring in DataStage Director Job Sequences (Batch jobs, Controlling jobs) Master Server jobs that kick-off jobs and other activities Can kick-off Server or Parallel jobs Runtime monitoring in DataStage Director Server jobs (Requires Server Edition license) Executed by the DataStage Server Edition Compiled into Basic (interpreted pseudo-code) Runtime monitoring in DataStage Director Mainframe jobs (Requires Mainframe Edition license) Compiled into COBOL Executed on the Mainframe, outside of DataStage
Design Elements of Parallel Jobs Stages Implemented as OSH operators (pre-built components) Passive stages (E and L of ETL)
Read data Write data E.g., Sequential File, Oracle, Peek stages Processor (active) stages (T of ETL)
Transform data Filter data Aggregate data Generate data Split / Merge data E.g., Transformer, Aggregator, Join, Sort stages Links Pipes through which the data moves from stage to stage
Quiz True or False?
DataStage Designer is used to build and compile your ETL jobs
Manager is used to execute your jobs after you build them
Director is used to execute your jobs after you build them
Administrator is used to set global and project properties
Introduction to the Lab Exercises Two types of exercises in this course: Conceptual exercises Designed to reinforce a specific modules topics Provide hands-on experiences with DataStage Introduced by the word Concept E.g., Conceptual Lab 01A Solution Development exercises Based on production applications Provide development examples Introduced by the word Solution E.g., Solution Lab 05A The Solution Development exercises are introduced and discussed in a later module
Lab Exercises Conceptual Lab 01A Install DataStage clients Test connection to the DataStage Server Install lab files
IBM WebSphere DataStage Module 02: Setting Environment up Your DataStage
Module Objectives
Setting project properties in Administrator
Defining Environment Variables
Importing / Exporting DataStage objects in Manager
Importing Table Definitions defining sources and targets in Manager
Setting Project Properties
Project Properties Projects can be created and deleted in Administrator Each project is associated with a directory on the DataStage Server
Project properties, defaults, and environmental variables are specified in Administrator Can be overridden at the job level
01/15/06
Setting Project Properties To set project properties, log onto Administrator, select your project, and then click Properties
Project Properties General Tab
Environment Variables
Permissions Tab
Tracing Tab
Parallel Tab
Sequence Tab
Importing and DataStage Exporting Objects
What Is Metadata? Data Source Target Metadata Metadata Metadata Repository
Transform
DataStage Manager
Manager Contents Metadata Describing sources and targets: Table definitions Describing inputs / outputs from external routines Describing inputs and outputs to BuildOp and CustomOp stages
DataStage objects Jobs Routines Compiled jobs / objects Stages
Import and Export
Any object in Manager can be exported to a file
Can export whole projects
Use for backup
Sometimes used for version control
Can be used to move DataStage objects from one project to another
Use to share DataStage jobs and projects with other developers
Export
Procedure
In Manager, click Export>DataStage Components
Select DataStage objects for export
Specify type of export: DSX: Default format XML: Enables processing of export file by XML applications, e.g., for generating reports
Specify file path on client machine
Quiz - True or False? You can export DataStage objects such as jobs, but you cant export metadata, such as field definitions of a sequential file.
Quiz - True or False? The directory to which you export is on the DataStage client machine, not on the DataStage server machine.
Exporting DataStage Objects
Select Objects for Export
Options Tab Select by folder or individual object
Import Procedure In Manager, click Import>DataStage Components Or Import>DataStage Components (XML) if you are importing an XML- format export file
Select DataStage objects for import
Importing DataStage Objects
Import Options
Importing Metadata
Metadata Import
Import format and column definitions from sequential files
Import relational table column definitions
Imported as Table Definitions
Table definitions can be loaded into job stages
Table definitions can be used to define Routine and Stage interfaces
Sequential File Import Procedure
In Manager, click Import>Table Definitions>Sequential File Definitions
Select directory containing sequential file and then the file
Select Manager category
Examined format and column definitions and edit is necessary
Importing Sequential Metadata
Sequential Import Window
Specify Format
Specify Column Names and Types Double-click to define extended properties
Extended Properties window Property categories Available properties
Table Definition General Tab Second level category Top level category
Table Definition Columns Tab
Table Definition Parallel Tab
Table Definition Format Tab
Lab Exercises Conceptual Lab 02A Set up your DataStage environment
Conceptual Lab 02B Import a sequential file Table Definition
IBM WebSphere DataStage Module 03: Creating Parallel Jobs
Module Objectives
Design a simple Parallel job in Designer
Compile your job
Run your job in Director
View the job log
Creating Parallel Jobs
What Is a Parallel Job?
Executable DataStage program
Created in DataStage Designer Can use components from Manager Repository
Built using a graphical user interface
Compiles into Orchestrate shell language (OSH) and object code (from generated C++)
Job Development Overview Import metadata defining sources and targets Can be done within Designer or Manager
In Designer, add stages defining data extractions and loads
Add processing stages to define data transformations
Add links defining the flow of data from sources to targets
Compile the job
In Director, validate, run, and monitor your job Can also run the job in Designer Can only view the job log in Director
Designer Work Area Canvas Repository Tools Palette
Designer Toolbar Provides quick access to the main functions of Designer Show/hide metadata markers Run Job properties Compile
Tools Palette
Adding Stages and Links Drag stages from the Tools Palette to the diagram Can also be dragged from Stage Type branch to the diagram
Draw links from source to target stage Right mouse over source stage Release mouse button over target stage
Job Creation Example Sequence
Brief walkthrough of procedure
Assumes table definition of source already exists in the repository
Create New Job
Drag Stages and Links From Palette Peek Row Generator Annotation
Renaming Links and Stages
Click on a stage or link to rename it
Meaningful names have many benefits Documentation Clarity Fewer development errors
RowGenerator Stage
Produces mock data for specified columns
No inputs link; single output link
On Properties tab, specify number of rows
On Columns tab, load or specify column definitions Click Edit Row over a column to specify the values to be generated for that column A number of algorithms for generating values are available depending on the data type
Algorithms for Integer type Random: seed, limit Cycle: Initial value, increment
Algorithms for string type: Cycle , alphabet
Algorithms for date type: Random, cycle
Inside the Row Generator Stage
Properties tab Set property value Property
Columns Tab View data Load a Table definition Select Table Definition
Extended Properties Specified properties and their values Additional properties to add
Peek Stage Displays field values Displayed in job log or sent to a Skip records option file Can control number of records to be displayed Shows data in each partition, labeled 0, 1, 2,
Useful stub stage for iterative job development Develop job to a stopping point and check the data
Peek Stage Properties Output to job log
Job Parameters
Defined in Job Properties window Makes the job more flexible Parameters can be: Used in directory and file names Used to specify property values Used in constraints and derivations Parameter values are determined at run time
When used for directory and files names and names of properties, surround with pound signs (#) E.g., #NumRows#
Job parameters can reference DataStage and system environment variables $PROJDEF $ENV
Defining a Job Parameter Parameters tab Parameter
Using a Job Parameter in a Stage Job parameter surrounded with pound signs
Adding Job Documentation Job Properties Short and long descriptions Shows in Manager
Annotation stage Added from the Tools Palette Display formatted text descriptions on diagram
Job Properties Documentation Documentation
Annotation Stage Properties
Compiling a Job Compile
Errors or Successful Message Highlight stage with error Click for more info
Running Jobs and Viewing the Job Log in Designer
Prerequisite to Job Execution
DataStage Director
Use to run and schedule jobs
View runtime messages
Can invoke from DataStage Manager or Designer Tools > Run Director
Run Options Stop after number of warnings Stop after number of rows
Director Log View Click the open book icon to view log messages Peek messages
Message Details
Other Director Functions
Schedule job to run on a particular date/time
Clear job log of messages
Set job log purging conditions
Set Director options Row limits Abort after x warnings
Running Jobs from Command Line
Use dsjob run
Use dsjob logsum to display messages in the log
Documented in Parallel Job Advanced Developers Guide, ch. 7
Lab Exercises Conceptual Lab 03A Design a simple job in Designer Define a job parameter Document the job Compile Run Monitor the job in Director
IBM WebSphere DataStage Module 04: Accessing Sequential Data
Module Objectives
Understand the stages for accessing different kinds of sequential data
Sequential File stage
Data Set stage
Complex Flat File stage
Create jobs that read from and write to sequential files
Read from multiple files using file patterns
Use multiple readers
Types of Sequential Data Stages Sequential Fixed or variable length
Data Set
Complex Flat File
The Framework and Sequential Data
The EE Framework processes only datasets
For files other than datasets, such as sequential flat files, import and export operations are done Import and export OSH operators are generated by Sequential and Complex Flat File stages
During import or export DataStage performs format translations into, or out of, the EE internal format
Internally, the format of data is described by Like Table Definitions schemas
Using the Sequential File Stage Both import and export of general files (text, binary) are performed by the SequentialFile Stage. Data import: Data export
EE internal format EE internal format
Features of Sequential File Stage
Normally executes in sequential mode
Executes in parallel when reading multiple files
Can use multiple readers within a node Reads chunks of a single file in parallel
The stage needs to be told: How file is divided into rows (record format) How row is divided into columns (column format)
1 1 1 1
File Format Example Record delimiter Final Delimiter = end Field Delimiter Final Delimiter = comma
nl , Last field , Field 3 , Field 2 , Field 1
nl
Last field
,
Field 3 , Field 2 , Field 1
Sequential File Stage Rules
One input link
One stream output link
Optionally, one reject link Will reject any records not matching metadata in the column definitions Example: You specify three columns separated by commas, but the row thats read had no commas in it
Job Design Using Sequential Stages Reject link
Sequential Source Columns Tab View data Load Table Definition Save as a new Table Definition
Input Sequential Stage Properties Output tab File to access Column names in first row Click to add more files having the same format
Format Tab Record format Column format
Reading Using a File Pattern Use wild cards Select File Pattern
Properties - Multiple Readers Multiple readers option allows you to set number of readers per node
Sequential Stage As a Target Input Tab Append / Overwrite
Reject Link Reject mode = Continue: Continue reading records Fail: Abort job Output: Send down output link In a source stage All records not matching the metadata (column definitions) are rejected In a target stage All records that fail to be written for any reason
Rejected records consist of one column, datatype = raw
Reject mode property
Inside the Copy Stage Column mappings
DataSet Stage
Data Set
Operating system (Framework) file
Preserves partitioning Component dataset files are written to on each partition
Suffixed by .ds
Referred to by a header file
Managed by Data Set Management utility from GUI (Manager, Designer, Director)
Represents persistent data
Key to good performance in set of linked jobs No import / export conversions are needed No repartitioning needed
Persistent Datasets
Accessed using DataSet Stage.
Two parts: Descriptor file: contains metadata, data location, but NOT the data itself
input.ds Data file(s) record ( partno: int32; description: string; )
contain the data multiple Unix files (one per node), accessible in parallel node1:/local/disk1/ node2:/local/disk2/
Data Translation Occurs on import From sequential files or file sets From RDBMS
Occurs on export From datasets to file sets or sequential files From datasets to RDBMS
DataStage engine is most efficient when processing internally formatted records (i.e. datasets)
FileSet Stage 01/15/06
File Set Stage
Can read or write file sets
Files suffixed by .fs
File set consists of: Descriptor file contains location of raw data files + metadata Individual raw data files
Can be processed in parallel
Similar to a dataset Main difference is that file sets are not in the internal format and therefore more accessible to external applications 1. 2.
File Set Stage Example Descriptor file
Lab Exercises Conceptual Lab 04A Read and write to a sequential file Create reject links Create a data set
Conceptual Lab 04B Read multiple files using a file path
Conceptual Lab 04C Read a file using multiple readers
DataStage Data Types
Standard types Complex types
Char
VarChar
Integer
Decimal (Numeric)
Floating point
Date
Time
Timestamp
VarBinary (raw)
Vector (array, occurs) Subrecord (group)
Standard Types Char Fixed length string VarChar Variable length string Specify maximum length Integer Decimal (Numeric) Precision (length including numbers after the decimal point) Scale (number of digits after the decimal point) Floating point Date
Complex Data Types Vector A one-dimensional array Elements are numbered 0 to n Elements can be of any single type All elements must have the same type Can have fixed or variable number of elements Subrecord A group or structure of elements Elements of the subrecord can be of any type Subrecords can be embedded
Schema With Complex Types subrecord vector
Table Definition with complex types
Authors is a subrecord
Books is a vector of 3 strings of length 5
Complex Types Column Definitions subrecord Elements of subrecord Vector
Reading and Writing Complex Data Complex Flat File target stage Complex Flat File source stage
Importing Cobol Copybooks Click Import>Table Definitions>COBOL File Definitions to begin the import
Each level 01 item begins a Table Definition
Specify position of level 01 items
Level 01 start position Path to copybook file Where to store the Table Definition
Reading and Writing NULL Values
Working with NULLs Internally, NULL is represented by a special value outside the range of any existing, legitimate values
If NULL is written to a non-nullable column, the job will abort
Columns can be specified as nullable NULLs can be written to nullable columns
You must handle NULLs written to non-nullable columns in a Sequential File stage You need to tell DataStage what value to write to the file Unhandled rows are rejected
In a Sequential source stage, you can specify values you want DataStage to convert to NULLs
Specifying a Value for NULL Nullable column Added property
Managing DataSets
Managing DataSets
GUI (Manager, Designer, Director) tools > data set management
Dataset management from the system command line Orchadmin
Unix command line utility List records Remove datasets Removes all component files, not just the header file Dsrecords Lists number of records in a dataset
Displaying Data and Schema Display data Schema
Dsrecords Gives record count Unix command-line utility $ dsrecords ds_name E.g., $ dsrecords myDS.ds 156999 records Orchadmin Manages EE persistent data sets Unix command-line utility E.g., $ orchadmin delete myDataSet.ds
Manage Datasets from the System Command Line
Lab Exercises Conceptual Lab 04D Use the dsrecords utility Use Data Set Management tool
Conceptual Lab 04E Reading and Writing NULLs
IBM WebSphere DataStage Module 05: Platform Architecture 2005 IBM Corporation
Module Objectives
Parallel processing architecture
Pipeline parallelism
Partition parallelism
Partitioning and collecting
Configuration files
Key EE Concepts Parallel processing: Executing the job on multiple CPUs
Scalable processing: Add more resources (CPUs and disks) to increase system performance
Example system: 6 CPUs (processing nodes) and disks Scale up by adding more CPUs Add CPUs as individual nodes or to an SMP system
1
2
3
4
5
6
Scalable Hardware Environments
SMP Multi-CPU (2-64+) Shared memory & disk GRID / Clusters Multiple, multi-CPU systems Dedicated memory per node Typically SAN-based shared storage
Like a conveyor belt moving rows from process to process Start downstream process while upstream process is running
Advantages: Reduces disk usage for staging areas Keeps processors busy
Still has limits on scalability
Partition Parallelism Divide the incoming stream of data into subsets to be separately processed by an operation Subsets are called partitions (nodes) Each partition of data is processed by the same operation E.g., if operation is Filter, each partition will be filtered in exactly the same way Facilitates near-linear scalability 8 times faster on 8 processors 24 times faster on 24 processors This assumes the data is evenly distributed
Three-Node Partitioning Node 1
Operation subset1 Node 2
Operation subset2 Data subset3 Node 3 Operation
Here the data is partitioned into three partitions The operation is performed on each partition of data separately and in parallel If the data is evenly distributed, the data will be processed three times faster
EE Combines Partitioning and Pipelining Within EE, pipelining, partitioning, and repartitioning are automatic Job developer only identifies:
Sequential vs. Parallel operations (by stage) Method of data partitioning Configuration file (which identifies resources) Advanced stage options (buffer tuning, operator combining, etc.)
Job Design v. Execution User assembles the flow using DataStage Designer at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes) No need to modify or recompile the job design!
Configuration File Configuration file separates configuration (hardware / software) from job design Specified per job at runtime by $APT_CONFIG_FILE Change hardware and resources without changing job design
Defines number of nodes (logical processing units) with their resources (need not match physical CPUs) Dataset, Scratch, Buffer disk (file systems) Optional resources (Database, SAS, etc.) Advanced resource optimizations Pools (named subsets of nodes)
Multiple configuration files can be used at runtime Optimizes overall throughput and matches job characteristics to overall hardware resources Allows runtime constraints on resource usage on a per job basis
Example Configuration File Key points: 1.
2. Number of nodes defined
Resources assigned to each node. Their order is significant.
Advanced resource optimizations and configuration (named pools, database, SAS) 3.
3
4
1
2
{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} } }
Partitioning and Collecting
Partitioning and Collecting
Partitioning breaks incoming rows into sets (partitions) of rows
Each partition of rows is processed separately by the stage/operator If the hardware and configuration file supports parallel processing, partitions of rows will be processed in parallel
Collecting returns partitioned data back to a single stream
Partitioning / Collecting occurs on stage Input links
Partitioning / Collecting is implemented automatically Based on stage and stage properties How the data is partitioned / collected can be specified
Partitioning / Collecting Algorithms Partitioning algorithms include: Round robin Hash: Determine partition based on key value Requires key specification Entire: Send all rows down all partitions Same: Preserve the same partitioning Auto: Let DataStage choose the algorithm Collecting algorithms include: Round robin Sort Merge Read in by key Presumes data is sorted by the key in each partition Builds a single sorted stream based on the key Ordered Read all records from first partition, then second,
Keyless V. Keyed Partitioning Algorithms Keyless: Rows are distributed independently of data values Round Robin Entire Same
Keyed: Rows are distributed based on values in the specified key Hash: Partition based on key Example: Key is State. All CA rows go into the same partition; all MA rows go in the same partition. Two rows of the same state never go into different partitions Modulus: Partition based on modulus of key divided by the number of partitions. Key is a numeric type. Example: Key is OrderNumber (numeric type). Rows with the same order number will all go into the same partition. DB2: Matches DB2 EEE partitioning
Partitioning Requirements for Related Records Misplaced records Using Aggregator stage to sum customer sales by customer number If there are 25 customers, 25 records should be output But suppose records with the same customer numbers are spread across partitions This will produce more than 25 groups (records) Solution: Use hash partitioning algorithm Partition imbalances Peek stage shows number of records going down each partition
Unequal Distribution Example Same key values are assigned to the same partition Hash on LName, with 2-node config file
P a r t i t i o n
1
ID LName FName Address 1 Ford Henry 66 Edison Avenue 2 Ford Clara 66 Edison Avenue 3 Ford Edsel 7900 Jefferson 4 Ford Eleanor 7900 Jefferson 7 Ford Henry 4901 Evergreen 8 Ford Clara 4901 Evergreen 9 Ford Edsel 1100 Lakeshore 10 Ford Eleanor 1100 Lakeshore S o u r c e
D a t a
ID LName FName Address 1 Ford Henry 66 Edison Avenue 2 Ford Clara 66 Edison Avenue 3 Ford Edsel 7900 Jefferson 4 Ford Eleanor 7900 Jefferson 5 Dodge Horace 17840 Jefferson 6 Dodge John 75 Boston Boulevard 7 Ford Henry 4901 Evergreen 8 Ford Clara 4901 Evergreen 9 Ford Edsel 1100 Lakeshore 10 Ford Eleanor 1100 Lakeshore P a r t
0
ID LName FName Address 5 Dodge Horace 17840 Jefferson 6 Dodge John 75 Boston Boulevard
Partitioning / Collecting Link Icons Partitioning icon Collecting icon
More Partitioning Icons fan-out Sequential to Parallel SAME partitioner Re-partition watch for this! AUTO partitioner
Quiz True or False? Everything that has been data-partitioned must be collected in same job
Data Set Stage Is the data partitioned?
Introduction to the Solution Exercises Development
Solution Development Jobs
Series of 4 jobs extracted from production jobs
Use a variety of stages in interesting, realistic configurations Sort, Aggregator stages Join, lookup stage Peek, Filter stages Modify stage Oracle stage
Contain useful techniques Use of Peeks Datasets used to connect jobs Use of project environment variables in job parameters Fork Joins Lookups for auditing
Warehouse Job 01
Glimpse Into the Sort Stage Algorithms Sort key to add
Copy Stage With Multiple Output Links Select output link
Filter Stage
Used with Peek stage to select a portion of data for checking
On Properties tab, specify a Where clause to filter the data
On Mapping tab, map input columns to output columns
Setting the Filtering Condition Filtering condition
Warehouse Job 02
Warehouse Job 03
Warehouse Job 04
Warehouse Job 02 With Lookup
Lab Exercises Conceptual Lab 05A Experiment with partitioning / collecting Solution Lab 05B (Build Warehouse_01 Job) Add environment variables as job parameters Read multiple sequential files Use the Sort stage Use Filter and Peek stages Write to a DataSet stage
IBM WebSphere DataStage Module 06: Combining Data
Module Objectives
Combine data using the Lookup stage
Combine data using Merge stage
Combine data using the Join stage
Combine data using the Funnel stage
Ways to combine data: Horizontally: Multiple input links One output link made of columns from different input links. Joins Lookup Merge
Vertically: One input link, one output link combining groups of related records into a single record Aggregator Remove Duplicates
Funneling: Multiple input streams funneled into a single output stream Funnel stage
Combining Data
Lookup, Merge, Join Stages These stages combine two or more input links Data is combined by designated "key" column(s)
These stages differ mainly in: Memory usage Treatment of rows with unmatched key values Input requirements (sorted, de-duplicated)
Not all Links are Created Equal DataStage distinguishes between: - The Primary input: (Framework port 0) - Secondary inputs: ports) Conventions: in some cases "Reference" (other Framework
Tip: Check Link Ordering" tab to make sure intended Primary is listed first
Joins Lookup Merge
Primary Input: port 0 Secondary Input(s): ports 1, Left Source Master Right Lookup table(s) Update(s)
Lookup Stage 01/15/06
Lookup Features
One Stream Input link (Source)
Multiple Reference links (Lookup files)
One output link
Optional Reject link Only one per Lookup stage, regardless of number of reference links
Hash tables are built in memory from the lookup files Indexed by key Should be small enough to fit into physical memory
The Lookup Stage Uses one or more key columns as an index into a table Usually contains other values associated with each key.
The lookup table is created in memory before any lookup source rows are processed
Lookup table
Associated Value Index [] SC SD TN TX UT VT [] Key column of source state_code TN South Carolina South Dakota Tennessee Texas Utah Vermont
Lookup from Sequential File Example Reference link Driver (Source) link (lookup table)
Lookup Key Column in Sequential File
Lookup key
Lookup Stage Mappings Source link Reference link Derivation for lookup key
Handling Lookup Failures Select action
Lookup Failure Actions If the lookup fails to find a matching key column, one of these actions can be taken: fail: the lookup Stage reports an error and the job fails immediately. This is the default.
drop: the input row with the failed lookup(s) is dropped
continue: the input row is transferred to the output, together with the successful table entries. The failed table entry(s) are not transferred, resulting in either default output values or null output values.
reject: the input row with the failed lookup(s) is transferred to a second output link, the "reject" link.
There is no option to capture unused table entries Compare with the Join and Merge stages
Lookup Stage Behavior
We shall first use a simplest case, optimal input:
Two input links: Source" as primary, Look up" as secondary sorted on key column (here "Citizen"), without duplicates on key Source link (primary input) Lookup link (secondary input)
Output of Lookup with continue option on key Citizen Same output as outer join and merge/keep Empty string or NULL Output of Lookup with drop option on key Citizen Same output as inner join and merge/drop
Lookup Tables should be small enough to fit into physical memory
On a MPP you should partition the lookup tables using entire partitioning method or partition them by the same hash key as the source link Entire results in multiple copies (one for each partition)
On a SMP, choose entire or accept the default (which is entire) Entire does not result in multiple copies because memory is shared
Join Stage
The Join Stage Four types:
Inner Left outer Right outer Full outer 2 or more sorted input links, 1 output link "left" on primary input, "right" on secondary input Pre-sort make joins "lightweight": few rows need to be in RAM Follow the RDBMS-style relational model Cross-products in case of duplicates Matching entries are reusable for multiple matches Non-matching entries can be captured (Left, Right, Full) No fail/reject option for missed matches
Join Stage Editor Link Order immaterial for Inner and Full Outer Joins, but very important for Left/Right Outer joins) One of four variants:
Inner Left Outer Right Outer Full Outer Multiple key columns allowed
Join Stage Behavior We shall first use a simplest case, optimal input:
two input links: "left" as primary, "right" as secondary sorted on key column (here without duplicates on key "Citizen"), Left link (primary input) Right link (secondary input)
Transfers all values from the left link and transfers values from the right link only where key columns match. Same output as lookup/continue and merge/keep
Revolution Citizen Exchange 1789 Lefty
1776 M_B_Dextrous Nasdaq
Left Outer Join Check Link Ordering Tab intended Primary to make sure is listed first
Right Outer Join Transfers all values from the right link and transfers values from the left link only where key columns match.
Revolution Citiz en Ex c hange 1776 M_B _Dex trous Nas daq Null or 0 Righty NYSE
Full Outer Join Transfers rows from both data sets, whose key columns contain equal values, to the output link.
It also transfers rows, whose key columns contain unequal values, from both input links to the output link.
Unmatched updates in input port n can be captured in output port n
Lightweight: One or more updates Master
Merge Rejects Output
0 1 2
0 2
Merge Stage Editor Unmatched Master rows
One of two options: Unmatched Update rows option:
Capture in reject link(s). Implemented by adding outgoing links
Keep [default] Drop (Capture in reject link is NOT an option)
Comparison: Joins, Lookup, Merge
Joins Lookup Merge
Model M emory us age
# and nam es of Inputs M andatory Input S ort Duplic ates in prim ary input Duplic ates in s ec ondary input(s ) Options on unmatc hed prim ary Options on unmatc hed s ec ondary On m atc h, s ec ondary entries are
# Outputs Captured in rejec t s et(s ) RDBMS-s ty le relational S ourc e - in RAM LU Table Mas ter -Update(s ) light heavy light
2 or more: left, right 1 S ourc e, N LU Tables 1 Mas ter, N Update(s ) all inputs no all inputs OK (x -produc t) OK W arning! OK (x -produc t) W arning! OK only when N = 1 K eep (left outer), Drop (Inner) [fail] | c ontinue | drop | rejec t [k eep] | drop K eep (right outer), Drop (Inner) NONE c apture in rejec t s et(s ) c aptured c aptured c ons um ed
What is a Funnel Stage? A processing stage that combines data from multiple input links to a single output link
Useful to combine data from several identical data sources into a single large dataset
Operates in three modes Continuous SortFunnel Sequence
Three Funnel modes Continuous: Combines the records of the input link in no guaranteed order. It takes one record from each input link in turn. If data is not available on an input link, the stage skips to the next link rather than waiting. Does not attempt to impose any order on the data it is processing.
Sort Funnel: Combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys.
Sequence: Copies all records from the first input link to the output link, then all the records from the second input link and so on.
Sort Funnel Method
Produces a sorted output (assuming input links are all sorted on key) Data from all input links must be sorted on the same key column Typically data from all input links are hash partitioned before they are sorted Selecting Auto partition type under Input Partitioning tab defaults to this Hash partitioning guarantees that all the records with same key column values are located in the same partition and are processed on the same node. Allows for multiple key columns 1 primary key column, n secondary key columns Funnel stage first examines the primary key in each input record. For records with multiple records with same primary key value, it will then examine secondary keys to determine the order of records it will output
Funnel Stage Example
Funnel Stage Properties
Lab Exercises Conceptual Lab 06A Use a Lookup stage Handle lookup failures Use a Merge stage Use a Join stage Use a Funnel stage
Solution Lab 06B (Build Warehouse_02 Job) Use a Join stage
IBM WebSphere DataStage Module 07: Sorting and Aggregating Data
Module Objectives
Sort data using in-stage sorts and Sort stage
Combine data using Aggregator stage
Combine data Remove Duplicates stage
Sort Stage
Sorting Data Uses Some stages require sorted input Join, merge stages require sorted input Some stages use less memory with sorted input E.g., Aggregator
Sorts can be done: Within stages On input link Partitioning tab, set partitioning to anything other than Auto In a separate Sort stage Makes sort more visible on diagram Has more options
Sorting Alternatives Sort stage Sort within stage
In-Stage Sorting Partitioning ort Preserve -key ordering Remove dups Cant when sorting
tab Do s
non
be Auto Sort key
Sort Stage Sort key Sort options
Sort keys
Add one or more keys
Specify sort mode for each key Sort: Sort by this key Dont sort (previously sorted): Assume the data has already been sorted by this key Continue sorting by any secondary keys
Specify sort order: ascending / descending
Specify case sensitive or not
Sort Options Sort Utility DataStage the default Unix: Dont use. Slower than DataStage sort utility Stable Allow duplicates Memory usage Sorting takes advantage of the available memory for increased performance Uses disk if necessary Increasing amount of memory can improve performance Create key change column Add a column with a value of 1 / 0 1 indicates that the key value has changed 0 mean that the key value hasnt changed Useful for processing groups of rows in a Transformer
Sort Stage Mapping Tab
Partitioning V. Sorting Keys Partitioning keys are often different than Sorting keys Keyed partitioning (e.g., Hash) is used to group related records into the same partition Sort keys are used to establish order within each partition
For example, partition on HouseHoldID, sort on HouseHoldID, PaymentDate Important when removing duplicates. Sorting within each partition is uses to establish order for duplicate retention (first or last in the group)
Aggregator Stage
Aggregator Stage Purpose: Perform data aggregations Specify: Zero or more key columns that define groups) Columns to be aggregated the aggregation units (or
Aggregation functions, include among many others: count (nulls/non-nulls) Sum Max / Min / Range The grouping method issue (hash table or pre-sort) is a performance
Job with Aggregator Stage Aggregator stage
Aggregator Stage Properties Group columns Group method Aggregation functions
Aggregator Functions Aggregation type = Count rows Count rows in each group Put result in a specified output column
Aggregation type = Calculation Select column Put result of calculation in a specified output column Calculations include:
Sum Count Min, max Mean Missing value count Non-missing value count Percent coefficient of variation
Grouping Methods Hash (default) Intermediate results for each group are stored in a hash table Final results are written out after all input has been processed No sort required Use when number of unique groups is small Running tally for each groups aggregate calculations needs to fit into memory. Requires about 1K RAM / group E.g. average family income by state requires .05MB of RAM
Sort Only a single aggregation group is kept in memory When a new group is seen, the current group is written out Requires input to be sorted by grouping keys Can handle unlimited numbers of groups Example: average daily balance by credit card
Aggregation Types
Calculation types
Remove Duplicates Stage
Removing Duplicates Can be done by Sort stage Use unique option
No choice on which to keep Stable sort always retains the first row in the group Non-stable sort is indeterminate OR Remove Duplicates stage Has more sophisticated ways to remove duplicates Can choose to retain first or last
Remove Duplicates Stage Properties Key that defines duplicates Retain first or last duplicate
Lab Exercises Solution Development Lab 07A Use Sort stage Use Aggregator stage Use RemoveDuplicates stage (Build Warehouse_03 job)
IBM WebSphere DataStage Module 08: Transforming Data
Module Objectives
Understand ways DataStage allows you to transform data
Use this understanding to: Create column derivations using user-defined code and system functions Filter records based on business criteria Control data flow based on data conditions
Transformed Data Derivations may include incoming fields or parts of incoming fields Derivations may reference system variables and constants Frequently uses Date and time Mathematical Logical Null handling More functions performed on incoming values
Stages Review Stages that can transform data Transformer Modify Aggregator Stages that do not transform data File stages: Sequential, Dataset, Peek, etc. Sort Remove Duplicates Copy Filter Funnel
Transformer Stage
Column mappings Derivations Written in Basic Final compiled code is C++ generated object code Constraints Filter data Direct data down different output links For different processing or storage Expressions for constraints and derivations can reference Input columns Job parameters Functions System variables and constants Stage variables External routines
Transformer Stage Uses Transformer with multiple outputs Control data flow Constrain data Direct data
Defining a Derivation Input column String in quotes Concatenation operator (:)
IF THEN ELSE Derivation
Use IF THEN ELSE to conditionally derive a value
Format: IF <condition> THEN <expression1> ELSE <expression1> If the condition evaluates to true then the result of expression1 will be copied to the target column or stage variable If the condition evaluates to false then the result of expression2 will be copied to the target column or stage variable
Example: Suppose the source column is named In.OrderID and the target column is named Out.OrderID Replace In.OrderID values of 3000 by 4000 IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID
String Functions and Operators Substring operator Format: String [loc, length] Example: Suppose In.Description contains the string Orange Juice InDescription[8,5] Juice
UpCase(<string>) / DownCase(<string>) Example: UpCase(In.Description) ORANGE JUICE
Len(<string>) Example: Len(In.Description) 12
Checking for NULLs Nulls can be introduced into the data flow from lookups Mismatches (lookup failures) can produce nulls
Can be handled in constraints, derivations, stage variables, or a combination of these
NULL functions Testing for NULL
IsNull(<column>) IsNotNull(<column>) Replace NULL with a value NullToValue(<column>, <value>) Set to NULL: SetNull() Example: IF In.Col = 5 THEN SetNull() ELSE In.Col
Transformer Functions
Date & Time
Logical
Null Handling
Number
String
Type Conversion
Transformer Execution Order
Derivations in stage variables
Constraints are executed before derivations
Column derivations in earlier links are executed before later links
Derivations in higher columns are executed before lower columns
Transformer Stage Variables Derivations execute in order from top to bottom Later stage variables can reference earlier stage variables Earlier stage variables can reference later stage variables These variables will contain a value derived from the previous row that came into the Transformer
Multi-purpose Counters Store values from previous rows to make comparisons Store derived values to be used in multiple target field derivations Can be used to control execution of constraints
Stage Variables Toggle Show/Hide button
Transformer Reject Links Reject link Convert link to a Reject link
Otherwise Link Otherwise link
Defining an Otherwise Link Check to create otherwise link Can specify abort condition
Specifying Link Ordering Link ordering toolbar icon Last in order
Transformer Stage Tips Suggestions - Include reject links Test for NULL values before using a column in a function Use RCP (Runtime Column Propogation) Map columns that have derivations (not just copies). More on RCP later. Be aware of column and stage variable data types. Often developers do not pay attention to stage variable types. Avoid type conversions. Try to maintain the data type as imported.
Modify Stage
Modify Stage
Modify column types
Perform some types of derivations Null handling Date / time handling String handling
Add or drop columns
Job With Modify Stage Modify stage
Specifying a Column Conversion
Derivation / Conversion New column Specification property
Lab Exercises Conceptual Lab 08A Add a Transformer to a job Define a constraint Work with null values Define a rejects link Define a stage variable Define a derivation
IBM WebSphere DataStage Module 09: Standards and Techniques
Module Objectives
Establish standard techniques for Parallel job development Job documentation Naming conventions for jobs, links, and stages Iterative job design Useful stages for job development Using configuration files for development Using environmental variables Job parameters Containers
Job Presentation Document using the
Document using annotation stage
Job Properties Documentation Organize jobs into categories Description is displayed in Manager and MetaStage
Naming Conventions Stages named after the Data they access Function they perform DO NOT leave default stage names like Sequential_File_0 One possible convention: Use 2-character prefixes to indicate stage type, e.g.,
SF_ for Sequential File stage DS_ for Dataset stage CP_ for Copy stage Links named for the data they carry DO NOT leave default link names like DSLink3 One possible convention: Prefix all link names with lnk_ Name links after the data flowing through them
Stage and Link Names handle
Name stages and links for the data they
Iterative Job Design
Use Copy and Peek stages as stubs
Test job in phases Small sections first, then increasing in complexity
Use Peek stage to examine records Check data at various locations Check before and after processing stages
Copy Stage Stub Example Copy stage
Copy Stage Example
With 1 link in, 1 link out: The Copy Stage is the ultimate
Partitioners Sort / Remove Duplicates Rename, Drop column Can be placed on: "no-op" (place-holder): input link (Partitioning): Partitioners, Sort, Remove Duplicates) output link (Mapping page): Rename, Drop. Sometimes replace the transformer: Rename, Drop, Implicit type Conversions Link Constraint break up schema
Developing Jobs
1. Keep it simple a) Jobs with many stages are hard to debug and maintain Start small and build to final solution Use view data, copy, and peek Start from source and work out Develop with a 1 node configuration file Solve the business problem before the performance problem Dont worry too much about partitioning until the sequential flow works as expected
If you land data in order to break complex jobs into smaller sets of jobs for purposes of restartability or maintainability, use persistent datasets Retains partitioning and internal data types This is true only as long as you dont need to read the data outside of DataStage 2. a) b) c) 3. a) 4. a) b)
Final Result
Good Things to Have in each Job
Job parameters
Useful environmental variables to add $APT_DUMP_SCORE Report OSH to message log $APT_CONFIG_FILE to job parameters
Establishes runtime parameters to EE engine Establishes degree of parallelization
Shared Creates reusable object that many jobs within the project can include
Reusable Job Components Use Shared Containers for repeatedly used components Container
Creating a Container
Create a job Select (loop) portions to containerize Edit > Construct container > local or shared
Lab Exercises Conceptual Lab 07A Apply best practices when naming links and stages
IBM WebSphere DataStage Module 10: Accessing Relational Data
Module Objectives Understand how DataStage jobs RDBMS tables
Import relational table definitions read and write records to a
Read from and write to database tables Use database tables to lookup data
Parallel Database Connectivity E En nt te er rp pr ri is se e E Ed di it ti io on n C Cl li ie en nt t- -S Se er rv ve er r L Lo oa ad d C Cl li ie en nt t X Parallel server runs APPLICATIONS X Application has parallel connections to RDBMS X Suitable for large data volumes X Higher levels of integration possible X Only RDBMS is running in parallel X Each application has only one connection X Suitable only for small data volumes
Importing Table Definitions Can import using ODBC or using Orchestrate schema definitions Orchestrate schema imports are better because the data types are more accurate
Support for standard SQL syntax for specifying: SELECT clause list WHERE clause filter condition INSERT / UPDATE Supports user-defined queries
Native Parallel RDBMS Stages
DB2/UDB Enterprise
Informix Enterprise
Oracle Enterprise
Teradata Enterprise
ODBC Enterprise SQL Server Enterprise
RDBMS Usage As a source Extract data from table (stream link) Read methods include: Table, Generated SQL SELECT, or User- defined SQL User-defined can perform joins, access views Lookup (reference link)
Normal lookup is memory-based (all table data read into memory) Can perform one lookup at a time in DBMS (sparse option) Continue/drop/fail options As a target Inserts Upserts (Inserts and updates) Loader
DB2 Enterprise Stage Source Auto-generated SELECT Connection information Job example
Sourcing with User-Defined SQL User-defined read method Columns in SQL must match definitions on Columns tab
DBMS Source Lookup Reference link
DBMS as a Target
Write Methods Write methods Delete Load Uses database load utility Upsert INSERT followed by an UPDATE Write (DB2) INSERT Write modes Truncate: Empty the table before writing Create: Create a new table Replace: Drop the existing table (if it exists) then create a new one Append
DB2 Stage Target Properties SQL INSERT Drop table and create Database specified by job parameter Optional CLOSE command
Generated OSH for first 2 stages Generated OSH Primer Comment blocks introduce each operator Operator order is determined by the order stages were added to the canvas OSH uses the familiar syntax of the UNIX shell Operator name Schema Operator options ( -name value format) Input (indicated by n< where n is the input #) Output (indicated by n> where n is the output #) may include modify
For every operator, input and/or output datasets are numbered sequentially starting from 0. E.g.: op1 0> dst op1 1< src Virtual datasets are generated to connect operators
## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')] ## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ; Virtual dataset is #### STAGE: SortSt used to connect
tsort output of one -key 'a' operator to input of -asc
Framework v. DataStage Terminology Framework schema
property
type
virtual dataset
Record / field
operator
step, flow, OSH command
Framework DataStage table definition
format
SQL type and length
link
row / column
stage
job
DS Parallel Engine GUI uses both terminologies Log messages (info, warnings, errors) use Framework terminology
Elements of a Framework Program
Operators
Virtual datasets: set of rows processed
Schema: by Framework data description (metadata) for datasets and links
Enterprise Edition Runtime Architecture
Enterprise Edition Job Startup Generated OSH and configuration file are used to compose a job Score Think of Score as in musical score, not game score Similar to the way an RDBMS builds a query optimization plan Identifies degree of parallelism and node assignments for each operator Inserts sorts and partitioners as needed to ensure correct results Defines connection topology (virtual datasets) between adjacent operators Inserts buffer operators to prevent deadlocks E.g., in fork-joins Defines number of actual OS processes Where possible, multiple operators are combined within a single OS process to improve performance and optimize resource requirements Job Score is used to fork processes with communication interconnects for data, message, and control Set $APT_STARTUP_STATUS to show each step of job startup Set $APT_PM_SHOW_PIDS to show process IDs in DataStage log
Enterprise Edition Runtime It is only after the job Score and processes are created that processing begins Startup overhead of an EE job Job processing ends when either: Last row of data is processed by final operator A fatal error is encountered by any operator Job is halted (SIGINT) by DataStage Job Control or human intervention (e.g. DataStage Director STOP)
Viewing the Job Score
Set $APT_DUMP_SCORE to output the Score to the job log
For each job run, 2 separate Score dumps are written
First score is for the license operator
Second score entry is the real job score
To identify the Score dump, look for main program: This step You dont see anywhere the word Score
License operator job score Job score
Example Job Score Job scores are divided into two sections Datasets partitioning and collecting Operators node/operator mapping Both sections identify sequential or parallel processing Why 9 Unix processes?