You are on page 1of 25

DataStage – Some More Stages

Data Set Stage

• Used to create, write (append/overwrite) into and read from Data Sets
• Data Sets:
• Persistent data that can capture the data as carried through the DataStage Links
• An operating system file in an DataStage-specific format
• Cannot be shared with other software packages.
• Used to pass data files between DataStage jobs
• Preserves partitioning
• Component dataset files are written to on each partition
• Key to good performance in set of linked jobs
• No import / export conversions are needed
• No repartitioning needed
• DataStage engine is most efficient when processing internally formatted records (i.e.
datasets)
• Can be managed by Data Set Management utility from GUI (Manager, Designer,
Director)

August 7, 2021 2
Data Set Stage
• Managing Data Sets
• GUI (Manager, Designer, Director) – tools > data set management
• View schema & partitions
• Open
• Copy, delete

Important:
Avoid manipulating individual files (e.g. copy, rename,
delete) outside the DataStage environment as the
structure is complex & may span across multiple files.
 Referred to by a header file
 Usually suffixed by .ds
If required the entire directory where the file(s) are kept
may be moved, copied or deleted.

August 7, 2021 3
Accessing Relational Data
Database Access
• Enterprise Database Stages
• high performance / scalable interfaces
• DB2 UDB Enterprise Stage
• Informix Enterprise Stage
• Oracle Enterprise Stage
• Teradata Enterprise Stage
• ODBC Enterprise Stage is also provided

Importing Table Definitions


• Using ODBC or using Orchestrate schema definitions
• Orchestrate schema imports are better because the data types are
more accurate

August 7, 2021 4
DB2/UDB Enterprise Stage

As a source

• Extract data from table (stream link)


• Read methods include: Table, Generated SQL SELECT, or
User-defined SQL
• User-defined can perform joins, access views

• Lookup (reference link)


• Normally lookup is memory-based (all table data read into
memory)
• Can also perform one lookup at a time in DBMS (sparse option)
• Continue/drop/fail options

August 7, 2021 5
DB2/UDB Enterprise Stage

• Write Method • Write Mode


• Write Mode • Available for Write & Load Methods
• Inserts data (Auto-generated/ User- • Append.
defined) • Create
• Load • fails if table exists
• Invokes DB2 loader for fast load • cannot create table with primary
• Needs DBADM privilege keys
• Delete Rows (Auto-generated/ User- • Replace
defined) • Drops & Creates
• Upsert Mode • cannot replace table with
• Update & Insert primary keys
• Uses DB2 CLI for enhanced • Truncate
performance • Retain table attributes and the
• Upsert Modes DB2 partitioning keys
• Auto-generated Update & Insert • Delete Records & Append new
• Auto-generated Update Only. records
(Insert if this fails) • Input Reject Link
• User-defined Update & Insert • optional to catch records where write
• User-defined Update Only. fails
(Insert if this fails)

August 7, 2021 6
DB2/UDB Enterprise Stage

• DataStage & DB2 Administrators

• Database Configuration
• Remote DB2 server(s)
• DB2 client must installed on the DataStage server
• The Db2 client as well as the server must be configured to support remote connections.
• To use DataStage EE’s optimized parallel capability w.r.t DB2 DB2 Database must be
installed with partitioning features enables
• Appropriate privileges must be given to the users

• DataStage configuration for DB2 access


• DataStage configuration file needs to be modified to recognize the nodes on which DB2 is
running. Each DB2 server that will be accessed must have an entry
• DataStage environment variable to define the DB2 environment (including database name,
commit point, etc)
• Individual DB2 Enterprise Stage can override the database name & instance

August 7, 2021 7
Transforming Data

• Stages that can transform data


• Transformer Already covered
• Modify
• Aggregator Already covered

• Stages that do not transform data


• File stages: Sequential, Dataset, Peek, etc.
• Sort
• Remove Duplicates
• Copy Already covered
• Filter
• Funnel

August 7, 2021 8
Modify Stage

Features of Modify Stage

• Modify column types


• Perform some types of derivations
• Null handling
• Date / time handling
• String handling
• Add or drop columns
• Less overhead than Transformer
in earlier versions. But in version
8, this has been overcome.

August 7, 2021 9
Sorting
• Why Sort?
• Some stages require sorted input like Join, merge stages
• Within a transform, may wish to use sorted data to provide meaning to record-comparison with
previous record (s)
• Sorts can be done:
• Implicitly (automatically inserted) on input links (for join etc.)
• Explicitly within stages
• On input link Partitioning tab, set partitioning to anything other than Auto
• In a separate Sort Stage
• Makes sort more visible on diagram
• Has more options
• Sort Stage Options include
• Ascending/Descending
• Case sensitivity
• Duplicate handling
• Key Change Flag Setting
• Sort algorithm type

August 7, 2021 10
Remove Duplicates

• Can be done by
• Sort stage Use unique option
• Aggregate Stage – select first or last
• OR
• Remove Duplicates Stage
• Has more sophisticated ways to
remove duplicates
• Specify Key that defines duplicates
• Option to retain first or last duplicate
• More clearly noticeable within a job
flow

August 7, 2021 11
Copy Stage

Features of Copy Stage


• Copies single input dataset to a number of output datasets

• Records can be copied with or without modifications

• Modifications can be:


• Drop columns
• Change the order of columns
• Rename Columns

August 7, 2021 12
Filter Stage

Features of Filter Stage


• Supports single input link, multiple output links and optional reject link
• Transfers input records which satisfy specified requirements
• Option to specify different requirements to route rows to different output links
• Filtered records can be routed to a reject link

August 7, 2021 13
Combining Data

Two ways to combine data:

• Horizontally:
• Multiple input links
• One output link (+ optional rejects) made of columns from different input links.
• Joins Already covered
• Lookup
• Merge
• Funnel
• Vertically:
• One input link, one output link with column combining values from all input rows.
• Aggregator Already covered
• Remove Duplicates

August 7, 2021 14
Lookup, Merge and Join Stages

• These stages combine two or more input links


• Data is combined by designated "key" column(s)

• These stages differ mainly in:


• Memory usage
• Treatment of rows with unmatched key values
• Input requirements (sorted, de-duplicated)

• Join
• RDBMS-style Joins supported
• No rejection handling
• Data presorted, partitioned
• 2 or more input links, one output link
• Duplicates may occur on any of the links

August 7, 2021 15
Lookup, Merge and Join Stages

• Merge
• 1 Master link
• 1 or more update links
• One output link
• Master link row with no match on update
links
• may be dropped or
• passed on to the output
• Update link data with no match in the
master
• may be ignored or
• captured through the reject l ink – one for
each update link
• Data partitioned & pre-sorted on key
• Column names matters
• No duplicates on merge key in any except
the last update link

• Link Ordering matters & can be changed


• First link is used as the master link

August 7, 2021 16
Lookup, Merge and Join Stages

• Merge contd.

• All key fields & values from Master Stream available for output
• All fields & values available only on a single input link available for output
• For fields appearing in more than one input link, the first (available) link’s value is output
Merge Key Fields from Master

Single Link Fields

Common fields –output value from first


available link

Link order =
Link order =
Master > Stream_1 > Stream_2
Master > Stream_2 > Stream_1
August 7, 2021 17
Lookup, Merge and Join Stages
Features of Lookup Stage
• Single stream input link (straight line)

• Multiple reference input links (dotted line)

• One output link

• Optional Reject link Only one, regardless of number of reference links

• If required, can return multiple matching rows from any one reference link. Else returns first
row

• Hash file is built in memory internally from the lookup files


• Indexed by key
• Should be small enough to fit into physical memory

• No need for the input/reference links to be sorted

• No need to have the same names for lookup key columns in input/reference links

• Case-less matching supported

August 7, 2021 18
Lookup Stage

Lookup Stage Implementation

Dotted link for reference input

August 7, 2021 19
Lookup Stage

Lookup Stage Implementation

Link look-up keys.


Can also use returned values from the
previous lookup link as key into next lookup
data set

August 7, 2021 20
Lookup, Merge and Join Stages

Features of Lookup Stage


• Ensure data being looked up in lookup table is in the same partition as the input data
referencing it.

• Alternatively, use Entire method for the Lookup link.

• Look-up key can use value(s) returned from a previously looked-up record

• Link ordering matters in this case

• Can save look-up data as a persistent data set

• Conditions can be set for

 conditional lookup – look-up operation on a link is performed only if the condition is


met
 lookup failure – Continue, Fail, Drop, Reject
 Handling duplicates in the reference link
• No settings => Warning logged & first record returned
• Upto one reference link can be set to return duplicates, if required

August 7, 2021 21
Lookup Stage
Lookup Stage Implementation

Execution Order

Open up conditions dialog


box for the stage

Upto 1 link can return multiple


records

Condition for lookup – can use Describe conditions for condition


previous look-up’s return value failure & lookup failure

August 7, 2021 22
Lookup, Merge and Join Stages - Comparison

  Merge Join Lookup

Stream Input 2 to N 2 To N 1
Reference Input NA NA 1-N
If no duplicates in the lookup data expected
then one for every input stream record
Else If one reference stream provides
Merged data legitimate duplicates, then multiple rows for
Output Master Update Type SQL-type joined data those records
Sorting requirements All input All Input Stream Input Only
Allowed in Stream Input
Not allowed except in Upto 1 reference link can handle duplicates.
Duplicates last update link Allowed In others, single(first) value returned.

Partition Merge Key Join Key Usually set “Entire” for lookup data
Master - drop/keep, Depends on join type
warning/no warning NULL values on outer
Unmatched Rows Update - drop/reject join Unmatched stream - reject/keep
Few rows as data is Lookup data in memory- may page for large
sorted. volumes. Not suitable for large reference
Very few rows in Higher (sequential, data.
memory as data is optimized) I/O for high- When looking up against a database, the
sorted & no duplicates speed sort on input & DB stage can be set to provide sparse look-
Memory are expected reference data sets up support

Use When Larger sorted data Large data Small reference data look-up.

August 7, 2021 23
Funnel Stage

Features of Funnel Stage


• A processing stage that combines data from multiple input links to a single output link
• Useful to combine data from several identical data sources into a single large dataset
• Operates in three modes
• Continuous
• SortFunnel
• Sequence

August 7, 2021 24
Surrogate Key Stage

Features of Surrogate Key Stage


• Generates key columns for existing dataset
• Processing stage that has single input and single output link
• Generates sequentially incrementing unique integers from a given starting point
• Other existing columns of dataset are passed straight through the stage

August 7, 2021 25

You might also like