Professional Documents
Culture Documents
C-Some More Stages
C-Some More Stages
• Used to create, write (append/overwrite) into and read from Data Sets
• Data Sets:
• Persistent data that can capture the data as carried through the DataStage Links
• An operating system file in an DataStage-specific format
• Cannot be shared with other software packages.
• Used to pass data files between DataStage jobs
• Preserves partitioning
• Component dataset files are written to on each partition
• Key to good performance in set of linked jobs
• No import / export conversions are needed
• No repartitioning needed
• DataStage engine is most efficient when processing internally formatted records (i.e.
datasets)
• Can be managed by Data Set Management utility from GUI (Manager, Designer,
Director)
August 7, 2021 2
Data Set Stage
• Managing Data Sets
• GUI (Manager, Designer, Director) – tools > data set management
• View schema & partitions
• Open
• Copy, delete
Important:
Avoid manipulating individual files (e.g. copy, rename,
delete) outside the DataStage environment as the
structure is complex & may span across multiple files.
Referred to by a header file
Usually suffixed by .ds
If required the entire directory where the file(s) are kept
may be moved, copied or deleted.
August 7, 2021 3
Accessing Relational Data
Database Access
• Enterprise Database Stages
• high performance / scalable interfaces
• DB2 UDB Enterprise Stage
• Informix Enterprise Stage
• Oracle Enterprise Stage
• Teradata Enterprise Stage
• ODBC Enterprise Stage is also provided
August 7, 2021 4
DB2/UDB Enterprise Stage
As a source
August 7, 2021 5
DB2/UDB Enterprise Stage
August 7, 2021 6
DB2/UDB Enterprise Stage
• Database Configuration
• Remote DB2 server(s)
• DB2 client must installed on the DataStage server
• The Db2 client as well as the server must be configured to support remote connections.
• To use DataStage EE’s optimized parallel capability w.r.t DB2 DB2 Database must be
installed with partitioning features enables
• Appropriate privileges must be given to the users
August 7, 2021 7
Transforming Data
August 7, 2021 8
Modify Stage
August 7, 2021 9
Sorting
• Why Sort?
• Some stages require sorted input like Join, merge stages
• Within a transform, may wish to use sorted data to provide meaning to record-comparison with
previous record (s)
• Sorts can be done:
• Implicitly (automatically inserted) on input links (for join etc.)
• Explicitly within stages
• On input link Partitioning tab, set partitioning to anything other than Auto
• In a separate Sort Stage
• Makes sort more visible on diagram
• Has more options
• Sort Stage Options include
• Ascending/Descending
• Case sensitivity
• Duplicate handling
• Key Change Flag Setting
• Sort algorithm type
August 7, 2021 10
Remove Duplicates
• Can be done by
• Sort stage Use unique option
• Aggregate Stage – select first or last
• OR
• Remove Duplicates Stage
• Has more sophisticated ways to
remove duplicates
• Specify Key that defines duplicates
• Option to retain first or last duplicate
• More clearly noticeable within a job
flow
August 7, 2021 11
Copy Stage
August 7, 2021 12
Filter Stage
August 7, 2021 13
Combining Data
• Horizontally:
• Multiple input links
• One output link (+ optional rejects) made of columns from different input links.
• Joins Already covered
• Lookup
• Merge
• Funnel
• Vertically:
• One input link, one output link with column combining values from all input rows.
• Aggregator Already covered
• Remove Duplicates
August 7, 2021 14
Lookup, Merge and Join Stages
• Join
• RDBMS-style Joins supported
• No rejection handling
• Data presorted, partitioned
• 2 or more input links, one output link
• Duplicates may occur on any of the links
August 7, 2021 15
Lookup, Merge and Join Stages
• Merge
• 1 Master link
• 1 or more update links
• One output link
• Master link row with no match on update
links
• may be dropped or
• passed on to the output
• Update link data with no match in the
master
• may be ignored or
• captured through the reject l ink – one for
each update link
• Data partitioned & pre-sorted on key
• Column names matters
• No duplicates on merge key in any except
the last update link
August 7, 2021 16
Lookup, Merge and Join Stages
• Merge contd.
• All key fields & values from Master Stream available for output
• All fields & values available only on a single input link available for output
• For fields appearing in more than one input link, the first (available) link’s value is output
Merge Key Fields from Master
Link order =
Link order =
Master > Stream_1 > Stream_2
Master > Stream_2 > Stream_1
August 7, 2021 17
Lookup, Merge and Join Stages
Features of Lookup Stage
• Single stream input link (straight line)
• If required, can return multiple matching rows from any one reference link. Else returns first
row
• No need to have the same names for lookup key columns in input/reference links
August 7, 2021 18
Lookup Stage
August 7, 2021 19
Lookup Stage
August 7, 2021 20
Lookup, Merge and Join Stages
• Look-up key can use value(s) returned from a previously looked-up record
August 7, 2021 21
Lookup Stage
Lookup Stage Implementation
Execution Order
August 7, 2021 22
Lookup, Merge and Join Stages - Comparison
Stream Input 2 to N 2 To N 1
Reference Input NA NA 1-N
If no duplicates in the lookup data expected
then one for every input stream record
Else If one reference stream provides
Merged data legitimate duplicates, then multiple rows for
Output Master Update Type SQL-type joined data those records
Sorting requirements All input All Input Stream Input Only
Allowed in Stream Input
Not allowed except in Upto 1 reference link can handle duplicates.
Duplicates last update link Allowed In others, single(first) value returned.
Partition Merge Key Join Key Usually set “Entire” for lookup data
Master - drop/keep, Depends on join type
warning/no warning NULL values on outer
Unmatched Rows Update - drop/reject join Unmatched stream - reject/keep
Few rows as data is Lookup data in memory- may page for large
sorted. volumes. Not suitable for large reference
Very few rows in Higher (sequential, data.
memory as data is optimized) I/O for high- When looking up against a database, the
sorted & no duplicates speed sort on input & DB stage can be set to provide sparse look-
Memory are expected reference data sets up support
Use When Larger sorted data Large data Small reference data look-up.
August 7, 2021 23
Funnel Stage
August 7, 2021 24
Surrogate Key Stage
August 7, 2021 25