Professional Documents
Culture Documents
1 Core Modules
Microsoft, Windows, Window NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java, JDBC, and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. All other product or brand names may be trademarks of their respective companies. All information contained in this document has not been submitted to any formal IBM test and is distributed on an as is basis without any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customers ability to evaluate and integrate them into the customers operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. The original repository material for this course has been certified as being Year 2000 compliant. This document may not be reproduced in whole or in part without the priori written permission of IBM. Note to U.S. Government Users Documentation related to restricted rights Use, duplication, or disclosure is subject to restrictions set forth in GSA ADP Schedule Contract with IBM Corp.
Course Contents
Mod 01: Introduction 05 Mod 02: Deployment 27 Mod 03: Administering DataStage 41 Mod 04: DataStage Designer 67 Mod 05: Creating Parallel Jobs 97 Mod 06: Accessing Sequential Data 133 Mod 07: Platform Architecture 163 Mod 08: Combining Data 201 Mod 09: Sorting and Aggregating Data 243 Mod 10: Transforming Data 271 Mod 11: Repository Functions 301 Mod 12: Working With Relational Data 327 Mod 13: Metadata in the Parallel Framework 373 Mod 14: Job Control 397 Appendix A: Information Server on zLinux 425
Unit objectives
After completing this unit, you should be able to: List and describe the uses of DataStage List and describe the DataStage clients Describe the DataStage workflow List and compare the different types of DataStage jobs Describe the two types of parallelism exhibited by DataStage parallel jobs
Business Glossary
Information Analyzer
DataStage
QualityStage
DataStage Architecture
Clients
Parallel engine
Engines
DataStage Administrator
DataStage Designer
DataStage Director
Log message events
Developing in DataStage
Define global and project properties in Administrator Import metadata into the Repository Build job in Designer Compile job in Designer Run and monitor job log messages in Director
Jobs can also be run in Designer, but job lob messages cannot be viewed in Designer
Parallel jobs
Master Server jobs that kick-off jobs and other activities Can kick-off Server or Parallel jobs Runtime monitoring in DataStage Director Executed by the Server engine
Links
Pipes through which the data moves from stage to stage
Pipeline Parallelism
Transform, clean, load processes execute simultaneously Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running
Advantages:
Reduces disk usage for staging areas Keeps processors busy
Partition Parallelism
Divide the incoming stream of data into subsets to be separately processed by an operator
Subsets are called partitions
Three-Node Partitioning
Node 1
Operation
subset1 Node 2 subset2
Operation
Node 3
Data
subset3
Operation
Here the data is partitioned into three partitions The operation is performed on each partition of data separately and in parallel If the data is evenly distributed, the data will be processed three times faster
at runtime, this job runs in parallel for any configuration or partitions (called nodes)
Checkpoint
1. True or false: DataStage Director is used to build and compile your ETL jobs 2. True or false: Use Designer to monitor your job during execution 3. True or false: Administrator is used to set global and project properties
Checkpoint solutions
1. False. 2. False. 3. True.
Unit summary
Having completed this unit, you should be able to: List and describe the uses of DataStage List and describe the DataStage clients Describe the DataStage workflow List and compare the different types of DataStage jobs Describe the two types of parallelism exhibited by DataStage parallel jobs
Unit objectives
After completing this unit, you should be able to: Identify the components of Information Server that need to be installed Describe what a deployment domain consists of Describe different domain deployment options Describe the installation process Start the Information Server
One DB2 UDB instance containing the Repository database Information Server clients
Administration console Reporting console DataStage clients
Administrator Designer Director
Selected layers are installed on the machine local to the installation Already existing components can be configured and used
E.g., DB2, InfoSphere Application Server
Start the ASB agent From Windows Start menu, click Start the agent after selecting the Information Server folder Only required if DataStage and the MetaData Server are on different machines To begin work in DataStage, double-click on a DataStage client icon To begin work in the Administration and Reporting consoles, double-click on the Web Console for Information Server icon
Profile Profile
Startup command
Checkpoint
1. What application components make up a domain? 2. Can a domain contain multiple DataStage servers? 3. Does the DB2 instance and the Repository database need to be on the same machine as the Application Server? 4. Suppose DataStage is on a separate machine from the Application Server. What two components need to be running before you can log onto DataStage?
Checkpoint solutions
1. Metadata server hosted by the Application Server. One or more DataStage servers. One DB2/UDB instance containing the Suite Repository database. 2. Yes. The DataStage servers must be on separate machines. 3. No. The DB2 instance with the Repository can reside on a separate machine than the Application Server. 4. The Application Server and the ASB agent.
Unit summary
Having completed this unit, you should be able to: Identify the components of Information Server that need to be installed Describe what a deployment domain consists of Describe different domain deployment options Describe the installation process Start the Information Server
Unit objectives
After completing this unit, you should be able to: Open the Administrative console Create new users and groups Assign Suite roles and Product roles to users and groups Give users DataStage credentials Log onto DataStage Administrator Add a DataStage user on the Permissions tab and specify the users role Specify DataStage global and project defaults List and describe important environment variables
Suite roles
DataStage administrator
Users
User ID
Users
Users given DataStage Suite roles in the Suite Administration console do not automatically receive DataStage credentials
Users need to be mapped to a user who has DataStage credentials on the DataStage Server machine This DataStage user must have file access permission to the DataStage engine/project files or Administrator rights on the operating system
OS ID)
DataStage Administrator
Environment Variables
Parallel job variables
User-defined variables
Display Score
Display OSH
Checkpoint
1. Authorizations can be assigned to what two items? 2. What two types of authorization roles can be assigned to a user or group? 3. In addition to Suite authorization to log onto DataStage what else does a DataStage developer require to work in DataStage?
Checkpoint solutions
1. Users and groups. Members of a group acquire the authorizations of the group. 2. Suite roles and Component roles. 3. Must be mapped to a user with DataStage credentials.
Unit objectives
Having completing this unit, you should be able to: Open the Administrative console Create new users and groups Assign Suite roles and Product roles to users and groups Give users DataStage credentials Log onto DataStage Administrator Add a DataStage user on the Permissions tab and specify the users role Specify DataStage global and project defaults List and describe important environment variables
Unit objectives
After completing this unit, you should be able to: Log onto DataStage Navigate around DataStage Designer Import and export DataStage objects to a file Import a Table Definition for a sequential file
Parallel canvas
Palette
Repository Window
Search for objects in the project
Export Procedure
Click Export>DataStage Components Select DataStage objects for export Specify type of export:
DSX: Default format XML: Enables processing of export file by XML applications, e.g., for generating reports
Export Window
Selected objects
Export type
Import Procedure
Click Import>DataStage Components
Or Import>DataStage Components (XML) if you are importing an XML-format export file
Import Options
Import all objects in the file
Invoke through Programs IBM Information Server Information Server Command Line Interface Examples:
istool hostname domain:9080 username dsuser password inf0server deployment-package importfile.dsx import istool hostname domain:9080 username dsuser password inf0server job GenDataJob filename exportfile.dsx export
Adding a Domain
Created export archive DataStage Repository. Shows all DataStage servers and projects
Builds
Objects
Specify Format
Edit columns Select if first row has column names
Delimiter
Parallel properties
Property categories
Available properties
Type
Source
Checkpoint
True or False? The directory to which you export is on the DataStage client machine, not on the DataStage server machine. Can you import Table Definitions for sequential files with fixed-length record formats?
Checkpoint solutions
1. True. 2. Yes. Record lengths are determined by the lengths of the individual columns.
Unit summary
Having completed this unit, you should be able to: Log onto DataStage Navigate around DataStage Designer Import and export DataStage objects to a file Import a Table Definition for a sequential file
Unit objectives
After completing this unit, you should be able to: Design a simple Parallel job in Designer Define a job parameter Use the Row Generator, Peek, and Annotation stages in a job Compile your job Run your job in Director View the job log
Built using a graphical user interface Compiles into Orchestrate script language (OSH) and object code (from generated C++)
In Designer, add stages defining data extractions and loads Add processing stages to define data transformations Add links defining the flow of data from sources to targets Compile the job In Director, validate, run, and monitor your job
Can also run the job in Designer Can only view the job log in Director
Tools Palette
Stage categories
Stages
Peek
RowGenerator Stage
Produces mock data for specified columns No inputs link; single output link On Properties tab, specify number of rows On Columns tab, load or specify column definitions
Open Extended Properties window to specify the values to be generated for that column A number of algorithms for generating values are available depending on the data type
Algorithms for string type: Cycle , alphabet Algorithms for date type: Random, cycle
Property
Columns Tab
Double-click to specify extended properties
View data
Extended Properties
Peek Stage
Displays field values
Displayed in job log or sent to a file Skip records option Can control number of records to be displayed Shows data in each partition, labeled 0, 1, 2,
Job Parameters
Defined in Job Properties window Makes the job more flexible Parameters can be:
Used in directory and file names Used to specify property values Used in constraints and derivations
Parameter values are determined at run time When used for directory and files names and property values, they are surrounded with pound signs (#)
E.g., #NumRows#
Job parameter
Annotation stage
Added from the Tools Palette Displays formatted text descriptions on diagram
Documentation
Compiling a Job
Run Compile
DataStage Director
Use to run and schedule jobs View runtime messages Can invoke directly from Designer
Tools > Run Director
Run Options
Run Options
Peek messages
Message Details
Checkpoint
1. Which stage can be used to display output data in the job log? 2. Which stage is used for documenting your job on the job canvas? 3. What command is used to run jobs from the operating system command line?
Checkpoint solutions
1. Peek stage 2. Annotation stage 3. dsjob -run
Unit summary
Having completed this unit, you should be able to: Design a simple Parallel job in Designer Define a job parameter Use the Row Generator, Peek, and Annotation stages in a job Compile your job Run your job in Director View the job log
Unit objectives
After completing this unit, you should be able to: Understand the stages for accessing different kinds of file data Sequential File stage Data Set stage Create jobs that read from and write to sequential files Create Reject links Work with NULLs in sequential files Read from multiple files using file patterns Use multiple readers
Import operators convert data from the external format, as described by the Table Definition, to the framework internal format
Internally, the format of data is described by schemas
Export operators reverse the process Messages in the job log use the import / export terminology
E.g., 100 records imported successfully; 2 rejected E.g., 100 records exported successfully; 0 rejected Records get rejected when they cannot be converted correctly during the import or export
Importing/Exporting Data
Data import:
Internal format
Data export
Internal format
Field 1
2 Field 1
Field 1 3
, Last field
, nl
Stream link
Format Tab
Record format
Column format
View data Load from Table Definition Save as a new Table Definition
Multiple readers option allows you to set number of readers per node
Append / Overwrite
Reject Link
Reject mode =
Continue: Continue reading records Fail: Abort job Output: Send down output link All records not matching the metadata (column definitions) are rejected All records that fail to be written for any reason
In a source stage
In a target stage
Column mappings
You must handle NULLs written to nullable columns in a Sequential File stage
You need to tell DataStage what value to write to the file Unhandled rows are rejected
In a Sequential File source stage, you can specify values you want DataStage to convert to NULLs
Nullable column
Added property
DataSet Stage
Data Set
Binary data file Preserves partitioning
Component dataset files are written to each partition
Suffixed by .ds Referred to by a header file Managed by Data Set Management utility from GUI (Manager, Designer, Director) Represents persistent data Key to good performance in set of linked jobs
No import / export conversions are needed No repartitioning needed
Accessed using DataSet stage Implemented with two types of components: Descriptor file:
contains metadata, data location, but NOT the data itself contain the data multiple files, one per partition (node)
Data file(s)
Use to read and write to filesets Files suffixed by .fs Files are similar to a dataset
Partitioned Implemented with header file and data files
Checkpoint
1. List three types of file data. 2. What makes datasets perform better than other types of files in parallel jobs? 3. What is the difference between a data set and a file set?
Checkpoint solutions
1. Sequential, dataset, & complex flat files. 2. They are partitioned and they store data in the native parallel format. 3. Both are partitioned. Data sets store data in a binary format not readable by user applications. File sets are readable.
Unit summary
Having completed this unit, you should be able to: Understand the stages for accessing different kinds of file data Sequential File stage Data Set stage Create jobs that read from and write to sequential files Create Reject links Work with NULLs in sequential files Read from multiple files using file patterns Use multiple readers
Unit objectives
After completing this unit, you should be able to: Describe parallel processing architecture Describe pipeline parallelism Describe partition parallelism List and describe partitioning and collecting algorithms Describe configuration files Describe the parallel job compilation process Explain OSH Explain the Score
Scalable processing:
Add more resources (CPUs and disks) to increase system performance
Example system: 6 CPUs (processing nodes) and disks Scale up by adding more CPUs Add CPUs as individual nodes or to an SMP system
GRID / Clusters
Multiple, multi-CPU systems Dedicated memory per node Typically SAN-based shared storage Multiple nodes with dedicated memory, storage
MPP
2 1000s of CPUs
Pipeline Parallelism
Transform, clean, load processes execute simultaneously Like a conveyor belt moving rows from process to process
Start downstream process while upstream process is running
Advantages:
Reduces disk usage for staging areas Keeps processors busy
Partition Parallelism
Divide the incoming stream of data into subsets to be separately processed by an operation
Subsets are called partitions (nodes)
8 times faster on 8 processors 24 times faster on 24 processors This assumes the data is evenly distributed
Three-Node Partitioning
Node 1
Operation
subset1 Node 2 subset2
Operation
Node 3
Data
subset3
Operation
Here the data is partitioned into three partitions The operation is performed on each partition of data separately and in parallel If the data is evenly distributed, the data will be processed three times faster
Within parallel jobs pipelining, partitioning, and repartitioning are automatic Job developer only identifies: Sequential vs. parallel mode (by stage) Default for most stages is parallel mode Method of data partitioning The method to use to distribute the data into the available nodes Method of data collecting The method to use to collect the data from the nodes into a single node Configuration file Specifies the nodes and resources
at runtime, this job runs in parallel for any configuration (1 node, 4 nodes, N nodes)
Configuration File
Configuration file separates configuration (hardware / software) from job design
Specified per job at runtime by $APT_CONFIG_FILE Change hardware and resources without changing job design
Defines nodes with their resources (need not match physical CPUs)
Dataset, Scratch, Buffer disk (file systems) Optional resources (Database, SAS, etc.) Advanced resource optimizations
Pools (named subsets of nodes)
3 1
4 2
Key points: 1. 2. 3. Number of nodes defined Resources assigned to each node. Their order is significant. Advanced resource optimizations and configuration (named pools, database, SAS)
Fairly low overhead Round Robin assigns rows to partitions like dealing cards
Row/Partition assignment will be the same for a given $APT_CONFIG_FILE
6 3 0
Round Robin
7 4 1
8 5 2
Random has slightly higher overhead, but assigns rows in a non-deterministic fashion between job runs
ENTIRE Partitioning
Each partition gets a complete copy of the data
Useful for distributing lookup and reference data May have performance impact in MPP / clustered environments On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data On MPP platforms, each server uses shared memory for a single local copy
Keyless
8 7 6 5 4 3 2 1 0
ENTIRE
ENTIRE is the default partitioning for Lookup reference links with Auto partitioning
On SMP platforms, it is a good practice to set this explicitly on the Normal Lookup reference link(s)
. . 3 2 1 0
. . 3 2 1 0
. . 3 2 1 0
HASH Partitioning
Keyed partitioning method Rows are distributed according to the values in key columns
Guarantees that rows with same key values go into the same partition Needed to prevent matching rows from hiding in other partitions
Values of key column
Keyed
0 3 2 1 0 2 3 2 1 1
HASH
Partition distribution is relatively equal if the data across the source key column(s) is evenly
0 3 0 3
1 1 1
2 2 2
distributed
Modulus Partitioning
Keyed partitioning method Rows are distributed according to the values in one integer key column
Uses modulus
partition = MOD (key_value / #partitions) Values of key column 0 3 2 1 0 2 3 2 1 1
Keyed
MODULUS
Faster than HASH Guarantees that rows with identical key values go in the same partition Partition size is relatively equal if the data within the key column is evenly distributed
0 3 0 3
1 1 1
2 2 2
Auto Partitioning
DataStage inserts partition components as necessary to ensure correct results
Before any stage with Auto partitioning Generally chooses ROUND-ROBIN or SAME Inserts HASH on stages that require matched key values (e.g. Join, Merge, Remove Duplicates) Inserts ENTIRE on Normal (not Sparse) Lookup reference links NOT always appropriate for MPP/clusters
Since DataStage has limited awareness of your data and business rules, explicitly specify HASH partitioning when needed
DataStage has no visibility into Transformer logic Hash is required before Sort and Aggregator stages DataStage sometimes inserts un-needed partitioning Check the log
Partition imbalances
If all the records are going down only one of the nodes, then the job is in effect running sequentially
LName
Dodge Dodge
FName
Horace John
Address
17840 Jefferson 75 Boston Boulevard
LName
Ford Ford Ford Ford Dodge Dodge Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Horace John Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 17840 Jefferson 75 Boston Boulevard 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
Partition 1
ID
1 2 3 4 7 8 9 10
LName
Ford Ford Ford Ford Ford Ford Ford Ford
FName
Henry Clara Edsel Eleanor Henry Clara Edsel Eleanor
Address
66 Edison Avenue 66 Edison Avenue 7900 Jefferson 7900 Jefferson 4901 Evergreen 4901 Evergreen 1100 Lakeshore 1100 Lakeshore
Partitioning icon
Collecting icon
SAME partitioner
Re-partition
watch for this!
AUTO partitioner
Partitioning Tab
Input tab Key specification
Algorithms
Collecting Specification
Key specification
Algorithms
Runtime Architecture
Compile
DataStage server
A custom operator built during the compile Compiled into C++ and then to corresponding native operators
Thus a C++ compiler is needed to compile jobs with a Transformer stage
Executable Job
Transformer Components
Generated OSH
Enable viewing of generated OSH in Administrator:
DataSet: copy Sort: tsort Aggregator: group Row Generator, Column Generator, Surrogate Key Generator: generator Oracle
Source: oraread Sparse Lookup: oralookup Target Load: orawrite Target Upsert: oraupsert
Syntax
#################################################### #### STAGE: Row_Generator_0 ## Operator generator ## Operator options -schema record ( a:int32; b:string[max=12]; c:nullable decimal[10,2] {nulls=10}; ) -records 50000 ## General options [ident('Row_Generator_0'); jobmon_ident('Row_Generator_0')] ## Outputs 0> [] 'Row_Generator_0:lnk_gen.v' ; #################################################### #### STAGE: SortSt ## Operator tsort ## Operator options -key 'a' -asc ## General options [ident('SortSt'); jobmon_ident('SortSt'); par] ## Inputs 0< 'Row_Generator_0:lnk_gen.v' ## Outputs 0> [modify ( keep a,b,c; )] 'SortSt:lnk_sorted.v' ;
For every operator, input and/or output datasets are numbered sequentially starting from 0. E.g.: Virtual datasets are generated to connect operators
op1 0> dst op1 1< src
Job Score
Generated from OSH and configuration file Think of Score as in musical score, not game score Assigns nodes for each operator Inserts sorts and partitioners as needed Defines connection topology (virtual datasets) between adjacent operators Inserts buffer operators to prevent deadlocks Defines the actual processes
Virtual datasets
Processing Node
SL
Processing Node
SL
Players
The actual processes associated with stages Combined players: one process only Sends stderr, stdout to Section Leader
Establish connections to other players for data flow Clean up upon completion
Section Leader,0
Section Leader,1
Section Leader,2
generator,0
generator,1
generator,2
copy,0
copy,1
copy,2
Checkpoint
1. What file defines the degree of parallelism a job runs under? 2. Name two partitioning algorithms that partition based on key values? 3. What partitioning algorithm produces the most even distribution of data in the various partitions?
Checkpoint solutions
1. Configuration file. 2. Hash, Modulus, DB2. 3. Round robin or Entire.
Unit summary
Having completed this unit, you should be able to: Describe parallel processing architecture Describe pipeline parallelism Describe partition parallelism List and describe partitioning and collecting algorithms Describe configuration files Describe the parallel job compilation process Explain OSH Explain the Score
Unit objectives
After completing this unit, you should be able to: Combine data using the Lookup stage Define range lookups Combine data using Merge stage Combine data using the Join stage Combine data using the Funnel stage
Combining Data
Ways to combine data: Horizontally:
Multiple input links One output link made of columns from different input links. Joins Lookup Merge
Vertically:
One input link, one output link combining groups of related records into a single record Aggregator Remove Duplicates
Lookup Stage
Lookup Features
One stream input link (source link) Multiple reference links One output link Lookup failure options
Continue, Drop, Fail, Reject Only available when lookup failure option is set to Reject
Reject link Can return multiple matching rows Hash file is built in memory from the lookup files
Indexed by key Should be small enough to fit into physical memory
Lookup Types
Equality match
Match exactly values in the lookup key column of the reference link to selected values in the source row Return row or rows (if multiple matches are to be returned) that match
Caseless match
Like an equality match except that its caseless E.g., abc matches AbC
Lookup Example
Column definitions
Source link
Revolution 1789 1776 Citizen Lefty M_B_Dextrous
Reference link
Citizen M_B_Dextrous Righty Exchange Nasdaq NYSE
Lookup Stage
Output of Lookup with Continue option
Revolution 1789 1776 Citizen Lefty M_B_Dextrous Exchange Nasdaq
Lookup stage
Retrieve description
Source values
Reference link
Select operators
Join Stage
Join Stage
Four types of joins
Inner Left outer Right outer Full outer Left link and a right link Supports additional intermediate links Little memory required, because of the sort requirement Column names for each input link must match
Join stage
Inner Join
Transfers rows from both data sets whose key columns have matching values
Output of inner join on key Citizen
Revolution 1776 Citizen M_B_Dextrous Exchange Nasdaq
Revolution 1776
Merge Stage
Merge Stage
Light-weight Unmatched master rows can be kept or dropped Unmatched ("Bad") update rows in input link n can be captured in a "reject" link in corresponding output link n. Matched update rows are consumed Follows the Master-Update model Master row and one or more updates row are merged iff they have the same value in user-specified key column(s). A non-key column name occurs in several inputs?
The lowest input port number prevails e.g., master over update; update values are ignored
Merge stage
Master
0 0
Merge
1 1
Allows composite keys Multiple update links Matched update rows are consumed Unmatched updates in input port n can be captured in output port n Lightweight
Output
Rejects
Joins
Model Memory usage # and names of Inputs Mandatory Input Sort Duplicates in primary input Duplicates in secondary input(s) Options on unmatched primary Options on unmatched secondary On match, secondary entries are # Outputs Captured in reject set(s) RDBMS-style relational light
Lookup
Source - in RAM LU Table heavy
Merge
Master -Update(s) light 1 Master, N Update(s) all inputs W arning! OK only when N = 1 [keep] | drop capture in reject set(s) consumed 1 out, (N rejects) unmatched secondary entries
1 Source, N LU Tables 2 or more: left, right all inputs no OK OK OK W arning! Keep (left outer), Drop (Inner) [fail] | continue | drop | reject Keep (right outer), Drop (Inner) NONE captured captured 1 Nothing (N/A) 1 out, (1 reject) unmatched primary entries
Funnel Stage
Checkpoint
1. Name three stages that horizontally join data? 2. Which stage uses the least amount of memory? Join or Lookup? 3. Which stage requires that the input data is sorted? Join or Lookup?
Checkpoint solutions
1. Lookup, Merge, Join 2. Join 3. Join
Unit summary
Having completed this unit, you should be able to: Combine data using the Lookup stage Define range lookups Combine data using Merge stage Combine data using the Join stage Combine data using the Funnel stage
Unit objectives
After completing this unit, you should be able to: Sort data using in-stage sorts and Sort stage Combine data using Aggregator stage Combine data Remove Duplicates stage
Sort Stage
Sorting Data
Uses
Some stages require sorted input Join, merge stages require sorted input Some stages use less memory with sorted input e.g., Aggregator
Sorting Alternatives
Sort stage
In-Stage Sorting
Partitioning tab
Sort key
Sort Stage
Sort key
Sort options
Sort keys
Add one or more keys Specify sort mode for each key
Sort: Sort by this key Dont sort (previously sorted): Assumes the data has already been sorted on this key Continue sorting by any secondary keys
Sort Options
Sort Utility
DataStage the default Unix: Dont use. Slower than DataStage sort utility
Stable Sort
Key
4 3 1 3 2 3 1 2
Col
X Y K C P D A L
Key
1 1 2 2 3 3 3 4
Col
K A P L Y C D X
Col
Key
1 1 2 2 3 3 3 4
Col
K A P L Y C D X
K_C
1 0 1 0 1 0 0 1
Example:
Partition on HouseHoldID, sort on HouseHoldID, EntryDate Partitioning on HouseHoldID ensures that the same ID will not be spread across multiple partitions Sorting orders the records with the same ID by entry date
Useful for deciding which of a group of duplicate records with the same ID should be retained
Aggregator Stage
Aggregator Stage
Purpose: Perform data aggregations Specify: One or more key columns that define the aggregation units (or groups) Columns to be aggregated Aggregation functions include, among many others:
Count (nulls/non-nulls) Sum Max / Min / Range
Aggregator stage
Aggregation Types
Count rows
Count rows in each group Put result in a specified output column
Calculation
Select column Put result of calculation in a specified output column Calculations include: Sum Count Min, max Mean Missing value count Non-missing value count Percent coefficient of variation
Grouping Methods
Hash (default)
Calculations are made for all groups and stored in memory Hash table structure (hence the name) Results are written out after all input has been processed Input does not need to be sorted Useful when the number of unique groups is small Running tally for each groups aggregations needs to fit into memory
Sort
Requires the input data to be sorted by grouping keys Does not perform the sort! Expects the sort Only a single aggregation group is kept in memory When a new group is seen, the current group is written out Can handle unlimited numbers of groups
4 3 3Y
4X 3C 3D
1K 2P
1A 2L
Col
K A P L Y C D X
1K
1A
2P
2L
3Y
3C
3D
4X
Removing Duplicates
Can be done by Sort stage
Use unique option No choice on which duplicate to keep Stable sort always retains the first row in the group Non-stable sort is indeterminate
OR
Checkpoint
1. What stage is used to perform calculations of column values grouped in specified ways? 2. In what two ways can sorts be performed? 3. What is a stable sort?
Checkpoint solutions
1. Aggregator stage 2. Using the Sort stage. In-stage sorts. 3. Stable sort preserves the order of non-key values.
Unit summary
Having completed this unit, you should be able to: Sort data using in-stage sorts and Sort stage Combine data using Aggregator stage Combine data Remove Duplicates stage
Unit objectives
After completing this unit, you should be able to: Use the Transformer stage in parallel jobs Define constraints Define derivations Use stage variables Create a parameter set and use its parameters in constraints and derivations
Transformer Stage
Transformer Stage
Column mappings Derivations Constraints
Written in Basic Final compiled code is C++ generated object code Filter data Direct data down different output links For different processing or storage Input columns Job parameters Functions System variables and constants Stage variables External routines
Single input
Reject link
Multiple outputs
Defining a Constraint
Input column
Defining a Derivation
Input column
String in quotes
Example:
Suppose the source column is named In.OrderID and the target column is named Out.OrderID Replace In.OrderID values of 3000 by 4000 IF In.OrderID = 3000 THEN 4000 ELSE In.OrderID
UpCase(<string>) / DownCase(<string>)
Example: UpCase(In.Description) ORANGE JUICE
Len(<string>)
Example: Len(In.Description) 12
Can be handled in constraints, derivations, stage variables, or a combination of these NULL functions
Testing for NULL
IsNull(<column>) IsNotNull(<column>) NullToValue(<column>, <value>) Example: IF In.Col = 5 THEN SetNull() ELSE In.Col
Transformer Functions
Date & Time Logical Null Handling Number String Type Conversion
Multi-purpose
Counters Store values from previously read rows to make comparisons with the currently read row Store derived values to be used in multiple target field derivations Can be used to control execution of constraints
Reject link
When the result of a link constraint or output derivation is NULL, the Transformer will output that row to the reject link. If no reject link, job will abort! Standard practices
Always create a reject link Always test for NULL values in expressions with NULLABLE columns IF isNull(link.col) THEN ELSE Transformer issues warnings for rejects
Otherwise Link
Otherwise link
Last in order
Parameter Sets
Parameter Sets
Store a collection of parameters in a named object One or more values files can be named and specified
A values file stores values for specified parameters Values are picked up at runtime
Parameter Sets can be added to the job parameters specified on the Parameters tab in the job properties
Parameters Tab
Specify parameters
Values Tab
Checkpoint
1. What occurs first? Derivations or constraints? 2. Can stage variables be referenced in constraints? 3. Where should you test for NULLS within a Transformer? Stage Variable derivations or output column derivations?
Checkpoint solutions
1. Constraints. 2. Yes. 3. Stage variable derivations. Reference the stage variables in the output column derivations.
Unit summary
Having completed this unit, you should be able to: Use the Transformer stage in parallel jobs Define constraints Define derivations Use stage variables Create a parameter set and use its parameters in constraints and derivations
Unit objectives
After completing this unit, you should be able to: Perform a simple Find Perform an Advanced Find Perform an impact analysis Compare the differences between two Table Definitions Compare the differences between two jobs
Quick Find
Name with wild card character (*)
Execute Find
Found Results
Found item
Filter search
Options
Case sensitivity Search within last result set
Impact Analysis
Graphical functionality
Display the dependency path Collapse selected objects Move the graphical object Birds-eye view