Professional Documents
Culture Documents
August 7, 2021 2
Types or Parallelism
1b
1b 1b
Operator 5 Operator 6 Operator 7
• Partitioning: rows split across different processes, each performing the same logic
Logical Unit
P1
Partitioned Data
Data P1
Note that these options also depend on
P1 • support from the h/w & OS
• server settings & configurations
August 7, 2021 3
DataStage Enterprise Edition
• Pipelining & Partitioning Combined
Pipeline processed for read, process & write
Oracle
Oracle
Oracle Oracle
August 7, 2021 4
Pipelining
• By default,
• This can be over-ridden at a project or stage level, if we wish to create separate processes
for each operator
August 7, 2021 5
Data Partitioning
Default
• The configuration file decides how many process instances of each operator is created,
e.g. if 4 nodes are defined, there is a 4-way partition of data
• By default, Auto-Partitioning is set
• DS chooses the optimum partitioning & repartitioning mechanism
• “Round-robin” is applied at the first level followed by “Same”
• If there is a need for key-based partition upstream or down-stream, then alternative
modes are chosen
• e.g. in the case of a join, the data in the input link is sorted & partitioned by the join
key
Degree of parallelism
• This is decided by the Configuration file.
• The configuration file used can be varied at the job-level to suit different job
requirements
• Individual stages may also be executed on a selected nodes by specifying the node map
constraints
• Where the overhead of partitioning is not worth the performance improvement, the entire
job or a specific stage may be executed sequentially.
• System size & configuration details maintained external to the job design
• Can be modified to suit development & production environment, handle hardware
upgrades, etc. without redesigning/recompiling jobs
• The configuration file describes available processing power in terms of processing
nodes
• determines how many instances of a process will be produced when you
compile a parallel job.
• Minimum #Nodes < ½ times #CPUs Minimum Recommended
• Usual starting point for #Nodes = # CPUs
• # Nodes < # CPUs if some CPUs left free for OS, DB and other
applications
• # Nodes > # CPUs for I/O intensive streams with poor CPU-usage
August 7, 2021 7
Configuration File
Sample
{
node "node1"
{
fastname "ibmsceai"
pools ""
resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}
resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}
}
node "node2"
{
fastname "ibmsceai"
pools ""
resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}
resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}
}
}
August 7, 2021 8
Partitioning & Collecting
Collecting
• Round Robin
• Ordered – Read all records from first partition, then from second and so on
• Sorted Merge – Read records based on one or more columns (collecting key)
August 7, 2021 9
Discussion on Partitioning Data Within A Job
Partitioning
August 7, 2021 11
Partitioning
Sequential Files - Read
• Normally read in sequence, i.e 1 Process to read the data & pass it to the next stage
• Output data is partitioned in round-robin into <#Node> partitions
Processing Stages
• Receives partitioned data & propagates it using the “Same” method
August 7, 2021 12
Partitioning
Sequential Files
August 7, 2021 13
Partitioning
• Back to EE_TRG_Demo_1
• Note that we did not set any specific options for parallelism or partitioning
August 7, 2021 14
Partitioning
** Note that in this case data output from aggregator is not partitioned again since it is already in the
required partitioning format. It is only sorted
August 7, 2021 15
Partitioning
• Look at Link Icons to identify where partitioning, explicit repartitioning & collection has
occurred
• Tune parallelism
• through the configuration file
• running specific stages sequentially or on selected node pool(s)
• Changing the partition mode
• Enabling or disabling Operator Combinability, etc.
August 7, 2021 16
Partitioning
Sequential/Parallel
August 7, 2021 17
Partitioning
• If stage is executed sequentially & preceding stage is parallel, then the “Collection” options are available
August 7, 2021 18
Partitioning
A 40 A 40
B 50 B 120
Grp Key Amt Val Grp Key Amt Val
B 60 A 60
A 20 A 60
B 70 B 140
A 40 B 140
B 80 B 60
B 80
August 7, 2021 19
Case Study 2