Parallelism Partitioning Techniques

Parallelism
 Component parallelism
 Pipeline parallelism
 Data parallelism
Component Parallelism
Sorting Customers
Sorting Transactions
Component Parallelism
 Comes “for free” with graph programming.
 Limitation:
– Scales to number of “branches” a graph.
Pipeline Parallelism
Processing Record: 100
Processing Record: 99
Pipeline Parallelism
 Comes “for free” with graph programming.
 Limitations:
– Scales to length of “branches” in a graph.
– Some operations, like sorting, do not pipeline.
Data Parallelism
ns
t i o
rt i
Pa
Two Ways of Looking at
Data Parallelism
Expanded View:
Global View:
Data Parallelism
 Scales with data.
 Requires data partitioning.
 Different partitioning methods for different

operations.
Data Partitioning
Expanded View:
Global View:
Data Partitioning:
The Global View
Degree of Parallelism
Fan-out Flow
Partitioning
Partitioning Review
Fan-out Flow
 For the various partitioning components:

– Is it Key-based? Does the problem require a
key-based partition?
– Performance: Are the partitions balanced or
skewed?
Partitioning: Performance
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 2
Partition 3
Partition 3
Balanced: Skewed:
Processors get neither Some processors get
too much nor too little. too much, others too little.
Sample Data to be Partitioned
 Customers
 42John 02116 30
 43Mark 02114
record
9
 44Bob 02116 8 decimal(2) id;
 45Sue 02241 92 string(5) name;
 46Rick 02116 23 decimal(5) zipcode;
 47Bill 02114 14 decimal(3) amount;
 48Mary 02116 38
 49Jane 02241 2.
string(1) newline;
end
Partition by Round-robin
Partition 0 Partition 1 Partition 2
Customers Customers Customers

42John 02116 30 43Mark 02114 9 44Bob 02116 8
45Sue 02241 92 46Rick 02116 23 47Bill 02114 14
48Mary 02116 38 49Jane 02241 2
Partition by Round-robin
 Not key based.

 Results in very well balanced data, especially
with block-size of 1.
 Useful for record-independent parallelism.
Partition by Key
partition on zipcode:
Customers Customers
43Mark 02114 9 42John 02116 30
45Sue 02241 92 44Bob 02116 8
47Bill 02114 14 46Rick 02116 23
49Jane 02241 2 48Mary 02116 38
Partition by Key often
followed by a Sort
Sort on zipcode:
Customers Customers
43Mark 02114 9 42John 02116 30
47Bill 02114 14 44Bob 02116 8
45Sue 02241 92 46Rick 02116 23
49Jane 02241 2 48Mary 02116 38
Rollup by zipcode:
Totals by Zipcode Totals by Zipcode
02114 23 02116 99
02241 94
Partition by Key
 Key-based.
 Usually results in well balanced data.
 Useful for key-dependent parallelism.
Partition by Expression
Expression: amount/33
42John 02116 30 48Mary 02116 38 45Sue 02241 92
43Mark 02114 9
44Bob 02116 8
46Rick 02116 23
47Bill 02114 14
49Jane 02241 2
Partition by Expression
 Key-based, depending on the expression.

 Resulting balance very dependent on
expression and on data.
 Various application-dependent uses.
Partition by Range
With splitter values of 9 and 23:

43Mark 02114 9 46Rick 02116 23 42John 02116 30
44Bob 02116 8 47Bill 02114 14 45Sue 02241 92
49Jane 02241 2 48Mary 02116 38
Range+Sort: Global Ordering
Sort following a partition by range:

49Jane 02241 2 47Bill 02114 14 42John 02116 30
44Bob 02116 8 46Rick 02116 23 48Mary 02116 38
43Mark 02114 9 45Sue 02241 92
Partition by Range
 Key-based.
 Resulting balance dependent on set of
splitters chosen.
 Useful for “binning” and global sorting.
Partition with Load Balance
if middle node highly loaded:

42John 02116 30 45Sue 02241 92 46Rick 02116 23
43Mark 02114 9 47Bill 02114 14
44Bob 02116 8 48Mary 02116 38
49Jane 02241 2
Partition by Load Balance
 Not key-based.
 Results in skewed data distribution to
complement skewed load.
Partition with Percentage
With percentages: 4, 20
42John 02116 30 46Rick 02116 23 ...
43Mark 02114 9 47Bill 02114 14
44Bob 02116 8 48Mary 02116 38
45Sue 02241 92 49Jane 02241 2
The next 16 records

would go here,
and the next 76 records would go here
Partition by Percentage
 Not key-based
 Results in usually skewed data distribution
conforming to the provided percentages.
Broadcast (as a Partitioner)
Unlike all other partitioners which write a record to ONE output
flow, Broadcast writes each record to EVERY output flow.

42John 02116 30 42John 02116 30 42John 02116 30
43Mark 02114 9 43Mark 02114 9 43Mark 02114 9
44Bob 02116 8 44Bob 02116 8 44Bob 02116 8
45Sue 02241 92 45Sue 02241 92 45Sue 02241 92
46Rick 02116 23 46Rick 02116 23 46Rick 02116 23
47Bill 02114 14 47Bill 02114 14 47Bill 02114 14
48Mary 02116 38 48Mary 02116 38 48Mary 02116 38
49Jane 02241 2 49Jane 02241 2 49Jane 02241 2
Broadcast
 Not key-based
 Results in perfectly balanced partitions
De-Partitioning
Departitioning
Departitioning combines many flows of data to

produce one flow. It is the opposite of partitioning.
Each departition component combines flows in a

different manner.
Departitioning
Expanded View:
Score 1
Departition
Score
2 Output File
Score
3
Global View:
Departitioning
Fan-in Flow
 For the various departitioning components:

– Key-based?
– Result ordering?
– Effect on parallelism?
– Uses?
Concatenation
Globally ordered, partitioned data:
49Jane 02241 2 47Bill 02114 14 42John 02116 30
44Bob 02116 8 46Rick 02116 23 48Mary 02116 38
43Mark 02114 9 45Sue 02241 92
Sorted data:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Concatenation
 Not key-based.
 Result ordering is by partition.
 Serializes pipelined computation.
 Useful for:
– creating serial flow from partitioned data
– appending headers and trailers
– writing DML
 Used infrequently
Merge
Round-robin partitioned and sorted by amount:
42John 02116 30 49Jane 02241 2 44Bob 02116 8
48Mary 02116 38 43Mark 02114 9 47Bill 02114 14
45Sue 02241 92 46Rick 02116 23
Sorted data, following merge on amount:

49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Merge
 Key-based.
 Result ordering is sorted if each input is sorted.
 Possibly synchronizes pipelined computation; may
even serialize.
 Useful for creating ordered data flows.
 Used more than concatenate, but still infrequently
Interleave
Round-robin partitioned and scored:
42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C
Scored dataset in original order, following interleave:
42John 02116 30A
43Mark 02114 9C
44Bob 02116 8C
45Sue 02241 92A
46Rick 02116 23B
47Bill 02114 14B
48Mary 02116 38A
49Jane 02241 2C
Interleave
 Not key-based.
 Result ordering is inverse of round-robin.
 Synchronizes pipelined computation.
 Useful for restoring original order following a
record-independent parallel computation
partitioned by round-robin.
 Used in rare circumstances
Gather
Round-robin partitioned and scored:

42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C
Scored dataset in random order, following gather:

43Mark 02114 9C
46Rick 02116 23B
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
44Bob 02116 8C
47Bill 02114 14B
49Jane 02241 2C
Gather
 Not key-based.
 Result ordering is unpredictable.
 Neither serializes nor synchronizes pipelined
computation.
 Useful for efficient collection of data from multiple
partitions and for repartitioning.
 Used most frequently
Layout
 Layout determines the location of a resource.

 A layout is either serial or parallel.
 A serial layout specifies one node and one
directory.
 A parallel layout specifies multiple nodes and
multiple directories. It is permissible for the
same node to be repeated.
Layout
 The location of a Dataset is one or more
places on one or more disks.
 The location of a computing component is one

or more directories on one or more nodes. By
default, the node and directory is unknown.
 Computing components propagate their

layouts from neighbors, unless specifically
given a layout by the user.
Joins
Join Types
•Inner join — sets the record-required parameters for all ports
to True.
•Outer join — sets the record-required parameters for all ports
to False.
•Explicit — allows you to set the record-required parameter
for each port individually.
Join Types .. Contd.
Case 1: Inner Join join-type
Case 2: Full Outer Join join-type
Case 3: Explicit join-type:record-required0: false

record-required1: true
Case 4: Explicit join-type:record-required0: true

record-required1: false
Some key Join Parameters
key
Name(s) of the field(s) in the input records that must have
matching values for Join to call the transform function.
driving
Number of the port to which you want to connect the driving
input. The driving input is the largest input. All other inputs are
read into memory.
The driving parameter is only available when the sorted-input
parameter is set to In memory: Input need not be sorted.
Some key Join Parameters
dedupn
Set the dedupn parameter to true to remove duplicates from the
corresponding inn port before joining. This allows you to choose
only one record from a group with matching key values as the
argument to the transform function.
Default is false, which does not remove duplicates
override-keyn
Alternative name(s) for the key field(s) for a particular in port.
References
 Ab Initio Tutorial
 Ab Initio Online Help
 Website (abinitio.com)

Parallelism Partitioning Techniques

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallelism Partitioning Techniques

Uploaded by

Copyright:

Available Formats

Parallelism

 Requires data partitioning.

 Different partitioning methods for different

 For the various partitioning components:

Partition 0 Partition 1 Partition 2

Customers Customers Customers

 Not key based.

 Key-based, depending on the expression.

With splitter values of 9 and 23:

Sort following a partition by range:

if middle node highly loaded:

The next 16 records

Customers Customers Customers

Departitioning combines many flows of data to

Each departition component combines flows in a

 For the various departitioning components:

Sorted data, following merge on amount:

Round-robin partitioned and scored:

Scored dataset in random order, following gather:

 Layout determines the location of a resource.

 The location of a computing component is one

 Computing components propagate their

Case 2: Full Outer Join join-type

Case 3: Explicit join-type:record-required0: false

Case 4: Explicit join-type:record-required0: true

You might also like