You are on page 1of 50

Parallelism

 Component parallelism

 Pipeline parallelism

 Data parallelism
Component Parallelism

Sorting Customers

Sorting Transactions
Component Parallelism
 Comes “for free” with graph programming.

 Limitation:
– Scales to number of “branches” a graph.
Pipeline Parallelism
Processing Record: 100

Processing Record: 99
Pipeline Parallelism
 Comes “for free” with graph programming.

 Limitations:
– Scales to length of “branches” in a graph.
– Some operations, like sorting, do not pipeline.
Data Parallelism

ns
t i o
rt i
Pa
Two Ways of Looking at
Data Parallelism
Expanded View:

Global View:
Data Parallelism
 Scales with data.

 Requires data partitioning.

 Different partitioning methods for different


operations.
Data Partitioning
Expanded View:

Global View:
Data Partitioning:
The Global View
Degree of Parallelism

Fan-out Flow
Partitioning
Partitioning Review
Fan-out Flow

 For the various partitioning components:


– Is it Key-based? Does the problem require a
key-based partition?
– Performance: Are the partitions balanced or
skewed?
Partitioning: Performance

Partition 0 Partition 0

Partition 1 Partition 1
Partition 2
Partition 2

Partition 3
Partition 3

Balanced: Skewed:
Processors get neither Some processors get
too much nor too little. too much, others too little.
Sample Data to be Partitioned

 Customers
 42John 02116 30
 43Mark 02114
record
9
 44Bob 02116 8 decimal(2) id;
 45Sue 02241 92 string(5) name;
 46Rick 02116 23 decimal(5) zipcode;
 47Bill 02114 14 decimal(3) amount;
 48Mary 02116 38
 49Jane 02241 2.
string(1) newline;
end
Partition by Round-robin

Partition 0 Partition 1 Partition 2

Customers Customers Customers


42John 02116 30 43Mark 02114 9 44Bob 02116 8
45Sue 02241 92 46Rick 02116 23 47Bill 02114 14
48Mary 02116 38 49Jane 02241 2
Partition by Round-robin

 Not key based.


 Results in very well balanced data, especially
with block-size of 1.
 Useful for record-independent parallelism.
Partition by Key

partition on zipcode:
Customers Customers
43Mark 02114 9 42John 02116 30
45Sue 02241 92 44Bob 02116 8
47Bill 02114 14 46Rick 02116 23
49Jane 02241 2 48Mary 02116 38
Partition by Key often
followed by a Sort
Sort on zipcode:
Customers Customers
43Mark 02114 9 42John 02116 30
47Bill 02114 14 44Bob 02116 8
45Sue 02241 92 46Rick 02116 23
49Jane 02241 2 48Mary 02116 38

Rollup by zipcode:
Totals by Zipcode Totals by Zipcode
02114 23 02116 99
02241 94
Partition by Key

 Key-based.
 Usually results in well balanced data.
 Useful for key-dependent parallelism.
Partition by Expression

Expression: amount/33
Customers Customers Customers
42John 02116 30 48Mary 02116 38 45Sue 02241 92
43Mark 02114 9
44Bob 02116 8
46Rick 02116 23
47Bill 02114 14
49Jane 02241 2
Partition by Expression

 Key-based, depending on the expression.


 Resulting balance very dependent on
expression and on data.
 Various application-dependent uses.
Partition by Range

With splitter values of 9 and 23:


Customers Customers Customers
43Mark 02114 9 46Rick 02116 23 42John 02116 30
44Bob 02116 8 47Bill 02114 14 45Sue 02241 92
49Jane 02241 2 48Mary 02116 38
Range+Sort: Global Ordering

Sort following a partition by range:


Customers Customers Customers
49Jane 02241 2 47Bill 02114 14 42John 02116 30
44Bob 02116 8 46Rick 02116 23 48Mary 02116 38
43Mark 02114 9 45Sue 02241 92
Partition by Range

 Key-based.
 Resulting balance dependent on set of
splitters chosen.
 Useful for “binning” and global sorting.
Partition with Load Balance

if middle node highly loaded:


Customers Customers Customers
42John 02116 30 45Sue 02241 92 46Rick 02116 23
43Mark 02114 9 47Bill 02114 14
44Bob 02116 8 48Mary 02116 38
49Jane 02241 2
Partition by Load Balance

 Not key-based.
 Results in skewed data distribution to
complement skewed load.
 Useful for record-independent parallelism.
Partition with Percentage
With percentages: 4, 20
Customers Customers Customers
42John 02116 30 46Rick 02116 23 ...
43Mark 02114 9 47Bill 02114 14
44Bob 02116 8 48Mary 02116 38
45Sue 02241 92 49Jane 02241 2

The next 16 records


would go here,
and the next 76 records would go here
Partition by Percentage

 Not key-based
 Results in usually skewed data distribution
conforming to the provided percentages.
 Useful for record-independent parallelism.
Broadcast (as a Partitioner)
Unlike all other partitioners which write a record to ONE output
flow, Broadcast writes each record to EVERY output flow.

Customers Customers Customers


42John 02116 30 42John 02116 30 42John 02116 30
43Mark 02114 9 43Mark 02114 9 43Mark 02114 9
44Bob 02116 8 44Bob 02116 8 44Bob 02116 8
45Sue 02241 92 45Sue 02241 92 45Sue 02241 92
46Rick 02116 23 46Rick 02116 23 46Rick 02116 23
47Bill 02114 14 47Bill 02114 14 47Bill 02114 14
48Mary 02116 38 48Mary 02116 38 48Mary 02116 38
49Jane 02241 2 49Jane 02241 2 49Jane 02241 2
Broadcast

 Not key-based
 Results in perfectly balanced partitions
 Useful for record-independent parallelism.
De-Partitioning
Departitioning

Departitioning combines many flows of data to


produce one flow. It is the opposite of partitioning.

Each departition component combines flows in a


different manner.
Departitioning
Expanded View:

Score 1

Departition
Score
2 Output File

Score
3

Global View:
Departitioning
Fan-in Flow

 For the various departitioning components:


– Key-based?
– Result ordering?
– Effect on parallelism?
– Uses?
Concatenation
Globally ordered, partitioned data:
49Jane 02241 2 47Bill 02114 14 42John 02116 30
44Bob 02116 8 46Rick 02116 23 48Mary 02116 38
43Mark 02114 9 45Sue 02241 92

Sorted data:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Concatenation
 Not key-based.
 Result ordering is by partition.
 Serializes pipelined computation.
 Useful for:
– creating serial flow from partitioned data
– appending headers and trailers
– writing DML
 Used infrequently
Merge
Round-robin partitioned and sorted by amount:
42John 02116 30 49Jane 02241 2 44Bob 02116 8
48Mary 02116 38 43Mark 02114 9 47Bill 02114 14
45Sue 02241 92 46Rick 02116 23

Sorted data, following merge on amount:


49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Merge
 Key-based.
 Result ordering is sorted if each input is sorted.
 Possibly synchronizes pipelined computation; may
even serialize.
 Useful for creating ordered data flows.
 Used more than concatenate, but still infrequently
Interleave
Round-robin partitioned and scored:
42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C
Scored dataset in original order, following interleave:
42John 02116 30A
43Mark 02114 9C
44Bob 02116 8C
45Sue 02241 92A
46Rick 02116 23B
47Bill 02114 14B
48Mary 02116 38A
49Jane 02241 2C
Interleave
 Not key-based.
 Result ordering is inverse of round-robin.
 Synchronizes pipelined computation.
 Useful for restoring original order following a
record-independent parallel computation
partitioned by round-robin.
 Used in rare circumstances
Gather

Round-robin partitioned and scored:


42John 02116 30A 43Mark 02114 9C 44Bob 02116 8C
45Sue 02241 92A 46Rick 02116 23B 47Bill 02114 14B
48Mary 02116 38A 49Jane 02241 2C

Scored dataset in random order, following gather:


43Mark 02114 9C
46Rick 02116 23B
42John 02116 30A
45Sue 02241 92A
48Mary 02116 38A
44Bob 02116 8C
47Bill 02114 14B
49Jane 02241 2C
Gather

 Not key-based.
 Result ordering is unpredictable.
 Neither serializes nor synchronizes pipelined
computation.
 Useful for efficient collection of data from multiple
partitions and for repartitioning.
 Used most frequently
Layout

 Layout determines the location of a resource.


 A layout is either serial or parallel.
 A serial layout specifies one node and one
directory.
 A parallel layout specifies multiple nodes and
multiple directories. It is permissible for the
same node to be repeated.
Layout
 The location of a Dataset is one or more
places on one or more disks.

 The location of a computing component is one


or more directories on one or more nodes. By
default, the node and directory is unknown.

 Computing components propagate their


layouts from neighbors, unless specifically
given a layout by the user.
Joins
Join Types
•Inner join — sets the record-required parameters for all ports
to True.
•Outer join — sets the record-required parameters for all ports
to False.
•Explicit — allows you to set the record-required parameter
for each port individually.
Join Types .. Contd.
Case 1: Inner Join join-type

Case 2: Full Outer Join join-type

Case 3: Explicit join-type:record-required0: false


record-required1: true

Case 4: Explicit join-type:record-required0: true


record-required1: false
Some key Join Parameters
key
Name(s) of the field(s) in the input records that must have
matching values for Join to call the transform function.
driving
Number of the port to which you want to connect the driving
input. The driving input is the largest input. All other inputs are
read into memory.
The driving parameter is only available when the sorted-input
parameter is set to In memory: Input need not be sorted.
Some key Join Parameters

dedupn
Set the dedupn parameter to true to remove duplicates from the
corresponding inn port before joining. This allows you to choose
only one record from a group with matching key values as the
argument to the transform function.
Default is false, which does not remove duplicates
override-keyn
Alternative name(s) for the key field(s) for a particular in port.
References
 Ab Initio Tutorial
 Ab Initio Online Help
 Website (abinitio.com)

You might also like