Professional Documents
Culture Documents
Component parallelism
Pipeline parallelism
Data parallelism
Component Parallelism
Sorting Customers
Sorting Transactions
Component Parallelism
Comes “for free” with graph programming.
Limitation:
– Scales to number of “branches” a graph.
Pipeline Parallelism
Processing Record: 100
Processing Record: 99
Pipeline Parallelism
Comes “for free” with graph programming.
Limitations:
– Scales to length of “branches” in a graph.
– Some operations, like sorting, do not pipeline.
Data Parallelism
ns
t i o
rt i
Pa
Two Ways of Looking at
Data Parallelism
Expanded View:
Global View:
Data Parallelism
Scales with data.
Global View:
Data Partitioning:
The Global View
Degree of Parallelism
Fan-out Flow
Partitioning
Partitioning Review
Fan-out Flow
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 2
Partition 3
Partition 3
Balanced: Skewed:
Processors get neither Some processors get
too much nor too little. too much, others too little.
Sample Data to be Partitioned
Customers
42John 02116 30
43Mark 02114
record
9
44Bob 02116 8 decimal(2) id;
45Sue 02241 92 string(5) name;
46Rick 02116 23 decimal(5) zipcode;
47Bill 02114 14 decimal(3) amount;
48Mary 02116 38
49Jane 02241 2.
string(1) newline;
end
Partition by Round-robin
partition on zipcode:
Customers Customers
43Mark 02114 9 42John 02116 30
45Sue 02241 92 44Bob 02116 8
47Bill 02114 14 46Rick 02116 23
49Jane 02241 2 48Mary 02116 38
Partition by Key often
followed by a Sort
Sort on zipcode:
Customers Customers
43Mark 02114 9 42John 02116 30
47Bill 02114 14 44Bob 02116 8
45Sue 02241 92 46Rick 02116 23
49Jane 02241 2 48Mary 02116 38
Rollup by zipcode:
Totals by Zipcode Totals by Zipcode
02114 23 02116 99
02241 94
Partition by Key
Key-based.
Usually results in well balanced data.
Useful for key-dependent parallelism.
Partition by Expression
Expression: amount/33
Customers Customers Customers
42John 02116 30 48Mary 02116 38 45Sue 02241 92
43Mark 02114 9
44Bob 02116 8
46Rick 02116 23
47Bill 02114 14
49Jane 02241 2
Partition by Expression
Key-based.
Resulting balance dependent on set of
splitters chosen.
Useful for “binning” and global sorting.
Partition with Load Balance
Not key-based.
Results in skewed data distribution to
complement skewed load.
Useful for record-independent parallelism.
Partition with Percentage
With percentages: 4, 20
Customers Customers Customers
42John 02116 30 46Rick 02116 23 ...
43Mark 02114 9 47Bill 02114 14
44Bob 02116 8 48Mary 02116 38
45Sue 02241 92 49Jane 02241 2
Not key-based
Results in usually skewed data distribution
conforming to the provided percentages.
Useful for record-independent parallelism.
Broadcast (as a Partitioner)
Unlike all other partitioners which write a record to ONE output
flow, Broadcast writes each record to EVERY output flow.
Not key-based
Results in perfectly balanced partitions
Useful for record-independent parallelism.
De-Partitioning
Departitioning
Score 1
Departition
Score
2 Output File
Score
3
Global View:
Departitioning
Fan-in Flow
Sorted data:
49Jane 02241 2
44Bob 02116 8
43Mark 02114 9
47Bill 02114 14
46Rick 02116 23
42John 02116 30
48Mary 02116 38
45Sue 02241 92
Concatenation
Not key-based.
Result ordering is by partition.
Serializes pipelined computation.
Useful for:
– creating serial flow from partitioned data
– appending headers and trailers
– writing DML
Used infrequently
Merge
Round-robin partitioned and sorted by amount:
42John 02116 30 49Jane 02241 2 44Bob 02116 8
48Mary 02116 38 43Mark 02114 9 47Bill 02114 14
45Sue 02241 92 46Rick 02116 23
Not key-based.
Result ordering is unpredictable.
Neither serializes nor synchronizes pipelined
computation.
Useful for efficient collection of data from multiple
partitions and for repartitioning.
Used most frequently
Layout
dedupn
Set the dedupn parameter to true to remove duplicates from the
corresponding inn port before joining. This allows you to choose
only one record from a group with matching key values as the
argument to the transform function.
Default is false, which does not remove duplicates
override-keyn
Alternative name(s) for the key field(s) for a particular in port.
References
Ab Initio Tutorial
Ab Initio Online Help
Website (abinitio.com)