Professional Documents
Culture Documents
Parralel Demro 002
Parralel Demro 002
FUNDAMENTAL GPU
ALGORITHMS
More Patterns
Reduce
Scan
OUTLINES
Efficiency Measure
Reduce primitive
Reduce model
Reduce Implementation and complexity analysis
Scan primitive
Scan model
Scan Implementation and complexity analysis
Histogram primitive
Sequential Parallel
1 Thread 3 Threads
6 6 0 2 5 1 7
+ 0 + + +
+ 2 +
+ 5 5 Steps 3
+
+ 1
5 Total Work 5 21
+ 7
If =
5
21
Work Efficient
REDUCE PATTERN
REDU
REDUCE DEFINITION
CE
Set of elements
Reduction operator
Binary
Associative operation
6 + 0+ 2 +5 + …..
QUIZ
6 0 ❑✓Multiply ( a*b )
+ 2 ✓❑Minimum
❑ Factorial ( a! )
+ 5
❑✓Logical Or ( a||b )
8
❑✓Logical And ( a&&b)
+
❑ Division ( a/b )
REDUCE SERIAL (SEQUENTIAL)
MPLEMENTATION
sum=0
For (i=1;i<n;i++){ 6 + 0+ 2 +5
sum=sum+array[i];
}
return sum; 6 0
+ 2
+ 5
+
Steps = 3
Total work = 3
4
REDUCE SERIAL
COMPLEXITY
Which is true to reduce n elements?
❑ It takes n operation
5
REDUCE PARALLEL
IMPLEMENTATION
(a+(b+c))+d) = (a+b)+(c+d)
6 + 0 + 2 +5
6 0 2 5
+ +
Steps = 2
Total work = 3
6
REDUCE PARALLEL
COMPLEXITY
6 0 2 5 3 7 1 5
+ + + +
+ +
Transaction Balance
+10 10
-5 5
+4 9
+3 12 9
SCAN
DEFINITION
Set of elements
Reduction operator (op)
Binary
Associative operation (We assume here it’s also
commutative e.g. x+y=y+x)
Identity element [I op a= a]
op I Because
+ 0 a+0=a
* 1 a*1=a
min (unsigned 0xFF min(0xFF,a)=a
char) 10
SCAN
DEFINITION CONT.
Exclusive
in : 1 0 4 2 3
Op= + I= 0
Out: 0 1 1 5 7
Inclusive:
in : 1 0 4 2 3
Op= + I= 0
Out: 1 1 5 7 11
11
SCAN SERIAL IMPLEMENTATION AND
COMPLEXITY
acc=identity acc=identity
For (i=1;i<n;i++){ For (i=1;i<n;i++){
acc=acc+array[i]; out[i]=acc;
out[i]=acc; acc=acc+array[i];
} }
return out; return out;
Inclusive Exclusive
Steps = n
Total work = n
12
SCAN PARALLEL
IMPLEMENTATION AND
COMPLEXITY
Inclusive (+ Scan):
in : 1 0 4 2 3
Out: 1 1 5 7 10
14
SCAN PARALLEL
IMPLEMENTATION AND
COMPLEXITY (CONT.2)
Hillis & Steele (Inclusive sum scan):
1 2 3 4 5 6 7 8
0
1 3 5 7 9 11 13 15
1
1 3 6 10 14 18 22 26
2
1 3 6 10 15 21 28 36
36
10 0 Downsweep
L R
3 0 11 10
R L+R
Step= log n
1 0 3 3 5 10 7 21
Work= n
0 1 3 9 10 15 21 28 25
QUI
Z
Which Scan algorithm to use?
512 elements
512 processor
✓
1m elements
512 processor ✓
128k elements
1 processor ✓
26
HISTOGRAM
38
34
16
12
19
HISTOGRAM
PARALLEL NAÏVE
Start
20
HISTOGRAM PARALLEL
NAÏVE (CONT.)
Example
128 element 8 threads 3 bin
Memory 15 15 16 17 16 6 7 8
Thread 1 Thread 2
Memory 5
Read
Increment
Write
Register 5 +1 5 +1
Race condition
21
HISTOGRAM
PARALLEL SIMPLE
Use atomic operation
However this serialize the problem.
22
HISTOGRAM PARALLEL
REDUCE BASE
Example
128 element 8 threads 3 bin
Example
First we sort
Memory …. 0 0 0 1 1 1 2 2 ….
Then we reduce
36
TONE MAPPING
26
TONE
MAPPING
Using reduce, scan and histogram
27
PARALLEL PROCESSING
QUIZ
On scan of N elements the amount of work is:
❑ O(log n)
✓❑O(n)
❑ O(n log n)
❑ O(n2)
Compact
Compact-like
Segment scan
Sorting
Merge Sort
Radix Sort
Quick Sort 3
COMPACT
(FILTER)
3
1
COMPACT MODEL
Input: a b c de …
Predicate: T F T FT …
e.g.
“Is my index even?”
Output: a - c - e … Sparse
a c e … Dense
3
2
WHY WE PREFER
DENSE COMPACT?
Suppose we want to apply
toGray() on every red object.
//Sparse //Dense
if( isRed(object)){ cc=compact(objects, isRed())
toGray(object)
} map(cc, toGray())
15 threads 15 threads
7
4 threads
QUIZ
When to use compact?
✓❑Large
34
COMPACT PARALLELIZATION
How to compute the scatter address/index in
parallel?
Input: a b c d e …
Predicate: T F T F T …
e.g.
“Is my index even?”
Output: a - c - e … Sparse
a c e … Dense
35
COMPACT
PARALLELIZATION
We can paraphrase the problem as following:
Input: T F F T T F
Output: 0 - - 1 2 -
Output: 0 1- 1- 1 2 3-
Exclusive Scan
36
COMPACT PARALLEL
ALGORITHM
1. Generate Predicate array:
predicate_array = predicate_function(input_array)
map
2. Generate Scan-in array (1s and 0s)
scan-in_array = convert(predicate_array)
37
QUIZ
Suppose we need to compact 1M element with the
following predicate functions:
A: isDivisiableBy_17( ) Few elements
B: isNotDivisiableBy_34( ) Many elements
RECAP COMPACT
ALLOCATION
Compact allocate 1 output for 1 (true) element
and 0 output for 1 (false).
39
CLIPPING
Suppose a set of triangles are sent as input to a
computer graphics pipeline
c d
40
CLIPPING
PROBLEM
c d
41
BAD
SOLUTION
Input: a c e d f b g
Allocate maximum possible space in intermediate
array
5 for each triangle in our case
I.ntermediate: a ? ? ? ? b ? ---
Apply compact
Disadvantages:
Wasteful in space
Input: a c e d f b g
Apply scan
Addresses: 0 0 1 1 2 4 5
Allocate output array based with respect to max
scan #, and then apply scatter
19
output: 0 1 2 3 4 5
OTHER APPLICATION OF SCAN
Data Compression
Collision Detection
20
SEGMENTED SCAN
Input: 1 2 3 4 5 6 70 Scan 8 1 3 21 6 10 15 28
Heads: 1 0 1 0 0 1 0 0
1 ∗ 0 + 0 ∗ 1 + (3 ∗ 2)
2 ∗ 0 + 4 ∗ 1 + (—1 ∗ 2) =
0 ∗ 0 + 1 ∗ 1 + (5 ∗ 2)
6
2
11
COMPRESSED SPARSE
ROW
We can represent sparse matrix in CPR format as
following:
a 0 b
c d e
0 0 ƒ
H H H
Value [a b c d e f]
Column [0 2 0 1 2 2]
RowPtr [0 2 5]
26
CPR
MULTIPLICATION
We can represent sparse matrix in CPR format as following:
Value: [a b c d e f] x. 0
Column: [0 2 0 1 2 2] x y. 1
RowPtr: [0 2 5] z. 2
1. Create Segments with values and RowPtr
[a b c d e f] RowPtr: [a b c d e f]
[0 2 5]
2. Gather vector values using column
column: [x z x y z z]
[0 2 0 1 2 2]
3. Pairwise multiply 1 . 2
[a*x b*z c*x d*y e*z f*z]
5 1 4 2 3 Step? Work?
1 5 2 4 3 O(1)
O(n) ✓
1 2 5 3 4
O(log n)
1 2 3 5 4 O(n log n)
1 2 3 4 5 O(n2) ✓
30
MERGE
SORT
Stage 3:
1M • We have only 1merge task
• 2 huge sorted lists
512K 512K • Choose splitters (256th element)
in each list. and sort them
⋮ ⋮ • Use merge task in stage
2 to merge elements between splitte
2048 2048
Stage 2:
1024 1024 • Bunch of medium merge tasks
1024 1024
• 1 merge to 1 block
512 512 512 512 Stage 1:
• Tons of small merge tasks
⋮ ⋮ • 1 merge to 1 thread
• We can use shared memory
1 1 1 ⋯ 1 1 1 • Also we can use serial algorithm 31
to sort small block of elements
MERGE TASK
0 1 5 12 34
Sorted Serial Algorithm
Compare heads 0 1 3 4 5 12 15 34 59 102
3 4 15 59 102
A 0 1 5 12 34 B 3 4 15 59 102
Parallel Algorithm
for-each (element I ∈ A )
1. Find position of I in A with thread ID
2. Find position of I in B with binary search (log n )
3. Sum results of 1, 2 to get the position in the output
32
0 1 3 4 5 12 15 34 59 102
MERGE TASK
(CONT.)
E F G H
List 2 ⋯
Merge splitters E A B F C G DH
1M Step? Work?
512K 512K
O(1)
⋮ ⋮
2048 2048 O(n)
log n
1024 1024 1024 1024 O(log n) ✓
512 512 512 512 O(n log n) ✓
⋮ ⋮ O(n2)
1 1 1 ⋯ 1 1 1 34
n
RADIX
SORT
2 100 110
What is The Predicate?
7 111 101
(i&1)==0
1 001 111
6 110 001
RADIX SORT
ANALYSIS
5 101 100
k: #of bits n: #of elements
2 100 110
1 001 111
6 110 001
37
QUICK
SORT
3 5 2 4 1 P=3
<3 =3 >3
2 1 P=2 3 5 4
<2 =2 >2
38
1 2
QUICK SORT
PENALIZATION
3 5 2 4 1 P=3
2 1 3 5 4
compact: <3
60
RED EYE
REMOVAL
Stencil
Sort
Map
42