Parralel Demro 002

PARALLEL PROCESSING
FUNDAMENTAL GPU
ALGORITHMS
 More Patterns
 Reduce
 Scan
OUTLINES
 Efficiency Measure
 Reduce primitive
 Reduce model
 Reduce Implementation and complexity analysis
 Scan primitive
 Scan model
 Scan Implementation and complexity analysis
 Histogram primitive
 Histogram with atomic

 Histogram with atomic and reduce
EFFICIENCY MEASURE
SEQUENTIAL VS. PARALLEL
EFFICIENCY?
6 0 2 5 1 7
Sequential Parallel
1 Thread 3 Threads
6 6 0 2 5 1 7
+ 0 + + +
+ 2 +
+ 5 5 Steps 3
+
+ 1
5 Total Work 5 21
+ 7
If =
5
21
Work Efficient
REDUCE PATTERN
REDU
REDUCE DEFINITION
CE
 Set of elements
 Reduction operator
 Binary
 Associative operation
6 + 0+ 2 +5 + …..
QUIZ
6 0 ❑✓Multiply ( a*b )
+ 2 ✓❑Minimum
❑ Factorial ( a! )
+ 5
❑✓Logical Or ( a||b )
8
❑✓Logical And ( a&&b)
+
❑ Division ( a/b )
REDUCE SERIAL (SEQUENTIAL)
MPLEMENTATION
sum=0
For (i=1;i<n;i++){ 6 + 0+ 2 +5
sum=sum+array[i];
}
return sum; 6 0
+ 2
+ 5
+
 Steps = 3
 Total work = 3
4
REDUCE SERIAL
COMPLEXITY
Which is true to reduce n elements?
❑ It takes n operation
✓❑It takes n-1 operation

❑✓It’s work complexity O(n)
❑✓It’s step complexity O(n)
5
REDUCE PARALLEL
IMPLEMENTATION
 (a+(b+c))+d) = (a+b)+(c+d)
6 + 0 + 2 +5
6 0 2 5
+ +
 Steps = 2
 Total work = 3
6
REDUCE PARALLEL
COMPLEXITY
6 0 2 5 3 7 1 5
+ + + +
+ +
N Steps Total work Actually

If we have n elements
2 1 1 we need n/2 threads at
first step. But this is not
4 2 3 possible because we have
only p processors
8 3 7 Thus O(log n) is not accurate
= lg n = n-1 and we need another calculation
O(log n) O(n) this called Brent’s theorm
REDUCE IN
ACTION
 Suppose we have 2^20 (1m) elements
 Stage 1: 1024 block * 1024 thread

1024 1024 ---
 Stage 2: 1 block B0 B1 ---
 PS: increase performance 3 times 1024

B0
by using shared memory
15
Start
SCAN PATTERN
SCAN
Input:  1 2 3 4
Op:  add
Output:  1 3 6 10
Transaction Balance
+10 10
-5 5
+4 9
+3 12 9
SCAN
DEFINITION
 Set of elements
 Reduction operator (op)
 Binary
 Associative operation (We assume here it’s also
commutative e.g. x+y=y+x)
 Identity element [I op a= a]
op I Because
+ 0 a+0=a
* 1 a*1=a
min (unsigned 0xFF min(0xFF,a)=a
char) 10
SCAN
DEFINITION CONT.
 Exclusive
in :  1 0 4 2 3
Op= +  I= 0
Out:  0 1 1 5 7
Inclusive: 
in :  1 0 4 2 3
Op= +  I= 0
Out:  1 1 5 7 11
11
SCAN SERIAL IMPLEMENTATION AND
COMPLEXITY
acc=identity acc=identity
For (i=1;i<n;i++){ For (i=1;i<n;i++){
acc=acc+array[i]; out[i]=acc;
out[i]=acc; acc=acc+array[i];
} }
return out; return out;
Inclusive Exclusive
 Steps = n
 Total work = n
12
SCAN PARALLEL
IMPLEMENTATION AND
COMPLEXITY
 Inclusive (+ Scan):
in :  1 0 4 2 3
Out:  1 1 5 7 10
 So if we consider the problem a set of reduce

problems with different n then:
n Step Work
1 lg 1 0
2 lg 2 1
… … …
13
n lg n n-1
= O ( log n) = O ( n2 )
SCAN PARALLEL
IMPLEMENTATION AND
COMPLEXITY (CONT.1)
Method Step Work
Hillis & Steele

✓
Blelloch
✓
14
SCAN PARALLEL
IMPLEMENTATION AND
COMPLEXITY (CONT.2)
 Hillis & Steele (Inclusive sum scan):
1 2 3 4 5 6 7 8
0
1 3 5 7 9 11 13 15
1
1 3 6 10 14 18 22 26
2
1 3 6 10 15 21 28 36
Step= O( log n ) work= matrix = O( n log n )

15
SCAN PARALLEL
IMPLEMENTATION AND

COMPLEXITY
Blelloch(Exclusive sum scan):
(CONT.3)
1 2 3 4 5 6 7 8
Reduce
3 7 11 15Step= log n
Work= n
10 26
36
10 0 Downsweep
L R
3 0 11 10
R L+R
Step= log n
1 0 3 3 5 10 7 21
Work= n
0 1 3 9 10 15 21 28 25
QUI
Z
 Which Scan algorithm to use?
Serial Hillis & Steele Blelloch
512 elements
512 processor
✓
1m elements
512 processor ✓
128k elements
1 processor ✓
26
HISTOGRAM
 Measure the students height in your class:
38
34
16
12
<161 161- 175 176- 180 >180
 How is shorter than 175? 29

 Cumulative distribution (Scan)
HISTOGRAM SERIAL
IMPLEMENTATION
In: measurements[], n-elements

Out: result[]
For (i=0; i<bin-count; i++)

result[i]=0;
For (i=0; i< n-elements ; i++)

result[computeBin(measurements[i])]++
19
HISTOGRAM
PARALLEL NAÏVE
In: measurements[], n-elements

Out: result[]
For (i=0; i<bin-count; i++)

result[i]=0;
For (i=0; i< n-elements ; i++)

result[computeBin(measurements[i])]++
 We have synchronization problem
Start
20
HISTOGRAM PARALLEL
NAÏVE (CONT.)
 Example
 128 element 8 threads 3 bin
Memory 15 15 16 17 16 6 7 8
Thread 1 Thread 2
Memory 5
 Read
 Increment
 Write
Register 5 +1 5 +1
Race condition
21
HISTOGRAM
PARALLEL SIMPLE
 Use atomic operation
 However this serialize the problem.
22
HISTOGRAM PARALLEL
REDUCE BASE
 Example
 Use local bins, which means every thread had 3

local bin.
Thread 0 1 ---- 7
Bin
0
1
2
 Each thread accumulate 16 item

 Then apply a reduce back in global memory
34
HISTOGRAM PARALLEL
REDUCE BASE
 Example
 Use local bins, which means every thread had 3

local bin.
Thread 0 1 ---- 7
Bin
0
1
2
 Each thread accumulate 16 item
 Then apply a reduce back in global memory 3
times for each bin (3 times is bad) 35
HISTOGRAM PARALLEL SORT
& REDUCE
Example 
3 bin 8 threads128 element 

1 0 0 2 1 2 1 0 ….
Memory …. •
 First we sort
Memory …. 0 0 0 1 1 1 2 2 ….
 Then we reduce
36
TONE MAPPING
26
TONE
MAPPING
 Using reduce, scan and histogram
27
PARALLEL PROCESSING
QUIZ
On scan of N elements the amount of work is:
❑ O(log n)
✓❑O(n)
❑ O(n log n)
❑ O(n2)
On scan of N elements the number of steps is:

✓❑O(log n)
❑ O(n)
❑ O(n log n)
❑ O(n2) 2
OUTLINES
 Compact
 Compact-like
 Segment scan
 Sorting
 Odd Even Sort
 Merge Sort
 Radix Sort
 Quick Sort 3
COMPACT
(FILTER)
 Filter red color objects only

 This help us to keep objects
we care about and ignore
other objects.
 This saves space and
processing time.
3
1
COMPACT MODEL
Input: a b c de …
Predicate: T F T FT …
e.g.
“Is my index even?”
Output: a - c - e … Sparse
a c e … Dense
3
2
WHY WE PREFER
DENSE COMPACT?
 Suppose we want to apply
toGray() on every red object.
//Sparse //Dense
if( isRed(object)){ cc=compact(objects, isRed())
toGray(object)
} map(cc, toGray())
15 threads 15 threads
7
4 threads
QUIZ
When to use compact?
When the number of elements …..

❑ Small
✓❑Large
When the operation on this elements are …..

✓❑Expensive (complex)
❑ Cheap (Simple)
34
COMPACT PARALLELIZATION
 How to compute the scatter address/index in
parallel?
Input: a b c d e …
Predicate: T F T F T …
e.g.
“Is my index even?”
Output: a - c - e … Sparse
a c e … Dense
35
COMPACT
PARALLELIZATION
 We can paraphrase the problem as following:
Input: T F F T T F
Output: 0 - - 1 2 -
 And for computer we paraphrase it again as

following:
Input: 1 0 0 1 1 0 ???
Output: 0 1- 1- 1 2 3-
Exclusive Scan
36
COMPACT PARALLEL
ALGORITHM

1. Generate Predicate array:
predicate_array = predicate_function(input_array)
map
2. Generate Scan-in array (1s and 0s)
scan-in_array = convert(predicate_array)
3. Generate Scatter-addresses array :


addresses_array = exclusive_sum_scan(scan-in_array)
scan
4. Scatter input elements to addresses

output = scatter(addresses_array, input_array)
 scatter
37
QUIZ
 Suppose we need to compact 1M element with the
following predicate functions:
 A: isDivisiableBy_17( ) Few elements
 B: isNotDivisiableBy_34( ) Many elements
A Faster A=B B Faster

Map ✓
Scan ✓
Scatter ✓
38
COMPACT-LIKE
RECAP COMPACT
ALLOCATION
 Compact allocate 1 output for 1 (true) element
and 0 output for 1 (false).
 Can we generalize?? (not only 1s)

 The number of output can be computed dynamically
for each input items.
39
CLIPPING
 Suppose a set of triangles are sent as input to a
computer graphics pipeline
c d
40
CLIPPING
PROBLEM
 How to clip triangles at the boundries?
c d
41
BAD
SOLUTION
Input: a c e d f b g
 Allocate maximum possible space in intermediate
array
 5 for each triangle in our case
 I.ntermediate: a ? ? ? ? b ? ---
 Apply compact
Disadvantages:
 Wasteful in space
 Scanning large intermediate array

42
GOOD SOLUTION (GENERAL
COMPACT)
Input: a c e d f b g
 Allocation requests per input element

request: 0 1 0 1 2 1 1
 Apply scan
Addresses: 0 0 1 1 2 4 5
 Allocate output array based with respect to max
scan #, and then apply scatter
19
output: 0 1 2 3 4 5
OTHER APPLICATION OF SCAN
 Data Compression
 Collision Detection
20
SEGMENTED SCAN
 Many small Scans

 Lunch each independently
 Combine as segments
 Remember we back all segments in one big array to be
processed by one kernel, instead of running a separate
kernel over each segment to gain max benefits from GPU
power.
 We use a separate array to indicate segments’ heads
Input: 1 2 3 4 5 6 70 Scan 8 1 3 21 6 10 15 28
Heads: 1 0 1 0 0 1 0 0
Input: 1 2 3 4 5 6 7 8 Segmented Scan 0 1 0 3 7 0 6 13

23
SEGMENTED SCAN
APPLICATION
 “Sparse matrix *Dense vector” multiplication (SpMv)

 Sparse matrix: contains all elements includes a lot of zeros
 Dense matrix: doesn’t contain zeros
 Sparse matrix multiplication comes with a lot of
unnecessary multiplication
 E.g. Google PageRank
 a is a non-zero value indicate there is a link between webpages
indicated by column index and webpage indicated by row index.
All web pages

All web pages
REVIEW MATRIX
MULTIPLICATION
1 0 3 0
2 4 —1 x 1 =
0 1 5 2
1 ∗ 0 + 0 ∗ 1 + (3 ∗ 2)
2 ∗ 0 + 4 ∗ 1 + (—1 ∗ 2) =
0 ∗ 0 + 1 ∗ 1 + (5 ∗ 2)
6
2
11
COMPRESSED SPARSE
ROW
 We can represent sparse matrix in CPR format as
following:
a 0 b
c d e
0 0 ƒ
H H H
Value [a b c d e f]
Column [0 2 0 1 2 2]
RowPtr [0 2 5]
26
CPR
MULTIPLICATION
 We can represent sparse matrix in CPR format as following:
 Value: [a b c d e f] x. 0
 Column: [0 2 0 1 2 2] x y. 1
 RowPtr: [0 2 5] z. 2
1. Create Segments with values and RowPtr
[a b c d e f] RowPtr: [a b c d e f]
[0 2 5]
2. Gather vector values using column
column: [x z x y z z]
[0 2 0 1 2 2]
3. Pairwise multiply 1 . 2
[a*x b*z c*x d*y e*z f*z]
4. Apply exclusive backward sum scan (at the head) 27
Can we apply reduce instead?? out(0) out(1) out(2)

SORTING
ODD-EVEN (BRICK)
SORT
 It’s the parallel version of bubble sort
5 1 4 2 3 Step? Work?
1 5 2 4 3 O(1)
O(n) ✓
1 2 5 3 4
O(log n)
1 2 3 5 4 O(n log n)
1 2 3 4 5 O(n2) ✓
30
MERGE
SORT
Stage 3:
1M • We have only 1merge task
• 2 huge sorted lists
512K 512K • Choose splitters (256th element)
in each list. and sort them
⋮ ⋮ • Use merge task in stage
2 to merge elements between splitte
2048 2048
Stage 2:
1024 1024 • Bunch of medium merge tasks
1024 1024
• 1 merge to 1 block
512 512 512 512 Stage 1:
• Tons of small merge tasks
⋮ ⋮ • 1 merge to 1 thread
• We can use shared memory
1 1 1 ⋯ 1 1 1 • Also we can use serial algorithm 31
to sort small block of elements
MERGE TASK
0 1 5 12 34
Sorted Serial Algorithm
Compare heads 0 1 3 4 5 12 15 34 59 102
3 4 15 59 102
A 0 1 5 12 34 B 3 4 15 59 102
Parallel Algorithm
for-each (element I ∈ A )
1. Find position of I in A with thread ID
2. Find position of I in B with binary search (log n )
3. Sum results of 1, 2 to get the position in the output
32
0 1 3 4 5 12 15 34 59 102
MERGE TASK
(CONT.)
 How to merge two huge lists in parallel?

 No one can do all of this work
A B C D
Find nth (e.g. 256th ) splitter elements
List1 ⋯
E F G H
List 2 ⋯
Merge splitters E A B F C G DH
Merge elements between every two consecutive splitters

e.g. F and C:
1. Find F position in list 1
2. Find C position in list 2 33
3. Merge elements between these two positions in every list.
MERGE
ANALYSIS
1M Step? Work?
512K 512K
O(1)
⋮ ⋮
2048 2048 O(n)
log n
1024 1024 1024 1024 O(log n) ✓
512 512 512 512 O(n log n) ✓
⋮ ⋮ O(n2)
1 1 1 ⋯ 1 1 1 34
n
RADIX
SORT
1. Start with LSB

2. Split input into 2 sets based on the current bit
3. Move to next MSB, and repeat 1
0 000 000 000 000 0
5 101 100 100 001 1
4 100 110 101 100 4
7 111 101 001 101 5
1 001 111 110 110 6

35
6 110 001 111 111 7
RADIX SORT
PENALIZATION
0 000 000 What is this Algorithm?

Compact
5 101 100
2 100 110
What is The Predicate?
7 111 101
(i&1)==0
1 001 111
6 110 001
RADIX SORT
ANALYSIS
0 000 000 Work: O(kn)

linear
5 101 100
k: #of bits n: #of elements
2 100 110
7 111 101 Steps: O(k)
1 001 111
6 110 001
37
QUICK
SORT
 Choose Pivot element

 Compare all elements with Pivot
 Split into 3 Arrays <p, =p, >p
 Recursion on each array
3 5 2 4 1 P=3
<3 =3 >3
2 1 P=2 3 5 4
<2 =2 >2
38
1 2
QUICK SORT
PENALIZATION
 Old GPUs doesn’t support recursion
3 5 2 4 1 P=3
compact: <3 compact =3 compact >3
2 1 3 5 4
compact: <3
Current GPUs support recursion

59

NOTE
 All sort algorithms that we have studied are key

value sorts where we usually depend on an
integer key to sort.
 However if you have items with different data

item (e.g. structure with many value).
 Use a key or a pointer to this value to apply sorting
60
RED EYE
REMOVAL
 Stencil
 Sort
 Map
42

Parralel Demro 002

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parralel Demro 002

Uploaded by

Copyright:

Available Formats

PARALLEL PROCESSING

 Histogram with atomic

✓❑It takes n-1 operation

❑✓It’s step complexity O(n)

N Steps Total work Actually

 Suppose we have 2^20 (1m) elements

 Stage 1: 1024 block * 1024 thread

 Stage 2: 1 block B0 B1 ---

 PS: increase performance 3 times 1024

 So if we consider the problem a set of reduce

Method Step Work

Hillis & Steele

Step= O( log n ) work= matrix = O( n log n )

Serial Hillis & Steele Blelloch

 Measure the students height in your class:

<161 161- 175 176- 180 >180

 How is shorter than 175? 29

In: measurements[], n-elements

For (i=0; i<bin-count; i++)

For (i=0; i< n-elements ; i++)

In: measurements[], n-elements

For (i=0; i<bin-count; i++)

For (i=0; i< n-elements ; i++)

 We have synchronization problem

 Use local bins, which means every thread had 3

 Each thread accumulate 16 item

 Use local bins, which means every thread had 3

3 bin 8 threads128 element 

On scan of N elements the number of steps is:

 Odd Even Sort

 Filter red color objects only

When the number of elements …..

When the operation on this elements are …..

 And for computer we paraphrase it again as

3. Generate Scatter-addresses array :

4. Scatter input elements to addresses

A Faster A=B B Faster

 Can we generalize?? (not only 1s)

 How to clip triangles at the boundries?

 Scanning large intermediate array

 Allocation requests per input element

 Many small Scans

Input: 1 2 3 4 5 6 7 8 Segmented Scan 0 1 0 3 7 0 6 13

 “Sparse matrix *Dense vector” multiplication (SpMv)

All web pages

4. Apply exclusive backward sum scan (at the head) 27

Can we apply reduce instead?? out(0) out(1) out(2)

 How to merge two huge lists in parallel?

Merge elements between every two consecutive splitters

1. Start with LSB

0 000 000 000 000 0

5 101 100 100 001 1

4 100 110 101 100 4

7 111 101 001 101 5

1 001 111 110 110 6

0 000 000 What is this Algorithm?

0 000 000 Work: O(kn)

7 111 101 Steps: O(k)

 Choose Pivot element

 Split into 3 Arrays <p, =p, >p

 Recursion on each array

 Old GPUs doesn’t support recursion

compact: <3 compact =3 compact >3