You are on page 1of 61

PARALLEL PROCESSING

FUNDAMENTAL GPU
ALGORITHMS
 More Patterns
 Reduce
 Scan

OUTLINES
 Efficiency Measure
 Reduce primitive

 Reduce model
 Reduce Implementation and complexity analysis

 Scan primitive
 Scan model
 Scan Implementation and complexity analysis

 Histogram primitive

 Histogram with atomic


 Histogram with atomic and reduce
EFFICIENCY MEASURE
SEQUENTIAL VS. PARALLEL
EFFICIENCY?
6 0 2 5 1 7

Sequential Parallel
1 Thread 3 Threads

6 6 0 2 5 1 7

+ 0 + + +

+ 2 +

+ 5 5 Steps 3
+

+ 1
5 Total Work 5 21

+ 7
If =
5
21
Work Efficient
REDUCE PATTERN
REDU
REDUCE DEFINITION
CE
 Set of elements
 Reduction operator
 Binary
 Associative operation

6 + 0+ 2 +5 + …..
QUIZ
6 0 ❑✓Multiply ( a*b )
+ 2 ✓❑Minimum
❑ Factorial ( a! )
+ 5
❑✓Logical Or ( a||b )
8
❑✓Logical And ( a&&b)
+
❑ Division ( a/b )
REDUCE SERIAL (SEQUENTIAL)
MPLEMENTATION

sum=0
For (i=1;i<n;i++){ 6 + 0+ 2 +5
sum=sum+array[i];
}
return sum; 6 0

+ 2

+ 5

+
 Steps = 3
 Total work = 3

4
REDUCE SERIAL
COMPLEXITY
Which is true to reduce n elements?
❑ It takes n operation

✓❑It takes n-1 operation


❑✓It’s work complexity O(n)

❑✓It’s step complexity O(n)

5
REDUCE PARALLEL
IMPLEMENTATION

 (a+(b+c))+d) = (a+b)+(c+d)

6 + 0 + 2 +5

6 0 2 5

+ +

 Steps = 2
 Total work = 3

6
REDUCE PARALLEL
COMPLEXITY
6 0 2 5 3 7 1 5

+ + + +

+ +

N Steps Total work Actually


If we have n elements
2 1 1 we need n/2 threads at
first step. But this is not
4 2 3 possible because we have
only p processors
8 3 7 Thus O(log n) is not accurate
= lg n = n-1 and we need another calculation
O(log n) O(n) this called Brent’s theorm
REDUCE IN
ACTION

 Suppose we have 2^20 (1m) elements

 Stage 1: 1024 block * 1024 thread


1024 1024 ---

 Stage 2: 1 block B0 B1 ---

 PS: increase performance 3 times 1024


B0
by using shared memory
15
Start
SCAN PATTERN
SCAN
Input:  1 2 3 4
Op:  add
Output:  1 3 6 10

Transaction Balance

+10 10

-5 5

+4 9

+3 12 9
SCAN
DEFINITION
 Set of elements
 Reduction operator (op)
 Binary
 Associative operation (We assume here it’s also
commutative e.g. x+y=y+x)
 Identity element [I op a= a]

op I Because
+ 0 a+0=a
* 1 a*1=a
min (unsigned 0xFF min(0xFF,a)=a
char) 10
SCAN
DEFINITION CONT.
 Exclusive
in :  1 0 4 2 3
Op= +  I= 0
Out:  0 1 1 5 7

Inclusive: 
in :  1 0 4 2 3
Op= +  I= 0
Out:  1 1 5 7 11

11
SCAN SERIAL IMPLEMENTATION AND
COMPLEXITY

acc=identity acc=identity
For (i=1;i<n;i++){ For (i=1;i<n;i++){
acc=acc+array[i]; out[i]=acc;
out[i]=acc; acc=acc+array[i];
} }
return out; return out;

Inclusive Exclusive

 Steps = n
 Total work = n

12
SCAN PARALLEL
IMPLEMENTATION AND
COMPLEXITY
 Inclusive (+ Scan):
in :  1 0 4 2 3

Out:  1 1 5 7 10

 So if we consider the problem a set of reduce


problems with different n then:
n Step Work
1 lg 1 0
2 lg 2 1
… … …
13
n lg n n-1
= O ( log n) = O ( n2 )
SCAN PARALLEL
IMPLEMENTATION AND
COMPLEXITY (CONT.1)

Method Step Work

Hillis & Steele



Blelloch

14
SCAN PARALLEL
IMPLEMENTATION AND
COMPLEXITY (CONT.2)
 Hillis & Steele (Inclusive sum scan):

1 2 3 4 5 6 7 8
0

1 3 5 7 9 11 13 15
1

1 3 6 10 14 18 22 26
2

1 3 6 10 15 21 28 36

Step= O( log n ) work= matrix = O( n log n )


15
SCAN PARALLEL
IMPLEMENTATION AND

COMPLEXITY
Blelloch(Exclusive sum scan):
(CONT.3)
1 2 3 4 5 6 7 8
Reduce
3 7 11 15Step= log n
Work= n
10 26

36

10 0 Downsweep
L R
3 0 11 10
R L+R
Step= log n
1 0 3 3 5 10 7 21
Work= n
0 1 3 9 10 15 21 28 25
QUI
Z
 Which Scan algorithm to use?

Serial Hillis & Steele Blelloch

512 elements
512 processor

1m elements
512 processor ✓
128k elements
1 processor ✓

26
HISTOGRAM

 Measure the students height in your class:

38
34

16
12

<161 161- 175 176- 180 >180

 How is shorter than 175? 29


 Cumulative distribution (Scan)
HISTOGRAM SERIAL
IMPLEMENTATION

In: measurements[], n-elements


Out: result[]

For (i=0; i<bin-count; i++)


result[i]=0;

For (i=0; i< n-elements ; i++)


result[computeBin(measurements[i])]++

19
HISTOGRAM
PARALLEL NAÏVE

In: measurements[], n-elements


Out: result[]

For (i=0; i<bin-count; i++)


result[i]=0;

For (i=0; i< n-elements ; i++)


result[computeBin(measurements[i])]++

 We have synchronization problem

Start

20
HISTOGRAM PARALLEL
NAÏVE (CONT.)
 Example
 128 element 8 threads 3 bin

Memory 15 15 16 17 16 6 7 8

Thread 1 Thread 2

Memory 5
 Read
 Increment
 Write
Register 5 +1 5 +1

Race condition
21
HISTOGRAM
PARALLEL SIMPLE
 Use atomic operation
 However this serialize the problem.

22
HISTOGRAM PARALLEL
REDUCE BASE
 Example
 128 element 8 threads 3 bin

 Use local bins, which means every thread had 3


local bin.
Thread 0 1 ---- 7
Bin
0
1
2

 Each thread accumulate 16 item


 Then apply a reduce back in global memory
34
HISTOGRAM PARALLEL
REDUCE BASE
 Example
 128 element 8 threads 3 bin

 Use local bins, which means every thread had 3


local bin.
Thread 0 1 ---- 7
Bin
0
1
2
 Each thread accumulate 16 item
 Then apply a reduce back in global memory 3
times for each bin (3 times is bad) 35
HISTOGRAM PARALLEL SORT
& REDUCE

Example 

3 bin 8 threads128 element 


1 0 0 2 1 2 1 0 ….
Memory …. •

 First we sort

Memory …. 0 0 0 1 1 1 2 2 ….

 Then we reduce
36
TONE MAPPING

26
TONE
MAPPING
 Using reduce, scan and histogram

27
PARALLEL PROCESSING

QUIZ
On scan of N elements the amount of work is:
❑ O(log n)

✓❑O(n)
❑ O(n log n)
❑ O(n2)

On scan of N elements the number of steps is:


✓❑O(log n)
❑ O(n)
❑ O(n log n)
❑ O(n2) 2
OUTLINES

 Compact

 Compact-like

 Segment scan

 Sorting

 Odd Even Sort

 Merge Sort

 Radix Sort

 Quick Sort 3
COMPACT
(FILTER)

 Filter red color objects only


 This help us to keep objects
we care about and ignore
other objects.
 This saves space and
processing time.

3
1
COMPACT MODEL
Input: a b c de …

Predicate: T F T FT …
e.g.
“Is my index even?”

Output: a - c - e … Sparse

a c e … Dense

3
2
WHY WE PREFER
DENSE COMPACT?
 Suppose we want to apply
toGray() on every red object.

//Sparse //Dense
if( isRed(object)){ cc=compact(objects, isRed())

toGray(object)
} map(cc, toGray())

15 threads 15 threads
7

4 threads
QUIZ
When to use compact?

When the number of elements …..


❑ Small

✓❑Large

When the operation on this elements are …..


✓❑Expensive (complex)
❑ Cheap (Simple)

34
COMPACT PARALLELIZATION
 How to compute the scatter address/index in
parallel?
Input: a b c d e …

Predicate: T F T F T …
e.g.
“Is my index even?”

Output: a - c - e … Sparse

a c e … Dense

35
COMPACT
PARALLELIZATION
 We can paraphrase the problem as following:

Input: T F F T T F

Output: 0 - - 1 2 -

 And for computer we paraphrase it again as


following:
Input: 1 0 0 1 1 0 ???

Output: 0 1- 1- 1 2 3-

Exclusive Scan
36
COMPACT PARALLEL
ALGORITHM


1. Generate Predicate array:
predicate_array = predicate_function(input_array)
map
2. Generate Scan-in array (1s and 0s)
scan-in_array = convert(predicate_array)

3. Generate Scatter-addresses array :



addresses_array = exclusive_sum_scan(scan-in_array)
scan

4. Scatter input elements to addresses


output = scatter(addresses_array, input_array)
 scatter

37
QUIZ
 Suppose we need to compact 1M element with the
following predicate functions:
 A: isDivisiableBy_17( ) Few elements
 B: isNotDivisiableBy_34( ) Many elements

A Faster A=B B Faster


Map ✓
Scan ✓
Scatter ✓
38
COMPACT-LIKE

RECAP COMPACT
ALLOCATION
 Compact allocate 1 output for 1 (true) element
and 0 output for 1 (false).

 Can we generalize?? (not only 1s)


 The number of output can be computed dynamically
for each input items.

39
CLIPPING
 Suppose a set of triangles are sent as input to a
computer graphics pipeline

c d

40
CLIPPING
PROBLEM

 How to clip triangles at the boundries?

c d

41
BAD
SOLUTION

Input: a c e d f b g
 Allocate maximum possible space in intermediate
array
 5 for each triangle in our case
 I.ntermediate: a ? ? ? ? b ? ---
 Apply compact

Disadvantages:
 Wasteful in space

 Scanning large intermediate array


42
GOOD SOLUTION (GENERAL
COMPACT)

Input: a c e d f b g

 Allocation requests per input element


request: 0 1 0 1 2 1 1

 Apply scan
Addresses: 0 0 1 1 2 4 5
 Allocate output array based with respect to max
scan #, and then apply scatter
19
output: 0 1 2 3 4 5
OTHER APPLICATION OF SCAN

 Data Compression
 Collision Detection

20
SEGMENTED SCAN

 Many small Scans


 Lunch each independently
 Combine as segments
 Remember we back all segments in one big array to be
processed by one kernel, instead of running a separate
kernel over each segment to gain max benefits from GPU
power.
 We use a separate array to indicate segments’ heads

Input: 1 2 3 4 5 6 70 Scan 8 1 3 21 6 10 15 28

Heads: 1 0 1 0 0 1 0 0

Input: 1 2 3 4 5 6 7 8 Segmented Scan 0 1 0 3 7 0 6 13


23
SEGMENTED SCAN
APPLICATION

 “Sparse matrix *Dense vector” multiplication (SpMv)


 Sparse matrix: contains all elements includes a lot of zeros
 Dense matrix: doesn’t contain zeros
 Sparse matrix multiplication comes with a lot of
unnecessary multiplication
 E.g. Google PageRank
 a is a non-zero value indicate there is a link between webpages
indicated by column index and webpage indicated by row index.

All web pages


All web pages
REVIEW MATRIX
MULTIPLICATION
1 0 3 0
2 4 —1 x 1 =
0 1 5 2

1 ∗ 0 + 0 ∗ 1 + (3 ∗ 2)
2 ∗ 0 + 4 ∗ 1 + (—1 ∗ 2) =
0 ∗ 0 + 1 ∗ 1 + (5 ∗ 2)

6
2
11
COMPRESSED SPARSE
ROW
 We can represent sparse matrix in CPR format as
following:
a 0 b
c d e
0 0 ƒ

H H H
Value [a b c d e f]
Column [0 2 0 1 2 2]
RowPtr [0 2 5]
26
CPR
MULTIPLICATION
 We can represent sparse matrix in CPR format as following:
 Value: [a b c d e f] x. 0
 Column: [0 2 0 1 2 2] x y. 1
 RowPtr: [0 2 5] z. 2
1. Create Segments with values and RowPtr
[a b c d e f] RowPtr: [a b c d e f]
[0 2 5]
2. Gather vector values using column
column: [x z x y z z]
[0 2 0 1 2 2]
3. Pairwise multiply 1 . 2
[a*x b*z c*x d*y e*z f*z]

4. Apply exclusive backward sum scan (at the head) 27

Can we apply reduce instead?? out(0) out(1) out(2)


SORTING
ODD-EVEN (BRICK)
SORT
 It’s the parallel version of bubble sort

5 1 4 2 3 Step? Work?

1 5 2 4 3 O(1)

O(n) ✓
1 2 5 3 4
O(log n)
1 2 3 5 4 O(n log n)

1 2 3 4 5 O(n2) ✓
30
MERGE
SORT
Stage 3:
1M • We have only 1merge task
• 2 huge sorted lists
512K 512K • Choose splitters (256th element)
in each list. and sort them
⋮ ⋮ • Use merge task in stage
2 to merge elements between splitte
2048 2048
Stage 2:
1024 1024 • Bunch of medium merge tasks
1024 1024
• 1 merge to 1 block
512 512 512 512 Stage 1:
• Tons of small merge tasks
⋮ ⋮ • 1 merge to 1 thread
• We can use shared memory
1 1 1 ⋯ 1 1 1 • Also we can use serial algorithm 31
to sort small block of elements
MERGE TASK

0 1 5 12 34
Sorted Serial Algorithm
Compare heads 0 1 3 4 5 12 15 34 59 102
3 4 15 59 102

A 0 1 5 12 34 B 3 4 15 59 102

Parallel Algorithm
for-each (element I ∈ A )
1. Find position of I in A with thread ID
2. Find position of I in B with binary search (log n )
3. Sum results of 1, 2 to get the position in the output
32

0 1 3 4 5 12 15 34 59 102
MERGE TASK
(CONT.)

 How to merge two huge lists in parallel?


 No one can do all of this work
A B C D
Find nth (e.g. 256th ) splitter elements
List1 ⋯

E F G H

List 2 ⋯

Merge splitters E A B F C G DH

Merge elements between every two consecutive splitters


e.g. F and C:
1. Find F position in list 1
2. Find C position in list 2 33
3. Merge elements between these two positions in every list.
MERGE
ANALYSIS

1M Step? Work?

512K 512K
O(1)
⋮ ⋮
2048 2048 O(n)
log n
1024 1024 1024 1024 O(log n) ✓
512 512 512 512 O(n log n) ✓
⋮ ⋮ O(n2)

1 1 1 ⋯ 1 1 1 34
n
RADIX
SORT

1. Start with LSB


2. Split input into 2 sets based on the current bit
3. Move to next MSB, and repeat 1

0 000 000 000 000 0

5 101 100 100 001 1

4 100 110 101 100 4

7 111 101 001 101 5

1 001 111 110 110 6


35
6 110 001 111 111 7
RADIX SORT
PENALIZATION

0 000 000 What is this Algorithm?


Compact
5 101 100

2 100 110
What is The Predicate?
7 111 101
(i&1)==0
1 001 111

6 110 001
RADIX SORT
ANALYSIS

0 000 000 Work: O(kn)


linear

5 101 100
k: #of bits n: #of elements

2 100 110

7 111 101 Steps: O(k)

1 001 111

6 110 001

37
QUICK
SORT

 Choose Pivot element


 Compare all elements with Pivot

 Split into 3 Arrays <p, =p, >p

 Recursion on each array

3 5 2 4 1 P=3

<3 =3 >3

2 1 P=2 3 5 4
<2 =2 >2
38
1 2
QUICK SORT
PENALIZATION

 Old GPUs doesn’t support recursion

3 5 2 4 1 P=3

compact: <3 compact =3 compact >3

2 1 3 5 4

compact: <3

Current GPUs support recursion


59

NOTE

 All sort algorithms that we have studied are key


value sorts where we usually depend on an
integer key to sort.

 However if you have items with different data


item (e.g. structure with many value).
 Use a key or a pointer to this value to apply sorting

60
RED EYE
REMOVAL

 Stencil
 Sort

 Map

42

You might also like