Professional Documents
Culture Documents
Pierre Vignras
Parallel Processing
http://www.vigneras.name/pierre
Dr. Pierre Vignras
This work is licensed under a Creative Commons
Attribution-Share Alike 2.0 France.
See
http://creativecommons.or/licenses/b!-sa/2.0/"r/
"or details
Dr. Pierre Vignras
Text & Reference Books
B
o
o
k
s
Parallel Programming (2
nd
edition)
*arr! 4ilkinson and 2ichael Allen
'earson/'rentice #all
)S*+: 0--3--50673-2
Dr. Pierre Vignras
Announced Quiz
Quiz: 5 %
Assignments: 5%
Project: 15%
Mid: 25 %
Final: 50 %
G
r
a
d
i
n
g
P
o
l
i
c
y
Dr. Pierre Vignras
Outline
(Important points)
Basics
[SI|MI][SD|MD],
Interconnection Networks
Parallel Complexity
SIMD Programming
Gauss Elimination
Dr. Pierre Vignras 7
O
u
t
l
i
n
e
Basics
Flynn's Classification
Interconnection Networks
Since
B
a
s
i
c
s
Dr. Pierre Vignras
!efinitions
Architecture,
Languages,
Algorithm
B
a
s
i
c
s
Dr. Pierre Vignras
"lynn #rc$itecture
C$aracterization
S: Single, M: Multiple
I: Instruction, D: Data
Four possibilities
B
a
s
i
c
s
Dr. Pierre Vignras
%&%! #rc$itecture
Interruption
IO processor
Pipeline
)
,
)
.
)
7
)
6
)F 8F (9 ST:
Flow o" )nstructions
)
-
)
2
)
3
)
5
5-staes pipeline
I
B
a
s
i
c
s
Dr. Pierre Vignras
Pro'lem (it$ )i)elines
Dependencies
Solutions:
Consequences:
Performance
I
B
a
s
i
c
s
Dr. Pierre Vignras
Exercises* Real +ife Exam)les
Intel:
Pentium 4:
AMD
Athlon:
Opteron:
IBM
Power 6:
Cell:
Sun
UltraSparc:
T1:
I
B
a
s
i
c
s
Dr. Pierre Vignras
&nstruction +e,el Parallelism
Reordering of instructions
By the compiler
By the hardware
B
a
s
i
c
s
Dr. Pierre Vignras
Reordering* Exam)le
EXP=A+B+C+(DxExF)+G+H
;
;
;
;
;
<
<
#
&
F
( = * A
C
#
e
i
h
t
>
6
6 Steps re?uired i"
hardware can do - add and
- multiplication simultaneousl!
;
;
;
; ;
<
<
#
&
F
( = * A C
#
e
i
h
t
>
5
5 Steps re?uired i"
hardware can do 2 adds and
- multiplication simultaneousl!
I
B
a
s
i
c
s
Dr. Pierre Vignras
%&-! #rc$itecture
Two versions
B
a
s
i
c
s
Dr. Pierre Vignras
Real +ife Exam)les
AMD: 3dnow
IBM: Altivec
Sun: VSI
Very efficient
B
a
s
i
c
s
Dr. Pierre Vignras
-&-! #rc$itecture
Two forms
Usage:
Need cooperation!
I
B
a
s
i
c
s
Dr. Pierre Vignras
.arning* !efinitions
Multiprocessing:
B
a
s
i
c
s
Dr. Pierre Vignras
%$ared ,s !istri'uted
2emor!
Switch
Shared 2emor!
C'@ C'@ C'@
C'@
2
C'@ 2 C'@ 2 Switch
=istributed 2emor!
I
B
a
s
i
c
s
Dr. Pierre Vignras
-&-!
Coo)eration/Communication
Shared memory
Variable Latency
Distributed memory
B
a
s
i
c
s
Dr. Pierre Vignras
&nterconnection 0et(orks
I
B
a
s
i
c
s
Dr. Pierre Vignras
1ard2 0et(ork -etrics
4 components
B
a
s
i
c
s
Dr. Pierre Vignras
1%oft2 0et(ork -etrics
d =1
r
d.N
d
I
B
a
s
i
c
s
Dr. Pierre Vignras
&nfluence of t$e net(ork
Bus
low cost,
Fully connected
Star
B
a
s
i
c
s
Dr. Pierre Vignras
Parallel Res)onsi'ilities
'arallel Alorithm
Aanuae
'arallel 'roram
'arallel 'rocess
#ardware
#uman writes a
#uman implements usin a
Compiler translates into a
:untime/Aibraries are re?uired in a
#uman % 8peratin S!stem map to the
'arallel (<ecution The hardware is leadin the
2an! actors are involved in the e<ecution
o" a proram Beven se?uentialC.
'arallelism can be introduced
automaticall! b! compilersD
runtime/librariesD 8S and in the hardware
4e usuall! consider
explicit parallelism:
when parallelism is
introduced at the
language level
I
B
a
s
i
c
s
Dr. Pierre Vignras
%&-! Pseudo3Code
SIMD
<indexed variable> = <indexed expression>, (<index range>);
Example:
C[i,j] == A[i,j] + [i,j], (! " i " #$%)
C
j
i
A
j
i
*
j
i
+
=
+
j = N-1
CEiF AEiF *EiF
=
+
j = 1
CEiF AEiF *EiF
=
+
j = 0
CEiF AEiF *EiF
=
...
I
B
a
s
i
c
s
Dr. Pierre Vignras
%&-! -atrix -ulti)lication
General Formula
&or (i = !; i < #; i++) '
(( )*+, *ni-ialisa-ion
C[i,j] = !, (! <= j < #);
&or (. = !; . < #; .++) '
(( )*+, /omp0-a-ion
C[i,j]=C[i,j]+A[i,.]1[.,j], (! <= j < #);
2
2
c
ij
=
k =0
N1
a
ik
b
kj
I
B
a
s
i
c
s
Dr. Pierre Vignras
%&-! C Code 4GCC5
B
a
s
i
c
s
Dr. Pierre Vignras
-&-! %$ared -emory
Pseudo Code
New instructions:
B
a
s
i
c
s
Dr. Pierre Vignras
-&-! %$ared -emory -atrix
-ulti)lication
priva-e i,j,.;
s4ared A[#,#], [#,#], C[#,#], #;
&or (j = !; j < # $ ;; j++) &or. ,>C>?;
(( >nl3 -4e >riginal pro/ess rea/4 -4is poin-
j = #$%;
,>C>?: (( Exe/0-ed b3 # pro/ess &or ea/4 /ol0mn
&or (i = !; i < #; i++) '
C[i,j] = !;
&or (. = !; . < #; .++) '
(( )*+, /omp0-a-ion
C[i,j]=C[i,j]+A[i,.]1[.,j];
2
2
join #; (( @ai- &or -4e o-4er pro/ess
I
B
a
s
i
c
s
Dr. Pierre Vignras
-&-! %$ared -emory C Code
Process based:
fork()/wait()
Thread Based:
Pthread library
B
a
s
i
c
s
Dr. Pierre Vignras
-&-! !istri'uted -emory
Pseudo Code
Many possibilities:
Grouped communications
B
a
s
i
c
s
Dr. Pierre Vignras
-&-! !istri'uted -emory
C Code
Traditional Sockets
Bad Performance
B
a
s
i
c
s
Dr. Pierre Vignras 35
O
u
t
l
i
n
e
Perfor!ance
s = A[!];
&or (i = %; i < #; i++) '
s = s + A[i];
2
GE0F GE-F GE2F GE+--F
;
;
GE3F
;
;
s
...
=ata =ependence &raph
P
e
r
f
o
r
!
a
n
c
e
Dr. Pierre Vignras 37
GE0F GE-F GE2F GE+--F
; ;
GE3F
;
;
Parallel Program
C$aracteristics
GE+-2F
;
;
s
P
e
r
f
o
r
!
a
n
c
e
Dr. Pierre Vignras 3.
Parallel Program
C$aracteristics
etc...
P
e
r
f
o
r
!
a
n
c
e
Dr. Pierre Vignras 3,
P
e
r
f
o
r
!
a
n
c
e
Prefix Pro'lem
GE0F GE-F GE2F GE+--F GE3F
...
#ow to make this
proram parallelI
=epth >+--
Dr. Pierre Vignras 3/
Prefix %olution 6
7))er/+o(er Prefix #lgorit$m
Proof by induction
Example: N=1,048,576=2
20
P
ul
requires N/2 processors available!
Size = 2N-log
2
(N)-2 (N=2
k
)
Depth = 2log
2
(N)-2
Size = 4N-4.96N
0.69
+ 1
Depth = log
2
(N)
Classification:
Execution time
Generally in seconds
Real time
Instruction Count
Mflop/s, Gflop/s
Three parameters:
Capacity,
Latency,
Bandwidth
Dr. Pierre Vignras 5.
Parallelism and &nteraction
O,er$ead
T
comp
: time of the effective computation
T
par
: parallel overhead
T
inter
: interaction overhead
t(m) = t
0
+ m/r
t
0
: start-up time
: asymptotic bandwidth
m
t
t
0
)deal SituationJ
Topolo! metrics
+etwork Contention
Dr. Pierre Vignras 62
Collecti,e Communication
( ) n
Se?uentialI
Topolo! metrics
+etwork Contention
Dr. Pierre Vignras
%)eedu)
4Pro'lem3oriented5
For a given algorithm, let T
p
be time to perform
the computation using p processors
Speedup: S
p
=T
1
/T
p
Issue: what is T
1
?
Conditions
No overhead at all
Clearly biased!
Superlinear speedup
Efficiency: E
p
=S
p
/p (expressed in %)
Bounded by 100%
W = W + (1-)W
percent of W must be executed sequentially , and the
remaining 1- can be executed by n nodes simultaneously with
100% of efficiency
T
s
(W)=T
s
(W+(1-)W)=T
s
(W)+T
s
((1-)W)=T
s
(W)+(1-)T
s
(W)
T
p
(W)=T
p
(W+(1-)W)=T
s
(W)+T
p
((1-)W)=T
s
(W)+(1-)T
s
(W)/p
S
p
=
T
s
T
s
1T
s
p
=
p
1 p1
1
p
Dr. Pierre Vignras 70
#md$al9s +a(* fixed )ro'lem
size
...
serial section
paralleliHable sections
MT
s
B--MCT
s
T
s
...
B--MCT
s
/p
T
p
'
p
r
o
c
e
s
s
o
r
s
Dr. Pierre Vignras 7-
#md$al9s +a( Conse8uences
M>20N
M>50N
M>-0N
M>6N
SBpC
SBMC
p>-00
p>-0
Dr. Pierre Vignras 72
1 T
s
p
T
o
=
p
1 p1
pT
o
T
s
S
p
T
o
T
s
p
Dr. Pierre Vignras 73
T
s
(W)=T
s
(W)+(1-)T
s
(W)
T
p
(W)=T
s
(W)+(1-)T
s
(W)/p
T
s
(W')= T
s
(W)+p(1-)T
s
(W)
T
o
: "unction o" the number o" processors
Gauss %li!ination
Dr. Pierre Vignras 7/
-atrix >oca'ulary
k =0
l 1
a
i , k
b
k , j
0in , 0 jm
Complexity:
n.m initialisations
n
2
/p initialisations BpR>nC
n
3
/p additions and multiplications BpR>nC
!"er#ead: 2emor! inde<inJ
)nitialisation in 8B-C
Computation in 8BnC
(lgorit#m in !$lg$n'' )it# p=n
+
Aowest bound
Dr. Pierre Vignras .3
Gauss Elimination
a
n1,0
a
n1,1
a
n1,2
... a
n1, n1
b
n1
a
n2,0
a
n2,1
a
n2,2
... a
n2, n1
b
n2
... ... ... ... ... ...
a
1,0
a
1,1
a
1,2
... a
1, n1
b
1
a
0,0
a
0,1
a
0,2
... a
0, n1
b
0
:epresented b!
Dr. Pierre Vignras .5
Gauss Elimination Princi)le
Image Processing
Molecular Dynamics
Semi-Conductor Design
VLSI-Design
...
Dr. Pierre Vignras ..
O
u
t
l
i
n
e
&I&# S& Progra!!ing
Pre- ) self-sc$eduling
Co!!on Issues
Open&P
Dr. Pierre Vignras .,
Process, T$reads, "i'ers
Fairness?
Advantages:
Disadvantages:
priva-e i,j,.;
s4ared A[#,#], [#,#], C[#,#], #;
&or (i = !; i < # $ ;; i++) &or. ,>?*#E;
(( >nl3 -4e >riginal pro/ess rea/4 -4is poin-
i = #$%; (( Beall3 BeC0iredD
,>?*#E: (( Exe/0-ed b3 # pro/ess &or ea/4 line
&or (j = !; j < #; j++) '
(( )*+, /omp0-a-ion
C[i,j]=A[i,j]+[i,j];
2
2
join #; (( @ai- &or -4e o-4er pro/ess
Complexity: !$n
*
.p'
,% S,-D is sed: !$n
*
.$p.m''
m:nm/er o% S,-D P0s
,sse: i% n 1 p2 self-scheduling.
3#e !S de&ides )#o )ill do )#at. $Dynamic'
,n t#is &ase2 overhead
Dr. Pierre Vignras ,6
-&-! %- -atrix #ddition
Pre3%c$eduling
Challenge
Given
Ten-processor multiprocessor
Goal
-0
/
-0
-0
2U-0
/
-
'
0
'
-
'
/
Code for thread i:
void thread(int i) {
for (j = i*10
9
+1, j<(i+1)*10
9
; j++) {
if (isPrime(j)) print(j);
Thread workloads
Uneven
Hard to predict
8
L
"
o
r
u
n
i
p
r
o
c
e
s
s
o
r
D
n
o
t
"
o
r
m
u
l
t
i
p
r
o
c
e
s
s
o
r
Dr. Pierre Vignras
Counter &m)lementation
// Global (shared) variable
stati! "on# va"$e = 1;
"on# in!() {
ret$rn va"$e++;
temp = value;
value = value + 1;
return temp;
Dr. Pierre Vignras
7$3O$
time
Value
read -
read -
write 2
read 2
write 3
write 2
2 3 2
Dr. Pierre Vignras
C$allenge
// Global (shared) variable
stati! "on# va"$e = 1;
"on# in!() {
temp = va"$e;
va"$e = temp + 1;
ret$rn temp;
Critical section
Dr. Pierre Vignras
Correctness
Mutual Exclusion
No Lockout (lockout-freedom)
Coordination
Cancellation (interruption)
Asynchronous (immediate)
Deferred
Preprocessor directives:
#pragma omp <directive> [arguments]
Dr. Pierre Vignras -0-
O)en-P
Dr. Pierre Vignras -02
O)en-P
T$read Creation
No barrier on entry
Work-Sharing Constructs
Sections
Loop
Single
Workshare (Fortran)
Dr. Pierre Vignras -05
O)en-P
.ork %$aring 3 %ection
double e, pi, factorial, product;
int i;
e = 1; // Compute e using Taylor expansion
factorial = 1;
for (i = 1; i<num_steps; i++) {
factorial *= i;
e += 1.0/factorial;
} // e done
pi = 0; // Compute pi/4 using Taylor expansion
for (i = 0; i < num_steps*10; i++) {
pi += 1.0/(i*4.0 + 1.0);
pi -= 1.0/(i*4.0 + 3.0);
}
pi = pi * 4.0; // pi done
return e * pi;
)ndependent Aoops
)deal CaseJ
Dr. Pierre Vignras -06
double e, pi, factorial, product; int i;
#pragma omp parallel sections shared(e, pi) {
#pragma omp section {
e = 1; factorial = 1;
for (i = 1; i<num_steps; i++) {
factorial *= i; e += 1.0/factorial;
}
} /* e section */
#pragma omp section {
pi = 0;
for (i = 0; i < num_steps*10; i++) {
pi += 1.0/(i*4.0 + 1.0);
pi -= 1.0/(i*4.0 + 3.0);
}
pi = pi * 4.0;
} /* pi section */
} /* omp sections */
return e * pi;
)ndependent
sections are
declared
'rivate Copies
(<cept "or
shared variables
)mplicit *arrier
on e<it
O)en-P
.ork %$aring 3 %ection
Dr. Pierre Vignras -07
Talyor Exam)le #nalysis
System: Gentoo/Linux-2.6.18
reduction(op: list)
e.g: reduction(*: result)
Master directive:
#pragma omp master
{...}
Critical directive:
#pragma omp critical [name]
{...}
Critical directive:
#pragma omp barrier
{...}
Restrictions
If (i < n)
#pragma omp barrier
// Syntax error!
Dr. Pierre Vignras --6
O)en-P
%ync$ronisation 33 #tomic
Critical directive:
#pragma omp atomic
expr-stmt
Critical directive:
#pragma omp flush [(list)]
Implied flush:
...
Process Creation
Send.Recei/e
&PI
Dr. Pierre Vignras --,
-essage Passing %olutions
ProActive, JavaParty
Using a library
Process creation
Message Passing
Blocking or Unblocking
Grouped Communication
Dr. Pierre Vignras -20
Process Creation
Multiple Program, Multiple Data
Model
(MPMD)
Source
File
(<ecutable
File
'rocessor 0
Source
File
(<ecutable
File
'rocessor p-1
Compile to
suit processor
Dr. Pierre Vignras -2-
Process Creation
Single Program, Multiple Data Model
(SPMD)
Source
File
(<ecutable
File
'rocessor 0
(<ecutable
File
'rocessor p-1
Compile to
suit processor
Dr. Pierre Vignras -22
%P-! Code %am)le
int pid = getPID();
if (pid == MASTER_PID) {
execute_master_code();
}else{
execute_slave_code();
}
2aster/Slave Architecture
Compilation "or
heteroeneous
processors still re?uired
Dr. Pierre Vignras -23
!ynamic Process Creation
:
:
:
spawnBC
:
:
:
:
'rocess -
:
:
:
:
:
:
:
'rocess 2
Primiti"e re5ired:
spawn(location, executable)
Dr. Pierre Vignras -25
%end/Recei,e* Basic
Basic primitives:
send(dst, msg);
receive(src, msg);
Questions:
Partitionning
Dr. Pierre Vignras -33
Em'arrassingly Parallel
Com)utations
Ideal situation!
Master/Slave architecture
2 Partitioning Possibilities
Em'arrassingly Parallel
Com)utations Ty)ical Exam)le*
&mage Processing
750
5
,
0
750
5
,
0
'rocess 'rocess
Dr. Pierre Vignras -36
&mage %$ift Pseudo3Code
Master
h=Height/slaves;
for(i=0,row=0;i<slaves;i++,row+=h)
send(row, P
i
);
for(i=0; i<Height*Width;i++){
recv(oldrow, oldcol,
newrow, newcol, P
ANY
);
map[newrow,newcol]=map[oldrow,oldcol];
}
recv(row, P
master
)
for(oldrow=row;oldrow<(row+h);oldrow++)
for(oldcol=0;oldcol<Width;oldcol++) {
newrow=oldrow+delta_x;
newcol=oldcol+delta_y;
send(oldrow,oldcol,newrow,newcol,P
master
);
}
Slave
Dr. Pierre Vignras -37
Hypothesis:
2 computational steps/pixel
n*n pixels
p processes
Sequential: T
s
=2n
2
=O(n
2
)
Parallel: T
//
=O(p+n
2
)+O(n
2
/p)=O(n
2
) (p fixed)
Communication: T
comm
=T
startup
+mT
data
=p(T
startup
+1*T
data
)+n
2
(T
startup
+4T
data
)=O(p+n
2
)
Computation: T
comp
=2(n
2
/p)=O(n
2
/p)
&mage %$ift Pseudo3Code
Com)lexity
Dr. Pierre Vignras -3.
Speedup
Computation/Communication:
Broadcasting is better
2n
2
/ p
pT
startup
T
data
n
2
T
startup
'T
data
=O
n
2
/ p
pn
2
if for (k == n), |z
k
|<2 we assume the sequence
converge we plot c in a black color.
z
0
=0
z
k1
=z
k
c
Dr. Pierre Vignras -3/
-andel'rot %et* se8uential
code
int calculate(double cr, double ci) {
double bx = cr, by = ci;
double xsq, ysq;
int cnt = 0;
while (true) {
xsq = bx * bx;
ysq = by * by;
if (xsq + ysq >= 4.0) break;
by = (2 * bx * by) + ci;
bx = xsq - ysq + cr;
cnt++;
if (cnt >= max_iterations) break;
}
return cnt;
}
Computes the number o" iterations
"or the iven comple< point BcrD ciC
Dr. Pierre Vignras -50
-andel'rot %et* se8uential
code
y = startDoubleY;
for (int j = 0; j < height; j++) {
double x = startDoubleX;
int offset = j * width;
for (int i = 0; i < width; i++) {
int iter = calculate(x, y);
pixels[i + offset] = colors[iter];
x += dx;
}
y -= dy;
}
Dr. Pierre Vignras -5-
-andel'rot %et*
)arallelization
Assign n
2
/p pixels per processor in a round robin or
random manner
Left as an exercice
Dr. Pierre Vignras -52
-andel'rot set* load
'alancing
Problems:
Task == pixel
Task == row
T
comp
n
2
p
Max
itr
T
s
T
p
=
Max
itr
. n
2
Max
itr
. n
2
p
n
2
t
data
2np1t
startup
t
data
T
comp
T
comm
=
Max
itr
. n
2
p
n
2
t
data
2np1t
startup
t
data
Steps:
Two ways
Data partitioning
Functional partitioning
We want to compute:
a
b
! x dx
Dr. Pierre Vignras -5/
O
u
t
l
i
n
e
0ui""
Dr. Pierre Vignras -60
Readings
Fork/Join
Cobegin
Forall
Aggregate