Professional Documents
Culture Documents
programming
Tim Mattson
Intel Corp.
timothy.g.mattson@ intel.com
1
* The name “OpenMP” is the property of the OpenMP Architecture Review Board. 1
Introduction
Figure from “An Introduction to Concurrency in Programming Languages” by J. Sottile, Timothy G. Mattson, and Craig E Rasmussen, 2010
Concurrency vs. Parallelism
• Two important definitions:
– Concurrency: A condition of a system in which multiple tasks are
logically active at one time.
Programs
Parallel
Programs
Figure from “An Introduction to Concurrency in Programming Languages” by J. Sottile, Timothy G. Mattson, and Craig E Rasmussen, 2010
Parallel computing: It’s old
Vector Computers
SMP computers
Paragon (1993)
Cosmic cube (1983)
ASCI Red (1997)
Cluster Computers
Linux PC Clusters
(~1995)
Clusters (late 80’s)
Late 70’s
Late 80’s Late 90’s
5
The Parallel Programming Problem
And some
really beautiful
architectural
dead-ends …
Very Few People Wrote parallel software.
86%
14%
2017
*Constellation: A cluster for which the number of processors on a node is greater than the number of
nodes in the cluster. I’ve never seen anyone use this term outside of the top500 list.
Parallel computing: 2015-2020
source: https://www.indiamart.com/nanohub/computer-servers.html
• Why OpenMP?
– It’s the easiest to use and requires the minimum learning overhead
– Most key parallel design patterns can be learned with OpenMP.
– You all have processors that can use OpenMP (multicore CPUs)
Tim Mattson
Senior Principal Engineer
Intel labs
timothy.g.mattson@ intel.com
11
* The name “OpenMP” is the property of the OpenMP Architecture Review Board. 11
Preliminaries: Part 1
• Disclosures
– The views expressed in this tutorial are those of the
people delivering the tutorial.
– We are not speaking for our employers.
– We are not speaking for the OpenMP ARB
• We take these tutorials VERY seriously:
– Help us improve … tell us how you would make this
tutorial better.
12
Preliminaries: Part 2
13
Outline
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
14
OpenMP* overview:
End User
Application
Directives, Environment
OpenMP library
Compiler variables
16
OpenMP core syntax
• Most of the constructs in OpenMP are compiler directives.
#pragma omp construct [clause [clause]…]
– Example
#pragma omp parallel private(x)
17
Exercise 1, Part A: Hello world
Verify that your environment works
• Write a program that prints “hello world”.
#include<stdio.h>
int main()
{
18
Exercise 1, Part B: Hello world
Verify that your OpenMP environment works
• Write a multithreaded program that prints “hello world”.
19
Exercise 1: Solution
A multi-threaded “Hello world” program
• Write a multithreaded program where each thread prints
“hello world”.
#include <omp.h> OpenMP include file
#include <stdio.h>
int main()
{ Sample Output:
Parallel region with
#pragma omp parallel default number of threads hello hello world
{ world
hello hello world
printf(“ hello ”); world
printf(“ world \n”);
}
} End of the Parallel region
20
Outline
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
21
OpenMP programming model:
Fork-Join Parallelism:
u Master thread spawns a team of threads as needed.
u Parallelism added incrementally until performance goals are met,
i.e., the sequential program evolves into a parallel program.
Parallel Regions
A Nested
Master Parallel
Thread region
in red
Sequential Parts
22
Thread creation: Parallel regions
double A[1000];
• Each thread executes the omp_set_num_threads(4);
same code redundantly. #pragma omp parallel
{
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);
A single
copy of A is
shared pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between all
threads.
printf(“all done\n”); Threads wait here for all threads to finish
before proceeding (i.e., a barrier)
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
24
Thread creation: How many threads did
you actually get?
• You create a team threads in OpenMP* with the parallel construct.
• You can request a number of threads with omp_set_num_threads()
• But is the number of threads requested the number you actually get?
– NO! An implementation can silently decide to give you a team with fewer threads.
– Once a team of threads is established … the system will not reduce the size of the team.
1
4.0
ò
4.0
dx = p
(1+x2)
0
å F(x )Dx » p
i
i=0
26
Serial PI program
See OMP_exercises/pi.c 27
Serial PI program
#include <omp.h>
static long num_steps = 100000;
double step;
int main ()
{ int i; double x, pi, sum = 0.0, tdata;
29
Exercise 2 (hints)
• Use a parallel construct:
#pragma omp parallel.
• The challenge is to:
– divide loop iterations between threads (use the thread ID and the
number of threads).
– Create an accumulator for each thread to hold partial sums that you
can later combine to generate the global sum.
• In addition to a parallel construct, you will need the runtime
library routines
– int omp_set_num_threads();
– int omp_get_num_threads();
– int omp_get_thread_num();
– double omp_get_wtime();
30
Results*
• Original Serial pi program with 100000000 steps ran in 1.83 seconds.
threads 1st
SPMD
1 1.86
2 1.03
3 1.08
4 0.97
*Intel compiler (icpc) with no optimization on Apple OS X 10.7.3 with a dual core (four HW
thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz. 31
Outline
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
32
Consider performance of parallel programs
We measure our
success as parallel
programmers by how
close we come to ideal
linear speedup.
A good parallel
programmer always
figures out when you
fall off the linear
speedup curve and
why that has
occurred.
35
Amdahl’s Law
• What is the maximum speedup you can expect from a parallel program?
• Approximate the runtime as a part that can be sped up with additional
processors and a part that is fundamentally serial.
parallel _ fraction
Time par ( P) = ( serial _ fraction + ) * Timeseq
P
n If serial_fraction is a and parallel_fraction is (1- a) then the speedup is:
Timeseq Timeseq 1
S(P) = = =
Time par (P) (α + 1− α )*Time α+
1− α
seq
P P
– int omp_get_num_threads();
– int omp_get_thread_num();
– double omp_get_wtime();
– omp_set_num_threads();
– export OMP_NUM_THREADS = N
An environment variable
to request N threads
38
Results*
• Original Serial pi program with 100000000 steps ran in 1.83 seconds.
threads 1st
SPMD
1 1.86
2 1.03
3 1.08
4 0.97
*Intel compiler (icpc) with no optimization on Apple OS X 10.7.3 with a dual core (four HW
thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz. 39
Why such poor scaling? False sharing
• If independent data elements happen to sit on the same cache line, each
update will cause the cache lines to “slosh back and forth” between threads
… This is called “false sharing”.
L1 $ lines L1 $ lines
*Intel compiler (icpc) with no optimization on Apple OS X 10.7.3 with a dual core (four HW
thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
42
Outline
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
43
Synchronization
44
Synchronization: critical
• Mutual exclusion: Only one thread at a time can enter a
critical region.
float res;
#pragma omp parallel
{ float B; int i, id, nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
Threads wait
their turn – only for(i=id;i<niters;i+=nthrds){
one at a time B = big_job(i);
calls consume() #pragma omp critical
res += consume (B);
}
}
45
Synchronization: barrier
• Barrier: a point in a program all threads much reach before
any threads are allowed to proceed.
double Arr[8], Brr[8]; int numthrds;
omp_set_num_threads(8)
#pragma omp parallel
{ int id, nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads(); Onr Thread
saves a copy
Threads if (id==0) numthrds = nthrds; of nthrds into a
wait until all global variable,
Arr[id] = big_ugly_calc(id, nthrds); numthrds
threads hit
the barrier. #pragma omp barrier
Then they Brr[id] = really_big_and_ugly(id, nthrds, A);
can go on. }
46
Exercise 5
47
Pi program with false sharing*
• Original Serial pi program with 100000000 steps ran in 1.83 seconds.
threads 1st
SPMD
1 1.86
2 1.03
3 1.08
4 0.97
*Intel compiler (icpc) with no optimization on Apple OS X 10.7.3 with a dual core (four HW
thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz. 48
Example: Using a critical section to remove impact of false sharing
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ int nthreads; double pi=0.0; step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel Create a scalar local
{ to each thread to
int i, id, nthrds; double x, sum; accumulate partial
id = omp_get_thread_num(); sums.
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum=0.0;i< num_steps; i=i+nthrds) { No array, so
x = (i+0.5)*step; no false
sum += 4.0/(1.0+x*x); sharing.
}
#pragma omp critical Sum goes “out of scope” beyond the parallel
pi += sum * step; region … so you must sum it in here. Must
} protect summation into pi in a critical region so
} updates don’t conflict
49
Results*: pi program critical section
• Original Serial pi program with 100000000 steps ran in 1.83 seconds.
*Intel compiler (icpc) with no optimization on Apple OS X 10.7.3 with a dual core (four HW
thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
50
Example: Using a critical section to remove impact of false sharing
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ int nthreads; double pi=0.0; step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{ Be careful where
int i, id,nthrds; double x; you put a critical
id = omp_get_thread_num(); section
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum=0.0;i< num_steps; i=i+nthreads){ What would happen if
x = (i+0.5)*step; you put the critical
#pragma omp critical section inside the
pi += 4.0/(1.0+x*x); loop?
}
}
pi *= step;
}
51
Outline
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
52
The loop worksharing constructs
53
Loop worksharing constructs
A motivating example
OpenMP parallel
#pragma omp parallel
region and a
#pragma omp for
worksharing for
for(i=0;i<N;i++) { a[i] = a[i] + b[i];}
construct 54
Loop worksharing constructs:
The schedule clause
• The schedule clause affects how loop iterations are mapped onto threads
– schedule(static [,chunk])
– Deal-out blocks of iterations of size “chunk” to each thread.
– schedule(dynamic[,chunk])
– Each thread grabs “chunk” iterations off a queue until all iterations have
been handled.
56
Working with loops
• Basic approach
– Find compute intensive loops
– Make the loop iterations independent ... So they can safely execute in
any order without loop-carried dependencies
– Place the appropriate OpenMP directive and test
58
Reduction
• OpenMP reduction clause:
reduction (op : list)
• Inside a parallel or a work-sharing construct:
– A local copy of each list variable is made and initialized depending
on the “op” (e.g. 0 for “+”).
– Updates occur on the local copy.
– Local copies are reduced into a single value and combined with
the original global value.
• The variables in “list” must be shared in the enclosing
parallel region.
61
Example: Pi with a loop and a reduction
#include <omp.h>
static long num_steps = 100000; double step;
void main ()
{ int i; double x, pi, sum = 0.0; Create a team of threads …
step = 1.0/(double) num_steps; without a parallel construct, you’ll
never have more than one thread
#pragma omp parallel
{ Create a scalar local to each thread to hold
double x; value of x at the center of each interval
*Intel compiler (icpc) with no optimization on Apple OS X 10.7.3 with a dual core (four HW
thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
63
Barriers, for loops and the nowait clause
• Barrier: Each thread waits until all threads arrive.
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
65
Data environment:
Default storage attributes
66
Data sharing: Examples
void wrong() {
int tmp = 0;
#pragma omp parallel for private(tmp)
for (int j = 0; j < 1000; ++j) tmp was not
tmp += j; initialized
printf(“%d\n”, tmp);
}
tmp is 0 here
69
Data sharing: Private clause
When is the original variable valid?
• The original variable’s value is unspecified if it is referenced
outside of the construct
– Implementations may reference the original variable or a copy ….. a
dangerous programming practice!
– For example, consider what would happen if the compiler inlined work()?
int tmp;
void danger() { extern int tmp;
tmp = 0; void work() {
#pragma omp parallel private(tmp) tmp = 5;
work(); }
printf(“%d\n”, tmp);
}
unspecified which
tmp has unspecified value copy of tmp
70
Firstprivate clause
incr = 0;
#pragma omp parallel for firstprivate(incr)
for (i = 0; i <= MAX; i++) {
if ((i%2)==0) incr++;
A[i] = incr;
}
71
Data sharing:
A data environment test
• Consider this example of PRIVATE and FIRSTPRIVATE
variables: A = 1,B = 1, C = 1
#pragma omp parallel private(B) firstprivate(C)
• Are A,B,C private to each thread or shared inside the parallel region?
• What are their initial values inside and values after the parallel region?
73
Exercise 7: Mandelbrot set area
74
Outline
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
75
OpenMP memory model
Shared memory
a
cache1 cache2 cache3 cacheN
...
proc1 proc2 proc3 procN
a
76
OpenMP and relaxed consistency
• OpenMP supports a relaxed-consistency
shared memory model
– Threads can maintain a temporary view of shared memory
that is not consistent with that of other threads
– These temporary views are made consistent only at certain
points in the program
– The operation that enforces consistency is called the flush operation
77
Flush operation
78
Synchronization: flush example
l Flush forces data to be updated in memory so other threads see the most
recent value
double A;
A = compute();
#pragma omp flush(A)
// flush to memory to make sure other
// threads can pick up the right value
79
What is the BIG DEAL with flush?
• Compilers routinely reorder instructions implementing a
program
– Can better exploit the functional units, keep the machine busy, hide
memory latencies, etc.
• Compiler generally cannot move instructions:
– Past a barrier
– Past a flush on all variables
• But it can move them past a flush with a list of variables so
long as those variables are not accessed
• Keeping track of consistency when flushes are used can be
confusing … especially if “flush(list)” is used.
WARNING:
If you find your self wanting to write code with explicit flushes, stop and get help. It is
very difficult to manage flushes on your own. Even experts often get them wrong.
81
Example: prod_cons.c
• Parallelize a producer/consumer program
– One thread produces values that another thread consumes.
int main() – Often used with a
{
double *A, sum, runtime; int flag = 0; stream of
produced values
A = (double *) malloc(N*sizeof(double)); to implement
“pipeline
runtime = omp_get_wtime(); parallelism”
fill_rand(N, A); // Producer: fill an array of data – The key is to
implement
sum = Sum_array(N, A); // Consumer: sum the array pairwise
runtime = omp_get_wtime() - runtime; synchronization
between threads
printf(" In %lf secs, The sum is %lf \n",runtime,sum);
}
82
Pairwise synchronizaion in OpenMP
• OpenMP lacks synchronization constructs that work between
pairs of threads.
• When needed, you have to build it yourself.
• Pairwise synchronization
– Use a shared flag variable
– Reader spins waiting for the new flag value
– Use flushes to force updates to and from memory
83
Exercise: Producer/consumer
int main()
{
double *A, sum, runtime; int numthreads, flag = 0;
A = (double *)malloc(N*sizeof(double));
#pragma omp parallel sections
{
#pragma omp section
Put the flushes in the right places to
{
fill_rand(N, A); make this program race-free.
flag = 1;
Do you need any other
}
#pragma omp section synchronization constructs to make
{ this work?
84
Solution (try 1): Producer/consumer
int main()
{
double *A, sum, runtime; int numthreads, flag = 0;
A = (double *)malloc(N*sizeof(double));
#pragma omp parallel sections
{
#pragma omp section
Use flag to Signal when the
{
fill_rand(N, A); “produced” value is ready
#pragma omp flush
flag = 1;
#pragma omp flush (flag) Flush forces refresh to memory;
}
#pragma omp section guarantees that the other thread
{ sees the new value of A
#pragma omp flush (flag)
while (flag == 0){
#pragma omp flush (flag) Flush needed on both “reader” and “writer”
} sides of the communication
#pragma omp flush
sum = Sum_array(N, A);
} Notice you must put the flush inside the
} while loop to make sure the updated flag
} variable is seen
This program works with the x86 memory model (loads and stores use relaxed
atomics), but it technically has a race … on the store and later load of flag 85
Synchronization: atomic
#pragmaomp
#pragma ompparallel
parallel
{{
double
doubletmp,
B; B;
B = DOIT();
B = DOIT();
86
Synchronization: atomic
88
The OpenMP 3.1 atomics (2 of 2)
• Atomic can protect the assignment of a value (its capture) AND an
associated update operation:
# pragma omp atomic capture
statement or structured block
• Where the statement is one of the following forms:
v = x++; v = ++x; v = x--; v = –x; v = x binop expr;
• Where the structured block is one of the following forms:
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
93
What are tasks?
A common Pattern is to have one thread create the tasks while the
other threads wait at a barrier and execute the tasks
Single worksharing Construct
96
Task Directive
#pragma omp task [clauses]
structured-block
• At taskwait directive
– i.e. Wait until all tasks defined in the current task have completed.
#pragma omp taskwait
– Note: applies only to tasks generated in the current task, not to
“descendants” .
99
Example
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
fred();
#pragma omp task fred() and daisy()
daisy(); must complete before
#pragma taskwait billy() starts
#pragma omp task
billy();
}
}
100
Linked list traversal
p = listhead ;
while (p) {
process(p);
p=next(p) ;
}
101
Parallel linked list traversal
Only one thread
#pragma omp parallel packages tasks
{
#pragma omp single
{
p = listhead ;
while (p) {
#pragma omp task firstprivate(p)
{
process (p);
}
p=next (p) ;
makes a copy of p
} when the task is
} packaged
}
102
Parallel linked list traversal
p = listhead ;
while (p) {
< package up task >
p=next (p) ; while (tasks_to_do) {
} < execute task >
}
while (tasks_to_do){
< execute task >
}
103
Data scoping with tasks
• Variables can be shared, private or firstprivate with respect to
task
• These concepts are a little bit different compared with
threads:
– If a variable is shared on a task construct, the references to it inside
the construct are to the storage with that name at the point where the
task was encountered
– If a variable is private on a task construct, the references to it inside
the construct are to new uninitialized storage that is created when the
task is executed
– If a variable is firstprivate on a construct, the references to it inside the
construct are to new storage that is created and initialized with the
value of the existing storage of that name when the task is
encountered
104
Data scoping defaults
• The behavior you want for tasks is usually firstprivate, because the task
may not be executed until later (and variables may have gone out of
scope)
– Variables that are private when the task construct is encountered are firstprivate by
default
• Variables that are shared in all constructs starting from the innermost
enclosing parallel construct are shared by default
x = fib(n-1);
y = fib (n-2);
return (x+y);
}
Int main()
{
int NW = 5000;
fib(NW);
}
Parallel Fibonacci
int fib (int n)
{ int x,y;
• Binary tree of tasks
if (n < 2) return n;
• Traversed using a recursive
#pragma omp task shared(x) function
x = fib(n-1);
#pragma omp task shared(y) • A task cannot complete until all
y = fib (n-2); tasks below it in the tree are
#pragma omp taskwait complete (enforced with taskwait)
return (x+y);
} • x,y are local, and so by default
they are private to current task
Int main() – must be shared on child tasks so they
{ int NW = 5000; don’t create their own firstprivate
#pragma omp parallel copies at this level!
{
#pragma omp single
fib(NW);
}
} 107
Using tasks
• Getting the data attribute scoping right can be quite tricky
– default scoping rules different from other constructs
– as usual, using default(none) is a good idea
109
Outline
• Introduction to OpenMP
• Creating Threads
• Quantifying Performance and Amdahl’s law
• Synchronization
• Parallel Loops
• Data environment
• Memory model
• Irregular Parallelism and tasks
• Recap and the fundamental design patterns of parallel
programming
110
The OpenMP Common Core: Most OpenMP programs only use these 16 constructs
113
Fork-join
• Use when:
– Target platform has a shared address space
– Dynamic task parallelism
• Particularly useful when you have a serial program to
transform incrementally into a parallel program
• Solution:
1. A computation begins and ends as a single thread.
2. When concurrent tasks are desired, additional threads are forked.
3. The thread carries out the indicated task,
4. The set of threads recombine (join)
This pattern is very general and has been used to support most (if not all) the
algorithm strategy patterns.
MPI programs almost always use this pattern … it is probably the most
commonly used pattern in the history of parallel programming.
A simple SPMD pi program
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step;
} 116
Loop-level parallelism
• Collections of tasks are defined as iterations of one or more
loops.
• Loop iterations are divided between a collection of processing
elements to compute tasks concurrently. Key elements:
– identify compute intensive loops
– Expose concurrency by removing/managing loop carried dependencies
– Exploit concurrency for parallel execution usually using a parallel loop
construct/directive.
This design pattern is also heavily used with data parallel design
patterns. OpenMP programmers commonly use this pattern.
Loop Level parallelism pi program
#include <omp.h>
For good OpenMP
static long num_steps = 100000; double step; implementations,
reduction is more
void main () scalable than critical.
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
i private by sum = sum + 4.0/(1.0+x*x);
default }
pi = step * sum;
}
• Use when:
– A problem includes a method to divide the problem into
subproblems and a way to recombine solutions of
subproblems into a global solution.
• Solution
– Define a split operation
– Continue to split the problem until subproblems are small
enough to solve directly.
– Recombine solutions to subproblems to solve original
global problem.
• Note:
– Computing may occur at each phase (split, leaves,
recombine).
Divide and conquer
• Split the problem into smaller sub-problems; continue until
the sub-problems can be solve directly
n 3 Options:
¨ Do work as you split
into sub-problems
¨ Do work only at the
leaves
¨ Do work as you
recombine
Program: OpenMP tasks (divide and conquer pattern)
#include <omp.h>
static long num_steps = 100000000; int main ()
#define MIN_BLK 10000000 {
double pi_comp(int Nstart,int Nfinish,double step) int i;
{ int i,iblk; double step, pi, sum;
double x, sum = 0.0,sum1, sum2; step = 1.0/(double) num_steps;
if (Nfinish-Nstart < MIN_BLK){ #pragma omp parallel
for (i=Nstart;i< Nfinish; i++){ {
x = (i+0.5)*step; #pragma omp single
sum = sum + 4.0/(1.0+x*x); sum =
} pi_comp(0,num_steps,step);
} }
else{ pi = step * sum;
iblk = Nfinish-Nstart; }
#pragma omp task shared(sum1)
sum1 = pi_comp(Nstart, Nfinish-iblk/2,step);
#pragma omp task shared(sum2)
sum2 = pi_comp(Nfinish-iblk/2, Nfinish, step);
#pragma omp taskwait
sum = sum1 + sum2;
}return sum;
} 121
Geometric Decomposition We’ll cover this when we
discuss MPI
• Use when:
– The problem is organized around a central data structure that can be
decomposed into smaller segments (chunks) that can be updated
concurrently.
• Solution
– Typically, the data structure is updated iteratively where a new value
for one chunk depends on neighboring chunks.
– The computation breaks down into three components: (1) exchange
boundary data, (2) update the interiors or each chunk, and (3) update
boundary regions. The optimal size of the chunks is dictated by the
properties of the memory hierarchy.
• Note:
– This pattern is often used with the Structured Mesh and linear algebra
computational patterns.
SIMT (Data-parallel/index-space) We’ll cover this
when we discuss
GPUs
Note: This is basically a fine grained extreme form of the SPMD pattern.
“Patterns make you smarter”
Prof. Kurt Keutzer of UC Berkeley
124
“Patterns make you smarter”
Prof. Kurt Keutzer of UC Berkeley
125
Backup materials
• Challenge problems
• Appendices
126
Challenge problems
127
Challenge 1: Molecular dynamics
128
Challenge 1: MD (cont.)
• Once you have a working version, move the parallel region
out to encompass the iteration loop in main.c
– Code other than the forces loop must be executed by a single thread
(or workshared).
– How does the data sharing change?
• The atomics are a bottleneck on most systems.
– This can be avoided by introducing a temporary array for the force
accumulation, with an extra dimension indexed by thread number
– Which thread(s) should do the final accumulation into f?
129
Challenge 1 MD: (cont.)
• Another option is to use locks
– Declare an array of locks
– Associate each lock with some subset of the particles
– Any thread that updates the force on a particle must hold the
corresponding lock
– Try to avoid unnecessary acquires/releases
– What is the best number of particles per lock?
130
Challenge 2: Monte Carlo calculations
Using random numbers to solve tough problems
132
Challenge 3: Matrix multiplication
133
Challenge 4: Traversing linked lists
• Consider the program linked.c
– Traverses a linked list, computing a sequence of Fibonacci numbers
at each node
• Parallelize this program two different ways
1. Use OpenMP tasks
2. Use anything you choose in OpenMP other than tasks.
• The second approach (no tasks) can be difficult and may
take considerable creativity in how you approach the
problem (why its such a pedagogically valuable problem)
134
Challenge 5: Recursive matrix multiplication
135
Challenge 5: Recursive matrix multiplication
A B C
C1,1 = A1,1·B1,1 + A1,2·B2,1 C1,2 = A1,1·B1,2 + A1,2·B2,2
C2,1 = A2,1·B1,1 + A2,2·B2,1 C2,2 = A2,1·B1,2 + A2,2·B2,2
136
Challenge 5: Recursive matrix multiplication
How to multiply submatrices?
• Use the same routine that is computing the full matrix
multiplication
– Quarter each input submatrix and output submatrix
– Treat each sub-submatrix as a single element and multiply
A1,1
A111,1 A1,2
A111,2 B11
B1,1
1,1 B1,21,2
B11 C11
C1,1
1,1 C11
C1,21,2
A11
A2,1
2,1 A11
A2,2
2,2 B11
B2,1
2,1 B11
B2,22,2 C11
C2,12,1 C11
C2,22,2
// C11 += A11*B11
matmultrec(mf, mf+(ml-mf)/2, nf, nf+(nl-nf)/2, pf, pf+(pl-pf)/2, A,B,C);
// C11 += A12*B21
matmultrec(mf, mf+(ml-mf)/2, nf, nf+(nl-nf)/2, pf+(pl-pf)/2, pl, A,B,C);
. . .
}
139
OpenMP organizations
143
OpenMP Papers (continued)
• Jost G., Labarta J., Gimenez J., What Multilevel Parallel Programs do when you are
not watching: a Performance analysis case study comparing MPI/OpenMP, MLP, and
Nested OpenMP, Shared Memory Parallel Programming with OpenMP, Lecture notes
in Computer Science, Vol. 3349, P. 29, 2005
• Gonzalez M, Serra A, Martorell X, Oliver J, Ayguade E, Labarta J, Navarro N.
Applying interposition techniques for performance analysis of OPENMP parallel
applications. Proceedings 14th International Parallel and Distributed Processing
Symposium. IPDPS 2000. IEEE Comput. Soc. 2000, pp.235-40.
• Chapman B, Mehrotra P, Zima H. Enhancing OpenMP with features for locality
control. Proceedings of Eighth ECMWF Workshop on the Use of Parallel Processors
in Meteorology. Towards Teracomputing. World Scientific Publishing. 1999, pp.301-
13. Singapore.
• Steve W. Bova, Clay P. Breshears, Henry Gabb, Rudolf Eigenmann, Greg Gaertner,
Bob Kuhn, Bill Magro, Stefano Salvini. Parallel Programming with Message Passing
and Directives; SIAM News, Volume 32, No 9, Nov. 1999.
• Cappello F, Richard O, Etiemble D. Performance of the NAS benchmarks on a cluster
of SMP PCs using a parallelization of the MPI programs with OpenMP. Lecture Notes
in Computer Science Vol.1662. Springer-Verlag. 1999, pp.339-50.
• Liu Z., Huang L., Chapman B., Weng T., Efficient Implementationi of OpenMP for
Clusters with Implicit Data Distribution, Shared Memory Parallel Programming with
OpenMP, Lecture notes in Computer Science, Vol. 3349, P. 121, 2005
144
OpenMP Papers (continued)
145
Appendices
• Sources for additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD pi program
– Exercise 3: SPMD pi without false sharing
– Exercise 4: Loop level pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: Molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: Linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler notes
146
OpenMP pre-history
SGI Merged, HP
needed
commonality
across IBM
Cray products
Intel
Wrote a
ISV - needed rough draft Other vendors
KAI larger market straw man invited to join
SMP API
was tired of
recoding for
ASCI SMPs. Urged
vendors to
standardize. 1997 148
OpenMP Release History
GPGPU support,
user defined
reductions, and
more
150
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: Molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
151
Exercise 1: Solution
A multi-threaded “Hello world” program
• Write a multithreaded program where each thread prints
“hello world”.
#include “omp.h” OpenMP include file
void main()
{ Parallel region with default
number of threads Sample Output:
#pragma omp parallel
{ hello(1) hello(0) world(1)
152
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: Molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
153
The SPMD pattern
154
Exercise 2: A simple SPMD pi program
Promote scalar to an array
#include <omp.h> dimensioned by number of
static long num_steps = 100000; double step; threads to avoid race
#define NUM_THREADS 2 condition.
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds; Only one thread should copy the
double x; number of threads to the global
value to make sure multiple threads
id = omp_get_thread_num();
writing to the same address don’t
nthrds = omp_get_num_threads(); conflict.
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step; This is a common trick in
sum[id] += 4.0/(1.0+x*x); SPMD programs to create a
} cyclic distribution of loop
iterations
}
for(i=0, pi=0.0;i<nthreads;i++)pi += sum[i] * step;
} 155
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
156
False sharing
157
Exercise 3: SPMD pi without false sharing
#include <omp.h>
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ double pi; step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel Create a scalar local to
{ each thread to
int i, id,nthrds; double x, sum; accumulate partial
id = omp_get_thread_num(); sums.
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
for (i=id, sum=0.0;i< num_steps; i=i+nthrds){ No array, so
x = (i+0.5)*step; no false
sum += 4.0/(1.0+x*x); sharing.
}
#pragma omp critical Sum goes “out of scope” beyond the parallel
pi += sum * step; region … so you must sum it in here. Must
protect summation into pi in a critical region so
}
updates don’t conflict 158
}
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
159
Exercise 4: Solution
#include <omp.h>
static long num_steps = 100000; double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel
{
double x;
#pragma omp for reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
}
pi = step * sum;
}
160
Exercise 4: Solution
#include <omp.h>
For good OpenMP
static long num_steps = 100000; double step; implementations,
reduction is more
void main () scalable than critical.
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
i private by sum = sum + 4.0/(1.0+x*x);
default }
pi = step * sum;
}
162
Exercise 5: The Mandelbrot area program
#include <omp.h>
# define NPOINTS 1000
# define MXITR 1000 void testpoint(void){
void testpoint(void); struct d_complex z;
struct d_complex{ int iter;
double r; double i; double temp;
};
struct d_complex c; z=c;
int numoutside = 0; for (iter=0; iter<MXITR; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
int main(){ z.i = z.r*z.i*2+c.i;
int i, j; z.r = temp;
double area, error, eps = 1.0e-5; if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp parallel for default(shared) \ numoutside++;
private(c,eps) break;
for (i=0; i<NPOINTS; i++) { }
for (j=0; j<NPOINTS; j++) { }
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps; }
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps; When I run this
testpoint(); program, I get a
} different incorrect
}
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS-
answer each time I
numoutside)/(double)(NPOINTS*NPOINTS); run it … there is a
error=area/(double)NPOINTS; race condition!!!!
} 163
Exercise 5: Area of a Mandelbrot set
164
Debugging parallel programs
• Find tools that work with your environment and learn to use
them; a good parallel debugger can make a huge difference
• But parallel debuggers are not portable and you will
assuredly need to debug “by hand” at some point
• There are tricks to help you; the most important is to use the
default(none) pragma
165
Exercise 5: The Mandelbrot area program
#include <omp.h>
# define NPOINTS 1000
# define MXITR 1000 void testpoint(struct d_complex c){
struct d_complex{ struct d_complex z;
double r; double i; int iter;
}; double temp;
void testpoint(struct d_complex);
struct d_complex c; z=c;
int numoutside = 0; for (iter=0; iter<MXITR; iter++){
temp = (z.r*z.r)-(z.i*z.i)+c.r;
int main(){ z.i = z.r*z.i*2+c.i;
int i, j; z.r = temp;
double area, error, eps = 1.0e-5; if ((z.r*z.r+z.i*z.i)>4.0) {
#pragma omp parallel for default(shared) private(c, j) \ #pragma omp atomic
firstpriivate(eps) numoutside++;
for (i=0; i<NPOINTS; i++) { break;
for (j=0; j<NPOINTS; j++) { }
c.r = -2.0+2.5*(double)(i)/(double)(NPOINTS)+eps; }
c.i = 1.125*(double)(j)/(double)(NPOINTS)+eps; }
testpoint(c);
Other errors found using a
}
}
debugger or by inspection:
area=2.0*2.5*1.125*(double)(NPOINTS*NPOINTS- • eps was not initialized
numoutside)/(double)(NPOINTS*NPOINTS); • Protect updates of numoutside
error=area/(double)NPOINTS; • Which value of c die testpoint()
} see? Global or private? 166
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
167
Divide and conquer pattern
• Use when:
– A problem includes a method to divide into subproblems
and a way to recombine solutions of subproblems into a
global solution
• Solution
– Define a split operation
– Continue to split the problem until subproblems are small
enough to solve directly
– Recombine solutions to subproblems to solve original
global problem
• Note:
– Computing may occur at each phase (split, leaves,
recombine)
Divide and conquer
• Split the problem into smaller sub-problems; continue until
the sub-problems can be solve directly
n 3 Options:
¨ Do work as you split
into sub-problems
¨ Do work only at the
leaves
¨ Do work as you
recombine
Program: OpenMP tasks (divide and conquer pattern)
#include <omp.h>
static long num_steps = 100000000; int main ()
#define MIN_BLK 10000000 {
double pi_comp(int Nstart,int Nfinish,double step) int i;
{ int i,iblk; double step, pi, sum;
double x, sum = 0.0,sum1, sum2; step = 1.0/(double) num_steps;
if (Nfinish-Nstart < MIN_BLK){ #pragma omp parallel
for (i=Nstart;i< Nfinish; i++){ {
x = (i+0.5)*step; #pragma omp single
sum = sum + 4.0/(1.0+x*x); sum =
} pi_comp(0,num_steps,step);
} }
else{ pi = step * sum;
iblk = Nfinish-Nstart; }
#pragma omp task shared(sum1)
sum1 = pi_comp(Nstart, Nfinish-iblk/2,step);
#pragma omp task shared(sum2)
sum2 = pi_comp(Nfinish-iblk/2, Nfinish, step);
#pragma omp taskwait
sum = sum1 + sum2;
}return sum;
} 170
Results*: pi with tasks
*Intel compiler (icpc) with no optimization on Apple OS X 10.7.3 with a dual core (four HW
thread) Intel® CoreTM i5 processor at 1.7 Ghz and 4 Gbyte DDR3 memory at 1.333 Ghz.
171
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
172
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
173
Challenge 1: Solution
Compiler will warn you
if you have missed
some variables
ftemp[myid][j] -= forcex;
ftemp[myid][j+1] -= forcey;
ftemp[myid][j+2] -= forcez;
}
}
ftemp[myid][i] += fxi;
ftemp[myid][i+1] += fyi; Replace atomics with
ftemp[myid][i+2] += fzi; accumulation into array
} with extra dimension
Challenge 1: With array reduction
….
Reduction can be done
#pragma omp for in parallel
for(int i=0;i<(npart*3);i++){
for(int id=0;id<nthreads;id++){
f[i] += ftemp[id][i];
ftemp[id][i] = 0.0;
}
} Zero ftemp for next time
round
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
179
Computers and random numbers
• We use “dice” to make random numbers:
– Given previous values, you cannot predict the next value.
– There are no patterns in the series … and it goes on forever.
• Computers are deterministic machines … set an initial state,
run a sequence of predefined instructions, and you get a
deterministic answer
– By design, computers are not random and cannot produce random
numbers.
• However, with some very clever programming, we can make
“pseudo random” numbers that are as random as you need
them to be … but only if you are very careful.
• Why do I care? Random numbers drive statistical methods
used in countless applications:
– Sample a large space of alternatives to find statistically good answers
(Monte Carlo methods).
180
Monte Carlo Calculations
Using Random numbers to solve tough problems
pi = 4.0 * ((double)Ncirc/(double)num_trials);
printf("\n %d trials, pi is %f \n",num_trials, pi);
}
182
Linear Congruential Generator (LCG)
• LCG: Easy to write, cheap to compute, portable, OK quality
l If you pick the multiplier and addend correctly, LCG has a period of
PMOD.
l Picking good LCG parameters is complicated, so look it up
(Numerical Recipes is a good source). I used the following:
u MULTIPLIER = 1366
u ADDEND = 150889
u PMOD = 714025
183
LCG code
static long MULTIPLIER = 1366;
static long ADDEND = 150889;
static long PMOD = 714025;
long random_last = 0; Seed the pseudo random
double random () sequence by setting
{
random_last
long random_next;
return ((double)random_next/(double)PMOD);
}
184
Running the PI_MC program with LCG generator
1
Log 10 Relative error
1 2 3 4 5 6
Run the same
0.1
LCG - one thread program the
same way and
0.01 LCG, 4 threads, get different
trail 1 answers!
LCG 4 threads,
0.001 trial 2 That is not
LCG, 4 threads, acceptable!
trial 3
0.0001 Issue: my LCG
generator is not
threadsafe
0.00001
Program written using the Intel C/C++ compiler (10.0.659.2005) in Microsoft Visual studio 2005 (8.0.50727.42) and running on a dual-core laptop (Intel
T2400 @ 1.83 Ghz with 2 GB RAM) running Microsoft Windows XP.
185
LCG code: threadsafe version
static long MULTIPLIER = 1366; random_last carries state
static long ADDEND = 150889; between random number
static long PMOD = 714025; computations,
long random_last = 0;
#pragma omp threadprivate(random_last) To make the generator
double random () threadsafe, make
{ random_last threadprivate
long random_next;
so each thread has its
own copy.
random_next = (MULTIPLIER * random_last + ADDEND)% PMOD;
random_last = random_next;
return ((double)random_next/(double)PMOD);
}
186
Thread safe random number generators
0.1
thread program.
LCG 4 threads,
trial 1
But for large
0.01 number of
LCT 4 threads,
trial 2
samples, its
0.001 quality is lower
LCG 4 threads,
trial 3
than the one
thread result!
0.0001 LCG 4 threads,
thread safe Why?
0.00001
187
Pseudo Random Sequences
• Random number Generators (RNGs) define a sequence of pseudo-random
numbers of length equal to the period of the RNG
Thread 1
Thread 2
Thread 3
189
MKL Random number generators (RNG)
l MKL includes several families of RNGs in its vector statistics library.
l Specialized to efficiently generate vectors of random numbers
190
Wichmann-Hill generators (WH)
VSLStreamStatePtr stream;
#pragma omp threadprivate(stream)
…
vslNewStream(&ran_stream, VSL_BRNG_WH+Thrd_ID, (int)seed);
191
Independent Generator for each thread
Log10 number of samples
Notice that once
1 you get beyond
1 2 3 4 5 6 the high error,
small sample
Log10 Relative error
0.001 threads
0.0001
192
Leap Frog method
• Interleave samples in the sequence of pseudo random numbers:
– Thread i starts at the ith number in the sequence
– Stride through sequence, stride length = number of threads.
• Result … the same sequence of values regardless of the number of
threads.
Used the MKL library with two generator streams per computation: one for the x values (WH) and one for the
y values (WH+1). Also used the leapfrog method to deal out iterations among threads.
194
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo Pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
195
Challenge 3: Matrix Multiplication
196
Matrix multiplication
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2 197
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo Pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
198
Challenge 4: traversing linked lists
• Consider the program linked.c
– Traverses a linked list computing a sequence of Fibonacci numbers at
each node.
• Parallelize this program two different ways
1. Use OpenMP tasks
2. Use anything you choose in OpenMP other than tasks.
• The second approach (no tasks) can be difficult and may
take considerable creativity in how you approach the
problem (hence why its such a pedagogically valuable
problem).
199
Linked lists with tasks (OpenMP 3)
• See the file Linked_omp3_tasks.c
200
Challenge 4: traversing linked lists
• Consider the program linked.c
– Traverses a linked list computing a sequence of Fibonacci numbers at
each node.
• Parallelize this program two different ways
1. Use OpenMP tasks
2. Use anything you choose in OpenMP other than tasks.
• The second approach (no tasks) can be difficult and may
take considerable creativity in how you approach the
problem (hence why its such a pedagogically valuable
problem).
201
Linked lists without tasks
• See the file Linked_omp25.c
while (p != NULL) {
p = p->next; Count number of items in the linked list
count++;
}
p = head;
for(i=0; i<count; i++) {
parr[i] = p;
Copy pointer to each node into an array
p = p->next;
}
#pragma omp parallel
{
#pragma omp for schedule(static,1)
for(i=0; i<count; i++) Process nodes in parallel with a for loop
processwork(parr[i]);
}
Default schedule Static,1
One Thread 48 seconds 45 seconds
Two Threads 39 seconds 28 seconds
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2 202
Linked lists without tasks: C++ STL
• See the file Linked_cpp.cpp
int j = (int)nodelist.size();
#pragma omp parallel for schedule(static,1) Count number of items in the linked list
for (int i = 0; i < j; ++i)
processwork(nodelist[i]);
204
Recursive matrix multiplication
• Could be executed in parallel as 4 tasks
– Each task executes the two calls for the same output submatrix of C
• However, the same number of multiplication operations needed
#define THRESHOLD 32768 // product size below which simple matmult code is called
void matmultrec(int mf, int ml, int nf, int nl, int pf, int pl,
double **A, double **B, double **C)
{
if ((ml-mf)*(nl-nf)*(pl-pf) < THRESHOLD)
matmult (mf, ml, nf, nl, pf, pl, A, B, C);
else
{
#pragma omp task firstprivate(mf,ml,nf,nl,pf,pl)
{
matmultrec(mf, mf+(ml-mf)/2, nf, nf+(nl-nf)/2, pf, pf+(pl-pf)/2, A, B, C); // C11 += A11*B11
matmultrec(mf, mf+(ml-mf)/2, nf, nf+(nl-nf)/2, pf+(pl-pf)/2, pl, A, B, C); // C11 += A12*B21
}
#pragma omp task firstprivate(mf,ml,nf,nl,pf,pl)
{
matmultrec(mf, mf+(ml-mf)/2, nf+(nl-nf)/2, nl, pf, pf+(pl-pf)/2, A, B, C); // C12 += A11*B12
matmultrec(mf, mf+(ml-mf)/2, nf+(nl-nf)/2, nl, pf+(pl-pf)/2, pl, A, B, C); // C12 += A12*B22
}
#pragma omp task firstprivate(mf,ml,nf,nl,pf,pl)
{
matmultrec(mf+(ml-mf)/2, ml, nf, nf+(nl-nf)/2, pf, pf+(pl-pf)/2, A, B, C); // C21 += A21*B11
matmultrec(mf+(ml-mf)/2, ml, nf, nf+(nl-nf)/2, pf+(pl-pf)/2, pl, A, B, C); // C21 += A22*B21
}
#pragma omp task firstprivate(mf,ml,nf,nl,pf,pl)
{
matmultrec(mf+(ml-mf)/2, ml, nf+(nl-nf)/2, nl, pf, pf+(pl-pf)/2, A, B, C); // C22 += A21*B12
matmultrec(mf+(ml-mf)/2, ml, nf+(nl-nf)/2, nl, pf+(pl-pf)/2, pl, A, B, C); // C22 += A22*B22
}
#pragma omp taskwait
}
} 205
Appendices
• Sources for Additional information
• OpenMP History
• Solutions to exercises
– Exercise 1: hello world
– Exercise 2: Simple SPMD Pi program
– Exercise 3: SPMD Pi without false sharing
– Exercise 4: Loop level Pi
– Exercise 5: Mandelbrot Set area
– Exercise 6: Recursive pi program
• Challenge Problems
– Challenge 1: molecular dynamics
– Challenge 2: Monte Carlo Pi and random numbers
– Challenge 3: Matrix multiplication
– Challenge 4: linked lists
– Challenge 5: Recursive matrix multiplication
• Fortran and OpenMP
• Mixing OpenMP and MPI
• Compiler Notes
206
Fortran and OpenMP
207
OpenMP:
Some syntax details for Fortran programmers
• Most of the constructs in OpenMP are compiler directives.
– For Fortran, the directives take one of the forms:
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
• The OpenMP include file and lib module
use omp_lib
Include omp_lib.h
OpenMP:
Structured blocks (Fortran)
– Most OpenMP constructs apply to structured blocks.
–Structured block: a block of code with one point of
entry at the top and one point of exit at the bottom.
–The only “branches” allowed are STOP statements
in Fortran and exit() in C/C++.
C$OMP PARALLEL DO
C$OMP PARALLEL
do I=1,N
10 wrk(id) = garbage(id)
res(I)=bigComp(I)
res(id) = wrk(id)**2
end do
if(conv(res(id)) goto 10
C$OMP END PARALLEL DO
C$OMP END PARALLEL
213
Pi program with MPI and OpenMP
#include <mpi.h>
#include “omp.h”
void main (int argc, char *argv[])
{
int i, my_id, numprocs; double x, pi, step, sum = 0.0 ;
step = 1.0/(double) num_steps ;
MPI_Init(&argc, &argv) ;
Get the MPI MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;
part done MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;
first, then add my_steps = num_steps/numprocs ;
OpenMP #pragma omp parallel for reduction(+:sum) private(x)
pragma for (i=my_id*my_steps; i<(m_id+1)*my_steps ; i++)
where it {
makes sense x = (i+0.5)*step;
to do so sum += 4.0/(1.0+x*x);
}
sum *= step ;
MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD) ;
}
214
Key issues when mixing OpenMP and MPI
1. Messages are sent to a process not to a particular thread.
– Not all MPIs are threadsafe. MPI 2.0 defines threading modes:
– MPI_Thread_Single: no support for multiple threads
– MPI_Thread_Funneled: Mult threads, only master calls MPI
– MPI_Thread_Serialized: Mult threads each calling MPI, but they
do it one at a time.
– MPI_Thread_Multiple: Multiple threads without any restrictions
– Request and test thread modes with the function:
MPI_init_thread(desired_mode, delivered_mode, ierr)
2. Environment variables are not propagated by mpirun. You’ll
need to broadcast OpenMP parameters and set them with
the library routines.
215
Dangerous Mixing of MPI and OpenMP
• The following will work only if MPI_Thread_Multiple is supported … a
level of support I wouldn’t depend on.
MPI_Comm_Rank(MPI_COMM_WORLD, &mpi_id) ;
#pragma omp parallel
{
int tag, swap_neigh, stat, omp_id = omp_thread_num();
long buffer [BUFF_SIZE], incoming [BUFF_SIZE];
big_ugly_calc1(omp_id, mpi_id, buffer);
// Finds MPI id and tag so
neighbor(omp_id, mpi_id, &swap_neigh, &tag); // messages don’t conflict
217
Safe Mixing of MPI and OpenMP
Put MPI in sequential regions
MPI_Init(&argc, &argv) ; MPI_Comm_Rank(MPI_COMM_WORLD, &mpi_id) ;
219
Hybrid OpenMP/MPI works, but is it worth it?
221
Compiler notes: Intel on Windows
• Intel compiler:
– Launch SW dev environment … on my laptop I use:
– start/intel software development tools/intel C++ compiler 11.0/C+ build
environment for 32 bit apps
– cd to the directory that holds your source code
– Build software for program foo.c
– icl /Qopenmp foo.c
– Set number of threads environment variable
– set OMP_NUM_THREADS=4
– Run your program
– foo.exe
223
Compiler notes: Other
for the Bash shell
• Linux and OS X with gcc:
> gcc -fopenmp foo.c
> export OMP_NUM_THREADS=4
> ./a.out
• Linux and OS X with PGI:
> pgcc -mp foo.c
> export OMP_NUM_THREADS=4
> ./a.out
224
OpenMP constructs
Where variable list is a comma
separated list of variables
• #pragma omp parallel
• #pragma omp for Print the value of the macro
• #pragma omp critical
_OPENMP
• #pragma omp atomic
• #pragma omp barrier And its value will be
• Data environment clauses
yyyymm
– private (variable_list)
– firstprivate (variable_list) For the year and month of the
– lastprivate (variable_list) spec the implementation used
– reduction(+:variable_list)
• Tasks (remember … private data is made firstprivate by default)
– pragma omp task
– pragma omp taskwait
• #pragma threadprivate(variable_list)
225
OMP Construct Concepts
#pragma omp parallel parallel region, teams structured block, interleaved execution across threads
omp_get_wtime() Speedup, Amdahl's law and other performance issues
omp_get_num_threads() The SPMD Pattern. False sharing
omp_get_thread_num
Barrier synchronization and race conditions. Revisit interleaved execution. Teach
the concept of flush but not the construct.
Critical
#pragma omp for worksharing, parallel loops, loop carried dependencies
schedule(dynamic [,chunk]) Loop schedules, loop overheads and load balance
schedule (static [,chunk])
private Data environment
firstprivate
shared
reduction
nowait synchronization overhead, the high cost of barriers
#pragma omp single
OMP_NUM_THREADS internal control variables
Features just beyond the common core
atomic, flush, locks Details on synchronization including spin locks and locking protocals
#pragma omp task tasks including the data environment for tasks.
OMP_PLACES NUMA issues
OMP_Proc_BIND
OMP Construct Concepts Exercises
#pragma omp parallel parallel region, teams structured block, interleaved hello world
execution across threads
int omp_get_thread_num() The SPMD Pattern. pi SPMD
int omp_get_num_threads()
double omp_get_wtime() Speedup and Amdahl's law.
False Sharing and other performance issues
setenv OMP_NUM_THREADS N internal control variables pi SPMD on
many threads
#pragma omp barrier Synchronization and race conditions. Revisit PI SPMD without
#pragma omp critical interleaved execution. Teach the concept of flush but sum array
not the construct.
#pragma omp for worksharing, parallel loops, loop carried dependencies pi par loop
reduction(op:list) reductions of values across a team of threads pi par loop
#pragma omp task tasks including the data environment for tasks. linked lists
#pragma omp taskwait
Advertised Plan for the course
(in four units ranging from 90 minutes to 2 hours)
Host
300000 0.6
0.4
200000
O(N^2) 0.2
100000
O(N^3) 0
0 0 20 40 60 80
0 20 40 60 80
What if the problem size grows
• Consider the dense linear algebra problems.
• A key feature of many of these operations between matrices (such as LU
factorization or matrix multiplication) … work scales as the cube of the order of
the matrix.
• Assume we can parallelize the linear algebra operation (O(N3)) but not the
loading of the matrices from memory (O(N2)). How does the serial fraction
vary with matrix order (assume loading from memory is much slower than a
floating point op).
6E+09 1.2
Runtime vs. Serial fraction vs.
5E+09 1
matrix order matrix order
4E+09 0.8
3E+09 0.6
O(N^2)
2E+09 O(N^3) 0.4
1E+09 0.2
0 0
0 500 1000 1500 2000 0 500 1000 1500 2000
-1E+09
For much larger Matrix orders …
Weak Scaling: a response to Amdhal
• Gary Montry and John Gustafson (1988, Sandia National
Laboratories) observed that for many problems the serial
fraction of a function of the problem size (N) decreases:
Tseq (1)
S ( P, N ) =
1-a (N ) lim a ( N ) = 0
(a ( N ) + ) * Tseq (1) N ® N l arg e
P
S ( P, N l arg e ) ® P
A time dependent
Quantum
simulation of
Execution time (secs)
helium atoms
with 20 grid units
per processing
element.
http://www.spscicomp.org/ScicomP16/presentations/PRACE_ScicomP.pdf
Example of weak scaling
A time dependent
Quantum
simulation of
Execution time (secs)
helium atoms
with 20 grid units
per processing
element.
http://www.spscicomp.org/ScicomP16/presentations/PRACE_ScicomP.pdf
Example of weak scaling
A time dependent
Quantum
simulation of
Execution time (secs)
helium atoms
with 20 grid units
per processing
element.
http://www.spscicomp.org/ScicomP16/presentations/PRACE_ScicomP.pdf