Fdi 2008 Lecture5

5: Introductory OpenMP
John Burkardt1
1 Virginia Tech
10-12 June 2008
Burkardt Introductory OpenMP

Introductory OpenMP
OpenMP enables parallel execution on shared memory systems.

The user inserts special comment statements, which are actually
OpenMP directives.
These directives indicate which parts of the program can be done
in parallel, and whether any data needs special treatment.
The user decides whether the OpenMP directives should be used
(for parallel execution) or ignored (in which case the program runs
sequentially).

Introductory OpenMP
OpenMP enables cooperative execution of a user program by

processors sharing a common memory.
OpenMP works well with:
a multicore chips;
a NUMA system, multiple chips with separate, but linked,
memories
OpenMP can also be used in combination with MPI, combining

the power of fast chips and networking, but this is an advanced
topic!

Introductory OpenMP
OpenMP is an efficient language for modern multicore chips.

Introductory OpenMP
OpenMP is available for C, C++, FORTRAN77 and

FORTRAN90 programmers.
OpenMP is mainly a series of “comments” to the language
compiler, pointing out opportunities for parallel execution.
At compile time, the user declares whether the program should be
prepared for sequential or parallel execution.
If the program is compiled for parallel execution, many details are
only decided at execution time (how many processors will be used,
which processors do what.)

Introductory OpenMP
Glendower: ”I can call spirits from the vasty deep.”

Hotspur: ”Why, so can I, or so can any man. But will they come
when you do call for them?”
Shakespeare, Henry IV, Part 1.
Any C/C++/Fortran program can “call” OpenMP

...but you only get parallel code if the compiler supports OpenMP.

Compilers Support OpenMP
Compiler writers do support OpenMP:
Gnu gcc/g++ 4.2, gfortran 2.0;

IBM xlc, xlf
Intel icc, ifort
Microsoft Visual C++ (2005 Professional edition)
Portland C/C++/Fortran, pgcc, pgf95
Sun Studio C/C++/Fortran

The OpenMP ”Language”
OpenMP can be divided into five parts:

parallel control structures
data classification (private/shared)
environment inquiry and setting
work sharing
synchronization
We will concentrate on the first three parts.

OpenMP Ideas: Threads
OpenMP uses the idea of threads to describe the multiple agents

that are cooperating.
A thread is an entity that can carry out some part of a program,
with its own list of tasks and data.
In any computation, a single master thread begins the
computation.
When a parallel loop is encountered, the master thread calls up
more threads. Technically, the single thread of execution forks into
multiple threads.
The threads cooperate until the parallel work is done. Then the
multiple threads join into a single thread again.

OpenMP Ideas: Loops
OpenMP assumes that the parallel work is done in the for loops
or do loops in the program.
The user marks which loops are to be executed in parallel.
So the program begins as a single thread of execution.
When a parallel loop is reached, extra threads of execution are
generated.
Each thread will execute some loop iterations.
When the loop is complete, the extra threads disappear.

OpenMP Ideas: Shared and Private Data
In the ideal case, each iteration of the loop uses data in a way that
doesn’t depend on other iterations. Loosely, this is the meaning of
the term shared data.
For instance, in the SAXPY task, each iteration is
y(i) = s * x(i) + y(i)
You can imagine that any number of processors could cooperate in
a calculation like this, with no interference.
We will start by assuming that all data can be shared, and wait for
reality to correct us!
This is OpenMP’s default assumption, as well.

SAXPY: Sequential Version
# i n c l u d e <s t d l i b . h>
# i n c l u d e <s t d i o . h>
i n t main ( i n t a r g c , c h a r ∗ a r g v [ ] )
{
i n t i , n = 1000;
double x [ 1 0 0 0 ] , y [ 1 0 0 0 ] , s ;
s = 123.456;
f o r ( i = 0 ; i < n ; i++ )
{
x [ i ] = ( d o u b l e ) r a n d ( ) / ( d o u b l e ) RAND MAX ;
y [ i ] = ( d o u b l e ) r a n d ( ) / ( d o u b l e ) RAND MAX ;
}
f o r ( i = 0 ; i < n ; i++ )
{
y[ i ] = y[ i ] + s ∗ x[ i ];
}
return 0;
}

The SAXPY task
The SAXPY task adds a multiple of vector X to vector Y.

The arrays X and Y can be shared, because only the thread
associated with loop index I needs to look at the I-th entries.
Each thread will need to know the value of S but they can all
agree on what that value is. (They “share” the same value).
This is a “perfect” parallel application: no private data, no
memory contention.
SAXPY is a common low level task, as in Gauss elimination.

SAXPY: with OpenMP Directives
# i n c l u d e <omp . h>
{
i n t i , n = 1000;
double x [ 1 0 0 0 ] , y [ 1 0 0 0 ] , s ;
s = 123.456;
f o r ( i = 0 ; i < n ; i++ )
{
}
#pragma omp p a r a l l e l f o r
f o r ( i = 0 ; i < n ; i++ )
{
y[ i ] = y[ i ] + s ∗ x[ i ];
}
return 0;
}

OpenMP C Syntax
The #pragma omp string is a marker that indicates to the

compiler that this is an OpenMP directive.
The parallel clause indicates that parallel execution is being
requested.
The for clause indicates that the parallel execution will take place
in a for loop.

OpenMP Fortran Syntax
The rules for Fortran are similar.

The marker string is !$omp.
The parallel clause indicates that parallel execution is being
requested.
The do clause indicates that the parallel execution will take place
in a do loop.
In Fortran, but not C, the end of the parallel loop must be marked
by a !$omp end parallel marker.

SAXPY: with OpenMP Directives
program main
integer i , n
double p r e c i s i o n x (1000) , y (1000) , s
n = 1000
s = 123.456
do i = 1 , n
x ( i ) = rand ( )
y ( i ) = rand ( )
end do
! $omp p a r a l l e l do
do i = 1 , n
y( i ) = y( i ) + s ∗ x( i )
end do
! $omp end p a r a l l e l
stop
end

The SAXPY task - How To Think About Threads
To visualize parallel execution, suppose 4 threads will execute the

SAXPY loop.
OpenMP might assign the first 250 iterations to thread 1, and so
on.
Then you also have to imagine that the four threads each execute
their loop more or less simultaneously.
Even this simple model of what’s going on will suggest some of the
things that can go wrong in a parallel program!

The SAXPY loop, as OpenMP might think of it
i f ( t h r e a d i d == 1 ) then
do i = 1 , 250
y( i ) = y( i ) + s ∗ x( i )
end do
else i f ( thread id == 2 ) t h e n
do i = 2 5 1 , 500
y( i ) = y( i ) + s ∗ x( i )
end do
do i = 5 0 1 , 750
y( i ) = y( i ) + s ∗ x( i )
end do
do i = 7 5 1 , 1000
y( i ) = y( i ) + s ∗ x( i )
end do
end i f

The SAXPY task - Comments
What about the loop that initializes X and Y?

The problem here is that we’re calling the rand function.
Normally, inside a parallel loop, you can call a function and it will
also run in parallel. However, the function cannot have side effects.
The rand function is a special case; it has an internal “static” or
“saved” variable whose value is changed and remembered
internally.
Getting random numbers in a parallel loop requires care. We will
leave this topic for later discussion.

DOT: Sequential Version
{
i n t i , n = 1000;
double x [ 1 0 0 0 ] , y [ 1 0 0 0 ] , xdoty ;
f o r ( i = 0 ; i < n ; i++ )
{
}
xdoty = 0 . 0 ;
f o r ( i = 0 ; i < n ; i++ )
{
xdoty = xdoty + x [ i ] ∗ y [ i ] ;
}
p r i n t f ( ”XDOTY = %e\n” , x d o t y ) ;
return 0;
}

The DOT task
The DOT task seems similar to the SAXPY task, but there’s an
important difference:
The arrays X and Y can still be shared - only the thread associated
with loop index I needed to look at the I-th entries.
But the results of every computation now are added to the same,
single variable, XDOTY.
The threads doing iterations I1 and I2 will both want to “read”
the current copy of XDOTY, add their increment to it, and
“write” the updated copy.
Uncontrolled sequences of read’s and writes cause errors!

The DOT task
The dot product is one example of a reduction operation.

Examples include computing;
the sum of the entries of a vector,
the product of the entries of a vector,
the maximum or minimum of a vector,
the Euclidean norm of a vector,
Reduction operations, if recognized, can be carried out in parallel.

In OpenMP, this is done with a reduction declaration which tells
the compiler what it needs to know so that the result vector is
computed in an orderly and correct way.

DOT: with OpenMP directives
# i n c l u d e <omp . h>
{
i n t i , n = 1000;
double x [ 1 0 0 0 ] , y [ 1 0 0 0 ] , xdoty ;
f o r ( i = 0 ; i < n ; i++ )
{
}
xdoty = 0 . 0 ;
#pragma omp p a r a l l e l f o r r e d u c t i o n ( + : x d o t y )
f o r ( i = 0 ; i < n ; i++ )
{
xdoty = xdoty + x [ i ] ∗ y [ i ] ;
}
p r i n t f ( ”XDOTY = %e\n” , x d o t y ) ;
return 0;
}

OpenMP Ideas: Private and Shared Data
OpenMP assumes that most of the problem data is shared, that

is, there is a single copy, to which all threads have access.
If data is only “read”, it’s obviously shared.
Some data may be shared even though it is written. An example is
an array A. If entry A[I] is only written during loop iteration I,
then the array A can probably remain shared.
Examples of private data include temporary variables that are
assigned during a loop iteration.

The PRIME SUM task
The PRIME SUM task will illustrate the concept of private and
shared variables.
Our task is to compute the sum of the inverses of the prime
numbers from 1 to N.
A natural formulation stores the result in TOTAL, then checks
each number I from 2 to N.
To check if the number I is prime, we ask whether it can be evenly
divided by any of the numbers J from 2 to I − 1.
We can use a temporary variable PRIME to help us.

The PRIME SUM task: Sequential Version
# i n c l u d e <c s t d l i b >
# i n c l u d e <i o s t r e a m >
u s i n g namespace s t d ;
{
int i , j , total ;
i n t n = 1000;
bool prime ;
total = 0;
f o r ( i = 2 ; i <= n ; i++ )
{
prime = true ;
f o r ( j = 2 ; j < i ; j++ )
{
i f ( i % j == 0 )
{
prime = f a l s e ;
break ;
}
}
i f ( prime )
{
total = total + i ;
}
}
c o u t << ”PRIME SUM ( 2 : ” << n << ” ) = ” << t o t a l << ”\n” ;
return 0;
}

The PRIME SUM Task: Eliminating Data Conflicts!
Let’s imagine we parallelize the I loop.

So each thread:
works on an integer I
initializes PRIME to be TRUE
checks whether any J can divide I and resets PRIME if
necessary;
If PRIME ends up TRUE, add 1/I to TOTAL.
The variables J, PRIME and TOTAL represent possible data

conflicts that we must resolve.

PRIME SUM: With OpenMP Directives
# i n c l u d e <c s t d l i b >
# i n c l u d e <i o s t r e a m >
u s i n g namespace s t d ;
{
i n t i , j , t o t a l , n = 1000 , t o t a l = 0 ;
bool prime ;
# pragma omp p a r a l l e l f o r p r i v a t e ( i , p ri m e , j ) s h a r e d ( n )
# pragma omp r e d u c t i o n ( + : t o t a l )
f o r ( i = 2 ; i <= n ; i++ )
{
prime = true ;
f o r ( j = 2 ; j < i ; j++ )
{
i f ( i % j == 0 )
{
prime = f a l s e ;
break ;
}
}
i f ( prime )
{
total = total + i ;
}
}
c o u t << ”PRIME SUM ( 2 : ” << n << ” ) = ” << t o t a l << ”\n” ;
return 0;
}

STEPS
For the STEPS task, we suppose we want to evaluate a function at

equally spaced points in a rectangle.
A common way starts (X,Y) at (0,0) and then increments X by
DX. If we exceed the maximum value of X, we reset that to 0, and
increment Y by DY.
In sequential computing, this is a natural way to “visit” every
point.
This simple idea won’t work in parallel without some changes.
Each thread needs a separate copy of (X,Y), but the value of
these variables depends on having all the previous iterations.
This is a simple example of a common problem, data
dependence. Some cases are curable, some not.

STEPS: Sequential Version
program main
i n t e g e r i , j , m, n
r e a l dx , dy , t o t a l , x , y ,
t o t a l = 0.0
y = 0.0
do j = 1 , n
x = 0.0
do i = 1 , m
total = total + f ( x , y )
x = x + dx
end do
y = y + dy
end do
stop
end

STEPS
To make the loop iterations independent,

precompute X(1:M) and Y(1:N) in arrays.
or notice X = I /M and Y = J/N
The second solution is simple and saves us a separate preparation
loop and extra storage.
The first solution, converting some temporary scalar variables to
vectors and precomputing them, may be able to help you
parallelize a stubborn loop.

STEPS: With OpenMP directives
program main
i n t e g e r i , j , m, n
real total , x , y ,
t o t a l = 0.0
! $omp p a r a l l e l do p r i v a t e ( i , j , x , y ) s h a r e d ( m, n ) r e d u c t i o n ( + : t o t a l )
do j = 1 , n
y = j / real ( n )
do i = 1 , m
x = i / real ( m )
total = total + f ( x , y )
end do
end do
! $omp end p a r a l l e l do
stop
end

STEPS
Another issue pops up in the STEPS program. What happens

when you call the function f(x,y) inside the loop?
Notice that f is not a variable, it’s a function, so it is not declared
private or shared.
The function might have internal variables, loops, might call other
functions, and so on.
OpenMP works in such a way that a function called within a
parallel loop will also participate in the parallel execution. We
don’t have to make any declarations about the function or its
internal variables at all.
Each thread runs a separate copy of f.
(But if f includes static or saved variables, trouble!)

SUMMARY
I hope you have started to become familiar with some of the basic
concepts of OpenMP.
We will come back to OpenMP in more detail for:
a few more useful features of the language;
some obstacles to parallelization;
how to get or set the number of threads
how to measure the improvement in performance.

Fdi 2008 Lecture5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fdi 2008 Lecture5

Uploaded by

Copyright:

Available Formats

5: Introductory OpenMP

10-12 June 2008

Burkardt Introductory OpenMP

OpenMP enables parallel execution on shared memory systems.

Burkardt Introductory OpenMP

OpenMP enables cooperative execution of a user program by

OpenMP can also be used in combination with MPI, combining

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

OpenMP is available for C, C++, FORTRAN77 and

Burkardt Introductory OpenMP

Glendower: ”I can call spirits from the vasty deep.”

Any C/C++/Fortran program can “call” OpenMP

Burkardt Introductory OpenMP

Compiler writers do support OpenMP:

Gnu gcc/g++ 4.2, gfortran 2.0;

Burkardt Introductory OpenMP

OpenMP can be divided into five parts:

We will concentrate on the first three parts.

Burkardt Introductory OpenMP

OpenMP uses the idea of threads to describe the multiple agents

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

The SAXPY task adds a multiple of vector X to vector Y.

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

The #pragma omp string is a marker that indicates to the

Burkardt Introductory OpenMP

The rules for Fortran are similar.

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

To visualize parallel execution, suppose 4 threads will execute the

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

What about the loop that initializes X and Y?

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

The dot product is one example of a reduction operation.

Reduction operations, if recognized, can be carried out in parallel.

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

OpenMP assumes that most of the problem data is shared, that

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

Let’s imagine we parallelize the I loop.

The variables J, PRIME and TOTAL represent possible data

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

For the STEPS task, we suppose we want to evaluate a function at

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

To make the loop iterations independent,

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

Another issue pops up in the STEPS program. What happens

Burkardt Introductory OpenMP

Burkardt Introductory OpenMP

You might also like