You are on page 1of 67

NSCET

E-LEARNING
PRESENTATION
LISTEN … LEARN… LEAD…
COMPUTER SCIENCE AND ENGINEERING

IV YEAR / VIII SEMESTER

CS6801 – MULTICORE ARCHITECTURES


AND PROGRAMMING

P.MAHALAKSHMI,M.E,MISTE PHOTO
ASSISTANT PROFESSOR
Nadar Saraswathi College of Engineering & Technology,
Vadapudupatti, Annanji (po), Theni – 625531.
UNIT III
SHARED MEMORY
PROGRAMMING WITH OpenMP
Introduction
OpenMP - Open Multi-Processing
 OpenMP is an API for shared-memory parallel programming. The “MP” in OpenMP
stands for “multiprocessing,” a term that is synonymous with shared-memory parallel
computing.
 OpenMP is designed for systems in which each thread or process can potentially have
access to all available memory. When programming with OpenMP, view system as a
collection of cores or CPUs, all of which have access to main memory
 OpenMP is a set of compiler directives as well as an API for programs written in C, C++,
or FORTRAN that provides support for parallel programming in shared-memory
environments.
 OpenMP is available on several open-source and commercial compilers for Linux,
Windows, and Mac OS X systems.

Department of CSE, NSCET, Theni Page-1


Steps To Create A Parallel Program
1. Include the header file
Include the OpenMP header for our program along with the standard header files.
#include <omp.h>
2. Specify the parallel region
 OpenMP identifies parallel regions as blocks of code that may run in parallel using the
keyword pragma omp parallel.
 The pragma omp parallel is used to fork additional threads to carry out the work
enclosed in the parallel.
 The original thread will be denoted as the master thread with thread ID 0.
Example
#pragma omp parallel
{
printf("Hello World... from thread);
}

Department of CSE, NSCET, Theni Page-2


3. Set the number of threads
 Set the number of threads to execute the program using the external variable.
export OMP_NUM_THREADS=5
 Once the compiler encounters the parallel regions code, the master thread(thread which has
thread id 0) will fork into the specified number of threads (Here 5 threads)
 Parallel region will be executed by all threads concurrently.
 Once the parallel region ended, all threads will get merged into the master thread.
 Order of statement execution in the parallel region won’t be the same for all executions

Department of CSE, NSCET, Theni Page-3


OpenMP program to print Hello World using C language
// OpenMP header
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char* argv[])
{
// Beginning of parallel region
#pragma omp parallel
{
printf("Hello World... from thread = %d\n",
omp_get_thread_num());
}
// Ending of parallel region
}

Department of CSE, NSCET, Theni Page-4


OpenMP program to print Hello World using C language

Department of CSE, NSCET, Theni Page-5


Topic
OpenMP Execution Model
Introduction
 OpenMP uses the fork-join model of parallel execution. When a thread encounters a
parallel construct, the thread creates a team composed of itself and some additional
(possibly zero) helper threads.
 The encountering thread becomes the master of the new team. All team members
execute the code in the parallel region.
 When a thread finishes its work within the parallel region, it waits at an implicit barrier
at the end of the parallel region. When all team members have arrived at the barrier, the
threads can leave the barrier.
 The master thread continues execution of user code in the program beyond the end of
the parallel construct, while the helper threads wait to be summoned to join other
teams.
 OpenMP parallel regions can be nested inside each other. If nested parallelism is
disabled, then the team executing a nested parallel region consists of one thread only
(the thread that encountered the nested parallel construct).
 If nested parallelism is enabled, then the new team may consist of more than one thread.

Department of CSE, NSCET, Theni Page-6


Fork Join Model
Uses fork join model of parallel execution.

 All openMp program begin as single process(Master thread) until parallel region
encountered.
 FORK – creates team of parallel threads.
 JOIN – When team threads completed then leaving the master thread only.
 The number of parallel regions and the threads that comprise them are arbitrary

Department of CSE, NSCET, Theni Page-7


 The OpenMP runtime library maintains a pool of helper threads that can be
used to work on
 parallel regions. When a thread encounters a parallel construct and requests a
team of more than
 one thread, the thread will check the pool and grab idle threads from the pool,
making them part of
 the team. The encountering thread might get fewer helper threads than it
requests if the pool does
 not contain a sufficient number of idle threads. When the team finishes
executing the parallel region, the helper threads are returned to the pool.

Department of CSE, NSCET, Theni Page-8


Open Programming Model
 The OpenMP programming model is SMP (symmetric multi-processors, or shared-memory
processors): that means when programming with OpenMP all threads share memory and
data.
Shared Memory Model
Designed for multi-processor, shared memory machines.

ii) Thread Based Parallelism


 OpenMp accomplish parallelism through the use of threads.
 Thread execution is the smallest unit of processing that can be scheduled by an OS.

Department of CSE, NSCET, Theni Page-9


Topic
Memory Model
OpenMP memory model
OpenMP supports a relaxed-consistency shared memory model.
 Threads can maintain a temporary view of shared memory which is not consistent
with that of other threads.
 These temporary views are made consistent only at certain points in the program.
 The operation which enforces consistency is called the flush operation.
 OpenMP assumes that there is a place for storing and retrieving data that is available
to all threads, called the memory.
 Each thread may have a temporary view of memory that it can use instead of memory
to store data temporarily when it need not be seen by other threads.
 Data can move between memory and a thread's temporary view, but can never move
between temporary views directly, without going through memory.

Department of CSE, NSCET, Theni Page-10


Department of CSE, NSCET, Theni Page-11
 Each variable used within a parallel region is either shared or private. The variable
names used within a parallel construct relate to the program variables visible at the
point of the parallel directive, referred to as their "original variables".
 Each shared variable reference inside the construct refers to the original variable of the
same name.
 For each private variable, a reference to the variable name inside the construct refers to
a variable of the same type and size as the original variable, but private to the thread.
That is, it is not accessible by other threads.
Aspects Of Memory System Behavior
There are two aspects of memory system behavior relating to shared memory parallel
programs:
Coherence - Coherence refers to the behavior of the memory system when a single
memory location is accessed by multiple threads.
Consistency - Consistency refers to the ordering of accesses to different memory locations,
observable from various threads in the system.

Department of CSE, NSCET, Theni Page-12


 OpenMP doesn't specify any coherence behavior of the memory system. That is left to
the underlying base language and computer system.
 OpenMP does not guarantee anything about the result of memory operations that
constitute data races within a program.
 A data race in this context is defined to be accesses to a single variable by at least two
threads, at least one of which is a write, not separated by a synchronization operation.
 OpenMP does guarantee certain consistency behavior, however. That behavior is based
on the OpenMP flush operation.
Flush Operation
 The OpenMP flush operation is applied to a set of variables called the flush set.
 Memory operations for variables in the flush set that precede the flush in program
execution order must be firmly lodged in memory and available to all threads before
the flush completes, and memory operations for variables in the flush set, that follow a
flush in program order cannot start until the flush completes.

Department of CSE, NSCET, Theni Page-13


 A flush also causes any values of the flush set variables that were captured in the
temporary view, to be discarded, so that later reads for those variables will come
directly from memory.
 A flush without a list of variable names flushes all variables visible at that point in the
program. A flush with a list flushes only the variables in the list.
 The OpenMP flush operation is the only way in an OpenMP program, to guarantee that
a value will move between two threads.
 In order to move a value from one thread to a second thread, OpenMP requires these
four actions in exactly the following order:
1. The first thread writes the value to the shared variable,
2. The first thread flushes the variable.
3. The second thread flushes the variable
4. The second thread reads the variable.

Department of CSE, NSCET, Theni Page-14


Fig: A write to shared variable A may complete as soon as point , and as late as point 2
 The flush operation and the temporary view allow OpenMP implementations to
optimize reads and writes of shared variables.
Example
Consider the program fragment in Figure . The write to variable A may complete as soon
as point .
 However, the OpenMP implementation is allowed to execute the computation denoted
as “…” in the figure, before the write to A completes.
 The write need not complete until point 2, when it must be firmly lodged in memory
and available to all other threads.

Department of CSE, NSCET, Theni Page-15


 If an OpenMP implementation uses a temporary view, then a read of A during the “…”
computation in Figure 1 can be satisfied from the temporary view, instead of going all the
way to memory for the value.
 So, flush and the temporary view together allow an implementation to hide both write
and read latency.
 A flush of all visible variables is implied
1) In a barrier region
2) At entry and exit from parallel, critical and ordered regions
3) At entry and exit from combined parallel work-sharing regions, and
4) During lock API routines.
 A flush with a list is implied at entry to and exit from atomic regions, where the list
contains the object being updated.
 The C and C++ languages include the volatile qualifier, which provides a consistency
mechanism for C and C++ that is related to the OpenMP consistency mechanism.

Department of CSE, NSCET, Theni Page-16


OpenMP pros and Cons
Pros
 Simple
 Incremental Parallelism.
 Decomposition is handled automatically.
 Unified code for both serial and parallel applications.
Cons
 Runs only on shared-memory multiprocessor.
 Scalability is limited by memory architecture.
 Reliable error handling is missing.

Department of CSE, NSCET, Theni Page-17


Topic
OpenMP Directives
Introduction
 OpenMP directives exploit shared memory parallelism by defining various types of parallel
regions. Parallel regions can include both iterative and non-iterative segments of program
code.
 OpenMP provides what’s known as a “directives-based” shared-memory API
Pragmas
 Special pre-processor instructions.(#pragma)
 Compilers that don’t support the pragmas ignore them
 OpenMP directives exploit shared memory parallelism by defining various types of parallel
regions. Parallel regions can include both iterative and non-iterative segments of program
code.
 Each directive starts with #pragma omp.
 The remainder of the directive follows the conventions of the C and C++ standards for
compiler directives
 The remainder of the directive follows the conventions of the C and C++ standards for
compiler directives. A structured-block is a single statement or a compound statement with a
single entry at the top and a single exit at the bottom.

Department of CSE, NSCET, Theni Page-18


Pragmas fall into these general categories
1. Pragmas that let you define parallel regions in which work is done by threads in
parallel. Most of the OpenMP directives either statically or dynamically bind to an
enclosing parallel region.
#pragma omp parallel - Defines a parallel region, which is code that will be executed by
multiple threads in parallel.
2. Pragmas that let you define how work is distributed or shared across the threads in a
parallel region
#pragma omp section - Identifies code sections to be divided among all threads.
#pragma omp for - Causes the work done in a for loop inside a parallel region to be
divided among threads.
#pragma omp single - lets you specify that a section of code should be executed on a
single thread
#pragma omp task
Department of CSE, NSCET, Theni Page-19
3. Pragmas that let you control synchronization among threads
#pragma omp atomic -Specifies that a memory location that will be updated atomically.
#pragma omp master - Specifies that only the master thread should execute a section of the
program.
#pragma omp barrier - Synchronizes all threads in a team; all threads pause at the barrier,
until all threads execute the barrier.
#pragma omp critical - Specifies that code is only executed on one thread at a time.
#pragma omp flush - Specifies that all threads have the same view of memory for all shared
objects.
#pragma omp ordered - Specifies that code under a parallelized for loop should be executed
like a sequential loop.
4. Pragmas that let you define the scope of data visibility across threads
#pragma omp threadprivate - Specifies that a variable is private to a thread.
5. Pragmas for task synchronization
#pragma omp taskwait
#pragma omp barrier

Department of CSE, NSCET, Theni Page-21


OpenMP directive syntax

parallel
Defines a parallel region, which is code that will be executed by multiple threads in parallel.
Syntax
#pragma omp parallel [clauses]
{
code_block
}

Department of CSE, NSCET, Theni Page-22


Remarks
The parallel directive supports the following OpenMP clauses
 copyin
 default
 firstprivate
 if
 num_threads
 private
 reduction
 shared
parallel can also be used with the sections and for directives.
Example
The following sample shows how to set the number of threads and define a parallel region.
By default, the number of threads is equal to the number of logical processors on the
machine.
For example, if you have a machine with one physical processor that has hyper threading
enabled, it will have two logical processors and, therefore, two threads.

Department of CSE, NSCET, Theni Page-23


//hello.c
#include <stdio.h> Compiling and running the parallel
#include <omp.h> program
int main() To Compile:
{
#pragma omp parallel num_threads(4) $gcc −g −Wall −fopenmp −o omp_hello
{ omp_hello . c
int i = omp_get_thread_num();
To run the program, specify the
int j= omp_get_num_threads ();
number of threads on the command
printf("Hello from thread %d of %d\n", i,j);
line.
}
} For example, run the program with
Output four threads and type
Hello from thread 0 of 4
Hello from thread 1 of 4 $ ./omp hello 4
Hello from thread 2 of 4
Hello from thread 3 of 4

Department of CSE, NSCET, Theni Page-24


Syntax for other directives
a) for
Specifies that the iterations of associated loops will be executed in parallel by threads in the
team in the context of their implicit tasks.
#pragma omp [parallel] for [clauses]
clause
private(list), firstprivate(list), lastprivate(list), reduction(reduction-identifier : list),
schedule( [modifier [, modifier] : ] kind[, chunk_size]), collapse(n), ordered[ (n) ], nowait
b) sections
A noniterative work-sharing construct that contains a set of structured blocks that are to be
distributed among and executed by the threads in a team.
#pragma omp [parallel] sections [clauses]
{
#pragma omp section
{
code_block
}
}
Department of CSE, NSCET, Theni Page-25
clause:
private(list), firstprivate(list), lastprivate(list), reduction(reduction-identifier: list), nowait
c) single
Specifies that the associated structured block is executed by only one of the threads in the
team.
#pragma omp single [clauses]
structured-block
clause
private(list), firstprivate(list), copyprivate(list), nowait
d) master
Specifies a structured block that is executed by the master thread of the team.
#pragma omp master
structured-block
e) critical
Restricts execution of the associated structured block to a single thread at a time.
#pragma omp critical [(name)]
structured-block
Department of CSE, NSCET, Theni Page-26
f) flush
Executes the OpenMP flush operation, which makes a thread’s temporary view of memory
consistent with memory, and enforces an order on the memory operations of the variables.
#pragma omp flush [(list)]
g) barrier
Specifies an explicit barrier at the point at which the construct appears.
#pragma omp barrier
h) atomic
Ensures that a specific storage location is accessed atomically.
#pragma omp atomic
expression
i) parallel for
Shortcut for specifying a parallel construct containing one or more associated loops and no
other statements.
#pragma omp parallel for [clauses]
for-loop
clause
Any accepted by the parallel or for directives, except the nowait clause, with identical meanings and
restrictions
Department of CSE, NSCET, Theni Page-27
j) parallel sections
Shortcut for specifying a parallel construct containing one sections construct and no other
statements.
#pragma omp parallel sections [clauses]
{
[#pragma omp section]
structured-block
[#pragma omp section
structured-block]
...
}
clause
Any accepted by the parallel or sections directives, except the nowait clause, with identical
meanings and restrictions.

Department of CSE, NSCET, Theni Page-28


Topic
Work-sharing Constructs
Introduction
 A work-sharing construct distributes the execution of the associated statement among the
members of the team that encounter it.
 The work-sharing directives do not launch new threads, and there is no implied barrier on entry to
a work-sharing construct.
 The sequence of work-sharing constructs and barrier directives encountered must be the same for
every thread in a team.
 OpenMP defines the following work-sharing constructs, and these are described in the sections
that follow:

Department of CSE, NSCET, Theni Page-29


1. Parallel for loop
for directive
 The for directive identifies an iterative work-sharing construct that specifies the
iterations of the associated loop will be executed in parallel.
 The iterations of the for loop are distributed across threads that already exist in the
team executing the parallel construct to which it binds.

Department of CSE, NSCET, Theni Page-30


Example

Department of CSE, NSCET, Theni Page-31


Department of CSE, NSCET, Theni Page-32
2. Parallel sections
sections directive
 The sections directive identifies a non iterative work-sharing construct that specifies a set of constructs
that are to be divided among threads in a team.
 Each section is executed once by a thread in the team.

Department of CSE, NSCET, Theni Page-33


3. Parallel single
single directive
 The single directive identifies a construct that specifies that the associated structured block is
executed by only one thread in the team (not necessarily the master thread).
Syntax
#pragma omp single [clause,clause , ...]
structured-block
Arguments
clause - Can be one or more of the following clauses
 copyprivate(list) - Provides a mechanism to use a private variable in list to broadcast a
value from the data environment of one implicit task to the data environments of the other
implicit tasks belonging to the parallel region.
 firstprivate(list) - Provides a superset of the functionality provided by the private clause.
Each private data object is initialized with the value of the original object.
 nowait(integer expression) - Indicates that an implementation may omit the barrier at the
end of the worksharing region.
 private(list) - Declares variables to be private to each thread in a team.

Department of CSE, NSCET, Theni Page-34


Description
 Only one thread will be allowed to execute the structured block. The rest of the threads
in the team wait at the implicit barrier at the end of the single construct, unless nowait
is specified.
 If nowait is specified, then the rest of the threads in the team immediately execute the
code after the structured block.
 The following example demonstrates how to use this pragma to make sure that the
printf function is executed only once.
 All the threads in the team that do not execute the function proceed immediately to the
following calculations.
Example
#include <omp.h>
#pragma omp parallel
{
#pragma omp single nowait { printf(“Starting calculation\n”); } // Do some calculation
}

Department of CSE, NSCET, Theni Page-35


Topic

Library functions
Introduction
 OpenMP provides several run-time library routines to manage program in parallel mode.
 Many of these run-time library routines have corresponding environment variables that can be set
as defaults.
 The run-time library routines let you dynamically change these factors to assist in controlling your
program. In all cases, a call to a run-time library routine overrides any corresponding environment
variable.
Execution Environment Routines
Function Description
omp_set_num_threads(nthreads) Sets the number of threads to use for subsequent parallel
regions.
omp_get_num_threads( ) Returns the number of threads that are being used in the
current parallel region.
omp_get_max_threads( ) Returns the maximum number of threads that are available
for parallel execution.

Department of CSE, NSCET, Theni Page-36


Function Description
omp_get_thread_num( ) Returns the unique thread number of the thread currently executing
this section of code.
omp_get_num_procs( ) Returns the number of processors available to the program.
omp_in_parallel( ) Returns TRUE if called within the dynamic extent of a parallel region
executing in parallel; otherwise returns FALSE.
omp_set_dynamic(dynamic Enables or disables dynamic adjustment of the number of threads
_threads) used to execute a parallel region. If dynamic_threads is TRUE, dynamic
threads are enabled. If dynamic_threads is FALSE, dynamic threads
are disabled. Dynamics threads are disabled by default.
omp_get_dynamic( ) Returns TRUE if dynamic thread adjustment is enabled, otherwise
returns FALSE.
omp_set_nested(nested) Enables or disables nested parallelism. If nested is TRUE, nested
parallelism is enabled. If nested is FALSE, nested parallelism is
disabled. Nested parallelism is disabled by default.

Department of CSE, NSCET, Theni Page-37


Lock Routines
Function Description
omp_init_lock(lock) Initializes the lock associated with lock for use in subsequent calls.
omp_destroy_lock(lock) Causes the lock associated with lock to become undefined.
omp_set_lock(lock) Forces the executing thread to wait until the lock associated with lock
is available. The thread is granted ownership of the lock when it
becomes available.
omp_unset_lock(lock) Releases the executing thread from ownership of the lock associated
with lock. The behavior is undefined if the executing thread does not
own the lock associated with lock.
omp_test_lock(lock) Attempts to set the lock associated with lock. If successful, returns
TRUE, otherwise returns FALSE.
omp_init_nest_lock(lock) Initializes the nested lock associated with lock for use in the
subsequent calls.

Department of CSE, NSCET, Theni Page-38


Function Description
omp_destroy_nest_lock(lock) Causes the nested lock associated with lock to become undefined.
omp_set_nest_lock(lock) Forces the executing thread to wait until the nested lock associated
with lock is available. The thread is granted ownership of the nested
lock when it becomes available.
omp_unset_nest_lock(lock) Releases the executing thread from ownership of the nested lock
associated with lock if the nesting count is zero. Behavior is undefined if
the executing thread does not own the nested lock associated with lock.
omp_test_nest_lock(lock) Attempts to set the nested lock associated with lock. If successful,
returns the nesting count, otherwise returns zero.

Timing Routines
omp_get_wtime( ) Returns a double-precision value equal to the elapsed wall clock time
(in seconds) relative to an arbitrary reference time. The reference time
does not change during program execution.
omp_get_wtick( ) Returns a double-precision value equal to the number of seconds
between successive clock ticks.

Department of CSE, NSCET, Theni Page-39


Topic
Handling Data and Functional
Parallelism
Handling Data and Functional Parallelism
Data Parallelism
 Perform the same operation on different items of data at the same time
 Parallelism grows with the size of the data.
Example
Adding two long arrays of doubles to produce another array of doubles.
Task Parallelism
 Perform distinct computations -- or tasks -- at the same time.
 If the number of tasks is fixed, the parallelism is not scalable
 A collection of tasks that need to be completed.
Example
Reading the newspaper

Department of CSE, NSCET, Theni Page-40


OpenMP Data Parallel Construct: Parallel Loop
 All pragmas begin: #pragma
 Compiler calculates loop bounds for each thread directly from serial source
(computation decomposition)
 Compiler also manages data partitioning
 Synchronization also automatic (barrier)

Department of CSE, NSCET, Theni Page-41


Function parallelism(Task or Control Parallelism)
 The simplest way to create parallelism in OpenMP is to use the parallel pragma.
 A block preceded by the omp parallel pragma is called a parallel region ; it is executed by a newly
created team of threads. SIMD} model: all threads execute the same segment of code.
#pragma omp parallel
{ // this is executed by a team of threads }
For instance, if program computes
result = f(x)+g(x)+h(x)
you could parallelize this as
double result,fresult,gresult,hresult;
#pragma omp parallel
{
int num = omp_get_thread_num();
if (num==0) fresult = f(x);
else if (num==1) gresult = g(x);
else if (num==2) hresult = h(x);
}
result =fresult +gresult +hresult;

Department of CSE, NSCET, Theni Page-42


Nested parallelism
 Consider if you call a function from inside a parallel region, and that function itself contains a parallel region.
int main( ) {
...
#pragma omp parallel
{.
..
func(...)
...
}
} // end of main
void func(...) {
#pragma omp parallel
{
...
}
}
 By default, the nested parallel region will have only one thread. To allow nested thread creation, set
OMP_NESTED=true
or
omp_set_nested(1)

Department of CSE, NSCET, Theni Page-43


Topic
Handling Loops
Introduction
 Loop parallelism is a very common type of parallelism in scientific codes, so OpenMP
has an easy mechanism for it.
 OpenMP parallel loops are a first example of OpenMP `worksharing' constructs.
 Constructs that take an amount of work and distribute it over the available threads in a
parallel region.
 The parallel execution of a loop can be handled a number of different ways. For
instance, create a parallel region around the loop, and adjust the loop bounds.
#pragma omp parallel
{
int threadnum = omp_get_thread_num(),
numthreads = omp_get_num_threads();
int low = N*threadnum/numthreads,
high = N*(threadnum+1)/numthreads;
for (i=low; i<high; i++)
// do something with i
}

Department of CSE, NSCET, Theni Page-44


A more natural option is to use the parallel for pragma:
#pragma omp parallel
#pragma omp for
for (i=0; i<N; i++)
{
// do something with i
}
 This has several advantages. For one, you don't have to calculate the loop bounds for the threads
yourself, but you can also tell OpenMP to assign the loop iterations according to different
schedules.
The execution on four threads are as follows
#pragma omp parallel
{
code1( );
#pragma omp for
for (i=1; i<=4*N; i++) {
code2( );
} code3( );
}
 The code before and after the loop is executed identically in each thread; the loop iterations are
spread over the four threads.

Department of CSE, NSCET, Theni Page-45


 Note that the parallel do and parallel for pragmas do not create a team of threads: they take the
team of threads that is active, and divide the loop iterations over them.
 This means that the omp for or omp do directive needs to be inside a parallel region. It is also
possible to have a combined omp parallel for or omp parallel do directive.
 If your parallel region only contains a loop, you can combine the pragmas for the parallel region
and distribution of the loop iterations:
#pragma omp parallel for
for (i=0; .....
Department of CSE, NSCET, Theni Page-46
Loop schedules
 Usually there are many more iterations in a loop than there are threads. Thus, there are
 several ways you can assign your loop iterations to the threads.
#pragma omp for schedule(....)
 The first distinction we now have to make is between static and dynamic schedules.
 With static schedules, the iterations are assigned purely based on the number of
iterations and the number of threads (and the chunk parameter; see later).
 In dynamic schedules, on the other hand, iterations are assigned to threads that are
unoccupied.
 Dynamic schedules are a good idea if iterations take an unpredictable amount of time,
so that load balancing is needed.

Department of CSE, NSCET, Theni Page-47


Figure 2 illustrates this: assume that each core gets assigned two (blocks of) iterations and
these blocks take gradually less and less time. You see from the left picture that thread 1 gets
two fairly long blocks, where as thread 4 gets two short blocks, thus finishing much earlier. On
the other hand, in the right figure thread 4 gets block 5, since it finishes the first set of blocks
early. The effect is a perfect load balancing.
i) Static
 The schedule(static, chunk-size) clause of the loop construct specifies that the for loop has
the static scheduling type. OpenMP divides the iterations into chunks of size and it
distributes the chunks to threads in a circular order.
 When no chunk-size is specified, OpenMP divides iterations into chunks that are
approximately equal in size and it distributes at most one chunk to each thread.
 The static scheduling type is appropriate when all iterations have the same computational
cost.

Department of CSE, NSCET, Theni Page-48


Example
a( ); a( );
#pragma omp parallel for #pragma omp parallel for schedule(static,2)
for (int i = 0; i < 16; ++i) for (int i = 0; i < 16; ++i)
{ {
w(i); w(i);
} }
z( ); z( );

Department of CSE, NSCET, Theni Page-49


 Default scheduling and static scheduling are very efficient: there is no need for any
communication between the threads.
 When the loop starts, each thread will immediately know which iterations of the loop it
will execute.
 The only synchronization part is at the end of the entire loop, where we just start for all
threads to finish.
ii) Dynamic
 The schedule(dynamic, chunk-size) clause of the loop construct specifies that the for
loop has the dynamic scheduling type.
 OpenMP divides the iterations into chunks of size chunk-size. Each thread executes a
chunk of iterations and then requests another chunk until there are no more chunks
available.
 There is no particular order in which the chunks are distributed to the threads. The
order changes each time when we execute the for loop.

Department of CSE, NSCET, Theni Page-50


Example
a( ); a( );
#pragma omp parallel for #pragma omp parallel for schedule(dynamic,1)
for (int i = 0; i < 16; ++i) for (int i = 0; i < 16; ++i)
{ {
v(i); v(i);
} }
z( ); z( );

Dynamic scheduling is expensive: there is some communication between the threads after each iteration of
the loop
Department of CSE, NSCET, Theni Page-51
iii) Guided
 The guided scheduling type is similar to the dynamic scheduling type. OpenMP again
divides the iterations into chunks.
 Each thread executes a chunk of iterations and then requests another chunk until there
are no more chunks available.
 The difference with the dynamic scheduling type is in the size of chunks.
 The size of a chunk is proportional to the number of unassigned iterations divided by
the number of the threads. Therefore the size of the chunks decreases.

iv) Auto
 The auto scheduling type delegates the decision of the scheduling to the compiler
and/or runtime system.
v) Runtime
 The runtime scheduling type defers the decision about the scheduling until the runtime.

Department of CSE, NSCET, Theni Page-52


NoWait clause
The implicit barrier at the end of a work sharing construct can be cancelled with a nowait clause. This has
the effect that threads that are finished can continue with the next code in the parallel region:
#pragma omp parallel {
#pragma omp for nowait
for (i=0; i<N; i++) { ... }
// more parallel code }
While loops
OpenMP can only handle `for' loops: while loops can not be parallelized. So to find a way around that. While
loops are for instance used to search through data:
while ( a[i]!=0 && i<imax ) {
i++; }
// now i is the first index for which a[i]
is zero.
We replace the while loop by a for loop that examines all locations:
result = -1;
#pragma omp parallel for
for (i=0; i<imax; i++) {
if (a[i]!=0 && result<0) result = i;
}

Department of CSE, NSCET, Theni Page-53


Topic
Performance Considerations
General Performance Recommendations
Minimize synchronization:
 Avoid or minimize the use of synchronizations such as barrier, critical, ordered,
taskwait, and locks.
 Use the nowait clause where possible to eliminate redundant or unnecessary barriers.
 For example, there is always an implied barrier at the end of a parallel region. Adding
nowait to a work-sharing loop in the region that is not followed by any code in the
region eliminates one redundant barrier.
 Use named critical sections for fine-grained locking where appropriate so that not all
critical sections in the program will use the same, default lock.
OMP_WAIT_POLICY
 Use the OMP_WAIT_POLICY environment variables to control the behavior of waiting
threads. By default, idle threads will be put to sleep after a certain timeout period. If a
thread does not find work by the end of the timeout period, it will go to sleep, thus
avoiding wasting processor cycles at the expense of other threads.
 The default timeout period might not be appropriate for your application, causing the
threads to go to sleep too soon or too late.
Department of CSE, NSCET, Theni Page-54
 In general, if an application has dedicated processors to run on, then an active wait
policy that would make waiting threads spin would give better performance.
 If an application runs simultaneously with other applications, then a passive wait policy
that would put waiting threads to sleep would be better for system throughput.
 Parallelize at the highest level possible, such as outermost loops. Enclose multiple loops
in one parallel region. In general, make parallel regions as large as possible to reduce
parallelization overhead.
Use master instead of single where possible.
 The master directive is implemented as an if statement with no implicit barrier:
if (omp_get_thread_num() == 0) {...}
 The single construct is implemented similarly to other work-sharing constructs.
Keeping track of which thread reaches single first adds additional runtime overhead.
Moreover, there is an implicit barrier if nowait is not specified, which is less efficient.
 Use explicit flush with care. A flush causes data to be stored to memory, and subsequent
data accesses may require reload from memory, all of which decrease efficiency.

Department of CSE, NSCET, Theni Page-55


Avoiding False sharing
 False sharing occurs when threads on different processors modify variables that reside
on the same cache line. This situation is called false sharing because the threads are not
accessing the same variable, but rather are accessing different variables that happen to
reside on the same cache line.
 If false sharing occurs frequently, interconnect traffic increases, and the performance
and scalability of an OpenMP application suffer significantly. False sharing degrades
performance when all the following conditions occur:
1) Shared data is modified by multiple threads
2) Multiple threads modify data within the same cache line
3) Data is modified very frequently (as in a tight loop)
 False sharing can typically be detected when accesses to certain variables seem
particularly expensive.
 Careful analysis of parallel loops that play a major part in the execution of an
application can reveal performance scalability problems caused by false sharing.

Department of CSE, NSCET, Theni Page-56


In general, false sharing can be reduced using the following techniques:
 Make use of private or thread private data as much as possible.
 Use the compiler’s optimization features to eliminate memory loads and stores.
 Pad data structures so that each thread's data resides on a different cache line. The size
of the padding is system-dependent, and is the size needed to push a thread's data onto
a separate cache line.
 Modify data structures so there is less sharing of data among the threads.
 Techniques for tackling false sharing are very much dependent on the particular
application.
In some cases, a change in the way the data is allocated can reduce false sharing. In other
cases, changing the mapping of iterations to threads by giving each thread more work per
chunk (by changing the chunk_size value) can also lead to a reduction in false sharing.

Department of CSE, NSCET, Theni Page-57

You might also like