8 OpenMPTasking

Advanced OpenMP Tutorial: Tasking
Dr. Volker Weinberg (LRZ, Garching near Munich, Germany)

OpenMPCon / IWOMP 2019 | Auckland | New Zealand
Acknowledgements
• This tutorial includes material kindly provided by the following OpenMP experts:
• Reinhold Bader (LRZ)
• Georg Hager (RRZE)
• Michael Klemm (Intel Corp.)
• Christian Terboven (RWTH Aachen)
Dr. Volker Weinberg | LRZ

Topics covered in this Tutorial
• OpenMP Worksharing and Tasks

• Task Synchronization
• Task Data Scoping
• Task Reductions
• Task Loops
• Task Dependencies
• Task Affinity
• Task Priorities

OpenMP Worksharing and Tasks

OpenMP Worksharing
• #pragma omp parallel fork

• {
•
• #pragma omp for distribute work
• for (i = 0; i<N; i++) {
• …
• }
barrier
•
• #pragma omp for distribute work

• for (i = 0; i<N; i++) {
• {
• …
• }
barrier
• } join

OpenMP Worksharing
s=0
• double a[N];
s’=0 s’’=0 s’’’=0 s’’’’=0
• double l,s = 0;
distribute work
• #pragma omp parallel for reduction(+:s) \
private(l) schedule(static,4)
• for (i = 0; i<N; i++)

• {
• l = log(a[i]);
• s += l;
s’+= s’’ s’’’+= s’’’’
barrier
s’+= s’’’
• }
s = s’

Traditional Worksharing
• Worksharing constructs do not compose well (or at least: do not compose as well as we
want)
• Pathological example: parallel daxpy in MKL
void example1() { void example2() {
#pragma omp parallel
{ // parallel within: this/that
compute_in_parallel_this(A); // for, sects,… compute_in_parallel_this(A);
compute_in_parallel_that(B); // for, sects,… compute_in_parallel_that(B);
// daxpy is either parallel or sequential,
// but has no orphaned worksharing // parallel MKL version
cblas_daxpy (n, x, A, incx, B, incy); cblas_daxpy ( <...> );
}
} }
• Writing such codes either:
• oversubscribes the system (creating more OpenMP threads than cores)

• yields bad performance due to OpenMP overheads, or
• needs a lot of glue code to use sequential daxpy only for sub-arrays

Task Execution Model
• Supports unstructured parallelism

 Example (unstructured parallelism)
• unbounded loops
while ( <expr> ) { #pragma omp parallel
... #pragma omp master
} while (elem != NULL) {
#pragma omp task
• recursive functions compute(elem);
elem = elem->next;
}
void myfunc( <args> )
{
...; myfunc( <newargs> ); ...;
}
Parallel Team
Task pool
• Several scenarios are possible:
• single creator, multiple creators, nested tasks (tasks & WS)
• All threads in the team are candidates to execute tasks
The Execution Model
Task queue

The task Construct
• Deferring (or not) a unit of work (executable for any member of the team)
#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]

{structured-block} …structured-block…
!$omp end task
 private(list)  if(scalar-expression)
Cutoff
 firstprivate(list)  mergeable
Strategies
Data
 shared(list)  final(scalar-expression)
Environment
 default(shared | none)  depend(dep-type: list) Dependencies
 in_reduction(r-id: list)*  untied Sched. Restriction
 allocate([allocator:] list)*  priority(priority-value)

Miscellaneous Scheduler Hints
 detach(event-handler)*  affinity(list)*
What is a Task?
• Make OpenMP worksharing more flexible:

• allow the programmer to package code blocks and data items for execution  this by
definition is a task
• and assign these to an encountering thread
• possibly defer execution to a later time („work queue“)
• Introduced with OpenMP 3.0 (May 2008!)
• some additional features added in later versions
• When a thread encounters a task construct, a task is generated from the code of the
associated structured block.
• Data environment of the task is created (according to the data-sharing attributes,
defaults, …)  „Packaging of data“
• The encountering thread may immediately execute the task, or defer its execution. In the
latter case, any thread in the team may be assigned the task.
Example: Processing a Linked List
typedef struct {
list *next;
contents *data; • Typical task generation loop:
} list;
void process_list(list *head) {

{ #pragma omp parallel
#pragma omp single {
{ #pragma omp single
list *p = head; {
while(p) { while(p) {
#pragma omp task #pragma omp task
{ do_work(p->data); } {}
p = p->next; } /* all tasks done */
} }
} /* all tasks done */
}
}

Example: Processing a Linked List
typedef struct { • Features of this example:

list *next;
contents *data; • one of the threads has the job of
} list; generating all tasks
void process_list(list *head) { • synchronization: at the end of the
{
single block for all tasks created inside
#pragma omp single it
{
list *p = head; • no particular order between tasks is
while(p) { enforced here
#pragma omp task
{ do_work(p->data); } • data scoping default for task block:
p = p->next;
}
firstprivate  iterating through p is
} /* all tasks done */ fine
}
} (this is the „packaging of data“ mentioned earlier)
• task region: includes call of do_work()

The if Clause
• When if argument is false –

• task body is executed immediately by encountering thread (an „undeferred task“)
• but otherwise semantics are the same (data environment, synchronization) as for a
„deferred“ task
#pragma omp task if (sizeof(p->data) > threshold)
{ do_work(p->data); }
• User-directed optimization:
• avoid overhead for deferring small tasks
• cache locality / memory affinity may be lost by doing so

Task Synchronization

Task Synchronization with barrier and taskwait
• OpenMP barrier (implicit or explicit)

• All tasks created by any thread of the current team are guaranteed to be completed at
barrier exit
C/C++
#pragma omp barrier
• Task barrier: taskwait

• Encountering task is suspended until child tasks are complete
• Applies only to direct childs, not descendants!
C/C++
#pragma omp taskwait

Task Synchronization with taskgroup
• The taskgroup construct (deep task synchronization) (introduced with OpenMP 4.0)
• attached to a structured block; completion of all descendants of the current task; TSP at the end
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}
• where clause (could only be): reduction(reduction-identifier: list-items) ≥ OpenMP 5.0

#pragma omp single A
{
#pragma omp taskgroup
{
#pragma omp task
{ … }
#pragma omp task B C
{ … #C.1; #C.2; …}
wait for…
} // end of taskgroup
C.1 C.2
}

Task Synchronization Constructs: Comparison
Task Synchronization Construct Description
barrier Either an implicit, or explicit barrier.
taskwait Wait on the completion of child tasks of the current task.
taskgroup Wait on the completion of child tasks of the current task, and their
descendants.

Example: Task Synchronization with taskwait
• Example:
• assure leaf-to-root traversal for a binary tree
void process_tree(tree *root) {

if (root->left) {
#pragma omp task
{ process_tree(root->left); }
}
if (root->right) {
#pragma omp task
{ process_tree(root->right);}
}
do_work(root->data);
}
• what if we run out of threads?  do we hang?

Task Switching
• It is allowed for a thread to

• suspend a task during execution
• start (or resume) execution of another task (assigned to the same team)
• resume original task later
• Pre-condition:
• a task scheduling point is reached
• Example from previous slide:
• the taskwait directive implies a task scheduling point
• Another example:
• very many tasks are generated  implementation can suspend generation of tasks and start
processing the existing queue
• With the exception of the taskyield construct, all of these scheduling points are implied. They
are there, but not visible  taskyield was added to allow the user to add an explicit task
scheduling point.

Task Scheduling Points
• The point immediately following the generation of an explicit task.

• After the point of completion of a task region.
• In a taskyield region.
• In a taskwait region.
• At the end of a taskgroup region.
• In an implicit or explicit barrier region.
• When using the support for accelerators:
• The point immediately following the generation of the target region.
• At the beginning and end of a target data region.
• In a target update region.
• In a target enter data region.
• In a target exit data region.
• In the omp_target_memcpy() runtime function.
• In the omp_target_memcpy_rect() runtime function.
Example: Dealing with Large Number of Tasks
• #define LARGE_NUMBER 10000000
• double item[LARGE_NUMBER]
• extern void process(double); • Features of this example:
• int main() {
• generates a large number of tasks with
• #pragma omp parallel one thread and executes them with the
• { task threads in the team
• #pragma omp single scheduling
• {
point • implementation may reach its limit on
• int i; unassigned tasks
•
for(i=0;i<=LARGE_NUMBER;i++) • if it does, the implementation is allowed
• #pragma omp task to cause the thread executing the task
• process(item[i]); generating loop to suspend its task at
• }
• } the scheduling point and start executing
i is firstprivate
unassigned tasks.
item is shared
• once the number of unassigned tasks is
sufficiently low, the thread may resume
executing of the task generating loop.

Example: The taskyield Directive
• #include <omp.h> • Introduced with OpenMP 3.1

• void something_useful();
• void something_critical();
• void foo(omp_lock_t * lock, int n)

• {
• for(int i = 0; i < n; i++)
• #pragma omp task
• {
• something_useful();
• while( !omp_test_lock(lock) ) {
• #pragma omp taskyield
• }
• something_critical();
• omp_unset_lock(lock);
• }
• }

Example: The taskyield Directive
• #include <omp.h> • Introduced with OpenMP 3.1

• void something_useful();
• void something_critical();
• void foo(omp_lock_t * lock, int n)

• {
• for(int i = 0; i < n; i++)
• #pragma omp task
• {
• something_useful();
• while( !omp_test_lock(lock) ) { The waiting task may be
• #pragma omp taskyield suspended here and allow
• } the executing thread to
• something_critical(); perform other work. This
• omp_unset_lock(lock); may also avoid deadlock
• } situations.
• }

Thread Switching
• Default behaviour:
• a task assigned to a thread must be (eventually) completed by that thread  task is
tied to the thread
#pragma omp task untied
• Change this via the untied clause structured-block
• execution of task block may change

to another thread of the team at any task scheduling point
• implementation may add task scheduling points beyond those previously defined
(outside programmer‘s control!)
• Deployment of untied tasks
• starvation scenarios: running out of tasks while generating thread is still working on
something
• Dangers:
• more care required (compared with tied tasks) wrt. scoping and synchronization
Final and mergeable Tasks
• Final Tasks • Merged Tasks

• use a final clause with a condition • using a mergeable clause may create a
• always undeferred, i.e. executed merged task if it is undeferred or final
immediately by the encountering thread • a merged task has the same data
• reducing the overhead of placing tasks environment as its creating task region
in the “task pool” • Clause was introduced to reduce data /
• all tasks created inside task region are memory requirements
also final ( different from an if • Final and/or mergeable
clause) • can be used for optimization purposes
• inside a task block, a omp_in_final() • optimize wind-down phase of a recursive
can be used to check whether the task is algorithm
final

Task Scheduling Points: Pitfalls
• Use of threadprivate data:

• value of threadprivate variables cannot be assumed to be unchanged across a task
scheduling point. Can be modified by another task executed by the same thread.
• Tasks and locks:

• if a lock is held across a task scheduling point, interleaved code trying to acquire it
(with the same thread) may cause deadlock
• this especially is a problem for untied tasks
• Tasks and critical regions:

• similar issue if suspension of a task happens inside a critical region and the same
thread tries to access a critical region in another scheduled task

Example: Fibonacci Numbers

Recursive Approach to Compute Fibonacci
int main(int argc, int fib(int n) {

char* argv[]) if (n < 2) return n;
{ int x = fib(n - 1);
[...] int y = fib(n - 2);
fib(input); return x+y;
[...] }
}
• On the following slides we will discuss three approaches to parallelize this recursive code
with Tasking.

First Version Parallelized with Tasking (omp-v1)
{ int x, y;
[...] #pragma omp task shared(x)
#pragma omp parallel {
{ x = fib(n - 1);
#pragma omp single }
{ #pragma omp task shared(y)
fib(input); {
} y = fib(n - 2);
} }
[...] #pragma omp taskwait
} return x+y;
}
• Only one Task / Thread enters fib() from main(), it is responsable for creating the two initial
work tasks
• Taskwait is required, as otherwise x and y would be lost

Scalability Measurements (1/3)
• Overhead of task creation prevents better scalability!

Speedup of Fibonacci with Tasks
9
8
7
6
Speedup
5
4 optimal
omp-v1
3
2
1
0
1 2 4 8
#Threads

Improved Parallelization with Tasking (omp-v2)
• Improvement: Don‘t create yet another task once a certain (small enough) n is reached

if (n < 2) return n;
char* argv[])
int x, y;
{ #pragma omp task shared(x) \
[...] if(n > 30)
{
x = fib(n - 1);
{ }
#pragma omp single #pragma omp task shared(y) \
if(n > 30)
{ {
fib(input); y = fib(n - 2);
} }
} #pragma omp taskwait
return x+y;
[...] }
}

• Speedup is ok, but we still have some overhead when running with 4 or 8 threads
9
8
7
6
Speedup
5
optimal
4
omp-v1
3 omp-v2
2
1
0
1 2 4 8
#Threads

Improved Parallelization with Tasking (omp-v3)
• Improvement: Skip the OpenMP overhead once a certain n is reached

if (n <= 30)
{
return serfib(n);
[...] int x, y;
#pragma omp parallel #pragma omp task shared(x)
{ {
#pragma omp single x = fib(n - 1);
{ }
#pragma omp task shared(y)
fib(input);
{
} y = fib(n - 2);
} }
[...] #pragma omp taskwait
} return x+y;
}

• Everything ok now 
9
8
7
6
Speedup
5 optimal
4 omp-v1
omp-v2
3
omp-v3
2
1
0
1 2 4 8
#Threads

Task Data Scoping

Data Environment
• The task directive takes the following data attribute clauses that define the data
environment of the task:
• default (private | firstprivate | shared | none)

• private (list)
• firstprivate (list)
• shared (list)

Data Scoping
• Some rules from Parallel Regions apply:

• Static and Global variables are shared
• Automatic Storage (local) variables are private
• The data-sharing attributes of variables that are not listed in data attribute clauses of a
task construct, and are not predetermined according to the OpenMP rules, are implicitly
determined as follows:
(a) In a task construct, if no default clause is present, a variable that is determined to be
shared in all enclosing constructs, up to and including the innermost enclosing
parallel construct, is shared.
(b) In a task construct, if no default clause is present, a variable whose data-sharing
attribute is not determined by rule (a) is firstprivate.

Data Scoping
• It follows that:
(a) If a task construct is lexically enclosed in a parallel construct, then variables that are
shared in all scopes enclosing the task construct remain shared in the generated
task. Otherwise, variables are implicitly determined firstprivate.
(b) If a task construct is orphaned, then variables are implicitly determined
firstprivate.
• Shared entities in a task region

• may become invalid while some tasks are unfinished  programmer must add
synchronization to prevent this

Example: Data Scoping Example
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a:
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
}Volker
Dr. } } Weinberg | LRZ
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared value of a: 1

// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
}Volker
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of b: firstprivate value of b: 0 / undefined
// Scope of c:
// Scope of d:
// Scope of e:
}Volker
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of c: shared value of c: 3
// Scope of d:
// Scope of e:
}Volker
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of d: firstprivate value of d: 4
// Scope of e:
}Volker
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of d: firstprivate value of d: 4
// Scope of e: private value of e: 5
}Volker
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b) default(none)
#pragma omp parallel private(b) default(none)
{
int d = 4;
#pragma omp task
{
int e = 5; Hint: Use default(none) to
be forced to think about
// Scope of a: shared every variable if you do
// Scope of b: firstprivate not see clear.
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e: private
} } }

Task Reductions

Reminder:
Task Synchronization: The taskgroup Construct
• The taskgroup construct (deep task synchronization) (introduced with OpenMP 4.0)
• attached to a structured block; completion of all descendants of the current task; TSP at the end
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}
• where clause (could only be): reduction(reduction-identifier: list-items) ≥ OpenMP 5.0

#pragma omp single A
{
#pragma omp taskgroup
{
#pragma omp task
{ … }
#pragma omp task B C
{ … #C.1; #C.2; …}
wait for…
} // end of taskgroup
C.1 C.2
}

Task Reductions using the taskgroup Construct
int res = 0;
• Reduction operation node_t* node = NULL;
...
• perform some forms of recurrence calculations #pragma omp parallel
{
• associative and commutative operators #pragma omp single
• The (taskgroup) scoping reduction clause {
#pragma omp taskgroup task_reduction(+: res)
#pragma omp taskgroup task_reduction(op: list) { // [1]
while (node) {
{structured-block}
#pragma omp task in_reduction(+: res) \
firstprivate(node)
• Register a new reduction at [1] { // [2]
res += node->value;
• Computes the final result after [3] }
node = node->next;
• The (task) in_reduction clause }
• Task participates in a reduction operation [2] } // [3]
}
}
#pragma omp task in_reduction(op: list)
{structured-block}
OpenMP 5.0
Task Loops

Example: saxpy Kernel with OpenMP task
for ( i = 0; i<SIZE; i+=1) {

A[i]=A[i]*B[i]*S;
• Difficult to determine grain
} • 1 single iteration  to fine
for ( i = 0; i<SIZE; i+=TS) { • whole loop  no parallelism
UB = SIZE < (i+TS)?SIZE:i+TS; • Manually transform the code
for ( ii=i; ii<UB; ii++) {
A[ii]=A[ii]*B[ii]*S; • blocking techniques
} • Improving programmability
}
• OpenMP taskloop
#pragma omp single
for ( i = 0; i<SIZE; i+=TS) {
UB = SIZE < (i+TS)?SIZE:i+TS; Outer loop iterates across the tiles
#pragma omp task private(ii) \
firstprivate(i,UB) shared(S,A,B)
A[ii]=A[ii]*B[ii]*S; Inner loop iterates within a tile
}
}
You have to
rename all your
loop indices!
The taskloop Construct
• Introduced with OpenMP 4.5

• Parallelize a loop using OpenMP tasks
• Cut loop into chunks
• Create a task for each loop chunk
• Syntax (C/C++)
#pragma omp taskloop [simd] [clause[[,] clause],…]
for-loops
• Syntax (Fortran)
!$omp taskloop[simd] [clause[[,] clause],…]
do-loops
[!$omp end taskloop [simd]]
• Loop iterations are distributed over the tasks and (if simd) the resulting loop uses SIMD
instructions.
Clauses for the taskloop Construct
• Taskloop constructs inherit clauses both from worksharing constructs and the task
construct
• shared, private
• firstprivate, lastprivate
• default
• collapse
• final, untied, mergeable
• allocate
• in_reduction / reduction
• grainsize(grain-size)
Chunks have at least grain-size and max 2*grain-size loop iterations
• num_tasks(num-tasks)
Create num-tasks tasks for iterations of the loop
Example: saxpy Kernel with OpenMP taskloop
for ( i = 0; i<SIZE; i+=1) {

A[i]=A[i]*B[i]*S;
blocking }
taskloop
for ( i = 0; i<SIZE; i+=TS) { #pragma omp taskloop grainsize(TS)

UB = SIZE < (i+TS)?SIZE:i+TS; for ( i = 0; i<SIZE; i+=1) {
for ( ii=i; ii<UB; ii++) { A[i]=A[i]*B[i]*S;
A[ii]=A[ii]*B[ii]*S; }
}
}
for ( i = 0; i<SIZE; i+=TS) { • Easier to apply than manual blocking:

UB = SIZE < (i+TS)?SIZE:i+TS;
#pragma omp task private(ii) \
• Compiler implements mechanical transformation
firstprivate(i,UB) shared(S,A,B) • Less error-prone, more productive
A[ii]=A[ii]*B[ii]*S;
}
}

Worksharing vs. taskloop Constructs
subroutine worksharing subroutine taskloop

integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp do !$omp taskloop

do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end parallel !$omp end parallel

write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

Worksharing vs. taskloop Constructs
subroutine worksharing subroutine taskloop

integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp single
!$omp do !$omp taskloop
do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end single
!$omp end parallel !$omp end parallel
write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

Task Dependencies

Catchy example: Building a house
Garden
Put up Tiling of the
Roof Roof
Build
Walls Install
Electrical Plaster
Wall
Exterior
Siding

The depend Clause
C/C++
#pragma omp task depend(dependency-type: list)
... structured block ...

• The task dependence is fulfilled when the predecessor task has completed
• in dependency-type: the generated task will be a dependent task of all previously

generated sibling tasks that reference at least one of the list items in an out or inout
clause.
• out and inout dependency-type: The generated task will be a dependent task of all
previously generated sibling tasks that reference at least one of the list items in an in,
out, or inout clause.
• mutexinoutset dependency-type: Added in 5.0 to support mutually exclusive inout
sets
• The list items in a depend clause may include array sections.

Task Synchronization with Dependencies
int x = 0; OpenMP 3.1 int x = 0; OpenMP 4.0

#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task #pragma omp task depend(in: x)
std::cout << x << std::endl; std::cout << x << std::endl;
#pragma omp task #pragma omp task

long_running_task(); long_running_task();
#pragma omp task #pragma omp task depend(inout: x)

x++; x++;
} }
t1 t1
t2 t2
t3 t3

Example: Cholesky Factorization
void cholesky(int ts, int nt, double* a[nt][nt]) { • Complex synchronization patterns
for (int k = 0; k < nt; k++) {
potrf(a[k][k], ts, ts); • Splitting computational phases
// Triangular systems • taskwait or taskgroup
for (int i = k + 1; i < nt; i++) {
#pragma omp task • Needs complex code analysis
trsm(a[k][k], a[k][i], ts, ts); • May perform a bit better than regular OpenMP
}
worksharing
// Update trailing matrix Is this best solution we can come up with?
for (int i = k + 1; i < nt; i++) {
for (int j = k + 1; j < i; j++) {
#pragma omp task
dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
}
#pragma omp task
syrk(a[k][i], a[i][i], ts, ts);
}
}
}

Example: Cholesky Factorization
void cholesky(int ts, int nt, double* a[nt][nt]) { void cholesky(int ts, int nt, double* a[nt][nt]) {
for (int k = 0; k < nt; k++) { ts
for (int k = 0; k < nt; k++) {
// Diagonal Block factorization nt // Diagonal Block factorization
ts #pragma omp task depend(inout: a[k][k])
potrf(a[k][k], ts, ts);
nt ts potrf(a[k][k], ts, ts);
// Triangular systems
ts
for (int i = k + 1; i < nt; i++) { // Triangular systems
#pragma omp task for (int i = k + 1; i < nt; i++) {
trsm(a[k][k], a[k][i], ts, ts); #pragma omp task depend(in: a[k][k])
} depend(inout: a[k][i])
#pragma omp taskwait trsm(a[k][k], a[k][i], ts, ts);
}
// Update trailing matrix
for (int i = k + 1; i < nt; i++) { // Update trailing matrix
for (int j = k + 1; j < i; j++) { for (int i = k + 1; i < nt; i++) {
#pragma omp task for (int j = k + 1; j < i; j++) {
dgemm(a[k][i], a[k][j], a[j][i], ts, ts); #pragma omp task depend(inout: a[j][i])
} depend(in: a[k][i], a[k][j])
#pragma omp task dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
syrk(a[k][i], a[i][i], ts, ts); }
} #pragma omp task depend(inout: a[i][i])
#pragma omp taskwait depend(in: a[k][i])
} syrk(a[k][i], a[i][i], ts, ts);
} }
}
Dr. Volker Weinberg | LRZ OpenMP 3.1 } OpenMP 4.0
Task Affinity

OpenMP Task Affinity
• void task_affinity() {
• double* B; Core Core Core Core
• #pragma omp task shared(B)
• { on-chip on-chip on-chip on-chip
cache cache cache cache
• B = init_B_and_important_computation(A);
• }
• #pragma omp task firstprivate(B)
• {
• important_computation_too(B); interconnect
• }
• #pragma omp taskwait
• }
memory memory
A[0] … A[N] B[0] … B[N]

OpenMP Task Affinity
• void task_affinity() {
• double* B; Core Core Core Core
• #pragma omp task shared(B) affinity(A[0:N])
• { on-chip on-chip on-chip on-chip
cache cache cache cache
• B = init_B_and_important_computation(A);
• }
• #pragma omp task firstprivate(B) affinity(B[0:N])
• {
• important_computation_too(B); interconnect
• }
• #pragma omp taskwait
• }
Introduced with OpenMP 5.0: affinity(locator-list) memory memory
The affinity clause is a hint to indicate data affinity of the

A[0] … A[N]
generated task. The task is recommended to execute closely to the
location of the list items. B[0] … B[N]

Task Priorities

Task Priority
• Syntax
!$omp task priority(priority_value) Fortran
#pragma omp task priority(priority_value) C

• Semantics
• provides a hint to the run time on prioritizing (ordering) task execution
• the priority value must be a non-negative integer; higher values correspond to higher
priorities; maximum value is omp_get_max_task_priority()
• do not rely on a particular ordering of tasks imposed by specifying a priority
#pragma omp task priority(9999)
participants.take_break(five_minutes);

Thank you for your attention!

8 OpenMPTasking

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

8 OpenMPTasking

Uploaded by

Copyright:

Available Formats

Advanced OpenMP Tutorial: Tasking

Dr. Volker Weinberg (LRZ, Garching near Munich, Germany)

• Reinhold Bader (LRZ)

• Georg Hager (RRZE)

• Michael Klemm (Intel Corp.)

• Christian Terboven (RWTH Aachen)

Dr. Volker Weinberg | LRZ

• OpenMP Worksharing and Tasks

Dr. Volker Weinberg | LRZ

Dr. Volker Weinberg | LRZ

• #pragma omp parallel fork

• #pragma omp for distribute work

Dr. Volker Weinberg | LRZ

• for (i = 0; i<N; i++)

Dr. Volker Weinberg | LRZ

• Writing such codes either:

• oversubscribes the system (creating more OpenMP threads than cores)

Dr. Volker Weinberg | LRZ

• Supports unstructured parallelism

Dr. Volker Weinberg | LRZ

#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]

 allocate([allocator:] list)*  priority(priority-value)

• Make OpenMP worksharing more flexible:

void process_list(list *head) {

Dr. Volker Weinberg | LRZ

typedef struct { • Features of this example:

Dr. Volker Weinberg | LRZ

• When if argument is false –

Dr. Volker Weinberg | LRZ

Dr. Volker Weinberg | LRZ

• OpenMP barrier (implicit or explicit)

• Task barrier: taskwait

Dr. Volker Weinberg | LRZ

• where clause (could only be): reduction(reduction-identifier: list-items) ≥ OpenMP 5.0

#pragma omp parallel

Dr. Volker Weinberg | LRZ

Task Synchronization Construct Description

barrier Either an implicit, or explicit barrier.

taskwait Wait on the completion of child tasks of the current task.

Dr. Volker Weinberg | LRZ

void process_tree(tree *root) {

• what if we run out of threads?  do we hang?

Dr. Volker Weinberg | LRZ

• It is allowed for a thread to

Dr. Volker Weinberg | LRZ

• The point immediately following the generation of an explicit task.

Dr. Volker Weinberg | LRZ

• #include <omp.h> • Introduced with OpenMP 3.1

• void foo(omp_lock_t * lock, int n)

Dr. Volker Weinberg | LRZ

• #include <omp.h> • Introduced with OpenMP 3.1

• void foo(omp_lock_t * lock, int n)

Dr. Volker Weinberg | LRZ

• execution of task block may change

• Final Tasks • Merged Tasks

Dr. Volker Weinberg | LRZ

• Use of threadprivate data:

• Tasks and locks:

• Tasks and critical regions:

Dr. Volker Weinberg | LRZ

Dr. Volker Weinberg | LRZ

int main(int argc, int fib(int n) {