You are on page 1of 68

Advanced OpenMP Tutorial: Tasking

Dr. Volker Weinberg (LRZ, Garching near Munich, Germany)


OpenMPCon / IWOMP 2019 | Auckland | New Zealand
Acknowledgements

• This tutorial includes material kindly provided by the following OpenMP experts:

• Reinhold Bader (LRZ)

• Georg Hager (RRZE)

• Michael Klemm (Intel Corp.)

• Christian Terboven (RWTH Aachen)

Dr. Volker Weinberg | LRZ


Topics covered in this Tutorial

• OpenMP Worksharing and Tasks


• Task Synchronization
• Task Data Scoping
• Task Reductions
• Task Loops
• Task Dependencies
• Task Affinity
• Task Priorities

Dr. Volker Weinberg | LRZ


OpenMP Worksharing and Tasks

Dr. Volker Weinberg | LRZ


OpenMP Worksharing

• #pragma omp parallel fork


• {

• #pragma omp for distribute work
• for (i = 0; i<N; i++) {
• …
• }
barrier

• #pragma omp for distribute work


• for (i = 0; i<N; i++) {
• {
• …
• }
barrier
• } join

Dr. Volker Weinberg | LRZ


OpenMP Worksharing

s=0
• double a[N];
s’=0 s’’=0 s’’’=0 s’’’’=0
• double l,s = 0;
distribute work
• #pragma omp parallel for reduction(+:s) \
private(l) schedule(static,4)

• for (i = 0; i<N; i++)


• {
• l = log(a[i]);
• s += l;
s’+= s’’ s’’’+= s’’’’

barrier
s’+= s’’’
• }
s = s’

Dr. Volker Weinberg | LRZ


Traditional Worksharing

• Worksharing constructs do not compose well (or at least: do not compose as well as we
want)
• Pathological example: parallel daxpy in MKL
void example1() { void example2() {
#pragma omp parallel
{ // parallel within: this/that
compute_in_parallel_this(A); // for, sects,… compute_in_parallel_this(A);
compute_in_parallel_that(B); // for, sects,… compute_in_parallel_that(B);
// daxpy is either parallel or sequential,
// but has no orphaned worksharing // parallel MKL version
cblas_daxpy (n, x, A, incx, B, incy); cblas_daxpy ( <...> );
}
} }

• Writing such codes either:

• oversubscribes the system (creating more OpenMP threads than cores)


• yields bad performance due to OpenMP overheads, or
• needs a lot of glue code to use sequential daxpy only for sub-arrays

Dr. Volker Weinberg | LRZ


Task Execution Model

• Supports unstructured parallelism


 Example (unstructured parallelism)
• unbounded loops
while ( <expr> ) { #pragma omp parallel
... #pragma omp master
} while (elem != NULL) {
#pragma omp task
• recursive functions compute(elem);
elem = elem->next;
}
void myfunc( <args> )
{
...; myfunc( <newargs> ); ...;
}
Parallel Team

Task pool
• Several scenarios are possible:
• single creator, multiple creators, nested tasks (tasks & WS)
• All threads in the team are candidates to execute tasks
Dr. Volker Weinberg | LRZ
The Execution Model

Task queue

Dr. Volker Weinberg | LRZ


The task Construct

• Deferring (or not) a unit of work (executable for any member of the team)

#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]


{structured-block} …structured-block…
!$omp end task

 private(list)  if(scalar-expression)
Cutoff
 firstprivate(list)  mergeable
Strategies
Data
 shared(list)  final(scalar-expression)
Environment
 default(shared | none)  depend(dep-type: list) Dependencies
 in_reduction(r-id: list)*  untied Sched. Restriction

 allocate([allocator:] list)*  priority(priority-value)


Miscellaneous Scheduler Hints
 detach(event-handler)*  affinity(list)*
Dr. Volker Weinberg | LRZ
What is a Task?

• Make OpenMP worksharing more flexible:


• allow the programmer to package code blocks and data items for execution  this by
definition is a task
• and assign these to an encountering thread
• possibly defer execution to a later time („work queue“)
• Introduced with OpenMP 3.0 (May 2008!)
• some additional features added in later versions
• When a thread encounters a task construct, a task is generated from the code of the
associated structured block.
• Data environment of the task is created (according to the data-sharing attributes,
defaults, …)  „Packaging of data“
• The encountering thread may immediately execute the task, or defer its execution. In the
latter case, any thread in the team may be assigned the task.
Dr. Volker Weinberg | LRZ
Example: Processing a Linked List

typedef struct {
list *next;
contents *data; • Typical task generation loop:
} list;

void process_list(list *head) {


#pragma omp parallel
{ #pragma omp parallel
#pragma omp single {
{ #pragma omp single
list *p = head; {
while(p) { while(p) {
#pragma omp task #pragma omp task
{ do_work(p->data); } {}
p = p->next; } /* all tasks done */
} }
} /* all tasks done */
}
}

Dr. Volker Weinberg | LRZ


Example: Processing a Linked List

typedef struct { • Features of this example:


list *next;
contents *data; • one of the threads has the job of
} list; generating all tasks
void process_list(list *head) { • synchronization: at the end of the
#pragma omp parallel
{
single block for all tasks created inside
#pragma omp single it
{
list *p = head; • no particular order between tasks is
while(p) { enforced here
#pragma omp task
{ do_work(p->data); } • data scoping default for task block:
p = p->next;
}
firstprivate  iterating through p is
} /* all tasks done */ fine
}
} (this is the „packaging of data“ mentioned earlier)
• task region: includes call of do_work()

Dr. Volker Weinberg | LRZ


The if Clause

• When if argument is false –


• task body is executed immediately by encountering thread (an „undeferred task“)
• but otherwise semantics are the same (data environment, synchronization) as for a
„deferred“ task
#pragma omp task if (sizeof(p->data) > threshold)
{ do_work(p->data); }

• User-directed optimization:
• avoid overhead for deferring small tasks
• cache locality / memory affinity may be lost by doing so

Dr. Volker Weinberg | LRZ


Task Synchronization

Dr. Volker Weinberg | LRZ


Task Synchronization with barrier and taskwait

• OpenMP barrier (implicit or explicit)


• All tasks created by any thread of the current team are guaranteed to be completed at
barrier exit
C/C++
#pragma omp barrier

• Task barrier: taskwait


• Encountering task is suspended until child tasks are complete
• Applies only to direct childs, not descendants!
C/C++
#pragma omp taskwait

Dr. Volker Weinberg | LRZ


Task Synchronization with taskgroup

• The taskgroup construct (deep task synchronization) (introduced with OpenMP 4.0)
• attached to a structured block; completion of all descendants of the current task; TSP at the end
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}

• where clause (could only be): reduction(reduction-identifier: list-items) ≥ OpenMP 5.0

#pragma omp parallel


#pragma omp single A
{
#pragma omp taskgroup
{
#pragma omp task
{ … }
#pragma omp task B C
{ … #C.1; #C.2; …}
wait for…
} // end of taskgroup
C.1 C.2
}

Dr. Volker Weinberg | LRZ


Task Synchronization Constructs: Comparison

Task Synchronization Construct Description

barrier Either an implicit, or explicit barrier.

taskwait Wait on the completion of child tasks of the current task.

taskgroup Wait on the completion of child tasks of the current task, and their
descendants.

Dr. Volker Weinberg | LRZ


Example: Task Synchronization with taskwait

• Example:
• assure leaf-to-root traversal for a binary tree

void process_tree(tree *root) {


if (root->left) {
#pragma omp task
{ process_tree(root->left); }
}
if (root->right) {
#pragma omp task
{ process_tree(root->right);}
}
#pragma omp taskwait
do_work(root->data);
}

• what if we run out of threads?  do we hang?

Dr. Volker Weinberg | LRZ


Task Switching

• It is allowed for a thread to


• suspend a task during execution
• start (or resume) execution of another task (assigned to the same team)
• resume original task later
• Pre-condition:
• a task scheduling point is reached
• Example from previous slide:
• the taskwait directive implies a task scheduling point
• Another example:
• very many tasks are generated  implementation can suspend generation of tasks and start
processing the existing queue
• With the exception of the taskyield construct, all of these scheduling points are implied. They
are there, but not visible  taskyield was added to allow the user to add an explicit task
scheduling point.

Dr. Volker Weinberg | LRZ


Task Scheduling Points

• The point immediately following the generation of an explicit task.


• After the point of completion of a task region.
• In a taskyield region.
• In a taskwait region.
• At the end of a taskgroup region.
• In an implicit or explicit barrier region.
• When using the support for accelerators:
• The point immediately following the generation of the target region.
• At the beginning and end of a target data region.
• In a target update region.
• In a target enter data region.
• In a target exit data region.
• In the omp_target_memcpy() runtime function.
• In the omp_target_memcpy_rect() runtime function.
Dr. Volker Weinberg | LRZ
Example: Dealing with Large Number of Tasks
• #define LARGE_NUMBER 10000000
• double item[LARGE_NUMBER]
• extern void process(double); • Features of this example:
• int main() {
• generates a large number of tasks with
• #pragma omp parallel one thread and executes them with the
• { task threads in the team
• #pragma omp single scheduling
• {
point • implementation may reach its limit on
• int i; unassigned tasks

for(i=0;i<=LARGE_NUMBER;i++) • if it does, the implementation is allowed
• #pragma omp task to cause the thread executing the task
• process(item[i]); generating loop to suspend its task at
• }
• } the scheduling point and start executing
i is firstprivate
unassigned tasks.
item is shared
• once the number of unassigned tasks is
sufficiently low, the thread may resume
executing of the task generating loop.

Dr. Volker Weinberg | LRZ


Example: The taskyield Directive

• #include <omp.h> • Introduced with OpenMP 3.1


• void something_useful();
• void something_critical();

• void foo(omp_lock_t * lock, int n)


• {
• for(int i = 0; i < n; i++)
• #pragma omp task
• {
• something_useful();
• while( !omp_test_lock(lock) ) {
• #pragma omp taskyield
• }
• something_critical();
• omp_unset_lock(lock);
• }
• }

Dr. Volker Weinberg | LRZ


Example: The taskyield Directive

• #include <omp.h> • Introduced with OpenMP 3.1


• void something_useful();
• void something_critical();

• void foo(omp_lock_t * lock, int n)


• {
• for(int i = 0; i < n; i++)
• #pragma omp task
• {
• something_useful();
• while( !omp_test_lock(lock) ) { The waiting task may be
• #pragma omp taskyield suspended here and allow
• } the executing thread to
• something_critical(); perform other work. This
• omp_unset_lock(lock); may also avoid deadlock
• } situations.
• }

Dr. Volker Weinberg | LRZ


Thread Switching

• Default behaviour:
• a task assigned to a thread must be (eventually) completed by that thread  task is
tied to the thread
#pragma omp task untied
• Change this via the untied clause structured-block

• execution of task block may change


to another thread of the team at any task scheduling point
• implementation may add task scheduling points beyond those previously defined
(outside programmer‘s control!)
• Deployment of untied tasks
• starvation scenarios: running out of tasks while generating thread is still working on
something
• Dangers:
• more care required (compared with tied tasks) wrt. scoping and synchronization
Dr. Volker Weinberg | LRZ
Final and mergeable Tasks

• Final Tasks • Merged Tasks


• use a final clause with a condition • using a mergeable clause may create a
• always undeferred, i.e. executed merged task if it is undeferred or final
immediately by the encountering thread • a merged task has the same data
• reducing the overhead of placing tasks environment as its creating task region
in the “task pool” • Clause was introduced to reduce data /
• all tasks created inside task region are memory requirements
also final ( different from an if • Final and/or mergeable
clause) • can be used for optimization purposes
• inside a task block, a omp_in_final() • optimize wind-down phase of a recursive
can be used to check whether the task is algorithm
final

Dr. Volker Weinberg | LRZ


Task Scheduling Points: Pitfalls

• Use of threadprivate data:


• value of threadprivate variables cannot be assumed to be unchanged across a task
scheduling point. Can be modified by another task executed by the same thread.

• Tasks and locks:


• if a lock is held across a task scheduling point, interleaved code trying to acquire it
(with the same thread) may cause deadlock
• this especially is a problem for untied tasks

• Tasks and critical regions:


• similar issue if suspension of a task happens inside a critical region and the same
thread tries to access a critical region in another scheduled task

Dr. Volker Weinberg | LRZ


Example: Fibonacci Numbers

Dr. Volker Weinberg | LRZ


Recursive Approach to Compute Fibonacci

int main(int argc, int fib(int n) {


char* argv[]) if (n < 2) return n;
{ int x = fib(n - 1);
[...] int y = fib(n - 2);
fib(input); return x+y;
[...] }
}

• On the following slides we will discuss three approaches to parallelize this recursive code
with Tasking.

Dr. Volker Weinberg | LRZ


First Version Parallelized with Tasking (omp-v1)
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x, y;
[...] #pragma omp task shared(x)
#pragma omp parallel {
{ x = fib(n - 1);
#pragma omp single }
{ #pragma omp task shared(y)
fib(input); {
} y = fib(n - 2);
} }
[...] #pragma omp taskwait
} return x+y;
}
• Only one Task / Thread enters fib() from main(), it is responsable for creating the two initial
work tasks
• Taskwait is required, as otherwise x and y would be lost

Dr. Volker Weinberg | LRZ


Scalability Measurements (1/3)

• Overhead of task creation prevents better scalability!


Speedup of Fibonacci with Tasks
9
8
7

6
Speedup

5
4 optimal
omp-v1
3
2
1

0
1 2 4 8
#Threads

Dr. Volker Weinberg | LRZ


Improved Parallelization with Tasking (omp-v2)

• Improvement: Don‘t create yet another task once a certain (small enough) n is reached

int main(int argc, int fib(int n) {


if (n < 2) return n;
char* argv[])
int x, y;
{ #pragma omp task shared(x) \
[...] if(n > 30)
{
#pragma omp parallel
x = fib(n - 1);
{ }
#pragma omp single #pragma omp task shared(y) \
if(n > 30)
{ {
fib(input); y = fib(n - 2);
} }
} #pragma omp taskwait
return x+y;
[...] }
}

Dr. Volker Weinberg | LRZ


Scalability Measurements (2/3)

• Speedup is ok, but we still have some overhead when running with 4 or 8 threads
Speedup of Fibonacci with Tasks
9

8
7
6
Speedup

5
optimal
4
omp-v1
3 omp-v2
2
1
0
1 2 4 8
#Threads

Dr. Volker Weinberg | LRZ


Improved Parallelization with Tasking (omp-v3)

• Improvement: Skip the OpenMP overhead once a certain n is reached

int main(int argc, int fib(int n) {


char* argv[]) if (n < 2) return n;
if (n <= 30)
{
return serfib(n);
[...] int x, y;
#pragma omp parallel #pragma omp task shared(x)
{ {
#pragma omp single x = fib(n - 1);
{ }
#pragma omp task shared(y)
fib(input);
{
} y = fib(n - 2);
} }
[...] #pragma omp taskwait
} return x+y;
}

Dr. Volker Weinberg | LRZ


Scalability Measurements (3/3)

• Everything ok now 
Speedup of Fibonacci with Tasks
9

8
7

6
Speedup

5 optimal
4 omp-v1
omp-v2
3
omp-v3
2

1
0
1 2 4 8
#Threads

Dr. Volker Weinberg | LRZ


Task Data Scoping

Dr. Volker Weinberg | LRZ


Data Environment

• The task directive takes the following data attribute clauses that define the data
environment of the task:

• default (private | firstprivate | shared | none)


• private (list)
• firstprivate (list)
• shared (list)

Dr. Volker Weinberg | LRZ


Data Scoping

• Some rules from Parallel Regions apply:


• Static and Global variables are shared
• Automatic Storage (local) variables are private

• The data-sharing attributes of variables that are not listed in data attribute clauses of a
task construct, and are not predetermined according to the OpenMP rules, are implicitly
determined as follows:
(a) In a task construct, if no default clause is present, a variable that is determined to be
shared in all enclosing constructs, up to and including the innermost enclosing
parallel construct, is shared.
(b) In a task construct, if no default clause is present, a variable whose data-sharing
attribute is not determined by rule (a) is firstprivate.

Dr. Volker Weinberg | LRZ


Data Scoping

• It follows that:
(a) If a task construct is lexically enclosed in a parallel construct, then variables that are
shared in all scopes enclosing the task construct remain shared in the generated
task. Otherwise, variables are implicitly determined firstprivate.
(b) If a task construct is orphaned, then variables are implicitly determined
firstprivate.

• Shared entities in a task region


• may become invalid while some tasks are unfinished  programmer must add
synchronization to prevent this

Dr. Volker Weinberg | LRZ


Example: Data Scoping Example

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a:
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
}Volker
Dr. } } Weinberg | LRZ
Example: Data Scoping Example

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared value of a: 1


// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
}Volker
Dr. } } Weinberg | LRZ
Example: Data Scoping Example

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared value of a: 1


// Scope of b: firstprivate value of b: 0 / undefined
// Scope of c:
// Scope of d:
// Scope of e:
}Volker
Dr. } } Weinberg | LRZ
Example: Data Scoping Example

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared value of a: 1


// Scope of b: firstprivate value of b: 0 / undefined
// Scope of c: shared value of c: 3
// Scope of d:
// Scope of e:
}Volker
Dr. } } Weinberg | LRZ
Example: Data Scoping Example

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared value of a: 1


// Scope of b: firstprivate value of b: 0 / undefined
// Scope of c: shared value of c: 3
// Scope of d: firstprivate value of d: 4
// Scope of e:
}Volker
Dr. } } Weinberg | LRZ
Example: Data Scoping Example

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared value of a: 1


// Scope of b: firstprivate value of b: 0 / undefined
// Scope of c: shared value of c: 3
// Scope of d: firstprivate value of d: 4
// Scope of e: private value of e: 5
}Volker
Dr. } } Weinberg | LRZ
Example: Data Scoping Example

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b) default(none)
#pragma omp parallel private(b) default(none)
{
int d = 4;
#pragma omp task
{
int e = 5; Hint: Use default(none) to
be forced to think about
// Scope of a: shared every variable if you do
// Scope of b: firstprivate not see clear.
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e: private
} } }

Dr. Volker Weinberg | LRZ


Task Reductions

Dr. Volker Weinberg | LRZ


Reminder:
Task Synchronization: The taskgroup Construct

• The taskgroup construct (deep task synchronization) (introduced with OpenMP 4.0)
• attached to a structured block; completion of all descendants of the current task; TSP at the end
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}

• where clause (could only be): reduction(reduction-identifier: list-items) ≥ OpenMP 5.0

#pragma omp parallel


#pragma omp single A
{
#pragma omp taskgroup
{
#pragma omp task
{ … }
#pragma omp task B C
{ … #C.1; #C.2; …}
wait for…
} // end of taskgroup
C.1 C.2
}

Dr. Volker Weinberg | LRZ


Task Reductions using the taskgroup Construct

int res = 0;
• Reduction operation node_t* node = NULL;
...
• perform some forms of recurrence calculations #pragma omp parallel
{
• associative and commutative operators #pragma omp single
• The (taskgroup) scoping reduction clause {
#pragma omp taskgroup task_reduction(+: res)
#pragma omp taskgroup task_reduction(op: list) { // [1]
while (node) {
{structured-block}
#pragma omp task in_reduction(+: res) \
firstprivate(node)
• Register a new reduction at [1] { // [2]
res += node->value;
• Computes the final result after [3] }
node = node->next;
• The (task) in_reduction clause }
• Task participates in a reduction operation [2] } // [3]
}
}
#pragma omp task in_reduction(op: list)
{structured-block}
OpenMP 5.0
Dr. Volker Weinberg | LRZ
Task Loops

Dr. Volker Weinberg | LRZ


Example: saxpy Kernel with OpenMP task

for ( i = 0; i<SIZE; i+=1) {


A[i]=A[i]*B[i]*S;
• Difficult to determine grain
} • 1 single iteration  to fine
for ( i = 0; i<SIZE; i+=TS) { • whole loop  no parallelism
UB = SIZE < (i+TS)?SIZE:i+TS; • Manually transform the code
for ( ii=i; ii<UB; ii++) {
A[ii]=A[ii]*B[ii]*S; • blocking techniques
} • Improving programmability
}
#pragma omp parallel
• OpenMP taskloop
#pragma omp single
for ( i = 0; i<SIZE; i+=TS) {
UB = SIZE < (i+TS)?SIZE:i+TS; Outer loop iterates across the tiles
#pragma omp task private(ii) \
firstprivate(i,UB) shared(S,A,B)
for ( ii=i; ii<UB; ii++) {
A[ii]=A[ii]*B[ii]*S; Inner loop iterates within a tile
}
}
You have to
rename all your
loop indices!
Dr. Volker Weinberg | LRZ
The taskloop Construct

• Introduced with OpenMP 4.5


• Parallelize a loop using OpenMP tasks
• Cut loop into chunks
• Create a task for each loop chunk

• Syntax (C/C++)
#pragma omp taskloop [simd] [clause[[,] clause],…]
for-loops

• Syntax (Fortran)
!$omp taskloop[simd] [clause[[,] clause],…]
do-loops
[!$omp end taskloop [simd]]

• Loop iterations are distributed over the tasks and (if simd) the resulting loop uses SIMD
instructions.
Dr. Volker Weinberg | LRZ
Clauses for the taskloop Construct

• Taskloop constructs inherit clauses both from worksharing constructs and the task
construct
• shared, private
• firstprivate, lastprivate
• default
• collapse
• final, untied, mergeable
• allocate
• in_reduction / reduction

• grainsize(grain-size)
Chunks have at least grain-size and max 2*grain-size loop iterations

• num_tasks(num-tasks)
Create num-tasks tasks for iterations of the loop
Dr. Volker Weinberg | LRZ
Example: saxpy Kernel with OpenMP taskloop

for ( i = 0; i<SIZE; i+=1) {


A[i]=A[i]*B[i]*S;
blocking }
taskloop

for ( i = 0; i<SIZE; i+=TS) { #pragma omp taskloop grainsize(TS)


UB = SIZE < (i+TS)?SIZE:i+TS; for ( i = 0; i<SIZE; i+=1) {
for ( ii=i; ii<UB; ii++) { A[i]=A[i]*B[i]*S;
A[ii]=A[ii]*B[ii]*S; }
}
}

for ( i = 0; i<SIZE; i+=TS) { • Easier to apply than manual blocking:


UB = SIZE < (i+TS)?SIZE:i+TS;
#pragma omp task private(ii) \
• Compiler implements mechanical transformation
firstprivate(i,UB) shared(S,A,B) • Less error-prone, more productive
for ( ii=i; ii<UB; ii++) {
A[ii]=A[ii]*B[ii]*S;
}
}

Dr. Volker Weinberg | LRZ


Worksharing vs. taskloop Constructs

subroutine worksharing subroutine taskloop


integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024

x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)

!$omp do !$omp taskloop


do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop

!$omp end parallel !$omp end parallel


write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

Dr. Volker Weinberg | LRZ


Worksharing vs. taskloop Constructs

subroutine worksharing subroutine taskloop


integer :: x integer :: x
integer :: i integer :: i
integer, parameter :: T = 16 integer, parameter :: T = 16
integer, parameter :: N = 1024 integer, parameter :: N = 1024

x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp single
!$omp do !$omp taskloop
do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end single
!$omp end parallel !$omp end parallel
write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine

Dr. Volker Weinberg | LRZ


Task Dependencies

Dr. Volker Weinberg | LRZ


Catchy example: Building a house

Garden
Put up Tiling of the
Roof Roof
Build
Walls Install
Electrical Plaster
Wall
Exterior
Siding

Dr. Volker Weinberg | LRZ


The depend Clause

C/C++
#pragma omp task depend(dependency-type: list)
... structured block ...

• Introduced with OpenMP 4.0


• The task dependence is fulfilled when the predecessor task has completed

• in dependency-type: the generated task will be a dependent task of all previously


generated sibling tasks that reference at least one of the list items in an out or inout
clause.
• out and inout dependency-type: The generated task will be a dependent task of all
previously generated sibling tasks that reference at least one of the list items in an in,
out, or inout clause.
• mutexinoutset dependency-type: Added in 5.0 to support mutually exclusive inout
sets
• The list items in a depend clause may include array sections.

Dr. Volker Weinberg | LRZ


Task Synchronization with Dependencies

int x = 0; OpenMP 3.1 int x = 0; OpenMP 4.0


#pragma omp parallel #pragma omp parallel
#pragma omp single #pragma omp single
{ {
#pragma omp task #pragma omp task depend(in: x)
std::cout << x << std::endl; std::cout << x << std::endl;

#pragma omp task #pragma omp task


long_running_task(); long_running_task();
#pragma omp taskwait

#pragma omp task #pragma omp task depend(inout: x)


x++; x++;
} }

t1 t1
t2 t2
t3 t3

Dr. Volker Weinberg | LRZ


Example: Cholesky Factorization

void cholesky(int ts, int nt, double* a[nt][nt]) { • Complex synchronization patterns
for (int k = 0; k < nt; k++) {
potrf(a[k][k], ts, ts); • Splitting computational phases
// Triangular systems • taskwait or taskgroup
for (int i = k + 1; i < nt; i++) {
#pragma omp task • Needs complex code analysis
trsm(a[k][k], a[k][i], ts, ts); • May perform a bit better than regular OpenMP
}
#pragma omp taskwait
worksharing
// Update trailing matrix Is this best solution we can come up with?
for (int i = k + 1; i < nt; i++) {
for (int j = k + 1; j < i; j++) {
#pragma omp task
dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
}
#pragma omp task
syrk(a[k][i], a[i][i], ts, ts);
}
#pragma omp taskwait
}
}

Dr. Volker Weinberg | LRZ


Example: Cholesky Factorization

void cholesky(int ts, int nt, double* a[nt][nt]) { void cholesky(int ts, int nt, double* a[nt][nt]) {
for (int k = 0; k < nt; k++) { ts
for (int k = 0; k < nt; k++) {
// Diagonal Block factorization nt // Diagonal Block factorization
ts #pragma omp task depend(inout: a[k][k])
potrf(a[k][k], ts, ts);
nt ts potrf(a[k][k], ts, ts);
// Triangular systems
ts
for (int i = k + 1; i < nt; i++) { // Triangular systems
#pragma omp task for (int i = k + 1; i < nt; i++) {
trsm(a[k][k], a[k][i], ts, ts); #pragma omp task depend(in: a[k][k])
} depend(inout: a[k][i])
#pragma omp taskwait trsm(a[k][k], a[k][i], ts, ts);
}
// Update trailing matrix
for (int i = k + 1; i < nt; i++) { // Update trailing matrix
for (int j = k + 1; j < i; j++) { for (int i = k + 1; i < nt; i++) {
#pragma omp task for (int j = k + 1; j < i; j++) {
dgemm(a[k][i], a[k][j], a[j][i], ts, ts); #pragma omp task depend(inout: a[j][i])
} depend(in: a[k][i], a[k][j])
#pragma omp task dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
syrk(a[k][i], a[i][i], ts, ts); }
} #pragma omp task depend(inout: a[i][i])
#pragma omp taskwait depend(in: a[k][i])
} syrk(a[k][i], a[i][i], ts, ts);
} }
}
Dr. Volker Weinberg | LRZ OpenMP 3.1 } OpenMP 4.0
Task Affinity

Dr. Volker Weinberg | LRZ


OpenMP Task Affinity

• void task_affinity() {
• double* B; Core Core Core Core
• #pragma omp task shared(B)
• { on-chip on-chip on-chip on-chip
cache cache cache cache
• B = init_B_and_important_computation(A);
• }
• #pragma omp task firstprivate(B)
• {
• important_computation_too(B); interconnect
• }
• #pragma omp taskwait
• }

memory memory

A[0] … A[N] B[0] … B[N]

Dr. Volker Weinberg | LRZ


OpenMP Task Affinity

• void task_affinity() {
• double* B; Core Core Core Core
• #pragma omp task shared(B) affinity(A[0:N])
• { on-chip on-chip on-chip on-chip
cache cache cache cache
• B = init_B_and_important_computation(A);
• }
• #pragma omp task firstprivate(B) affinity(B[0:N])
• {
• important_computation_too(B); interconnect
• }
• #pragma omp taskwait
• }

Introduced with OpenMP 5.0: affinity(locator-list) memory memory

The affinity clause is a hint to indicate data affinity of the


A[0] … A[N]
generated task. The task is recommended to execute closely to the
location of the list items. B[0] … B[N]

Dr. Volker Weinberg | LRZ


Task Priorities

Dr. Volker Weinberg | LRZ


Task Priority

• Syntax
!$omp task priority(priority_value) Fortran
#pragma omp task priority(priority_value) C

• Introduced with OpenMP 4.5


• Semantics
• provides a hint to the run time on prioritizing (ordering) task execution
• the priority value must be a non-negative integer; higher values correspond to higher
priorities; maximum value is omp_get_max_task_priority()
• do not rely on a particular ordering of tasks imposed by specifying a priority
#pragma omp task priority(9999)
participants.take_break(five_minutes);

Dr. Volker Weinberg | LRZ


Thank you for your attention!

Dr. Volker Weinberg | LRZ

You might also like