Professional Documents
Culture Documents
• This tutorial includes material kindly provided by the following OpenMP experts:
s=0
• double a[N];
s’=0 s’’=0 s’’’=0 s’’’’=0
• double l,s = 0;
distribute work
• #pragma omp parallel for reduction(+:s) \
private(l) schedule(static,4)
barrier
s’+= s’’’
• }
s = s’
• Worksharing constructs do not compose well (or at least: do not compose as well as we
want)
• Pathological example: parallel daxpy in MKL
void example1() { void example2() {
#pragma omp parallel
{ // parallel within: this/that
compute_in_parallel_this(A); // for, sects,… compute_in_parallel_this(A);
compute_in_parallel_that(B); // for, sects,… compute_in_parallel_that(B);
// daxpy is either parallel or sequential,
// but has no orphaned worksharing // parallel MKL version
cblas_daxpy (n, x, A, incx, B, incy); cblas_daxpy ( <...> );
}
} }
Task pool
• Several scenarios are possible:
• single creator, multiple creators, nested tasks (tasks & WS)
• All threads in the team are candidates to execute tasks
Dr. Volker Weinberg | LRZ
The Execution Model
Task queue
• Deferring (or not) a unit of work (executable for any member of the team)
private(list) if(scalar-expression)
Cutoff
firstprivate(list) mergeable
Strategies
Data
shared(list) final(scalar-expression)
Environment
default(shared | none) depend(dep-type: list) Dependencies
in_reduction(r-id: list)* untied Sched. Restriction
typedef struct {
list *next;
contents *data; • Typical task generation loop:
} list;
• User-directed optimization:
• avoid overhead for deferring small tasks
• cache locality / memory affinity may be lost by doing so
• The taskgroup construct (deep task synchronization) (introduced with OpenMP 4.0)
• attached to a structured block; completion of all descendants of the current task; TSP at the end
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}
taskgroup Wait on the completion of child tasks of the current task, and their
descendants.
• Example:
• assure leaf-to-root traversal for a binary tree
• Default behaviour:
• a task assigned to a thread must be (eventually) completed by that thread task is
tied to the thread
#pragma omp task untied
• Change this via the untied clause structured-block
• On the following slides we will discuss three approaches to parallelize this recursive code
with Tasking.
6
Speedup
5
4 optimal
omp-v1
3
2
1
0
1 2 4 8
#Threads
• Improvement: Don‘t create yet another task once a certain (small enough) n is reached
• Speedup is ok, but we still have some overhead when running with 4 or 8 threads
Speedup of Fibonacci with Tasks
9
8
7
6
Speedup
5
optimal
4
omp-v1
3 omp-v2
2
1
0
1 2 4 8
#Threads
• Everything ok now
Speedup of Fibonacci with Tasks
9
8
7
6
Speedup
5 optimal
4 omp-v1
omp-v2
3
omp-v3
2
1
0
1 2 4 8
#Threads
• The task directive takes the following data attribute clauses that define the data
environment of the task:
• The data-sharing attributes of variables that are not listed in data attribute clauses of a
task construct, and are not predetermined according to the OpenMP rules, are implicitly
determined as follows:
(a) In a task construct, if no default clause is present, a variable that is determined to be
shared in all enclosing constructs, up to and including the innermost enclosing
parallel construct, is shared.
(b) In a task construct, if no default clause is present, a variable whose data-sharing
attribute is not determined by rule (a) is firstprivate.
• It follows that:
(a) If a task construct is lexically enclosed in a parallel construct, then variables that are
shared in all scopes enclosing the task construct remain shared in the generated
task. Otherwise, variables are implicitly determined firstprivate.
(b) If a task construct is orphaned, then variables are implicitly determined
firstprivate.
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a:
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
}Volker
Dr. } } Weinberg | LRZ
Example: Data Scoping Example
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b) default(none)
#pragma omp parallel private(b) default(none)
{
int d = 4;
#pragma omp task
{
int e = 5; Hint: Use default(none) to
be forced to think about
// Scope of a: shared every variable if you do
// Scope of b: firstprivate not see clear.
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e: private
} } }
• The taskgroup construct (deep task synchronization) (introduced with OpenMP 4.0)
• attached to a structured block; completion of all descendants of the current task; TSP at the end
#pragma omp taskgroup [clause[[,] clause]...]
{structured-block}
int res = 0;
• Reduction operation node_t* node = NULL;
...
• perform some forms of recurrence calculations #pragma omp parallel
{
• associative and commutative operators #pragma omp single
• The (taskgroup) scoping reduction clause {
#pragma omp taskgroup task_reduction(+: res)
#pragma omp taskgroup task_reduction(op: list) { // [1]
while (node) {
{structured-block}
#pragma omp task in_reduction(+: res) \
firstprivate(node)
• Register a new reduction at [1] { // [2]
res += node->value;
• Computes the final result after [3] }
node = node->next;
• The (task) in_reduction clause }
• Task participates in a reduction operation [2] } // [3]
}
}
#pragma omp task in_reduction(op: list)
{structured-block}
OpenMP 5.0
Dr. Volker Weinberg | LRZ
Task Loops
• Syntax (C/C++)
#pragma omp taskloop [simd] [clause[[,] clause],…]
for-loops
• Syntax (Fortran)
!$omp taskloop[simd] [clause[[,] clause],…]
do-loops
[!$omp end taskloop [simd]]
• Loop iterations are distributed over the tasks and (if simd) the resulting loop uses SIMD
instructions.
Dr. Volker Weinberg | LRZ
Clauses for the taskloop Construct
• Taskloop constructs inherit clauses both from worksharing constructs and the task
construct
• shared, private
• firstprivate, lastprivate
• default
• collapse
• final, untied, mergeable
• allocate
• in_reduction / reduction
• grainsize(grain-size)
Chunks have at least grain-size and max 2*grain-size loop iterations
• num_tasks(num-tasks)
Create num-tasks tasks for iterations of the loop
Dr. Volker Weinberg | LRZ
Example: saxpy Kernel with OpenMP taskloop
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
x = 0 x = 0
!$omp parallel shared(x) num_threads(T) !$omp parallel shared(x) num_threads(T)
!$omp single
!$omp do !$omp taskloop
do i = 1,N do i = 1,N
!$omp atomic !$omp atomic
x = x + 1 x = x + 1
!$omp end atomic !$omp end atomic
end do end do
!$omp end do !$omp end taskloop
!$omp end single
!$omp end parallel !$omp end parallel
write (*,'(A,I0)') 'x = ', x write (*,'(A,I0)') 'x = ', x
end subroutine end subroutine
Garden
Put up Tiling of the
Roof Roof
Build
Walls Install
Electrical Plaster
Wall
Exterior
Siding
C/C++
#pragma omp task depend(dependency-type: list)
... structured block ...
t1 t1
t2 t2
t3 t3
void cholesky(int ts, int nt, double* a[nt][nt]) { • Complex synchronization patterns
for (int k = 0; k < nt; k++) {
potrf(a[k][k], ts, ts); • Splitting computational phases
// Triangular systems • taskwait or taskgroup
for (int i = k + 1; i < nt; i++) {
#pragma omp task • Needs complex code analysis
trsm(a[k][k], a[k][i], ts, ts); • May perform a bit better than regular OpenMP
}
#pragma omp taskwait
worksharing
// Update trailing matrix Is this best solution we can come up with?
for (int i = k + 1; i < nt; i++) {
for (int j = k + 1; j < i; j++) {
#pragma omp task
dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
}
#pragma omp task
syrk(a[k][i], a[i][i], ts, ts);
}
#pragma omp taskwait
}
}
void cholesky(int ts, int nt, double* a[nt][nt]) { void cholesky(int ts, int nt, double* a[nt][nt]) {
for (int k = 0; k < nt; k++) { ts
for (int k = 0; k < nt; k++) {
// Diagonal Block factorization nt // Diagonal Block factorization
ts #pragma omp task depend(inout: a[k][k])
potrf(a[k][k], ts, ts);
nt ts potrf(a[k][k], ts, ts);
// Triangular systems
ts
for (int i = k + 1; i < nt; i++) { // Triangular systems
#pragma omp task for (int i = k + 1; i < nt; i++) {
trsm(a[k][k], a[k][i], ts, ts); #pragma omp task depend(in: a[k][k])
} depend(inout: a[k][i])
#pragma omp taskwait trsm(a[k][k], a[k][i], ts, ts);
}
// Update trailing matrix
for (int i = k + 1; i < nt; i++) { // Update trailing matrix
for (int j = k + 1; j < i; j++) { for (int i = k + 1; i < nt; i++) {
#pragma omp task for (int j = k + 1; j < i; j++) {
dgemm(a[k][i], a[k][j], a[j][i], ts, ts); #pragma omp task depend(inout: a[j][i])
} depend(in: a[k][i], a[k][j])
#pragma omp task dgemm(a[k][i], a[k][j], a[j][i], ts, ts);
syrk(a[k][i], a[i][i], ts, ts); }
} #pragma omp task depend(inout: a[i][i])
#pragma omp taskwait depend(in: a[k][i])
} syrk(a[k][i], a[i][i], ts, ts);
} }
}
Dr. Volker Weinberg | LRZ OpenMP 3.1 } OpenMP 4.0
Task Affinity
• void task_affinity() {
• double* B; Core Core Core Core
• #pragma omp task shared(B)
• { on-chip on-chip on-chip on-chip
cache cache cache cache
• B = init_B_and_important_computation(A);
• }
• #pragma omp task firstprivate(B)
• {
• important_computation_too(B); interconnect
• }
• #pragma omp taskwait
• }
memory memory
• void task_affinity() {
• double* B; Core Core Core Core
• #pragma omp task shared(B) affinity(A[0:N])
• { on-chip on-chip on-chip on-chip
cache cache cache cache
• B = init_B_and_important_computation(A);
• }
• #pragma omp task firstprivate(B) affinity(B[0:N])
• {
• important_computation_too(B); interconnect
• }
• #pragma omp taskwait
• }
• Syntax
!$omp task priority(priority_value) Fortran
#pragma omp task priority(priority_value) C