OpenACC Fundamentals

OpenACC Fundamentals
Jeff Larkin <jlarkin@nvidia.com>, November 16, 2016

What is OpenACC?
AGENDA OpenACC by Example
2
3 Ways to Program GPUs
Applications
Compiler Programming
Libraries
Directives Languages
“Drop-in” Easily Accelerate Maximum

Acceleration Applications Flexibility
3
OpenACC Directives
Manage
Data
#pragma acc data copyin(x,y) copyout(z)
{
Incremental
Movement ... Single source
#pragma acc parallel
Initiate
{
#pragma acc loop gang vector
Interoperable
Parallel
for (i = 0; i < n; ++i) {
Execution
z[i] = x[i] + y[i]; Performance portable
...
Optimize } CPU, GPU, MIC
Loop }
Mappings ...
}
4
Accelerated Computing
10x Performance & 5x Energy Efficiency for HPC
GPU Accelerator
CPU Optimized for
Optimized for Parallel Tasks
Serial Tasks
5
What is Accelerated Computing?
Application Code
Compute-Intensive Functions
Rest of Sequential
A few % of Code CPU Code
GPU A large % of Time CPU
+ 6
OpenACC Example
#pragma acc data \
{
copy(b[0:n][0:m]) \
create(a[0:n][0:m]) A S (B)
p2
1
for (iter = 1; iter <= p; ++iter){

#pragma acc kernels
{
for (i = 1; i < n-1; ++i){ B S (B)
p
1
for (j = 1; j < m-1; ++j){

a[i][j]=w0*b[i][j]+
w1*(b[i-1][j]+b[i+1][j]+
b[i][j-1]+b[i][j+1])+
w2*(b[i-1][j-1]+b[i-1][j+1]+
b[i+1][j-1]+b[i+1][j+1]);
}
for( i = 1; i < n-1; ++i )
for( j = 1; j < m-1; ++j )
b[i][j] = a[i][j];
} }
}
} Host Memory GPU Memory
7
Example: Jacobi Iteration
Iteratively converges to correct value (e.g. Temperature), by computing new
values at each point from the average of neighboring points.
Common, useful algorithm
Example: Solve Laplace equation in 2D: 𝛁𝟐 𝒇(𝒙, 𝒚) = 𝟎

A(i,j+1)
A(i-1,j) A(i+1,j)
A(i,j)
𝐴𝑘 (𝑖 − 1, 𝑗) + 𝐴𝑘 𝑖 + 1, 𝑗 + 𝐴𝑘 𝑖, 𝑗 − 1 + 𝐴𝑘 𝑖, 𝑗 + 1
𝐴𝑘+1 𝑖, 𝑗 =
A(i,j-1) 4
8
Jacobi Iteration: C Code
while ( err > tol && iter < iter_max ) {
err=0.0;
Iterate until converged
Iterate across matrix

for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
elements
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + Calculate new value from

A[j-1][i] + A[j+1][i]); neighbors
err = max(err, abs(Anew[j][i] - A[j][i]));
}
Compute max error for
} convergence
for( int j = 1; j < n-1; j++) {

for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i]; Swap input/output arrays
}
}
iter++;
}
9 9
Look For Parallelism
while ( err > tol && iter < iter_max ) { Data dependency
err=0.0; between iterations.
for( int j = 1; j < n-1; j++) { Independent loop

for(int i = 1; i < m-1; i++) {
iterations
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);
err = max(err, abs(Anew[j][i] - A[j][i])); Max Reduction required

}
}
for( int j = 1; j < n-1; j++) { Independent loop

for( int i = 1; i < m-1; i++ ) { iterations
A[j][i] = Anew[j][i];
}
}
iter++;
}
10 10
OPENACC DIRECTIVE SYNTAX
C/C++
#pragma acc directive [clause [,] clause] …]
…often followed by a structured code block
Fortran
!$acc directive [clause [,] clause] …]
...often paired with a matching end directive surrounding a structured code block:
!$acc end directive
Don’t forget acc

11
OpenACC Parallel Directive
Generates parallelism

{
When encountering
the parallel directive,
the compiler will
generate 1 or more
parallel gangs, which
execute redundantly.
}
13
OpenACC Parallel Directive
Generates parallelism

{
When encountering
the parallel directive,
the compiler will
generate 1 or more
parallel gangs, which
execute redundantly.
}
14
OpenACC Loop Directive
Identifies loops to run in parallel

{
#pragma acc loop
for (i=0;i<N;i++)
{ The loop directive
informs the compiler
} which loops to
parallelize.
}
15
OpenACC Loop Directive
Identifies loops to run in parallel

{
#pragma acc loop
for (i=0;i<N;i++)
{ The loop directive
informs the compiler
} which loops to
parallelize.
}
16
OpenACC Parallel Loop Directive
Generates parallelism and identifies loop in one directive
#pragma acc parallel loop

for (i=0;i<N;i++)
{
}
The parallel and loop
directives are
frequently combined
into one.
17
PARALLELIZE WITH OPENACC
err=0.0;
#pragma acc parallel loop reduction(max:err)

for( int j = 1; j < n-1; j++) { Parallelize loop on
for(int i = 1; i < m-1; i++) {
accelerator
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);

}
}

for( int j = 1; j < n-1; j++) { Parallelize loop on
for( int i = 1; i < m-1; i++ ) { accelerator
}
}
iter++;
* A reduction means that all of the N*M values
} for err will be reduced to just one, the max.
18 18
BUILDING THE CODE
$ pgcc -fast -acc -ta=tesla -Minfo=all laplace2d.c
main:
40, Loop not fused: function call before adjacent loop
Generated vector sse code for the loop
51, Loop not vectorized/parallelized: potential early exits
55, Accelerator kernel generated
55, Max reduction generated for error
56, #pragma acc loop gang /* blockIdx.x */
58, #pragma acc loop vector(256) /* threadIdx.x */
55, Generating copyout(Anew[1:4094][1:4094])
Generating copyin(A[:][:])
Generating Tesla code
58, Loop is parallelizable
66, Generating copyin(Anew[1:4094][1:4094])
Generating copyout(A[1:4094][1:4094])
20 20
BUILDING THE CODE
main:
21 21
BUILDING THE CODE
main:
22 22
Intel Xeon E5-
Speed-up (Higher is Better) 2698 v3 @
6.00X 2.30GHz
(Haswell)
Why did OpenACC vs.
5.00X
slow down here? 5.00X
NVIDIA Tesla
K40 & P100
4.59X
4.00X
3.69X
3.00X
2.00X
1.94X
1.00X
1.00X
0.61X 0.66X
0.00X
Single Thread 2 Threads 4 Threads 6 Threads 8 Threads OpenACC (K40) OpenACC (P100)
Compiler: PGI 16.10 23
Very low
Compute/Memcpy
ratio
Compute 4 seconds
Memory Copy 51 seconds
24
112ms/iteration
PCIe Copies
25
Excessive Data Transfers
while ( err > tol && iter < iter_max )

{
err=0.0;
A, Anew resident
A, Anew resident on
on host
C accelerator
for( int j = 1; j < n-1; j++) {
o for(int i = 1; i < m-1; i++) {
These copies p Anew[j][i] = 0.25 * (A[j][i+1] +
happen every y A[j][i-1] + A[j-1][i] +
C A[j+1][i]);
iteration of the err = max(err, abs(Anew[j][i] –
o
outer while A[j][i]);
p
loop! }
y }
...
A, Anew resident A, Anew resident on
... on host accelerator
}
26
Evaluate Data Locality
err=0.0;

for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) {
Does the CPU need the data
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + between these loop nests?
A[j-1][i] + A[j+1][i]);

}
}
#pragma acc parallel loop Does the CPU need the data
for( int j = 1; j < n-1; j++) {
between iterations of the
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i]; convergence loop?
}
}
iter++;
}
27
Data regions
The data directive defines a region of code in which GPU arrays remain on
the GPU and are shared among all kernels in that region.
#pragma acc data

{
#pragma acc parallel loop Arrays used within the
... data region will remain
Data Region
on the GPU until the
#pragma acc parallel loop end of the data region.
...
}
28
Data Clauses
copy ( list ) Allocates memory on GPU and copies data from host to GPU
when entering region and copies data to the host when
exiting region.
copyin ( list ) Allocates memory on GPU and copies data from host to GPU
when entering region.
copyout ( list ) Allocates memory on GPU and copies data to the host when
exiting region.
create ( list ) Allocates memory on GPU but does not copy.
present ( list ) Data is already present on GPU from another containing

data region.
deviceptr( list ) The variable is a device pointer (e.g. CUDA) and can be
used directly on the device.
29
Array Shaping
Compiler sometimes cannot determine size of arrays

Must specify explicitly using data clauses and array “shape”
C/C++
#pragma acc data copyin(a[0:nelem]) copyout(b[s/4:3*s/4])
Fortran
!$acc data copyin(a(1:end)) copyout(b(s/4:3*s/4))
Note: data clauses can be used on data, parallel, or kernels
30
Add Data Clauses
#pragma acc data copy(A) create(Anew) Copy A to/from the
while ( err > tol && iter < iter_max ) { accelerator only when
err=0.0;
needed.
for( int j = 1; j < n-1; j++) {
for(int i = 1; i < m-1; i++) { Create Anew as a device
temporary.
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);

}
}

for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
}
}
iter++;
}
31
Rebuilding the code
main:
51, Generating copy(A[:][:])
Generating create(Anew[:][:])
Loop not vectorized/parallelized: potential early exits
56, Generating Tesla code
67, Generating Tesla code
32 32
Visual Profiler: Data Region
Data Movement Now

at Beginning and End
33 33
Visual Profiler: Data Region
Was 112ms
Iteration 1 Iteration 2
34 34
Speed-Up (Higher is Better)
40.00X
34.71X
35.00X
30.00X
Intel Xeon E5-2698 v3 @ 2.30GHz (Haswell)

25.00X
vs.
NVIDIA Tesla K40 & Tesla P100
20.00X Socket/Socket: 7X
14.92X
15.00X
10.00X Socket/Socket: 3X
4.59X 5.00X
5.00X 3.69X
1.94X
1.00X
0.00X
Single Thread 2 Threads 4 Threads 6 Threads 8 Threads OpenACC K40 OpenACC P100

The loop Directive
The loop directive gives the compiler additional information about the next loop in
the source code through several clauses.
• independent – all iterations of the loop are independent
• collapse(N) – turn the next N loops into one, flattened loop
• tile(N[,M,…]) - break the next 1 or more loops into tiles based on the
provided dimensions.
These clauses and more will be discussed in greater detail in a later class.
36
Optimize Loop Performance
#pragma acc data copy(A) create(Anew)
err=0.0;
#pragma acc parallel loop device_type(nvidia) tile(32,4)

for( int j = 1; j < n-1; j++) { “Tile” the next two loops
for(int i = 1; i < m-1; i++) { into 32x4 blocks, but
only on NVIDIA GPUs.
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]);

}
}
#pragma acc parallel loop device_type(nvidia) tile(32,4)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
}
}
}
iter++;
}
37
40.00X
36.78X
35.00X 34.71X

30.00X vs.
25.00X
20.00X
14.92X 15.46X
15.00X
10.00X
4.59X 5.00X
5.00X 3.69X
1.94X
1.00X
0.00X
Single 2 Threads 4 Threads 6 Threads 8 Threads OpenACC OpenACC OpenACC OpenACC
Thread (K40) Tuned (K40 P100 Tuned (P100)

40.00X
36.58X
35.00X
30.00X vs.
25.00X
20.00X
15.46X
15.00X
10.00X
6.31X
5.25X 5.33X
5.00X
1.00X
0.00X
Single Thread Intel OpenMP (Best) PGI OpenMP (Best) PGI OpenACC (Best) OpenACC K40 OpenACC P100
Intel C Compiler 16, PGI 16.10 (OpenMP, K40, & P100), PGI 15.10 (multicore) 39
Next Lecture
Friday – OpenACC Pipelining
40

OpenACC Fundamentals

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

OpenACC Fundamentals

Uploaded by

Copyright:

Available Formats

OpenACC Fundamentals

Jeff Larkin <jlarkin@nvidia.com>, November 16, 2016

“Drop-in” Easily Accelerate Maximum

for (iter = 1; iter <= p; ++iter){

for (j = 1; j < m-1; ++j){

Example: Solve Laplace equation in 2D: 𝛁𝟐 𝒇(𝒙, 𝒚) = 𝟎

Iterate across matrix

Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + Calculate new value from

for( int j = 1; j < n-1; j++) {

for( int j = 1; j < n-1; j++) { Independent loop

err = max(err, abs(Anew[j][i] - A[j][i])); Max Reduction required

for( int j = 1; j < n-1; j++) { Independent loop

Don’t forget acc

#pragma acc parallel

#pragma acc parallel

#pragma acc parallel

#pragma acc parallel

#pragma acc parallel loop

#pragma acc parallel loop reduction(max:err)

err = max(err, abs(Anew[j][i] - A[j][i]));

#pragma acc parallel loop

while ( err > tol && iter < iter_max )

#pragma acc parallel loop

err = max(err, abs(Anew[j][i] - A[j][i]));

#pragma acc data

create ( list ) Allocates memory on GPU but does not copy.

present ( list ) Data is already present on GPU from another containing

Compiler sometimes cannot determine size of arrays

Note: data clauses can be used on data, parallel, or kernels

err = max(err, abs(Anew[j][i] - A[j][i]));

#pragma acc parallel loop

Data Movement Now

Intel Xeon E5-2698 v3 @ 2.30GHz (Haswell)

Compiler: PGI 16.10 35

#pragma acc parallel loop device_type(nvidia) tile(32,4)

err = max(err, abs(Anew[j][i] - A[j][i]));

Intel Xeon E5-2698 v3 @ 2.30GHz (Haswell)

Compiler: PGI 16.10 38

Friday – OpenACC Pipelining

You might also like