Introduction To OpenACC Course 20161026 1550 1

INTRODUCTION TO OPENACC
Lecture 1: Analyzing and Parallelizing with OpenACC, October 26, 2016

Course Objective:
Enable you to to accelerate your applications

with OpenACC.
2
Oct 26: Analyzing and Parallelizing with OpenACC
Course Syllabus Nov 2: OpenACC Optimizations

Nov 9: Advanced OpenACC
Recordings: 3
https://developer.nvidia.com/intro-to-openacc-course-2016
ANALYZING AND PARALLELIZING
WITH OPENACC
Lecture 1: Jeff Larkin, NVIDIA
Today’s Objectives
Analyze
Understand what OpenACC is and why to use it
Understand some of the differences between CPU
and GPU hardware.
Know how to obtain an application profile using
PGProf
Optimize Parallelize
Know how to add OpenACC directives to existing
loops and build with OpenACC using PGI
5
Why OpenACC?
6
University of Illinois
main() PowerGrid- MRI Reconstruction
{
<serial code>
#pragma acc kernels
OpenACC //automatically runs on GPU

{
<parallel code>
}
Simple | Powerful | Portable } 70x Speed-Up
2 Days of Effort
Fueling the Next Wave of RIKEN Japan

NICAM- Climate Modeling
Scientific Discoveries in HPC
7-8x Speed-Up
5% of Code Modified
http://w w w .cr ay.com/sites/default/files/r esources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf
http://w w w .hpcw ire.com/off-the-w ir e/first-round-of-2015-hackathons-gets-under w ay
7
http://on-demand.gputechconf.com/gtc/2015/pr esentation/S5297-Hisashi-Yashir o.pdf
http://w w w .openacc.or g/content/ex periences-por ting-molecular -dynamics-code-gpus-cr ay-x k7
Minimal Effort
LS-DALTON
Lines of Code # of Weeks # of Codes to
Large-scale application for calculating high- Modified Required Maintain
accuracy molecular energies
<100 Lines 1 Week 1 Source
Big Performance
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
12.0x
10.0x
8.0x
“ OpenACC makes GPU computing approachable for

domain scientists. Initial OpenACC implementation
6.0x
4.0x
required only minor effort, and more importantly, “ 2.0x
no modifications of our existing CPU implementation. 0.0x
Alanine-1 Alanine-2 Alanine-3
Janus Juul Eriksen, PhD Fellow 13 Atoms 23 Atoms 33 Atoms
qLEAP Center for Theoretical Chemistry, Aarhus University
8
OpenACC Directives
Manage
Data
#pragma acc data copyin(x,y) copyout(z)
{
Incremental
Movement ... Single source
#pragma acc parallel
Initiate
{
#pragma acc loop gang vector
Interoperable
Parallel
for (i = 0; i < n; ++i) {
Execution
z[i] = x[i] + y[i]; Performance portable
...
Optimize }
Loop }
Mappings ...
}
9
OpenACC Performance Portability: CloverLeaf
Hydrodynamics Application OpenACC Performance Portability
Speedup vs 1 CPU Core

“We were extremely impressed that we can run
OpenACC on a CPU with no code change and get
equivalent performance to our OpenMP/MPI
implementation.”
Wayne Gaudin and Oliver Perks
Atomic Weapons Establishment, UK
Benchmarked Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, Accelerator: Tesla K80
10
CloverLeaf on Dual Haswell vs Tesla K80
20
15 14x
Speedup vs Single Haswell Core
10
7x 8x
0
Haswell: Intel OpenMP Haswell: PGI OpenACC Tesla K80: PGI OpenACC
CPU: Intel Xeon E5-2698 v3, 2 sockets, 32 cores, 2.30 GHz, HT disabled
GPU: NVIDIA Tesla K80 (single GPU)
OS: CentOS 6.6, Compiler: PGI 16.5 11
CloverLeaf on Tesla P100 Pascal
45
40x
40
Migrating from multicore CPU to K80 to
35
P100 requires only changing a compiler
30 flag.
25
Speedup vs Single Haswell Core
20
14x
15
10 7x 8x
0
Haswell: Intel OpenMP Haswell: PGI OpenACC Tesla K80: PGI OpenACC Tesla P100: PGI OpenACC
CPU: Intel Xeon E5-2698 v3, 2 sockets, 32 cores, 2.30 GHz, HT disabled
GPU: NVIDIA Tesla K80 (single GPU) , NVIDIA Tesla P100 (Single GPU)
OS: CentOS 6.6, Compiler: PGI 16.5 12
3 Steps to Accelerate with OpenACC
Analyze
Optimize Parallelize
13
10/26/2016
Case Study: Conjugate Gradient
A sample code implementing the conjugate

gradient method has been provided in C/C++ and
Fortran.
• To save space, only the C will be shown in slides.
You do not need to understand the algorithm to

proceed, but should be able to understand C, C++,
or Fortran.
For more information on the CG method, see
https://en.wikipedia.org/wiki/Conjugate_gradient
_method
14
Analyze
15
Analyze
Obtain a performance profile

Read compiler feedback
Understand the code.
16
Obtain a Profile
A application profile helps to understand where time is spent

What routines are hotspots?
Focusing on the hotspots delivers the greatest performance impact
A variety of profiling tools are available: gprof, nvprof, CrayPAT, TAU, Vampir
We’ll use PGProf, which comes with the PGI compiler
$ pgprof &
17
PGPROF Profiler
18
PGPROF Profiler
19
PGPROF Profiler
20
PGPROF Profiler
21
PGPROF Profiler
22
PGPROF Profiler
23
PGPROF Profiler
24
PGPROF Profiler
Double Click
25
PGPROF Profiler
26
Compiler Feedback
Before we can make changes to matvec(const matrix &, const vector &, const
vector &):
the code, we need to 23, include "matrix_functions.h"
understand how the compiler is Generated 2 alternate versions of the loop
Generated vector sse code for the loop
optimizing Generated 2 prefetch instructions for the
loop
With PGI, this can be done with
the –Minfo and –Mneginfo flags
$ pgc++ -Minfo=all,ccff -Mneginfo
27
Compiler Feedback in PGProf
28
Parallelize
29
Parallelize
Insert OpenACC directives

around important loops
Enable OpenACC in the compiler
Run on a parallel platform
30
Accelerated Computing
10x Performance & 5x Energy Efficiency for HPC
GPU Accelerator
CPU Optimized for
Optimized for Parallel Tasks
Serial Tasks
31
CPU
GPU Strengths
Accelerator
CPU Optimized for
• Very largeParallel
main memory
Tasks
Optimized for • Very fast clock speeds
Serial Tasks • Latency optimized via large caches
• Small number of threads can run
very quickly
CPU Weaknesses
• Relatively low memory bandwidth

• Cache misses very costly
• Low performance/watt
32
GPU Strengths GPU Accelerator
Optimized for
• High bandwidthCPU
main memory Parallel Tasks
Optimized
• Significantly for
more compute
resources Serial Tasks
• Latency tolerant via parallelism
• High throughput
• High performance/watt
GPU Weaknesses
• Relatively low memory capacity

• Low per-thread performance
33
Speed v. Throughput
Speed Throughput
Which is better depends on your needs…
*Images from Wikimedia Commons via Creative Commons 34

Accelerator Nodes
CPU and GPU communicate via PCIe
RAM RAM
• Data must be copied between
these memories over PCIe
• PCIe Bandwidth is much lower
than either memories
Obtaining high performance on GPU

nodes often requires reducing PCIe
copies to a minimum
PCIe
35
CUDA Unified Memory Sometimes referred to as
“managed memory.”
Simplified Developer Effort
Without Unified Memory With Unified Memory
System GPU Memory Unified Memory

Memory
New “Pascal” GPUs handle Unified Memory in hardware.
36
OpenACC Parallel Directive
Generates parallelism

{
When encountering
the parallel directive,
the compiler will
generate 1 or more
parallel gangs, which
execute redundantly.
}
37
OpenACC Parallel Directive
Generates parallelism

{
When encountering
the parallel directive,
the compiler will
generate 1 or more
parallel gangs, which
execute redundantly.
}
38
OpenACC Loop Directive
Identifies loops to run in parallel

{
#pragma acc loop
for (i=0;i<N;i++)
{ The loop directive
informs the compiler
} which loops to
parallelize.
}
39
OpenACC Loop Directive
Identifies loops to run in parallel

{
#pragma acc loop
for (i=0;i<N;i++)
{ The loop directive
informs the compiler
} which loops to
parallelize.
}
40
OpenACC Parallel Loop Directive
Generates parallelism and identifies loop in one directive
#pragma acc parallel loop

for (i=0;i<N;i++)
{
}
The parallel and loop
directives are
frequently combined
into one.
41
Case Study: Parallelize
Normally we would start with the most time-consuming routine to

deliver the greatest performance impact.
In order to ease you in to writing parallel code, I will instead start

with the simplest routine.
42
Parallelize Waxpby
void waxpby(...) { Adding a parallel loop around

the waxpby loop informs the
#pragma acc parallel loop compiler to
for(int i=0;i<n;i++) { Generate parallel gangs on
wcoefs[i] = which to execute
alpha*xcoefs[i] +
beta*ycoefs[i]; Parallelize the loop
iterations across the parallel
}
gangs
}
43
Build With OpenACC
The PGI –ta flag enables Compiler feedback now:

OpenACC and chooses a target waxpby(double, const vector &, double,
accelerator. const vector &, const vector &):
6, include "vector_functions.h"
We’ll add the following to our 22, Generating implicit
copyout(wcoefs[:n])
compiler flags: Generating implicit
copyin(xcoefs[:n],ycoefs[:n])
-ta=tesla:managed Accelerator kernel generated
Generating Tesla code
25, #pragma acc loop gang,
vector(128) /* blockIdx.x threadIdx.x */
44
PGPROF with Parallel waxpby
A significant portion
of the time is now
spent migrating data
between the host and
device.
45
PGPROF with Parallel waxpby
In order to improve
performance, we
need to parallelize
the remaining
functions.
46
Parallelize Dot
double dot(...) { Because each iteration of the

loop adds to the variable sum,
#pragma acc parallel loop we must declare a reduction.
reduction(+:sum) A parallel reduction may return
for(int i=0;i<n;i++) { a slightly different result than a
sum += sequential addition due to
xcoefs[i]*ycoefs[i]; floating point limitations
}
return sum;
}
47
Parallelize Matvec
void matvec(...) {
#pragma acc parallel loop
for(int i=0;i<num_rows;i++) {
double sum=0;
int row_start=row_offsets[i];
The outer parallel loop
int row_end=row_offsets[i+1]; generates parallelism and
#pragma acc loop reduction(+:sum) parallelizes the “i” loop.
for(int
j=row_start;j<row_end;j++) { The inner loop declares the
unsigned int Acol=cols[j]; iterations of “j” independent
double Acoef=Acoefs[j]; and the reduction on “sum”
double xcoef=xcoefs[Acol];
sum+=Acoef*xcoef;
}
ycoefs[i]=sum;
}
}
48
Final PGPROF Profile for Lecture 1
Now data migration

has been eliminated
during the
computation.
49
OpenACC Profiling in PGPROF
PGPROF will show you

where in your code to
find an OpenACC
region.
We’ll optimize this

loop next week!
50
OpenACC Performance So Far…
Serial Multicore K80 (single) P100
10.00X
5.00X
Speed-up from serial
0.00X
Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 51

Where we’re going next week…
25.00X
20.00X
15.00X
10.00X
5.00X
0.00X

Optimize (Next Week)
53
Optimize (Next Week)
Get new performance data from

parallel execution
Remove unnecessary data
transfer to/from GPU
Guide the compiler to better
loop decomposition
Refactor the code to make it
more parallel
54
Using QwikLabs
55
Getting access
1. Create an account with NVIDIA qwikLABS

https://developer.nvidia.com/qwiklabs-
signup
2. Enter a promo code OPENACC before
submitting the form
3. Free credits will be added to your account
4. Start using OpenACC!
56
CERTIFICATION OPENACC TOOLKIT
Available after November 9th Free for Academia
1. Attend live lectures Download link:

https://developer.nvidia.com/openacc-toolkit
2. Complete the test
3. Enter for a chance to win a
Titan X or an OpenACC Book
NEW OPENACC BOOK

Parallel Programming with OpenACC
Available starting Nov 1st, 2016:

http://store.elsevier.com/Parallel-Programming-
with-OpenACC/Rob-Farber/isbn-9780124103979/
Official rules:
http://developer.download.nvidia.com/compute/OpenACC-
57
Toolkit/docs/TITANX-GIVEAWAY-OPENACC-Official-Rules-2016.pdf
Where to find help
• OpenACC Course Recordings - https://developer.nvidia.com/openacc-courses
• PGI Website - http://www.pgroup.com/resources
• OpenACC on StackOverflow - http://stackoverflow.com/questions/tagged/openacc
• OpenACC Toolkit - http://developer.nvidia.com/openacc-toolkit
• Parallel Forall Blog - http://devblogs.nvidia.com/parallelforall/
• GPU Technology Conference - http://www.gputechconf.com/
• OpenACC Website - http://openacc.org/
Questions? Email openacc@nvidia.com 58

Oct 26: Analyzing and Parallelizing with OpenACC
Course Syllabus Nov 2: OpenACC Optimizations

Nov 9: Advanced OpenACC
Recordings: Questions? Email openacc@nvidia.com

59
https://developer.nvidia.com/intro-to-openacc-course-2016
Additional Material
60
OpenACC kernels Directive
Identifies a region of code where I think the compiler can turn loops
into kernels
#pragma acc kernels

{
for(int i=0; i<N; i++)
{ The compiler identifies
x[i] = 1.0; kernel 1
y[i] = 2.0; 2 parallel loops and
} generates 2 kernels.
for(int i=0; i<N; i++)
{
y[i] = a*x[i] + y[i]; kernel 2
}
}
61 61
Loops vs. Kernels
for (int i = 0; i < 16384; i++) function loopBody(A, B, C, i)
{ {
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
62
Loops vs. Kernels
{ {
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
Calculate 0 -16383 in order.
63
Loops vs. Kernels
{ {
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
Calculate 0
Calculate 0 -16383 in order.
64
Loops vs. Kernels
{ {
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
Calculate 0
Calculate 1
Calculate 2
Calculate 3
Calculate 0
Calculate 4
Calculate 1
Calculate 2
Calculate 3
Calculate
Calculate 0 9
Calculate 1
Calculate 0 -16383 in order. Calculate 2
Calculate 3
Calculate
Calculate 0 14
Calculate 1
Calculate 2
Calculate 3
Calculate
Calculate 0 …
Calculate 1
Calculate 2
Calculate 3
Calculate 16383
65
Parallelize Matvec with kernels
void matvec(...) {
double *restrict ycoefs=y.coefs;
#pragma acc kernels
for(int i=0;i<num_rows;i++) {
double sum=0;
With the kernels directive, the
int row_start=row_offsets[i]; compiler will detect a (false)
int row_end=row_offsets[i+1]; data dependency on ycoefs.
#pragma acc loop reduction(+:sum)
for(int It’s necessary to either mark the
j=row_start;j<row_end;j++) { loop as independent or add the
unsigned int Acol=cols[j]; restrict keyword to get
double Acoef=Acoefs[j];
double xcoef=xcoefs[Acol]; parallelization.
sum+=Acoef*xcoef;
}
ycoefs[i]=sum;
}
66
}
OpenACC parallel loop vs. kernels
PARALLEL LOOP KERNELS
Programmer’s responsibility to ensure Compiler’s responsibility to analyze the
safe parallelism code and parallelize what is safe.
Will parallelize what a compiler may Can cover larger area of code with
miss single directive
Straightforward path from OpenMP Gives compiler additional leeway to
optimize.
Compiler sometimes gets it wrong.
Both approaches are equally valid and can perform equally well.
67
OpenACC Performance So Far… (kernels)
15.00X
10.00X
5.00X
0.00X

Introduction To OpenACC Course 20161026 1550 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To OpenACC Course 20161026 1550 1

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO OPENACC

Lecture 1: Analyzing and Parallelizing with OpenACC, October 26, 2016

Enable you to to accelerate your applications

Course Syllabus Nov 2: OpenACC Optimizations

OpenACC //automatically runs on GPU

Fueling the Next Wave of RIKEN Japan

Scientific Discoveries in HPC

“ OpenACC makes GPU computing approachable for

Speedup vs 1 CPU Core

Case Study: Conjugate Gradient

A sample code implementing the conjugate

You do not need to understand the algorithm to

Obtain a performance profile

A application profile helps to understand where time is spent

$ pgc++ -Minfo=all,ccff -Mneginfo

Insert OpenACC directives

• Relatively low memory bandwidth

• Relatively low memory capacity

Which is better depends on your needs…

*Images from Wikimedia Commons via Creative Commons 34

Obtaining high performance on GPU

Without Unified Memory With Unified Memory

System GPU Memory Unified Memory

#pragma acc parallel

#pragma acc parallel

#pragma acc parallel

#pragma acc parallel

#pragma acc parallel loop

Normally we would start with the most time-consuming routine to

In order to ease you in to writing parallel code, I will instead start

void waxpby(...) { Adding a parallel loop around

The PGI –ta flag enables Compiler feedback now:

double dot(...) { Because each iteration of the

Now data migration

PGPROF will show you

We’ll optimize this

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 51

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 52

Get new performance data from

1. Create an account with NVIDIA qwikLABS

1. Attend live lectures Download link:

NEW OPENACC BOOK

Available starting Nov 1st, 2016:

Questions? Email openacc@nvidia.com 58

Course Syllabus Nov 2: OpenACC Optimizations

Recordings: Questions? Email openacc@nvidia.com

#pragma acc kernels

Calculate 0 -16383 in order.

Calculate 0 -16383 in order.

Source: PGI 16.9, Multicore: Intel Xeon CPU E5-2698 v3 @ 2.30GHz 68

You might also like