Professional Documents
Culture Documents
2
Oct 26: Analyzing and Parallelizing with OpenACC
Recordings: 3
https://developer.nvidia.com/intro-to-openacc-course-2016
ANALYZING AND PARALLELIZING
WITH OPENACC
Lecture 1: Jeff Larkin, NVIDIA
Today’s Objectives
Analyze
Understand what OpenACC is and why to use it
Understand some of the differences between CPU
and GPU hardware.
Know how to obtain an application profile using
PGProf
Optimize Parallelize
Know how to add OpenACC directives to existing
loops and build with OpenACC using PGI
5
Why OpenACC?
6
University of Illinois
main() PowerGrid- MRI Reconstruction
{
<serial code>
#pragma acc kernels
7-8x Speed-Up
5% of Code Modified
http://w w w .cr ay.com/sites/default/files/r esources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf
http://w w w .hpcw ire.com/off-the-w ir e/first-round-of-2015-hackathons-gets-under w ay
7
http://on-demand.gputechconf.com/gtc/2015/pr esentation/S5297-Hisashi-Yashir o.pdf
http://w w w .openacc.or g/content/ex periences-por ting-molecular -dynamics-code-gpus-cr ay-x k7
Minimal Effort
LS-DALTON
Lines of Code # of Weeks # of Codes to
Large-scale application for calculating high- Modified Required Maintain
accuracy molecular energies
<100 Lines 1 Week 1 Source
Big Performance
LS-DALTON CCSD(T) Module
Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)
12.0x
10.0x
8.0x
9
OpenACC Performance Portability: CloverLeaf
Hydrodynamics Application OpenACC Performance Portability
10
CloverLeaf on Dual Haswell vs Tesla K80
20
15 14x
Speedup vs Single Haswell Core
10
7x 8x
0
Haswell: Intel OpenMP Haswell: PGI OpenACC Tesla K80: PGI OpenACC
CPU: Intel Xeon E5-2698 v3, 2 sockets, 32 cores, 2.30 GHz, HT disabled
GPU: NVIDIA Tesla K80 (single GPU)
OS: CentOS 6.6, Compiler: PGI 16.5 11
CloverLeaf on Tesla P100 Pascal
45
40x
40
Migrating from multicore CPU to K80 to
35
P100 requires only changing a compiler
30 flag.
25
Speedup vs Single Haswell Core
20
14x
15
10 7x 8x
0
Haswell: Intel OpenMP Haswell: PGI OpenACC Tesla K80: PGI OpenACC Tesla P100: PGI OpenACC
CPU: Intel Xeon E5-2698 v3, 2 sockets, 32 cores, 2.30 GHz, HT disabled
GPU: NVIDIA Tesla K80 (single GPU) , NVIDIA Tesla P100 (Single GPU)
OS: CentOS 6.6, Compiler: PGI 16.5 12
3 Steps to Accelerate with OpenACC
Analyze
Optimize Parallelize
13
10/26/2016
15
Analyze
16
Obtain a Profile
17
PGPROF Profiler
18
PGPROF Profiler
19
PGPROF Profiler
20
PGPROF Profiler
21
PGPROF Profiler
22
PGPROF Profiler
23
PGPROF Profiler
24
PGPROF Profiler
Double Click
25
PGPROF Profiler
26
Compiler Feedback
Before we can make changes to matvec(const matrix &, const vector &, const
vector &):
the code, we need to 23, include "matrix_functions.h"
understand how the compiler is Generated 2 alternate versions of the loop
Generated vector sse code for the loop
optimizing Generated 2 prefetch instructions for the
loop
With PGI, this can be done with
the –Minfo and –Mneginfo flags
27
Compiler Feedback in PGProf
28
Parallelize
29
Parallelize
30
Accelerated Computing
10x Performance & 5x Energy Efficiency for HPC
GPU Accelerator
CPU Optimized for
Optimized for Parallel Tasks
Serial Tasks
31
Accelerated Computing
10x Performance & 5x Energy Efficiency for HPC
CPU
GPU Strengths
Accelerator
CPU Optimized for
• Very largeParallel
main memory
Tasks
Optimized for • Very fast clock speeds
Serial Tasks • Latency optimized via large caches
• Small number of threads can run
very quickly
CPU Weaknesses
32
Accelerated Computing
10x Performance & 5x Energy Efficiency for HPC
GPU Strengths GPU Accelerator
Optimized for
• High bandwidthCPU
main memory Parallel Tasks
Optimized
• Significantly for
more compute
resources Serial Tasks
• Latency tolerant via parallelism
• High throughput
• High performance/watt
GPU Weaknesses
33
Speed v. Throughput
Speed Throughput
PCIe
35
CUDA Unified Memory Sometimes referred to as
“managed memory.”
Simplified Developer Effort
}
37
OpenACC Parallel Directive
Generates parallelism
}
38
OpenACC Loop Directive
Identifies loops to run in parallel
42
Parallelize Waxpby
44
PGPROF with Parallel waxpby
A significant portion
of the time is now
spent migrating data
between the host and
device.
45
PGPROF with Parallel waxpby
In order to improve
performance, we
need to parallelize
the remaining
functions.
46
Parallelize Dot
49
OpenACC Profiling in PGPROF
5.00X
Speed-up from serial
0.00X
Serial Multicore K80 (single) P100
20.00X
15.00X
10.00X
Speed-up from serial
5.00X
0.00X
Serial Multicore K80 (single) P100
53
Optimize (Next Week)
55
Getting access
56
CERTIFICATION OPENACC TOOLKIT
Available after November 9th Free for Academia
60
OpenACC kernels Directive
Identifies a region of code where I think the compiler can turn loops
into kernels
61 61
Loops vs. Kernels
for (int i = 0; i < 16384; i++) function loopBody(A, B, C, i)
{ {
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
62
Loops vs. Kernels
for (int i = 0; i < 16384; i++) function loopBody(A, B, C, i)
{ {
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
63
Loops vs. Kernels
for (int i = 0; i < 16384; i++) function loopBody(A, B, C, i)
{ {
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
Calculate 0
64
Loops vs. Kernels
for (int i = 0; i < 16384; i++) function loopBody(A, B, C, i)
{ {
C[i] = A[i] + B[i]; C[i] = A[i] + B[i];
} }
Calculate 0
Calculate 1
Calculate 2
Calculate 3
Calculate 0
Calculate 4
Calculate 1
Calculate 2
Calculate 3
Calculate
Calculate 0 9
Calculate 1
Calculate 0 -16383 in order. Calculate 2
Calculate 3
Calculate
Calculate 0 14
Calculate 1
Calculate 2
Calculate 3
Calculate
Calculate 0 …
Calculate 1
Calculate 2
Calculate 3
Calculate 16383
65
Parallelize Matvec with kernels
void matvec(...) {
double *restrict ycoefs=y.coefs;
#pragma acc kernels
for(int i=0;i<num_rows;i++) {
double sum=0;
With the kernels directive, the
int row_start=row_offsets[i]; compiler will detect a (false)
int row_end=row_offsets[i+1]; data dependency on ycoefs.
#pragma acc loop reduction(+:sum)
for(int It’s necessary to either mark the
j=row_start;j<row_end;j++) { loop as independent or add the
unsigned int Acol=cols[j]; restrict keyword to get
double Acoef=Acoefs[j];
double xcoef=xcoefs[Acol]; parallelization.
sum+=Acoef*xcoef;
}
ycoefs[i]=sum;
}
66
}
OpenACC parallel loop vs. kernels
PARALLEL LOOP KERNELS
Programmer’s responsibility to ensure Compiler’s responsibility to analyze the
safe parallelism code and parallelize what is safe.
Will parallelize what a compiler may Can cover larger area of code with
miss single directive
Straightforward path from OpenMP Gives compiler additional leeway to
optimize.
Compiler sometimes gets it wrong.
Both approaches are equally valid and can perform equally well.
67
OpenACC Performance So Far… (kernels)
Serial Multicore K80 (single) P100
15.00X
10.00X
Speed-up from serial
5.00X
0.00X
Serial Multicore K80 (single) P100