You are on page 1of 4

Analysis of Programs for GPGPU

Architectures

Lakshmi M. Gadhikar Dr. Y.S. Rao:


Computer Engineering Department. Electronics and Telecommunications
S.P.I.T. Andheri (W) Engineering Department.
Mumbai, India. S.P.I.T. Andheri (W)
lmgadhikar@gmail.com Mumbai, India
ysrao@spit.ac.in

overview of the performance parameters of a CUDA


GPGPU program analyzed in the proposed system. Section
Abstract— Over the past decade, General Purpose IV describes the proposed system , analysis of GPGPU
Graphics Procesing Units i.e. GPGPUs emerged as a programs. Results are presented in section V. Related work
platform for general purpose and scientific computing is discussed in section VI and conclusion in section VII.
applications. Languages used for programming GPGPUs
are CUDA, OpenCL, DirectCompute etc. As GPUs are II. GRAPHICS PROCESSING UNITS
widely being used in scientific computing, optimization
of programs has become a major concern. Producing A Graphics Processing Unit ( GPU ) consists of a number
accurate results in reduced time is increasingly becoming of symmetric multiprocessors known as SMx in Nvidia
an important factor in the usability of GPU applications Kepler architecture. Each SMx contains a number of
in concerned domains. Program optimization requires CUDA cores , 192 per SMx in Kepler [9]. CUDA cores are
analysis of program to find the program features or the processing elements which process the data given as
parameters that impact its execution and storage
input to the GPU in parallel. Each SMx contain a set of
efficiency. In this work, we analyze GPGPU programs
registers and L1 cache plus shared memory shared by all
written in CUDA to extract important features in them
the threads running in that SMx. There is also a L2 cache
which will be useful for further optimization.
Keywords—GPGPU; CUDA ; Program Analysis. off chip from the SMx but on the GPU device itself.
Memory system of Nvidia GPUs is organized in a six level
I. INTRODUCTION hierarchy. Registers and shared memory are present inside
the SMx and registers are the fastest of all memories. Other
Analysis of programs may be done at different levels : device memories are global memory, local memory ,
Static and dynamic program analysis. In the context of constant memory and texture memory.
GPGPUs, the latter requires procuring a costly hardware -
a GPU to run the programs to analyze their dynamic III. OVERVIEW OF PERFORMANCE PARAMETERS
IN THE PROPOSED SYSTEM
runtime behavior. Static analysis of programs is
performed at compile time without the need of a specific Instructions in a thread may be divided into computation
hardware device and runtime environment to execute the instructions , memory instructions , control flow and
program. So, in this work, we focus on static analysis of synchronization instructions. The amount of time required
GPGPU programs. We present comprehensive analysis of for execution of an instruction is affected to a great extent
CUDA programs from Rodinia benchmark suite[1] and by its type. For example for CUDA compute capability
Nvidia CUDA SDK[2]. We perform automated program 3.5[3], throughput of a single precision floating-point
analysis to extract useful parameters form GPGPU add, multiply, multiply-add instruction is 192 operations
programs. Static analysis of GPGPU programs may be per clock cycle per multiprocessor with latency of 18-22
used to predict their performance on a specific clock cycles, throughput of a memory instruction is around
architecture, to find the resource usage of a program in 8 operations per clock cycle with latency of 400-800 clock
terms of the global – shared memory usage etc.
cycles whereas __syncthreads() gives a throughput
All of the experiments were performed on a system with
of 128 operations per clock cycle[3( v 9.0.176)]
an Intel Core i7 running Ubuntu 14.04 x86-64, equipped
The performance parameters analyzed in the proposed
with an NVIDIA GeForce GTX 780 GPU.
system are explained below.
Rest of the paper is organized as below. Section II
describes Graphics Processing Units. Section III describes
A. Global and Device Kernels The scanner scans the entire CUDA program given as input
In CUDA, kernel is a function call that executes the same to find various parameters that affect the execution
set of operations on multiple cores of a GPU in SIMT ( performance of a GPGPU program. The programs are
Single Inst. Multiple Threads ) manner i.e. the same set of scanned line by line from the beginning to the end of
instructions is executed by multiple threads in parallel on program file. We use regular expressions to search the
multiple cores of GPU. There are two types of kernels viz. keywords of interest such as “__global__”, “__shared__”
global and device kernels. Global kernel is launched from etc.
the host code and executes on the device whereas device
kernel cannot be called inside the host program directly. It Parameters that impact the execution performance of
has to be invoked within a global kernel. A global kernel CUDA programs are : number and type of arithmetic and
may invoke another global kernel or device kernel. memory operations, Number of intra block synchronization
instructions etc. as explained below.
B. Number of arithmetic operations

CUDA supports different types of arithmetic operations


2. Parameter extractor and Output generator
such as addition, subtraction, multiplication , division, a
number of transcendental operations etc. Counting In this phase, we extract following parameters that affect
number of arithmetic operations of each type is important the performance of a CUDA program based on the patterns
as each operation consumes different number of cycles. scanned in the first step.
C. Number and type of memory operations 1. Number of global and device kernels
2. Definitions of global and device Kernels
As explained in section II, memory system of CUDA is
hierarchically organized into different levels such as 3. Number and type of arithmetic instructions per
registers, global, shared , constant etc. As there is a kernel
considerable difference in the number of cycles required 4. Number and type of memory instructions per
for global and shared memory accesses, so in this work, kernel i.e. number of global memory accesses and
we concentrate on analysis of global and shared number of shared memory accesses per kernel
memories. Also, as number of cycles required for read
and write accesses is the same, we don’t count them 5. Number of intra block synchronization instructions
together as total number of global or shared memory
accesses. Thus, the output of this stage involves extraction of above
D. Number of intra – block synchronization parameters that affect the execution performance of a
operations GPGPU program. The extracted parameters are stored in
Intra – block synchronization operation respective files created at runtime.
__syncthreads()[7] acts as a barrier at which all threads in 3. Output generator
the block must wait before they proceed to the next This phase involves generation of a output file containing
instruction. parameters mentioned in phase 2.
Pseudo code to find number of global kernels with their
IV. ANALYSIS OF GPGPU PROGRAMS
definition is given below.
Kernel_count = 0
We focus on analysis of a GPGPU program written in
For each line in source CUDA file DO
CUDA. A typical GPGPU program in CUDA has two
Search for the line starting with PATTERN
parts :
“__global__”
i. The sequential portion of the code that runs on the CPU
If FOUND
( host ) and
Extract the entire line containing the PATTERN
ii. The parallel portion which runs on the GPU ( device ).
Followed by its definition and
Parallelization in CUDA programs is represented inside a
Store it in file global_kernel_count_defination
kernel. A kernel is a SIMD ( Single Instruction Multiple
Kernel_count = kernel_count + 1
Data ) function which executes in parallel on multiple
ENDDO
cores of a GPU. Our work focusses on analysis of CUDA
kernels in multiple application domains.
Code is written in similar way for extraction of the count
and definitions of device kernels by replacing the the
PATTERN “__global__” by the PATTERN “__device__”.
We developed a Python program to automatically scan
through various CUDA programs.
Similarly, we search for the other parameters of interest
It performs following two functions.
and store the extracted parameters in output file.
1. Scanner
2. Parameter extractor and Output generator V. RESULTS

1. Scanner
Contents of the file global_kernel_count_defination for
fluidsGL_kernels program from NVIDIA_CUDA- GPGPU Parameter Fluids Monte LUD DWT
7.0_Samples from NVIDIA SDK Toolkit is as shown GL_ Carlo
below. GPU
Global kernel 1 Number of global 5 2 2 4
__global__ void addForces_k (cData *v, int dx, int dy, int kernels
spx, int spy, float fx, float fy, int r, size_t pitch) Number of device 0 2 0 10
{ kernels
Number of arith- 300 20 92 262
int tx = threadIdx.x; metic opera-tions
int ty = threadIdx.y; Total Number of 10 4 40 87
memory perations
cData *fj = (cData *)((char *)v +
Number of global 10 0 11 24
ty + spy) * pitch) + tx + spx; memory perations
: Number of shared 0 4 29 63
: memory perations
} Number of intra 0 0 6 31
Global kernel 2 block synchroniza-
__global__ void advectVelocity_k(cData *v, float *vx, tion operations
float *vy,int dx, int pdx, int dy, float dt, int lb)
{ Table 1. Result of extraction of parameters from CUDA
int gtidx = blockIdx.x * blockDim.x programs.
+ threadIdx.x; VI. RELATED WORK
int gtidy = blockIdx.y * (lb *
blockDim.y) + threadIdx.y * lb; Lynx[4] is a dynamic binary instrumentation infrastructure
: used for construction of customizable program analysis
: tools for GPU based, parallel architectures. It provides a set
} of C based language constructs which may be used to build
: program analysis tools. Lynx also uses a JIT compiler to
: translate, insert and optimize instrumentation code. Lynx
Global kernel 5 involves modification of the applications to be analyzed by
__global__ void advectParticles_k cData *part, cData *v, inserting code in an application program at instruction
int dx, int dy,float dt, int lb, size_t pitch) level. This may introduce errors in the original application
{ code which requires debugging and also causes runtime
int gtidx = blockIdx.x * blockDim.x overhead. Our approach does not require any modifications
+ threadIdx.x; in the programs to be analyzed and generates information
int gtidy = blockIdx.y * (lb * about the code without the need to compile and run the
blockDim.y) + threadIdx.y * lb; application under analysis.
: Yao Zyang and John D. Owens [5] have developed a
: microbenchmark-based performance model for NVIDIA
} GeForce 200-series GPUs. Their model identifies
bottlenecks in GPU application programs and provides
Information generated for each of the global kernels in quantitative performance analysis which can be used to
global_kernel_count_defination file and device kernels in predict the benefits of potential program optimizations and
device_kernel_count_defination file is shown in table 1. architectural improvements. The model also suggests
We present the results of two programs fluidGL_kernels improvements on hardware resource allocation, avoiding
and MonteCarloGPU kernels from NVIDIA_CUDA- bank conflicts, block scheduling, and memory transaction
7.0_Samples and two programs LUD kernels and granularity. The purpose of our work is pure analysis of
Discrete Wavelet Transform kernels from Rodinia 3.1 programs to extract useful features of a GPGPU program
benchmark suite. which may further be optimized in variety of ways based
As both memory load and memory store instructions on the requirements of an application specific to individual
require same execution time , so we do not discriminate domains.
between memory load and memory store instructions. Michael Boyer [6] et. al. have developed a technique for
However, there is a considerable difference in the time automated dynamic analysis of CUDA programs to find
required for execution of global and shared memory two major classes of bugs in CUDA programs namely race
operations as explained in section III, so we count the conditions and shared memory bank conflict which impact
total number of global memory instructions and total program correctness and program performance
number of shared memory instructions separately. respectively. This automated analysis finds and reports
above classes of problems found in CUDA programs
which are otherwise difficult to detect manually. They use
a dynamic approach for program analysis whereas we do
static analysis of programs without the need to run it
thereby reducing the dynamic runtime overhead as present
in M. Boyer’s approach.

VII. CONCLUSION

In this paper we have presented analysis of programs


from different application domains from Nvidia SDK
toolkit and Rodinia benchmark suite 3.1. The purpose of
the analysis is to automatically extract some of the
important parameters that impact the performance of a
GPGPU program. The extracted parameters may be used
for predicting the execution time of a CUDA GPGPU
program and also for optimization purposes.

ACKNOWLEDGEMENT

We thank Prof. S. Biswas from department of CSE, IIT,


Mumbai for his expert advice, guidance and continuous
support during the research work.

REFERENCES
[1] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron.
A Characterization of the Rodinia Benchmark Suite with Comparison to
Contemporary CMP Workloads. In Proceedings of the IEEE International
Symposium on Workload Characterization, Dec. 2010.
[2] NVIDIA. NVIDIA CUDA Tools SDK CUPTI. NVIDIA Corporation, Santa
Clara, California, 1.0 edition, February 2011.
[3] Chapter 5, Performance Guidelines , CUDA C Programming Guide Version
4.0 ( v 9.0.176) http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html [accessed on : 14/11/2017]
[4] Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, Sudhakar
Yalamanchili ,Lynx: A dynamic instrumentation system for data-parallel
applications on GPGPU architectures IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), 2012.
[5] Yao Zhang , John D. Owens, A Quantitative Performance Analysis Model
for GPU Architectures, IEEE 17th International Symposium on High
Performance Computer Architecture (HPCA),San Antonio, TX, USA, Feb
2011
[6] Michael Boyer, Kevin Skadron, and Westley Weimer. Automated dynamic
analysis of cuda programs. Third Workshop on Software Tools for MultiCore
Systems (STMCS), 2008.
[7] NVIDIA Corporation. NVIDIA CUDA Programming Guide, Version 9.1.85
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[8] Walker, Justin (25 March 2014). "Two GPUs, One Insane Graphics Card:
Introducing the GeForce GTX TITAN Z". Nvidia Blog. Nvidia Corporation.
Retrieved 25 March 2014.
[9] NVIDIA’s next Generation CUDA Compute Architecture : Kepler GK 110,
Whitepapaer, http://www.nvidia.in/content/PDF/kepler/NVIDIA-Kepler-
GK110-Architecture-Whitepaper.pdf

You might also like