Professional Documents
Culture Documents
Architectures
1. Scanner
Contents of the file global_kernel_count_defination for
fluidsGL_kernels program from NVIDIA_CUDA- GPGPU Parameter Fluids Monte LUD DWT
7.0_Samples from NVIDIA SDK Toolkit is as shown GL_ Carlo
below. GPU
Global kernel 1 Number of global 5 2 2 4
__global__ void addForces_k (cData *v, int dx, int dy, int kernels
spx, int spy, float fx, float fy, int r, size_t pitch) Number of device 0 2 0 10
{ kernels
Number of arith- 300 20 92 262
int tx = threadIdx.x; metic opera-tions
int ty = threadIdx.y; Total Number of 10 4 40 87
memory perations
cData *fj = (cData *)((char *)v +
Number of global 10 0 11 24
ty + spy) * pitch) + tx + spx; memory perations
: Number of shared 0 4 29 63
: memory perations
} Number of intra 0 0 6 31
Global kernel 2 block synchroniza-
__global__ void advectVelocity_k(cData *v, float *vx, tion operations
float *vy,int dx, int pdx, int dy, float dt, int lb)
{ Table 1. Result of extraction of parameters from CUDA
int gtidx = blockIdx.x * blockDim.x programs.
+ threadIdx.x; VI. RELATED WORK
int gtidy = blockIdx.y * (lb *
blockDim.y) + threadIdx.y * lb; Lynx[4] is a dynamic binary instrumentation infrastructure
: used for construction of customizable program analysis
: tools for GPU based, parallel architectures. It provides a set
} of C based language constructs which may be used to build
: program analysis tools. Lynx also uses a JIT compiler to
: translate, insert and optimize instrumentation code. Lynx
Global kernel 5 involves modification of the applications to be analyzed by
__global__ void advectParticles_k cData *part, cData *v, inserting code in an application program at instruction
int dx, int dy,float dt, int lb, size_t pitch) level. This may introduce errors in the original application
{ code which requires debugging and also causes runtime
int gtidx = blockIdx.x * blockDim.x overhead. Our approach does not require any modifications
+ threadIdx.x; in the programs to be analyzed and generates information
int gtidy = blockIdx.y * (lb * about the code without the need to compile and run the
blockDim.y) + threadIdx.y * lb; application under analysis.
: Yao Zyang and John D. Owens [5] have developed a
: microbenchmark-based performance model for NVIDIA
} GeForce 200-series GPUs. Their model identifies
bottlenecks in GPU application programs and provides
Information generated for each of the global kernels in quantitative performance analysis which can be used to
global_kernel_count_defination file and device kernels in predict the benefits of potential program optimizations and
device_kernel_count_defination file is shown in table 1. architectural improvements. The model also suggests
We present the results of two programs fluidGL_kernels improvements on hardware resource allocation, avoiding
and MonteCarloGPU kernels from NVIDIA_CUDA- bank conflicts, block scheduling, and memory transaction
7.0_Samples and two programs LUD kernels and granularity. The purpose of our work is pure analysis of
Discrete Wavelet Transform kernels from Rodinia 3.1 programs to extract useful features of a GPGPU program
benchmark suite. which may further be optimized in variety of ways based
As both memory load and memory store instructions on the requirements of an application specific to individual
require same execution time , so we do not discriminate domains.
between memory load and memory store instructions. Michael Boyer [6] et. al. have developed a technique for
However, there is a considerable difference in the time automated dynamic analysis of CUDA programs to find
required for execution of global and shared memory two major classes of bugs in CUDA programs namely race
operations as explained in section III, so we count the conditions and shared memory bank conflict which impact
total number of global memory instructions and total program correctness and program performance
number of shared memory instructions separately. respectively. This automated analysis finds and reports
above classes of problems found in CUDA programs
which are otherwise difficult to detect manually. They use
a dynamic approach for program analysis whereas we do
static analysis of programs without the need to run it
thereby reducing the dynamic runtime overhead as present
in M. Boyer’s approach.
VII. CONCLUSION
ACKNOWLEDGEMENT
REFERENCES
[1] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang, and K. Skadron.
A Characterization of the Rodinia Benchmark Suite with Comparison to
Contemporary CMP Workloads. In Proceedings of the IEEE International
Symposium on Workload Characterization, Dec. 2010.
[2] NVIDIA. NVIDIA CUDA Tools SDK CUPTI. NVIDIA Corporation, Santa
Clara, California, 1.0 edition, February 2011.
[3] Chapter 5, Performance Guidelines , CUDA C Programming Guide Version
4.0 ( v 9.0.176) http://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html [accessed on : 14/11/2017]
[4] Naila Farooqui, Andrew Kerr, Greg Eisenhauer, Karsten Schwan, Sudhakar
Yalamanchili ,Lynx: A dynamic instrumentation system for data-parallel
applications on GPGPU architectures IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), 2012.
[5] Yao Zhang , John D. Owens, A Quantitative Performance Analysis Model
for GPU Architectures, IEEE 17th International Symposium on High
Performance Computer Architecture (HPCA),San Antonio, TX, USA, Feb
2011
[6] Michael Boyer, Kevin Skadron, and Westley Weimer. Automated dynamic
analysis of cuda programs. Third Workshop on Software Tools for MultiCore
Systems (STMCS), 2008.
[7] NVIDIA Corporation. NVIDIA CUDA Programming Guide, Version 9.1.85
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[8] Walker, Justin (25 March 2014). "Two GPUs, One Insane Graphics Card:
Introducing the GeForce GTX TITAN Z". Nvidia Blog. Nvidia Corporation.
Retrieved 25 March 2014.
[9] NVIDIA’s next Generation CUDA Compute Architecture : Kepler GK 110,
Whitepapaer, http://www.nvidia.in/content/PDF/kepler/NVIDIA-Kepler-
GK110-Architecture-Whitepaper.pdf