Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
accelerated CUDA programming

accelerated CUDA programming

Ratings: (0)|Views: 238|Likes:
Published by amit_per
I have uploaded this document for understanding the CUDA hardware model and CUDA programming Model.
I have uploaded this document for understanding the CUDA hardware model and CUDA programming Model.

More info:

Published by: amit_per on Apr 23, 2010
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as TXT, PDF, TXT or read online from Scribd
See more
See less





hiCUDA: A High-level Directive-basedLanguage for GPU ProgrammingTianyi David HanThe Edward S. Rogers Sr. Department ofElectrical and Computer EngineeringUniversity of TorontoToronto, Ontario, Canada M5S 3G4han@eecg.toronto.eduTarek S. AbdelrahmanThe Edward S. Rogers Sr. Department ofElectrical and Computer EngineeringUniversity of TorontoToronto, Ontario, Canada M5S 3G4tsa@eecg.toronto.eduABSTRACTThe Compute Unified Device Architecture (CUDA) has be-come a de facto standard for programming NVIDIA GPUs.However, CUDA places on the programmer the burden ofpackaging GPU code in separate functions, of explicitly man-aging data transfer between the host memory and variouscomponents of the GPU memory, and of manually optimiz-ing the utilization of the GPU memory. Practical experienceshows that the programmer needs to make significant codechanges, which are often tedious and error-prone, before get-ting an optimized program. We have designed hiCUDA,a high-level directive-based language for CUDA program-ming. It allows programmers to perform these tedious tasksin a simpler manner, and directly to the sequential code.Nonetheless, it supports the same programming paradigmalready familiar to CUDA programmers. We have proto-typed a source-to-source compiler that translates a hiCUDAprogram to a CUDA program. Experiments using five stan-dard CUDA bechmarks show that the simplicity and flexi-bility hiCUDA provides come at no expense to performance.Categories and Subject DescriptorsD.3.2 [Programming Languages]: Language Classifica-tions—concurrent, distributed, and parallel languages, spe-cialized application languages, very high-level languagesGeneral TermsLanguages, Design, Documentation, Experimentation, Mea-surementKeywordsCUDA, GPGPU, Data parallel programming1. INTRODUCTIONThe Compute Unified Device Architecture (CUDA) hasbecome a de facto standard for programming NVIDIA GPUs.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.GPGPU ’09 March 8, 2009, Washington, D.C. USACopyright 2009 ACM 978-1-60558-517-8 ...$5.00.Although it is a simple extension to C, CUDA places on theprogrammer the burden of packaging GPU code in sepa-rate functions, of explicitly managing data transfer betweenthe host memory and various GPU memories, and of manu-ally optimizing the utilization of the GPU memory. Further,
the complexity of the underlying GPU architecture demandsthat the programmer experiments with many configurationsof the code to obtain the best performance. The experimentsinvolve different schemes of partitioning computation amongGPU threads, of optimizing single-thread code, and of uti-lizing the GPU memory. As a result, the programmer has tomake significant code changes, possibly many times, beforeachieving desired performance. Practical experience showsthat this process is very tedious and error-prone. Nonethe-less, many of the tasks involved are mechanical, and we be-lieve can be performed by a compiler.Therefore, we have defined a directive-based languagecalled hiCUDA (for high-level CUDA) for programmingNVIDIA GPUs. The language provides a programmer withhigh-level abstractions to carry out the tasks mentionedabove in a simple manner, and directly to the original se-quential code. The use of hiCUDA directives makes it easierto experiment with different ways of identifying and extract-ing GPU computation, and of managing the GPU memory.We have implemented a prototype compiler that translates ahiCUDA program to an equivalent CUDA program. Our ex-periments with five standard CUDA benchmarks show thatthe simplicity and flexibility hiCUDA provides come at noexpense to performance. For each benchmark, the executiontime of the hiCUDA-compiler-generated code is within 2%of that of the hand-written CUDA version.The remainder of this paper is organized as follows. Sec-tion 2 provides some background on CUDA programming.Section 3 introduces our hiCUDA directives using a sim-ple example. Section 4 specifies the hiCUDA language indetails. Sections 5 describes our compiler implementationof hiCUDA and gives an experimental evaluation of thecompiler-generated code. Section 6 reviews related work.Finally, Section 7 presents concluding remarks and direc-tions for future work.2. CUDA PROGRAMMINGCUDA provides a programming model that is ANSI C,extended with several keywords and constructs. The pro-grammer writes a single source program that contains boththe host (CPU) code and the device (GPU) code. These twoparts will be automatically separated and compiled by the1Figure 1: Kernel definition and invocation in CUDA.CUDA compiler tool chain [8].CUDA allows the programmer to write device code in Cfunctions called kernels. A kernel is different from a regu-lar function in that it is executed by many GPU threads ina Single Instruction Multiple Data (SIMD) fashion. Eachthread executes the entire kernel once. Figure 1 shows anexample that performs vector addition in GPU. Launching akernel for GPU execution is similar to calling the kernel func-tion, except that the programmer needs to specify the spaceof GPU threads that execute it, called a grid. A grid con-tains multiple thread blocks, organized in a two-dimensionalspace (or one-dimensional if the size of the second dimen-sion is 1). Each block contains multiple threads, organizedin a three-dimensional space. In the example (Figure 1),the grid contains 2 blocks, each containing 3 threads, so thekernel is executed by 6 threads in total. Each GPU threadis given a unique thread ID that is accessible within the ker-
nel, through the built-in variables blockIdx and threadIdx.They are vectors that specify an index into the block space(that forms the grid) and the thread space (that forms ablock) respectively. In the example, each thread uses itsID to select a distinct vector element for addition. It isworth noting that blocks are required to execute indepen-dently because the GPU does not guarantee any executionorder among them. However, threads within a block cansynchronize through a barrier [7].GPU threads have access to multiple GPU memories dur-ing kernel execution. Each thread can read and/or writeits private registers and local memory (for spilled registers).With single-cycle access time, registers are the fastest in theGPU memory hierarchy. In contrast, local memory is theslowest in the hierarchy, with more than 200-cycle latency.Each block has its private shared memory. All threads inthe block have read and write access to this shared memory,which is as fast as registers. Globally, all threads have readand write access to the global memory, and read-only accessto the constant memory and the texture memory. Thesethree memories have the same access latency as the localmemory.Local variables in a kernel function are automatically allo-cated in registers (or local memory). Variables in other GPUmemories must be created and managed explicitly, throughthe CUDA runtime API. The global, constant and texturememory are also accessible from the host. The data neededby a kernel must be transferred into these memories beforeit is launched. Note that these data are persistent acrosskernel launches. The shared memory is essentially a cachefor the global memory, and it requires explicit managementin the kernel. In contrast, the constant and texture memoryhave caches that are managed by the hardware.To write a CUDA program, the programmer typicallystarts from a sequential version and proceeds through thefollowing steps:1. Identify a kernel, and package it as a separate function.2. Specify the grid of GPU threads that executes it, andpartition the kernel computation among these threads,by using blockIdx and threadIdx inside the kernelfunction.3. Manage data transfer between the host memory andthe GPU memories (global, constant and texture), be-fore and after the kernel invocation. This includesredirecting variable accesses in the kernel to the corre-sponding copies allocated in the GPU memories.4. Perform memory optimizations in the kernel, such asutilizing the shared memory and coalescing accesses tothe global memory [7, 13].5. Perform other optimizations in the kernel in order toachieve an optimal balance between single-thread per-formance and the level of parallelism [12].Note that a CUDA program may contain multiple kernels,in which case the procedure above needs to be applied toeach of them.Most of the above steps in the procedure involve signif-icant code changes that are tedious and error-prone, notto mention the difficulty in finding the “right” set of opti-mizations to achieve the best performance [12]. This notonly increases development time, but also makes the pro-

Activity (6)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads
bunny_gh2052 liked this
wizzerking liked this
bernasek liked this
jd38 liked this

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->