Object-Oriented Stream Programming using Aspects

Mingliang Wang and Manish Parashar
NSF Center for Autonomic Computing The Applied Software System Laboratory (TASSL) Rutgers University 94 Brett Road, Piscataway, NJ 08854, USA {mlwang, parashar}@cac.rutgers.edu
Abstract—High-performance parallel programs that efficiently utilize heterogeneous CPU+GPU accelerator systems require tuned coordination among multiple program units. However, using current programming frameworks such as CUDA leads to tangled source code that combines code for the core computation with that for device and computational kernel management, data transfers between memory spaces, and various optimizations. In this paper, we propose a programming system based on the principles of Aspect-Oriented Programming, to un-clutter the code and to improve programmability of these heterogeneous parallel systems. Specifically, we use standard C++ to describe the core computations and aspects to encapsulate all other support parts. An aspect-weaving compiler is then used to combine these two pieces of code to generate a final program. The system modularizes concerns that are hard to manage using conventional programming frameworks such as CUDA, has a small impact on existing program structure as well as performance, and as a result, simplifies the programming of accelerator-based heterogeneous parallel systems. We also present an options pricing and an n-body simulation example program to demonstrate that programs written using this system can be successfully translated to CUDA programs for NVIDIA GPU hardware and to OpenCL programs for multicore CPUs with comparable performance. For both examples, the performance of the translated code achieved ~80% of the hand-coded CUDA programs.

er higher performance on future hardware with increasing parallelism. Popularized by the General-Purpose GPU programming community, Stream Programming (SP) has demonstrated its strength and successes in mapping various algorithms to the highly parallel GPU architecture [3]. SP is a type of dataparallel programming that emphasizes improved arithmetic intensity – the average number of arithmetic operations per memory operation in data-parallel programs.

CPU Host GPU Accelerators

Figure 1: A hybrid CPU + GPU accelerators system.

I.

INTRODUCTION

Parallel processing systems have become a mainstream as multicore CPUs and manycore GPUs are commonly seen in today’s personal computing systems. To achieve power efficiency, highly parallel manycore systems must use power-efficient design. The current CPU cores are, however, power inefficient as they are very complex and devote many transistors to exploiting instruction level parallelism and a large on-chip cache [1]. Special-purpose processors, such as GPUs, take a different approach by using simpler cores and relying on parallelism to hide latency [2]. This trend has lead to heterogeneous processing systems where both complex cores and simple special-purpose cores coexist. A hybrid system with a CPU and several GPU accelerators is schematically shown in Figure 1. While available hardware parallelism continues to scale up with Moore’s Law, the pressing challenge is how to efficiently program these hybrid and heterogeneous parallel systems so that today’s programs can transparently leverage the increasing number of processing cores and continue to deliv978-1-4244-6443-2/10/$26.00 ©2010 IEEE

CUDA [4][5] is among the most widely used programming systems for SP so far. It defines a few C language extensions to provide a SP-based abstraction for its GPU hardware. OpenCL [6] adopts a very similar programming model and, although in early stages, is being accepted as an open standard for providing uniform APIs for using parallel hardware as accelerators. Programs written in these programming systems have, however, become much more complex than their sequential counterparts. Admittedly, parallel programming is inherently harder than sequential and more complex programs are expected. However, not all complexities should be equally accepted. Accidental complexities [7] due to insufficiency in programming support should be reduced or remedied. In CUDA programming, for example, a lot of complexities are due to (1) management of GPU devices and computational kernels, (2) transfers of data between GPU and CPU memories, (3) optimization of data access patterns, and (4) placement of data in the GPU memory hierarchy. Code fragments for these supporting functions are tangled with those for the core computations. The example in Figure 2 illustrates this problem with a simple matrix multiplication program using the CUDA programming system. As shown in the figure, a large number of lines of code (shown in red) are dedicated to the complexities listed above. These code fragments distract algorithm designers and are hard to manage, as they are mixed with code for the core computation, and are also scattered into many different places of a program.

The root cause of this code-tangling problem is that these programming concerns are not units of functional decomposition as they cut across multiple components of a system. This code allocates memory buffers to store matrix data. // allocate device memory for A B C float* d_A. The code in red on the right half is the extra code introduced for offloading computation to a GPU. WA. unsigned int size_A = WA * HA. size_A). } classes or packages. float* d_C. // clean up memory cutilSafeCall(cudaFree(d_A)). free(h_C). float* d_B. This piece of code needs to do extra work: manage GPU device. cudaMemcpyDeviceToHost) ). BLOCK_SIZE). Instead. HA. mem_size_C)).y). randomInit(h_B. allocate/de-allocate memory on GPU. void run_cpu(int argc. unsigned int mem_size_C = sizeof(float) * size_C. d_A. however. float* h_C = (float*) malloc(mem_size_C). mem_size_A)). these extra code fragments are repeated for every kernel computation and cannot be modularized in a functional decomposition framework of a function. HC / threads. // copy result from device to host cutilSafeCall(cudaMemcpy(h_C. char** argv) { // set seed for rand() srand(2006). unsigned int mem_size_B = sizeof(float) * size_B. mem_size_B)). // copy host memory to device cutilSafeCall(cudaMemcpy(d_A. srand(2006). The computation kernel is matrixMul_cuda. "device") ) cutilDeviceInit(argc. free(h_C). float* h_B = (float*) malloc(mem_size_B). and then calls matrixMul_cpu to do matrix-matrix multiplication. cudaThreadExit(). WB). else cudaSetDevice(cutGetMaxGflopsDeviceId()). char** argv) { if( cutCheckCmdLineFlag(argc. randomInit(h_B. void run_CUDA(int argc. (const char**)argv. transfer data between CPU and GPU memories. can be addressed using the Aspect-Oriented Programming (AOP) [8] paradigm. cutilSafeCall(cudaMemcpy(d_B. h_B. mem_size_B. unsigned int mem_size_B = sizeof(float) * size_B. This problem. threads >>> (d_C. free(h_B). d_B. mem_size_A. // clean up memory free(h_A). unsigned int size_B = WB * HB. unsigned int size_C = WC * HC. cudaMemcpyHostToDevice) ). class or module. cutilSafeCall( cudaMalloc((void**) &d_B. h_A. cutilSafeCall(cudaFree(d_C)). The places of . unsigned int size_C = WC * HC. size_B). A special compiler – aspect weaver then inserts program fragments. cutilSafeCall( cudaMalloc((void**) &d_A. unsigned int size_A = WA * HA. and set up call parameters. Figure 2: Code tangling in a typical CUDA program. d_C. // execute the kernel matrixMul_cuda<<< grid. The normal code that fits into the functional decomposition scheme is termed as component (or base. as necessary. WB). The lines in red are extra code introduced to the mainline compute code and this code clutters the original program. free(h_B). } Code for a sequential version on CPU. h_A. WA. float* h_A = (float*) malloc(mem_size_A). mem_size_C. cudaMemcpyHostToDevice)). and therefore cannot be cleanly encapsulated as procedures. unsigned int mem_size_A = sizeof(float) * size_A. size_B). unsigned int mem_size_C = sizeof(float) * size_C. cutilSafeCall( cudaMalloc((void**) &d_C. Furthermore. size_A). // check if kernel execution error cutilCheckMsg("Kernel execution failed"). free(h_A). cutilSafeCall(cudaFree(d_B)). unsigned int size_B = WB * HB. // setup execution parameters dim3 threads(BLOCK_SIZE. // compute solution matrixMul_cpu(h_C. Code for a CUDA version that offloads computation to a GPU. unsigned int mem_size_A = sizeof(float) * size_A. h_B. argv). where these crosscutting concerns can be modularized as aspects that do not naturally decompose into functional units. initializes them with random data. dim3 grid(WC / threads. mainline) code. randomInit(h_A. in the component code. // initialize host memory randomInit(h_A.x. they are properties that affect the performance or semantics of the components in a systematic way.

The base part of the language .insertion are called join points and are specified by a language construct called pointcut. and using them requires handling details such as device management. The issues dealt with in the aspect portion of the framework are: (1) deciding layout of the thread hierarchy for a kernel launch. program execution and other aspects of accelerator programming. focuses on cross-vendor compatibility and portability and defines a more complete model of devices. Component Component code A code B Aspect code A Aspect code B stractions and mechanisms for modularizing GPU programming. a program is divided into sequential portions that run on CPU. which are executed in parallel on GPUs. II. Recent years have seen many programming systems that intend to address the complexity of programming GPUs for general purpose computing. which are functions independently applied to collections of data elements. In this paper. It. CUDA and OpenCL are programming systems based on conventional functional decomposition. This model scales well into the massive parallelism future as data collection are usually very large and can be divided into an appropriate number of chunks to match the number of cores. Computation in this model is organized as a sequence of steps called kernels. Its programming model is similar to CUDA as computation is expressed as C function-like kernels executed in parallel over collections of data stored in memory buffers. The aspect code to be inserted is specified by another language construct called advice. we propose to improve on this situation by introducing AOP into SP for accelerators to modularize these crosscutting concerns. In this respect. The organization of aspect-oriented programs and the process of weaving aspect code into component code are shown in Figure 3. and illustrates their use with examples. The system uses standard C++ to program the core computational functions of a program. It supports a set of memory and device management APIs on the CPU side through library mechanism. they are more suitable as compilation targets than direct manipulation by programmers. which identifies a set of join points in the component code. This arrangement leads to clean separation and modularization of these crosscutting concerns. processors have at least an order of magnitude more computational power than memory bandwidth. (2) transferring data between memory spaces. and hence provide no support to cleanly and efficiently capture these aspects. we discuss related work in GPU accelerator programming and compare and contrast them with the work presented in this paper. and the creation of kernel functions. In the next section. manual memory management and data transfers between host and GPU memories. The final woven program is highly tangled with component and aspect code fragments. NVIDIA CUDA is a programming system based on extensions to the C language for programming their GPUs as accelerators [5]. we describe an aspect-oriented programming system that addresses the afore-mentioned issues. Section IV discusses the compilation process. CUDA is so far the most widely used GPU programming system. Furthermore. Section III describes AOP-based ab- Modern GPUs for general-purpose computation are often modeled as a type of data-parallel computation called stream processing [9]. It then introduces a small aspect language for describing the crosscutting concerns in accelerator programming and enforces a formal separation between core functional modules and secondary aspects for supporting purposes. In Sections VI and VII. RELATED WORK Aspect Weaver Component code A Component code B Figure 3: Weaving component code with aspect code. and insulates programmers from dealing with a particular hardware platform. memories. Some systems provide high-level abstractions that entirely insulate programmers from interactions with the low-level GPU programming interface. Another major insufficiency in current programming systems is that they require that computation kernels be programmed in separate compilation units and called as external C global functions or dynamically loaded at runtime. This model can also be made deterministic as threads executing a kernel do not communicate and collective operations for communication are entirely delegated to system components. and as a result negatively impacts existing program structures. Within a kernel. These systems include new programming languages such as the Brook [11] system and its variant Brook+ from AMD. as it is flexible and widely available on major OS platforms. and (4) the placement of data within a memory hierarchy. OpenCL is an ongoing effort [6] to construct a foundation-level acceleration API for programming GPU-like massively parallel hardware from different vendors. much more amenable to programmers. the aspect code is high-level. however. In this model. Realizing this problem. conclusions are presented and future avenues of this research are discussed. The rest of the paper is structured as follows. Both CUDA and OpenCL are low-level. in modern computers. and parallel portions that run on GPUs. many computations can be combined to increase its arithmetic intensity [10]. (3) mapping data to the thread hierarchy. Section V presents examples and evaluates the programming system. but the component and aspect source programs are cleanly modularized. Increasing this ratio is crucial in achieving efficient computation because. This model implies moving out existing functions to additional program source files. which is the ratio between the arithmetic operations to memory operations.

efficiency and expressiveness. hence limiting the general-purpose goal of modern GPU computation. it includes direct support for nested parallelism. class Vector { public: void increase(float inc). 1) It provides the low-level flexibility and control power in frameworks such as CUDA. similar to OpenMP. CuPP [14] provides a C++ framework to easily integrate CUDA into C++ applications. The separation of code into a component portion in standard C++ and an aspect portion containing all the add-ons for enabling AOP abstraction and GPU programming clearly demarcates the domains for different roles of programmers. For instance. trying to alleviate the code clutter due to crosscutting concerns. This is essentially an attempt to wrap the low-level CUDA interface with a higher level object-oriented interface. T>. 2) Furthermore. They have gained limited acceptance due to issues such as legacy compatibility. The code listed in Listing 1 is a simple Vector class1 in C++. it provides three abstractions: a stream construct to represent a collection of data that can be operated on in parallel. Additionally. and then the compiler derives code to start data transfers when required. with canned class templates to support data parallel programming. • GPU main memory is automatically managed through intrinsic data types. and does not represent a good design of a full-fledged vector class. T>. The increase method increases each element of the vector by a constant inc. } Listing 1: A C++ Vector class (incomplete) for parallel execution on an accelerator. A Program stores a sequence of operations that can operate on the Values of Arrays in parallel. An Array abstracts the concept of data collections and a Value of data elements. . but also avoids the tedious coding and maintenance of crosscutting concerns in them. This reduces programming effort and improves code quality. Programmers give declarations about memory synchronization requirements and hints about changed memory buffers in computation kernels. numerical algorithms developers can focus on the component code and GPU accelerator experts can work on the aspect code for performance tuning. This platform allows computational kernels to be directly specified in the C++ language and its metaprogramming feature supports code tuning and parameterization. STREAM PROGRAMMING WITH ASPECTS The programming system described in this section supports accelerator programming using conventional C++ and a small aspect language. they can also be virtual functions so as to support polymorphism of computation kernels. which defines its own set of compiler directives to generate native GPU code. as GPU kernels are automatically extracted from the component code. Both these new languages and embedded languages approaches entirely rely on the underlying runtime systems to manage the details of moving computation to GPUs. To support stream programming. Array<D. As class members. Arithmetic intense kernels are inferred by inspecting programs and are then offloaded onto GPUs. • Better modularization of the program code. code structure is improved and hence the over- all maintainability of a program. RapidMind [12] system (now part of Intel) takes the approach of an embedded programming language. This is achieved by exploiting the abstraction power of the AOP paradigm. Compared to the previous approaches. The programming model consists of three C++ data types: Value<N. Intel Ct [13] is another embedded programming system in C++. III. instead of being manually maintained in separate compilation units.compiler directives can spread over the entire system. the shift method rotates the elements by 1 This is to illustrate the concepts of our programming system. Memory transfer code is managed by the compiler. . and they can be class member methods. //other members omitted .is similar to C. 3) This method also allows for better engineering practice. This system provides the following features to help with programming GPU accelerators in a high-level and more modularized way. and also a reduction primitive for calculating a single value out of a set of data elements. and to create data structures that support transparent access from both CPU and GPU sides. and as a result addresses the code-tangling problem. void shift(int offset). Programmers have to use the canned library to craft their applications. MemBlock<float> m_data. for offloading computation to GPUs. Similar to RapidMind. to make C++ kernel calls. private: unsigned int m_size. Accelerator [15] is a data-parallel library from Microsoft to exploit GPUs as accelerators. determining thread hierarchy for kernel execution and data transfers are expressed in separate aspect code. but without the advantage of modularization provided by the language construct aspect -. A similar approach is also taken by the HMPP system [17]. • Computation kernels are organized as an integral part of the source code. PGI [16] provides a set of compiler directives. The small aspect language is largely based on AspectC++ [18] but is specifically customized for expressing aspects that arise from accelerator programming. a special kernel function type that operates on streams. These compiler directives based approaches essentially bear an aspect-oriented flavor as the inserted compiler directives actually play the role of aspect code. . and Program. our method has the following advantages. We will use it to illustrate the language concepts in the proposed system. Auxiliary functions such as device management. Additionally. leading to clean and better modularized code. providing an embedded data-parallel programming environment in C++. It provides a C++ interface to manage GPU device and memory.

such as members of its classes and the relationship between classes. It is a class template provided by our programming system. This is actually a desired property as it allows the program to scale up performance when more number of processing cores is available in future generations of hardware. but modified to facilitate accelerator programming. Since the only allowed join points in our system are function executions. Threads across groups cannot communicate directly. 2) How are join points specified? A pointcut defines a set of join points and a language is required to specify it. pointcut pt_inc() = execution( “Vect::increase(float)” || “Vect::shift(int)” ). e. A two-level thread hierarchy that is commonly found in modern GPUs is illustrated in Figure 5. and data mapping. In the CPU+GPU heterogeneous parallel processing system. we only need to use a subset of the existing pointcut specification mechanism in AspectC++. there exist shared memories and synchronization mechanisms. Semantically. Execution Model An execution model conceptually describes how a program makes progress at runtime. and passing the return values back to the call sites. The data member m_data is the memory buffer for the elements.an offset. Advice code for GPU programming needs to define the following behaviors for a computational kernel: the thread layout of a kernel launch. a kernel advice for a pointcut specifies that at each execution of a join point in this pointcut. Details of this mechanism will be discussed in section C. the . we provide two types of advice to describe them: kernel and memory. a computation is composed of a sequence of serial actions on the CPU and parallel computational kernels on the GPU. The execution of a computation kernel is organized in a hierarchy of threads. In a two level hierarchy.g. data transfers between memory spaces. Join Point Model for Accelerator Programming A Join Point Model (JPM) is central to any AOP framework and it defines three components. a typical case for GPU accelerators. This example defines a pointcut named “pt_inc”. AspectC++ provides a fairly complex language to specify pointcuts. More complex pointcut expressions are allowed. so that multiple threads often need to be temporally multiplexed. which includes the execution of member methods ”increase(float)” and “shift(int)” of the class Vect. and then focus on the important mechanisms of the system to support stream programming with aspects. (X. In this niche AOP system. B. A. We also do not allow constructs to modify a program’s static structure. The computations in these methods are dataparallel and can be offloaded to an accelerator. 1) What can be the join points? This determines those points in a program where aspect code can be inserted. 3) How to specify the advice code to run at the join points? Our JPM is based on the model of AspectC++ [18]. including argument evaluations. we first briefly introduce the execution model of a CPU-GPU accelerator system. For these very specific needs. Kernel_method1 Kernel_method2 sequential actions on CPU s1 parallel kernel methods on GPU Kernel_method3 Figure 4: Execution model of a heterogeneous CPU + GPU system. All other generic advice types in AspectC++ are not allowed. and these details are available in [19]. e. (x. featuring managed memory across memory spaces.y) Figure 5: Parallel methods executed as a grid of thread blocks. In this system. Thread groups are further organized into a grid. An example of pointcut designation is shown in the following. please refer to [19]. The sizes of the grid and the thread groups are determined during the launch of each kernel method. this geometric information and the index of the current thread executing a kernel instance are available. The only join points are function executions. This execution model is illustrated in Figure 4. Thread Block indexed by its location in the grid. Within each kernel method. A thread grid may be much larger than the actual number of available processing elements on a particular GPU accelerator. so that each kernel can determine its share of data elements to process. In the following subsections. and within the same group.g. and for details.Y) Threads within a thread block is futher indexed by its location in the block. threads are organized as thread groups. we define a function execution join point to encompass all the actions that comprise a function. the JPM is relatively simple. the actions in the function body.

y = 1. The following shows its usage in a computation kernel.x + threadIdx. The type is. } advice pt_inc: memory() { sync_mem(ON_ENTRY).x = 100. and shared memory sizes for thread groups. A memory advice specifies data synchronization information for non-scalar variables that are referenced in a computation kernel. thread. A usage of this shared memory in a kernel function is shown in the following. Aspects are language constructs that modularize units of crosscutting concerns. advice integrate(): kernel() { thread. // memory advice advice pt_cknls(): memory() { sync_mem(ON_EXIT). the threads are organized as a 1-D array of size 32. the advice code of kernel and memory needs to access the context of the join point that is currently activated. An example of this usage is shown below. //kernel advice advice pt_cknls(): kernel() { grid. An aspect is similar to a class in that it can include the regular members allowed in a class definition. In the CPU side code.x + blockDim. and we provide this information through pointcut parameters and access to data members of the activated class instance.y. specifies that memory buffers should be synchronized when this kernel finishes. the kernel advice chooses the device to use for this kernel launch.x = that->m_numBodies/that->m_p. these types must be accessed through an access method T* MemBock<T>. int l_index = threadIdx. Note that memory advice is declarative and statements in it do not directly correspond to actual actions.x. however. and that the data member m_data is changed in the execution of the kernel and the new version is on the GPU side. in addition to pointcut and advice definitions. The memory advice on the other hand. The following shows definitions for kernel and memory advice.x = 32.ptr(). grid. and sets up the dimensions of the 2-level thread hierarchy. Borrowing the idea of this pointer. thread layouts. GPU_IS_NEW). and within each thread block. Intrinsic Data Types Memory regions that are accessed in a computation kernel should be treated differently from regular CPU memory as they live in a separate address space and cannot be accessed directly from a CPU side thread.computation in the function should be offloaded to an accelerator device. // device used pointcut pt_inc() = execution( “Vect::increase(float)” || “Vect::shift(int)” ) advice pt_inc: kernel() { device = dev. Note that the size of shared memory is controlled by thread launch parameters in the advice code of this kernel method. float4* h_ptr = m_Pos.y = that-m_q. In this example. } aspect GPU_Vect { Device dev. int index = blockIdx.x*threadIdx. .x = 100. thread. with overloaded subscript operators and automatic memory management. } } Listing 2: Aspect for class Vector. The example in Listing 2 defines an aspect for the Vector class. Its usage will be translated by our compiler. it is specified that the thread layout be a 1-D array of thread block with size 100. We provide a C++ class template to model this memory system. understood by our compiler as an intrinsic type. float4 pos = m_Pos[index]. SharedMem<float> cache_Pos. grid. Additional code in the advice specifies launch parameters.x * blockDim. } It is a class template for 1-D arrays of basic data types. N> m_Pos. MemBlock<float4. oosp::syncThreadsBlock(). With this piece of information. and exposed as another C++ class template SharedMem. } C. respectively. This access method returns the CPU side pointer of the data type and may initialize a data transfer from GPU memory to CPU memory to update the stale data. sync_mem(ON_EXIT). proper data transfer requests can be inserted by the aspect weaving compiler. that is the pointer to the current object when the join point is activated. such as the device to use. The memory advice specifies that memory synchronization is required on both entry and exit of this kernel. grid. on_exit(that->m_data. In many cases. thread. The following example shows utilization of this context information in a kernel advice to determine the thread layout for a computation kernel.x = that->m_p. The following code declares a GPU memory region to be an array of float4. In the kernel advice.x = 32. The on-chip shared memory is also represented as intrinsic data types.ptr(). cache_Pos[l_index] = m_Pos[index].

The compiler has two high-level tasks: 1) identify all the kernel function execution join points and then weave into the kernel and memory advice code. The pointcut expressions in the aspect code are evaluated to identify all join points for each pointcut. • Intrinsic data types for shared memory within a thread block.SharedMem is a simpler data type for programmers as it is only usable in a computation kernel. m_data[index] += val. and the local index in the thread block. The memory region affected is a data member m_data. thread. aspect ACC_Vect { pointcut pt_inc() = execution( “Vect::increase(float)” advice pt_inc: kernel() { device = dev. we declare synchronization requirements both on entry and exit of the computational kernel. They can be global or namespace functions or class member methods. • Scalar data types. • Intrinsic data types for global memory of a GPU accelerator. and 3) what data should be synchronized. Code Weaving The weaving process is relatively simple as it only involves function execution join points. They are organized the same way as in standard C++. The only type of join points is function execution and the advice types are restricted to kernel and memory advices. set- Note that in this example. For each identified kernel function. all others being read only. As discussed earlier. Data Transfers We define data transfer requirements using intrinsic functions in the memory advice code. Note. The interaction of advice at a join point is also simple and well-defined. The correctness of the coordination due to aspect code is relatively easy to maintain. Programmers do not specify memory transfer operations. but use them to generate GPU kernel source. without requiring to be placed in separate compilation units. 2) when it is required. as in CUDA. instead. grid. The following discusses the details of the compiling process. } E. } Memory synchronization requirements are declaratively specified.x. the aspect weaver takes in the C++ mainline code and the advice code in the aspect code to generate standard C++ host side programs. IV. B. An example of a kernel definition is shown below. the changes are restricted to its body. Compilation Architecture The compilation architecture is built on the PUMA code analysis and transformation framework [20]. sharedMemSize = that->m_p * that->m_q * sizeof(float4).x * blockDim. the dimension of the thread block. 2) generate computation kernels for target accelerator APIs and insert these APIs calls in the CPU side program. The following is an example aspect specification for the previous computation kernel. They refer to the index of the current thread block. advice pt_inc(): memory() { sync_mem(ON_ENTRY). A. the JPM in our programming system is simpler than a generic AOP. The allowed types and accesses are listed in the following. sync_mem(ON_EXIT). It defines the body of the member void Vector::increase(float ). the shared memory size is determined from the context parameters of the active join point. The scoping rules of the conventional C++ language apply. Read and write accesses are allowed. however. This includes loading the kernel code to the device. including all the actions in the kernel advice code. which can be variables defined in a kernel. Kernels can access variables from a variety of places. which is also used in the AspectC++ aspect weaving compiler. Calling a kernel starts a hierarchy of threads with each executing an instance of this kernel. • Code to setup kernel launch parameters. Programs that do not conform to this rule will trigger compilation errors. or as class members. . they declare if memories need to be synchronized and which memory space has the latest version on a kernel completion. The flow of this compilation architecture is illustrated in Figure 6. It also has overloaded subscript operators in order to be used as a random access 1D array. Read and write accesses are allowed. blockDim. GPU_IS_NEW). The kernel code generator also takes in C++ mainline code and aspect code. • Invocation of the offloaded kernel functions.x + threadIdx. For each join point. a programmer has access to the variables blockIdx.x = 32. In the following example. COMPILATION Notice that in each kernel method. D. The specification includes: 1) if synchronization is required. void Vector::increase(float val) { int index = blockIdx. and threadIdx. but the data types and their allowed accesses are restricted. on_exit(that->m_data. } } The compiler is a source-source translator for C++ programs. global or namespace scope.x = 100. Kernel definitions require no language extensions in the component portion of the code. Its purpose is to tag data types so that the compiler can map it to on-chip shared memory on a GPU accelerator. The original body is replaced with the following. The actual layout of this thread hierarchy is specified in its kernel advice. Computation Kernels and Execution Computation kernels are defined as regular C++ functions with added semantics from aspect code. respectively. that only scalar variables defined within a kernel are writeable.

The first one is an options pricing and the second one is an n-body simulation.syncMem(). 1). e. m_oldPos.. . This library also helps with tracking the update status of the memory buffers for the intrinsic data types. Index variables and geometric size variables of the thread hierarchy are directly translated to corresponding variables in CUDA or function calls in OpenCL. The parameter list of the function is changed to allow passing parameters from a host thread to a GPU kernel. These memory transfer code fragments appear both before and after kernel invocations.syncMem(). • All references to intrinsic data types for GPU global memory are translated to references to array data types.setDeviceHasNew(). the weaver is allowed to insert the same code for both the OpenCL and CUDA targets. which is used to model GPU global memory. The memory management functions for GPU main memory also rely on a runtime library. /////////////////////// //OOSP_END } Listing 3: Code weaving for CUDA target. In this section.d_ptr(). This reduces the amount of code to be directly inserted in the weaving process and also eases re-targeting to different accelerator APIs. With this convenience. m_newPos. m_damping. m_newVel.g. 1). Computation Kernel Generation A piece of kernel advice for a pointcut identifies all the kernel methods to be transformed.d_ptr(). its function body is used to generate the body of a kernel function for a given target. m_softeningSquared.d_ptr() ). m_newVel. CUDA. m_oldVel. Additional parameters to a kernel function are generated to pass in non-local scalars. and actually starting the execution of a kernel. • Code fragments to initiate memory transfers. The following lists the major points in kernel generation. The options parameters and pricing data are stored as class members and are of the intrinsic type MemBlock<float>. C++ mainline code OOSP aspect code • • References to intrinsic types for on-chip shared memory are transformed into those to raw arrays of on-chip memory. as both CUDA and OpenCL directly support them. BodySysetmOOSP__integrateNbodySystem <<<grid. sharedMemSize >>> (deltaTime. to call the generated GPU kernel function. Pointcut expression evaluation Join Points Weaver Kernel Code Generator C++ host side code GPU Kernel code V. we also discuss the improvement in program structures. m_newPos. These data types require maintenance of memory buffers in both the CPU and GPU address spaces. Additional array parameters are added to a kernel function. • All references to non-local scalar variables are converted to references to parameters to the generated kernel function. The code shown in Listing 3 is an example output of the code weaving process.ting up parameters to be passed in. We use the same computation method used in the example project BlackScholes from CUDA SDK [4]. For both applications. This code is derived from the memory advice code. Kernel launch parameters in the host side code decide shared memory sizes. void BodySystemOOSP::_integrateNbodySystem( float deltaTime) { /////////////////////// //OOSP_START int sharedMemSize = m_p * m_q * sizeof(float4).d_ptr(). 1.setDeviceHasNew(). A. The original function body is replaced with code to set up kernel launch parameters. The original code is a procedural CUDA C program and the OOSP version is a C++ program. It is from one of the kernels in the nbody simulation problem that we will discuss later. dim3 grid(m_numBodies/m_p. m_numBodies. m_newVel. m_newPos. m_q. EXPERIMENTAL RESULTS AND EVALUATION C. The runtime performance of the OOSP version is compared against the CUDA SDK version and shown in Figure 7. dim3 threads(m_p. for the second example. Figure 6: Compiling architecture of the accelerator programming system. we report results of performance comparison with hand-coded CUDA versions. and to call memory transfer functions. we present results with two examples programmed in the proposed programming system. For each kernel method identified. threads. Options Pricing The Black-Scholes model for options pricing [21] provides a partial differential equation for the evolution of option prices. We built a C++ template library to help with this purpose.

we ass added another body system implementa ation BodySystemOOSP to modularize the implementation i the OOSP proin gramming framework. which are enclosed in an aspect. G The multicore CPU version is obta ained by generating code for OpenCL with the AMD OpenCL implementation [23]. We f noticed that a significant amount of complexities in existing programming frameworks.8 100% OOSP 30720 358. which negatively im mpact the original CPU version. yielding a much cleaner program structure. TABLE II. L TABLE III. we emphasize on studyi impact on exing isting program structures. The performance com mparison with the hand-coded version in the CUDA SDK is s shown in TABLE I. The OOSP program improves on the CUDA version in the following: it reduces t number of non-member functions from 14 (10 CPU + 4 unctions and 2 additional GPU) to 6 regular C++ member fu aspect advice functions. CU UDA. . Linux 2. with GTX 280 NVIDIA GPU. 10 8 6 4 2 0 2 4 8 16 20 40 80 160 200 400 800 1600 2000 4000 8000 16000 20000 40000 X: Kilo-Options Evaluated Y: Giga-Options / Sec CUDA A OOSP 1) Structural Changes Structural changes to the origin C++ n-body program nal are investigated from the counts of additional global functions and class members introduced. It is often used in computation of interactions am mong close-range entities. program structure is greatly improv The final OOSP proved. CU UDA SDK version 2. howeve keeps all the computaer. .cu Aspect Total CPU 40 200 LINES-OF-CODE COMPARISON O CUD DA 55 190 kernel: 17 70 wrapper: 100 ~515 OOSP 190 180 CPU CUDA OOSP 1 C++ 1 aspect Figure 8: Class diagram of N-body simulatio on. The core C++ co is very similar to the ode original sequential C++ program and contains only computation code. Number of Classes 1 C++ STRUCTURAL CHA ANGES COMPARISON ricing with BlackFigure 7: Performance comparison of options pr Scholes method. N-Body Problem es An n-body problem numerically simulate the continuous interactions among a system of bodies. and the al code clutter due to offloading comp putation to GPU device is eliminated.0 21.cpp . on We compare the OOSP implementatio with the CPU and CUDA versions. On top of the existing C++ cla hierarchy. which is a brute-force method that evaluates all the pair-wise interactions among N bodies. are due to the crosscutting concerns such as managing GPU devices. In this example. gram is very similar to the origina CPU version. ata memory allocation/de-allocation. OOSP has a smaller number of code sizes compared to CUDA version More importantly. x86_64. OpenCL. B.we The giga-options / second numbers w achieved with OOSP are between 79% to 84% of those with the CUDA SDK example. A survey of the methods developed for n-body problems is available at [22]. indicating ~20% degradation of performa ance. CUDA SDK ver rsion 2..6. dy The OOSP version of the n-bod program we developed was translated for both NVIDIA GPU and multicore CPU.9 80% ~240 40 ~410 LOC numbers exclude comments in the code. The OOSP version. The 14 non-member functions in CUDA are GPU kernels n and wrappers. TABLE I. the n.1.g.6. x86_64. da transfers between host . Th example uses his the all-pairs method. Linux 2. As shown. BodySystem 1 C++ Class Members 7 data members 4 non-public methods 10 data members 2 non-public methods 12 data members 6 non-public methods Non-Member Functions 1 computation function 10 CPU side fun’s 4 GPU side fun’s 2 aspect advice fun’s BodySystemCPU BodySystemCUDA Body ySystemOOSP We also show the Line-Of-C Code (LOC) metric in TABLE III. C VI. ody This example is an extension to the n-bo CUDA SDK example. e.h .1. The class diagram is shown in Figure 8. CONCLU USION Environment: GTX 280 NVIDIA GPU. RUNTIME COMPARISON OF N-BOD SIMULATION DY CUDA Number of Bodies G-FLOPS Interactions billions/s Perf Percentage 30720 436. tion functions as class members an adds 2 aspect advice nd functions to manage memory tran nsfers and layout of the thread hierarchy. Result is shown in TABLE II. In this paper we described a sys stem to support programming heterogeneous CPU + GPU accelerator systems.6 17.

J.. MICRO-36. This method removes clutters in the source code and helps to achieve better modularization.” Proceedings of the European Conference on Object-Oriented Programming. so that a programmer can easily spot and navigate among them.” ACM Trans. California. Lopes. Lindholm et al. T. pp. and management of thread hierarchies of kernel functions. “Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses. Additionally. Loingtier. I. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] . vol. M. J.J.” in Proceedings of the 21st annual international conference on Supercomputing (Seattle. The current aspect language is based on AspectC++. advice code.D. Dec. pp. J. Kumar et al. This material was based. http://www. in view of the complexity resulted from manually managing computational kernel code of GPU. “Data-parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform. 1997. We also conducted case studies with an options pricings and an n-body simulation program. 81-92.. Oglesby.” in Microarchitecture. San Jose. California: ACM.com/. pp. We hence proposed to apply the principles of Aspect-Oriented Programming to modularize these crosscutting concerns. 5 (2008): 816-831. Graph.com/cuda/ J. M. X. 2004. Roma.A framework for easy CUDA integration. G. “Executing irregular scientific applications on stream architectures. Y. G. An easy to access model helps to create efficient programs that maximize the return of investment in hardware.V. M. Wolfe.” 14th International Workshop on High-Level Parallel Programming Models and Supportive Environments. This is a major aspect of the system that will continue evolving.” Micro. A.” Dec. Comparisons were made and we observed reduced program structure changes. 2003). “CuPP -. Fang. Irwin.” Proceedings of the IEEE 96. Italy: 2009. Addison-Wesley Professional. which was designed for generic AOP. Kiczales. uncluttered code. are those of the author and do not necessarily reflect the views of the National Science Foundation. “NVIDIA Tesla: A Unified Graphics and Computing Architecture. Rajagopalan. and J.” GSPx Multicore Applications Conference. we provided a dedicated aspect language to improve the ease of use with respect to a general-purpose AOP framework.” in Proceedings of the 2003 ACM/IEEE conference on Supercomputing (IEEE Computer Society. J. and by an IBM Faculty Award. 2006. Buck. VII. no. McCool. Maeda. 2008.” Proceedings of the 12th international conference on Architectural support for programming languages and operating systems. Zhou. this is achieved by adjusting mapping from thread index to stream data elements. Puri. I. on work supported by the National Science Foundation. Integrated Development Environment (IDE) integration is important for large-scale developments. and C. we are still not sure if they are convenient or powerful enough to handle the situations in more complex applications. FUTURE WORK A performance model provides a structured and repeatable approach to modeling the performance of a program. pp. “Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction. USA: ACM. 2 (2008): 39-55. we introduced a code transformation that automatically extracts kernel code from the component code. and was conducted as part of the Center for Autonomic Computing at Rutgers University.nvidia. “The Mythical Man-Month: Essays on Software Engineering. Khronos Group. M. IEEE 28. We plan to deliver the system to application programmers for more complex applications. Garland. 35. Tarditi. We will provide visual hints about join points. while working at the Foundation. In the current CUDA programming framework. and B. Mattan Erez et al. Santa Clara: 2006. We are working on providing a mechanism to reduce code clutter due to this cause. 220-242. for example.and device memories. Lamping. The compiler has not been fully completed. Breitbart. Even though we added two advice types for kernel and memory advice. “Merrimac: Supercomputing with Streams. Skadron. “Brook for GPUs: Stream Computing on Graphics Hardware. B. ACKNOWLEDGMENT The research presented in this paper is supported in part by The Extreme Scale Systems Center at ORNL and the Department of Defense.” Los Angeles. C. 2003.khronos. and conclusions or recommendations expressed in this material. We have developed a performance model based on Bulk Synchronization Parallelism (BSP) [24][25] and are now applying it to modeling OOSP programs. Dally et al. 777-786. This is actually another source of cluttered code and has not been addressed in our system. 2003. Wu. “Scalable Programming Models for Massively Multicore Processors. Smith. and runtime performance close to the hand-coded CUDA version for the GPU implementations. 2007. no. “OpenCL 1. “Scalable parallel programming with CUDA. W. Ghuloum. We will develop plugins for popular IDEs. Proceedings. Buck et al.. Guo. P. further reducing the structural change required. 36th Annual IEEE/ACM International Symposium on. S. “Compilers and More: A GPU and Accelerator Programming Model..” HPCWire online news letter. Nickolls. D. “Future-Proof Data Parallel Algorithms and Software on Intel® Multi-Core Architecture. http://www.. So.D.” Anniversary Edition. 1995. E. 1-14. and adapted the activation model for offloading computation to accelerators. 2008. pointcuts and their relationships. 93-104 “NVIDIA CUDA Zone. To help with programming these very specific aspects in accelerator programming. Chen.org/registry/cl/ F. An important optimization in GPU accelerator programming is adjusting memory access patterns to achieve coalesced access.0 Specification. “Aspect-Oriented Programming. so one of our major ongoing tasks is to deliver a full-fledged compiler. Any opinion.P. REFERENCES [1] R.” http://www.hpcwire.. 2008. and K. 23. Brooks. in part. 325-335. Washington: ACM. J. We programmed the OOSP versions of both programs and successfully translated them into source programs for both CUDA and OpenCL targets. Chen. McCool. Nov. M. finding. 2007).” Intel Technology Journal. Additionally. so that we can gather practical experience and feedbacks to further improve our system.

Urban. A. vol. mobile and embedded applications. [19] “O. Boston. [22] N-Body Particle Simulation Methods. accessed September 2009. D. [24] L. MA: 2007. pp.G.” http://ivs. [20] PUMA – The PURE Manipulator.” http://ati. Inc. “Questions and Answers about BSP.amd. “The Pricing of Options and Corporate Liabilities. “HMPP: A Hybrid Multi-core Parallel Programming Environment. Australia: Australian Computer Society. [25] D.” The Journal of Political Economy 81. Schröder-Preikschat. pp. accessed September 2009.de /~puma/ [21] F.” New trends in software methodologies. . 1990. S.com/papers/ nbody. 5360. Scholes. Gal.uni-magdeburg. J. [23] “OpenCL: The Open Standard for Parallel Programming of GPUs and Multi-core CPUs.” Scientific Programming 6. Hill. 2005. 33.amara. http://www. “Advances in AOP with AspectC++. McColl. Black and M. and B. “A bridging model for parallel computation.” Workshop on General Purpose Processing on Graphics Processing Units. 3 (1997): 249-274.com/technology/streamcomputing /opencl. Sydney.cs. 2002. Tokyo.[17] R. 33-53. Lohmann.html. no. Japan: IOS Press. no. and M. Dolbeau. and W. pp. tools and techniques. Valiant.html.” Proceedings of the Fortieth International Conference on Tools Pacific: Objects for internet. [18] O. 103-111. Skillicorn. ACM. Bihan.” Commun. Spinczyk. “AspectC++: An Aspect-Oriented Extension to the C++ Programming Language. Spinczyk. Francois. 637. 3 (1973): 654. and W..

Sign up to vote on this title
UsefulNot useful