COMPUTE UNIFIED DEVICE ARCHITECTURE

ABSTRACT
CUDA (Compute Unified Device Architecture) is a compiler and set of development tools that enable programmers to use a variation of C to code algorithms for execution on the graphics processing unit (GPU). CUDA has been developed by NVIDIA and to use this architecture requires an Nvidia GPU and drivers.

manipulating and displaying computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. A GPU can sit on top of a video card, or it can be integrated directly into the motherboard. More than 90% of new desktop and notebook computers have integrated GPUs, which are usually far less powerful than those on a video card. Graphic processors or GPUs have evolved much in the past few years. Today, they are capable of calculating things other than pixels in video games, however, it's important to know how to use them efficiently for other tasks

1. Introduction
CUDA is a scalable parallel programming model and a software environment for parallel computingThe latest drivers all contain the necessary CUDA components. CUDA works with all NVIDIA GPUs from the G8X series onwards, including GeForce, Quadro and the Tesla line. NVIDIA states that programs developed for the GeForce 8 series will also work without modification on all future Nvidia video cards, due to binary compatibility. CUDA gives developers access to the native instruction set and memory of the massively parallel computational elements in CUDA GPUs. Using CUDA, the latest NVIDIA GPUs effectively become open architectures like CPUs (Central Processing Units). Unlike CPUs however, GPUs have a parallel "many-core" architecture, each core capable of running thousands of threads simultaneously - if an application is suited to this kind of an architecture, the GPU can offer large performance benefits. CUDA provides both deterministic low level API and a higher level API.The initial CUDA SDK was made

3. Difference Between CPU and GPU
During the last couple of years, GPU calculation power has improved exponentially and much faster than that of the CPU. However, this doesn't mean that GPUs have evolved faster. GPU Computing with CUDA brings parallel computing to the masses Over 85,000,000 CUDA-capable GPUs sold. Data-parallel supercomputers are everywhere We¶re already seeing innovations in data-parallel computing Massively parallel computing has become a commodity technology Due to the nature of graphics computations, GPU chips are customized to handle streaming data. This means that the data is already sequential, or cache-coherent, and thus the GPU chips do not need the significant amount of cache space that dominates CPU chips. The GPU die real estate can then be re-targeted to produce more processing power. A CPU is expected to process a task as fast as possible whereas a GPU must be capable of processing a maximum of tasks, or to be more accurate, one task for a maximum of data in a minimum period of time. Of course, a GPU also has to be fast and a CPU must be able to process several tasks, but up to this date the development of their respective architectures has shown the above priority. This has meant multiplying processing units for GPUs, and for CPUs,

public 15th February 2007. The compiler in CUDA is based on the PathScale C compiler which is based on Open64. NVIDIA released a CUDA API beta for Mac OS X August 19, 2008.
2. GPU(Graphical Processing Unit)
A graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a dedicated graphics rendering device for a personal computer, workstation, or game console. Modern GPUs are very efficient at

1

Whenever data must be passed through the system bus. These disks have built-in flash memory caches. Some CPUs have both L1 and L2 cache built-in and designate the separate cache chip as Level 3 (L3) cache. Cache is more expensive than RAM. or located next to it on a separate chip. CUDA API CUDA. Cache that is built into the CPU is faster than separate cache. libraries (an implementation of BLAS amongst other things). In this case. the data transfer speed slows to the motherboard¶s capability.making control units more complex increasing embedded cache memory. Cuda Architecture The GPU has some number of MultiProcessors (MPs). as flash memory is faster than RAM. programs can operate more quickly and efficiently. once most programs are open and running. or Compute Unified Device Architecture. CUDA is in consequence a driver. However. separate cache is still roughly twice as fast as Random Access Memory (RAM). The CPU can process data much faster by avoiding the bottleneck created by the system bus. as hybrid hard disks become ubiquitous. Disk caching applies the same principle to the hard disk that memory caching applies to the CPU. since it has more and more optimizations to facilitate non graphic calculations. arranged in 16 banks ‡ There are 64 KB of Constant Memory 6. The advantage of cache memory is that the CPU does not have to use the motherboard¶s system bus for data transfer. This situation will change. they use very few resources. is the architecture that allows the exploitation of GeForce 8 GPU calculation power by allowing it to process kernels (programs) on a certain amount of threads. The CPU uses cache memory to store instructions that are repeatedly required to run programs. but it is well worth getting a CPU and motherboard with built-in cache in order to maximize system performance. running at the speed of the microprocessor itself. CUDA is a high level API. RAM is faster than the platter technology used in conventional hard disks. cache is so effective in system performance that a computer running a fast CPU with little cache can have lower benchmarks than a system running a somewhat slower CPU with more cache. an API based on an extension of the C programming language and an accompanying compiler (redirecting the part not executed by the GPU to the system¶s default compiler). If CUDA also partly includes the GPU. Cache Memory Cache (pronounced cash) memory is extremely fast memory that is built into a computer¶s central processing unit (CPU). and drives. Frequently accessed hard disk data is stored in a separate segment of RAM in order to avoid having to retrieve it from the hard disk over and over. in practice it mainly concerns software. All else being equal. depending on the model ‡ The NVIDIA 8800 comes in 2 models: either 12 or 16 MPs ‡ The NVIDIA 8600 has 4 MPs ‡ Each MP has 8 independent processors ‡ There are 16 KB of Shared Memory per MP. When these resources are kept in cache. a runtime. improving overall system speed. hard drives will be 100% flash 5. Cache built into the CPU itself is referred to as Level 1 (L1) cache. Eventually. meaning that it globally disregards hardware even if taking into 2 . eliminating the need for RAM disk caching. As it happens. however. Cache that resides on a separate chip next to the CPU is called Level 2 (L2) cache. 4.

however. a component intended for the system that makes it possible to control the GPU(s). facilitating programming by masking some of the details. Nvidia and AMD try to go a little bith in the opposite way. we are going to give you the main ones that allow the understanding of the functioning of CUDA This thread-swap happens within a single cycle A full memory access requires 200 instruction cycles to complete 8. Each thread can determine which one it is Think of all the threads as living in a ³pool´. but operates on different data. It is possible to see the API runtime as a high level language and the API driver as an intermediate between high and low level. AMD. The 8-Series (G8X) GPUs feature hardware support for 32-bit (single precision) floating point vector processors. we focused on API runtime. Without going into all the details on the added extensions. the processor quickly returns the thread to the pool and grabs another one to work on. GPU assisted hardware stream processing can have a huge impact in specific data processing applications. Limitations y Texture rendering is not supported. found in the GeForce. (CUDA supports the 64-bit C "double" data type. In the opposite direction.[6] Faster downloads and readbacks to and from the GPU Full support for integer and bitwise operations 7. 3 . waiting for information from another thread. enabling higher bandwidth than is possible using texture lookups. is the first series of GPUs to support the CUDA SDK. Shared memory ± CUDA exposes a fast shared memory region (16KB in size) that can be shared amongst threads.). however. It is anticipated in the computer gaming industry that graphics cards may be used in future game physics calculations (physical effects like debris. For this first look at CUDA. using the CUDA SDK as API. While the both stick to their initial choices. isn't that different and it only has more options and less automation. The CUDA driver acts as an intermediate element between the compiled code and GPU. waiting to be executed. y y y It uses the standard C language. This can be used as a user-managed cache. NVIDIA recently announced its new GT200 architecture which now supports 64-bit (double precision) floating point.). etc. allowing a manual and deeper optimization of the code. AMD gives the possibility of writing kernels in HLSL instead of machine language to facilitate programming. and a common component that includes the types of vectors and a group of functions of the standard C library. CUDA Efficiency CUDA expects not just have a few threads. When a thread gets blocked somehow (a memory access. Advantages CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs. but to have thousands of them! All threads execute the same code (called the kernel). Quadro and Tesla lines. All processors start by grabbing a thread from the pool. however on G8X series GPUs these types will get demoted to 32-bit floats. With CUDA it¶s either possible to use the API runtime or directly access the API driver. smoke. Scattered writes ± code can write to arbitrary addresses in memory. with some simple extensions.account the specifications is required to provide high performances. with CTM has a low level API. y y 10. CUDA has also been used to accelerate non-graphical applications in computational biology and other fields by an order of magnitude or more 9. This particular API consists of a couple of extensions of C language. Hardware Support The 8-Series (G8X) GPU from NVIDIA. Due to the highly parallel nature of vector processors. This roughly means that it is easier to program with CUDA whereas it is easier to fully optimize the code with CTM. another that runs with the GPU. The driver mode. The CUDA runtime is an intermediate between the developer and driver. fire. fluids). which can be executed on the system as well as the GPU.

Conclusion The first one is that a high end GPU comes with a memory bandwidth of 100 GB/s 4 . Two main evolutions stand out.9 and also 1.. this problem disappeared and the CPU is free once it has sent the program to be executed to the GPU (except when access to textures is used).y y y y Recursive functions are not supported and must be converted to loops. With CUDA 0. improve the compiler. make use more flexible. CUDA¶s team is therefore working hard to evolve the language. for any inherently divergent task (e. the versions 0. Various deviations from the IEEE 754 standard. As we explained in our previous article. This allows avoiding (or at least reducing) certain current problems such as a thread which tries to read a value which we don¶t know if it was modified or not. not Visual Studio 9 (2008). The bus bandwidth and latency between the CPU and the GPU may be a bottleneck. These regular evolutions also allowed increasing the feeling of confidence.g. the SIMD execution model becomes a significant limitation benefit. it is however necessary to create a CPU thread per GPU because CUDA does not authorize the piloting of two 2 GPUs starting from the same thread.0 have allowed CUDA to viably make the use of GPUs as coprocessors. Denormals and signalling NaNs are not supported.8 beta version out in February. which isn¶t too efficient in practice. This is not a big problem.0. Since the 0. A CPU core per GPU was needed. etc. the version 0. CUDA Evolves It¶s now clear that CUDA wasn¶t simply an Nvidia marketing ploy and/or to see if the market is ready for such a thing. which means reading data in memory. Support for VS2008 is expected by end of 2008. The CPU and GPU therefore couldn¶t work at the same time and was a big brake on performances. More flexibility and robustness were necessary. traversing a ray tracing acceleration data structure). Another problem was found in the case where a calculation system was equipped with several GPUs.8 suffered from a large limitation. using it in an operation and writing the result without any other access to this memory space until the operation is fully completed. from which CUDA is starting to 12. Nvidia of course knew this and the synchronous functioning of the first CUDA versions were probably used to facilitate a rapid release of a functional version without focusing on the more delicate details. it was blocked until it sent the results back. because once the CPU sent the work to the GPU. Branches in the program code do not impact performance significantly. Note that there is a function that can force synchronous functioning if this is necessary. although the version 0.9beta and finally 1. y 11. provided that each of 32 threads takes the same execution path.8 was already very promising. y CUDA-enabled GPUs are only available from Nvidia (GeForce 8 series and above. Threads should be run in groups of at least 32 for best performance. and the precision of division/square root is slightly lower than single precision. Quadro and Tesla[1]) Currently CUDA's compiler support under Windows is only for Visual Studio 7 and 8. In the case where a number GPUs are used. The first is the asynchronous functioning of CUDA. only two IEEE rounding modes are supported (chop and round-to-nearest even). Rather it¶s a long term strategy based on the feeling that an accelerator market is starting to form and is expected to grow quickly in the coming years. The second main innovation on the functional level is the appearance of atomic functions. and those are specified on a per-instruction basis rather than in a control word (whether this is a limitation is arguable).

It would be a mistake to limit our view of a GPU such as the GeForce 8800 as a cluster of 128 cores..whereas the four cores of a CPU have to share 10 GB/s. "PyCUDA".C. "Efficient computation of Sum-products on GPUs". but rather first of all up to 25. A. Mark (2007).0 Schatz. which we should try to take advantage of by segmenting an algorithm. BMC Bioinformatics 8:474: 474. Giorgio Valle (2008). A GeForce 8800 isn¶t just 128 processors. 2. CUDA for GPU Computing 5 . 13. Manavski. 6. "High-throughput sequence alignment using Graphics Processing Units. it's all about making a massively parallel algorithm efficient depending on a given architecture.".L. Trapnell. doi:10. Therefore. 5. Using a GPU with CUDA may seem very difficult even adventurous at first. This is a significant difference which limits CPUs in certain cases. This doesn¶t involve making a task parallel in order to use several cores (like it¶s currently the case with CPUs) but to implement a task which is naturally massively parallel and they are numerous. M. C. It will therefore automatically use a very high number of threads to maximize its throughput whereas a CPU will more often end up waiting many cycles. Varshney. doi:10. but rather to help it in certain specific tasks. The reason is that GPU isn¶t destined to replace a CPU.1186/14712105-9-S2-S10.1186/1471-2105-8-474. keeping them within the hardware limits for maximum productivity and let the GPU execute them efficiently. 3.. "CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment". NVIDIA on DailyTech NVIDIA CUDA 2. A race car isn¶t used to transport cattle and we don¶t drive a tractor in an F1 race.. Svetlin A. (2007).. A. BMC Bioinformatics 9(Suppl 2):S10: S10. 4. References 1. Silberstein. These two reasons allow GPUs to have an enormous advantage over CPUs in certain applications such as the one we tested. Delcher. This is the same thing for GPUs. 7.000 threads in flight! We therefore have to provide the GPU with an enormous number of threads. The second point is that a GPU is designed to maximize its throughput. but it is much simpler than what most people think.

6 .

Sign up to vote on this title
UsefulNot useful