You are on page 1of 6


CUDA (Compute Unified Device Architecture) is a compiler and set of development tools that enable programmers to use a variation of C to code algorithms for execution on the graphics processing unit (GPU). CUDA has been developed by NVIDIA and to use this architecture requires an Nvidia GPU and drivers.

manipulating and displaying computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for a range of complex algorithms. A GPU can sit on top of a video card, or it can be integrated directly into the motherboard. More than 90% of new desktop and notebook computers have integrated GPUs, which are usually far less powerful than those on a video card. Graphic processors or GPUs have evolved much in the past few years. Today, they are capable of calculating things other than pixels in video games, however, it's important to know how to use them efficiently for other tasks

1. Introduction
CUDA is a scalable parallel programming model and a software environment for parallel computingThe latest drivers all contain the necessary CUDA components. CUDA works with all NVIDIA GPUs from the G8X series onwards, including GeForce, Quadro and the Tesla line. NVIDIA states that programs developed for the GeForce 8 series will also work without modification on all future Nvidia video cards, due to binary compatibility. CUDA gives developers access to the native instruction set and memory of the massively parallel computational elements in CUDA GPUs. Using CUDA, the latest NVIDIA GPUs effectively become open architectures like CPUs (Central Processing Units). Unlike CPUs however, GPUs have a parallel "many-core" architecture, each core capable of running thousands of threads simultaneously - if an application is suited to this kind of an architecture, the GPU can offer large performance benefits. CUDA provides both deterministic low level API and a higher level API.The initial CUDA SDK was made

3. Difference Between CPU and GPU
During the last couple of years, GPU calculation power has improved exponentially and much faster than that of the CPU. However, this doesn't mean that GPUs have evolved faster. GPU Computing with CUDA brings parallel computing to the masses Over 85,000,000 CUDA-capable GPUs sold. Data-parallel supercomputers are everywhere We¶re already seeing innovations in data-parallel computing Massively parallel computing has become a commodity technology Due to the nature of graphics computations, GPU chips are customized to handle streaming data. This means that the data is already sequential, or cache-coherent, and thus the GPU chips do not need the significant amount of cache space that dominates CPU chips. The GPU die real estate can then be re-targeted to produce more processing power. A CPU is expected to process a task as fast as possible whereas a GPU must be capable of processing a maximum of tasks, or to be more accurate, one task for a maximum of data in a minimum period of time. Of course, a GPU also has to be fast and a CPU must be able to process several tasks, but up to this date the development of their respective architectures has shown the above priority. This has meant multiplying processing units for GPUs, and for CPUs,

public 15th February 2007. The compiler in CUDA is based on the PathScale C compiler which is based on Open64. NVIDIA released a CUDA API beta for Mac OS X August 19, 2008.
2. GPU(Graphical Processing Unit)
A graphics processing unit or GPU (also occasionally called visual processing unit or VPU) is a dedicated graphics rendering device for a personal computer, workstation, or game console. Modern GPUs are very efficient at


4. a runtime. however. As it happens. If CUDA also partly includes the GPU. Whenever data must be passed through the system bus. CUDA is a high level API. depending on the model ‡ The NVIDIA 8800 comes in 2 models: either 12 or 16 MPs ‡ The NVIDIA 8600 has 4 MPs ‡ Each MP has 8 independent processors ‡ There are 16 KB of Shared Memory per MP. Some CPUs have both L1 and L2 cache built-in and designate the separate cache chip as Level 3 (L3) cache. cache is so effective in system performance that a computer running a fast CPU with little cache can have lower benchmarks than a system running a somewhat slower CPU with more cache. they use very few resources. In this case. All else being equal. The CPU can process data much faster by avoiding the bottleneck created by the system bus. as hybrid hard disks become ubiquitous. Frequently accessed hard disk data is stored in a separate segment of RAM in order to avoid having to retrieve it from the hard disk over and over. since it has more and more optimizations to facilitate non graphic calculations.making control units more complex increasing embedded cache memory. programs can operate more quickly and efficiently. improving overall system speed. eliminating the need for RAM disk caching. Cache built into the CPU itself is referred to as Level 1 (L1) cache. Cache that resides on a separate chip next to the CPU is called Level 2 (L2) cache. libraries (an implementation of BLAS amongst other things). is the architecture that allows the exploitation of GeForce 8 GPU calculation power by allowing it to process kernels (programs) on a certain amount of threads. arranged in 16 banks ‡ There are 64 KB of Constant Memory 6. Cache is more expensive than RAM. as flash memory is faster than RAM. The advantage of cache memory is that the CPU does not have to use the motherboard¶s system bus for data transfer. RAM is faster than the platter technology used in conventional hard disks. hard drives will be 100% flash 5. the data transfer speed slows to the motherboard¶s capability. Cache Memory Cache (pronounced cash) memory is extremely fast memory that is built into a computer¶s central processing unit (CPU). This situation will change. but it is well worth getting a CPU and motherboard with built-in cache in order to maximize system performance. or located next to it on a separate chip. CUDA is in consequence a driver. an API based on an extension of the C programming language and an accompanying compiler (redirecting the part not executed by the GPU to the system¶s default compiler). However. and drives. in practice it mainly concerns software. The CPU uses cache memory to store instructions that are repeatedly required to run programs. These disks have built-in flash memory caches. Eventually. Disk caching applies the same principle to the hard disk that memory caching applies to the CPU. separate cache is still roughly twice as fast as Random Access Memory (RAM). running at the speed of the microprocessor itself. Cache that is built into the CPU is faster than separate cache. or Compute Unified Device Architecture. Cuda Architecture The GPU has some number of MultiProcessors (MPs). When these resources are kept in cache. once most programs are open and running. CUDA API CUDA. meaning that it globally disregards hardware even if taking into 2 .

GPU assisted hardware stream processing can have a huge impact in specific data processing applications. The 8-Series (G8X) GPUs feature hardware support for 32-bit (single precision) floating point vector processors. fire. is the first series of GPUs to support the CUDA SDK. Advantages CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.). found in the GeForce. AMD. isn't that different and it only has more options and less automation. waiting for information from another thread. It is anticipated in the computer gaming industry that graphics cards may be used in future game physics calculations (physical effects like debris.). Limitations y Texture rendering is not supported. the processor quickly returns the thread to the pool and grabs another one to work on. The CUDA runtime is an intermediate between the developer and driver. we focused on API runtime. All processors start by grabbing a thread from the pool. but to have thousands of them! All threads execute the same code (called the kernel). with some simple extensions. Without going into all the details on the added extensions.[6] Faster downloads and readbacks to and from the GPU Full support for integer and bitwise operations 7. NVIDIA recently announced its new GT200 architecture which now supports 64-bit (double precision) floating point. Shared memory ± CUDA exposes a fast shared memory region (16KB in size) that can be shared amongst threads. and a common component that includes the types of vectors and a group of functions of the standard C library. allowing a manual and deeper optimization of the code. While the both stick to their initial choices. etc. y y y It uses the standard C language. with CTM has a low level API. For this first look at CUDA. It is possible to see the API runtime as a high level language and the API driver as an intermediate between high and low level. (CUDA supports the 64-bit C "double" data type. however on G8X series GPUs these types will get demoted to 32-bit floats. Each thread can determine which one it is Think of all the threads as living in a ³pool´. Scattered writes ± code can write to arbitrary addresses in memory. smoke. When a thread gets blocked somehow (a memory access. which can be executed on the system as well as the GPU. another that runs with the GPU. but operates on different data. The driver mode. Due to the highly parallel nature of vector processors. This roughly means that it is easier to program with CUDA whereas it is easier to fully optimize the code with CTM. fluids). Nvidia and AMD try to go a little bith in the opposite way. a component intended for the system that makes it possible to control the GPU(s). CUDA Efficiency CUDA expects not just have a few threads. enabling higher bandwidth than is possible using texture lookups. facilitating programming by masking some of the details. AMD gives the possibility of writing kernels in HLSL instead of machine language to facilitate programming. This particular API consists of a couple of extensions of C language. we are going to give you the main ones that allow the understanding of the functioning of CUDA This thread-swap happens within a single cycle A full memory access requires 200 instruction cycles to complete 8. In the opposite direction. The CUDA driver acts as an intermediate element between the compiled code and GPU. Quadro and Tesla lines. y y 10. however. using the CUDA SDK as API.account the specifications is required to provide high performances. 3 . waiting to be executed. CUDA has also been used to accelerate non-graphical applications in computational biology and other fields by an order of magnitude or more 9. however. With CUDA it¶s either possible to use the API runtime or directly access the API driver. Hardware Support The 8-Series (G8X) GPU from NVIDIA. This can be used as a user-managed cache.

y y y y Recursive functions are not supported and must be converted to loops. provided that each of 32 threads takes the same execution path. although the version 0. y CUDA-enabled GPUs are only available from Nvidia (GeForce 8 series and above. which means reading data in memory. Note that there is a function that can force synchronous functioning if this is necessary.9 and also 1. using it in an operation and writing the result without any other access to this memory space until the operation is fully completed. the versions 0. Two main evolutions stand out. not Visual Studio 9 (2008). and the precision of division/square root is slightly lower than single precision. make use more flexible. and those are specified on a per-instruction basis rather than in a control word (whether this is a limitation is arguable). Quadro and Tesla[1]) Currently CUDA's compiler support under Windows is only for Visual Studio 7 and 8. The bus bandwidth and latency between the CPU and the GPU may be a bottleneck. In the case where a number GPUs are used. Denormals and signalling NaNs are not supported. A CPU core per GPU was needed. Various deviations from the IEEE 754 standard. As we explained in our previous article. it is however necessary to create a CPU thread per GPU because CUDA does not authorize the piloting of two 2 GPUs starting from the same thread. Threads should be run in groups of at least 32 for best performance.8 suffered from a large limitation. because once the CPU sent the work to the GPU. from which CUDA is starting to 12. it was blocked until it sent the results back. These regular evolutions also allowed increasing the feeling of confidence. this problem disappeared and the CPU is free once it has sent the program to be executed to the GPU (except when access to textures is used). This is not a big problem.0 have allowed CUDA to viably make the use of GPUs as coprocessors. With CUDA 0. CUDA Evolves It¶s now clear that CUDA wasn¶t simply an Nvidia marketing ploy and/or to see if the market is ready for such a thing. The second main innovation on the functional level is the appearance of atomic functions. only two IEEE rounding modes are supported (chop and round-to-nearest even). Nvidia of course knew this and the synchronous functioning of the first CUDA versions were probably used to facilitate a rapid release of a functional version without focusing on the more delicate details. etc. The first is the asynchronous functioning of CUDA. improve the compiler. Since the 0. Conclusion The first one is that a high end GPU comes with a memory bandwidth of 100 GB/s 4 . Rather it¶s a long term strategy based on the feeling that an accelerator market is starting to form and is expected to grow quickly in the coming years. for any inherently divergent task (e..9beta and finally 1. The CPU and GPU therefore couldn¶t work at the same time and was a big brake on performances. which isn¶t too efficient in practice. traversing a ray tracing acceleration data structure).8 beta version out in February. CUDA¶s team is therefore working hard to evolve the language. This allows avoiding (or at least reducing) certain current problems such as a thread which tries to read a value which we don¶t know if it was modified or not. Support for VS2008 is expected by end of 2008.8 was already very promising. Another problem was found in the case where a calculation system was equipped with several GPUs.g. the version 0. y 11.0. More flexibility and robustness were necessary. Branches in the program code do not impact performance significantly. the SIMD execution model becomes a significant limitation benefit.

Svetlin A. 2.. BMC Bioinformatics 8:474: 474. it's all about making a massively parallel algorithm efficient depending on a given architecture. but it is much simpler than what most people think. 7. These two reasons allow GPUs to have an enormous advantage over CPUs in certain applications such as the one we tested.. "High-throughput sequence alignment using Graphics Processing Units.whereas the four cores of a CPU have to share 10 GB/s.1186/14712105-9-S2-S10. 3. but rather first of all up to 25. doi:10.. Delcher. A race car isn¶t used to transport cattle and we don¶t drive a tractor in an F1 race. 4. Using a GPU with CUDA may seem very difficult even adventurous at first. Therefore. M. CUDA for GPU Computing 5 . "Efficient computation of Sum-products on GPUs".1186/1471-2105-8-474. References 1. which we should try to take advantage of by segmenting an algorithm. The second point is that a GPU is designed to maximize its throughput. A GeForce 8800 isn¶t just 128 processors. The reason is that GPU isn¶t destined to replace a CPU. A. doi:10. NVIDIA on DailyTech NVIDIA CUDA 2.L. It would be a mistake to limit our view of a GPU such as the GeForce 8800 as a cluster of 128 cores. 5. Manavski. but rather to help it in certain specific tasks.000 threads in flight! We therefore have to provide the GPU with an enormous number of threads. This doesn¶t involve making a task parallel in order to use several cores (like it¶s currently the case with CPUs) but to implement a task which is naturally massively parallel and they are numerous. keeping them within the hardware limits for maximum productivity and let the GPU execute them efficiently. (2007). Giorgio Valle (2008). This is a significant difference which limits CPUs in certain cases. "CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment"..C. Mark (2007).0 Schatz. It will therefore automatically use a very high number of threads to maximize its throughput whereas a CPU will more often end up waiting many cycles. Trapnell. 13. This is the same thing for GPUs. 6. Silberstein. "PyCUDA". C.". A. Varshney. BMC Bioinformatics 9(Suppl 2):S10: S10.

6 .