GPU is specialized for compute-intensive. highly p contrary to CPUs. p p . p . g y parallel computation and. more transistors are devoted to data processing rather than data caching and flow control GPU?? graphics processing unit 2 .Introduction CUDA stands for Compute Unified Device Architecture and is a new hardware and software architecture for issuing and managing computations on the th GPU as a d t data-parallel computing d i without th need of mapping th ll l ti device ith t the d f i them to graphics API.

Background In just few years the GPUs had a great evolution evolution. -some applications were bottlenecked by the DRAM memory bandwidth. 3 . Problems: -the GPU could only be programmed using graphics API. -the GPU DRAM could be read in a general way but could not be written in a general way.

an API and its runtime. . host and device runtime libraries and a driver API.a hardware driver. .Hardware and software NVIDIA CUDA SDK has been designed for running parallel computations on the device hardware: it consist of a compiler. CUDA software stack is composed of several layers: .two higher-level mathematical libraries of common usage. 4 .

Gather Scatter 5 .Access memory CUDA provides general DRAM memory addressing both for scatter and gather memory operations. just like on a CPU.

Shared Memory In addition it features a parallel data cache/on-chip shared memory with a very fast g general read and write access. used by threads to share data with each other. y Without shared memory With shared memory 6 .

Ak kernel i a f l is function that is executed ti th t i t d on the graphic device as many different threads. A thread block is a batch of threads that cooperate together by sharing data through some fast shared memory and synchronizing their execution to coordinate memory access.a number in ascending order. 7 .Programming Model & Hardware Implementation GPU can be viewed as a coprocessor to the main CPU. . called host.a 2 (or 3) component index composed using the (x. Each thread is identified by a thread ID that can be: .y) position of the specific thread inside the block.

A thread that executes on the device has only access to the device’s DRAM and on-chip memory through different memory spaces. The global.Programming Model & Hardware Implementation Threads in different blocks from the same grid cannot communicate and g synchronize with each other. constant and texture memory spaces can be read from or written to by the host and are itt t b th h t d persistent across kernel launches by the same application. 8 .

-A read-only constant cache. -A parallel data cache shared by all the processors. each one having a SIMD architecture and onchip memory of the four following types: -One set of local 32-bit register per processor. -A read-only texture cache. 9 . shared by all the processors. shared by ll the b all th processors.Programming Model & Hardware Implementation The GPU device is implemented as a p set of multiprocessors.

there are potential RAW. accesses 10 .a runtime library : -a host component that runs on the host and provides f id functions t control and access one or ti to t l d The fundamental extensions of the C language madethe host.a set of extensions to the C language . WAR or WAW hazards for some ofthat memory types and a subset of the C standard library these accesses.Variable type qualifiers(-adevice component that runs on the device and qualifiers(_device_ . provides ) When some threads within device component that provides builtin shared or global -a a block access the same addresses in vector a built-in memory.Function type qualifiers (_device_._constant_. accesses. _host_) . but they can be avoided by synchronizing threads in-between these are supported in both host and device code.Application Programming Interface CUDA programming interface: .Synchronization(_syncthreads() device-specific functions. are: more compute devices from by CUDA . _global_. _shared_ ) shared a device constant .

Optimization guidelines -Minimizing the use of instructions with low throughput -Maximizing the use of the available Maximizing Example:bandwidth for each category of memory _shared_ float shared[32]. memory float data = shared [BaseIndex to s * tid] -Allowing the thread scheduler + g soverlap memory transactions withmemory stride (number of locations in between successive elements) much as mathematical computations as m total possible number of banks d highest common factor between s and m achieve high memory bandwidth. lti f /d sizedlmemory Bank conflicts may arise! Without bank 8-way bank conflict 11 . To tid and tid+n are is divided into is a shared memory in conflict if n equallymultiple of m/d modules called banks. .pdf) .com/compute/cuda/1_0/NVIDIA_CUDA_Programming_Guide_1.Bibliography -NVIDIA CUDA Programming Guide (http://developer.wikipedia.pdf) -The CUDA compiler driver NVCC ( three-dimensional teraflop CFD computing on a desktop PC using graphics hardware ( 12 .pdf) .www gpgpu org/developer/

Sign up to vote on this title
UsefulNot useful