Professional Documents
Culture Documents
GFLOPS
G80GL = Quadro FX 5600 G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra
DRAM
DRAM
CPU
GPU
The fast-growing video game industry exerts strong economic pressure that forces constant innovation
NVIDIA Corporation 2006
Same computation means lower requirement for sophisticated flow control High arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches
ALU
ALU
ALU
...
Control Cache
ALU
ALU
ALU
...
DRAM
d0
d1
d2
d3
d4
d5
d6
d7
No scatter:
Control Cache
ALU
ALU
ALU
...
Control Cache
ALU
ALU
ALU
...
DRAM
d0
d1
d2
d3
d4
d5
d6
d7
Control Cache
ALU
ALU
ALU
...
Control Cache
ALU
ALU
ALU
...
DRAM
d0
d1
d2
d3
d4
d5
d6
d7
Enter CUDA!
Compute Unified Device Architecture: a new hardware and software architecture for issuing and managing computations on the GPU
Available for GeForce 8800 Series, Quadro FX 5600/4600 and beyond No need to go through a graphics API Dedicated features for general computing
CPU Application
CUDA Runtime
CUDA Driver
GPU
ALU
ALU
ALU
...
Control Cache
ALU
ALU
ALU
...
DRAM
d0
d1
d2
d3
d4
d5
d6
d7
And scatter:
Control Cache
ALU
ALU
ALU
...
Control Cache
ALU
ALU
ALU
...
DRAM
d0
d1
d2
d3
d4
d5
d6
d7
ALU
ALU
ALU
...
ALU
ALU
ALU
...
d4 d5 d6 d7
d0
d1
d2
d3
DRAM
d0
d1
d2
d3
d4
d5
d6
d7
Shared Memory
Thread Execution Manager
Shared Memory
Control
ALU
Shared Data P1 P2 P3 P4 P5
Pn=P1+P2+P3+P4
Control
DRAM
ALU Pn=P1+P2+P3+P4
ALU ALU
Pn=P1+P2+P3+P4
CUDA Performance
CUDA Advantage 20x
47x
197x
10x
FFT: 52 GFLOPS
(GFLOPS as defined by benchFFT)
Overview
CUDA programming model Hardware implementation CUDA application programming interface Tesla / GeForce 8 Series / Quadro 5600/4600 technical specifications CUDA libraries
NVIDIA Corporation 2006
Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads
GPU threads are extremely lightweight
Very little creation overhead
Grid 2 Kernel 2
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data
Image processing Solving PDEs on volumes
NVIDIA Corporation 2006
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Block (0, 0)
Block (1, 0)
Local Memory
Local Memory
Local Memory
Local Memory
The host can read/write global, constant, and texture memory (stored in DRAM)
NVIDIA Corporation 2006
Host
So, a common way of scheduling some computation on the device is to block it up to take advantage of fast shared memory:
Partition the data set into data subsets that fit into shared memory Handle each data subset with one thread block by:
Loading the subset from global memory to shared memory Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory
NVIDIA Corporation 2006
WIDTH
NVIDIA Corporation 2006
WIDTH
WIDTH
WIDTH
BLOCK_SIZE WIDTH
WIDTH
Psub
BLOCK_SIZE
WIDTH
Device Multiprocessor N
Multiprocessor 2 Multiprocessor 1
Processor 1
Processor 2
The local, global, constant, and texture spaces are regions of device memory Each multiprocessor has:
A set of 32-bit registers per processor On-chip shared memory
Where the shared memory space resides
Multiprocessor N
Multiprocessor 2 Multiprocessor 1
Shared Memory
Registers Registers Registers
Processor 1
Processor 2
Device memory
A quick review
device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory
Memory
Local Shared Global Constant Texture
NVIDIA Corporation 2006
Location
Off-chip On-chip Off-chip Off-chip Off-chip
Cached
No N/A No Yes Yes
Access
Read/write Read/write Read/write Read Read
Who
One thread All threads in a block All threads + host All threads + host All threads + host
// copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // execute the kernel on N/256 blocks of 256 threads each vectorAdd<<< N/256, 256>>>( d_A, d_B, d_C);
NVIDIA Corporation 2006
__device__ and __host__ can be used together __device__ functions cannot have their address taken For functions executed on the device:
No recursion No static variable declarations inside the function No variable number of arguments
NVIDIA Corporation 2006
__device__ is optional when used with __shared__ or __constant__ Automatic variables without any qualifier reside in a register
Except for large structures or arrays that reside in local memory
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
dim3
Based on uint3 Used to specify dimensions
Initializes the first time a runtime function is called A host thread can execute device code on only one device
Multiple host threads required to run on multiple devices
NVIDIA Corporation 2006
Device selection
cudaChooseDevice(), cudaSetDevice()
Memory addressing
cudaGetSymbolAddress()
NVIDIA Corporation 2006
cudaBindTexture(), cudaUnbindTexture()
Compilation
Any source file containing CUDA language extensions must be compiled with nvcc NVCC is a compiler driver
Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ...
Compiling CUDA
C/C++ CUDA Application
NVCC
CPU Code
PTX Code
PTX to Target
Compiler
G80
NVIDIA Corporation 2006
GPU
Target code
Compiling CUDA
C/C++ CUDA Application
NVCC
PTX Code
Virtual Physical
PTX to Target
Compiler
G80
NVIDIA Corporation 2006
GPU
Target code
EDG
Separate GPU vs. CPU code EDG CPU Code
Open64
Generates GPU PTX assembly
Open64
PTX Code
ld.global.v4.f32 mad.f32
NVIDIA Corporation 2006
Role of Open64
Open64 compiler gives us A complete C/C++ compiler framework. Forward looking. We do not need to add infrastructure framework as our hardware arch advances over time. A good collection of high level architecture independent optimizations. All GPU code is in the inner loop. Compiler infrastructure that interacts well with other related standardized tools.
Maximum number of threads per block: 512 Maximum size of each dimension of a grid: 65535 Warp size: 32 threads Number of registers per multiprocessor: 8192 Shared memory per multiprocessor: 16 KB divided in 16 banks Constant memory: 64 KB
NVIDIA Corporation 2006
Conclusion
The GPUs highly multithreaded architecture is very suitable for solving data-parallel problems and does it with increasingly higher performance The GeForce 8800 Series and Quadro 5600/4600 go one step further by streamlining the programming model and introducing On-Chip Shared Memory for efficient inter-thread communication By being a simple extension to the C programming language, CUDA is the tool that will allow you to leverage that flexibility and power with minimum learning curve
NVIDIA Corporation 2006
Extra Slides
CUDA Libraries
CUDA Libraries
CUBLAS
CUDA Basic Linear Algebra Subprograms Implementation of BLAS standard on CUDA For details see cublas_library.pdf and cublas.h
CUFFT
CUDA Fast Fourier Transform (FFT) FFT one of the most important and widely used numerical algorithms For details see cufft_library.pdf and cufft.h
CUBLAS Library
Self-contained at API level
Application needs no direct interaction with CUDA driver
Currently only a subset of CUBLAS core functions are implemented Simple to use:
Create matrix and vector objects in GPU memory Fill them with data Call sequence of CUBLAS functions Upload results back from GPU to host
CUFFT Library
Efficient implementation of FFT on CUDA Features
1D, 2D, and 3D FFTs of complex-valued signal data Batch execution for multiple 1D transforms in parallel Transform sizes (in any dimension) in the range [2, 16384]