You are on page 1of 52

The Architecture of Graphic Processor Unit - GPU

P. Bakowski

P.Bakowski

Evolution of parallel architectures


We can distinguish 3 generations of massively parallel . architectures (scientific calculation): (1) The super-computers with special processors for vector calculation (Single Instruction Multiple Data) The Cray-1 (1976) contained 200,000 integrated circuits and could perform 100 million floating point operations per second (100 MFLOPS). price: $5 - $8.8 million Number of units sold: 85
P.Bakowski 2

Evolution of parallel architectures


(2) The super-computers with standard microprocessors adapted for massive multiprocessing operating as Multiple Instruction Multiple Data computers. IBM Roadrunner: PowerXCell 8i CPUs, 6480 dual cores - AMD Opteron, Linux Consumption: 2,35 MW Surface: 296 racks, 560 m2 Memory: 103,6 TiB Performance: 1,042 petaflops Price: USD $125M
P.Bakowski 3

Evolution of GPU architectures


(3) General Processing on Graphic Processing Units (GPGPU) technology based on the circuits integrated into graphic cards.

P.Bakowski

GPU based processing


The GPUs (Graphic Processing Units) contain . hundreds/thousands of arithmetical units . These capacities may be used to accelerate a wide range of computing applications. CUDA cores 48 per Streaming Processor Example - nVIDIA: GT200,300,400,500 series

P.Bakowski

CPUs and SSE extensions


Modern CPU integrate specific SIMD units for graphic . processing. These units implement - SSE2, SSE3, SSE4 instructions and contain 4 arithmetic units that may operate in parallel on 4 fixed or floating point data.

P.Bakowski

CPUs and GPUs


The .GPU are based on multiple processing units with multiple processing cores (8/16/32 cores per processing unit), they contain register files and shared memories. A graphic card contains a global memory that can be used by all processors (including CPU), a local memory for each processing unit, and special memories for constant values.

P.Bakowski

GPUs : streaming multi-processors


. The streaming multiprocessor (SM) integrated in GPUs are the SIMD blocks with several arithmetic cores. Each core contains one Floating Point unit and one INTeger unit

8/16/32/48 cores per SM


P.Bakowski 8

CPUs and cache memories


CPUs use cache memories to reduce the access latency to main memory. CPU caches need more and more of the surface of the processor and use a lot of energy.

P.Bakowski

Cache memory : latency

P.Bakowski

10

CPUs and cache memories


GPUs use caches or shared memory to increase the bandwidth of memory.

Global Memory
P.Bakowski 11

GPU memory : transfer data rate


Each GPU multiprocessor has its own memory controller, For example, each memory controller of nVIDIA GT200 chip provides 8 64-bit communication channels. Shared Memory Raster OutPut SMs

8 * 64-bit channels
P.Bakowski 12

GPU memory : transfer data rate


data_rate = interface_width/8 * memory_clock*2 for GTX275: number of bytes on the bus: 448-bit/8 = 56 data_rate in bytes: 56 * 1224MHz = 68,544MB/s 68,544MB/s*2 = 137,088Mb/s = 137.1GB/s two reads/writes per clock cycle: DDR2

P.Bakowski

13

CPU/GPU : execution threads


GPU evoids the memory latency by the simulteneous execution of thousands of threads; if one thread waits on the memory access, the other one my be executed at the same time.
thread executes thread waits thread executes

P.Bakowski

14

CPU/GPU : execution threads


A CPU may execute 1-2 threads per core; a GPU multiprocessor may maintain up to 1024 threads each. The cost of thread context switching for a CPU core is tens or hundreds of memory cycles , a GPU may switch several threads per clock cycle.

P.Bakowski

15

SIMD versus SIMT


SIMD The CPUs exploit the vector processing units for SIMD processing (a single instruction is executed on multiple data elements) - single execution thread !

The GPUs use SIMT operational mode; single instruction is executed by multiple threads. SIMT processing does not require the transformation of the data into vectors. It allows for arbitrary branches in the threads.

SIMT

P.Bakowski

16

GPUs and high density computing


The GPUs give excellent results when the same sequence of operations is applied to a great number of data. The best results are obtained when the number of arithmetical operations greatly exceeds the number of memory accesses. High density of calculation does not require large cache memory that is necessary in CPUs.

calculations

high

low

memory access

P.Bakowski

17

GPUs : performance

P.Bakowski

18

GPU based calculus


In several cases the performance of GPU based processing is 5-30 times greater than CPU based processing. The biggest difference - performance gain up to 100 times! - relates to the code, that is not adapted to SEE instructions but suits well the GPU functions.

P.Bakowski

19

GPU based calculus


Some example of synthetic code accelerated by the use of GPUs compared to the same code vectorized for SSE : processing for fluorescent microscope : 12x modeling of molecular dynamics : 8-16x modeling electrostatic fields : 40-120x et 7x.

P.Bakowski

20

GPU based calculus: speed-up


The comparison of the speed-up relative to SSE

P.Bakowski

21

From GeForce8 to Tesla

P.Bakowski

22

From GeForce8 to Tesla


8-16 CUDA cores

P.Bakowski

23

From GeForce8 to Tesla

How many CUDA cores ?

P.Bakowski

24

From GeForce8 to Tesla

P.Bakowski

25

Tesla system S1070

P.Bakowski

26

NVIDIA and CUDA


CUDA technology is a software architecture based on nVIDIA hardware. CUDA language is an extension of the C programming language. It gives acces to GPU instructions and to the video memory for parallel calculations. CUDA allows to implement the algorithms that can be run on GeForce 8 cards and on all more recent GPUs chips (GeForce 9, GeForce 200, GeForce 300, GeForce 400, GeForce 500), Quadro and Tesla.

P.Bakowski

27

NVIDIA and CUDA

P.Bakowski

28

NVIDIA and CUDA


The CUDA Toolkit contains: compiler: nvcc libraries FFT and BLAS profiler debugger gdb for GPU runtime driver for CUDA included in nVIDIA drivers guide of programming SDK for CUDA developers source codes (examples) and documentation

P.Bakowski

29

CUDA : compilation phases

The CUDA C code is compiled with nvcc, that is a script activating other programs: cudacc, g++ , cl , etc.
P.Bakowski 30

CUDA : compilation phases


nvcc generates: the CPU code, compiled with other parts of application and written in pure C , and the PTX object code for the GPU

P.Bakowski

31

CUDA : compilation phases


The executable files with CUDA code require: runtime CUDA library (cudart) and base CUDA library

P.Bakowski

32

CUDA : advantages
Main CUDA advantage for GPGPU computing results from the new GPU architecture designed for the efficient implementation of non-graphic calculations and the use of C programming language. There is no need to convert the algorithms into pipelined format required for graphic calculations. The GPGPU does not use the graphic API and the corresponding drivers

P.Bakowski

33

CUDA : advantages
CUDA provides: the access to 16 KB of memory per SM; this access is shared by the SM threads an efficient transfer of data between the system and video memory (global GPU memory) a memory with linear addressing scheme and with random access to any memory location hardware implemented operations for FP, integers and bits

P.Bakowski

34

CUDA : limitations
Limitations: no recursive functions (no stack) processing block of minimum 32 threads (warp) CUDA is a proprietary architecture of nVIDIA

P.Bakowski

35

CUDA : programming model


CUDA programming model is based on groups of threads. The blocks of threads grids of one or two dimensions of threads cooperate via shared memory and synchronization points. A kernel program is executed in a grid of blocks of threads. Only one grid of blocks of threads is executed at a time. Each block may be built in one, two or three dimensions, and contain up two 512 threads.

P.Bakowski

36

CUDA : programming model


The blocks of threads are executed by groups of 32 threads called warps. A warp is a minimal volume of data that is processed by streaming processors. CUDA works with blocks of threads containing from 32 to 512 threads.

P.Bakowski

37

CUDA : memory model


Local and Global Memory is not cached . Local and Global Memory are implemented in separate circuits. The access time to Local and Global Memory is much longer than the Register access time.

P.Bakowski

38

CUDA : memory model


There are 1024 register entries per SM. The access to these registers is very rapid. Each register may store one 32-bit integer or floating point number.

P.Bakowski

39

CUDA : memory model


Global Memory from 256Mo to 2Go ( up to 4Go in Tesla). Data bandwidth may be over 100 Go/s but the latency is high (several hundreds of clock cycles) . There is no cache memory for Global Memory. Global Memory is used for global data and instructions

P.Bakowski

40

CUDA : memory model


Shared Memory: 16-KB of shared memory for all cores in a block of threads. Shared Memory is as rapid as the Registers.

P.Bakowski

41

CUDA : memory model


Constant Memory - 64 KB, read-only for all SM units Constant Memory is high latency memory with access time of several hundreds of clock cycles.

P.Bakowski

42

CUDA : memory model


That is why the Constant Memory data are cached in blocks of 8KB for each SM.

P.Bakowski

43

CUDA : memory model


Texture Memory is accessible (read-only) to all MS. Texture data are used directly by GPU, they may be interpolated linearly without additional operations.

P.Bakowski

44

CUDA : memory model


Texture Memory has long latency access and is cached.

P.Bakowski

45

CUDA : memory model


Typical use of CUDA memories: divide the task into several sub-tasks decompose the input data into blocks that correspond to the shared memory size each block of data will be processed by a block of threads load the data blocks from the Global Memory to Shared Memory process the data in the Shared Memory copy the results from the Shared Memory to Global Memory
P.Bakowski 46

CUDA : program example


main() - function at the CPU side

P.Bakowski

47

CUDA : program example


main() - function at the CPU side (cont.)

P.Bakowski

48

CUDA : program example


main() - function at the CPU side (cont.)

P.Bakowski

49

CUDA : program example


kernel function: at the GPU side

10 threads

++++++++++ no loop but several threads each thread with an index threadIdx.x
P.Bakowski 50

CUDA and graphic APIs


CUDA programs my exploit the graphic functions provided by graphic APIs (DirectX, openGL). These functions provide necessary image processing operations for rastering and shading rendering of the images on the screen. The proposed module does not deal with these primitives. However some of openGL operations may be used in practical classes to display the images directly from GPU memory.

P.Bakowski

51

Summary
Evolution of multiprocessing CPUs and GPUs SIMD and SIMT processing modes Performances of GPUs NVIDIA and CUDA CUDA processing model CUDA memory model a simple example

P.Bakowski

52

You might also like