Professional Documents
Culture Documents
Graphical
Processing
Unit
Thanks GPU
Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion
Université de Mons
What is a GPU ?
The GPU is a processor specialized in 3D tasks
Offload the the CPU (central processor unit) of
several tasks
Highly parallel structure more effective than
CPU for a range of complexe algorithme
Calculation of floating point
Université de Mons
Central Processing Unit : CPU
• an essential component in a computer.
• interpret instructions and process datas of a
program.
• Sequential process (not much data but higher
complexity)
• Need to process more and more datas for
Multimedia applications (games, CAD,…)
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 5
Evolution of the CPU
• Multimedia applications used dedicated
algorithms to proceed
• Linear algorithm to apply the same
instructions to a large amount of data : we
speak about « vector calculus »
• Adaptation of the Architectures of CPU to use
Multimedia complexion :
Intel Pentium MMX, AMD Opteron 3D Now !
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 6
Limitation of the CPU
• New generation of CPU with higher
performances seems more features and
functions for the users
• Users want more and more functions and they
want that technologies follow their desire
• But technologies are limited because internal
clock frequency of CPU are physically limited
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 7
Solution to turn away the problem
• Multi-core : combine several CPU to one CPU
• Add a specific processor to multimedia
application GPU
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 8
Multi-core CPU
• Classic programming is not adapted to multi-
core architecture because sequential
programming use one core and no more
• Classic programming + multi-core doesn’t
seem improvement !
• Need parallel programming : the problem is
divided into elementary task which are
process simultaneously by several CPU to
decrease computation time
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 9
Multi-core CPU
• Parallel programming seems complex
programming
• Parallel programming is already used by
scientists to use supercalculators
• Multi-core CPU is good but not enough
compare to GPU
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 10
GPU Vs CPU
• Comparison on FLOPS performance (Floating
point Operation Per Second)
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 11
Origin of GPU
• Need to display a 2D projection of a 3D model
in real time
CAD : to visualize in 3D a virtual object
Video Games : to represent a virtual world
• 2 techniques : Ray tracing Rasterizing
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 12
Graphic Card is often called GPU
• Graphic Card is an important part of the
computer
• Composed by memory area, processors,
registers and communication chipsets
• GPU = graphics processors on this card
• Until 240 parallel processors flow on GPU
@1500MHz
• Single Instruction on Multiple Data [SIMD]
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 13
Graphic Card is often called GPU
Languages
Shading Language
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 15
Language GPGPU
CUDA
OpenCL
Accelerator
…..
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 16
Programming Model
Tableau = texture
Kernel =Fragment Shader
Calculus = Graphics rendering
Feedback
GPGPU complexity
Memory Access
Bandwidth
…..
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 17
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion
Université de Mons
Basic need ?
Show, in real time, a 2D projection (on the screen) of a 3D model
Raytracing
Rasterisation
Université de Mons
There is a specific vocabulary for
the GPU
Vertex
Texture
Pixel & fragment
Shader
Pipeline
Université de Mons
Vertex
A Vertex (plural : Vertices) are
commonly used to define the
corners of surfaces in 3D
models, where each such point
is given as a vector.
A vertex is represented by
coordonates X,Y and Z
Université de Mons
Texture
A Texture is a 2D image which
is applicated at a 3D object
perceived surface quality of an
artwork
Université de Mons
Pixel Fragment
• A pixel is the smallest item of information in an image
seen by the viewer
Université de Mons
Shader
A shader is simple programs that describe the traits of either a vertex or a
pixel (via the fragments).
It allows to control a subset of the GPU processors
Lots of special shading functions defined thanks to major graphics software
libraries (OpenGL and Direct3D)
3 types of shaders :
Run for each vertex given at the calculate the color of individual add and remove vertices
processor pixels New shader (not present one
transform each vertex's 3D lighting/shadow effect each GPU)
position to the 2D coordinate of
the screen
Université de Mons
Pipeline
A pipeline is an ordonate sequence of different
levels.
Each level get the data of the past one, do his
own operation and send the results to the
next one.
A pipeline is « full » when each level is working
simultaneously optimal use
Université de Mons
Actual Graphic Pipeline
Université de Mons
Vertex flux from the CPU to the GPU
Université de Mons
Pre-stage : Tessellation
Université de Mons
Stage 1 : Vertex shader (Programmable)
Université de Mons
Stage 2 : Geometry shader (Prgrammable)
•Displacement Mapping
2D vector to raster
Polygon = set of triangles
Triangle = 3 vertex in 3D space
Université de Mons
Stage 3 : Rasterization (2)
The Rasterization algorithme has minimum 3 steps :
5. Further refinments
Université de Mons
Stage 3 : Rasterization (3)
Ex : Translation matrix :
Université de Mons
Stage 3 : Rasterization (5)
4. Acceleration techniques
I. Backface culling :determines whether a polygon of a graphical object is
visible, if not (it shows its back to the camera) cull
II. Spatial data structures
Université de Mons
Stage 4 : Fragment shader (Programmable)
Fragment Shader = Pixel Shader
OpenGL Direct3D
Université de Mons
Exit of the pipeline
• Fragment flux can :
Université de Mons
Resume
Université de Mons
The unified architecture came from the 6th generation
of GPU
Université de Mons
GPU-s evolution through the different generations
Gén Year Nvidia AMD/ATI Particularities
1 96 TNT2 Rage -DirectX6 = standard
-Rasterziation of traingle and texture
-Limitation : no vertex treatment
-Other provider : 3 dfx (Voodoo)
2 99 Geforce 256 Radeon 7500 -Open GL supported
- vertex treatment supported
3 01 Geforce 3 Radeon 8500 -Nvidia buy 3 dfx
-Vertex treatment programmable
02 Geforce 4
4 02 Geforce FX Radeon 9700 - Fragments treatment programmable
- First GPGPU opérations
5 04 Geforce 6 Radeon X800 -Speed of treatment increase
-GPGPU operation developped
05 Geforce 7 Radeon X1800
6 06 Geforce 8 Radeon HD200 -Geometry shader appear
-Unified architecture
07 Radeon HD300 -Nvidia created CUDA language
08 Geforce 9
7 08 Geforce 200 Radeon HD400 -Not very spread yet
-Technical improvments (frequence, memory,
number of processor, bandwith,…)
Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion
Université de Mons
Architecture of a GPU
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 42
Overview
Short remember :
Architecture of a CPU
CPU and its evolution
Time
Drawbacks
Architecture of a GPU
Needs
SIMD/MIMD
Short talk about data management
Gathering/scattering and PRAM
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 43
Architecture of a CPU
Arithmetic Logic Unit or
ALU ALU Calcul Unit :
CONTROL
ALU ALU • Manage all operations
Control Unit :
CACHE
• Manage all instructions
DRAM Cache :
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 44
Short remember :
ARCHITECTURE OF A CPU
CPU AND ITS EVOLUTION
DRAWBACKS
Architecture of a GPU
CPU processing
NEEDS
SIMD/MIMD
Data management
GATHERING/SCATTERING AND PRAM For a computer :
Program = several sequential instructions
Program
Instruction1 Instruction2 Instruction3
code
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 45
CPU and its evolution
At first : SISD
In-order processors
Out-of-order processors ( performances )
Instructions dispatch to an instruction queue
The results are queued
The process is still sequential
High volume of cache memory
Need to have a fast access to instructions and datas
Lots of « go and back » on datas
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 46
Evolution and drawbacks
Evolution ( Pentium 3 )
SIMD (single instruction multiple data)
Vectorial calculus performances
Reasons
Only a few « go and back » on datas
The complexity of the algorithm is very
High volume of cache memory and out-of-order execution are
superficials for multimedia applications
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 47
Architecture of a GPU
A GPU is a SIMD processor
To be able to process a lot of datas
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 48
Needs of the GPU
A high memory bandwidth
10 x CPU bandwidth to process lots of datas in real time
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 49
Reste à faire
Parler de la nouvelle génération GPU
MIMD (multiple instruction multiple data)
Comparer MIMD et SIMD
Parler de la gestion des données
Gathering
Scattering
Parler du modèle PRAM utilisé dans les GPU
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 50
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion
Université de Mons
CUDA
• CUDA (Computer Unified Device Architecture) is a
development library created by NVIDIA in 2007.
• It allows to use the power of a compatible graphic
card for general purpose computing.
• Programmers can use C,C++ or Fortran to develop
applications using CUDA.
• Interfaces (wrappers) enable to use high-level
languages such as Java, .net or Python.
Université de Mons 52
Different components of CUDA
• CUDA is constituated of set of software layers to
communicate with the GPU: a Driver, a Runtime and
a few librairies.
Université de Mons 53
CUDA Libraries
• Include the code of all the functions to be
executed on the GPU.
• Using those libraries, developpers can only
use a set of predefined functions.
• They do not have access to the actual GPU.
• Examples:
• CUBLAS, which has a set of building blocks for linear algebra calculations
on the GPU
• CUFFT, which can handle calculation of Fourier transforms
Université de Mons 54
High Level API : CUDA Runtime
• Also called « C for CUDA »
• The high-level API is implemented “above” the low-
level API, each call to a function of the Runtime is
broken down into more basic instructions managed
by the Driver API
• The term “high-level API” is relative. Even the
Runtime API is still what a lot of people would
consider very low-level; yet it still offers functions
that are highly practical for initialization.
Université de Mons 55
Low Level API : CUDA Driver
• The Driver API is more complex to manage; it
requires more work to launch processing on the
GPU.
• The upside is that it’s more flexible, giving the
programmer additional control.
• Note that the high-level and Low-level APIs are
mutually exclusive – the programmer must use one
or the other, but it’s not possible to mix function calls
from both.
Université de Mons 56
CUDA from the Hardware
Point of View
• Nvidia’s Shader Core is made up of several clusters Nvidia calls Texture
Processor Clusters.
• Each cluster is made up of a texture unit and 2 streaming multiprocessors.
Université de Mons 57
The streaming Multiprocessor
• These processors consist of a front
end that reads/decodes and launches
instructions and a backend made up
of a group of eight calculating units
and two SFUs (Super Function Units).
where the instructions are executed
in SIMD fashion.
• The same instruction is applied to all
the threads in the warp. Nvidia calls
this mode of execution SIMT (for
single instruction multiple threads).
• The backend operates at double
the frequency of the front end.
Université de Mons 58
Streaming multiprocessors’
operating mode
• At each cycle, a warp ready for execution is
selected by the front end, which launches
execution of an instruction.
• To apply the instruction to all 32 threads in the
warp, the backend will take four cycles, but since it
operates at double the frequency of the front end,
from its point of view only two cycles will be
executed.
• to avoid having the front end remain unused for
one cycle, the ideal is to alternate types of
instructions every cycle – a classic instruction for
one cycle and an SFU instruction for the other.
Université de Mons 59
Shared Memory
• Each multoprocessors have a small
memory area called Shared Memory
with a size of 16 KB per multiprocessor.
• This memory area provides a way for
threads in the same block to
communicate. All the threads in a given
block are executed by the same
multiprocessor.
• The assignment of blocks to the
different multiprocessors is completely
undefined, meaning that two threads
from different blocks can’t
communicate during their execution.
Université de Mons 60
Cache Memory - Registers
• To limit too-frequent access to the
shared memory, Nvidia has also
provided its multiprocessors with a
cache (approximately 8 KB per
multiprocessor) for access to constants
and textures.
• The multiprocessors also have 8,192
registers that are shared among all the
threads of all the blocks active on that
multiprocessor. The number of active
blocks per multiprocessor can’t exceed
eight, and the number of active warps
are limited to 24 (768 threads)
Université de Mons 61
Optimizing a CUDA program
• Finding the optimum balance between the number of blocks and
their size – more threads per block will be useful in masking the
latency of the memory operations, but at the same time the
number of registers available per thread are reduced.
• Blocks of 512 threads would be particularly inefficient, since only
one block might be active on a multiprocessor, potentially wasting
256 threads. So, Nvidia advises using blocks of 128 to 256 threads,
which offers the best compromise between masking latency and
the number of registers needed for most kernels.
Université de Mons 62
Definitions
• Host : CPU
• Device : GPU
• Kernel : Function executed
on the GPU
• Thread : basic element of the data
to be processed (very lightweight)
• Warp : group of 32 threads
• Block : set of 64 to 512 threads
• Grid : Array of blocks
Université de Mons
Definitions (2)
VCheck
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 64
CUDA from a Software
Point of View
CUDA = set of extensions to the C language
Type qualifiers for functions :
__global__ void function()
Function called by the CPU, executed on the GPU
__device__ void function()
Function called by and executed on the GPU
__host__ void function()
Standard function (executed on the CPU)
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 65
Software Point of View (2)
Restrictions on __device__ and __global__ :
1. Cannot be recursive
2. Must have a fixed number of arguments
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 67
A few applications examples
Université de Mons 68
ATI equivalent to Nvidia’s CUDA
Université de Mons 69