You are on page 1of 69

Faculté Polytechnique

Graphical
Processing
Unit
Thanks GPU

Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion

Université de Mons
What is a GPU ?
The GPU is a processor specialized in 3D tasks
Offload the the CPU (central processor unit) of
several tasks
Highly parallel structure  more effective than
CPU for a range of complexe algorithme
Calculation of floating point

Université de Mons
Central Processing Unit : CPU
• an essential component in a computer.
• interpret instructions and process datas of a
program.
• Sequential process (not much data but higher
complexity)
• Need to process more and more datas for
Multimedia applications (games, CAD,…)

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 5
Evolution of the CPU
• Multimedia applications used dedicated
algorithms to proceed
• Linear algorithm to apply the same
instructions to a large amount of data : we
speak about « vector calculus »
• Adaptation of the Architectures of CPU to use
Multimedia complexion :
Intel Pentium MMX, AMD Opteron 3D Now !

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 6
Limitation of the CPU
• New generation of CPU with higher
performances seems more features and
functions for the users
• Users want more and more functions and they
want that technologies follow their desire
• But technologies are limited because internal
clock frequency of CPU are physically limited

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 7
Solution to turn away the problem
• Multi-core : combine several CPU to one CPU
• Add a specific processor to multimedia
application  GPU

BUT need parallel programming

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 8
Multi-core CPU
• Classic programming is not adapted to multi-
core architecture because sequential
programming use one core and no more
• Classic programming + multi-core doesn’t
seem improvement !
• Need parallel programming : the problem is
divided into elementary task which are
process simultaneously by several CPU to
decrease computation time
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 9
Multi-core CPU
• Parallel programming seems complex
programming
• Parallel programming is already used by
scientists to use supercalculators
• Multi-core CPU is good but not enough
compare to GPU

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 10
GPU Vs CPU
• Comparison on FLOPS performance (Floating
point Operation Per Second)

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 11
Origin of GPU
• Need to display a 2D projection of a 3D model
in real time
CAD : to visualize in 3D a virtual object
Video Games : to represent a virtual world
• 2 techniques : Ray tracing Rasterizing

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 12
Graphic Card is often called GPU
• Graphic Card is an important part of the
computer
• Composed by memory area, processors,
registers and communication chipsets
• GPU = graphics processors on this card
• Until 240 parallel processors flow on GPU
@1500MHz
• Single Instruction on Multiple Data [SIMD]

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 13
Graphic Card is often called GPU

• GPU processors are organized in pipeline


Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 14
GPU Programming

Languages
Shading Language

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 15
Language GPGPU

CUDA
OpenCL
Accelerator
…..

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 16
Programming Model
Tableau = texture
Kernel =Fragment Shader
Calculus = Graphics rendering
Feedback
GPGPU complexity
Memory Access
Bandwidth
…..
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 17
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion

Université de Mons
Basic need ?
Show, in real time, a 2D projection (on the screen) of a 3D model
 Raytracing
 Rasterisation

Université de Mons
There is a specific vocabulary for
the GPU
Vertex
Texture
Pixel & fragment
Shader
Pipeline

Université de Mons
Vertex
A Vertex (plural : Vertices) are
commonly used to define the
corners of surfaces in 3D
models, where each such point
is given as a vector.
A vertex is represented by
coordonates X,Y and Z

This cube has 8 vertices

Université de Mons
Texture
A Texture is a 2D image which
is applicated at a 3D object
perceived surface quality of an
artwork

Université de Mons
Pixel Fragment
• A pixel is the smallest item of information in an image
 seen by the viewer

• A fragment is the data necessary to generate a single pixel of


a drawing primitive. It is constituate by :
 Some coordonates X,Y,Z
 A color
 A visibility depth
 NOT seen by the user

Université de Mons
Shader
A shader is simple programs that describe the traits of either a vertex or a
pixel (via the fragments).
It allows to control a subset of the GPU processors
Lots of special shading functions defined thanks to major graphics software
libraries (OpenGL and Direct3D)
3 types of shaders :

Vertex Shader Pixel (or fragment) Shader Geometry shader

Run for each vertex given at the calculate the color of individual add and remove vertices
processor pixels New shader (not present one
transform each vertex's 3D  lighting/shadow effect each GPU)
position to the 2D coordinate of
the screen

Université de Mons
Pipeline
A pipeline is an ordonate sequence of different
levels.
Each level get the data of the past one, do his
own operation and send the results to the
next one.
A pipeline is « full » when each level is working
simultaneously  optimal use

Université de Mons
Actual Graphic Pipeline

The graphics pipeline typically accepts some representation of a


three-dimensional scene as an input and results in a 2D raster
image (image made of pixels) as output.

OpenGL and Direct3D are two notable graphics pipeline models


accepted as widespread industry standards.

The graphic pipeline contains 4 levels :

3 programmable levels 1 non-programmable level


 pilot by the shader  The rasterizer

Université de Mons
Vertex flux from the CPU to the GPU

Université de Mons
Pre-stage : Tessellation

Université de Mons
Stage 1 : Vertex shader (Programmable)

•Objects are transformed from 3D world space


coordinates into a 3D coordinate system based on the
position and orientation of a virtual camera

•Use to add special effect to objetcs in a 3D


environment

•Run once for each vertex given to the GPU

•Can change vertex’s properties such as : position,


color, texture coordinate,…

•One element in/one element out

•Can not create new vertices

Université de Mons
Stage 2 : Geometry shader (Prgrammable)

•One element in / 0 ~100 elements out

•Can add and remove vertices

•Can be used to add volumetric detail (too costly for


CPU) or for the refinement of the mesh size
•Ex : 20 triangles  100 triangles smaller

•Displacement Mapping

•Last type of shader created (not always present in the


pipeline)

Mesh size = taille des mailles = maillage


Université de Mons
Stage 3 : Rasterization (non-programmable)(1)
Vector image (Vertex)  Raster image (Fragments)

•Most popular technique for


producing real-time 3D computer
graphics (faster than raytracing)

•Projection of the polygons of the 3D


scene on a grid (2D) of the size of the
output image

•Output fragments have the image


final coordinates

2D vector to raster
Polygon = set of triangles
Triangle = 3 vertex in 3D space
Université de Mons
Stage 3 : Rasterization (2)
The Rasterization algorithme has minimum 3 steps :

1. Calculation of the 2D coordinates (transformation)

2. Filtering of the vertex (clipping)

3. Rasterization itself (scan conversion)

4. Acceleration technics (optional)

5. Further refinments

Université de Mons
Stage 3 : Rasterization (3)

The Rasterization algorithme has minimum 3 steps :

1. Calculation of the 2D coordinates (transformation)

 Set of mathematics transformation :


• Translation, scalling, rotation : to put the 3D figure at the desire
location (Exemple = the origine)
• Projection : from 3D to 2D (orthogonal projection (removed the
z-components), perspective projection)

 These operations are done thanks to a multiplication of the


vertex’s augmented 3D matrix by different matrix

Ex : Translation matrix :

Ex : A man who turn his head


Université de Mons
Stage 3 : Rasterization (4)
2. Filtering of the vertex (clipping)
• Triangles 2D vertices location are calculated BUT may be outside of the window (area on the screen where
the pixel will be written)
• Clipping is the process of truncating triangles to fit them inside the viewing area.

3. Rasterization itself (scan conversion)


• To fill in the 2D triangles that are now in the image plane in pixels
• Exemple : treatment of a line (coordonates (1,1) to (5.1), color degraded blue to green)
 Will fill pixel (1,1), (2,1), (3,1), (4,1), & (5,1) ;
 For each pixel, ones has to determinates the caracteristic with a goog balance :
(1,1) being totaly blue, (2,1) less blue, (3,1) blue)green,…
• This is much more complicated for shape like triangle but the principe remains the same

• Difficulty : Pixel Aliasing


 use of Z-buffer to see which pixel is closer to the camera

Université de Mons
Stage 3 : Rasterization (5)

4. Acceleration techniques
I. Backface culling :determines whether a polygon of a graphical object is
visible, if not (it shows its back to the camera)  cull
II. Spatial data structures

Université de Mons
Stage 4 : Fragment shader (Programmable)
Fragment Shader = Pixel Shader

OpenGL Direct3D

•Give his final color to each pixel (fonction of lighting,


reflexing or refraction of the light,…)
•Biggest computational resource
•Perform complex per-pixel effects and refinments
techniques such as :
I. Texture filtering : to create clean images at any
distance
II. Environment mapping : a form
of texture mapping in which the
texture coordinates view-dependent
to simulate reflection on a shiny
object
III. Shadows : traditionnally not process
in the rasterizer  modern techniques

Université de Mons
Exit of the pipeline
• Fragment flux can :

Either be written in a framebuffer


and then display on the screen

 Either, if it need more treatment, be


written in a texture and then pick back
by the the CPU

Université de Mons
Resume

Université de Mons
The unified architecture came from the 6th generation
of GPU

Before : 2 types of processor in the GPU


 Vertex Units
 Fragments Units
Creation of a neck of strangling when one type was over-charged  not optimal

Since GeForce 8, processors are not specifics anymore


 optimal use of the pipeline : Unified Architecture

Université de Mons
GPU-s evolution through the different generations
Gén Year Nvidia AMD/ATI Particularities
1 96 TNT2 Rage -DirectX6 = standard
-Rasterziation of traingle and texture
-Limitation : no vertex treatment
-Other provider : 3 dfx (Voodoo)
2 99 Geforce 256 Radeon 7500 -Open GL supported
- vertex treatment supported
3 01 Geforce 3 Radeon 8500 -Nvidia buy 3 dfx
-Vertex treatment programmable
02 Geforce 4
4 02 Geforce FX Radeon 9700 - Fragments treatment programmable
- First GPGPU opérations
5 04 Geforce 6 Radeon X800 -Speed of treatment increase
-GPGPU operation developped
05 Geforce 7 Radeon X1800
6 06 Geforce 8 Radeon HD200 -Geometry shader appear
-Unified architecture
07 Radeon HD300 -Nvidia created CUDA language
08 Geforce 9
7 08 Geforce 200 Radeon HD400 -Not very spread yet
-Technical improvments (frequence, memory,
number of processor, bandwith,…)
Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion

Université de Mons
Architecture of a GPU

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 42
Overview
Short remember :
 Architecture of a CPU
 CPU and its evolution
Time
 Drawbacks
Architecture of a GPU
 Needs
 SIMD/MIMD
Short talk about data management
 Gathering/scattering and PRAM

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 43
Architecture of a CPU
Arithmetic Logic Unit or
ALU ALU Calcul Unit :
CONTROL
ALU ALU • Manage all operations

Control Unit :
CACHE
• Manage all instructions

DRAM Cache :

• Fast memory access


• Expensive
• High volume

Control brain DRAM :


ALU hands • Dynamic random access memory
Memory tools • Cheap but need to be refreshed

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 44
Short remember :
ARCHITECTURE OF A CPU
CPU AND ITS EVOLUTION
DRAWBACKS
Architecture of a GPU
CPU processing
NEEDS
SIMD/MIMD
Data management
GATHERING/SCATTERING AND PRAM For a computer :
Program = several sequential instructions

Program
Instruction1 Instruction2 Instruction3
code

Simple CPU : SISD (single instruction single data)

• Instructions are computed 1 by 1


• On a single data at each time

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 45
CPU and its evolution
At first : SISD
 In-order processors
 Out-of-order processors ( performances )
 Instructions dispatch to an instruction queue
 The results are queued
 The process is still sequential
 High volume of cache memory
 Need to have a fast access to instructions and datas
 Lots of « go and back » on datas

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 46
Evolution and drawbacks
Evolution ( Pentium 3 )
 SIMD (single instruction multiple data)
 Vectorial calculus performances

CPU is perfect for sequential program but is weak for


multimedia applications

 Reasons
 Only a few « go and back » on datas
 The complexity of the algorithm is very
 High volume of cache memory and out-of-order execution are
superficials for multimedia applications

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 47
Architecture of a GPU
A GPU is a SIMD processor
 To be able to process a lot of datas

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 48
Needs of the GPU
A high memory bandwidth
 10 x CPU bandwidth to process lots of datas in real time

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 49
Reste à faire
 Parler de la nouvelle génération GPU
 MIMD (multiple instruction multiple data)
 Comparer MIMD et SIMD
 Parler de la gestion des données
 Gathering
 Scattering
 Parler du modèle PRAM utilisé dans les GPU

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 50
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion

Université de Mons
CUDA
• CUDA (Computer Unified Device Architecture) is a
development library created by NVIDIA in 2007.
• It allows to use the power of a compatible graphic
card for general purpose computing.
• Programmers can use C,C++ or Fortran to develop
applications using CUDA.
• Interfaces (wrappers) enable to use high-level
languages such as Java, .net or Python.

Université de Mons 52
Different components of CUDA
• CUDA is constituated of set of software layers to
communicate with the GPU: a Driver, a Runtime and
a few librairies.

Université de Mons 53
CUDA Libraries
• Include the code of all the functions to be
executed on the GPU.
• Using those libraries, developpers can only
use a set of predefined functions.
• They do not have access to the actual GPU.
• Examples:
• CUBLAS, which has a set of building blocks for linear algebra calculations
on the GPU
• CUFFT, which can handle calculation of Fourier transforms

Université de Mons 54
High Level API : CUDA Runtime
• Also called « C for CUDA »
• The high-level API is implemented “above” the low-
level API, each call to a function of the Runtime is
broken down into more basic instructions managed
by the Driver API
• The term “high-level API” is relative. Even the
Runtime API is still what a lot of people would
consider very low-level; yet it still offers functions
that are highly practical for initialization.

Université de Mons 55
Low Level API : CUDA Driver
• The Driver API is more complex to manage; it
requires more work to launch processing on the
GPU.
• The upside is that it’s more flexible, giving the
programmer additional control.
• Note that the high-level and Low-level APIs are
mutually exclusive – the programmer must use one
or the other, but it’s not possible to mix function calls
from both.

Université de Mons 56
CUDA from the Hardware
Point of View
• Nvidia’s Shader Core is made up of several clusters Nvidia calls Texture
Processor Clusters.
• Each cluster is made up of a texture unit and 2 streaming multiprocessors.

Université de Mons 57
The streaming Multiprocessor
• These processors consist of a front
end that reads/decodes and launches
instructions and a backend made up
of a group of eight calculating units
and two SFUs (Super Function Units).
where the instructions are executed
in SIMD fashion.
• The same instruction is applied to all
the threads in the warp. Nvidia calls
this mode of execution SIMT (for
single instruction multiple threads).
• The backend operates at double
the frequency of the front end.

Université de Mons 58
Streaming multiprocessors’
operating mode
• At each cycle, a warp ready for execution is
selected by the front end, which launches
execution of an instruction.
• To apply the instruction to all 32 threads in the
warp, the backend will take four cycles, but since it
operates at double the frequency of the front end,
from its point of view only two cycles will be
executed.
• to avoid having the front end remain unused for
one cycle, the ideal is to alternate types of
instructions every cycle – a classic instruction for
one cycle and an SFU instruction for the other.

Université de Mons 59
Shared Memory
• Each multoprocessors have a small
memory area called Shared Memory
with a size of 16 KB per multiprocessor.
• This memory area provides a way for
threads in the same block to
communicate. All the threads in a given
block are executed by the same
multiprocessor.
• The assignment of blocks to the
different multiprocessors is completely
undefined, meaning that two threads
from different blocks can’t
communicate during their execution.

Université de Mons 60
Cache Memory - Registers
• To limit too-frequent access to the
shared memory, Nvidia has also
provided its multiprocessors with a
cache (approximately 8 KB per
multiprocessor) for access to constants
and textures.
• The multiprocessors also have 8,192
registers that are shared among all the
threads of all the blocks active on that
multiprocessor. The number of active
blocks per multiprocessor can’t exceed
eight, and the number of active warps
are limited to 24 (768 threads)

Université de Mons 61
Optimizing a CUDA program
• Finding the optimum balance between the number of blocks and
their size – more threads per block will be useful in masking the
latency of the memory operations, but at the same time the
number of registers available per thread are reduced.
• Blocks of 512 threads would be particularly inefficient, since only
one block might be active on a multiprocessor, potentially wasting
256 threads. So, Nvidia advises using blocks of 128 to 256 threads,
which offers the best compromise between masking latency and
the number of registers needed for most kernels.

Université de Mons 62
Definitions
• Host : CPU
• Device : GPU
• Kernel : Function executed
on the GPU
• Thread : basic element of the data
to be processed (very lightweight)
• Warp : group of 32 threads
• Block : set of 64 to 512 threads
• Grid : Array of blocks
Université de Mons
Definitions (2)

VCheck

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 64
CUDA from a Software
Point of View
CUDA = set of extensions to the C language
Type qualifiers for functions :
__global__ void function()
 Function called by the CPU, executed on the GPU
__device__ void function()
 Function called by and executed on the GPU
__host__ void function()
 Standard function (executed on the CPU)

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 65
Software Point of View (2)
Restrictions on __device__ and __global__ :
1. Cannot be recursive
2. Must have a fixed number of arguments

Type qualifier for variables :


__shared__ variable
This variable will be stored in the
multiprocessor’s shared memory
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 66
Compilation
1. CPU code is extracted
and handed to the
standard compiler
2. GPU code is converted
into PTX code
(assembly code) and
scanned for
inefficiences
3. PTX is translated is
GPU-specific
commands that are
incapsulated in the exe

Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 67
A few applications examples

Université de Mons 68
ATI equivalent to Nvidia’s CUDA

Université de Mons 69

You might also like