GPU - Graphical Processing Unit

Faculté Polytechnique
Graphical
Processing
Unit
Thanks GPU
Université de Mons
Table of content
1. History & Resume
2. GPU and 3D rendering
3. Architecture of a GPU
4. GPU programming
5. CUDA
6. Conclusion
Université de Mons
What is a GPU ?
The GPU is a processor specialized in 3D tasks
Offload the the CPU (central processor unit) of
several tasks
Highly parallel structure  more effective than
CPU for a range of complexe algorithme
Calculation of floating point
Université de Mons
Central Processing Unit : CPU
• an essential component in a computer.
• interpret instructions and process datas of a
program.
• Sequential process (not much data but higher
complexity)
• Need to process more and more datas for
Multimedia applications (games, CAD,…)
Université de Mons Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint) 5
Evolution of the CPU
• Multimedia applications used dedicated
algorithms to proceed
• Linear algorithm to apply the same
instructions to a large amount of data : we
speak about « vector calculus »
• Adaptation of the Architectures of CPU to use
Multimedia complexion :
Intel Pentium MMX, AMD Opteron 3D Now !
Limitation of the CPU
• New generation of CPU with higher
performances seems more features and
functions for the users
• Users want more and more functions and they
want that technologies follow their desire
• But technologies are limited because internal
clock frequency of CPU are physically limited
Solution to turn away the problem
• Multi-core : combine several CPU to one CPU
• Add a specific processor to multimedia
application  GPU
BUT need parallel programming
Multi-core CPU
• Classic programming is not adapted to multi-
core architecture because sequential
programming use one core and no more
• Classic programming + multi-core doesn’t
seem improvement !
• Need parallel programming : the problem is
divided into elementary task which are
process simultaneously by several CPU to
decrease computation time
Multi-core CPU
• Parallel programming seems complex
programming
• Parallel programming is already used by
scientists to use supercalculators
• Multi-core CPU is good but not enough
compare to GPU
GPU Vs CPU
• Comparison on FLOPS performance (Floating
point Operation Per Second)
Origin of GPU
• Need to display a 2D projection of a 3D model
in real time
CAD : to visualize in 3D a virtual object
Video Games : to represent a virtual world
• 2 techniques : Ray tracing Rasterizing
Graphic Card is often called GPU
• Graphic Card is an important part of the
computer
• Composed by memory area, processors,
registers and communication chipsets
• GPU = graphics processors on this card
• Until 240 parallel processors flow on GPU
@1500MHz
• Single Instruction on Multiple Data [SIMD]
Graphic Card is often called GPU
• GPU processors are organized in pipeline

GPU Programming
Languages
Shading Language
Language GPGPU
CUDA
OpenCL
Accelerator
…..
Programming Model
Tableau = texture
Kernel =Fragment Shader
Calculus = Graphics rendering
Feedback
GPGPU complexity
Memory Access
Bandwidth
…..
Table of content
1. History & Resume
4. GPU programming
5. CUDA
6. Conclusion
Université de Mons
Basic need ?
Show, in real time, a 2D projection (on the screen) of a 3D model
 Raytracing
 Rasterisation
Université de Mons
There is a specific vocabulary for
the GPU
Vertex
Texture
Pixel & fragment
Shader
Pipeline
Université de Mons
Vertex
A Vertex (plural : Vertices) are
commonly used to define the
corners of surfaces in 3D
models, where each such point
is given as a vector.
A vertex is represented by
coordonates X,Y and Z
This cube has 8 vertices
Université de Mons
Texture
A Texture is a 2D image which
is applicated at a 3D object
perceived surface quality of an
artwork
Université de Mons
Pixel Fragment
• A pixel is the smallest item of information in an image
 seen by the viewer
• A fragment is the data necessary to generate a single pixel of

a drawing primitive. It is constituate by :
 Some coordonates X,Y,Z
 A color
 A visibility depth
 NOT seen by the user
Université de Mons
Shader
A shader is simple programs that describe the traits of either a vertex or a
pixel (via the fragments).
It allows to control a subset of the GPU processors
Lots of special shading functions defined thanks to major graphics software
libraries (OpenGL and Direct3D)
3 types of shaders :
Vertex Shader Pixel (or fragment) Shader Geometry shader
Run for each vertex given at the calculate the color of individual add and remove vertices
processor pixels New shader (not present one
transform each vertex's 3D  lighting/shadow effect each GPU)
position to the 2D coordinate of
the screen
Université de Mons
Pipeline
A pipeline is an ordonate sequence of different
levels.
Each level get the data of the past one, do his
own operation and send the results to the
next one.
A pipeline is « full » when each level is working
simultaneously  optimal use
Université de Mons
Actual Graphic Pipeline
The graphics pipeline typically accepts some representation of a

three-dimensional scene as an input and results in a 2D raster
image (image made of pixels) as output.
OpenGL and Direct3D are two notable graphics pipeline models

accepted as widespread industry standards.
The graphic pipeline contains 4 levels :
3 programmable levels 1 non-programmable level

 pilot by the shader  The rasterizer
Université de Mons
Vertex flux from the CPU to the GPU
Université de Mons
Pre-stage : Tessellation
Université de Mons
Stage 1 : Vertex shader (Programmable)
•Objects are transformed from 3D world space

coordinates into a 3D coordinate system based on the
position and orientation of a virtual camera
•Use to add special effect to objetcs in a 3D

environment
•Run once for each vertex given to the GPU
•Can change vertex’s properties such as : position,

color, texture coordinate,…
•One element in/one element out
•Can not create new vertices
Université de Mons
Stage 2 : Geometry shader (Prgrammable)
•One element in / 0 ~100 elements out
•Can add and remove vertices
•Can be used to add volumetric detail (too costly for

CPU) or for the refinement of the mesh size
•Ex : 20 triangles  100 triangles smaller
•Displacement Mapping
•Last type of shader created (not always present in the

pipeline)
Mesh size = taille des mailles = maillage

Université de Mons
Stage 3 : Rasterization (non-programmable)(1)
Vector image (Vertex)  Raster image (Fragments)
•Most popular technique for

producing real-time 3D computer
graphics (faster than raytracing)
•Projection of the polygons of the 3D

scene on a grid (2D) of the size of the
output image
•Output fragments have the image

final coordinates
2D vector to raster
Polygon = set of triangles
Triangle = 3 vertex in 3D space
Université de Mons
Stage 3 : Rasterization (2)
The Rasterization algorithme has minimum 3 steps :
1. Calculation of the 2D coordinates (transformation)
2. Filtering of the vertex (clipping)
3. Rasterization itself (scan conversion)
4. Acceleration technics (optional)
5. Further refinments
Université de Mons
The Rasterization algorithme has minimum 3 steps :
1. Calculation of the 2D coordinates (transformation)
 Set of mathematics transformation :

• Translation, scalling, rotation : to put the 3D figure at the desire
location (Exemple = the origine)
• Projection : from 3D to 2D (orthogonal projection (removed the
z-components), perspective projection)
 These operations are done thanks to a multiplication of the

vertex’s augmented 3D matrix by different matrix
Ex : Translation matrix :
Ex : A man who turn his head

Université de Mons
2. Filtering of the vertex (clipping)
• Triangles 2D vertices location are calculated BUT may be outside of the window (area on the screen where
the pixel will be written)
• Clipping is the process of truncating triangles to fit them inside the viewing area.
3. Rasterization itself (scan conversion)

• To fill in the 2D triangles that are now in the image plane in pixels
• Exemple : treatment of a line (coordonates (1,1) to (5.1), color degraded blue to green)
 Will fill pixel (1,1), (2,1), (3,1), (4,1), & (5,1) ;
 For each pixel, ones has to determinates the caracteristic with a goog balance :
(1,1) being totaly blue, (2,1) less blue, (3,1) blue)green,…
• This is much more complicated for shape like triangle but the principe remains the same
• Difficulty : Pixel Aliasing

 use of Z-buffer to see which pixel is closer to the camera
Université de Mons
4. Acceleration techniques
I. Backface culling :determines whether a polygon of a graphical object is
visible, if not (it shows its back to the camera)  cull
II. Spatial data structures
Université de Mons
Stage 4 : Fragment shader (Programmable)
Fragment Shader = Pixel Shader
OpenGL Direct3D
•Give his final color to each pixel (fonction of lighting,

reflexing or refraction of the light,…)
•Biggest computational resource
•Perform complex per-pixel effects and refinments
techniques such as :
I. Texture filtering : to create clean images at any
distance
II. Environment mapping : a form
of texture mapping in which the
texture coordinates view-dependent
to simulate reflection on a shiny
object
III. Shadows : traditionnally not process
in the rasterizer  modern techniques
Université de Mons
Exit of the pipeline
• Fragment flux can :
Either be written in a framebuffer

and then display on the screen
 Either, if it need more treatment, be

written in a texture and then pick back
by the the CPU
Université de Mons
Resume
Université de Mons
The unified architecture came from the 6th generation
of GPU
Before : 2 types of processor in the GPU

 Vertex Units
 Fragments Units
Creation of a neck of strangling when one type was over-charged  not optimal
Since GeForce 8, processors are not specifics anymore

 optimal use of the pipeline : Unified Architecture
Université de Mons
GPU-s evolution through the different generations
Gén Year Nvidia AMD/ATI Particularities
1 96 TNT2 Rage -DirectX6 = standard
-Rasterziation of traingle and texture
-Limitation : no vertex treatment
-Other provider : 3 dfx (Voodoo)
2 99 Geforce 256 Radeon 7500 -Open GL supported
- vertex treatment supported
3 01 Geforce 3 Radeon 8500 -Nvidia buy 3 dfx
-Vertex treatment programmable
02 Geforce 4
4 02 Geforce FX Radeon 9700 - Fragments treatment programmable
- First GPGPU opérations
5 04 Geforce 6 Radeon X800 -Speed of treatment increase
-GPGPU operation developped
05 Geforce 7 Radeon X1800
6 06 Geforce 8 Radeon HD200 -Geometry shader appear
-Unified architecture
07 Radeon HD300 -Nvidia created CUDA language
08 Geforce 9
7 08 Geforce 200 Radeon HD400 -Not very spread yet
-Technical improvments (frequence, memory,
number of processor, bandwith,…)
Université de Mons
Table of content
1. History & Resume
4. GPU programming
5. CUDA
6. Conclusion
Université de Mons
Architecture of a GPU
Overview
Short remember :
 Architecture of a CPU
 CPU and its evolution
Time
 Drawbacks
 Needs
 SIMD/MIMD
Short talk about data management
 Gathering/scattering and PRAM
Architecture of a CPU
Arithmetic Logic Unit or
ALU ALU Calcul Unit :
CONTROL
ALU ALU • Manage all operations
Control Unit :
CACHE
• Manage all instructions
DRAM Cache :
• Fast memory access

• Expensive
• High volume
Control brain DRAM :

ALU hands • Dynamic random access memory
Memory tools • Cheap but need to be refreshed
Short remember :
ARCHITECTURE OF A CPU
CPU AND ITS EVOLUTION
DRAWBACKS
CPU processing
NEEDS
SIMD/MIMD
Data management
GATHERING/SCATTERING AND PRAM For a computer :
Program = several sequential instructions
Program
Instruction1 Instruction2 Instruction3
code
Simple CPU : SISD (single instruction single data)
• Instructions are computed 1 by 1

• On a single data at each time
CPU and its evolution
At first : SISD
 In-order processors
 Out-of-order processors ( performances )
 Instructions dispatch to an instruction queue
 The results are queued
 The process is still sequential
 High volume of cache memory
 Need to have a fast access to instructions and datas
 Lots of « go and back » on datas
Evolution and drawbacks
Evolution ( Pentium 3 )
 SIMD (single instruction multiple data)
 Vectorial calculus performances
CPU is perfect for sequential program but is weak for

multimedia applications
 Reasons
 Only a few « go and back » on datas
 The complexity of the algorithm is very
 High volume of cache memory and out-of-order execution are
superficials for multimedia applications
A GPU is a SIMD processor
 To be able to process a lot of datas
Needs of the GPU
A high memory bandwidth
 10 x CPU bandwidth to process lots of datas in real time
Reste à faire
 Parler de la nouvelle génération GPU
 MIMD (multiple instruction multiple data)
 Comparer MIMD et SIMD
 Parler de la gestion des données
 Gathering
 Scattering
 Parler du modèle PRAM utilisé dans les GPU
Table of content
1. History & Resume
4. GPU programming
5. CUDA
6. Conclusion
Université de Mons
CUDA
• CUDA (Computer Unified Device Architecture) is a
development library created by NVIDIA in 2007.
• It allows to use the power of a compatible graphic
card for general purpose computing.
• Programmers can use C,C++ or Fortran to develop
applications using CUDA.
• Interfaces (wrappers) enable to use high-level
languages such as Java, .net or Python.
Université de Mons 52
Different components of CUDA
• CUDA is constituated of set of software layers to
communicate with the GPU: a Driver, a Runtime and
a few librairies.
CUDA Libraries
• Include the code of all the functions to be
executed on the GPU.
• Using those libraries, developpers can only
use a set of predefined functions.
• They do not have access to the actual GPU.
• Examples:
• CUBLAS, which has a set of building blocks for linear algebra calculations
on the GPU
• CUFFT, which can handle calculation of Fourier transforms
High Level API : CUDA Runtime
• Also called « C for CUDA »
• The high-level API is implemented “above” the low-
level API, each call to a function of the Runtime is
broken down into more basic instructions managed
by the Driver API
• The term “high-level API” is relative. Even the
Runtime API is still what a lot of people would
consider very low-level; yet it still offers functions
that are highly practical for initialization.
Low Level API : CUDA Driver
• The Driver API is more complex to manage; it
requires more work to launch processing on the
GPU.
• The upside is that it’s more flexible, giving the
programmer additional control.
• Note that the high-level and Low-level APIs are
mutually exclusive – the programmer must use one
or the other, but it’s not possible to mix function calls
from both.
CUDA from the Hardware
Point of View
• Nvidia’s Shader Core is made up of several clusters Nvidia calls Texture
Processor Clusters.
• Each cluster is made up of a texture unit and 2 streaming multiprocessors.
The streaming Multiprocessor
• These processors consist of a front
end that reads/decodes and launches
instructions and a backend made up
of a group of eight calculating units
and two SFUs (Super Function Units).
where the instructions are executed
in SIMD fashion.
• The same instruction is applied to all
the threads in the warp. Nvidia calls
this mode of execution SIMT (for
single instruction multiple threads).
• The backend operates at double
the frequency of the front end.
Streaming multiprocessors’
operating mode
• At each cycle, a warp ready for execution is
selected by the front end, which launches
execution of an instruction.
• To apply the instruction to all 32 threads in the
warp, the backend will take four cycles, but since it
operates at double the frequency of the front end,
from its point of view only two cycles will be
executed.
• to avoid having the front end remain unused for
one cycle, the ideal is to alternate types of
instructions every cycle – a classic instruction for
one cycle and an SFU instruction for the other.
Shared Memory
• Each multoprocessors have a small
memory area called Shared Memory
with a size of 16 KB per multiprocessor.
• This memory area provides a way for
threads in the same block to
communicate. All the threads in a given
block are executed by the same
multiprocessor.
• The assignment of blocks to the
different multiprocessors is completely
undefined, meaning that two threads
from different blocks can’t
communicate during their execution.
Cache Memory - Registers
• To limit too-frequent access to the
shared memory, Nvidia has also
provided its multiprocessors with a
cache (approximately 8 KB per
multiprocessor) for access to constants
and textures.
• The multiprocessors also have 8,192
registers that are shared among all the
threads of all the blocks active on that
multiprocessor. The number of active
blocks per multiprocessor can’t exceed
eight, and the number of active warps
are limited to 24 (768 threads)
Optimizing a CUDA program
• Finding the optimum balance between the number of blocks and
their size – more threads per block will be useful in masking the
latency of the memory operations, but at the same time the
number of registers available per thread are reduced.
• Blocks of 512 threads would be particularly inefficient, since only
one block might be active on a multiprocessor, potentially wasting
256 threads. So, Nvidia advises using blocks of 128 to 256 threads,
which offers the best compromise between masking latency and
the number of registers needed for most kernels.
Definitions
• Host : CPU
• Device : GPU
• Kernel : Function executed
on the GPU
• Thread : basic element of the data
to be processed (very lightweight)
• Warp : group of 32 threads
• Block : set of 64 to 512 threads
• Grid : Array of blocks
Université de Mons
Definitions (2)
VCheck
CUDA from a Software
Point of View
CUDA = set of extensions to the C language
Type qualifiers for functions :
__global__ void function()
 Function called by the CPU, executed on the GPU
__device__ void function()
 Function called by and executed on the GPU
__host__ void function()
 Standard function (executed on the CPU)
Software Point of View (2)
Restrictions on __device__ and __global__ :
1. Cannot be recursive
2. Must have a fixed number of arguments
Type qualifier for variables :

__shared__ variable
This variable will be stored in the
multiprocessor’s shared memory
Compilation
1. CPU code is extracted
and handed to the
standard compiler
2. GPU code is converted
into PTX code
(assembly code) and
scanned for
inefficiences
3. PTX is translated is
GPU-specific
commands that are
incapsulated in the exe
A few applications examples
ATI equivalent to Nvidia’s CUDA

GPU - Graphical Processing Unit

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

GPU - Graphical Processing Unit

Uploaded by

Copyright:

Available Formats

Faculté Polytechnique

BUT need parallel programming

• GPU processors are organized in pipeline

This cube has 8 vertices

• A fragment is the data necessary to generate a single pixel of

Vertex Shader Pixel (or fragment) Shader Geometry shader

The graphics pipeline typically accepts some representation of a

OpenGL and Direct3D are two notable graphics pipeline models

The graphic pipeline contains 4 levels :

3 programmable levels 1 non-programmable level

•Objects are transformed from 3D world space

•Use to add special effect to objetcs in a 3D

•Run once for each vertex given to the GPU

•Can change vertex’s properties such as : position,

•One element in/one element out

•Can not create new vertices

•One element in / 0 ~100 elements out

•Can add and remove vertices

•Can be used to add volumetric detail (too costly for

•Last type of shader created (not always present in the

Mesh size = taille des mailles = maillage

•Most popular technique for

•Projection of the polygons of the 3D

•Output fragments have the image

1. Calculation of the 2D coordinates (transformation)

2. Filtering of the vertex (clipping)

3. Rasterization itself (scan conversion)

4. Acceleration technics (optional)

The Rasterization algorithme has minimum 3 steps :

1. Calculation of the 2D coordinates (transformation)

 Set of mathematics transformation :

 These operations are done thanks to a multiplication of the

Ex : A man who turn his head

3. Rasterization itself (scan conversion)

• Difficulty : Pixel Aliasing

•Give his final color to each pixel (fonction of lighting,

Either be written in a framebuffer

 Either, if it need more treatment, be

Before : 2 types of processor in the GPU

Since GeForce 8, processors are not specifics anymore

• Fast memory access

Control brain DRAM :

Simple CPU : SISD (single instruction single data)

• Instructions are computed 1 by 1

CPU is perfect for sequential program but is weak for

Type qualifier for variables :

You might also like