Professional Documents
Culture Documents
中国科学院大学计算机学院硕士生专业选修课
GPU 架构与编程
第四课:Floating-Point Considerations
赵地
中科院计算所
2023年秋季学期
讲授内容
ØFloating-Point Precision and Accuracy
ØNumerical Stability
1
2023秋
Normalized Representation
– Specifying that 1.0 B ≤ M < 10.0 B makes the
mantissa value for each floating point number
unique.
– For example, the only mantissa value allowed for 0.5 D is M =1.0
– 0.5 D = 1.0 B * 2 - 1
– Neither 10.0 B * 2 -2 nor 0.1 B * 2 0 qualifies
2
2023秋
Exponent Representation
2’s complement Actual decimal Excess-3
– In an n-bit exponent
representation, 2 n- 1 -1 is
added to its 2's 000 0 011
complement
representation to form its 001 1 100
excess representation.
– See Table for a 3-bit 010 2 101
exponent representation
011 3 110
– A simple unsigned integer
comparator can be used 100 (reserved pattern) 111
to compare the magnitude
of two FP numbers
101 -3 000
– Symmetric range for +/-
exponents (111 reserved) 110 -2 001
111 -1 010
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
3
2023秋
Representable Numbers
4
2023秋
Flush to Zero
–Treat all bit patterns with E=0 as
0.0
–This takes away several representable
numbers near zero and lump them all into
0.0
–For a representation with large M, a large
number of representable numbers will be
removed The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Flush to Zero
No-zero Flush to Zero Denormalized
5
2023秋
Denormalized Numbers
–The actual method adopted by the
IEEE standard is called
“denormalized numbers” or “gradual
underflow”.
–The method relaxes the normalization
requirement for numbers very close to 0.
–Whenever E=0, the mantissa is no longer
assumed to be of the form 1.XX. Rather, it is
assumed to be 0.XX. In general, if the n-bit
exponent is 0, the value is 0.M * 2 - 2 ^(n-1) + 2
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
6
2023秋
Denormalization
No-zero Flush to Zero Denormalized
–Double Precision
–1-bit sign, 11-bit exponent (1023-bias excess),
52 bit fraction
–The largest error for representing a number is
reduced to 1/2 29 of single precision
representation The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
7
2023秋
00…0 =0 0
8
2023秋
exponent is 00 →denorm
– The hardware needs to shift the mantissa bits in order to align
the correct bits with equal place value
The ideal result would be 1.001 * 2 1 (0, 10, 001) but this would
require 3 mantissa bits!
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
9
2023秋
Error Measure
– If a hardware adder has at least two more bit
positions than the total (both implicit and explicit)
number of mantissa bits, the error would never
be more than half of the place value of the
mantissa
– 0.001 in our 5-bit format
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
10
2023秋
Algorithm Considerations
– Sequential sum
1.00*2 0 +1.00*2 0 + 1.00*2 - 2 + 1.00*2 - 2
= 1.00*2 1 + 1.00*2 - 2 + 1.00*2 - 2
= 1.00*2 1 + 1.00*2 - 2
= 1.00*2 1
– Parallel reduction
(1.00*2 0 +1.00*2 0 ) + (1.00*2 -2 + 1.00*2 -2 )
= 1.00*2 1 + 1.00*2 -1
= 1.01*2 1
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
11
2023秋
讲授内容
ØFloating-Point Precision and Accuracy
ØNumerical Stability
12
2023秋
Numerical Stability
–Linear system solvers may require
different ordering of floating-point
operations for different input values in
order to find a solution
–An algorithm that can always find an
appropriate operation order and thus a
solution to the problem is a numerically
stable algorithm
– An algorithm that falls short is numerically unstable
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
中国科学院大学计算机学院硕士生专业选修课
GPU 架构与编程
第四课:GPU as Part of the PC Architecture
赵地
中科院计算所
2023年秋季学期
13
2023秋
讲授内容
ØGPU as Part of the PC Architecture
R e v i e w – Ty p i c a l S t r u c t u r e o f a C U DA P r o g r a m
– Global variables declaration
– Function prototypes
– __g lobal__ void kernelOne(…)
– Main ()
– allocate memory space on the device – cud aMalloc(&d_GlblVarPtr, bytes )
– transfer data from host to device – cudaMem Cpy(d_GlblVar Ptr, h_Gl…)
repeat
– execution configuration setup as
needed
– kernel call – kernelOne<<<execution configur ation>>>( args… );
– transfer results from device to host – c udaMemCpy(h_ GlblVarPtr,…)
– optional: compare against gol den (host computed) so lution
– Kernel – v oid k er n el O ne(t y pe args,…)
– variables declaration - _ _local_ _, __shared__
– automatic variables transparently assigned to registers or local memory
– s y nct hre ads () …
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
14
2023秋
B a n d w i d th – G r a v i t y o f M o d e r n C o m p u t e r S y s t e m s
15
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
16
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
17
2023秋
PCIe PC Architecture
– PCIe forms the
interconnect backbone
– Northbridge and
Southbridge are both PCIe
switches
– Some Southbridge designs
have built-in PCI-PCIe
bridge to allow old PCI
cards
– Some PCIe I/O cards are PCI
cards with a PCI-PCIe bridge
– Source: Jon Stokes, PCI
Express: An Overview
– http://arstechnica.com/articl
es/paedia/hardware/pcie.ars
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
18
2023秋
PCIe 3
–A total of 4 Giga Transfers per second in
each direction
–No more 8/10 encoding but uses a
polynomial transformation at the
transmitter and its inverse at the
receiver to achieve the same effect
–So the effective bandwidth is double of
PCIe 2
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
19
2023秋
Pinned Memory
– DMA uses physical ●If a source or destination of
addresses
a cudaMemCpy() in the host
– The OS could
accidentally page out memory is not pinned, it
the data that is being needs to be first copied to a
read or written by a
pinned memory – extra
DMA and page in
another virtual page overhead
into the same location
●cudaMemcpy is much faster
– Pinned memory
cannot not be paged with pinned host memory
out source or destination
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
– cudaFreeHost()
– One parameter
– Pointer to the memory to be freed
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
20
2023秋
Important Trends
21
2023秋
中国科学院大学计算机学院硕士生专业选修课
GPU 架构与编程
第四课:Efficient Host-Device Data Transfer
赵地
中科院计算所
2023年秋季学期
讲授内容
ØPinned Host Memory
ØTask Parallelism in CUDA
ØOverlapping Data Transfer with Computation
ØCUDA Unified Memory
22
2023秋
PCIe
Global DMA
Memory
GPU card
(or other I/O cards)
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
23
2023秋
24
2023秋
25
2023秋
26
2023秋
讲授内容
ØPinned Host Memory
ØTask Parallelism in CUDA
ØOverlapping Data Transfer with Computation
ØCUDA Unified Memory
time
27
2023秋
Device Overlap
– Some CUDA devices support device overlap
– Simultaneously execute a kernel while copying data
between device and host memory
int dev_count;
cudaDeviceProp prop;
cudaGetDeviceCount( &dev_count);
for (int i = 0; i < dev_count; i++) {
cudaGetDeviceProperties(&prop, i);
if (prop.deviceOverlap) …
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Trans Trans
A.3 B.3
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
28
2023秋
CUDA Streams
–CUDA supports parallel execution of
kernels and cudaMemcpy() with
“Streams”
–Each stream is a queue of operations
(kernel launches and cudaMemcpy()calls)
–Operations (tasks) in different streams
can go in parallel
–“Task parallelism”
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Streams
– Requests made from the host code are put into
First-In-First-Out queues
– Queues are read and processed asynchronously by the driver
and device
– Driver ensures that commands in a queue are processed in
sequence. E.g., Memory copies end before kernel launch, etc.
host thread
cudaMemcpy()
FIFO kernel launch
device sync
cudaMemcpy()
device driver
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
29
2023秋
Streams cont.
– To allow concurrent copying and kernel
execution, use multiple queues, called
“streams”
– CUDA “events” allow the host thread to query and
synchronize with individual queues (i.e. streams).
host thread
Stream 0 Stream 1
Event
device driver
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
PCIe PCIe
up down
Stream 0 Stream 1
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
30
2023秋
讲授内容
ØPinned Host Memory
ØTask Parallelism in CUDA
ØOverlapping Data Transfer with Computation
ØCUDA Unified Memory
31
2023秋
} The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Stream 0 Stream 1
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
32
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
33
2023秋
PCIe PCIe
up down
Copy Kernel Engine
Engine
Stream 0 Stream 1
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
34
2023秋
Trans Trans
A.3 B.3
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Hyper Queues
–Provide multiple queues for each
engine
–Allow more concurrency by allowing
some streams to make progress for
an engine while others are blocked
A -- B -- C
A--B--C
Stream 0
P--Q--R P -- Q -- R
Stream 1
X--Y--Z X -- Y -- Z
Multiple Hardware Work Queues
Stream
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 2 License.
International
35
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
讲授内容
ØPinned Host Memory
ØTask Parallelism in CUDA
ØOverlapping Data Transfer with Computation
ØCUDA Unified Memory
36
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
37
2023秋
38
2023秋
● T h i s t e l l s t h e U M d r i v e r t h a t t h e m e m o r y r e g io n i s m o s tl y f o r r e a d in g p u r p o s e s a n d o c c a s i o n a l l y
writing.
● A l l r e a d a c c e s s e s f r o m a p r o c e s s o r w i l l c r e a te a r e a d o n l y c o p y to b e a c c e s s e d .
● T h e d e v ic e a r g u m e n t i s i g n o r e d .
● W h e n u s e d i n c o n j u n c t i o n w i t h c u d a M e m P r e f e t c h A s y n c ( ) a r e a d o n l y c op y w i l l b e c r e a t e d i n t h e
s p e c i fi e d d e v i c e .
● cu d a Me m Ad vi se Se tPr e fe rr ed L o ca tion:
● H o w e v e r t h i s d o e s n o t c a u s e im m e d i a te d a t a m i g r a t i o n t o t h e l o c a ti o n , i t o n l y g u i d e s t h e
m i g r a t i o n p o l ic y f o r t h e U M d r i v e r.
● I f th e d e s t i n a ti o n p r o c e s s o r i s a G P U, t h e n th a t G P U n e e d s a n o n z e r o v a l ue f o r t he d e v i c e f l a g
c u d a D e v At t r Co n c u r r e n t M a n a g e d A c c e s s .
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
39
2023秋
40
2023秋
中国科学院大学计算机学院硕士生专业选修课
GPU 架构与编程
第四课:Computational Thinking
赵地
中科院计算所
2023年秋季学期
讲授内容
ØIntroduction to Computational Thinking
41
2023秋
42
2023秋
Data Sharing
– Data sharing can be a double-edged sword
– Excessive data sharing drasticall y reduces advantage of parallel execution
– Localized sharin g can impro ve memory bandwidth efficiency
Synchronization
– Synchronization == Control Sharing
– Barriers make threads wait until all threads
catch up
– Waiting is lost opportunity for work
– Atomic operations may reduce waiting
– Watch out for serialization
43
2023秋
Master/Worker
Shared Queue
Loop Parallelism
Distributed Array
Fork/Join
These are not necessarily
mutually exclusive.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Program Models
– SPMD (Single Program, Multiple Data)
– All PE’s (Processor Elements) execute the same program in
parallel, but has its own data
– Each PE uses a unique ID to access its portion of data
– Different PE can follow different paths through the same code
– This is essentially the CUDA Grid model (also OpenCL, MPI)
– SIMD is a special case – WARP used for efficiency
– Master/Worker
– Loop Parallelism
– Fork/Join
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
44
2023秋
Program Models
– SPMD (Single Program, Multiple Data)
– Master/Worker (OpenMP, OpenACC, TBB)
– A Master thread sets up a pool of worker threads and a bag of tasks
– Workers execute concurrently, removing tasks until done
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Algorithm Structure
Start
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
45
2023秋
More on SPMD
– Dominant coding style of scalable parallel
computing
– MPI code is mostly developed in SPMD style
– Many OpenMP code is also in SPMD (next to loop parallelism)
– Particularly suitable for algorithms based on task parallelism and
geometric decomposition.
– Main advantage
– Tasks and their interactions visible in one piece of source code, no need
to correlated multiple sources
– Distribute Data
– Dec o m pos e g l ob a l d ata i n to c h un k s a nd loc al ize the m , o r
– Sh a ri ng / re p li ca ting m aj o r da ta s tru c ture u s in g t hre a d I D t o a ss o ci a te sub s et of the data
to th r ea d s
– Finalize
– Rec o n ci le g l ob a l d ata s tr uc ture , pre p ar e for the ne x t m a jor i ter at i o n
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
46
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
47
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
48
2023秋
Future Studies
– More complex data structures
– More scalable algorithms and building
blocks
– More scalable math models
– Thread-aware approaches
– More available parallelism
– Locality-aware approaches
– Computing is becoming bigger, and everything is further away
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. 98
49
2023秋
中国科学院大学计算机学院硕士生专业选修课
GPU 架构与编程
第四课:Dynamic Parallelism
赵地
中科院计算所
2023年秋季学期
讲授内容
ØIn depth study of Dynamic Parallelism
50
2023秋
Dynamic Parallelism
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
51
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
中国科学院大学计算机学院硕士生专业选修课
GPU 架构与编程
第四课:CUDA libraries
赵地
中科院计算所
2023年秋季学期
52
2023秋
讲授内容
ØcuBLAS
ØcuSOLVER
ØcuFFT
ØThrust
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
53
2023秋
● cuBLAS uses column major order with Matrix A Column major Addressing in
order column major
1 indexing for compatibility with with 1 indexing order
Fortran numeric libraries
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
54
2023秋
55
2023秋
● Example
● cublasGetVector(2*sizeof(float), sizeof(float), A,
2, y, 1);
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
56
2023秋
57
2023秋
讲授内容
ØcuBLAS
ØcuSOLVER
ØcuFFT
ØThrust
58
2023秋
cuSOLVER
●cuSOLVER is linear algebra library with LAPACK-
like features focused on:
● Direct factorization methods for dense matrices
● Linear systems solving methods for dense and sparse
matrices
● Eigenvalue problems for dense and sparse matrices
● Refactorization techniques for sparse matrices
● Least square problems for sparse matrices
●In cases where the sparsity pattern produces low
GPU utilization cuSOLVER provides CPU routines
to handle those sparse matrices
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
59
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
60
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
61
2023秋
62
2023秋
●Eigenvalue problems:
● gesvd for SVD decomposition
● syevd for Eigenvalue decomposition
● sygvd for general Eigenvalue decomposition
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
63
2023秋
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
讲授内容
ØcuBLAS
ØcuSOLVER
ØcuFFT
ØThrust
64
2023秋
𝑵−𝟏
𝑿𝒌 = 𝒙𝒏 𝒆 𝒊𝝎𝒌𝒏
𝒏=𝟎
65
2023秋
● Pl a ns n ee d to be cr ea te d ● cufftCreate, cufftMakePlan1D,
cufftMakePl an2D, cufftMakePlan3D,
before executing the cufftMakePl anMany,
t r a n sf o r m a n d d e s tr o y e d cufftMakePl anMany64
after the user is done ● Aimed for customization and
extensibility
using them
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
●For simplicity, the rest of the slides will use real and
complex for referring to cufftReal/cufftDouble and
cufftComplex/cufftDoubleComplex respectively
●For memory size considerations is important to note that:
66
2023秋
C2R 𝐍 𝐍 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐫𝐞𝐚𝐥)
𝟐 + 𝟏 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐜𝐨𝐦𝐩𝐥𝐞𝐱)
R2C 𝐍 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐫𝐞𝐚𝐥) 𝐍
𝟐 + 𝟏 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐜𝐨𝐦𝐩𝐥𝐞𝐱)
C2C 𝑁 ∗ 𝑠𝑖𝑧𝑒𝑜𝑓(𝑐𝑜𝑚𝑝𝑙𝑒𝑥)
C2R 𝑁
2 + 1 ∗ 𝑠𝑖𝑧𝑒𝑜𝑓(𝑐𝑜𝑚𝑝𝑙𝑒𝑥)
R2C 𝑁
2 + 1 ∗ 𝑠𝑖𝑧𝑒𝑜𝑓(𝑐𝑜𝑚𝑝𝑙𝑒𝑥)
● The sizes of C2R and R2C are due to the transform needs to hold at most 𝐍
𝟐 +𝟏
complex numbers
● In R2C only the first N real entries are filled with input data , the rest of the ar ray is
allocated for the output
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
67
2023秋
error status with type ● Type used by cuFFT for reporting err ors
cufftResult ● Every cuFFT returns an error status
cufftMakeplan*()
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
68
2023秋
69
2023秋
70
2023秋
cufftHandle plan;
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
71
2023秋
讲授内容
ØcuBLAS
ØcuSOLVER
ØcuFFT
ØThrust
C++ STL
● C++ Standard Template Library ● Example:
provides a set of type agnostic #include <algorithm>
features as a way of simplifying #include <vector>
C++ progr amming, some
…
examples of these features are:
// Creates int vector with 10 elements
● Type agnostic data
std::vector<int> a(10);
structures: lists, map,
// Fill vector with 10 9 8 … 1
vectors, etc.
for(int i = 0 ; i < a.size(); ++i)
● Type agnostic algorithms:
sort, fill, max, etc. a[i] = a.size() - i;
72
2023秋
Thrust
● Thrust is a C++ template library written for CUDA devices and based on the
C++ STL with the intention of simplifying certain aspects of the CUDA
programming model
● Containers features:
● Interop erability between the host_vector<T> and device_vector<T>, specifically:
● Memory transfer s using the assi gn operator and constructors
● Interop erability with STL containers through iterators
● API simi lar to the STL std::vector
● Dynamic resizing of the container
73
2023秋
#include <thrust/device_vector.h>
#include <list>
std::list<float> stl_list;
stl_list.push_back(3.14);
stl_list.push_back(2.71);
stl_list.push_back(0.);
// Initialize thrust device vector with the STL list containing {3.14, 2.71, 0.}
74
2023秋
Thrust algorithms
● Provides common parallel algorithms like:
● Reductions
● Sorting
● Reorderings
● Prefix-sums
● Transformations
● All Thrust algorithms have host and device implementations
● If an algorithm is invoked with an iterator, thrust will execute the
operation in the host or device according to the iterator memory
region
● Except for thrust::copy which can copy data between host and
device all other routines must have all its arguments reside in
the same place
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
75
2023秋
Thrust transformations
#include <thrust/transform.h>
● Tr a n s f o r m a t io n s a p p l y a fu n c t i o n
…
t o e a c h e l e m e n t i n th e i n p u t r a n g e
a nd s t o r e s t h e r e s u l t i n th e o u tp u t struct saxpy_functional {
● t h r u s t : : t r a n s fo r m a p p l i e s o u t - }
o f- p l a c e a fu n c t io n t o e a c h
};
element in a data range
…
● t h r u s t : : t r a n s fo r m _ i f a p p l i e s
o u t- o f- p l a c e a f u n c t i o n i f a thrust::device_vector<float> a(1024);
condition is met
thrust::device_vector<float> b(1024);
● thrust::for_e ach applies in-
thrust::device_vector<float> c(1024);
place a unary function
// Initialize a and b
● To u s e T h r u s t t r a n s o r m a t i o n
a l g o r it h m s th e u s e r n e e d s t o saxpy_functional saxpy;
● # i n c l u d e < th r u s t / t r a n s f o r m . h > thrust::transform(a.begin(), a. end(), b.begin() , c. begin(), saxpy) ; // Perform c[i] = 2 * a[i] + b[i]
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
76
2023秋
77
2023秋
讲授内容
Ø课程大作业(一)问题2
解读
78
2023秋
主要内容
ü基于CUDA语言,实现CNN推理与训练(加分题),并
在MNIST数据(网址:
https://github.com/zalandoresearch/fashion-mnist)集合上
测试,不能调用英伟达公司library;
ü调优。与英伟达library相比,你的实现达到了几成功力
呢?
ü截止日期:期中考试(10月25日),截止日期之前可多
次提交。
ü每位同学单独提交、单独评分。
ü参考代码:https://github.com/KiveeDong/LeNet5-Matlab
问题二(加分题):训练
ü问题二:基于CUDA语言,实现CNN训练,不限制具体实现的
方式(网络需要与问题一相同);
ü包含模块:梯度下降、与网络相关的其他模型、等;
ü数据集合:需要与问题一相同;
ü训练代码:需要提交、考核;
ü指标:运行问题一推理代码的准确率,运行时间、考核;
ü考核平台:A100;
ü考核人员:助教;
ü评分:
ü功能正常:10分,满分;
79
2023秋
Training by SGD
ü关键/难点:在fully connected layer、convolution layer、
pooling layer、activation layer反向传播梯度
ü参考知识:
üCMU CS 10-301 + 10-601 Introduction to Machine Learning
https://www.cs.cmu.edu/~mgormley/courses/10601/
üCMU CS 11-785 Introduction to Deep Learning
https://deeplearning.cs.cmu.edu/F23/
Training by SGD
üInitial:w=0;
üFor epoch=1, 2, ⋯ I;
ü For 〈x, y〉ϵD
ü P=P-ȵ∇ PL
80
2023秋
81
2023秋
82
2023秋
83
2023秋
84
2023秋
85
2023秋
86
2023秋
87
2023秋
88
2023秋
89
2023秋
90
2023秋
91
2023秋
92
2023秋
93
2023秋
94
2023秋
95
2023秋
96
2023秋
97
2023秋
98
2023秋
99