周06

2023秋
中国科学院大学计算机学院硕士生专业选修课
GPU 架构与编程
第四课：Floating-Point Considerations
赵地
中科院计算所
2023年秋季学期
讲授内容
ØFloating-Point Precision and Accuracy
ØNumerical Stability
1
2023秋
What is IEEE floating-point format?

– An industrywide standard for representing floating-
point numbers to ensure that hardware from different
vendors generate results that are consistent with each
other
– A floating point binary number consists of three parts:
– sign (S), exponent (E), and mantissa (M)
– Each (S, E, M) pattern uniquely identifies a floating point number
– For each bit pattern, its IEEE floating-point value is

derived as:
– value = (-1) S * M * {2 E }, where 1.0 ≤ M < 10.0 B
– The interpretation of S is simple: S=0 results in a

positive number and S=1 a negative number
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Normalized Representation
– Specifying that 1.0 B ≤ M < 10.0 B makes the
mantissa value for each floating point number
unique.
– For example, the only mantissa value allowed for 0.5 D is M =1.0
– 0.5 D = 1.0 B * 2 - 1
– Neither 10.0 B * 2 -2 nor 0.1 B * 2 0 qualifies
– Because all mantissa values are of the form

1.XX…, one can omit the “1.” part in the
representation.
– The mantissa value of 0.5 D in a 2-bit mantissa is 00, which is derived by
omitting “1.” from 1.00.
– Mantissa without implied 1 is called the fraction
2
2023秋
Exponent Representation
2’s complement Actual decimal Excess-3
– In an n-bit exponent
representation, 2 n- 1 -1 is
added to its 2's 000 0 011
complement
representation to form its 001 1 100
excess representation.
– See Table for a 3-bit 010 2 101
exponent representation
011 3 110
– A simple unsigned integer
comparator can be used 100 (reserved pattern) 111
to compare the magnitude
of two FP numbers
101 -3 000
– Symmetric range for +/-
exponents (111 reserved) 110 -2 001
111 -1 010
A simple, hypothetical 5-bit FP format
●Assume 1-bit S, 2- 2’s Actual Excess-1

bit E, and 2-bit M complement decimal
00 0 01
●0.5D = 1.00 B * 2-1 01 1 10
●0.5D = 0 00 00, where 10 (reserved 11
S = 0, E = 00, and M = pattern)
11 -1 00
(1.)00
3
2023秋
Representable Numbers
●The representable numbers 000 0

of a given format is the set 001 1
of all numbers that can be 010 2
exactly represented in the 011 3
format. 100 4
●See Table for representable 101 5
numbers of an unsigned 3- 110 6
bit integer format 111 7
Representable Numbers of a 5-bit

Hypothetical IEEE Format
Non-zero Abrupt underflow Gradual underflow
E M S=0 S=1 S=0 S=1 S=0 S=1

00 00 2-1 -(2-1) 0 0 0 0
01 2-1+1*2-3 -(2-1+1*2-3) 0 0 1*2-2 -1*2-2
10 2-1+2*2-3 -(2-1+2*2-3) 0 0 2*2-2 -2*2-2
11 2-1+3*2-3 -(2-1+3*2-3) 0 0 3*2-2 -3*2-2
01 00 20 -(20) 20 -(20) 20 -(20)
01 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2)
10 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2)
11 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2)
10 00 21 -(21) 21 -(21) 21 -(21)
01 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1)
10 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1)
11 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1)
11 Reserved pattern
4
2023秋
Flush to Zero
–Treat all bit patterns with E=0 as
0.0
–This takes away several representable
numbers near zero and lump them all into
0.0
–For a representation with large M, a large
number of representable numbers will be
removed The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
Flush to Zero
No-zero Flush to Zero Denormalized
E M S=0 S=1 S=0 S=1 S=0 S=1

00 00 2-1 -(2-1) 0 0 0 0
01 2-1+1*2-3 -(2-1+1*2-3) 0 0 1*2-2 -1*2-2
10 2-1+2*2-3 -(2-1+2*2-3) 0 0 2*2-2 -2*2-2
11 2-1+3*2-3 -(2-1+3*2-3) 0 0 3*2-2 -3*2-2
01 00 20 -(20) 20 -(20) 20 -(20)
01 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2)
10 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2)
11 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2)
10 00 21 -(21) 21 -(21) 21 -(21)
01 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1)
10 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1)
11 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1)
11 Reserved pattern
5
2023秋
Why is flushing to zero problematic?

–Many physical model calculations work
on values that are very close to zero
– Dark (but not totally black) sky in movie rendering
– Small distance fields in electrostatic potential
calculation
–…
–Without Denormalization, these

calculations tend to create artifacts that
compromise the integrity of the models
Denormalized Numbers
–The actual method adopted by the
IEEE standard is called
“denormalized numbers” or “gradual
underflow”.
–The method relaxes the normalization
requirement for numbers very close to 0.
–Whenever E=0, the mantissa is no longer
assumed to be of the form 1.XX. Rather, it is
assumed to be 0.XX. In general, if the n-bit
exponent is 0, the value is 0.M * 2 - 2 ^(n-1) + 2
6
2023秋
Denormalization
No-zero Flush to Zero Denormalized
E M S=0 S=1 S=0 S=1 S=0 S=1

00 00 2-1 -(2-1) 0 0 0 0
01 2-1+1*2-3 -(2-1+1*2-3) 0 0 1*2-2 -1*2-2
10 2-1+2*2-3 -(2-1+2*2-3) 0 0 2*2-2 -2*2-2
11 2-1+3*2-3 -(2-1+3*2-3) 0 0 3*2-2 -3*2-2
01 00 20 -(20) 20 -(20) 20 -(20)
01 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2) 20+1*2-2 -(20+1*2-2)
10 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2) 20+2*2-2 -(20+2*2-2)
11 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2) 20+3*2-2 -(20+3*2-2)
10 00 21 -(21) 21 -(21) 21 -(21)
01 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1) 21+1*2-1 -(21+1*2-1)
10 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1) 21+2*2-1 -(21+2*2-1)
11 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1) 21+3*2-1 -(21+3*2-1)
11 Reserved pattern
IEEE 754 Format and Precision

–Single Precision
–1-bit sign, 8 bit exponent (bias-127 excess), 23
bit fraction
–Double Precision
–1-bit sign, 11-bit exponent (1023-bias excess),
52 bit fraction
–The largest error for representing a number is
reduced to 1/2 29 of single precision
representation The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
7
2023秋
Special Bit Patterns

exponent mantissa meaning
11…1 ≠0 NaN
11…1 =0 (-1)S * ∞
00…0 ≠0 denormalized
00…0 =0 0
– An ∞ can be created by overflow, e.g., divided by zero. Any

representable number divided by +∞ or -∞ results in 0.
– NaN (Not a Number) is generated by operations whose input
values do not make sense, for example, 0/0, 0*∞, ∞/∞, ∞ - ∞.
– Als o u s ed to fo r da ta th a t h as no t bee n p rope rl y in i tia li ze d in a p rog ra m .
– Si gn a li n g N aNs (S Na Ns) a re r epre s ente d wi t h m os t s ign ifi c an t m an ti ss a bit cl e ar ed
wher eas q u ie t Na Ns ar e r ep r es en te d wi th m o s t s ig nifi c ant m anti ss a bit se t.
Floating Point Accuracy and Rounding

– The accuracy of a floating point arithmetic
operation is measured by the maximal error
introduced by the operation.
– The most common source of error in floating
point arithmetic is when the operation
generates a result that cannot be exactly
represented and thus requires rounding.
– Rounding occurs if the mantissa of the result
value needs too many bits to be represented
exactly.
8
2023秋
Rounding and Error

– Assume our 5-bit representation, consider
1.0*2 -2 (0, 00, 01) + 1.00*2 1 (0, 10, 00)
exponent is 00 →denorm
– The hardware needs to shift the mantissa bits in order to align
the correct bits with equal place value
0.001*2 1 (0, 00, 0001) + 1.00*2 1 (0, 10, 00)
The ideal result would be 1.001 * 2 1 (0, 10, 001) but this would
require 3 mantissa bits!
Rounding and Error
–In some cases, the hardware may

only perform the operation on a
limited number of bits for speed and
area cost reasons
–An adder may only have 3 bit positions in our
example so the first operand would be treated
as a 0.00
0.00 1*2 1 (0, 00, 0001) + 1.00*2 1 (0, 10, 00)
9
2023秋
Error Measure
– If a hardware adder has at least two more bit
positions than the total (both implicit and explicit)
number of mantissa bits, the error would never
be more than half of the place value of the
mantissa
– 0.001 in our 5-bit format
– We refer to this as 0.5 ULP (Units in the Last

Place)
– If the hardware is designed to perform arithmetic and rounding
operations perfectly, the most error that one should introduce should be
no more than 0.5 ULP
– The error is limited by the precision for this case.
Order of Operations Matter
–Floating point operations are not

strictly associative
–The root cause is that some times a
very small number can disappear
when added to or subtracted from a
very large number ：
(Large + Small) + Small ≠ Large + (Small + Small)
10
2023秋
Algorithm Considerations
– Sequential sum
1.00*2 0 +1.00*2 0 + 1.00*2 - 2 + 1.00*2 - 2
= 1.00*2 1 + 1.00*2 - 2 + 1.00*2 - 2
= 1.00*2 1 + 1.00*2 - 2
= 1.00*2 1
– Parallel reduction
(1.00*2 0 +1.00*2 0 ) + (1.00*2 -2 + 1.00*2 -2 )
= 1.00*2 1 + 1.00*2 -1
= 1.01*2 1
Runtime Math Library
– There are two types of runtime math operations

– __func(): direct mapping to hardware ISA
– Fast but low accuracy
– Examples: __sin(x), __exp(x), __pow(x,y)
– func() : compile to multiple instructions
– Slower but higher accuracy (0.5 ulp, units in the least place, or less)
– Examples: sin(x), exp(x), pow(x,y)
– The -use_fast_math compiler option forces

every func() to compile to __func()
11
2023秋
Make your program float-safe!

– Double precision likely have performance cost
– Careless use of double or undeclared types may run more
slowly
– Important to be explicit whenever you want single
precision to avoid using double precision where it is
not needed
– Add ‘f’ specifier on float literals:
– foo = bar * 0.123; // double assumed
– foo = bar * 0.123f; // float explicit
– Use float version of standard library functions

– foo = sin(bar); // double assumed
– foo = sinf(bar); // single precision explicit
讲授内容
ØFloating-Point Precision and Accuracy
ØNumerical Stability
12
2023秋
Numerical Stability
–Linear system solvers may require
different ordering of floating-point
operations for different input values in
order to find a solution
–An algorithm that can always find an
appropriate operation order and thus a
solution to the problem is a numerically
stable algorithm
– An algorithm that falls short is numerically unstable
GPU 架构与编程
第四课：GPU as Part of the PC Architecture
赵地
中科院计算所
2023年秋季学期
13
2023秋
讲授内容
ØGPU as Part of the PC Architecture
R e v i e w – Ty p i c a l S t r u c t u r e o f a C U DA P r o g r a m
– Global variables declaration
– Function prototypes
– __g lobal__ void kernelOne(…)
– Main ()
– allocate memory space on the device – cud aMalloc(&d_GlblVarPtr, bytes )
– transfer data from host to device – cudaMem Cpy(d_GlblVar Ptr, h_Gl…)
repeat
– execution configuration setup as
needed
– kernel call – kernelOne<<<execution configur ation>>>( args… );
– transfer results from device to host – c udaMemCpy(h_ GlblVarPtr,…)
– optional: compare against gol den (host computed) so lution
– Kernel – v oid k er n el O ne(t y pe args,…)
– variables declaration - _ _local_ _, __shared__
– automatic variables transparently assigned to registers or local memory
– s y nct hre ads () …
14
2023秋
B a n d w i d th – G r a v i t y o f M o d e r n C o m p u t e r S y s t e m s
–The bandwidth between key

components ultimately dictates
system performance
–Especially true for massively parallel systems
processing massive amount of data
–Tricks like buffering, reordering and caching
can temporarily defy the rules in some cases
–Ultimately, the performance falls back to what
the “speeds and feeds” dictate
History: Classic PC architecture

– Northbridge connects 3
components that must
communicate at high
speed
CPU
– CPU, DRAM, video
– Video also needs to have
1 s t -class access to
DRAM
– Previous NVIDIA cards
are connected to AGP, up
to 2 GB/s transfers
– Southbridge serves as
a concentrator for
slower I/O devices Core Logic Chipset
15
2023秋
(Original) PCI Bus Specification

– Connected to the Southbridge
– Originally 33 MHz, 32-bit wide, 132 MB/second peak transfer rate
– More recently 66 MHz, 64-bit, 528 MB/second peak
– Upstream bandwidth remain slow for device (~256 MB/s peak)
– Shared bus with arbitration
– Winner of arbitration becomes bus master and can connect to CPU or DRAM
through the Southbridge and Northbridge
PCI as Memory Mapped I/O

– PCI device registers
are mapped into the
CPU’s physical
address space
–Accessed through
loads/ stores (kernel
mode)
– Addresses are
assigned to the PCI
devices at boot time
–All devices listen for
their addresses
16
2023秋
PCI Express (PCIe)

– Switched, point-to-point
connection
– Each card has a
dedicated “link” to the
central switch, no bus
arbitration
– Packet switches
messages form virtual
channel
– Prioritized packets for
QoS
– E.g., real-time video
streaming
PCIe 2 Links and Lanes

– Each link consists of one or more
lanes
– Each lane is 1-bit wide (4 wires, each
2-wire pair can transmit 2.5Gb/s in
one direction)
– Upstream and downstream now
simultaneous and symmetric
– Each Link can combine 1, 2, 4, 8, 12,
16 lanes- x1, x2, etc.
– Each byte data is 8b/10b encoded
into 10 bits with equal number of 1’s
and 0’s; net data rate 2 Gb/s per lane
each way
– Thus, the net data rates are 250 MB/s
(x1) 500 MB/s (x2), 1GB/s (x4), 2 GB/s
(x8), 4 GB/s (x16), each way
17
2023秋
8/10 bit encoding

– 00000000,
●Goal is to maintain DC 00000111, 11000001
balance while have sufficient bad
state transition for clock – 01010101,
recovery 11001100 good
– Find 256 good
●The difference of 1s and 0s in patterns among
a 20-bit stream should be ≤ 2 1024 total patterns
of 10 bits to
●There should be no more than encode an 8-bit
5 consecutive 1s or 0s in any data
stream – 20% overhead
PCIe PC Architecture
– PCIe forms the
interconnect backbone
– Northbridge and
Southbridge are both PCIe
switches
– Some Southbridge designs
have built-in PCI-PCIe
bridge to allow old PCI
cards
– Some PCIe I/O cards are PCI
cards with a PCI-PCIe bridge
– Source: Jon Stokes, PCI
Express: An Overview
– http://arstechnica.com/articl
es/paedia/hardware/pcie.ars
18
2023秋
PCIe 3
–A total of 4 Giga Transfers per second in
each direction
–No more 8/10 encoding but uses a
polynomial transformation at the
transmitter and its inverse at the
receiver to achieve the same effect
–So the effective bandwidth is double of
PCIe 2
PCIe Data Transfer using DMA

– DMA (Direct Memory
Access) is used to fully Main Memory (DRAM)
utilize the bandwidth of

an I/O bus
– DMA uses physical address for
source and destination CPU
– Transfers a number of bytes

requested by OS
– Needs pinned memory Global DMA
– DMA hardware is much faster Memory
than CPU software and frees the GPU card
CPU for other tasks during the (or other I/O cards)
data transfer
19
2023秋
Pinned Memory
– DMA uses physical ●If a source or destination of
addresses
a cudaMemCpy() in the host
– The OS could
accidentally page out memory is not pinned, it
the data that is being needs to be first copied to a
read or written by a
pinned memory – extra
DMA and page in
another virtual page overhead
into the same location
●cudaMemcpy is much faster
– Pinned memory
cannot not be paged with pinned host memory
out source or destination
Allocate/Free Pinned Memory

(a.k.a. Page Locked Memory)
– cudaHostAlloc()
– Three parameters
– Address of pointer to the allocated memory
– Size of the allocated memory in bytes
– Option – use cudaHostAllocDefault for
now
– cudaFreeHost()
– One parameter
– Pointer to the memory to be freed
20
2023秋
Using Pinned Memory

– Use the allocated memory and its pointer
the same way those returned by malloc();
– The only difference is that the allocated
memory cannot be paged by the OS
– The cudaMemcpy function should be about
2X faster with pinned memory
– Pinned memory is a limited resource whose
over-subscription can have serious
consequences
Important Trends
–Knowing yesterday, today, and

tomorrow
–The PC world is becoming flatter
–CPU and GPU are being fused together
–Outsourcing of computation is
becoming easier…
21
2023秋
GPU 架构与编程
第四课：Efficient Host-Device Data Transfer
赵地
中科院计算所
2023年秋季学期
讲授内容
ØPinned Host Memory
ØTask Parallelism in CUDA
ØOverlapping Data Transfer with Computation
ØCUDA Unified Memory
22
2023秋
CPU-GPU Data Transfer using DMA

– DMA (Direct Memory Access) hardware is used by
cudaMemcpy() for better efficiency
– Frees CPU for other tasks
– Hardware unit specialized to transfer a number of bytes requested by OS
– Between physical memory address space regions (some can be mapped
I/O memory locations)
– Uses system interconnect, typically PCIe in today’s systems
CPU Main Memory (DRAM)
PCIe
Global DMA
Memory
GPU card
(or other I/O cards)
Virtual Memory Management

– Modern computers use virtual memory management
– Many virtual memory spaces mapped into a single physical memory
– Virtual addresses (pointer values) are translated into physical
addresses
– Not all variables and data structures are always in the

physical memory
– Each virtual address space is divided into pages that are mapped
into and out of the physical memory
– Virtual memory pages can be mapped out of the physical memory
(page-out) to make room
– Whether or not a variable is in the physical memory is checked at
address translation time
23
2023秋
Data Transfer and Virtual Memory

– DMA uses physical addresses
– When cudaMemcpy() copies an array, it is implemented as
one or more DMA transfers
– Address is translated and page presence checked for the
entire source and destination regions at the beginning of
each DMA transfer
– No address translation for the rest of the same DMA
transfer so that high efficiency can be achieved
– The OS could accidentally page-out the data that is
being read or written by a DMA and page-in another
virtual page into the same physical location
Pinned Memory and DMA Data Transfer
– Pinned memory are virtual memory pages that

are specially marked so that they cannot be
paged out
– Allocated with a special system API function call
– a.k.a. Page Locked Memory, Locked Pages, etc.
– CPU memory that serve as the source or
destination of a DMA transfer must be allocated
as pinned memory
24
2023秋
CUDA data transfer uses pinned memory.

– The DMA used by cudaMemcpy() requires that
any source or destination in the host memory is
allocated as pinned memory
– If a source or destination of a cudaMemcpy() in
the host memory is not allocated in pinned
memory, it needs to be first copied to a pinned
memory – extra overhead
– cudaMemcpy() is faster if the host memory
source or destination is allocated in pinned
memory since no extra copy is needed
Allocate/Free Pinned Memory
–cudaHostAlloc(), three parameters

–Address of pointer to the allocated memory
–Size of the allocated memory in bytes
–Option – use cudaHostAllocDefault for now
–cudaFreeHost(), one parameter

–Pointer to the memory to be freed
25
2023秋
Using Pinned Memory in CUDA

– Use the allocated pinned memory and its
pointer the same way as those returned by
malloc();
– The only difference is that the allocated
memory cannot be paged by the OS
– The cudaMemcpy() function should be
about 2X faster with pinned memory
– Pinned memory is a limited resource
– over-subscription can have serious consequences
Putting It Together - Vector Addition Host

Code Example
int main()
{
float *h_A, *h_B, *h_C;
…
cudaHostAlloc((void **) &h_A, N* sizeof(float),
cudaHostAllocDefault);
cudaHostAlloc((void **) &h_B, N* sizeof(float),
cudaHostAlloc((void **) &h_C, N* sizeof(float),
…
// cudaMemcpy() runs 2X faster
}
26
2023秋
讲授内容
Serialized Data Transfer and Computation
–So far, the way we use

cudaMemcpy serializes data
transfer and GPU computation for
VecAddKernel()
Trans. A Trans. B Comp Trans. C
time
Only use one Only use one

PCIe Idle direction, GPU
direction, GPU idle
idle
27
2023秋
Device Overlap
– Some CUDA devices support device overlap
– Simultaneously execute a kernel while copying data
between device and host memory
int dev_count;
cudaDeviceProp prop;
cudaGetDeviceCount( &dev_count);
for (int i = 0; i < dev_count; i++) {
cudaGetDeviceProperties(&prop, i);
if (prop.deviceOverlap) …
Ideal, Pipelined Timing

–Divide large vectors into segments
–Overlap transfer and compute of
adjacent segments
Trans Trans Comp Trans
A.0 B.0 C.0 = A.0 + B.0 C.0

A.1 B.1 C.1 = A.1 + B.1 C.1
Trans Trans Comp

A.2 B.2 C.2 = A.2 + B.2
Trans Trans
A.3 B.3
28
2023秋
CUDA Streams
–CUDA supports parallel execution of
kernels and cudaMemcpy() with
“Streams”
–Each stream is a queue of operations
(kernel launches and cudaMemcpy()calls)
–Operations (tasks) in different streams
can go in parallel
–“Task parallelism”
Streams
– Requests made from the host code are put into
First-In-First-Out queues
– Queues are read and processed asynchronously by the driver
and device
– Driver ensures that commands in a queue are processed in
sequence. E.g., Memory copies end before kernel launch, etc.
host thread
cudaMemcpy()
FIFO kernel launch
device sync
cudaMemcpy()
device driver
29
2023秋
Streams cont.
– To allow concurrent copying and kernel
execution, use multiple queues, called
“streams”
– CUDA “events” allow the host thread to query and
synchronize with individual queues (i.e. streams).
host thread
Stream 0 Stream 1
Event
device driver
Conceptual View of Streams
PCIe PCIe
up down
Copy Engine Kernel

Engine
MemCpy A.0 MemCpy A.1

MemCpy B.0 MemCpy B.1
Kernel 0 Kernel 1
MemCpy C.0 MemCpy C.1
Stream 0 Stream 1
Operations (Kernel launches, cudaMemcpy() calls)
30
2023秋
讲授内容
Simple Multi-Stream Host Code

cudaStream_t stream0, stream1;
cudaStreamCreate(&stream0);
cudaStreamCreate(&stream1);
float *d_A0, *d_B0, *d_C0; // device memory

for stream 0
float *d_A1, *d_B1, *d_C1; // device memory
for stream 1
// cudaMalloc() calls for d_A0, d_B0, d_C0,

d_A1, d_B1, d_C1 go here
31
2023秋
Simple Multi-Stream Host Code (Cont.)

for (int i=0; i<n; i+=SegSize*2) {
cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),…, stream0);
vecAdd<<<SegSize/256, 256, 0, stream0>>>(d_A0, d_B0,…) ;
cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof( float),…,

stream1);
cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof( float),…,

stream1);
vecAdd<<<SegSize/256, 256, 0, stream1>>>(d_A1, d_B1, … );
cudaMemcpyAsync(d_C1, h_C+i+SegSize, SegSize*sizeof( float),…,

stream1);
} The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
A View Closer to Reality in Previous GPUs

PCIe PCIe
up down
Copy Kernel Engine
Engine
MemCpy A.0 Kernel 0

MemCpy B.0 Kernel 1
MemCpy C.0
MemCpy A.1
MemCpy B.1
MemCpy C.1
Stream 0 Stream 1
32
2023秋
Not quite the overlap we want in some

GPUs
–C.0 blocks A.1 and B.1 in the copy

engine queue

A.0 B.0 C.0 = A.0 + B.0 C.0

A.1 B.1 C.1= A.1 + B.1 C.1
Better Multi-Stream Host Code

for (int i=0; i< n; i+=SegSize*2) {
cudaMemcpyAsync(d_A0, h_A+i, S egSize*sizeof(float),…, stream0) ;
cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof( float),…, stream0);
cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),…,
stream1);
cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),…,
stream1);
vecAdd<<<SegSi ze/256, 256, 0, stream0>>>(d_A0, d_B0, …);

vecAdd<<<SegSi ze/256, 256, 0, stream1>>>(d_A1, d_B1, …);
cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof( float),…, stream0);

cudaMemcpyAsync(h_C+i+SegSize, d_C1, SegSize*sizeof(float),…,
stream1);
}
33
2023秋
C.0 no longer blocks A.1 and B.1
PCIe PCIe
up down
Copy Kernel Engine
Engine
MemCpy A.0 Kernel 0

MemCpy B.0 Kernel 1
MemCpy A.1
MemCpy B.1
MemCpy C.0
MemCpy C.1
Stream 0 Stream 1
Better, not quite the best overlap

–C.1 blocks next iteration A.0
and B.0 in the copy engine
queue
A.0 B.0 C.0 = A.0 + C.0 Iteration n
B.0
A.1 B.1 C.1= A.1 + B.1 C.1
Trans Trans Comp

Iteration n+1 A.2 B.2 C.2 = A.2
+
Trans
A.2
34
2023秋
Ideal, Pipelined Timing

–Will need at least three buffers for
each original A, B, and C, code is
more complicated
A.0 B.0 C.0 = A.0 + B.0 C.0

A.1 B.1 C.1 = A.1 + B.1 C.1
Trans Trans Comp

A.2 B.2 C.2 = A.2 + B.2
Trans Trans
A.3 B.3
Hyper Queues
–Provide multiple queues for each
engine
–Allow more concurrency by allowing
some streams to make progress for
an engine while others are blocked
A -- B -- C
A--B--C
Stream 0
P--Q--R P -- Q -- R
Stream 1
X--Y--Z X -- Y -- Z
Multiple Hardware Work Queues
Stream
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 2 License.
International
35
2023秋
Wait until all tasks have completed

– cudaStreamSynchronize(stream_id)
– Used in host code
– Takes one parameter – stream identifier
– Wait until all tasks in a stream have completed
– E.g., cudaStreamSynchronize(stream0)in host code ensures that all
tasks in the queues of stream0 have completed
– This is different from cudaDeviceSynchronize()

– Also used in host code
– No parameter
– cudaDeviceSynchronize() waits until all tasks in all streams have
completed for the current device
讲授内容
36
2023秋
Data Prefetching in CUDA Unified

Memory
●Main purpose being to avoid page faults
●Establishes data locality, providing a mechanism
to improve the performance of the application
●Used to migrate data to a device or the host and
map page tables onto the processor before it
begins using data
●Most useful when the data is accessed from a
single processor
Data Pr efetch ing CUDA API
●cudaMemPrefetchAsy ● Prefetching is an asynchronous

operation with respect to the
nc() device, however the call may not
● Four parameters be fully asynchronous with
● Unified Memory respect to the host .
Address of the memory ● The destination processor must be
region to prefetch a valid device ID
● Size of the region to or cudaCpuDeviceId, the later will
prefetch in terms of prefetch the memory to the host.
bytes ● If the prefetch processor is a GPU,
● Destination processor the device property
cudaDevAttrConcurrentManagedAccess
● CUDA stream must be nonzero.
37
2023秋
Putting it all together

void function(int * data, cudaStream_t stream) {
// data must've been allocated with
cudaMallocManaged( (void**) &data, N);
init(data,
N);
// Init data on the host
cudaMemPrefetchAsync(data, N * sizeof(int), myGpuId,
stream); // Prefetch to the device
kernel<<<...,
stream>>>(dat
a, ...);
// Execute on the device
cudaMemPrefetchAsync(data, N * sizeof(int), cudaCpuDeviceId,
stream); // Prefetch to the host
cudaStreamSynchronize(stream);
hostFunction(data, N);
}
CUDA Unified Memor y - Memory Ad viso r
- Hints that can be provided to the driver

on how data will be used during runtime
- One example of when we can give an
advice is when data will be accessed
from multiple processors at the same
time The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License.
38
2023秋
Memory Advis or CUDA API

● The destination processor must
● cudaMemAdvise()
be a valid device ID
● Four parameters
or cudaCpuDeviceId.
● Unified Memory
Address of the ● Available memory hints are:
memory region to ● cudaMemAdviseSetReadMostly
advise ● cudaMemAdviseSetPreferredLocati
● Size of the region on
to advise in terms ● cudaMemAdviseSetAccessedBy
of bytes
To unset the advice:
● Memory advise
● cudaMemAdviseUnsetReadMostly
● Destination
● cudaMemAdviseUnsetPreferredLoc
processor
ation
● cudaMemAdviseUnsetAccessedBy
Memory Advisor CUDA API

● cu d a Me m Ad vi se Se tR ea dMos tly :
● T h i s t e l l s t h e U M d r i v e r t h a t t h e m e m o r y r e g io n i s m o s tl y f o r r e a d in g p u r p o s e s a n d o c c a s i o n a l l y
writing.
● A l l r e a d a c c e s s e s f r o m a p r o c e s s o r w i l l c r e a te a r e a d o n l y c o p y to b e a c c e s s e d .
● T h e d e v ic e a r g u m e n t i s i g n o r e d .
● W h e n u s e d i n c o n j u n c t i o n w i t h c u d a M e m P r e f e t c h A s y n c ( ) a r e a d o n l y c op y w i l l b e c r e a t e d i n t h e
s p e c i fi e d d e v i c e .
● cu d a Me m Ad vi se Se tPr e fe rr ed L o ca tion:
● This tells the UM driver the preferred location of the data.
● H o w e v e r t h i s d o e s n o t c a u s e im m e d i a te d a t a m i g r a t i o n t o t h e l o c a ti o n , i t o n l y g u i d e s t h e
m i g r a t i o n p o l ic y f o r t h e U M d r i v e r.
● I f th e d e s t i n a ti o n p r o c e s s o r i s a G P U, t h e n th a t G P U n e e d s a n o n z e r o v a l ue f o r t he d e v i c e f l a g
c u d a D e v At t r Co n c u r r e n t M a n a g e d A c c e s s .
39
2023秋
Memory Advisor CUDA API

● cudaMemAdviseSetAccessedBy:
● This tell the UM driver that the data will be accessed
by the specified processor.
● This advice does not cause immediate data migration
to the location, it only causes data to always be
mapped in the specified processor’s pages.
● It is useful to avoid page faulting, like when in a
multi-GPU system one device wants to access
another GPU's memory but migrating the data may
be more expensive than just reading it through the
PCIE link.
Putting it all together

v oid f unc tion (i n t * dat a, c uda Stream_t stre am) {
/ / da ta m us t b e add ress abl e by the Unif ied Memory dr iver.
i n it( data , N ); / / I nit data on t he host
c u daM emAd vi s e( d a ta, N * si zeof(int), cu daMemAdviseSe tReadMostly , 0 );
/ / S e t th e ad vi c e f o r r ead onl y.
c u daM emPr ef e tc h A syn c(da ta, N * sizeof(i nt), myGpuId1 ,
s tre a m 1); / / P refetch to th e 1st device.
c u daM emPr ef e tc h A syn c(da ta, N * sizeof(i nt), myGpuId2 ,
s tre a m 2); / / P refetch to th e 2nd device.
c u daS etDe vi c e( m yGp u Id 1 );
/ / E x e cut e re ad on l y op erat ion s
k e rne l<<< .. . , s t rea m>>> (da ta, ...);
/ / o n the 1st d e vi c e .
c u daS etDe vi c e( m y Gpu Id2) ; // E xecute read
o nly o per atio ns
k e rne l<<< .. . , s t rea m>>> (da ta, ...);
/ / o n the 2nd d e vi c e .
}
40
2023秋
GPU 架构与编程
第四课：Computational Thinking
赵地
中科院计算所
2023年秋季学期
讲授内容
ØIntroduction to Computational Thinking
41
2023秋
Fundamentals of Parallel Computing

– Parallel computing requires that
– The problem can be decomposed into sub-problems that can be safely
solved at the same time
– The programmer structures the code and data to solve these sub-
problems concurrently
– The goals of parallel computing are

– To solve problems in less time (strong scaling), and/or
– To solve bigger problems (weak scaling), and/or
– To achieve better solutions (advancing science)
The problems must be large enough to justify parallel

computing and to exhibit exploitable concurrency.
Shared Memory vs. Message Passing

– We have focused on shared memory parallel
programming
– This is what CUDA (and OpenMP, OpenCL) is based on
– Future massively parallel microprocessors are expected to
support shared memory at the chip level
– The programming considerations of

message passing model is quite different!
– However, you will find parallels for almost every technique
you learned in this course
– Need to be aware of space-time constraints
42
2023秋
Data Sharing
– Data sharing can be a double-edged sword
– Excessive data sharing drasticall y reduces advantage of parallel execution
– Localized sharin g can impro ve memory bandwidth efficiency
– Efficient memory bandwidth usage can be achieved by

synchronizing the execution of task groups and
coordinating their usage of memory data
– Efficient use of on-chip, shared storage and datapaths
– Read-only sharing can usually be done at much higher

efficiency than read-write sharing, which often requires
more synchronization
– Many:Many, One:Many, Many:One, One:One
Synchronization
– Synchronization == Control Sharing
– Barriers make threads wait until all threads
catch up
– Waiting is lost opportunity for work
– Atomic operations may reduce waiting
– Watch out for serialization
– Important: be aware of which items of work

are truly independent
43
2023秋
Parallel Programming Coding Styles –

Program and Data Models
Program Models
Data Models
SPMD
Shared Data
Master/Worker
Shared Queue
Loop Parallelism
Distributed Array
Fork/Join
These are not necessarily
mutually exclusive.
Program Models
– SPMD (Single Program, Multiple Data)
– All PE’s (Processor Elements) execute the same program in
parallel, but has its own data
– Each PE uses a unique ID to access its portion of data
– Different PE can follow different paths through the same code
– This is essentially the CUDA Grid model (also OpenCL, MPI)
– SIMD is a special case – WARP used for efficiency
– Master/Worker
– Loop Parallelism
– Fork/Join
44
2023秋
Program Models
– SPMD (Single Program, Multiple Data)
– Master/Worker (OpenMP, OpenACC, TBB)
– A Master thread sets up a pool of worker threads and a bag of tasks
– Workers execute concurrently, removing tasks until done
– Loop Parallelism (OpenMP, OpenACC, C++AMP)

– Loop iterations execute in parallel
– FORTRAN do-all (truly parallel), do-across (with dependence)
– Fork/Join (Posix p-threads)

– Most general, generic way of creation of threads
Algorithm Structure
Start
Organize Organize by Organize by

by Task Data Data Flow
Linear Recursive Linear Recursive Regular Irregular
Task Divide and Geometric Recursive

Pipeline Event Driven
Parallelism Conquer Decomposition Data
45
2023秋
More on SPMD
– Dominant coding style of scalable parallel
computing
– MPI code is mostly developed in SPMD style
– Many OpenMP code is also in SPMD (next to loop parallelism)
– Particularly suitable for algorithms based on task parallelism and
geometric decomposition.
– Main advantage
– Tasks and their interactions visible in one piece of source code, no need
to correlated multiple sources
SPMD is by far the most commonly used pattern for

structuring massively parallel programs.
Typical SPMD Program Phases

– Initialize
– Es tab l is h l o ca li ze d d ata s tr uc ture a n d c o mm u ni c ati on cha n nel s
– Obtain a unique identifier

– Ea ch th re ad ac q ui re s a u niq u e i d en ti fi e r, typi ca ll y ra ng e from 0 to N-1 , wher e N i s the
n u m b er o f t h r ea ds.
– Both O p en M P and C UDA ha v e b ui lt- i n s u ppo r t fo r t his .
– Distribute Data
– Dec o m pos e g l ob a l d ata i n to c h un k s a nd loc al ize the m , o r
– Sh a ri ng / re p li ca ting m aj o r da ta s tru c ture u s in g t hre a d I D t o a ss o ci a te sub s et of the data
to th r ea d s
– Run the core computation

– M or e d eta i ls i n n ex t sl ide …
– Finalize
– Rec o n ci le g l ob a l d ata s tr uc ture , pre p ar e for the ne x t m a jor i ter at i o n
46
2023秋
Core Computation Phase

–Thread IDs are used to differentiate
behavior of threads
– Use thread ID in loop index calculations to split loop
iterations among threads
–Potential for memory/data divergence
– Use thread ID or conditions based on thread ID to
branch to their specific actions
–Potential for instruction/execution divergence
Both can have very different performance results and
code complexity depending on the way they are done.
Making Science Better, not just Faster
or… in other words:

There will be no Nobel Prizes or Turing
Awards awarded for “just recompile” or
using more threads
47
2023秋
As in Many Compu tati on-h un gry Appli cations

–Three-step approach:
– Restructure the mathematical formulation
– Innovate at the algorithm level
– Tune core software for hardware
architecture
Conclusion: Three Options

– Good: “Accelerate” Legacy Codes
– Recompile/Run
– Call CUBLAS/CUFFT/thrust/matlab/PGI pragmas/etc.
– => good work for domain scientists (minimal CS required)
– Better: Rewrite / Create new codes

– Opportunity for clever algori thmic thinking
– => good work for computer scientists (minimal domain kn owledge required)
– Best: Rethink Numerical Methods & Algorithms

– Potential for biggest performanc e advantage
– => Interdisciplinary: requires CS and domain insight
– => Exciting time to be a computational scientist
48
2023秋
Think, Understand… then, Program

– Think about the problem you are trying to solve
– Understand the structure of the problem
– Apply mathematical techniques to find solution
– Map the problem to an algorithmic approach
– Plan the structure of computation
– Be aware of in/dependence, interactions, bottlenecks
– Plan the organization of data

– Be explicitly aware of locality, and minimize glob al data
– Finally, write some code! (this is the easy part )

Future Studies
– More complex data structures
– More scalable algorithms and building
blocks
– More scalable math models
– Thread-aware approaches
– More available parallelism
– Locality-aware approaches
– Computing is becoming bigger, and everything is further away
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under the Creative Commons Attribution-NonCommercial 4.0 International License. 98
49
2023秋
GPU 架构与编程
第四课：Dynamic Parallelism
赵地
中科院计算所
2023年秋季学期
讲授内容
ØIn depth study of Dynamic Parallelism
50
2023秋
Dynamic Parallelism
●CUDA dynamic parallelism refers to the

ability of threads executing on the GPU to
launch new grids
Nested Par all eli sm

●Dynamic parallelism is useful for programming
applications with nested parallelism where each thread
discovers more work that can be parallelized
●Dynamic parallelism is particularly useful when the
amount of nested work is dynamically determined at
execution time, so enough threads cannot be launched
up front
51
2023秋
Applications of Dynamic Parallelism

● Applications whose amount of nested work
may be unknown before execution time:
● Nested parallel work is irregular (varies across threads)
● e.g., graph algorithms (each vertex has a different
#neighbors)
● e.g., Bézier curves (each cur ve needs different
#points to draw)
● Nested parallel work is recursive with data-depdent
depth
● e.g., tree traversal alg orithms (e.g., quadtrees and
octrees)
● e.g., divide and conquer algor ithms (e.g., qu icksort)
GPU 架构与编程
第四课：CUDA libraries
赵地
中科院计算所
2023年秋季学期
52
2023秋
讲授内容
ØcuBLAS
ØcuSOLVER
ØcuFFT
ØThrust
BLAS and cuBLAS

● BLAS (Basic Linear Algebra Subprograms) is a low-level linear
algebra library originally written in Fortran and standardized by
the BLAS Technical Forum
● It provides three levels of routines:
● Level 1: Scalar and Vector-Vector oper ations
● Example: Dot product and SAXPY
● Level 2: Matrix-Vector operations
● Example: Matrix vector multiplication, solving a triangular system
● Level 3: Matrix-Matrix oper ations
● Example: GEMM
● cuBLAS is a BLAS implementation for CUDA devices
53
2023秋
Data layout in cuBLAS

● T he r e ar e tw o n a tu ra l d ata la yo u ts t o
linearly store dense matrices: 𝑨𝟏,𝟏 𝑨𝟏,𝟐 𝑨𝟏,𝟏 𝑨𝟏,𝟐 𝑨𝟐,𝟏 𝑨𝟐,𝟐
𝑨𝒊,𝒋 = 𝑨[𝒊 ∗ 𝒄𝒐𝒍𝒔 + 𝒋]
● Row major order: 𝑨𝟐,𝟏 𝑨𝟐,𝟐 𝑨[𝟎] 𝑨[𝟏] 𝑨[𝟐] 𝑨[𝟑]
● Each row is stored contiguo usly Row major order Addressing in row
Matrix A
with 0 indexing major order
● Used by C, C++ and derivatives
● Column mayor order:
● E a c h c o l u m n i s s t o r e d
contiguously 𝑨𝟏,𝟏 𝑨𝟏,𝟐 𝑨𝟏,𝟏 𝑨𝟐,𝟏 𝑨𝟏,𝟐 𝑨𝟐,𝟐
𝑨𝒊,𝒋 = 𝑨[𝒋 ∗ 𝒓𝒐𝒘𝒔 + 𝒊]
● Used by Fortran and derivatives 𝑨𝟐,𝟏 𝑨𝟐,𝟐 𝑨[𝟏] 𝑨[𝟐] 𝑨[𝟑] 𝑨[𝟒]
● cuBLAS uses column major order with Matrix A Column major Addressing in
order column major
1 indexing for compatibility with with 1 indexing order
Fortran numeric libraries
Creation and destruction of a cuBLAS

environment
● c u bla sH andl e_ t
● cu B L AS n e ed s a n e x e c u t i o n c o n t e x t t o
store internal resources. This con text ● Type us ed by cu B L AS to stor e contex ts
needs to be created before executing ● c u bla sC r ea te(c ub la sHa n dl e_ t* handl e)
any cuBLAS routine ● Cr eate s a c u BL A S con t ext
● After all cuBLAS executions are
● Pa ra m e ter s:
finished the context needs to be ● Pointer to cuBLAS handle to creat e
destroyed to free resources
● c u bla sD estr oy(c u bl a sHandl e_ t ha ndl e)
● Creating and destroying contexts
should be considered an expensive ● Des tro y s a c uBLA S co nt ex t
operation. Recommended that each ● Pa ra m e ter s:

thread and each device have its own ● cuBLAS handle with the context t o dest roy
context ● cub l as Sta tu s_ t

● To crea te a c on t ext in a sp e cif ic de v i ce ● Type us ed by cu B L AS fo r r eport i ng e rr or s
call cudaSetDevice before the creation ● Ev er y c uB LAS re tur ns a n er ror s tatus
54
2023秋
cuBLAS streams API and thread safety

● cublasSetStream():
● cuBLAS permits the use
of cuda streams ● Sets the stream to be used by cuBLAS for
(cudaStream_t) for subsequent computations
increasing resource ● Parameters:
usage and introduce ● cuBLAS handle to set the stream
other levels of
● cuda stream to use
parallelism
● cuBLAS is a thread safe ● cublasGetStream():
library, meaning that the ● Gets the stream being used by cuBLAS
cuBLAS host functions ● Parameters:
can be called from ● cuBLAS handle to get the stream
multiple threads safely
● pointer to cuda stream
cuBLAS API naming convention for

BLAS routines
● Each of the three levels of BLAS routines in cuBLAS have multiple interfaces
for the same operation, having the naming convention:
● cublas<t>operation where <t> is one of:

● S for float parameters
● D for double parameters
● C for complex float parameters
● Z for complex double parameters
● Example: For the axpy operation (𝐲[𝐢] = 𝜶𝒙[𝒊] + 𝒚[𝒊]), the available functions
are:
● cublasSaxpy, cublasDaxpy, cublasCaxpy, cublasZaxpy
55
2023秋
cuBLAS memory API

●cuBLAS offers specialized data migration and
copy functions for strided matrix and vector
transfers
●Available functions:
● cublasGetVector & cudaGetMatrix for device to host transfers
● cublasSetVector & cudaSetMatrix for host to device transfers
● cublas<t>copy for device to device transfers
●Useful for obtaining a row in a matrix with column
major order
cuBLAS memory API

● cublasGetVector():
● Parameters:
● Number of elements to transfer in bytes
● Element size in bytes
𝑨𝟏,𝟏 𝑨𝟏,𝟐 𝑨𝟏,𝟏 𝑨𝟐,𝟏 𝑨𝟏,𝟐 𝑨𝟐,𝟐 𝑨𝟏,𝟏 𝑨𝟐,𝟏 𝑨𝟏,𝟐 𝑨𝟐,𝟐
● Source device pointer
● Stride to use for the source vector 𝑨𝟐,𝟏 𝑨𝟐,𝟐 𝐴[0] 𝐴[1] 𝐴[2] 𝐴[3] 𝑦[0] 𝑦[1]
● Destination host pointer Matrix A Column major order Transferred data to y
● Stride to use for the destination vector
● Example
● cublasGetVector(2*sizeof(float), sizeof(float), A,
2, y, 1);
56
2023秋
cuBLAS example: Conjugate Gradient

Algorithm:
● The Conjugate Gradient 1. 𝒓 𝟎 = 𝒃 − 𝑨 ∗ 𝒙 𝟎
2. 𝒑 𝟎 = 𝒓 𝟎
method is an iterative method 3. 𝒌 = 𝟎
to compute an approximation 4. loop
𝒓𝑻
𝒌 ∗𝒓𝒌
for the solution of the linear 1. 𝜶 𝒌 =
𝒑𝑻
𝒌 ∗𝑨∗𝒑𝒌
algebraic system 𝐀𝐱 = 𝒃 2. 𝒙 𝒌 𝟏 = 𝒙𝒌 + 𝜶𝒌 𝒑𝒌
3. 𝒓 𝒌 𝟏 = 𝒓𝒌 − 𝜶𝒌 𝑨 ∗ 𝒑𝒌
● Assumptions:
4. if ‖𝒓 𝒌 𝟏 ‖𝟐 < 𝜺, break the loop
● Let 𝑨 be a 𝒏 ∗ 𝒏 symmetric and 𝒓𝑻
𝒌 𝟏 ∗𝒓𝒌 𝟏
5. 𝜷 𝒌 = 𝒓𝑻
positive definite matrix (all the 𝒌 ∗𝒓𝒌
eigenvalues are positive) 6. 𝒑 𝒌 𝟏 = 𝒓 𝒌 𝟏 + 𝜷𝒌 𝒑𝒌

● Let 𝒃 be a n-dimensional vector 7. 𝒌 = 𝒌 + 𝟏
5.return 𝒙 𝒌 𝟏
cuBLAS implementation of the Conjugate

Gradient method
1 . d ou ble ze ro = 0 , mi nus One = - 1, one = 1, alpha = 0, beta = 0, rx r = 0, tmp;
2 . c ub las Dcop y(h and l e, n, b, 1, r, 1);
// 𝒓 𝟎 = 𝒃
3 . c ub las Dgem v(h and l e, C UBL A S _O P _N , n, n, &minusOne, A , rows, x, 1, &one, r, 1);
// 𝒓 𝟎 = 𝒃 − 𝑨 ∗ 𝒙 𝟎
4 . c ub las Dcop y(h and l e, n, r, 1, p, 1);
// 𝒑 𝟎 = 𝒓 𝟎
5 . c ub las Ddot(h an dl e, n, r , 1 , r , 1, &rxr );
// 𝒓 𝑻𝒌 ∗ 𝒓 𝒌
6 . w hi le(k < ma x it) {
7. c ub la sDg emv ( ha n dl e, CUBL A S_ O P_ N , n, n, &min usOne, A, row s, p, 1, &zer o, Axp , 1);
// 𝑨 ∗ 𝒑 𝒌
8. c ub lasD dot ( h an d le, n, p , 1 , Axp, 1, &tmp );
// 𝒑 𝑻𝒌 ∗ 𝑨 ∗ 𝒑 𝒌
9. a lph a = rx r / tm p;
1 0. c ubl asD ax py ( han d le , n, & alp ha, p, 1, x, 1);
// 𝒙 𝒌 𝟏 = 𝒙 𝒌 + 𝜶 𝒌 𝒑 𝒌
57
2023秋
cuBLAS implementation of the Conjugate

Gradient method
1. tmp = -alpha;
2. cublasDaxpy(handle, n, &tmp, Axp, 1, r, 1);
// 𝒓 𝒌 𝟏 = 𝒓 𝒌 − 𝜶 𝒌 𝑨 ∗ 𝒑 𝒌
3. cublasDdot(handle, n, r, 1, r, 1, &tmp);
// 𝒓 𝑻𝒌 ∗ 𝒓 𝒌
4. if (sqrt(tmp) < epsilon) break;
5. beta = tmp / rxr;
6. rxr = tmp;
7. cublasDscal(handle, n, beta, p, 1);
// 𝜷 𝒌 𝒑 𝒌
8. cublasDaxpy(handle, n, &one, r, 1, p, 1);
// 𝒑 𝒌 𝟏 = 𝒓 𝒌 𝟏 + 𝜷 𝒌 𝒑 𝒌
9. k += k;
10.}
讲授内容
ØcuBLAS
ØcuSOLVER
ØcuFFT
ØThrust
58
2023秋
cuSOLVER
●cuSOLVER is linear algebra library with LAPACK-
like features focused on:
● Direct factorization methods for dense matrices
● Linear systems solving methods for dense and sparse
matrices
● Eigenvalue problems for dense and sparse matrices
● Refactorization techniques for sparse matrices
● Least square problems for sparse matrices
●In cases where the sparsity pattern produces low
GPU utilization cuSOLVER provides CPU routines
to handle those sparse matrices
cuSOLVER dense API I

● < T > is t he da ta ty pe us e d by t he m a tri ce s:
● cuSOLVER dense matrices
routines are available <T> Type
S float
under the cuSolverDN API D double
● cuSolverDN provides two C cuComplex

Z cuDoubleComplex
different APIs:
X Generic type
● Legacy naming convention:
● <ope ra ti on> is t he o per ati o n to be pe rfor m ed,
● cusolverDn<T><operation s ome ex am pl e s ar e:
>
● por tf fo r Chol es k y fac tori za ti on
● Generic naming convention: ● ge s vd for the S VD de com p osi tion
● cusolverDn<operation> ● Ex am pl e :
● cusolverDnDportf calls the routine performing the
Choles ky decomposition for ma trices of t ype double
59
2023秋
cuSolverDN dense API II

●cuSolverDN assumes that matrices are stored in
column-major format
●The cuSolverDN generic API provides 64bit support for
integer parameters
●The cuSolverDN generic API uses the type
cudaDataType in the routines arguments to specify the
type of the input matrix
cudaDataType Type
CUDA_R_32F Single precision real number
CUDA_R_64F Double precision real number
CUDA_C_32F Single precision complex number
CUDA_C_64F Double precision complex number
cuSOLVER dense environment

● cuSolverDN needs an execution ● c u s o l v e r D n Ha n d l e _ t
con text to store intern al r eso ur ces. ● Ty p e u s e d b y c u S o l v e r D N t o s t o r e c o n t e x ts

handles
This context needs to be created
● c u s o l v e r D n Cr e a t e ( c u s ol v e r D nH a n d l e _ t* h a n dl e )
before executing any cuSolverDN
● C r e a t e s a c u S o l v e r D N c o nt e x t
routine
● Parameters:
● Poi nter to cuSol verD N handl e to create
● After all cuSolverDN execu tions are
● c u s o l v e r D n De s t r o y (c u s o l v e r Dn H a n d le _ t h a n d l e )
finished the context needs to be
● D e s t r o y s a c u S o l v e r D N c on t e x t
destroyed to free resources
● Parameters:
● cuSolverDN handle w ith the context to destroy
● To c r e a t e a c o n t e x t i n a s p e c i f i c
● cusolverStatus_t
device call cudaSetDevice before
● Ty p e u s e d b y c u S O LV E R f o r r e p o r t i n g
the creation errors
60
2023秋
cuSOLVER dense environment

● cuSolverDN routines don’t allocate workspace memory by
themselves, the user needs to allocate the device workspace
● To find the size in bytes needed by a cuSolverDN routine, the
user must call:
● <routine>_bufferSize()
● The arguments of the function varies according to the routine
● <routine> is the name of the cuSolverDN routine
● Once the user allocates the workspace region the user can call
the routine
● cuSolverDN routines accept an info parameter, if info is less
than 0 this tells the user that the i-th parameter (not counting
the handle) is invalid
cuSolverDN streams API and thread safety

● cusolverDnSetStream():
● cuSolverDN permits the use
of cuda streams ● Sets the stream to be used by cuSolverDN
for subsequent computations
(cudaStream_t) for increasing
● Parameters:
resource usage and introduce
● cuSo lverDN handle to set the stream
other levels of parallelism
● cuda stream to use
● cuSolverDN is a thread safe ● cusolverDnGetStream():
library, meaning that the
● Gets the stream being used by cuSolverDN
cuSolverDN host functions
● Parameters:
can be called from multiple
● cuSo lverDN handle to get the stream
threads safely
● pointer to cuda stream
61
2023秋
cuSolverDN LU factorization with generic API
●Given a matrix A, the LU factorization is given by:

𝑷∗𝑨=𝑳∗𝑼
where P is a permutation matrix produced by the algorithm
with the row pivots. L is a lower triangular matrix with unit
diagonal and U is an upper triangular matrix.
●The matrix A is assumed to be in column-major order
●If info=i and is positive, it means the factorization failed
and U(i,i) = 0
cuSolverDN LU factorization with generic API

c us o l v erD nHa ndle _ t h an d l e; //
c uS o l v erD N han dl e
// C reate and ini tialize optio ns st ruc t:
c us o l v erD nPa rams _ t p ar a m s; / / R ou tine
o pt i o n s cuso lverDnCreateP arams(&param s);
v oi d * A ; // cuso lverDnSetAdvO ptions(param s,

M at r i x to fa ctor i ze CUSO LVERDN_GETRF, CUSOLVER_ALG _0);
i nt 6 4 _ t m , n; / / Rows // G et the buffer size and all ocate it :

a nd c o lum ns resp e ct i vel y cuso lverDnGetrf_b ufferSize (ha ndle, params,
i nt 6 4 _ t* i piv; / / Row m, n , typeA, A, n , typeA, &wo rkSize );
p iv o t s , v ect or o f m ele men ts cuda Malloc((void* *) &buffer, workSi ze);
s iz e _ t wor kSiz e; / / Wo rkspace // C ompute the fa ctorization
s iz e i n b yte s
cuso lverDnGetrf(h andle, param s, m, n ,
v oi d * b u ffe r; // type A, A, n, ipiv , typeA, buff er, w ork Siz e ,
W or k s p ace bu ffer &inf o);
c ud a D a taT ype typ e A; // Ty p e o f m atrix A // D estroy the op tions struct
i nt in f o = 0; / / Info cuso lverDnDestroy Params(param s);
p ar a m e ter
62
2023秋
cuSolverDN Cholesky factorization with

legacy API
cus ol verD nHandle_t ha ndle; // cu SolverDN han d le
● Gi ve n a m a tri x A , th e dou bl e* A ; // Matrix to fa ctorize
Chol es k y fac to ri zati o n i s
cub la sFil lMode_t uplo; // Type in dicating fil li ng m ode
g iv en b y ei th er of :
int n; // Number o f rows
𝑨 = 𝑳 ∗ 𝑳𝑯 and c olum ns
𝑨 = 𝑼𝑯 ∗ 𝑼 int workS ize; // Workspace siz e in
byt es
W h er e L is a l o w e r tri an g u la r
m atr ix a n d U i s a n u p pe r voi d* buf fer; // Workspace bu ffer
tr ia n gu la r m at r ix . int info = 0; // Info parameter
● cu Sol ve r DN us e s a fil l m o d e // Ge t th e buffer size and allocat e it:

p ar am et er to d ete rm in e cus ol verD nSpotrf_buffe rSize(handle , uplo, n, A, n,
wh ic h d e co m p o s iti on to &wo rk Size );
co mp u t e
cud aM allo c((void**) & buffer, workSi ze);
● If in fo=i an d is p o s iti ve , it // Co mput e the factori zation
m e ans th e fa ct or iza tio n
fai led a n d U(i, i) = 0 cus ol verD nDpotrf(handl e, uplo n, A, n, buffer, wor kSiz e,
&in fo );
List of some available routines in

cuSolverDN
●Factorization:
● potrf, potrs for Cholesky factorization and linear system solving, see
potrfBatched and potrsBatched for batched versions
● getrf, getrs for LU factorization and linear system solving
● geqrf for QR factorization
● sytrf for LDL factorization
●Eigenvalue problems:
● gesvd for SVD decomposition
● syevd for Eigenvalue decomposition
● sygvd for general Eigenvalue decomposition
63
2023秋
Compiling and linking against cuSOLVER

●In order to compile and link against cuSOLVER, the
user needs to:
● Include the appropriate header in the required files
● #include <cusolverDn.h> for cuSOLVER dense functionality
● #include <cusolverSp.h> for cuSOLVER sparse functionality
● #include <cusolverRf.h> for cuSOLVER refactorization
functionality
● Link against the cuSOLVER library:
● For dynamic linking use the flag -lcusolver
● For static linking use the flags -lcusolver_static -llapack_static
讲授内容
ØcuBLAS
ØcuSOLVER
ØcuFFT
ØThrust
64
2023秋
Discrete Fourier Transform (DFT)

● Widely used in signal processing from astronomical radio
signals to image processing and many other areas
● The forward Discrete Fourier Transform is given by:
𝑵−𝟏
𝑿𝒌 = 𝒙𝒏 𝒆 𝒊𝝎𝒌𝒏
𝒏=𝟎
where (𝒙 𝟎 , ⋯, 𝒙 𝑵 𝟏) is the input signal, (𝑿 𝟎 , ⋯, 𝑿 𝑵 𝟏) the

𝟐𝝅
transformed signal, 𝝎 = and 𝒊 the imaginary unit
𝑵
● Given the transformed signal it’s possible to recover the input

signal, this process is known as the inverse transform
Fast Fourier Transform (FFT) & cuFFT

● A fast method to compute the DFT of a signal, with computational
complexity of 𝑶(𝑵𝐥𝐨𝐠 𝟐 𝑵 ) whereas the DFT has 𝑶(𝑵 𝟐 ) complexity
● A cornerstone of numeric algorithms for its many applications and

speed
● cuFFT is CUDA implementation of several FFT algorithms
● cuFFT is especially fast for input sizes of 𝟐 𝒂 𝟑 𝒃 𝟓 𝒄 𝟕 𝒅 elements
● cuFFT provides 1D, 2D and 3D transforms, as well as:

● C2C compl ex input to complex output
● R2C real input to complex output
● C2R compl ex input to real output
● cuFFT provides Multi-GPU support trough the cuFFT Xt API

65
2023秋
cuFFT plans & environment

●Plans cont ain necessary ● There exists 2 mains way to
information to compute a create plans:
transform, such as: ● Through the basic plan API:
● T he b es t a l g o ri t h m t o u s e f o r t h e ● cufftPl an1D, cufftPlan2D,

specified size cufftPl an3D, cufftPlanMany
● Memory resources needed by ● Aimed for easy setup

cuFFT to compute the tra nsform ● Through the extensible plan API:
● Pl a ns n ee d to be cr ea te d ● cufftCreate, cufftMakePlan1D,
cufftMakePl an2D, cufftMakePlan3D,
before executing the cufftMakePl anMany,
t r a n sf o r m a n d d e s tr o y e d cufftMakePl anMany64
after the user is done ● Aimed for customization and
extensibility
using them
cuFFT numeric data types

●cuFFT offers the following numeric types:
● cufftReal for single precision real numbers
● cufftDouble for double precision real numbers
● cufftComplex for single precision complex numbers
● cufftDoubleComplex for double precision complex numbers
●For simplicity, the rest of the slides will use real and
complex for referring to cufftReal/cufftDouble and
cufftComplex/cufftDoubleComplex respectively
●For memory size considerations is important to note that:
𝑠𝑖𝑧𝑒𝑜𝑓 (𝑐𝑜𝑚𝑝𝑙𝑒𝑥 ) = 2 ∗ 𝑠𝑖𝑧𝑒𝑜𝑓(𝑟𝑒𝑎𝑙)

66
2023秋
Data layout for 1D out-of-place transforms

● Out of place transforms have different memory regions for the input
and output signal
● Input and output sizes for each transform:
Type Input size in bytes Output size in bytes
C2C 𝐍 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐜𝐨𝐦𝐩𝐥𝐞𝐱) 𝐍 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐜𝐨𝐦𝐩𝐥𝐞𝐱)
C2R 𝐍 𝐍 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐫𝐞𝐚𝐥)
𝟐 + 𝟏 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐜𝐨𝐦𝐩𝐥𝐞𝐱)
R2C 𝐍 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐫𝐞𝐚𝐥) 𝐍
𝟐 + 𝟏 ∗ 𝐬𝐢𝐳𝐞𝐨𝐟(𝐜𝐨𝐦𝐩𝐥𝐞𝐱)
● Custom strides in the layout causes cuFFT to overwrite the input

array in the C2R transform
● The size of the output array in R2C consists of 𝑵 𝟐 + 𝟏 complex

numbers due to “Hermitan” redundancy property of the transform
Data layout for 1D in-place transforms

●In place transforms stores the output result in the input
array
●Array sizes for each transform:
Type Array size in bytes
C2C 𝑁 ∗ 𝑠𝑖𝑧𝑒𝑜𝑓(𝑐𝑜𝑚𝑝𝑙𝑒𝑥)
C2R 𝑁
2 + 1 ∗ 𝑠𝑖𝑧𝑒𝑜𝑓(𝑐𝑜𝑚𝑝𝑙𝑒𝑥)
R2C 𝑁
2 + 1 ∗ 𝑠𝑖𝑧𝑒𝑜𝑓(𝑐𝑜𝑚𝑝𝑙𝑒𝑥)
● The sizes of C2R and R2C are due to the transform needs to hold at most 𝐍
𝟐 +𝟏
complex numbers
● In R2C only the first N real entries are filled with input data , the rest of the ar ray is
allocated for the output
67
2023秋
cuFFT extensible API

● Provided for customization ● cufftHandle
and extensibility compared ● Type used by cuFFT to store plans
to the basic API ● cufftCreate(cufftHandle* handle)
● Typical usage of the ● Creates a cuFFT handle

extensible API: ● Par ameters:
● P o i nt e r t o c u F F T h a n d l e t o c r e a t e
● Initialize cufftHandle
● Make a plan ● cufftDestroy(cufftHandle handle)
● Compute transfo rms ● Destroys resources of a cuFFT plan

● Destroy the plan in ● Par ameters:
cufftHandle ● cuFFT handle with the context to destroy
● Every function returns an ● cufftResult
error status with type ● Type used by cuFFT for reporting err ors
cufftResult ● Every cuFFT returns an error status
cuFFT workspace memory

● cuFFT by default auto allocates on
plan creation the workspace ●cufftSetAutoAllocation():
memory needed for computations ● Sets if cuFFT should use
auto workspace allocation
● cuFFT allows the user to provide the
● Parameters:
memory region to be used for the
● cufftHandle plan
workspace
handle
● To disable auto allocation, the user ● integer indicating
needs to call whether to allocate
cufftSetAutoAllocation() after workspace area,
cufftCreate() and before 0:false & 1:true
cufftMakeplan*()
68
2023秋
cuFFT workspace memory

● If auto allocation has been ● cufftGetSize():
disabled, the user needs to: ● Gets the size in bytes needed by the cuFFT
● Get the size of the plan
workspace area needed by ● Parameters:
cuFFT
● cufftHandle plan handle
● Allocate d evice memory with
the size specified by cuFFT ● pointer to size_t variable to store the size
● Set the memory region to be ● cufftSetWorkArea():
used by the plan
● Sets the memory region to be used by the
● Allocating and setting the cuFFT plan
workspace memory needs to ● Parameters:
be done after plan creation
and before plan execution
● Pointer to device memory region
Creating cuFFT plans

● cuFFT allows 1, 2 and 3 ● cufftMakePlan1d():
dimensional transforms
● Creates a plan for a 1D transform
● Available plaining routines:
● Parameters:
● cufftMak ePlan1d: for 1d
● c u fftH and le pl a n han dle
t r a n s fo r m s
● cufftMak ePlan2d: for 2d ● S ize of t he tra nsfo r m
t r a n s fo r m s ● c u fftType ty pe o f t he tr ans for m
● cufftMak ePlan3d: for 3d
t r a n s fo r m s
● Nu m be r of tra nsfo r m s to pe rform i n batc h
● c ufftMa k eP la n Ma n y: fo r 1, 2 ● s i ze_ t p oi nte r to th e s iz e in by te s of t he

a nd 3 D tr an s for m s wi th w o rk spa ce a re a
s up po r t fo r s tri de d i n p ut and
● cufftType types:
o u tp u t m e m o ry la yo u ts
● c ufftMa k eP la n Ma n y6 4: s ame ● CUFFT_R2C, CUFFT_C2R, CUFFT_C2C for
a s c u fftMa ke Pl an M a ny bu t single precision transforms
w it h 6 4 -b it su p p or t fo r s ize s
a nd s tri d es ● CUFFT_D2Z, CUFFT_Z2D, CUFFT_Z2Z for
double precision transforms
69
2023秋
cuFFT plan and transform execution

● After plan creation the user ● cufftExec<t>():
can execute the transform
● Executes the transform specified by the plan:
● The execution function ● Parameters:
depends on the type of the
transform:
● pointer to input array
● cufftExecC2C and
cufftExecZ2Z for complex ● pointer to output array
to complex transforms ● (*) Transform direction:
● cufftExecR2C and  CUFFT_FORWARD for forward transform
cufftExecD2Z for real to  CUFFT_INVERSE for the inverse transform
complex transforms  (*) This argument only exists when <t> is
● cufftExecC2R and either C2C or Z2Z
cufftExecZ2D for complex ● If the input and output pointers are the same
to compreallex transforms
cuFFT will perform an in-place transform
cuFFT streams API and thread safety

● cuFFT permits the use of cuda ●cufftSetStream():
streams (cudaStream_t) for ●Sets the stream to be
increasing resource usage and used by cuFFT for
introduce other levels of subsequent
parallelism
computations
● cuFFT is a thread safe library if
●Parameters:
functions host threads execute
●cuFFT handle to set
computations on different plans
and output memory regions
the stream
●cuda stream to use
70
2023秋
Example: Cross correlation of two signals

cufftComplex *x, *y, *z;
// Allocate and initialize the signals
cufftHandle plan;
cufftCreate(&plan); // Create the handle
cufftMakePlan1d(plan, n, CUFFT _C2C , 1); // Create the pla n
cufftExecC2C(plan, x, x, CUFFT _FORWARD); // Compute the forward x

transform
cufftExecC2C(plan, y, y, CUFFT _FORWARD); // Compute the forward y

transform
// Launch a kernel executing z [i]=x[i] * y[i]
cufftExecC2C(plan, z, z, CUFFT _INVERSE);
cufftDestroy(&plan); // Destroy the plan

Compiling and linking against cuFFT
●In order to compile and link against cuFFT,

the user needs to:
●Include the header “#include <cufft.h>” in the
appropriate files
●Link against the cuFFT library:
●For dynamic linking use the flag -lcufft
●For static linking and a version of cuda 9.0 or later
use the flags -lcufft_static -lculibos
71
2023秋
讲授内容
ØcuBLAS
ØcuSOLVER
ØcuFFT
ØThrust
C++ STL
● C++ Standard Template Library ● Example:
provides a set of type agnostic #include <algorithm>
features as a way of simplifying #include <vector>
C++ progr amming, some
…
examples of these features are:
// Creates int vector with 10 elements
● Type agnostic data
std::vector<int> a(10);
structures: lists, map,
// Fill vector with 10 9 8 … 1
vectors, etc.
for(int i = 0 ; i < a.size(); ++i)
● Type agnostic algorithms:
sort, fill, max, etc. a[i] = a.size() - i;
● Iterators for data structures // Sort the vector
● C++ STL feature s reside inside std::sort(a.begin( ), a.end());
the C++ namespace std // a vector now contains 1 2 3 … 10

72
2023秋
Thrust
● Thrust is a C++ template library written for CUDA devices and based on the
C++ STL with the intention of simplifying certain aspects of the CUDA
programming model
● Thrust provides users with three main functionalities:

● The host and device vector containers
● A collection of parallel primitives such as, sort, reduce and
transformations
● Fancy iterators
● All Thrust functions and containers reside inside the thrust namespace
● Thrust allows either host or device execution
● Since thrust is a template library it doesn’t need to be linked against a library

Thrust vector containers

● Thrust provides 2 vector containers:
● thrust::host_vector<T> for a vector stored in host memory
● thrust::device_v ector<T> for a vector stored in device memory
● Containers features:
● Interop erability between the host_vector<T> and device_vector<T>, specifically:
● Memory transfer s using the assi gn operator and constructors
● Interop erability with STL containers through iterators
● API simi lar to the STL std::vector
● Dynamic resizing of the container
● To use them include:

● #<include> “thrust/host_vector.h”
● #<include> “thrust/device_vector.h”
73
2023秋
STL and Thrust Iterators

● I ter ato r s pr o vi d e a h ig h -le vel ● E xa m p le :
a bstr ac tio n fo r d a ta ac c es s to thrust ::host_vector <int > a(10);
c onta in e rs , s pe c ifi ca ll y :
// Fil l a with 0 1 2 … 9
● C o n t a i n s i n fo r m a t i o n o f a
s p e c i fi c e l e m e n t i n t h e c o nt a i n e r auto i t = a.begin() ; // Iterator pointing to the
● C o n t a i n s i n fo r m a t i o n o n h o w to first element
access other elements in the
int tm p = *it; // D ereference th e iterator, t mp
c o n ta i n e r
holds 0
● Th er e a re s ev e ra l ki n ds of
i ter ato r s for e x am p l e: it = a. begin() + 5; // Iterator pointing to a [5]
● R a n d o m a c c e s s i te r a to r s , a l l o w s ++it; // Iterator p ointing to a[ 6]

a c c e s s t o a n y e le m e n t i n t h e
c o n ta i n e r
*it = tmp; // Equiv alent to perf orming a[6] = tmp;
● B i d i r e c ti o n a l i t e r a t o r s , a l l o w s for( i t = a.begin() ; it != a.end (); ++it)

a c c e s s t o th e p r e v i o u s a n d n e x t
element std :: cout << *it << std ::endl; // Print a c o nte nts
● F o r w a r d i te r a to r s , a l l o w s a c c e s s ** * It’s i mp or tant to ob se rv e t h at a .e n d() do es n’t poi nt to a[ 9] , it
to the next element
re fe rs to a p os i ti on pa st the fin a l el em ent
Example: Thrust vector containers

interoperability
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <list>
// Declare and initialize STL list container
std::list<float> stl_list;
stl_list.push_back(3.14);
stl_list.push_back(2.71);
stl_list.push_back(0.);
// Initialize thrust device vector with the STL list containing {3.14, 2.71, 0.}
thrust::device_vector<float> vector_D(stl_list.begin(), stl_list.end());
// Perform computations on vector_D
// Create host vector and copy back results to host memory
thrust::host_vector<float> vector_H = vector_D;

74
2023秋
Thrust pointers and memory

● Thrust provides a pointer ● Ex amp l e:
interface
#include <thr ust/device_pt r.h>
thrust::device_ptr<T> to be
used when data was …
allocated using cudaMalloc
float *a;
or similar mechanisms
// Allocate d evice memory
● The thrust::device_ptr<T>
interface is compatible with cudaMalloc ((v oid **)&a, 10 24 * sizeof(fl oa t));
all Thrust algorithms and
// Initialize thrust point er with a
allows similar semantics as
iterators thrust::devic e_ptr<float> a_thrust(a);
● To use Thrust device // Assign to a_thrust 0, 0 + 2, 2 + 2, …, 2046

pointer the user needs to
thrust::seque nce(a_thrust, a_thrust + 1024, 0,
include:
2);
● #include
<thrust/device_ptr.h> // Extract th e pointer bei ng used by a_ thrust
float *b = th rust::raw_po inter_cast(a_ thrust);

Thrust algorithms
● Provides common parallel algorithms like:
● Reductions
● Sorting
● Reorderings
● Prefix-sums
● Transformations
● All Thrust algorithms have host and device implementations
● If an algorithm is invoked with an iterator, thrust will execute the
operation in the host or device according to the iterator memory
region
● Except for thrust::copy which can copy data between host and
device all other routines must have all its arguments reside in
the same place
75
2023秋
Thrust transformations
#include <thrust/transform.h>
● Tr a n s f o r m a t io n s a p p l y a fu n c t i o n
…
t o e a c h e l e m e n t i n th e i n p u t r a n g e
a nd s t o r e s t h e r e s u l t i n th e o u tp u t struct saxpy_functional {
range float alpha;
● Exists a diverse set of available float operator()(float &x, float &y) {

t r a n s fo r m a t i o n s s u c h a s :
return al pha * x + y;
● t h r u s t : : t r a n s fo r m a p p l i e s o u t - }
o f- p l a c e a fu n c t io n t o e a c h
};
element in a data range
…
● t h r u s t : : t r a n s fo r m _ i f a p p l i e s
o u t- o f- p l a c e a f u n c t i o n i f a thrust::device_vector<float> a(1024);
condition is met
thrust::device_vector<float> b(1024);
● thrust::for_e ach applies in-
thrust::device_vector<float> c(1024);
place a unary function
// Initialize a and b
● To u s e T h r u s t t r a n s o r m a t i o n
a l g o r it h m s th e u s e r n e e d s t o saxpy_functional saxpy;
include: saxpy. alpha = 2;
● # i n c l u d e < th r u s t / t r a n s f o r m . h > thrust::transform(a.begin(), a. end(), b.begin() , c. begin(), saxpy) ; // Perform c[i] = 2 * a[i] + b[i]
Thrust function objects

● Ex ampl e:
● Thrust provides a
#i nclude <thru st/functional .h>
collection of predefined
operators usable by #i nclude <thru st/transform_ reduce.h>
Thrust algorithms like …
transform and reduce
th rust::device _vector<doubl e> x;
● Available operator
do uble zero = 0;
categories are:
do uble norm;
● Arithmetic operators
● Comparison operators // Initialize data
● Logical operators // Compute x n orm using the thrust::squa re (x * x)

● Bitwise operators an d thrust::plu s (x + y) ope rators
● Generalized identity no rm = std ::sq rt(thrust::tr ansform_reduc e(x.begin(),
operators x.end(), th rust::square <double>(), z ero,
thrust::plu s <double>())) ;
76
2023秋
Thrust sorting routines

● Example:
● Th rust pr ovides s orting
algo rithms for GP U # i ncl u de <thru st/sort.h>
namely:
…
● th r us t :: s o r t fo r s o r t i ng
a n a r r ay t h rus t: :device _vector<int > keys(4);
● th r us t :: s o r t _b y _ k e y fo r
s o r ti ng a n a r r a y b y a t h rus t: :device _vector<doubl e> values(4);
k e y a r r ay
/ / In i ti alize data with:
● Stable sorting ro utines
are also a vailable: // keys = {3, 1, 2, 7}
● th r us t :: s t a b le _ s o r t
// values = {1., 2., 3 ., 4. }
● th r us t :: s t a b le _ s o r t_ b y _
key / / La u nc h thru st sorting by key algorith m
● To use T hru st s orting
algo rithms the us er t h rus t: :sort_b y_key (keys.be gin(), keys.e nd(), values .begin());
needs to in clu de / / Va l ue s afte r sorting:
● # i n c l u de
< th r us t/ s o r t. h > // keys = {1, 2, 3, 7}
// values = {2., 3., 1 ., 4. }

Thrust fancy iterators

● Thrust provides a collection of ● Example:
special i terators to be used by #include <thrust/iterator/zip_iterator.h>
Thrust algorithms enhanci ng …

language programmability: thrust::device_vector<int> A;
● thrust::constan t_iterator<T thrust::device_vector<float> B;

> is an iterator with …
constant value
auto begin =
● thrust::counting_iterator<T thrust::make_zip_iterator(thrust::make_tuple(A.begin(),
> is an iterator addressing B.begin()));
a range of numbers auto end =
● thrust::zip_iterator is an thrust::make_zip_iterator(thrust::make_tuple(A.end(),
B.end()));
iter ator zipping two
memory regions together thrust::maximum< thrust::tuple<int,float>> max_op;
into a single object of thrust::reduce(begin, lendast, init, max _op);

pairs
77
2023秋
Thrust execution policies and streams

● Execution policies ● Thrust allows the use of CUDA streams
gives users control through the execution policy:
over certain runtime
execution decisions ● thrust::cuda::par.on(stream):
● Common execution ● Sets the stream to use by the Thrust algorithm
policies are: ● Parameters:
● thrust::seq for  Stream to use
sequential execution
● To use this policy the user needs to include:
● thrust::omp::par for
parallel execution ● #include <thrust/system/cuda/execution_policy.h>
using the OpenMP
backend ● Example:
● thrust::cuda::par for cudaStream_t stream;
CUDA execution
● thrust::host for ho st …
execution thrust::sort(thrust::cuda::par.on(stream),
● thrust::device for dev_vector.begin(), dev_vector.end())
device execution
讲授内容
Ø课程大作业（一）问题2
解读
78
2023秋
主要内容
ü基于CUDA语言，实现CNN推理与训练（加分题），并
在MNIST数据（网址：
https://github.com/zalandoresearch/fashion-mnist）集合上
测试，不能调用英伟达公司library；
ü调优。与英伟达library相比，你的实现达到了几成功力
呢？
ü截止日期：期中考试（10月25日），截止日期之前可多
次提交。
ü每位同学单独提交、单独评分。
ü参考代码：https://github.com/KiveeDong/LeNet5-Matlab
问题二（加分题）：训练
ü问题二：基于CUDA语言，实现CNN训练，不限制具体实现的
方式（网络需要与问题一相同）；
ü包含模块：梯度下降、与网络相关的其他模型、等；
ü数据集合：需要与问题一相同；
ü训练代码：需要提交、考核；
ü指标：运行问题一推理代码的准确率，运行时间、考核；
ü考核平台：A100；
ü考核人员：助教；
ü评分：
ü功能正常：10分，满分；
79
2023秋
Training by SGD
ü关键/难点：在fully connected layer、convolution layer、
pooling layer、activation layer反向传播梯度
ü参考知识：
üCMU CS 10-301 + 10-601 Introduction to Machine Learning
https://www.cs.cmu.edu/~mgormley/courses/10601/
üCMU CS 11-785 Introduction to Deep Learning
https://deeplearning.cs.cmu.edu/F23/
Training by SGD
üInitial：w=0；
üFor epoch=1, 2, ⋯ I；
ü For 〈x, y〉ϵD
ü P=P－ȵ∇ PL
80
2023秋
CMU CS 10-301 + 10-601 Introduction to Machine Learning https://www.cs.cmu.edu/~mgormley/courses/10601/
81
2023秋
82
2023秋
83
2023秋
84
2023秋
85
2023秋
86
2023秋
87
2023秋
88
2023秋
89
2023秋
CMU CS 11-785 Introduction to Deep Learning https://deeplearning.cs.cmu.edu/F23/
90
2023秋
91
2023秋
92
2023秋
93
2023秋
94
2023秋
95
2023秋
96
2023秋
97
2023秋
98
2023秋
99

周06

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

周06

Uploaded by

Copyright:

Available Formats

2023秋

What is IEEE floating-point format?

– For each bit pattern, its IEEE floating-point value is

– The interpretation of S is simple: S=0 results in a

– Because all mantissa values are of the form

A simple, hypothetical 5-bit FP format

●Assume 1-bit S, 2- 2’s Actual Excess-1

●The representable numbers 000 0

Representable Numbers of a 5-bit

E M S=0 S=1 S=0 S=1 S=0 S=1

E M S=0 S=1 S=0 S=1 S=0 S=1

Why is flushing to zero problematic?

–Without Denormalization, these

E M S=0 S=1 S=0 S=1 S=0 S=1

IEEE 754 Format and Precision

Special Bit Patterns

– An ∞ can be created by overflow, e.g., divided by zero. Any

Floating Point Accuracy and Rounding

Rounding and Error

1.0*2 -2 (0, 00, 01) + 1.00*2 1 (0, 10, 00)

0.001*2 1 (0, 00, 0001) + 1.00*2 1 (0, 10, 00)

Rounding and Error

–In some cases, the hardware may

– We refer to this as 0.5 ULP (Units in the Last

Order of Operations Matter

–Floating point operations are not

Runtime Math Library

– There are two types of runtime math operations

– The -use_fast_math compiler option forces

Make your program float-safe!

– Use float version of standard library functions

–The bandwidth between key

History: Classic PC architecture

(Original) PCI Bus Specification

PCI as Memory Mapped I/O

PCI Express (PCIe)

PCIe 2 Links and Lanes

8/10 bit encoding

PCIe Data Transfer using DMA

utilize the bandwidth of

– Transfers a number of bytes

Allocate/Free Pinned Memory

Using Pinned Memory

–Knowing yesterday, today, and

CPU-GPU Data Transfer using DMA

CPU Main Memory (DRAM)

Virtual Memory Management

– Not all variables and data structures are always in the

Data Transfer and Virtual Memory

Pinned Memory and DMA Data Transfer

– Pinned memory are virtual memory pages that

CUDA data transfer uses pinned memory.

Allocate/Free Pinned Memory

–cudaHostAlloc(), three parameters

–cudaFreeHost(), one parameter

Using Pinned Memory in CUDA

Putting It Together - Vector Addition Host

Serialized Data Transfer and Computation

–So far, the way we use

Only use one Only use one

Ideal, Pipelined Timing

Trans Trans Comp Trans

Trans Trans Comp

Conceptual View of Streams

1.02 -2 (0, 00, 01) + 1.002 1 (0, 10, 00)

0.0012 1 (0, 00, 0001) + 1.002 1 (0, 10, 00)

float d_A0, d_B0, *d_C0; // device memory