Professional Documents
Culture Documents
Concurrency
The ability to perform multiple CUDA operations simultaneously
(beyond multi-threaded parallelism) CUDA Kernel !!! cuda"emcpyAsync (#ostToDe$ice) cuda"emcpyAsync (De$iceTo#ost) %perations on the C&U
Streams
Stream A se/uence of operations that e0ecute in issue-order on the .&U &ro1rammin1 model used to effect concurrency CUDA operations in different streams may run concurrently CUDA operations from different streams may be interlea$ed
Concurrency 20ample
Serial
cudaMemcpyAsync(H2D) Ke ne!"""### cudaMemcpyAsync(D2H)
time
st eams
time
Amount of Concurrency
Serial (1%)
cudaMemcpyAsync(H2D)
K1
HD2
DH1
K2
HD3
DH2
K3
DH3
K4 'n )*+
K1
DH1
K2
DH2
K3
DH3
K4
DH4
6+ 4ay concurrency
HD1 K1$1 K1$2 K1$3 DH1 HD2 K2$1 K2$2 K2$3 DH2 HD3 K3$1 K3$2 K3$3 DH3 HD4 K4$1 K4$2 K4$3 DH4 HD- K-$1 K-$2 K-$3 DHHD. K.$1 K.$2 K.$3 DH.
K1
HD2
DH1
K2
HD3
DH2
K3
HD4
DH3
K4
.&U + C&U
6-4ay con)7 (;( .flops (,),0) Up to 55* .flops for lar1er ran-
%btain ma0imum performance by le$era1in1 concurrency All communication hidden 3 effecti$ely remo$es de$ice memory si<e limitation
cudaMa!!'c ( <dev15 si4e ) = d'u>!e? h'st1 9 (d'u>!e?) ma!!'c ( <h'st15 si4e ) = @ cudaMemcpy ( dev15 h'st15 si4e5 H2D ) = ;e ne!2 """ A id5 >!'c;5 0 ### ( @5 dev25 @ ) = ;e ne!3 """ A id5 >!'c;5 0 ### ( @5 dev35 @ ) = cudaMemcpy ( h'st45 dev45 si4e5 D2H ) = $$$
completely synchronous
cudaMa!!'c ( <dev15 si4e ) = d'u>!e? h'st1 9 (d'u>!e?) ma!!'c ( <h'st15 si4e ) = @ cudaMemcpy ( dev15 h'st15 si4e5 H2D ) = ;e ne!2 """ A id5 >!'c; ### ( @5 dev25 @ ) = s'meB)*+Bmeth'd ()= ;e ne!3 """ A id5 >!'c; ### ( @5 dev35 @ ) = cudaMemcpy ( h'st45 dev45 si4e5 D2H ) = $$$
potentially o$erlapped
cudaSt eamBt st eam15 st eam25 st eam35 st eam4 = cudaSt eam) eate ( <st eam1) = $$$ cudaMa!!'c ( <dev15 si4e ) = cudaMa!!'cH'st ( <h'st15 si4e ) = @ cudaMemcpyAsync ( dev15 h'st15 si4e5 H2D5 st eam1 ) = ;e ne!2 """ A id5 >!'c;5 05 st eam2 ### ( @5 dev25 @ ) = ;e ne!3 """ A id5 >!'c;5 05 st eam3 ### ( @5 dev35 @ ) = cudaMemcpyAsync ( h'st45 dev45 si4e5 D2H5 st eam4 ) = s'meB)*+Bmeth'd ()= $$$
potentially o$erlapped
20plicit Synchroni<ation
Synchroni<e e$erythin1
cudaDe$iceSynchroni<e () Aloc-s host until all issued CUDA calls are complete
Fmplicit Synchroni<ation
These operations implicitly synchroni<e all other CUDA operations
&a1e-loc-ed memory allocation
cuda"alloc#ost cuda#ostAlloc
Stream Schedulin1
'ermi hard4are has 5 /ueues
+ Compute 2n1ine /ueue ( Copy 2n1ine /ueues 3 one for #(D and one for D(#
=ote a bloc-ed operation bloc-s all other operations in the /ueueC e$en in other streams
D2H Dueue
e%ecuti'n
HDa1
issue ' de
HDa1 HD>1
)+DA 'pe ati'ns Aet added t' Dueues in issue ' de
K1
DH1
time
HDa1 HD>1 K1
DH2
SiAna!s >etFeen Dueues en&' ce synch 'ni4ati'n
DH1 DH2
untime 9 -
D2H Dueue
e%ecuti'n
concurrent
DH2
issue ' de
HDa1 HD>1
)+DA 'pe ati'ns Aet added t' Dueues in issue ' de
K1
DH2
time
HDa1 HD>1 K1
DH2
DH1
SiAna!s >etFeen Dueues en&' ce synch 'ni4ati'n
DH1
untime 9 4
Ka1
issue ' de time
Ka1
issue ' de
Ka1 K>1
Ka2 K>2
K>1
untime 9 3
untime 9 2
Depth first
c'mpute Dueue issue ' de e%ecuti'n
Areadth first
c'mpute Dueue issue ' de
Ka1 K>1
Kc2 Kd2
Ka1
time
Ka1 K>1
Kc2 Kd2
untime 9 -
untime 9 4
untime 9 3
D2H Dueue
e%ecuti'n
K1 K2 K3
DH1
time
issue ' de
DH2 DH3
K1 K2 K3
>!'c;inA
D2H Dueue
e%ecuti'n
K1 K2 K3
Ke ne!s n' !'nAe issued seDuentia!!y
DH1
time
issue ' de
DH2 DH3
K1 K2 K3
untime 9 -
&re$ious Architectures
Compute Capability +)*+
Support for .&U / C&U concurrency
Additional Details
Ft is difficult to 1et more than 6 -ernels to run concurrently Concurrency can be disabled 4ith en$ironment $ariable
CUDAMGAU=C#MAG%CKF=.
cudaStreamEuery can be used to separate se/uential -ernels and pre$ent delayin1 si1nals Kernels usin1 more than ; te0tures cannot run concurrently S4itchin1 G+/Shared confi1uration 4ill brea- concurrency To run concurrentlyC CUDA operations must ha$e no more than ,( inter$enin1 CUDA operations
That isC in ?issue order? they must not be separated by more than ,( other issues 'urther operations are seriali<ed
Concurrency .uidelines
Code to pro1rammin1 model 3 Streams
'uture de$ices 4ill continually impro$e #D representation of streams model
Than-
Oou
Euestions
+) >erify that cuda"emcpyAsync() follo4ed by Kernel !!! in Stream*C the memcpy 4ill bloc- the -ernelC but neither 4ill bloc- the host) () Do the follo4in1 operations (or similar) contribute to the ,6 ?out-ofissue-order? limitP
CudaStreamEuery cudaDait2$ent
5) F understand that the ?/uery? operations cudaStraemEuery() can be placed in either the en1ine or the copy /ueuesC that 4hich /ueue any /uery actually 1oes in is difficult (P) to determineC and that this can result in some bloc-in1)