Streams and Con Currency Web in Ar

CUDA C/C++ Streams and Concurrency
Steve Rennich NVIDIA
Concurrency
The ability to perform multiple CUDA operations simultaneously
(beyond multi-threaded parallelism) CUDA Kernel !!! cuda"emcpyAsync (#ostToDe$ice) cuda"emcpyAsync (De$iceTo#ost) %perations on the C&U
'ermi architecture can simultaneously support

(compute capability ()*+) Up to +, CUDA -ernels on .&U ( cuda"emcpyAsyncs (must be in different directions) Computation on the C&U
Streams
Stream A se/uence of operations that e0ecute in issue-order on the .&U &ro1rammin1 model used to effect concurrency CUDA operations in different streams may run concurrently CUDA operations from different streams may be interlea$ed
Concurrency 20ample
Serial
cudaMemcpyAsync(H2D) Ke ne!"""### cudaMemcpyAsync(D2H)
time
Concurrent 3 o$erlap -ernel and D(# copy

cudaMemcpyAsync(H2D) K1 DH1 K2 DH2 K3 DH3 K4 DH4
1$33% pe &' mance imp 'vement
st eams
time
Amount of Concurrency
Serial (1%)
cudaMemcpyAsync(H2D)
6-4ay concurrency (3%()

Ke ne! """ ###
cudaMemcpyAsync(D2H) HD1
K1
HD2
DH1
K2
HD3
DH2
(-4ay concurrency (up t' 2%)

cudaMemcpyAsync(H2D)
K3
DH3
K4 'n )*+
K1
DH1
K2
DH2
K3
DH3
K4
DH4
6+ 4ay concurrency
HD1 K1$1 K1$2 K1$3 DH1 HD2 K2$1 K2$2 K2$3 DH2 HD3 K3$1 K3$2 K3$3 DH3 HD4 K4$1 K4$2 K4$3 DH4 HD- K-$1 K-$2 K-$3 DHHD. K.$1 K.$2 K.$3 DH.
5-4ay concurrency (up t' 3%)

HD1
K1
HD2
DH1
K2
HD3
DH2
K3
HD4
DH3
K, 'n )*+ DH4
K4
20ample 3 Tiled D.2""

C&U (4c' e /estme e %-.,0 12$23 3H45 MK6) 65 .flops .&U ()20,0)
Serial 7 +(8 .flops (()90) (-4ay 7 +:: .flops (6)+0) 5-4ay 7 (,( .fllops (,)+0)
de&au!t st eam st eam 1 st eam 2 st eam 3 st eam 4 )*+
D37MM8 m9n9:1225 ;92::
=$idia >isual &rofiler (n$$p)
.&U + C&U
6-4ay con)7 (;( .flops (,),0) Up to 55* .flops for lar1er ran-
%btain ma0imum performance by le$era1in1 concurrency All communication hidden 3 effecti$ely remo$es de$ice memory si<e limitation
Default Stream (a-a Stream ?*?)

Stream used 4hen no stream is specified Completely synchronous 4)r)t) host and de$ice
As if cudaDe$iceSynchroni<e() inserted before and after e$ery CUDA operation
20ceptions 3 asynchronous 4)r)t) host

Kernel launches in the default stream cuda"emcpy@Async cuda"emset@Async cuda"emcpy 4ithin the same de$ice #(D cuda"emcpy of ,6-A or less
Be/uirements for Concurrency

CUDA operations must be in differentC non-*C streams cuda"emcpyAsync 4ith host from ?pinned? memory
&a1e-loc-ed memory Allocated usin1 cuda"alloc#ost() or cuda#ostAlloc()
Sufficient resources must be a$ailable

cuda"emcpyAsyncs in different directions De$ice resources (S"2"C re1istersC bloc-sC etc))
Simple 20ample7 Synchronous
cudaMa!!'c ( <dev15 si4e ) = d'u>!e? h'st1 9 (d'u>!e?) ma!!'c ( <h'st15 si4e ) = @ cudaMemcpy ( dev15 h'st15 si4e5 H2D ) = ;e ne!2 """ A id5 >!'c;5 0 ### ( @5 dev25 @ ) = ;e ne!3 """ A id5 >!'c;5 0 ### ( @5 dev35 @ ) = cudaMemcpy ( h'st45 dev45 si4e5 D2H ) = $$$
completely synchronous
All CUDA operations in the default stream are synchronous
Simple 20ample7 AsynchronousC =o Streams
cudaMa!!'c ( <dev15 si4e ) = d'u>!e? h'st1 9 (d'u>!e?) ma!!'c ( <h'st15 si4e ) = @ cudaMemcpy ( dev15 h'st15 si4e5 H2D ) = ;e ne!2 """ A id5 >!'c; ### ( @5 dev25 @ ) = s'meB)*+Bmeth'd ()= ;e ne!3 """ A id5 >!'c; ### ( @5 dev35 @ ) = cudaMemcpy ( h'st45 dev45 si4e5 D2H ) = $$$
potentially o$erlapped
.&U -ernels are asynchronous 4ith host by default
Simple 20ample7 Asynchronous 4ith Streams
cudaSt eamBt st eam15 st eam25 st eam35 st eam4 = cudaSt eam) eate ( <st eam1) = $$$ cudaMa!!'c ( <dev15 si4e ) = cudaMa!!'cH'st ( <h'st15 si4e ) = @ cudaMemcpyAsync ( dev15 h'st15 si4e5 H2D5 st eam1 ) = ;e ne!2 """ A id5 >!'c;5 05 st eam2 ### ( @5 dev25 @ ) = ;e ne!3 """ A id5 >!'c;5 05 st eam3 ### ( @5 dev35 @ ) = cudaMemcpyAsync ( h'st45 dev45 si4e5 D2H5 st eam4 ) = s'meB)*+Bmeth'd ()= $$$
CC pinned mem' y eDui ed 'n h'st
potentially o$erlapped
'ully asynchronous / concurrent Data used by concurrent operations should be independent
20plicit Synchroni<ation
Synchroni<e e$erythin1
cudaDe$iceSynchroni<e () Aloc-s host until all issued CUDA calls are complete
Synchroni<e 4)r)t) a specific stream

cudaStreamSynchroni<e ( st eamid ) Aloc-s host until all CUDA calls in streamid are complete
Synchroni<e usin1 2$ents

Create specific ?2$ents?C 4ithin streamsC to use for synchroni<ation cuda2$entBecord ( event5 st eamid ) cuda2$entSynchroni<e ( event ) cudaStreamDait2$ent ( st eam5 event ) cuda2$entEuery ( event )
20plicit Synchroni<ation 20ample

Besol$e usin1 an e$ent
E cuda7ventBt event= cuda7vent) eate (<event)= cudaMemcpyAsync ( dBin5 in5 si4e5 H2D5 st eam1 )= cuda7ventRec' d (event5 st eam1)= cudaMemcpyAsync ( 'ut5 dB'ut5 si4e5 D2H5 st eam2 )= cudaSt eam/ait7vent ( st eam25 event )= ;e ne! """ 5 5 5 st eam2 ### ( dBin5 dB'ut )= asynch 'n'us)*+meth'd ( @ ) G CC c eate event CC 1) H2D c'py '& neF input CC ec' d event CC 2) D2H c'py '& p evi'us esu!t CC Fait &' event in st eam1 CC 3) must Fait &' 1 and 2 CC Async 3*+ meth'd
Fmplicit Synchroni<ation
These operations implicitly synchroni<e all other CUDA operations
&a1e-loc-ed memory allocation
cuda"alloc#ost cuda#ostAlloc
De$ice memory allocation

cuda"alloc
=on-Async $ersion of memory operations

cuda"emcpy@ (no Async suffi0) cuda"emset@ (no Async suffi0)
Chan1e to G+/shared memory confi1uration

cudaDe$iceSetCacheConfi1
Stream Schedulin1
'ermi hard4are has 5 /ueues
+ Compute 2n1ine /ueue ( Copy 2n1ine /ueues 3 one for #(D and one for D(#
CUDA operations are dispatched to #D in the se/uence they 4ere issued

&laced in the rele$ant /ueue Stream dependencies bet4een en1ine /ueues are maintainedC but lost 4ithin an en1ine /ueue
A CUDA operation is dispatched from the en1ine /ueue if7

&recedin1 calls in the same stream ha$e completedC &recedin1 calls in the same /ueue ha$e been dispatchedC and Besources are a$ailable
CUDA -ernels may be e0ecuted concurrently if they are in different streams

Threadbloc-s for a 1i$en -ernel are scheduled if all threadbloc-s for precedin1 -ernels ha$e been scheduled and there still are S" resources a$ailable
=ote a bloc-ed operation bloc-s all other operations in the /ueueC e$en in other streams
20ample 3 Aloc-ed Eueue

T4o streamsC stream + is issued first
Stream + 7 #Da+C #Db+C K+C D#+ (issued first) Stream ( 7 D#( (completely independent of stream +)
p 'A am H2D Dueue
c'mpute Dueue
D2H Dueue
e%ecuti'n
HDa1
issue ' de
HDa1 HD>1
)+DA 'pe ati'ns Aet added t' Dueues in issue ' de
K1
DH1
time
HDa1 HD>1 K1
D#+ bloc-s completely independent D#(
HD>1 K1 DH1 DH2
DH2
SiAna!s >etFeen Dueues en&' ce synch 'ni4ati'n
DH1 DH2
untime 9 -
Fithin Dueues5 st eam dependencies a e !'st
20ample 3 Aloc-ed Eueue

T4o streamsC stream ( is issued first
Stream + 7 #Da+C #Db+C K+C D#+ Stream ( 7 D#( (issued first)
p 'A am H2D Dueue
c'mpute Dueue
issue order mattersH
D2H Dueue
e%ecuti'n
concurrent
DH2
issue ' de
HDa1 HD>1
)+DA 'pe ati'ns Aet added t' Dueues in issue ' de
K1
DH2
time
HDa1 HD>1 K1
DH2
HDa1 HD>1 K1 DH1
DH1
SiAna!s >etFeen Dueues en&' ce synch 'ni4ati'n
DH1
untime 9 4
Fithin Dueues5 st eam dependencies a e !'st
20ample - Aloc-ed Kernel

T4o streams 3 Iust issuin1 CUDA -ernels
Stream + 7 Ka+C Kb+ Stream ( 7 Ka(C Kb( Kernels are similar si<eC fill J of the S" resources
issue order mattersH
Fssue depth first

c'mpute Dueue e%ecuti'n
Fssue breadth first

c'mpute Dueue e%ecuti'n
Ka1
issue ' de time
Ka1
issue ' de
Ka1 Ka2 K>2

time
Ka1 K>1
Ka2 K>2
K>1 Ka2 K>2
K>1
Ka2 K>1 K>2
untime 9 3
untime 9 2
20ample - %ptimal Concurrency can Depend on Kernel 20ecution Time

T4o streams 3 Iust issuin1 CUDA -ernels 3 but -ernels are different ?si<es?
Stream + 7 Ka+ K(LC Kb+ K+L Stream ( 7 Kc( K+LC Kd( K(L Kernels fill J of the S" resources
issue order mattersH e0ecution time mattersH Custom

e%ecuti'n c'mpute Dueue issue ' de e%ecuti'n
Depth first
c'mpute Dueue issue ' de e%ecuti'n
Areadth first
c'mpute Dueue issue ' de
Ka1 K>1 Kc2 Kd2

time
Ka1 K>1 Kc2 Kd2
Ka1 Kc2 K>1 Kd2

time
Ka1 K>1
Kc2 Kd2
Ka1
time
Ka1 K>1
Kc2 Kd2
Kc2 Kd2 K>1
untime 9 -
untime 9 4
untime 9 3
Concurrent Kernel Schedulin1

Concurrent -ernel schedulin1 is special =ormallyC a si1nal is inserted into the /ueuesC after the operationC to launch the ne0t operation in the same stream 'or the compute en1ine /ueueC to enable concurrent -ernelsC 4hen compute -ernels are issued se/uentiallyC this si1nal is delayed until after the last se/uential compute -ernel Fn some situations this delay of si1nals can bloc- other /ueues
20ample 3 Concurrent Kernels and Aloc-in1

Three streamsC each performin1 (#DC KC D#) Areadth first
Se/uentially issued -ernels delay si1nals and bloc- cuda"emcpy(D(#)
p 'A am HD1 HD2 HD3 K1 K2 K3 DH1 DH2 DH3 H2D Dueue
c'mpute Dueue
D2H Dueue
e%ecuti'n
HD1 HD2 HD1 HD3
K1 K2 K3
DH1
time
issue ' de
DH2 DH3
HD1 HD2 HD3
K1 K2 K3
>!'c;inA
SiAna!s >etFeen seDuentia!!y issued ;e ne!s a e de!ayed
DH1 DH2 DH3

untime 9 ,
20ample 3 Concurrent Kernels and Aloc-in1

Three streamsC each performin1 (#DC KC D#) Depth first
?usually? best for 'ermi
p 'A am HD1 K1 DH1 HD2 K2 DH2 HD3 K3 DH3 H2D Dueue
c'mpute Dueue
D2H Dueue
e%ecuti'n
HD1 HD2 HD1 HD3
K1 K2 K3
Ke ne!s n' !'nAe issued seDuentia!!y
DH1
time
issue ' de
DH2 DH3
HD1 HD2 HD3
K1 K2 K3
DH1 DH2 DH3
untime 9 -
&re$ious Architectures
Compute Capability +)*+
Support for .&U / C&U concurrency
Compute Capability +)++ ( i)e) C+*,* )

Adds support for asynchronous memcopies (sin1le en1ine )
( some e0ceptions 3 chec- usin1 async2n1ineCount de$ice property )
Compute Capability ()*+ ( i)e) C(*8* )

Add support for concurrent .&U -ernels
( some e0ceptions 3 chec- usin1 concurrentKernels de$ice property )
Adds second copy en1ine to support bidirectional memcopies

( some e0ceptions 3 chec- usin1 async2n1ineCount de$ice property )
Additional Details
Ft is difficult to 1et more than 6 -ernels to run concurrently Concurrency can be disabled 4ith en$ironment $ariable
CUDAMGAU=C#MAG%CKF=.
cudaStreamEuery can be used to separate se/uential -ernels and pre$ent delayin1 si1nals Kernels usin1 more than ; te0tures cannot run concurrently S4itchin1 G+/Shared confi1uration 4ill brea- concurrency To run concurrentlyC CUDA operations must ha$e no more than ,( inter$enin1 CUDA operations
That isC in ?issue order? they must not be separated by more than ,( other issues 'urther operations are seriali<ed
cuda2$entMt is useful for timin1C but for performance use

cuda2$entCreateDith'la1s ( Ne$entC cuda2$entDisableTimin1 )
Concurrency .uidelines
Code to pro1rammin1 model 3 Streams
'uture de$ices 4ill continually impro$e #D representation of streams model
&ay attention to issue order

Can ma-e a difference
&ay attention to resources and operations 4hich can brea- concurrency

Anythin1 in the default stream 2$ents N synchroni<ation Stream /ueries G+/Shared confi1uration chan1es ;+ te0tures
Use tools (>isual &rofilerC &arallel 2nsi1ht) to $isuali<e concurrency

(but these don?t currently sho4 concurrent -ernels)
Than-
Oou
Euestions
+) >erify that cuda"emcpyAsync() follo4ed by Kernel !!! in Stream*C the memcpy 4ill bloc- the -ernelC but neither 4ill bloc- the host) () Do the follo4in1 operations (or similar) contribute to the ,6 ?out-ofissue-order? limitP
CudaStreamEuery cudaDait2$ent
5) F understand that the ?/uery? operations cudaStraemEuery() can be placed in either the en1ine or the copy /ueuesC that 4hich /ueue any /uery actually 1oes in is difficult (P) to determineC and that this can result in some bloc-in1)

Streams and Con Currency Web in Ar

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Streams and Con Currency Web in Ar

Uploaded by

Copyright:

Available Formats

CUDA C/C++ Streams and Concurrency

Steve Rennich NVIDIA

'ermi architecture can simultaneously support

Concurrent 3 o$erlap -ernel and D(# copy

1$33% pe &' mance imp 'vement

6-4ay concurrency (3%()

(-4ay concurrency (up t' 2%)

5-4ay concurrency (up t' 3%)

K, 'n )*+ DH4

20ample 3 Tiled D.2""

D37MM8 m9n9:1225 ;92::

=$idia >isual &rofiler (n$$p)

Default Stream (a-a Stream ?*?)

20ceptions 3 asynchronous 4)r)t) host

Be/uirements for Concurrency

Sufficient resources must be a$ailable

Simple 20ample7 Synchronous

All CUDA operations in the default stream are synchronous

Simple 20ample7 AsynchronousC =o Streams

.&U -ernels are asynchronous 4ith host by default

Simple 20ample7 Asynchronous 4ith Streams

CC pinned mem' y eDui ed 'n h'st

'ully asynchronous / concurrent Data used by concurrent operations should be independent

Synchroni<e 4)r)t) a specific stream

Synchroni<e usin1 2$ents

20plicit Synchroni<ation 20ample

De$ice memory allocation

=on-Async $ersion of memory operations

Chan1e to G+/shared memory confi1uration

CUDA operations are dispatched to #D in the se/uence they 4ere issued

A CUDA operation is dispatched from the en1ine /ueue if7

CUDA -ernels may be e0ecuted concurrently if they are in different streams

20ample 3 Aloc-ed Eueue

D#+ bloc-s completely independent D#(

HD>1 K1 DH1 DH2

Fithin Dueues5 st eam dependencies a e !'st

20ample 3 Aloc-ed Eueue

issue order mattersH

HDa1 HD>1 K1 DH1

Fithin Dueues5 st eam dependencies a e !'st

20ample - Aloc-ed Kernel

issue order mattersH

Fssue depth first

Fssue breadth first

Ka1 Ka2 K>2

K>1 Ka2 K>2

Ka2 K>1 K>2

20ample - %ptimal Concurrency can Depend on Kernel 20ecution Time

issue order mattersH e0ecution time mattersH Custom

Ka1 K>1 Kc2 Kd2

Ka1 K>1 Kc2 Kd2

Ka1 Kc2 K>1 Kd2

Kc2 Kd2 K>1

Concurrent Kernel Schedulin1

20ample 3 Concurrent Kernels and Aloc-in1

HD1 HD2 HD1 HD3

HD1 HD2 HD3

SiAna!s >etFeen seDuentia!!y issued ;e ne!s a e de!ayed

DH1 DH2 DH3

20ample 3 Concurrent Kernels and Aloc-in1

HD1 HD2 HD1 HD3

HD1 HD2 HD3

DH1 DH2 DH3

Compute Capability +)++ ( i)e) C+*,* )

Compute Capability +)++ ( i)e) C+, )

Compute Capability ()+ ( i)e) C(8* )