You are on page 1of 27

CUDA C/C++ Streams and Concurrency

Steve Rennich NVIDIA

Concurrency
The ability to perform multiple CUDA operations simultaneously
(beyond multi-threaded parallelism) CUDA Kernel !!! cuda"emcpyAsync (#ostToDe$ice) cuda"emcpyAsync (De$iceTo#ost) %perations on the C&U

'ermi architecture can simultaneously support


(compute capability ()*+) Up to +, CUDA -ernels on .&U ( cuda"emcpyAsyncs (must be in different directions) Computation on the C&U

Streams
Stream A se/uence of operations that e0ecute in issue-order on the .&U &ro1rammin1 model used to effect concurrency CUDA operations in different streams may run concurrently CUDA operations from different streams may be interlea$ed

Concurrency 20ample
Serial
cudaMemcpyAsync(H2D) Ke ne!"""### cudaMemcpyAsync(D2H)

time

Concurrent 3 o$erlap -ernel and D(# copy


cudaMemcpyAsync(H2D) K1 DH1 K2 DH2 K3 DH3 K4 DH4

1$33% pe &' mance imp 'vement

st eams

time

Amount of Concurrency
Serial (1%)
cudaMemcpyAsync(H2D)

6-4ay concurrency (3%()


Ke ne! """ ###
cudaMemcpyAsync(D2H) HD1

K1
HD2

DH1

K2
HD3

DH2

(-4ay concurrency (up t' 2%)


cudaMemcpyAsync(H2D)

K3

DH3

K4 'n )*+

K1

DH1

K2

DH2

K3

DH3

K4

DH4

6+ 4ay concurrency
HD1 K1$1 K1$2 K1$3 DH1 HD2 K2$1 K2$2 K2$3 DH2 HD3 K3$1 K3$2 K3$3 DH3 HD4 K4$1 K4$2 K4$3 DH4 HD- K-$1 K-$2 K-$3 DHHD. K.$1 K.$2 K.$3 DH.

5-4ay concurrency (up t' 3%)


HD1

K1
HD2

DH1

K2
HD3

DH2

K3
HD4

DH3

K, 'n )*+ DH4

K4

20ample 3 Tiled D.2""


C&U (4c' e /estme e %-.,0 12$23 3H45 MK6) 65 .flops .&U ()20,0)
Serial 7 +(8 .flops (()90) (-4ay 7 +:: .flops (6)+0) 5-4ay 7 (,( .fllops (,)+0)
de&au!t st eam st eam 1 st eam 2 st eam 3 st eam 4 )*+

D37MM8 m9n9:1225 ;92::

=$idia >isual &rofiler (n$$p)

.&U + C&U
6-4ay con)7 (;( .flops (,),0) Up to 55* .flops for lar1er ran-

%btain ma0imum performance by le$era1in1 concurrency All communication hidden 3 effecti$ely remo$es de$ice memory si<e limitation

Default Stream (a-a Stream ?*?)


Stream used 4hen no stream is specified Completely synchronous 4)r)t) host and de$ice
As if cudaDe$iceSynchroni<e() inserted before and after e$ery CUDA operation

20ceptions 3 asynchronous 4)r)t) host


Kernel launches in the default stream cuda"emcpy@Async cuda"emset@Async cuda"emcpy 4ithin the same de$ice #(D cuda"emcpy of ,6-A or less

Be/uirements for Concurrency


CUDA operations must be in differentC non-*C streams cuda"emcpyAsync 4ith host from ?pinned? memory
&a1e-loc-ed memory Allocated usin1 cuda"alloc#ost() or cuda#ostAlloc()

Sufficient resources must be a$ailable


cuda"emcpyAsyncs in different directions De$ice resources (S"2"C re1istersC bloc-sC etc))

Simple 20ample7 Synchronous

cudaMa!!'c ( <dev15 si4e ) = d'u>!e? h'st1 9 (d'u>!e?) ma!!'c ( <h'st15 si4e ) = @ cudaMemcpy ( dev15 h'st15 si4e5 H2D ) = ;e ne!2 """ A id5 >!'c;5 0 ### ( @5 dev25 @ ) = ;e ne!3 """ A id5 >!'c;5 0 ### ( @5 dev35 @ ) = cudaMemcpy ( h'st45 dev45 si4e5 D2H ) = $$$

completely synchronous

All CUDA operations in the default stream are synchronous

Simple 20ample7 AsynchronousC =o Streams

cudaMa!!'c ( <dev15 si4e ) = d'u>!e? h'st1 9 (d'u>!e?) ma!!'c ( <h'st15 si4e ) = @ cudaMemcpy ( dev15 h'st15 si4e5 H2D ) = ;e ne!2 """ A id5 >!'c; ### ( @5 dev25 @ ) = s'meB)*+Bmeth'd ()= ;e ne!3 """ A id5 >!'c; ### ( @5 dev35 @ ) = cudaMemcpy ( h'st45 dev45 si4e5 D2H ) = $$$

potentially o$erlapped

.&U -ernels are asynchronous 4ith host by default

Simple 20ample7 Asynchronous 4ith Streams

cudaSt eamBt st eam15 st eam25 st eam35 st eam4 = cudaSt eam) eate ( <st eam1) = $$$ cudaMa!!'c ( <dev15 si4e ) = cudaMa!!'cH'st ( <h'st15 si4e ) = @ cudaMemcpyAsync ( dev15 h'st15 si4e5 H2D5 st eam1 ) = ;e ne!2 """ A id5 >!'c;5 05 st eam2 ### ( @5 dev25 @ ) = ;e ne!3 """ A id5 >!'c;5 05 st eam3 ### ( @5 dev35 @ ) = cudaMemcpyAsync ( h'st45 dev45 si4e5 D2H5 st eam4 ) = s'meB)*+Bmeth'd ()= $$$

CC pinned mem' y eDui ed 'n h'st

potentially o$erlapped

'ully asynchronous / concurrent Data used by concurrent operations should be independent

20plicit Synchroni<ation
Synchroni<e e$erythin1
cudaDe$iceSynchroni<e () Aloc-s host until all issued CUDA calls are complete

Synchroni<e 4)r)t) a specific stream


cudaStreamSynchroni<e ( st eamid ) Aloc-s host until all CUDA calls in streamid are complete

Synchroni<e usin1 2$ents


Create specific ?2$ents?C 4ithin streamsC to use for synchroni<ation cuda2$entBecord ( event5 st eamid ) cuda2$entSynchroni<e ( event ) cudaStreamDait2$ent ( st eam5 event ) cuda2$entEuery ( event )

20plicit Synchroni<ation 20ample


Besol$e usin1 an e$ent
E cuda7ventBt event= cuda7vent) eate (<event)= cudaMemcpyAsync ( dBin5 in5 si4e5 H2D5 st eam1 )= cuda7ventRec' d (event5 st eam1)= cudaMemcpyAsync ( 'ut5 dB'ut5 si4e5 D2H5 st eam2 )= cudaSt eam/ait7vent ( st eam25 event )= ;e ne! """ 5 5 5 st eam2 ### ( dBin5 dB'ut )= asynch 'n'us)*+meth'd ( @ ) G CC c eate event CC 1) H2D c'py '& neF input CC ec' d event CC 2) D2H c'py '& p evi'us esu!t CC Fait &' event in st eam1 CC 3) must Fait &' 1 and 2 CC Async 3*+ meth'd

Fmplicit Synchroni<ation
These operations implicitly synchroni<e all other CUDA operations
&a1e-loc-ed memory allocation
cuda"alloc#ost cuda#ostAlloc

De$ice memory allocation


cuda"alloc

=on-Async $ersion of memory operations


cuda"emcpy@ (no Async suffi0) cuda"emset@ (no Async suffi0)

Chan1e to G+/shared memory confi1uration


cudaDe$iceSetCacheConfi1

Stream Schedulin1
'ermi hard4are has 5 /ueues
+ Compute 2n1ine /ueue ( Copy 2n1ine /ueues 3 one for #(D and one for D(#

CUDA operations are dispatched to #D in the se/uence they 4ere issued


&laced in the rele$ant /ueue Stream dependencies bet4een en1ine /ueues are maintainedC but lost 4ithin an en1ine /ueue

A CUDA operation is dispatched from the en1ine /ueue if7


&recedin1 calls in the same stream ha$e completedC &recedin1 calls in the same /ueue ha$e been dispatchedC and Besources are a$ailable

CUDA -ernels may be e0ecuted concurrently if they are in different streams


Threadbloc-s for a 1i$en -ernel are scheduled if all threadbloc-s for precedin1 -ernels ha$e been scheduled and there still are S" resources a$ailable

=ote a bloc-ed operation bloc-s all other operations in the /ueueC e$en in other streams

20ample 3 Aloc-ed Eueue


T4o streamsC stream + is issued first
Stream + 7 #Da+C #Db+C K+C D#+ (issued first) Stream ( 7 D#( (completely independent of stream +)
p 'A am H2D Dueue
c'mpute Dueue

D2H Dueue

e%ecuti'n

HDa1
issue ' de

HDa1 HD>1
)+DA 'pe ati'ns Aet added t' Dueues in issue ' de

K1

DH1
time

HDa1 HD>1 K1

D#+ bloc-s completely independent D#(

HD>1 K1 DH1 DH2

DH2
SiAna!s >etFeen Dueues en&' ce synch 'ni4ati'n

DH1 DH2
untime 9 -

Fithin Dueues5 st eam dependencies a e !'st

20ample 3 Aloc-ed Eueue


T4o streamsC stream ( is issued first
Stream + 7 #Da+C #Db+C K+C D#+ Stream ( 7 D#( (issued first)
p 'A am H2D Dueue
c'mpute Dueue

issue order mattersH

D2H Dueue

e%ecuti'n

concurrent

DH2
issue ' de

HDa1 HD>1
)+DA 'pe ati'ns Aet added t' Dueues in issue ' de

K1

DH2
time

HDa1 HD>1 K1

DH2

HDa1 HD>1 K1 DH1

DH1
SiAna!s >etFeen Dueues en&' ce synch 'ni4ati'n

DH1
untime 9 4

Fithin Dueues5 st eam dependencies a e !'st

20ample - Aloc-ed Kernel


T4o streams 3 Iust issuin1 CUDA -ernels
Stream + 7 Ka+C Kb+ Stream ( 7 Ka(C Kb( Kernels are similar si<eC fill J of the S" resources

issue order mattersH

Fssue depth first


c'mpute Dueue e%ecuti'n

Fssue breadth first


c'mpute Dueue e%ecuti'n

Ka1
issue ' de time

Ka1
issue ' de

Ka1 Ka2 K>2


time

Ka1 K>1

Ka2 K>2

K>1 Ka2 K>2

K>1

Ka2 K>1 K>2

untime 9 3

untime 9 2

20ample - %ptimal Concurrency can Depend on Kernel 20ecution Time


T4o streams 3 Iust issuin1 CUDA -ernels 3 but -ernels are different ?si<es?
Stream + 7 Ka+ K(LC Kb+ K+L Stream ( 7 Kc( K+LC Kd( K(L Kernels fill J of the S" resources

issue order mattersH e0ecution time mattersH Custom


e%ecuti'n c'mpute Dueue issue ' de e%ecuti'n

Depth first
c'mpute Dueue issue ' de e%ecuti'n

Areadth first
c'mpute Dueue issue ' de

Ka1 K>1 Kc2 Kd2


time

Ka1 K>1 Kc2 Kd2

Ka1 Kc2 K>1 Kd2


time

Ka1 K>1

Kc2 Kd2

Ka1
time

Ka1 K>1

Kc2 Kd2

Kc2 Kd2 K>1

untime 9 -

untime 9 4

untime 9 3

Concurrent Kernel Schedulin1


Concurrent -ernel schedulin1 is special =ormallyC a si1nal is inserted into the /ueuesC after the operationC to launch the ne0t operation in the same stream 'or the compute en1ine /ueueC to enable concurrent -ernelsC 4hen compute -ernels are issued se/uentiallyC this si1nal is delayed until after the last se/uential compute -ernel Fn some situations this delay of si1nals can bloc- other /ueues

20ample 3 Concurrent Kernels and Aloc-in1


Three streamsC each performin1 (#DC KC D#) Areadth first
Se/uentially issued -ernels delay si1nals and bloc- cuda"emcpy(D(#)
p 'A am HD1 HD2 HD3 K1 K2 K3 DH1 DH2 DH3 H2D Dueue
c'mpute Dueue

D2H Dueue

e%ecuti'n

HD1 HD2 HD1 HD3

K1 K2 K3

DH1
time

issue ' de

DH2 DH3

HD1 HD2 HD3

K1 K2 K3

>!'c;inA

SiAna!s >etFeen seDuentia!!y issued ;e ne!s a e de!ayed

DH1 DH2 DH3


untime 9 ,

20ample 3 Concurrent Kernels and Aloc-in1


Three streamsC each performin1 (#DC KC D#) Depth first
?usually? best for 'ermi
p 'A am HD1 K1 DH1 HD2 K2 DH2 HD3 K3 DH3 H2D Dueue
c'mpute Dueue

D2H Dueue

e%ecuti'n

HD1 HD2 HD1 HD3

K1 K2 K3
Ke ne!s n' !'nAe issued seDuentia!!y

DH1
time

issue ' de

DH2 DH3

HD1 HD2 HD3

K1 K2 K3

DH1 DH2 DH3

untime 9 -

&re$ious Architectures
Compute Capability +)*+
Support for .&U / C&U concurrency

Compute Capability +)++ ( i)e) C+*,* )


Adds support for asynchronous memcopies (sin1le en1ine )
( some e0ceptions 3 chec- usin1 async2n1ineCount de$ice property )

Compute Capability ()*+ ( i)e) C(*8* )


Add support for concurrent .&U -ernels
( some e0ceptions 3 chec- usin1 concurrentKernels de$ice property )

Adds second copy en1ine to support bidirectional memcopies


( some e0ceptions 3 chec- usin1 async2n1ineCount de$ice property )

Additional Details
Ft is difficult to 1et more than 6 -ernels to run concurrently Concurrency can be disabled 4ith en$ironment $ariable
CUDAMGAU=C#MAG%CKF=.

cudaStreamEuery can be used to separate se/uential -ernels and pre$ent delayin1 si1nals Kernels usin1 more than ; te0tures cannot run concurrently S4itchin1 G+/Shared confi1uration 4ill brea- concurrency To run concurrentlyC CUDA operations must ha$e no more than ,( inter$enin1 CUDA operations
That isC in ?issue order? they must not be separated by more than ,( other issues 'urther operations are seriali<ed

cuda2$entMt is useful for timin1C but for performance use


cuda2$entCreateDith'la1s ( Ne$entC cuda2$entDisableTimin1 )

Concurrency .uidelines
Code to pro1rammin1 model 3 Streams
'uture de$ices 4ill continually impro$e #D representation of streams model

&ay attention to issue order


Can ma-e a difference

&ay attention to resources and operations 4hich can brea- concurrency


Anythin1 in the default stream 2$ents N synchroni<ation Stream /ueries G+/Shared confi1uration chan1es ;+ te0tures

Use tools (>isual &rofilerC &arallel 2nsi1ht) to $isuali<e concurrency


(but these don?t currently sho4 concurrent -ernels)

Than-

Oou

Euestions
+) >erify that cuda"emcpyAsync() follo4ed by Kernel !!! in Stream*C the memcpy 4ill bloc- the -ernelC but neither 4ill bloc- the host) () Do the follo4in1 operations (or similar) contribute to the ,6 ?out-ofissue-order? limitP
CudaStreamEuery cudaDait2$ent

5) F understand that the ?/uery? operations cudaStraemEuery() can be placed in either the en1ine or the copy /ueuesC that 4hich /ueue any /uery actually 1oes in is difficult (P) to determineC and that this can result in some bloc-in1)

You might also like