You are on page 1of 311

Download from www.wowebook.

com
CUDA by Example

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


CUDA by Example
AN INTRODUCTION TO
GENERAL-PURPOSE
GPU PROGRAMMING

JASON SANDERS
EDWARD KANDROT

8SSHU6DGGOH5LYHU1-ǩ%RVWRQǩ,QGLDQDSROLVǩ6DQ)UDQFLVFR
1HZ<RUNǩ7RURQWRǩ0RQWUHDOǩ/RQGRQǩ0XQLFKǩ3DULVǩ0DGULG
&DSHWRZQǩ6\GQH\ǩ7RN\Rǩ6LQJDSRUHǩ0H[LFR&LW\

Download from www.wowebook.com


0DQ\RIWKHGHVLJQDWLRQVXVHGE\PDQXIDFWXUHUVDQGVHOOHUVWRGLVWLQJXLVKWKHLUSURGXFWVDUH
FODLPHGDVWUDGHPDUNV:KHUHWKRVHGHVLJQDWLRQVDSSHDULQWKLVERRNDQGWKHSXEOLVKHUZDV
DZDUHRIDWUDGHPDUNFODLPWKHGHVLJQDWLRQVKDYHEHHQSULQWHGZLWKLQLWLDOFDSLWDOOHWWHUVRULQDOO
FDSLWDOV
7KHDXWKRUVDQGSXEOLVKHUKDYHWDNHQFDUHLQWKHSUHSDUDWLRQRIWKLVERRNEXWPDNHQRH[SUHVVHG
RULPSOLHGZDUUDQW\RIDQ\NLQGDQGDVVXPHQRUHVSRQVLELOLW\IRUHUURUVRURPLVVLRQV1ROLDELOLW\LV
DVVXPHGIRULQFLGHQWDORUFRQVHTXHQWLDOGDPDJHVLQFRQQHFWLRQZLWKRUDULVLQJRXWRIWKHXVHRIWKH
LQIRUPDWLRQRUSURJUDPVFRQWDLQHGKHUHLQ
19,',$PDNHVQRZDUUDQW\RUUHSUHVHQWDWLRQWKDWWKHWHFKQLTXHVGHVFULEHGKHUHLQDUHIUHHIURP
DQ\,QWHOOHFWXDO3URSHUW\FODLPV7KHUHDGHUDVVXPHVDOOULVNRIDQ\VXFKFODLPVEDVHGRQKLVRU
KHUXVHRIWKHVHWHFKQLTXHV
7KHSXEOLVKHURIIHUVH[FHOOHQWGLVFRXQWVRQWKLVERRNZKHQRUGHUHGLQTXDQWLW\IRUEXONSXUFKDVHV
RUVSHFLDOVDOHVZKLFKPD\LQFOXGHHOHFWURQLFYHUVLRQVDQGRUFXVWRPFRYHUVDQGFRQWHQW
SDUWLFXODUWR\RXUEXVLQHVVWUDLQLQJJRDOVPDUNHWLQJIRFXVDQGEUDQGLQJLQWHUHVWV)RUPRUH
LQIRUPDWLRQSOHDVHFRQWDFW
86&RUSRUDWHDQG*RYHUQPHQW6DOHV
(800) 382-3419
FRUSVDOHV#SHDUVRQWHFKJURXSFRP
)RUVDOHVRXWVLGHWKH8QLWHG6WDWHVSOHDVHFRQWDFW
,QWHUQDWLRQDO6DOHV
LQWHUQDWLRQDO#SHDUVRQFRP
9LVLWXVRQWKH:HELQIRUPLWFRPDZ
Library of Congress Cataloging-in-Publication Data
6DQGHUV-DVRQ
&8'$E\H[DPSOHDQLQWURGXFWLRQWRJHQHUDOSXUSRVH*38SURJUDPPLQJ
-DVRQ6DQGHUV(GZDUG.DQGURW
S FP
,QFOXGHVLQGH[
,6%1 SENDONSDSHU
$SSOLFDWLRQVRIWZDUHǟ'HYHORSPHQW&RPSXWHUDUFKLWHFWXUH
3DUDOOHOSURJUDPPLQJ &RPSXWHUVFLHQFH ,.DQGURW(GZDUG,,7LWOH
4$$6
 ǟGF

&RS\ULJKWk19,',$&RUSRUDWLRQ
$OOULJKWVUHVHUYHG3ULQWHGLQWKH8QLWHG6WDWHVRI$PHULFD7KLVSXEOLFDWLRQLVSURWHFWHGE\FRS\-
ULJKWDQGSHUPLVVLRQPXVWEHREWDLQHGIURPWKHSXEOLVKHUSULRUWRDQ\SURKLELWHGUHSURGXFWLRQ
VWRUDJHLQDUHWULHYDOV\VWHPRUWUDQVPLVVLRQLQDQ\IRUPRUE\DQ\PHDQVHOHFWURQLFPHFKDQLFDO
SKRWRFRS\LQJUHFRUGLQJRUOLNHZLVH)RULQIRUPDWLRQUHJDUGLQJSHUPLVVLRQVZULWHWR
3HDUVRQ(GXFDWLRQ,QF
5LJKWVDQG&RQWUDFWV'HSDUWPHQW
%R\OVWRQ6WUHHW6XLWH
%RVWRQ0$
)D[  
,6%1
,6%1
7H[WSULQWHGLQWKH8QLWHG6WDWHVRQUHF\FOHGSDSHUDW(GZDUGV%URWKHUVLQ$QQ$UERU0LFKLJDQ
)LUVWSULQWLQJ-XO\

Download from www.wowebook.com


To our families and friends, who gave us endless support.
To our readers, who will bring us the future.
And to the teachers who taught our readers to read.

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Contents

)RUHZRUG                                   [LLL

3UHIDFH                                     [Y

$FNQRZOHGJPHQWV                               [YLL

$ERXWWKH$XWKRUV                               [L[

1 WHY CUDA? WHY NOW? 1

 &KDSWHU2EMHFWLYHV                              2

 7KH$JHRI3DUDOOHO3URFHVVLQJ    2

 &HQWUDO3URFHVVLQJ8QLWV                         2

 7KH5LVHRI*38&RPSXWLQJ                          4

 $%ULHI+LVWRU\RI*38V                           4

 (DUO\*38&RPSXWLQJ                           

 &8'$                                       

 :KDW,VWKH&8'$$UFKLWHFWXUH"                     

 8VLQJWKH&8'$$UFKLWHFWXUH                       

 $SSOLFDWLRQVRI&8'$                              8

 0HGLFDO,PDJLQJ                              8

 &RPSXWDWLRQDO)OXLG'\QDPLFV                     9

 (QYLURQPHQWDO6FLHQFH                          10

 &KDSWHU5HYLHZ                                11

vii

Download from www.wowebook.com


CONTENTS

2 GETTING STARTED 13

 &KDSWHU2EMHFWLYHV                             14

 'HYHORSPHQW(QYLURQPHQW                          14

 &8'$(QDEOHG*UDSKLFV3URFHVVRUV                 14

 19,',$'HYLFH'ULYHU                           1

 &8'$'HYHORSPHQW7RRONLW                       1

 6WDQGDUG&&RPSLOHr                           18

 &KDSWHU5HYLHZ                                19

3 INTRODUCTION TO CUDA C 21

 &KDSWHU2EMHFWLYHV                             22

 $)LUVW3URJUDP                                22

 +HOOR:RUOG!                                22

 $.HUQHO&DOO                               23

 3DVVLQJ3DUDPHWHUV                           24

 4XHU\LQJ'HYLFHV                               2

 8VLQJ'HYLFH3URSHUWLHV                          33

 &KDSWHU5HYLHZ                                3

4 PARALLEL PROGRAMMING IN CUDA C 37

 &KDSWHU2EMHFWLYHV                             38

 &8'$3DUDOOHO3URJUDPPLQJ                       38

 6XPPLQJ9HFWRUV                            38

 $)XQ([DPSOH                              4

 &KDSWHU5HYLHZ                                

viii

Download from www.wowebook.com


CONTENTS

5 THREAD COOPERATION 59

 &KDSWHU2EMHFWLYHV                            0

 6SOLWWLQJ3DUDOOHO%ORFNV                          0

 9HFWRU6XPV5HGX[                           0

 *385LSSOH8VLQJ7KUHDGV                      9

 6KDUHG0HPRU\DQG6\QFKURQL]DWLRQ                   

 'RW3URGXFW                              

 'RW3URGXFW2SWLPL]HG ,QFRUUHFWO\                   8

 6KDUHG0HPRU\%LWPDS                        90

 &KDSWHU5HYLHZ                                94

6 CONSTANT MEMORY AND EVENTS 95

 &KDSWHU2EMHFWLYHV                             9

 &RQVWDQW0HPRU\                               9

 5D\7UDFLQJ,QWURGXFWLRQ                        9

 5D\7UDFLQJRQWKH*38                         98

 5D\7UDFLQJZLWK&RQVWDQW0HPRU\                  104

 3HUIRUPDQFHZLWK&RQVWDQW0HPRU\                10

 0HDVXULQJ3HUIRUPDQFHZLWK(YHQWV                   108

 0HDVXULQJ5D\7UDFHU3HUIRUPDQFH                  110

 &KDSWHU5HYLHZ                                114

7 TEXTURE MEMORY 115

 &KDSWHU2EMHFWLYHV                            11

 7H[WXUH0HPRU\2YHUYLHZ                         11

L[

Download from www.wowebook.com


CONTENTS

7.3 Simulating Heat Transfer . . . . . . . . . . . . . . . . . . . . . . . . 117


7.3.1 Simple Heating Model . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.2 Computing Temperature Updates . . . . . . . . . . . . . . . . . 119
7.3.3 Animating the Simulation . . . . . . . . . . . . . . . . . . . . . . 121
7.3.4 Using Texture Memory . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3.5 Using Two-Dimensional Texture Memory . . . . . . . . . . . . . 131

7.4 Chapter Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8 GRAPHICS INTEROPERABILITY 139

8.1 Chapter Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140


8.2 Graphics Interoperation . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.3 GPU Ripple with Graphics Interoperability . . . . . . . . . . . . . . 147
8.3.1 The GPUAnimBitmap Structure . . . . . . . . . . . . . . . . . . 148
8.3.2 GPU Ripple Redux . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4 Heat Transfer with Graphics Interop . . . . . . . . . . . . . . . . . . 154
8.5 DirectX Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . 160

8.6 Chapter Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

9 ATOMICS 163

9.1 Chapter Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164


9.2 Compute Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.1 The Compute Capability of NVIDIA GPUs . . . . . . . . . . . . . 164
9.2.2 Compiling for a Minimum Compute Capability . . . . . . . . . . 167
9.3 Atomic Operations Overview . . . . . . . . . . . . . . . . . . . . . . 168
9.4 Computing Histograms . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.4.1 CPU Histogram Computation . . . . . . . . . . . . . . . . . . . . 171
9.4.2 GPU Histogram Computation . . . . . . . . . . . . . . . . . . . . 173

9.5 Chapter Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


x

Download from www.wowebook.com


CONTENTS

10 STREAMS 185

 &KDSWHU2EMHFWLYHV                            18

 3DJH/RFNHG+RVW0HPRU\                        18

 &8'$6WUHDPV                               192

 8VLQJD6LQJOH&8'$6WUHDP                       192

 8VLQJ0XOWLSOH&8'$6WUHDPV                      198

 *38:RUN6FKHGXOLQJ                           20

 8VLQJ0XOWLSOH&8'$6WUHDPV(IIHFWLYHO\                208

 &KDSWHU5HYLHZ                                211

11 CUDA C ON MULTIPLE GPUS 213

 &KDSWHU2EMHFWLYHV                            214

 =HUR&RS\+RVW0HPRU\                          214

 =HUR&RS\'RW3URGXFW                       214

 =HUR&RS\3HUIRUPDQFH                      222

 8VLQJ0XOWLSOH*38V                           224

 3RUWDEOH3LQQHG0HPRU\                         230

 &KDSWHU5HYLHZ                               23

12 THE FINAL COUNTDOWN 237

 &KDSWHU2EMHFWLYHV                            238

 &8'$7RROV                                 238

 &8'$7RRONLW                             238

 &8))7                                 239

 &8%/$6                                239

 19,',$*38&RPSXWLQJ6'.                    240

[L

Download from www.wowebook.com


CONTENTS

12.2.5 NVIDIA Performance Primitives . . . . . . . . . . . . . . . . . 241

12.2.6 Debugging CUDA C . . . . . . . . . . . . . . . . . . . . . . . . . 241

12.2.7 CUDA Visual Profiler . . . . . . . . . . . . . . . . . . . . . . . . 243

12.3 Written Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

12.3.1 Programming Massively Parallel Processors:


A Hands-On Approach . . . . . . . . . . . . . . . . . . . . . . . 244

12.3.2 CUDA U . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

12.3.3 NVIDIA Forums . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12.4 Code Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

12.4.1 CUDA Data Parallel Primitives Library . . . . . . . . . . . . . 247

12.4.2 CULAtools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

12.4.3 Language Wrappers . . . . . . . . . . . . . . . . . . . . . . . . 247

12.5 Chapter Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

A ADVANCED ATOMICS 249

A.1 Dot Product Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 250

A.1.1 Atomic Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

A.1.2 Dot Product Redux: Atomic Locks . . . . . . . . . . . . . . . . 254

A.2 Implementing a Hash Table . . . . . . . . . . . . . . . . . . . . . . . 258

A.2.1 Hash Table Overview . . . . . . . . . . . . . . . . . . . . . . . . 259

A.2.2 A CPU Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . 261

A.2.3 Multithreaded Hash Table . . . . . . . . . . . . . . . . . . . . . 267

A.2.4 A GPU Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . 268

A.2.5 Hash Table Performance . . . . . . . . . . . . . . . . . . . . . 276

A.3 Appendix Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

xii

Download from www.wowebook.com


Foreword

5HFHQWDFWLYLWLHVRIPDMRUFKLSPDQXIDFWXUHUVVXFKDV19,',$PDNHLWPRUH
HYLGHQWWKDQHYHUWKDWIXWXUHGHVLJQVRIPLFURSURFHVVRUVDQGODUJH+3&
V\VWHPVZLOOEHK\EULGKHWHURJHQHRXVLQQDWXUH7KHVHKHWHURJHQHRXVV\VWHPV
ZLOOUHO\RQWKHLQWHJUDWLRQRIWZRPDMRUW\SHVRIFRPSRQHQWVLQYDU\LQJ
SURSRUWLRQV

ǩ Multi- and many-core CPU technology7KHQXPEHURIFRUHVZLOOFRQWLQXHWR


HVFDODWHEHFDXVHRIWKHGHVLUHWRSDFNPRUHDQGPRUHFRPSRQHQWVRQDFKLS
ZKLOHDYRLGLQJWKHSRZHUZDOOWKHLQVWUXFWLRQOHYHOSDUDOOHOLVPZDOODQGWKH
PHPRU\ZDOO

ǩ Special-purpose hardware and massively parallel accelerators )RUH[DPSOH


*38VIURP19,',$KDYHRXWSDFHGVWDQGDUG&38VLQȍRDWLQJSRLQWSHUIRUPDQFH
LQUHFHQW\HDUV)XUWKHUPRUHWKH\KDYHDUJXDEO\EHFRPHDVHDV\LIQRWHDVLHU
WRSURJUDPWKDQPXOWLFRUH&38V

7KHUHODWLYHEDODQFHEHWZHHQWKHVHFRPSRQHQWW\SHVLQIXWXUHGHVLJQVLVQRW
FOHDUDQGZLOOOLNHO\YDU\RYHUWLPH7KHUHVHHPVWREHQRGRXEWWKDWIXWXUH
JHQHUDWLRQVRIFRPSXWHUV\VWHPVUDQJLQJIURPODSWRSVWRVXSHUFRPSXWHUV
ZLOOFRQVLVWRIDFRPSRVLWLRQRIKHWHURJHQHRXVFRPSRQHQWV,QGHHGWKHpetaflop
(10ȍRDWLQJSRLQWRSHUDWLRQVSHUVHFRQG SHUIRUPDQFHEDUULHUZDVEUHDFKHGE\
VXFKDV\VWHP

$QG\HWWKHSUREOHPVDQGWKHFKDOOHQJHVIRUGHYHORSHUVLQWKHQHZFRPSXWDWLRQDO
ODQGVFDSHRIK\EULGSURFHVVRUVUHPDLQGDXQWLQJ&ULWLFDOSDUWVRIWKHVRIWZDUH
LQIUDVWUXFWXUHDUHDOUHDG\KDYLQJDYHU\GLIȌFXOWWLPHNHHSLQJXSZLWKWKHSDFH
RIFKDQJH,QVRPHFDVHVSHUIRUPDQFHFDQQRWVFDOHZLWKWKHQXPEHURIFRUHV
EHFDXVHDQLQFUHDVLQJO\ODUJHSRUWLRQRIWLPHLVVSHQWRQGDWDPRYHPHQWUDWKHU
WKDQDULWKPHWLF,QRWKHUFDVHVVRIWZDUHWXQHGIRUSHUIRUPDQFHLVGHOLYHUHG\HDUV
DIWHUWKHKDUGZDUHDUULYHVDQGVRLVREVROHWHRQGHOLYHU\$QGLQVRPHFDVHVDV
RQVRPHUHFHQW*38VVRIWZDUHZLOOQRWUXQDWDOOEHFDXVHSURJUDPPLQJHQYLURQ-
PHQWVKDYHFKDQJHGWRRPXFK

[LLL

Download from www.wowebook.com


)25(:25'

CUDA by ExampleDGGUHVVHVWKHKHDUWRIWKHVRIWZDUHGHYHORSPHQWFKDOOHQJHE\
OHYHUDJLQJRQHRIWKHPRVWLQQRYDWLYHDQGSRZHUIXOVROXWLRQVWRWKHSUREOHPRI
SURJUDPPLQJWKHPDVVLYHO\SDUDOOHODFFHOHUDWRUVLQUHFHQW\HDUV

7KLVERRNLQWURGXFHV\RXWRSURJUDPPLQJLQ&8'$&E\SURYLGLQJH[DPSOHVDQG
LQVLJKWLQWRWKHSURFHVVRIFRQVWUXFWLQJDQGHIIHFWLYHO\XVLQJ19,',$*38V,W
SUHVHQWVLQWURGXFWRU\FRQFHSWVRISDUDOOHOFRPSXWLQJIURPVLPSOHH[DPSOHVWR
GHEXJJLQJ ERWKORJLFDODQGSHUIRUPDQFH DVZHOODVFRYHUVDGYDQFHGWRSLFVDQG
LVVXHVUHODWHGWRXVLQJDQGEXLOGLQJPDQ\DSSOLFDWLRQV7KURXJKRXWWKHERRN
SURJUDPPLQJH[DPSOHVUHLQIRUFHWKHFRQFHSWVWKDWKDYHEHHQSUHVHQWHG

7KHERRNLVUHTXLUHGUHDGLQJIRUDQ\RQHZRUNLQJZLWKDFFHOHUDWRUEDVHG
FRPSXWLQJV\VWHPV,WH[SORUHVSDUDOOHOFRPSXWLQJLQGHSWKDQGSURYLGHVDQ
DSSURDFKWRPDQ\SUREOHPVWKDWPD\EHHQFRXQWHUHG,WLVHVSHFLDOO\XVHIXOIRU
DSSOLFDWLRQGHYHORSHUVQXPHULFDOOLEUDU\ZULWHUVDQGVWXGHQWVDQGWHDFKHUVRI
SDUDOOHOFRPSXWLQJ

,KDYHHQMR\HGDQGOHDUQHGIURPWKLVERRNDQG,IHHOFRQȌGHQWWKDW\RXZLOO
DVZHOO

Jack Dongarra
University Distinguished Professor, University of Tennessee Distinguished Research
Staff Member, Oak Ridge National Laboratory

[LY

Download from www.wowebook.com


Preface

7KLVERRNVKRZVKRZE\KDUQHVVLQJWKHSRZHURI\RXUFRPSXWHUǢVJUDSKLFV
SURFHVVXQLW *38 \RXFDQZULWHKLJKSHUIRUPDQFHVRIWZDUHIRUDZLGHUDQJH
RIDSSOLFDWLRQV$OWKRXJKRULJLQDOO\GHVLJQHGWRUHQGHUFRPSXWHUJUDSKLFVRQ
DPRQLWRU DQGVWLOOXVHGIRUWKLVSXUSRVH *38VDUHLQFUHDVLQJO\EHLQJFDOOHG
XSRQIRUHTXDOO\GHPDQGLQJSURJUDPVLQVFLHQFHHQJLQHHULQJDQGȌQDQFH
DPRQJRWKHUGRPDLQV:HUHIHUFROOHFWLYHO\WR*38SURJUDPVWKDWDGGUHVV
SUREOHPVLQQRQJUDSKLFVGRPDLQVDVgeneral-purpose+DSSLO\DOWKRXJK\RX
QHHGWRKDYHVRPHH[SHULHQFHZRUNLQJLQ&RU&WREHQHȌWIURPWKLVERRN
\RXQHHGQRWKDYHDQ\NQRZOHGJHRIFRPSXWHUJUDSKLFV1RQHZKDWVRHYHU*38
SURJUDPPLQJVLPSO\RIIHUV\RXDQRSSRUWXQLW\WREXLOGǟDQGWREXLOGPLJKWLO\ǟ
RQ\RXUH[LVWLQJSURJUDPPLQJVNLOOV

7RSURJUDP19,',$*38VWRSHUIRUPJHQHUDOSXUSRVHFRPSXWLQJWDVNV\RX
ZLOOZDQWWRNQRZZKDW&8'$LV19,',$*38VDUHEXLOWRQZKDWǢVNQRZQDV
WKHCUDA Architecture<RXFDQWKLQNRIWKH&8'$$UFKLWHFWXUHDVWKHVFKHPH
E\ZKLFK19,',$KDVEXLOW*38VWKDWFDQSHUIRUPbothWUDGLWLRQDOJUDSKLFV
UHQGHULQJWDVNVandJHQHUDOSXUSRVHWDVNV7RSURJUDP&8'$*38VZHZLOO
EHXVLQJDODQJXDJHNQRZQDVCUDA C$V\RXZLOOVHHYHU\HDUO\LQWKLVERRN
&8'$&LVHVVHQWLDOO\&ZLWKDKDQGIXORIH[WHQVLRQVWRDOORZSURJUDPPLQJRI
PDVVLYHO\SDUDOOHOPDFKLQHVOLNH19,',$*38V

:HǢYHJHDUHGCUDA by ExampleWRZDUGH[SHULHQFHG&RU&SURJUDPPHUV
ZKRKDYHHQRXJKIDPLOLDULW\ZLWK&VXFKWKDWWKH\DUHFRPIRUWDEOHUHDGLQJDQG
ZULWLQJFRGHLQ&7KLVERRNEXLOGVRQ\RXUH[SHULHQFHZLWK&DQGLQWHQGVWRVHUYH
DVDQH[DPSOHGULYHQǤTXLFNVWDUWǥJXLGHWRXVLQJ19,',$ǢV&8'$&SURJUDP-
PLQJODQJXDJH%\QRPHDQVGR\RXQHHGWRKDYHGRQHODUJHVFDOHVRIWZDUH
DUFKLWHFWXUHWRKDYHZULWWHQD&FRPSLOHURUDQRSHUDWLQJV\VWHPNHUQHORUWR
NQRZDOOWKHLQVDQGRXWVRIWKH$16,&VWDQGDUGV+RZHYHUZHGRQRWVSHQG
WLPHUHYLHZLQJ&V\QWD[RUFRPPRQ&OLEUDU\URXWLQHVVXFKDVmalloc()RU
memcpy()VRZHZLOODVVXPHWKDW\RXDUHDOUHDG\UHDVRQDEO\IDPLOLDUZLWKWKHVH
WRSLFV

[Y

Download from www.wowebook.com


35()$&(

<RXZLOOHQFRXQWHUVRPHWHFKQLTXHVWKDWFDQEHFRQVLGHUHGJHQHUDOSDUDOOHO
SURJUDPPLQJSDUDGLJPVDOWKRXJKWKLVERRNGRHVQRWDLPWRWHDFKJHQHUDO
SDUDOOHOSURJUDPPLQJWHFKQLTXHV$OVRZKLOHZHZLOOORRNDWQHDUO\HYHU\SDUWRI
WKH&8'$$3,WKLVERRNGRHVQRWVHUYHDVDQH[WHQVLYH$3,UHIHUHQFHQRUZLOOLW
JRLQWRJRU\GHWDLODERXWHYHU\WRROWKDW\RXFDQXVHWRKHOSGHYHORS\RXU&8'$&
VRIWZDUH&RQVHTXHQWO\ZHKLJKO\UHFRPPHQGWKDWWKLVERRNEHXVHGLQFRQMXQF-
WLRQZLWK19,',$ǢVIUHHO\DYDLODEOHGRFXPHQWDWLRQLQSDUWLFXODUWKHNVIDIA CUDA
Programming Guide DQGWKHNVIDIA CUDA Best Practices Guide%XWGRQǢWVWUHVV
RXWDERXWFROOHFWLQJDOOWKHVHGRFXPHQWVEHFDXVHZHǢOOZDON\RXWKURXJKHYHU\-
WKLQJ\RXQHHGWRGR

:LWKRXWIXUWKHUDGRWKHZRUOGRISURJUDPPLQJ19,',$*38VZLWK&8'$&DZDLWV

[YL

Download from www.wowebook.com


Acknowledgments

,WǢVEHHQVDLGWKDWLWWDNHVDYLOODJHWRZULWHDWHFKQLFDOERRNDQGCUDA by Example
LVQRH[FHSWLRQWRWKLVDGDJH7KHDXWKRUVRZHGHEWVRIJUDWLWXGHWRPDQ\SHRSOH
VRPHRIZKRPZHZRXOGOLNHWRWKDQNKHUH

,DQ%XFN19,',$ǢVVHQLRUGLUHFWRURI*38FRPSXWLQJVRIWZDUHKDVEHHQLPPHD-
VXUDEO\KHOSIXOLQHYHU\VWDJHRIWKHGHYHORSPHQWRIWKLVERRNIURPFKDPSLRQLQJ
WKHLGHDWRPDQDJLQJPDQ\RIWKHGHWDLOV:HDOVRRZH7LP0XUUD\RXUDOZD\V
VPLOLQJUHYLHZHUPXFKRIWKHFUHGLWIRUWKLVERRNSRVVHVVLQJHYHQDPRGLFXPRI
WHFKQLFDODFFXUDF\DQGUHDGDELOLW\0DQ\WKDQNVDOVRJRWRRXUGHVLJQHU'DUZLQ
7DWZKRFUHDWHGIDQWDVWLFFRYHUDUWDQGȌJXUHVRQDQH[WUHPHO\WLJKWVFKHGXOH
)LQDOO\ZHDUHPXFKREOLJHGWR-RKQ3DUNZKRKHOSHGJXLGHWKLVSURMHFWWKURXJK
WKHGHOLFDWHOHJDOSURFHVVUHTXLUHGRISXEOLVKHGZRUN

:LWKRXWKHOSIURP$GGLVRQ:HVOH\ǢVVWDIIWKLVERRNZRXOGVWLOOEHQRWKLQJPRUH
WKDQDWZLQNOHLQWKHH\HVRIWKHDXWKRUV3HWHU*RUGRQ.LP%RHGLJKHLPHUDQG
-XOLH1DKLOKDYHDOOVKRZQXQERXQGHGSDWLHQFHDQGSURIHVVLRQDOLVPDQGKDYH
JHQXLQHO\PDGHWKHSXEOLFDWLRQRIWKLVERRNDSDLQOHVVSURFHVV$GGLWLRQDOO\
0ROO\6KDUSǢVSURGXFWLRQZRUNDQG.LP:LPSVHWWǢVFRS\HGLWLQJKDYHXWWHUO\
WUDQVIRUPHGWKLVWH[WIURPDSLOHRIGRFXPHQWVULGGOHGZLWKHUURUVWRWKHYROXPH
\RXǢUHUHDGLQJWRGD\

6RPHRIWKHFRQWHQWRIWKLVERRNFRXOGQRWKDYHEHHQLQFOXGHGZLWKRXWWKH
KHOSRIRWKHUFRQWULEXWRUV6SHFLȌFDOO\1DGHHP0RKDPPDGZDVLQVWUXPHQWDO
LQUHVHDUFKLQJWKH&8'$FDVHVWXGLHVZHSUHVHQWLQ&KDSWHUDQG1DWKDQ
:KLWHKHDGJHQHURXVO\SURYLGHGFRGHWKDWZHLQFRUSRUDWHGLQWRH[DPSOHV
WKURXJKRXWWKHERRN

:HZRXOGEHUHPLVVLIZHGLGQǢWWKDQNWKHRWKHUVZKRUHDGHDUO\GUDIWVRI
WKLVWH[WDQGSURYLGHGKHOSIXOIHHGEDFNLQFOXGLQJ*HQHYLHYH%UHHGDQG.XUW
:DOO0DQ\RIWKH19,',$VRIWZDUHHQJLQHHUVSURYLGHGLQYDOXDEOHWHFKQLFDO

[YLL

Download from www.wowebook.com


ACKNOWLEDGMENTS

DVVLVWDQFHGXULQJWKHFRXUVHRIGHYHORSLQJWKHFRQWHQWIRUCUDA by Example,
LQFOXGLQJ0DUN+DLUJURYHZKRVFRXUHGWKHERRNXQFRYHULQJDOOPDQQHURI
LQFRQVLVWHQFLHVǟWHFKQLFDOW\SRJUDSKLFDODQGJUDPPDWLFDO6WHYH+LQHV
1LFKRODV:LOWDQG6WHSKHQ-RQHVFRQVXOWHGRQVSHFLȌFVHFWLRQVRIWKH&8'$
$3,KHOSLQJHOXFLGDWHQXDQFHVWKDWWKHDXWKRUVZRXOGKDYHRWKHUZLVHRYHU-
ORRNHG7KDQNVDOVRJRRXWWR5DQGLPD)HUQDQGRZKRKHOSHGWRJHWWKLVSURMHFW
RIIWKHJURXQGDQGWR0LFKDHO6FKLGORZVN\IRUDFNQRZOHGJLQJ-DVRQLQKLVERRN

$QGZKDWDFNQRZOHGJPHQWVVHFWLRQZRXOGEHFRPSOHWHZLWKRXWDKHDUWIHOW
H[SUHVVLRQRIJUDWLWXGHWRSDUHQWVDQGVLEOLQJV",WLVKHUHWKDWZHZRXOGOLNHWR
WKDQNRXUIDPLOLHVZKRKDYHEHHQZLWKXVWKURXJKHYHU\WKLQJDQGKDYHPDGH
WKLVDOOSRVVLEOH:LWKWKDWVDLGZHZRXOGOLNHWRH[WHQGVSHFLDOWKDQNVWRORYLQJ
SDUHQWV(GZDUGDQG.DWKOHHQ.DQGURWDQG6WHSKHQDQG+HOHQ6DQGHUV7KDQNV
DOVRJRWRRXUEURWKHUV.HQQHWK.DQGURWDQG&RUH\6DQGHUV7KDQN\RXDOOIRU
\RXUXQZDYHULQJVXSSRUW

[YLLL

Download from www.wowebook.com


About the Authors

Jason Sanders LVDVHQLRUVRIWZDUHHQJLQHHULQWKH&8'$3ODWIRUPJURXSDW


19,',$:KLOHDW19,',$KHKHOSHGGHYHORSHDUO\UHOHDVHVRI&8'$V\VWHP
VRIWZDUHDQGFRQWULEXWHGWRWKH2SHQ&/6SHFLȌFDWLRQDQLQGXVWU\VWDQGDUG
IRUKHWHURJHQHRXVFRPSXWLQJ-DVRQUHFHLYHGKLVPDVWHUǢVGHJUHHLQFRPSXWHU
VFLHQFHIURPWKH8QLYHUVLW\RI&DOLIRUQLD%HUNHOH\ZKHUHKHSXEOLVKHGUHVHDUFKLQ
*38FRPSXWLQJDQGKHKROGVDEDFKHORUǢVGHJUHHLQHOHFWULFDOHQJLQHHULQJIURP
3ULQFHWRQ8QLYHUVLW\3ULRUWRMRLQLQJ19,',$KHSUHYLRXVO\KHOGSRVLWLRQVDW$7,
7HFKQRORJLHV$SSOHDQG1RYHOO:KHQKHǢVQRWZULWLQJERRNV-DVRQLVW\SLFDOO\
ZRUNLQJRXWSOD\LQJVRFFHURUVKRRWLQJSKRWRV

Edward Kandrot LVDVHQLRUVRIWZDUHHQJLQHHURQWKH&8'$$OJRULWKPVWHDPDW


19,',$+HKDVPRUHWKDQ\HDUVRILQGXVWU\H[SHULHQFHIRFXVHGRQRSWLPL]LQJ
FRGHDQGLPSURYLQJSHUIRUPDQFHLQFOXGLQJIRU3KRWRVKRSDQG0R]LOOD.DQGURW
KDVZRUNHGIRU$GREH0LFURVRIWDQG*RRJOHDQGKHKDVEHHQDFRQVXOWDQWDW
PDQ\FRPSDQLHVLQFOXGLQJ$SSOHDQG$XWRGHVN:KHQQRWFRGLQJKHFDQEH
IRXQGSOD\LQJ:RUOGRI:DUFUDIWRUYLVLWLQJ/DV9HJDVIRUWKHDPD]LQJIRRG

[L[

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Chapter 1
Why CUDA? Why Now?

7KHUHZDVDWLPHLQWKHQRWVRGLVWDQWSDVWZKHQSDUDOOHOFRPSXWLQJZDVORRNHG
XSRQDVDQǤH[RWLFǥSXUVXLWDQGW\SLFDOO\JRWFRPSDUWPHQWDOL]HGDVDVSHFLDOW\
ZLWKLQWKHȌHOGRIFRPSXWHUVFLHQFH7KLVSHUFHSWLRQKDVFKDQJHGLQSURIRXQG
ZD\VLQUHFHQW\HDUV7KHFRPSXWLQJZRUOGKDVVKLIWHGWRWKHSRLQWZKHUHIDU
IURPEHLQJDQHVRWHULFSXUVXLWQHDUO\HYHU\DVSLULQJSURJUDPPHUneedsWUDLQLQJ
LQSDUDOOHOSURJUDPPLQJWREHIXOO\HIIHFWLYHLQFRPSXWHUVFLHQFH3HUKDSV\RXǢYH
SLFNHGWKLVERRNXSXQFRQYLQFHGDERXWWKHLPSRUWDQFHRISDUDOOHOSURJUDPPLQJ
LQWKHFRPSXWLQJZRUOGWRGD\DQGWKHLQFUHDVLQJO\ODUJHUROHLWZLOOSOD\LQWKH
\HDUVWRFRPH7KLVLQWURGXFWRU\FKDSWHUZLOOH[DPLQHUHFHQWWUHQGVLQWKHKDUG-
ZDUHWKDWGRHVWKHKHDY\OLIWLQJIRUWKHVRIWZDUHWKDWZHDVSURJUDPPHUVZULWH
,QGRLQJVRZHKRSHWRFRQYLQFH\RXWKDWWKHSDUDOOHOFRPSXWLQJUHYROXWLRQKDV
alreadyKDSSHQHGDQGWKDWE\OHDUQLQJ&8'$&\RXǢOOEHZHOOSRVLWLRQHGWRZULWH
KLJKSHUIRUPDQFHDSSOLFDWLRQVIRUKHWHURJHQHRXVSODWIRUPVWKDWFRQWDLQERWK
FHQWUDODQGJUDSKLFVSURFHVVLQJXQLWV

Download from www.wowebook.com


:+<&8'$":+<12:"

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQDERXWWKHLQFUHDVLQJO\LPSRUWDQWUROHRISDUDOOHOFRPSXWLQJ

ǩ <RXZLOOOHDUQDEULHIKLVWRU\RI*38FRPSXWLQJDQG&8'$

ǩ <RXZLOOOHDUQDERXWVRPHVXFFHVVIXODSSOLFDWLRQVWKDWXVH&8'$&

 7KH$JHRI3DUDOOHO3URFHVVLQJ
,QUHFHQW\HDUVPXFKKDVEHHQPDGHRIWKHFRPSXWLQJLQGXVWU\ǢVZLGHVSUHDG
VKLIWWRSDUDOOHOFRPSXWLQJ1HDUO\DOOFRQVXPHUFRPSXWHUVLQWKH\HDU
ZLOOVKLSZLWKPXOWLFRUHFHQWUDOSURFHVVRUV)URPWKHLQWURGXFWLRQRIGXDOFRUH
ORZHQGQHWERRNPDFKLQHVWRDQGFRUHZRUNVWDWLRQFRPSXWHUVQRORQJHU
ZLOOSDUDOOHOFRPSXWLQJEHUHOHJDWHGWRH[RWLFVXSHUFRPSXWHUVRUPDLQIUDPHV
0RUHRYHUHOHFWURQLFGHYLFHVVXFKDVPRELOHSKRQHVDQGSRUWDEOHPXVLFSOD\HUV
KDYHEHJXQWRLQFRUSRUDWHSDUDOOHOFRPSXWLQJFDSDELOLWLHVLQDQHIIRUWWRSURYLGH
IXQFWLRQDOLW\ZHOOEH\RQGWKRVHRIWKHLUSUHGHFHVVRUV

0RUHDQGPRUHVRIWZDUHGHYHORSHUVZLOOQHHGWRFRSHZLWKDYDULHW\RISDUDOOHO
FRPSXWLQJSODWIRUPVDQGWHFKQRORJLHVLQRUGHUWRSURYLGHQRYHODQGULFKH[SHUL-
HQFHVIRUDQLQFUHDVLQJO\VRSKLVWLFDWHGEDVHRIXVHUV&RPPDQGSURPSWVDUHRXW
PXOWLWKUHDGHGJUDSKLFDOLQWHUIDFHVDUHLQ&HOOXODUSKRQHVWKDWRQO\PDNHFDOOV
DUHRXWSKRQHVWKDWFDQVLPXOWDQHRXVO\SOD\PXVLFEURZVHWKH:HEDQGSURYLGH
*36VHUYLFHVDUHLQ

 CENTRAL PROCESSING UNITS


)RU\HDUVRQHRIWKHLPSRUWDQWPHWKRGVIRUWKHLPSURYLQJWKHSHUIRUPDQFH
RIFRQVXPHUFRPSXWLQJGHYLFHVKDVEHHQWRLQFUHDVHWKHVSHHGDWZKLFKWKH
SURFHVVRUǢVFORFNRSHUDWHG6WDUWLQJZLWKWKHȌUVWSHUVRQDOFRPSXWHUVRIWKHHDUO\
VFRQVXPHUFHQWUDOSURFHVVLQJXQLWV &38V UDQZLWKLQWHUQDOFORFNVRSHU-
DWLQJDURXQG0+]$ERXW\HDUVODWHUPRVWGHVNWRSSURFHVVRUVKDYHFORFN
VSHHGVEHWZHHQ*+]DQG*+]QHDUO\WLPHVIDVWHUWKDQWKHFORFNRQWKH

Download from www.wowebook.com


 7+($*(2)3$5$//(/352&(66,1*

RULJLQDOSHUVRQDOFRPSXWHU$OWKRXJKLQFUHDVLQJWKH&38FORFNVSHHGLVFHUWDLQO\
QRWWKHRQO\PHWKRGE\ZKLFKFRPSXWLQJSHUIRUPDQFHKDVEHHQLPSURYHGLWKDV
DOZD\VEHHQDUHOLDEOHVRXUFHIRULPSURYHGSHUIRUPDQFH

,QUHFHQW\HDUVKRZHYHUPDQXIDFWXUHUVKDYHEHHQIRUFHGWRORRNIRUDOWHUQD-
WLYHVWRWKLVWUDGLWLRQDOVRXUFHRILQFUHDVHGFRPSXWDWLRQDOSRZHU%HFDXVHRI
YDULRXVIXQGDPHQWDOOLPLWDWLRQVLQWKHIDEULFDWLRQRILQWHJUDWHGFLUFXLWVLWLVQR
ORQJHUIHDVLEOHWRUHO\RQXSZDUGVSLUDOLQJSURFHVVRUFORFNVSHHGVDVDPHDQV
IRUH[WUDFWLQJDGGLWLRQDOSRZHUIURPH[LVWLQJDUFKLWHFWXUHV%HFDXVHRISRZHUDQG
KHDWUHVWULFWLRQVDVZHOODVDUDSLGO\DSSURDFKLQJSK\VLFDOOLPLWWRWUDQVLVWRUVL]H
UHVHDUFKHUVDQGPDQXIDFWXUHUVKDYHEHJXQWRORRNHOVHZKHUH

2XWVLGHWKHZRUOGRIFRQVXPHUFRPSXWLQJVXSHUFRPSXWHUVKDYHIRUGHFDGHV
H[WUDFWHGPDVVLYHSHUIRUPDQFHJDLQVLQVLPLODUZD\V7KHSHUIRUPDQFHRID
SURFHVVRUXVHGLQDVXSHUFRPSXWHUKDVFOLPEHGDVWURQRPLFDOO\VLPLODUWRWKH
LPSURYHPHQWVLQWKHSHUVRQDOFRPSXWHU&38+RZHYHULQDGGLWLRQWRGUDPDWLF
LPSURYHPHQWVLQWKHSHUIRUPDQFHRIDVLQJOHSURFHVVRUVXSHUFRPSXWHUPDQX-
IDFWXUHUVKDYHDOVRH[WUDFWHGPDVVLYHOHDSVLQSHUIRUPDQFHE\VWHDGLO\LQFUHDVLQJ
WKHnumberRISURFHVVRUV,WLVQRWXQFRPPRQIRUWKHIDVWHVWVXSHUFRPSXWHUVWR
KDYHWHQVRUKXQGUHGVRIWKRXVDQGVRISURFHVVRUFRUHVZRUNLQJLQWDQGHP

,QWKHVHDUFKIRUDGGLWLRQDOSURFHVVLQJSRZHUIRUSHUVRQDOFRPSXWHUVWKH
LPSURYHPHQWLQVXSHUFRPSXWHUVUDLVHVDYHU\JRRGTXHVWLRQ5DWKHUWKDQVROHO\
ORRNLQJWRLQFUHDVHWKHSHUIRUPDQFHRIDVLQJOHSURFHVVLQJFRUHZK\QRWSXW
PRUHWKDQRQHLQDSHUVRQDOFRPSXWHU",QWKLVZD\SHUVRQDOFRPSXWHUVFRXOG
FRQWLQXHWRLPSURYHLQSHUIRUPDQFHZLWKRXWWKHQHHGIRUFRQWLQXLQJLQFUHDVHVLQ
SURFHVVRUFORFNVSHHG

,QIDFHGZLWKDQLQFUHDVLQJO\FRPSHWLWLYHPDUNHWSODFHDQGIHZDOWHUQDWLYHV
OHDGLQJ&38PDQXIDFWXUHUVEHJDQRIIHULQJSURFHVVRUVZLWKWZRFRPSXWLQJFRUHV
LQVWHDGRIRQH2YHUWKHIROORZLQJ\HDUVWKH\IROORZHGWKLVGHYHORSPHQWZLWKWKH
UHOHDVHRIWKUHHIRXUVL[DQGHLJKWFRUHFHQWUDOSURFHVVRUXQLWV6RPHWLPHV
UHIHUUHGWRDVWKHmulticore revolutionWKLVWUHQGKDVPDUNHGDKXJHVKLIWLQWKH
HYROXWLRQRIWKHFRQVXPHUFRPSXWLQJPDUNHW

7RGD\LWLVUHODWLYHO\FKDOOHQJLQJWRSXUFKDVHDGHVNWRSFRPSXWHUZLWKD&38
FRQWDLQLQJEXWDVLQJOHFRPSXWLQJFRUH(YHQORZHQGORZSRZHUFHQWUDOSURFHV-
VRUVVKLSZLWKWZRRUPRUHFRUHVSHUGLH/HDGLQJ&38PDQXIDFWXUHUVKDYH
DOUHDG\DQQRXQFHGSODQVIRUDQGFRUH&38VIXUWKHUFRQȌUPLQJWKDW
SDUDOOHOFRPSXWLQJKDVDUULYHGIRUJRRG

Download from www.wowebook.com


:+<&8'$":+<12:"

 7KH5LVHRI*38&RPSXWLQJ
,QFRPSDULVRQWRWKHFHQWUDOSURFHVVRUǢVWUDGLWLRQDOGDWDSURFHVVLQJSLSHOLQH
SHUIRUPLQJJHQHUDOSXUSRVHFRPSXWDWLRQVRQDJUDSKLFVSURFHVVLQJXQLW *38 LV
DQHZFRQFHSW,QIDFWWKH*38LWVHOILVUHODWLYHO\QHZFRPSDUHGWRWKHFRPSXWLQJ
ȌHOGDWODUJH+RZHYHUWKHLGHDRIFRPSXWLQJRQJUDSKLFVSURFHVVRUVLVQRWDV
QHZDV\RXPLJKWEHOLHYH

 $%5,()+,6725<2)*386
:HKDYHDOUHDG\ORRNHGDWKRZFHQWUDOSURFHVVRUVHYROYHGLQERWKFORFNVSHHGV
DQGFRUHFRXQW,QWKHPHDQWLPHWKHVWDWHRIJUDSKLFVSURFHVVLQJXQGHUZHQWD
GUDPDWLFUHYROXWLRQ,QWKHODWHVDQGHDUO\VWKHJURZWKLQSRSXODULW\RI
JUDSKLFDOO\GULYHQRSHUDWLQJV\VWHPVVXFKDV0LFURVRIW:LQGRZVKHOSHGFUHDWH
DPDUNHWIRUDQHZW\SHRISURFHVVRU,QWKHHDUO\VXVHUVEHJDQSXUFKDVLQJ
'GLVSOD\DFFHOHUDWRUVIRUWKHLUSHUVRQDOFRPSXWHUV7KHVHGLVSOD\DFFHOHUDWRUV
RIIHUHGKDUGZDUHDVVLVWHGELWPDSRSHUDWLRQVWRDVVLVWLQWKHGLVSOD\DQGXVDELOLW\
RIJUDSKLFDORSHUDWLQJV\VWHPV

$URXQGWKHVDPHWLPHLQWKHZRUOGRISURIHVVLRQDOFRPSXWLQJDFRPSDQ\E\
WKHQDPHRI6LOLFRQ*UDSKLFVVSHQWWKHVSRSXODUL]LQJWKHXVHRIWKUHH
GLPHQVLRQDOJUDSKLFVLQDYDULHW\RIPDUNHWVLQFOXGLQJJRYHUQPHQWDQGGHIHQVH
DSSOLFDWLRQVDQGVFLHQWLȌFDQGWHFKQLFDOYLVXDOL]DWLRQDVZHOODVSURYLGLQJWKH
WRROVWRFUHDWHVWXQQLQJFLQHPDWLFHIIHFWV,Q6LOLFRQ*UDSKLFVRSHQHGWKH
SURJUDPPLQJLQWHUIDFHWRLWVKDUGZDUHE\UHOHDVLQJWKH2SHQ*/OLEUDU\6LOLFRQ
*UDSKLFVLQWHQGHG2SHQ*/WREHXVHGDVDVWDQGDUGL]HGSODWIRUPLQGHSHQGHQW
PHWKRGIRUZULWLQJ'JUDSKLFVDSSOLFDWLRQV$VZLWKSDUDOOHOSURFHVVLQJDQG
&38VLWZRXOGRQO\EHDPDWWHURIWLPHEHIRUHWKHWHFKQRORJLHVIRXQGWKHLUZD\
LQWRFRQVXPHUDSSOLFDWLRQV

%\WKHPLGVWKHGHPDQGIRUFRQVXPHUDSSOLFDWLRQVHPSOR\LQJ'JUDSKLFV
KDGHVFDODWHGUDSLGO\VHWWLQJWKHVWDJHIRUWZRIDLUO\VLJQLȌFDQWGHYHORSPHQWV
)LUVWWKHUHOHDVHRILPPHUVLYHȌUVWSHUVRQJDPHVVXFKDV'RRP'XNH1XNHP
'DQG4XDNHKHOSHGLJQLWHDTXHVWWRFUHDWHSURJUHVVLYHO\PRUHUHDOLVWLF'HQYL-
URQPHQWVIRU3&JDPLQJ$OWKRXJK'JUDSKLFVZRXOGHYHQWXDOO\ZRUNWKHLUZD\
LQWRQHDUO\DOOFRPSXWHUJDPHVWKHSRSXODULW\RIWKHQDVFHQWȌUVWSHUVRQVKRRWHU
JHQUHZRXOGVLJQLȌFDQWO\DFFHOHUDWHWKHDGRSWLRQRI'JUDSKLFVLQFRQVXPHU
FRPSXWLQJ$WWKHVDPHWLPHFRPSDQLHVVXFKDV19,',$$7,7HFKQRORJLHV
DQGGI[,QWHUDFWLYHEHJDQUHOHDVLQJJUDSKLFVDFFHOHUDWRUVWKDWZHUHDIIRUGDEOH

Download from www.wowebook.com


 7+(5,6(2)*38&20387,1*

HQRXJKWRDWWUDFWZLGHVSUHDGDWWHQWLRQ7KHVHGHYHORSPHQWVFHPHQWHG'
JUDSKLFVDVDWHFKQRORJ\WKDWZRXOGȌJXUHSURPLQHQWO\IRU\HDUVWRFRPH

7KHUHOHDVHRI19,',$ǢV*H)RUFHIXUWKHUSXVKHGWKHFDSDELOLWLHVRIFRQVXPHU
JUDSKLFVKDUGZDUH)RUWKHȌUVWWLPHWUDQVIRUPDQGOLJKWLQJFRPSXWDWLRQVFRXOG
EHSHUIRUPHGGLUHFWO\RQWKHJUDSKLFVSURFHVVRUWKHUHE\HQKDQFLQJWKHSRWHQWLDO
IRUHYHQPRUHYLVXDOO\LQWHUHVWLQJDSSOLFDWLRQV6LQFHWUDQVIRUPDQGOLJKWLQJZHUH
DOUHDG\LQWHJUDOSDUWVRIWKH2SHQ*/JUDSKLFVSLSHOLQHWKH*H)RUFHPDUNHG
WKHEHJLQQLQJRIDQDWXUDOSURJUHVVLRQZKHUHLQFUHDVLQJO\PRUHRIWKHJUDSKLFV
SLSHOLQHZRXOGEHLPSOHPHQWHGGLUHFWO\RQWKHJUDSKLFVSURFHVVRU

)URPDSDUDOOHOFRPSXWLQJVWDQGSRLQW19,',$ǢVUHOHDVHRIWKH*H)RUFHVHULHV
LQUHSUHVHQWVDUJXDEO\WKHPRVWLPSRUWDQWEUHDNWKURXJKLQ*38WHFKQRORJ\
7KH*H)RUFHVHULHVZDVWKHFRPSXWLQJLQGXVWU\ǢVȌUVWFKLSWRLPSOHPHQW
0LFURVRIWǢVWKHQQHZ'LUHFW;VWDQGDUG7KLVVWDQGDUGUHTXLUHGWKDWFRPSOLDQW
KDUGZDUHFRQWDLQERWKSURJUDPPDEOHYHUWH[DQGSURJUDPPDEOHSL[HOVKDGLQJ
VWDJHV)RUWKHȌUVWWLPHGHYHORSHUVKDGVRPHFRQWURORYHUWKHH[DFWFRPSXWD-
WLRQVWKDWZRXOGEHSHUIRUPHGRQWKHLU*38V

 EARLY GPU COMPUTING


7KHUHOHDVHRI*38VWKDWSRVVHVVHGSURJUDPPDEOHSLSHOLQHVDWWUDFWHGPDQ\
UHVHDUFKHUVWRWKHSRVVLELOLW\RIXVLQJJUDSKLFVKDUGZDUHIRUPRUHWKDQVLPSO\
2SHQ*/RU'LUHFW;EDVHGUHQGHULQJ7KHJHQHUDODSSURDFKLQWKHHDUO\GD\VRI
*38FRPSXWLQJZDVH[WUDRUGLQDULO\FRQYROXWHG%HFDXVHVWDQGDUGJUDSKLFV$3,V
VXFKDV2SHQ*/DQG'LUHFW;ZHUHVWLOOWKHRQO\ZD\WRLQWHUDFWZLWKD*38DQ\
DWWHPSWWRSHUIRUPDUELWUDU\FRPSXWDWLRQVRQD*38ZRXOGVWLOOEHVXEMHFWWRWKH
FRQVWUDLQWVRISURJUDPPLQJZLWKLQDJUDSKLFV$3,%HFDXVHRIWKLVUHVHDUFKHUV
H[SORUHGJHQHUDOSXUSRVHFRPSXWDWLRQWKURXJKJUDSKLFV$3,VE\WU\LQJWRPDNH
WKHLUSUREOHPVDSSHDUWRWKH*38WREHWUDGLWLRQDOUHQGHULQJ

(VVHQWLDOO\WKH*38VRIWKHHDUO\VZHUHGHVLJQHGWRSURGXFHDFRORUIRU
HYHU\SL[HORQWKHVFUHHQXVLQJSURJUDPPDEOHDULWKPHWLFXQLWVNQRZQDVpixel
shaders,QJHQHUDODSL[HOVKDGHUXVHVLWV(x,y)SRVLWLRQRQWKHVFUHHQDVZHOO
DVVRPHDGGLWLRQDOLQIRUPDWLRQWRFRPELQHYDULRXVLQSXWVLQFRPSXWLQJDȌQDO
FRORU7KHDGGLWLRQDOLQIRUPDWLRQFRXOGEHLQSXWFRORUVWH[WXUHFRRUGLQDWHVRU
RWKHUDWWULEXWHVWKDWZRXOGEHSDVVHGWRWKHVKDGHUZKHQLWUDQ%XWEHFDXVH
WKHDULWKPHWLFEHLQJSHUIRUPHGRQWKHLQSXWFRORUVDQGWH[WXUHVZDVFRPSOHWHO\
FRQWUROOHGE\WKHSURJUDPPHUUHVHDUFKHUVREVHUYHGWKDWWKHVHLQSXWǤFRORUVǥ
FRXOGDFWXDOO\EHanyGDWD

Download from www.wowebook.com


:+<&8'$":+<12:"

6RLIWKHLQSXWVZHUHDFWXDOO\QXPHULFDOGDWDVLJQLI\LQJVRPHWKLQJRWKHUWKDQ
FRORUSURJUDPPHUVFRXOGWKHQSURJUDPWKHSL[HOVKDGHUVWRSHUIRUPDUELWUDU\
FRPSXWDWLRQVRQWKLVGDWD7KHUHVXOWVZRXOGEHKDQGHGEDFNWRWKH*38DVWKH
ȌQDOSL[HOǤFRORUǥDOWKRXJKWKHFRORUVZRXOGVLPSO\EHWKHUHVXOWRIZKDWHYHU
FRPSXWDWLRQVWKHSURJUDPPHUKDGLQVWUXFWHGWKH*38WRSHUIRUPRQWKHLULQSXWV
7KLVGDWDFRXOGEHUHDGEDFNE\WKHUHVHDUFKHUVDQGWKH*38ZRXOGQHYHUEHWKH
ZLVHU,QHVVHQFHWKH*38ZDVEHLQJWULFNHGLQWRSHUIRUPLQJQRQUHQGHULQJWDVNV
E\PDNLQJWKRVHWDVNVDSSHDUDVLIWKH\ZHUHDVWDQGDUGUHQGHULQJ7KLVWULFNHU\
ZDVYHU\FOHYHUEXWDOVRYHU\FRQYROXWHG

%HFDXVHRIWKHKLJKDULWKPHWLFWKURXJKSXWRI*38VLQLWLDOUHVXOWVIURPWKHVH
H[SHULPHQWVSURPLVHGDEULJKWIXWXUHIRU*38FRPSXWLQJ+RZHYHUWKHSURJUDP-
PLQJPRGHOZDVVWLOOIDUWRRUHVWULFWLYHIRUDQ\FULWLFDOPDVVRIGHYHORSHUVWR
IRUP7KHUHZHUHWLJKWUHVRXUFHFRQVWUDLQWVVLQFHSURJUDPVFRXOGUHFHLYHLQSXW
GDWDRQO\IURPDKDQGIXORILQSXWFRORUVDQGDKDQGIXORIWH[WXUHXQLWV7KHUH
ZHUHVHULRXVOLPLWDWLRQVRQKRZDQGZKHUHWKHSURJUDPPHUFRXOGZULWHUHVXOWV
WRPHPRU\VRDOJRULWKPVUHTXLULQJWKHDELOLW\WRZULWHWRDUELWUDU\ORFDWLRQVLQ
PHPRU\ VFDWWHU FRXOGQRWUXQRQD*380RUHRYHULWZDVQHDUO\LPSRVVLEOHWR
SUHGLFWKRZ\RXUSDUWLFXODU*38ZRXOGGHDOZLWKȍRDWLQJSRLQWGDWDLILWKDQGOHG
ȍRDWLQJSRLQWGDWDDWDOOVRPRVWVFLHQWLȌFFRPSXWDWLRQVZRXOGEHXQDEOHWR
XVHD*38)LQDOO\ZKHQWKHSURJUDPLQHYLWDEO\FRPSXWHGWKHLQFRUUHFWUHVXOWV
IDLOHGWRWHUPLQDWHRUVLPSO\KXQJWKHPDFKLQHWKHUHH[LVWHGQRUHDVRQDEO\JRRG
PHWKRGWRGHEXJDQ\FRGHWKDWZDVEHLQJH[HFXWHGRQWKH*38

$VLIWKHOLPLWDWLRQVZHUHQǢWVHYHUHHQRXJKDQ\RQHZKRstillZDQWHGWRXVHD*38
WRSHUIRUPJHQHUDOSXUSRVHFRPSXWDWLRQVZRXOGQHHGWROHDUQ2SHQ*/RU'LUHFW;
VLQFHWKHVHUHPDLQHGWKHRQO\PHDQVE\ZKLFKRQHFRXOGLQWHUDFWZLWKD*381RW
RQO\GLGWKLVPHDQVWRULQJGDWDLQJUDSKLFVWH[WXUHVDQGH[HFXWLQJFRPSXWDWLRQV
E\FDOOLQJ2SHQ*/RU'LUHFW;IXQFWLRQVEXWLWPHDQWZULWLQJWKHFRPSXWDWLRQV
WKHPVHOYHVLQVSHFLDOJUDSKLFVRQO\SURJUDPPLQJODQJXDJHVNQRZQDVshading
languages$VNLQJUHVHDUFKHUVWRERWKFRSHZLWKVHYHUHUHVRXUFHDQGSURJUDP-
PLQJUHVWULFWLRQVDVZHOODVWROHDUQFRPSXWHUJUDSKLFVDQGVKDGLQJODQJXDJHV
EHIRUHDWWHPSWLQJWRKDUQHVVWKHFRPSXWLQJSRZHURIWKHLU*38SURYHGWRRODUJH
DKXUGOHIRUZLGHDFFHSWDQFH

 CUDA
,WZRXOGQRWEHXQWLOȌYH\HDUVDIWHUWKHUHOHDVHRIWKH*H)RUFHVHULHVWKDW*38
FRPSXWLQJZRXOGEHUHDG\IRUSULPHWLPH,Q1RYHPEHU19,',$XQYHLOHGWKH

Download from www.wowebook.com


 &8'$

LQGXVWU\ǢVȌUVW'LUHFW;*38WKH*H)RUFH*7;7KH*H)RUFH*7;ZDV
DOVRWKHȌUVW*38WREHEXLOWZLWK19,',$ǢV&8'$$UFKLWHFWXUH7KLVDUFKLWHFWXUH
LQFOXGHGVHYHUDOQHZFRPSRQHQWVGHVLJQHGVWULFWO\IRU*38FRPSXWLQJDQGDLPHG
WRDOOHYLDWHPDQ\RIWKHOLPLWDWLRQVWKDWSUHYHQWHGSUHYLRXVJUDSKLFVSURFHVVRUV
IURPEHLQJOHJLWLPDWHO\XVHIXOIRUJHQHUDOSXUSRVHFRPSXWDWLRQ

 :+$7,67+(&8'$$5&+,7(&785("
8QOLNHSUHYLRXVJHQHUDWLRQVWKDWSDUWLWLRQHGFRPSXWLQJUHVRXUFHVLQWRYHUWH[
DQGSL[HOVKDGHUVWKH&8'$$UFKLWHFWXUHLQFOXGHGDXQLȌHGVKDGHUSLSHOLQH
DOORZLQJHDFKDQGHYHU\DULWKPHWLFORJLFXQLW $/8 RQWKHFKLSWREHPDUVKDOHG
E\DSURJUDPLQWHQGLQJWRSHUIRUPJHQHUDOSXUSRVHFRPSXWDWLRQV%HFDXVH
19,',$LQWHQGHGWKLVQHZIDPLO\RIJUDSKLFVSURFHVVRUVWREHXVHGIRUJHQHUDO
SXUSRVHFRPSXWLQJWKHVH$/8VZHUHEXLOWWRFRPSO\ZLWK,(((UHTXLUHPHQWVIRU
VLQJOHSUHFLVLRQȍRDWLQJSRLQWDULWKPHWLFDQGZHUHGHVLJQHGWRXVHDQLQVWUXF-
WLRQVHWWDLORUHGIRUJHQHUDOFRPSXWDWLRQUDWKHUWKDQVSHFLȌFDOO\IRUJUDSKLFV
)XUWKHUPRUHWKHH[HFXWLRQXQLWVRQWKH*38ZHUHDOORZHGDUELWUDU\UHDGDQG
ZULWHDFFHVVWRPHPRU\DVZHOODVDFFHVVWRDVRIWZDUHPDQDJHGFDFKHNQRZQ
DVshared memory$OORIWKHVHIHDWXUHVRIWKH&8'$$UFKLWHFWXUHZHUHDGGHGLQ
RUGHUWRFUHDWHD*38WKDWZRXOGH[FHODWFRPSXWDWLRQLQDGGLWLRQWRSHUIRUPLQJ
ZHOODWWUDGLWLRQDOJUDSKLFVWDVNV

 86,1*7+(&8'$$5&+,7(&785(
7KHHIIRUWE\19,',$WRSURYLGHFRQVXPHUVZLWKDSURGXFWIRUERWKFRPSXWD-
WLRQDQGJUDSKLFVFRXOGQRWVWRSDWSURGXFLQJKDUGZDUHLQFRUSRUDWLQJWKH&8'$
$UFKLWHFWXUHWKRXJK5HJDUGOHVVRIKRZPDQ\IHDWXUHV19,',$DGGHGWRLWVFKLSV
WRIDFLOLWDWHFRPSXWLQJWKHUHFRQWLQXHGWREHQRZD\WRDFFHVVWKHVHIHDWXUHV
ZLWKRXWXVLQJ2SHQ*/RU'LUHFW;1RWRQO\ZRXOGWKLVKDYHUHTXLUHGXVHUVWR
FRQWLQXHWRGLVJXLVHWKHLUFRPSXWDWLRQVDVJUDSKLFVSUREOHPVEXWWKH\ZRXOG
KDYHQHHGHGWRFRQWLQXHZULWLQJWKHLUFRPSXWDWLRQVLQDJUDSKLFVRULHQWHG
VKDGLQJODQJXDJHVXFKDV2SHQ*/ǢV*/6/RU0LFURVRIWǢV+/6/

7RUHDFKWKHPD[LPXPQXPEHURIGHYHORSHUVSRVVLEOH19,',$WRRNLQGXVWU\
VWDQGDUG&DQGDGGHGDUHODWLYHO\VPDOOQXPEHURINH\ZRUGVLQRUGHUWRKDUQHVV
VRPHRIWKHVSHFLDOIHDWXUHVRIWKH&8'$$UFKLWHFWXUH$IHZPRQWKVDIWHU
WKHODXQFKRIWKH*H)RUFH*7;19,',$PDGHSXEOLFDFRPSLOHUIRUWKLV
ODQJXDJH&8'$&$QGZLWKWKDW&8'$&EHFDPHWKHȌUVWODQJXDJHVSHFLȌFDOO\
GHVLJQHGE\D*38FRPSDQ\WRIDFLOLWDWHJHQHUDOSXUSRVHFRPSXWLQJRQ*38V

Download from www.wowebook.com


:+<&8'$":+<12:"

,QDGGLWLRQWRFUHDWLQJDODQJXDJHWRZULWHFRGHIRUWKH*3819,',$DOVRSURYLGHV
DVSHFLDOL]HGKDUGZDUHGULYHUWRH[SORLWWKH&8'$$UFKLWHFWXUHǢVPDVVLYHFRPSX-
WDWLRQDOSRZHU8VHUVDUHQRORQJHUUHTXLUHGWRKDYHDQ\NQRZOHGJHRIWKH
2SHQ*/RU'LUHFW;JUDSKLFVSURJUDPPLQJLQWHUIDFHVQRUDUHWKH\UHTXLUHGWR
IRUFHWKHLUSUREOHPWRORRNOLNHDFRPSXWHUJUDSKLFVWDVN

 $SSOLFDWLRQVRI&8'$
6LQFHLWVGHEXWLQHDUO\DYDULHW\RILQGXVWULHVDQGDSSOLFDWLRQVKDYHHQMR\HG
DJUHDWGHDORIVXFFHVVE\FKRRVLQJWREXLOGDSSOLFDWLRQVLQ&8'$&7KHVH
EHQHȌWVRIWHQLQFOXGHRUGHUVRIPDJQLWXGHSHUIRUPDQFHLPSURYHPHQWRYHUWKH
SUHYLRXVVWDWHRIWKHDUWLPSOHPHQWDWLRQV)XUWKHUPRUHDSSOLFDWLRQVUXQQLQJRQ
19,',$JUDSKLFVSURFHVVRUVHQMR\VXSHULRUSHUIRUPDQFHSHUGROODUDQGSHUIRU-
PDQFHSHUZDWWWKDQLPSOHPHQWDWLRQVEXLOWH[FOXVLYHO\RQWUDGLWLRQDOFHQWUDO
SURFHVVLQJWHFKQRORJLHV7KHIROORZLQJUHSUHVHQWMXVWDIHZRIWKHZD\VLQZKLFK
SHRSOHKDYHSXW&8'$&DQGWKH&8'$$UFKLWHFWXUHLQWRVXFFHVVIXOXVH

 MEDICAL IMAGING


7KHQXPEHURISHRSOHZKRKDYHEHHQDIIHFWHGE\WKHWUDJHG\RIEUHDVWFDQFHUKDV
GUDPDWLFDOO\ULVHQRYHUWKHFRXUVHRIWKHSDVW\HDUV7KDQNVLQDODUJHSDUWWR
WKHWLUHOHVVHIIRUWVRIPDQ\DZDUHQHVVDQGUHVHDUFKLQWRSUHYHQWLQJDQGFXULQJ
WKLVWHUULEOHGLVHDVHKDVVLPLODUO\ULVHQLQUHFHQW\HDUV8OWLPDWHO\HYHU\FDVHRI
EUHDVWFDQFHUVKRXOGEHFDXJKWHDUO\HQRXJKWRSUHYHQWWKHUDYDJLQJVLGHHIIHFWV
RIUDGLDWLRQDQGFKHPRWKHUDS\WKHSHUPDQHQWUHPLQGHUVOHIWE\VXUJHU\DQG
WKHGHDGO\FRQVHTXHQFHVLQFDVHVWKDWIDLOWRUHVSRQGWRWUHDWPHQW$VDUHVXOW
UHVHDUFKHUVVKDUHDVWURQJGHVLUHWRȌQGIDVWDFFXUDWHDQGPLQLPDOO\LQYDVLYH
ZD\VWRLGHQWLI\WKHHDUO\VLJQVRIEUHDVWFDQFHU

7KHPDPPRJUDPRQHRIWKHFXUUHQWEHVWWHFKQLTXHVIRUWKHHDUO\GHWHFWLRQRI
EUHDVWFDQFHUKDVVHYHUDOVLJQLȌFDQWOLPLWDWLRQV7ZRRUPRUHLPDJHVQHHGWREH
WDNHQDQGWKHȌOPQHHGVWREHGHYHORSHGDQGUHDGE\DVNLOOHGGRFWRUWRLGHQWLI\
SRWHQWLDOWXPRUV$GGLWLRQDOO\WKLV;UD\SURFHGXUHFDUULHVZLWKLWDOOWKHULVNVRI
UHSHDWHGO\UDGLDWLQJDSDWLHQWǢVFKHVW$IWHUFDUHIXOVWXG\GRFWRUVRIWHQUHTXLUH
IXUWKHUPRUHVSHFLȌFLPDJLQJǟDQGHYHQELRSV\ǟLQDQDWWHPSWWRHOLPLQDWHWKH
SRVVLELOLW\RIFDQFHU7KHVHIDOVHSRVLWLYHVLQFXUH[SHQVLYHIROORZXSZRUNDQG
FDXVHXQGXHVWUHVVWRWKHSDWLHQWXQWLOȌQDOFRQFOXVLRQVFDQEHGUDZQ

Download from www.wowebook.com


 $33/,&$7,2162)&8'$

8OWUDVRXQGLPDJLQJLVVDIHUWKDQ;UD\LPDJLQJVRGRFWRUVRIWHQXVHLWLQFRQMXQF-
WLRQZLWKPDPPRJUDSK\WRDVVLVWLQEUHDVWFDQFHUFDUHDQGGLDJQRVLV%XWFRQYHQ-
WLRQDOEUHDVWXOWUDVRXQGKDVLWVOLPLWDWLRQVDVZHOO$VDUHVXOW7HFKQL6FDQ0HGLFDO
6\VWHPVZDVERUQ7HFKQL6FDQKDVGHYHORSHGDSURPLVLQJWKUHHGLPHQVLRQDO
XOWUDVRXQGLPDJLQJPHWKRGEXWLWVVROXWLRQKDGQRWEHHQSXWLQWRSUDFWLFHIRUD
YHU\VLPSOHUHDVRQFRPSXWDWLRQOLPLWDWLRQV6LPSO\SXWFRQYHUWLQJWKHJDWKHUHG
XOWUDVRXQGGDWDLQWRWKHWKUHHGLPHQVLRQDOLPDJHU\UHTXLUHGFRPSXWDWLRQFRQVLG-
HUHGSURKLELWLYHO\WLPHFRQVXPLQJDQGH[SHQVLYHIRUSUDFWLFDOXVH

7KHLQWURGXFWLRQRI19,',$ǢVȌUVW*38EDVHGRQWKH&8'$$UFKLWHFWXUHDORQJZLWK
LWV&8'$&SURJUDPPLQJODQJXDJHSURYLGHGDSODWIRUPRQZKLFK7HFKQL6FDQ
FRXOGFRQYHUWWKHGUHDPVRILWVIRXQGHUVLQWRUHDOLW\$VWKHQDPHLQGLFDWHVLWV
6YDUDXOWUDVRXQGLPDJLQJV\VWHPXVHVXOWUDVRQLFZDYHVWRLPDJHWKHSDWLHQWǢV
FKHVW7KH7HFKQL6FDQ6YDUDV\VWHPUHOLHVRQWZR19,',$7HVOD&SURFHVVRUV
LQRUGHUWRSURFHVVWKH*%RIGDWDJHQHUDWHGE\DPLQXWHVFDQ7KDQNVWR
WKHFRPSXWDWLRQDOKRUVHSRZHURIWKH7HVOD&ZLWKLQPLQXWHVWKHGRFWRU
FDQPDQLSXODWHDKLJKO\GHWDLOHGWKUHHGLPHQVLRQDOLPDJHRIWKHZRPDQǢVEUHDVW
7HFKQL6FDQH[SHFWVZLGHGHSOR\PHQWRILWV6YDUDV\VWHPVWDUWLQJLQ

 &20387$7,21$/)/8,''<1$0,&6
)RUPDQ\\HDUVWKHGHVLJQRIKLJKO\HIȌFLHQWURWRUVDQGEODGHVUHPDLQHGD
EODFNDUWRIVRUWV7KHDVWRQLVKLQJO\FRPSOH[PRYHPHQWRIDLUDQGȍXLGVDURXQG
WKHVHGHYLFHVFDQQRWEHHIIHFWLYHO\PRGHOHGE\VLPSOHIRUPXODWLRQVVRDFFX-
UDWHVLPXODWLRQVSURYHIDUWRRFRPSXWDWLRQDOO\H[SHQVLYHWREHUHDOLVWLF2QO\WKH
ODUJHVWVXSHUFRPSXWHUVLQWKHZRUOGFRXOGKRSHWRRIIHUFRPSXWDWLRQDOUHVRXUFHV
RQSDUZLWKWKHVRSKLVWLFDWHGQXPHULFDOPRGHOVUHTXLUHGWRGHYHORSDQGYDOLGDWH
GHVLJQV6LQFHIHZKDYHDFFHVVWRVXFKPDFKLQHVLQQRYDWLRQLQWKHGHVLJQRI
VXFKPDFKLQHVFRQWLQXHGWRVWDJQDWH

7KH8QLYHUVLW\RI&DPEULGJHLQDJUHDWWUDGLWLRQVWDUWHGE\&KDUOHV%DEEDJHLV
KRPHWRDFWLYHUHVHDUFKLQWRDGYDQFHGSDUDOOHOFRPSXWLQJ'U*UDKDP3XOODQ
DQG'U7RELDV%UDQGYLNRIWKHǤPDQ\FRUHJURXSǥFRUUHFWO\LGHQWLȌHGWKHSRWHQ-
WLDOLQ19,',$ǢV&8'$$UFKLWHFWXUHWRDFFHOHUDWHFRPSXWDWLRQDOȍXLGG\QDPLFV
XQSUHFHGHQWHGOHYHOV7KHLULQLWLDOLQYHVWLJDWLRQVLQGLFDWHGWKDWDFFHSWDEOHOHYHOV
RISHUIRUPDQFHFRXOGEHGHOLYHUHGE\*38SRZHUHGSHUVRQDOZRUNVWDWLRQV
/DWHUWKHXVHRIDVPDOO*38FOXVWHUHDVLO\RXWSHUIRUPHGWKHLUPXFKPRUHFRVWO\
VXSHUFRPSXWHUVDQGIXUWKHUFRQȌUPHGWKHLUVXVSLFLRQVWKDWWKHFDSDELOLWLHVRI
19,',$ǢV*38PDWFKHGH[WUHPHO\ZHOOZLWKWKHSUREOHPVWKH\ZDQWHGWRVROYH

Download from www.wowebook.com


:+<&8'$":+<12:"

)RUWKHUHVHDUFKHUVDW&DPEULGJHWKHPDVVLYHSHUIRUPDQFHJDLQVRIIHUHGE\
&8'$&UHSUHVHQWPRUHWKDQDVLPSOHLQFUHPHQWDOERRVWWRWKHLUVXSHUFRP-
SXWLQJUHVRXUFHV7KHDYDLODELOLW\RIFRSLRXVDPRXQWVRIORZFRVW*38FRPSXWD-
WLRQHPSRZHUHGWKH&DPEULGJHUHVHDUFKHUVWRSHUIRUPUDSLGH[SHULPHQWDWLRQ
5HFHLYLQJH[SHULPHQWDOUHVXOWVZLWKLQVHFRQGVVWUHDPOLQHGWKHIHHGEDFNSURFHVV
RQZKLFKUHVHDUFKHUVUHO\LQRUGHUWRDUULYHDWEUHDNWKURXJKV$VDUHVXOWWKH
XVHRI*38FOXVWHUVKDVIXQGDPHQWDOO\WUDQVIRUPHGWKHZD\WKH\DSSURDFKWKHLU
UHVHDUFK1HDUO\LQWHUDFWLYHVLPXODWLRQKDVXQOHDVKHGQHZRSSRUWXQLWLHVIRU
LQQRYDWLRQDQGFUHDWLYLW\LQDSUHYLRXVO\VWLȍHGȌHOGRIUHVHDUFK

 ENVIRONMENTAL SCIENCE


7KHLQFUHDVLQJQHHGIRUHQYLURQPHQWDOO\VRXQGFRQVXPHUJRRGVKDVDULVHQDV
DQDWXUDOFRQVHTXHQFHRIWKHUDSLGO\HVFDODWLQJLQGXVWULDOL]DWLRQRIWKHJOREDO
HFRQRP\*URZLQJFRQFHUQVRYHUFOLPDWHFKDQJHWKHVSLUDOLQJSULFHVRIIXHO
DQGWKHJURZLQJOHYHORISROOXWDQWVLQRXUDLUDQGZDWHUKDYHEURXJKWLQWRVKDUS
UHOLHIWKHFROODWHUDOGDPDJHRIVXFKVXFFHVVIXODGYDQFHVLQLQGXVWULDORXWSXW
'HWHUJHQWVDQGFOHDQLQJDJHQWVKDYHORQJEHHQVRPHRIWKHPRVWQHFHVVDU\
\HWSRWHQWLDOO\FDODPLWRXVFRQVXPHUSURGXFWVLQUHJXODUXVH$VDUHVXOWPDQ\
VFLHQWLVWVKDYHEHJXQH[SORULQJPHWKRGVIRUUHGXFLQJWKHHQYLURQPHQWDOLPSDFW
RIVXFKGHWHUJHQWVZLWKRXWUHGXFLQJWKHLUHIȌFDF\*DLQLQJVRPHWKLQJIRUQRWKLQJ
FDQEHDWULFN\SURSRVLWLRQKRZHYHU

7KHNH\FRPSRQHQWVWRFOHDQLQJDJHQWVDUHNQRZQDVsurfactants6XUIDFWDQW
PROHFXOHVGHWHUPLQHWKHFOHDQLQJFDSDFLW\DQGWH[WXUHRIGHWHUJHQWVDQGVKDP-
SRRVEXWWKH\DUHRIWHQLPSOLFDWHGDVWKHPRVWHQYLURQPHQWDOO\GHYDVWDWLQJ
FRPSRQHQWRIFOHDQLQJSURGXFWV7KHVHPROHFXOHVDWWDFKWKHPVHOYHVWRGLUWDQG
WKHQPL[ZLWKZDWHUVXFKWKDWWKHVXUIDFWDQWVFDQEHULQVHGDZD\DORQJZLWKWKH
GLUW7UDGLWLRQDOO\PHDVXULQJWKHFOHDQLQJYDOXHRIDQHZVXUIDFWDQWZRXOGUHTXLUH
H[WHQVLYHODERUDWRU\WHVWLQJLQYROYLQJQXPHURXVFRPELQDWLRQVRIPDWHULDOVDQG
LPSXULWLHVWREHFOHDQHG7KLVSURFHVVQRWVXUSULVLQJO\FDQEHYHU\VORZDQG
H[SHQVLYH

7HPSOH8QLYHUVLW\KDVEHHQZRUNLQJZLWKLQGXVWU\OHDGHU3URFWHU *DPEOHWR
XVHPROHFXODUVLPXODWLRQRIVXUIDFWDQWLQWHUDFWLRQVZLWKGLUWZDWHUDQGRWKHU
PDWHULDOV7KHLQWURGXFWLRQRIFRPSXWHUVLPXODWLRQVVHUYHVQRWMXVWWRDFFHOHUDWH
DWUDGLWLRQDOODEDSSURDFKEXWLWH[WHQGVWKHEUHDGWKRIWHVWLQJWRQXPHURXVYDUL-
DQWVRIHQYLURQPHQWDOFRQGLWLRQVIDUPRUHWKDQFRXOGEHSUDFWLFDOO\WHVWHGLQWKH
SDVW7HPSOHUHVHDUFKHUVXVHGWKH*38DFFHOHUDWHG+LJKO\2SWLPL]HG2EMHFW
RULHQWHG0DQ\SDUWLFOH'\QDPLFV +220' VLPXODWLRQVRIWZDUHZULWWHQE\WKH
'HSDUWPHQWRI(QHUJ\ǢV$PHV/DERUDWRU\%\VSOLWWLQJWKHLUVLPXODWLRQDFURVVWZR

10

Download from www.wowebook.com


 &+$37(55(9,(:

19,',$7HVOD*38VWKH\ZHUHDEOHDFKLHYHHTXLYDOHQWSHUIRUPDQFHWRWKH
&38FRUHVRIWKH&UD\;7DQGWRWKH&38VRIDQ,%0%OXH*HQH/PDFKLQH
%\LQFUHDVLQJWKHQXPEHURI7HVOD*38VLQWKHLUVROXWLRQWKH\DUHDOUHDG\VLPX-
ODWLQJVXUIDFWDQWLQWHUDFWLRQVDWWLPHVWKHSHUIRUPDQFHRISUHYLRXVSODWIRUPV
6LQFH19,',$ǢV&8'$KDVUHGXFHGWKHWLPHWRFRPSOHWHVXFKFRPSUHKHQVLYH
VLPXODWLRQVIURPVHYHUDOZHHNVWRDIHZKRXUVWKH\HDUVWRFRPHVKRXOGRIIHU
DGUDPDWLFULVHLQSURGXFWVWKDWKDYHERWKLQFUHDVHGHIIHFWLYHQHVVDQGUHGXFHG
HQYLURQPHQWDOLPSDFW

 &KDSWHU5HYLHZ
7KHFRPSXWLQJLQGXVWU\LVDWWKHSUHFLSLFHRIDSDUDOOHOFRPSXWLQJUHYROXWLRQ
DQG19,',$ǢV&8'$&KDVWKXVIDUEHHQRQHRIWKHPRVWVXFFHVVIXOODQJXDJHV
HYHUGHVLJQHGIRUSDUDOOHOFRPSXWLQJ7KURXJKRXWWKHFRXUVHRIWKLVERRNZHZLOO
KHOS\RXOHDUQKRZWRZULWH\RXURZQFRGHLQ&8'$&:HZLOOKHOS\RXOHDUQWKH
VSHFLDOH[WHQVLRQVWR&DQGWKHDSSOLFDWLRQSURJUDPPLQJLQWHUIDFHVWKDW19,',$
KDVFUHDWHGLQVHUYLFHRI*38FRPSXWLQJ<RXDUHnotH[SHFWHGWRNQRZ2SHQ*/
RU'LUHFW;QRUDUH\RXH[SHFWHGWRKDYHDQ\EDFNJURXQGLQFRPSXWHUJUDSKLFV

:HZLOOQRWEHFRYHULQJWKHEDVLFVRISURJUDPPLQJLQ&VRZHGRQRWUHFRPPHQG
WKLVERRNWRSHRSOHFRPSOHWHO\QHZWRFRPSXWHUSURJUDPPLQJ6RPHIDPLO-
LDULW\ZLWKSDUDOOHOSURJUDPPLQJPLJKWKHOSDOWKRXJKZHGRQRWexpect\RXWR
KDYHGRQHDQ\SDUDOOHOSURJUDPPLQJ$Q\WHUPVRUFRQFHSWVUHODWHGWRSDUDOOHO
SURJUDPPLQJWKDW\RXZLOOQHHGWRXQGHUVWDQGZLOOEHH[SODLQHGLQWKHWH[W,Q
IDFWWKHUHPD\EHVRPHRFFDVLRQVZKHQ\RXȌQGWKDWNQRZOHGJHRIWUDGLWLRQDO
SDUDOOHOSURJUDPPLQJZLOOFDXVH\RXWRPDNHDVVXPSWLRQVDERXW*38FRPSXWLQJ
WKDWSURYHXQWUXH6RLQUHDOLW\DPRGHUDWHDPRXQWRIH[SHULHQFHZLWK&RU&
SURJUDPPLQJLVWKHRQO\SUHUHTXLVLWHWRPDNLQJLWWKURXJKWKLVERRN

,QWKHQH[WFKDSWHUZHZLOOKHOS\RXVHWXS\RXUPDFKLQHIRU*38FRPSXWLQJ
HQVXULQJWKDW\RXKDYHERWKWKHKDUGZDUHDQGWKHVRIWZDUHFRPSRQHQWVQHFHV-
VDU\JHWVWDUWHG$IWHUWKDW\RXǢOOEHUHDG\WRJHW\RXUKDQGVGLUW\ZLWK&8'$&,I
\RXDOUHDG\KDYHVRPHH[SHULHQFHZLWK&8'$&RU\RXǢUHVXUHWKDW\RXUV\VWHP
KDVEHHQSURSHUO\VHWXSWRGRGHYHORSPHQWLQ&8'$&\RXFDQVNLSWR&KDSWHU

11

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Chapter 2
Getting Started

:HKRSHWKDW&KDSWHUKDVJRWWHQ\RXH[FLWHGWRJHWVWDUWHGOHDUQLQJ&8'$&
6LQFHWKLVERRNLQWHQGVWRWHDFK\RXWKHODQJXDJHWKURXJKDVHULHVRIFRGLQJ
H[DPSOHV\RXǢOOQHHGDIXQFWLRQLQJGHYHORSPHQWHQYLURQPHQW6XUH\RXFRXOG
VWDQGRQWKHVLGHOLQHDQGZDWFKEXWZHWKLQN\RXǢOOKDYHPRUHIXQDQGVWD\
LQWHUHVWHGORQJHULI\RXMXPSLQDQGJHWVRPHSUDFWLFDOH[SHULHQFHKDFNLQJ
&8'$&FRGHDVVRRQDVSRVVLEOH,QWKLVYHLQWKLVFKDSWHUZLOOZDON\RX
WKURXJKVRPHRIWKHKDUGZDUHDQGVRIWZDUHFRPSRQHQWV\RXǢOOQHHGLQRUGHUWR
JHWVWDUWHG7KHJRRGQHZVLVWKDW\RXFDQREWDLQDOORIWKHVRIWZDUH\RXǢOOQHHG
IRUIUHHOHDYLQJ\RXPRUHPRQH\IRUZKDWHYHUWLFNOHV\RXUIDQF\

13

Download from www.wowebook.com


GETTING STARTED

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOGRZQORDGDOOWKHVRIWZDUHFRPSRQHQWVUHTXLUHGWKURXJKWKLVERRN

ǩ <RXZLOOVHWXSDQHQYLURQPHQWLQZKLFK\RXFDQEXLOGFRGHZULWWHQLQ&8'$&

 'HYHORSPHQW(QYLURQPHQW
%HIRUHHPEDUNLQJRQWKLVMRXUQH\\RXZLOOQHHGWRVHWXSDQHQYLURQPHQWLQZKLFK
\RXFDQGHYHORSXVLQJ&8'$&7KHSUHUHTXLVLWHVWRGHYHORSLQJFRGHLQ&8'$&
DUHDVIROORZV

ǩ $&8'$HQDEOHGJUDSKLFVSURFHVVRU

ǩ $Q19,',$GHYLFHGULYHU

ǩ $&8'$GHYHORSPHQWWRRONLW

ǩ $VWDQGDUG&FRPSLOHU

7RPDNHWKLVFKDSWHUDVSDLQOHVVDVSRVVLEOHZHǢOOZDONWKURXJKHDFKRIWKHVH
SUHUHTXLVLWHVQRZ

 &8'$Ȑ(1$%/('*5$3+,&6352&(66256
)RUWXQDWHO\LWVKRXOGEHHDV\WRȌQG\RXUVHOIDJUDSKLFVSURFHVVRUWKDWKDV
EHHQEXLOWRQWKH&8'$$UFKLWHFWXUHEHFDXVHHYHU\19,',$*38VLQFHWKH
UHOHDVHRIWKH*H)RUFH*7;KDVEHHQ&8'$HQDEOHG6LQFH19,',$UHJXODUO\
UHOHDVHVQHZ*38VEDVHGRQWKH&8'$$UFKLWHFWXUHWKHIROORZLQJZLOOXQGRXEW-
HGO\EHRQO\DSDUWLDOOLVWRI&8'$HQDEOHG*38V1HYHUWKHOHVVWKH*38VDUHDOO
&8'$FDSDEOH

)RUDFRPSOHWHOLVW\RXVKRXOGFRQVXOWWKH19,',$ZHEVLWHDW
ZZZQYLGLDFRPFXGD
DOWKRXJKLWLVVDIHWRDVVXPHWKDWDOOUHFHQW*38V *38VIURPRQ ZLWKPRUH
WKDQ0%RIJUDSKLFVPHPRU\FDQEHXVHGWRGHYHORSDQGUXQFRGHZULWWHQ
ZLWK&8'$&

14

Download from www.wowebook.com


 '(9(/230(17(19,5210(17
DEVELOPMENT ENVIRONMENT

Table 2.1 &8'$HQDEOHG*38V

*H)RUFH*7; *H)RUFHP*38 4XDGUR);


*H)RUFH*7; *H)RUFHP*38 4XDGUR);
*H)RUFH*7; *H)RUFHP*38 4XDGUR);IRU0DF
*H)RUFH*7; 7HVOD6 4XDGUR);;
*H)RUFH*7;IRU0DF 7HVOD0 4XDGUR);
*H)RUFH*7; 7HVOD6 4XDGUR);
*H)RUFH*7; 7HVOD0 4XDGUR);
*H)RUFH*7; 7HVOD& 4XDGUR);
*H)RUFH*76 7HVOD6 4XDGUR);
*H)RUFH*7 7HVOD0 4XDGUR);
*H)RUFH* 7HVOD& 4XDGUR);
*H)RUFH*76 7HVOD6 4XDGUR);
*H)RUFH*7 7HVOD& 4XDGUR);
*H)RUFH*7 7HVOD6 4XDGUR);
*H)RUFH* 7HVOD& 4XDGUR);/RZ3URȌOH
*H)RUFH*; 7HVOD' 4XDGUR&;
*H)RUFH*7; QUADRO MOBILE 4XDGUR196
PRODUCTS
*H)RUFH*7; 4XDGUR196
4XDGUR);0
*H)RUFH*7 4XDGUR196
4XDGUR);0
*H)RUFH*62 4XDGUR196
4XDGUR);0
*H)RUFH*7 4XDGUR3OH['
4XDGUR);0
*H)RUFH*7 4XDGUR3OH['
4XDGUR);0
*H)RUFH*7 4XDGUR3OH[6
4XDGUR);0
*H)RUFH8OWUD 4XDGUR3OH[0RGHO,9
4XDGUR);0
*H)RUFH*7; GEFORCE MOBILE
4XDGUR);0 PRODUCTS
*H)RUFH*76
4XDGUR);0 *H)RUFH*7;0
*H)RUFH*7
4XDGUR1960 *H)RUFH*7;0
*H)RUFH*6
4XDGUR1960 *H)RUFH*760
*H)RUFH*76
4XDGUR1960 *H)RUFH*760
*H)RUFH*7
4XDGUR1960 *H)RUFH*760
*H)RUFH*7
4XDGUR1960 *H)RUFH*760
*H)RUFH*6
4XDGUR1960 *H)RUFH*70
*H)RUFHP*38
4XDGUR); *H)RUFH*70
*H)RUFHP*38

Continued

15

Download from www.wowebook.com


GETTING STARTED

Table 2.1 &8'$HQDEOHG*38V &RQWLQXHG

*H)RUFH*70 *H)RUFH0*76 *H)RUFH0*6


*H)RUFH*0 *H)RUFH0*7 *H)RUFH0*
*H)RUFH*0 *H)RUFH0*6 *H)RUFH0*76
*H)RUFH*0 *H)RUFH0*7 *H)RUFH0*7
*H)RUFH*0 *H)RUFH0*6 *H)RUFH0*7
*H)RUFH0*7; *H)RUFH0*6 *H)RUFH0*6
*H)RUFH0*7 *H)RUFH0* *H)RUFH0*7
*H)RUFH0*76 *H)RUFH0*6 *H)RUFH0*6
*H)RUFH0*6 *H)RUFH0*

 NVIDIA DEVICE DRIVER


19,',$SURYLGHVV\VWHPVRIWZDUHWKDWDOORZV\RXUSURJUDPVWRFRPPXQLFDWH
ZLWKWKH&8'$HQDEOHGKDUGZDUH,I\RXKDYHLQVWDOOHG\RXU19,',$*38SURSHUO\
\RXOLNHO\DOUHDG\KDYHWKLVVRIWZDUHLQVWDOOHGRQ\RXUPDFKLQH,WQHYHUKXUWV
WRHQVXUH\RXKDYHWKHPRVWUHFHQWGULYHUVVRZHUHFRPPHQGWKDW\RXYLVLW
ZZZQYLGLDFRPFXGDDQGFOLFNWKHDownload DriversOLQN6HOHFWWKHRSWLRQVWKDW
PDWFKWKHJUDSKLFVFDUGDQGRSHUDWLQJV\VWHPRQZKLFK\RXSODQWRGRGHYHORS-
PHQW$IWHUIROORZLQJWKHLQVWDOODWLRQLQVWUXFWLRQVIRUWKHSODWIRUPRI\RXUFKRLFH
\RXUV\VWHPZLOOEHXSWRGDWHZLWKWKHODWHVW19,',$V\VWHPVRIWZDUH

 CUDA DEVELOPMENT TOOLKIT


,I\RXKDYHD&8'$HQDEOHG*38DQG19,',$ǢVGHYLFHGULYHU\RXDUHUHDG\WRUXQ
FRPSLOHG&8'$&FRGH7KLVPHDQVWKDW\RXFDQGRZQORDG&8'$SRZHUHGDSSOL-
FDWLRQVDQGWKH\ZLOOEHDEOHWRVXFFHVVIXOO\H[HFXWHWKHLUFRGHRQ\RXUJUDSKLFV
SURFHVVRU+RZHYHUZHDVVXPHWKDW\RXZDQWWRGRPRUHWKDQMXVWUXQFRGH
EHFDXVHRWKHUZLVHWKLVERRNLVQǢWUHDOO\QHFHVVDU\,I\RXZDQWWRdevelopFRGH
IRU19,',$*38VXVLQJ&8'$&\RXZLOOQHHGDGGLWLRQDOVRIWZDUH%XWDVSURP-
LVHGHDUOLHUQRQHRILWZLOOFRVW\RXDSHQQ\

<RXZLOOOHDUQWKHVHGHWDLOVLQWKHQH[WFKDSWHUEXWVLQFH\RXU&8'$&DSSOLFD-
WLRQVDUHJRLQJWREHFRPSXWLQJRQWZRGLIIHUHQWSURFHVVRUV\RXDUHFRQVHTXHQWO\
JRLQJWRQHHGWZRFRPSLOHUV2QHFRPSLOHUZLOOFRPSLOHFRGHIRU\RXU*38DQG
RQHZLOOFRPSLOHFRGHIRU\RXU&3819,',$SURYLGHVWKHFRPSLOHUIRU\RXU*38
FRGH$VZLWKWKH19,',$GHYLFHGULYHU\RXFDQGRZQORDGWKHCUDA Toolkit DW
KWWSGHYHORSHUQYLGLDFRPREMHFWJSXFRPSXWLQJKWPO &OLFNWKH&8'$7RRONLW
OLQNWRUHDFKWKHGRZQORDGSDJHVKRZQLQ)LJXUH
16

Download from www.wowebook.com


 '(9(/230(17(19,5210(17
DEVELOPMENT ENVIRONMENT

Figure 2.1 7KH&8'$GRZQORDGSDJH

17

Download from www.wowebook.com


GETTING STARTED

<RXZLOODJDLQEHDVNHGWRVHOHFW\RXUSODWIRUPIURPDPRQJDQGELW
YHUVLRQVRI:LQGRZV;3:LQGRZV9LVWD:LQGRZV/LQX[DQG0DF26)URPWKH
DYDLODEOHGRZQORDGV\RXQHHGWRGRZQORDGWKH&8'$7RRONLWLQRUGHUWREXLOGWKH
FRGHH[DPSOHVFRQWDLQHGLQWKLVERRN$GGLWLRQDOO\\RXDUHHQFRXUDJHGDOWKRXJK
QRWUHTXLUHGWRGRZQORDGWKH*38&RPSXWLQJ6'.FRGHVDPSOHVZKLFKFRQWDLQV
GR]HQVRIKHOSIXOH[DPSOHSURJUDPV7KH*38&RPSXWLQJ6'.FRGHVDPSOHVZLOO
QRWEHFRYHUHGLQWKLVERRNEXWWKH\QLFHO\FRPSOHPHQWWKHPDWHULDOZHLQWHQG
WRFRYHUDQGDVZLWKOHDUQLQJDQ\VW\OHRISURJUDPPLQJWKHPRUHH[DPSOHVWKH
EHWWHU<RXVKRXOGDOVRWDNHQRWHWKDWDOWKRXJKQHDUO\DOOWKHFRGHLQWKLVERRNZLOO
ZRUNRQWKH/LQX[:LQGRZVDQG0DF26SODWIRUPVZHKDYHWDUJHWHGWKHDSSOL-
FDWLRQVWRZDUG/LQX[DQG:LQGRZV,I\RXDUHXVLQJ0DF26;\RXZLOOEHOLYLQJ
GDQJHURXVO\DQGXVLQJXQVXSSRUWHGFRGHH[DPSOHV

 STANDARD C COMPILER


$VZHPHQWLRQHG\RXZLOOQHHGDFRPSLOHUIRU*38FRGHDQGDFRPSLOHUIRU
&38FRGH,I\RXGRZQORDGHGDQGLQVWDOOHGWKH&8'$7RRONLWDVVXJJHVWHGLQWKH
SUHYLRXVVHFWLRQ\RXKDYHDFRPSLOHUIRU*38FRGH$FRPSLOHUIRU&38FRGHLV
WKHRQO\FRPSRQHQWWKDWUHPDLQVRQRXU&8'$FKHFNOLVWVROHWǢVDGGUHVVWKDW
LVVXHVRZHFDQJHWWRWKHLQWHUHVWLQJVWXII

WINDOWS
2Q0LFURVRIW:LQGRZVSODWIRUPVLQFOXGLQJ:LQGRZV;3:LQGRZV9LVWD:LQGRZV
6HUYHUDQG:LQGRZVZHUHFRPPHQGXVLQJWKH0LFURVRIW9LVXDO6WXGLR&
FRPSLOHU19,',$FXUUHQWO\VXSSRUWVERWKWKH9LVXDO6WXGLRDQG9LVXDO6WXGLR
IDPLOLHVRISURGXFWV$V0LFURVRIWUHOHDVHVQHZYHUVLRQV19,',$ZLOOOLNHO\
DGGVXSSRUWIRUQHZHUHGLWLRQVRI9LVXDO6WXGLRZKLOHGURSSLQJVXSSRUWIRUROGHU
YHUVLRQV0DQ\&DQG&GHYHORSHUVDOUHDG\KDYH9LVXDO6WXGLRRU9LVXDO
6WXGLRLQVWDOOHGRQWKHLUPDFKLQHVRLIWKLVDSSOLHVWR\RX\RXFDQVDIHO\
VNLSWKLVVXEVHFWLRQ

,I\RXGRQRWKDYHDFFHVVWRDVXSSRUWHGYHUVLRQRI9LVXDO6WXGLRDQGDUHQǢWUHDG\
WRLQYHVWLQDFRS\0LFURVRIWGRHVSURYLGHIUHHGRZQORDGVRIWKH9LVXDO6WXGLR
([SUHVVHGLWLRQRQLWVZHEVLWH$OWKRXJKW\SLFDOO\XQVXLWDEOHIRUFRPPHUFLDO
VRIWZDUHGHYHORSPHQWWKH9LVXDO6WXGLR([SUHVVHGLWLRQVDUHDQH[FHOOHQWZD\WR
JHWVWDUWHGGHYHORSLQJ&8'$&RQ:LQGRZVSODWIRUPVZLWKRXWLQYHVWLQJPRQH\LQ
VRIWZDUHOLFHQVHV6RKHDGRQRYHUWRZZZPLFURVRIWFRPYLVXDOVWXGLRLI\RXǢUH
LQQHHGRI9LVXDO6WXGLR

18

Download from www.wowebook.com


 &+ $ 3 5
7 ( 5( 9 ,( :

/,18;
0RVW/LQX[GLVWULEXWLRQVW\SLFDOO\VKLSZLWKDYHUVLRQRIWKH*18&FRPSLOHU
(gcc LQVWDOOHG$VRI&8'$WKHIROORZLQJ/LQX[GLVWULEXWLRQVVKLSSHGZLWK
VXSSRUWHGYHUVLRQVRIgccLQVWDOOHG

ǩ 5HG+DW(QWHUSULVH/LQX[

ǩ 5HG+DW(QWHUSULVH/LQX[

ǩ 2SHQ686(

ǩ 686(/LQX[(QWHUSULVH'HVNWRS

ǩ 8EXQWX

ǩ )HGRUD

,I\RXǢUHDGLHKDUG/LQX[XVHU\RXǢUHSUREDEO\DZDUHWKDWPDQ\/LQX[VRIWZDUH
SDFNDJHVZRUNRQIDUPRUHWKDQMXVWWKHǤVXSSRUWHGǥSODWIRUPV7KH&8'$
7RRONLWLVQRH[FHSWLRQVRHYHQLI\RXUIDYRULWHGLVWULEXWLRQLVQRWOLVWHGKHUHLW
PD\EHZRUWKWU\LQJLWDQ\ZD\7KHGLVWULEXWLRQǢVNHUQHOgccDQGglibcYHUVLRQV
ZLOOLQDODUJHSDUWGHWHUPLQHZKHWKHUWKHGLVWULEXWLRQLVFRPSDWLEOH

0$&,1726+26;
,I\RXZDQWWRGHYHORSRQ0DF26;\RXZLOOQHHGWRHQVXUHWKDW\RXUPDFKLQH
KDVDWOHDVWYHUVLRQRI0DF26;7KLVLQFOXGHVYHUVLRQ0DF26;
Ǥ6QRZ/HRSDUGǥ)XUWKHUPRUH\RXZLOOQHHGWRLQVWDOOgccE\GRZQORDGLQJ
DQGLQVWDOOLQJ$SSOHǢV;FRGH7KLVVRIWZDUHLVSURYLGHGIUHHWR$SSOH'HYHORSHU
&RQQHFWLRQ $'& PHPEHUVDQGFDQEHGRZQORDGHGIURPKWWSGHYHORSHUDSSOH
FRPWRROV;FRGH7KHFRGHLQWKLVERRNZDVGHYHORSHGRQ/LQX[DQG:LQGRZV
SODWIRUPVEXWVKRXOGZRUNZLWKRXWPRGLȌFDWLRQRQ0DF26;V\VWHPV

 &KDSWHU5HYLHZ
,I\RXKDYHIROORZHGWKHVWHSVLQWKLVFKDSWHU\RXDUHUHDG\WRVWDUWGHYHORSLQJ
FRGHLQ&8'$&3HUKDSV\RXKDYHHYHQSOD\HGDURXQGZLWKVRPHRIWKH19,',$
*38&RPSXWLQJ6'.FRGHVDPSOHV\RXGRZQORDGHGIURP19,',$ǢVZHEVLWH,IVR
ZHDSSODXG\RXUZLOOLQJQHVVWRWLQNHU,IQRWGRQǢWZRUU\(YHU\WKLQJ\RXQHHGLV
ULJKWKHUHLQWKLVERRN(LWKHUZD\\RXǢUHSUREDEO\UHDG\WRVWDUWZULWLQJ\RXUȌUVW
SURJUDPLQ&8'$&VROHWǢVJHWVWDUWHG

19

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Chapter 3
Introduction to CUDA C

,I\RXUHDG&KDSWHUZHKRSHZHKDYHFRQYLQFHG\RXRIERWKWKHLPPHQVH
FRPSXWDWLRQDOSRZHURIJUDSKLFVSURFHVVRUVDQGWKDW\RXDUHMXVWWKH
SURJUDPPHUWRKDUQHVVLW$QGLI\RXFRQWLQXHGWKURXJK&KDSWHU\RXVKRXOG
KDYHDIXQFWLRQLQJHQYLURQPHQWVHWXSLQRUGHUWRFRPSLOHDQGUXQWKHFRGH
\RXǢOOEHZULWLQJLQ&8'$&,I\RXVNLSSHGWKHȌUVWFKDSWHUVSHUKDSV\RXǢUHMXVW
VNLPPLQJIRUFRGHVDPSOHVSHUKDSV\RXUDQGRPO\RSHQHGWRWKLVSDJHZKLOH
EURZVLQJDWDERRNVWRUHRUPD\EH\RXǢUHMXVWG\LQJWRJHWVWDUWHGWKDWǢV2.WRR
ZHZRQǢWWHOO (LWKHUZD\\RXǢUHUHDG\WRJHWVWDUWHGZLWKWKHȌUVWFRGHH[DP-
SOHVVROHWǢVJR

21

Download from www.wowebook.com


INTRODUCTION TO CUDA C

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOZULWH\RXUȌUVWOLQHVRIFRGHLQ&8'$&

ǩ <RXZLOOOHDUQWKHGLIIHUHQFHEHWZHHQFRGHZULWWHQIRUWKHhostDQGFRGHZULWWHQ
IRUDdevice

ǩ <RXZLOOOHDUQKRZWRUXQGHYLFHFRGHIURPWKHKRVW

ǩ <RXZLOOOHDUQDERXWWKHZD\VGHYLFHPHPRU\FDQEHXVHGRQ&8'$FDSDEOH
GHYLFHV

ǩ <RXZLOOOHDUQKRZWRTXHU\\RXUV\VWHPIRULQIRUPDWLRQRQLWV&8'$FDSDEOH
GHYLFHV

 $)LUVW3URJUDP
6LQFHZHLQWHQGWROHDUQ&8'$&E\H[DPSOHOHWǢVWDNHDORRNDWRXUȌUVWH[DPSOH
RI&8'$&,QDFFRUGDQFHZLWKWKHODZVJRYHUQLQJZULWWHQZRUNVRIFRPSXWHU
SURJUDPPLQJZHEHJLQE\H[DPLQLQJDǤ+HOOR:RUOGǥH[DPSOH

 +(//2:25/'

#include "../common/book.h"

int main( void ) {


printf( "Hello, World!\n" );
return 0;
}

$WWKLVSRLQWQRGRXEW\RXǢUHZRQGHULQJZKHWKHUWKLVERRNLVDVFDP,VWKLVMXVW
&"'RHV&8'$&HYHQH[LVW"7KHDQVZHUVWRWKHVHTXHVWLRQVDUHERWKLQWKHDIȌU-
PDWLYHWKLVERRNLVQRWDQHODERUDWHUXVH7KLVVLPSOHǤ+HOOR:RUOGǥH[DPSOHLV

22

Download from www.wowebook.com


 $) ,5 6 73 52*5 $ 0

PHDQWWRLOOXVWUDWHWKDWDWLWVPRVWEDVLFWKHUHLVQRGLIIHUHQFHEHWZHHQ&8'$&
DQGWKHVWDQGDUG&WRZKLFK\RXKDYHJURZQDFFXVWRPHG

7KHVLPSOLFLW\RIWKLVH[DPSOHVWHPVIURPWKHIDFWWKDWLWUXQVHQWLUHO\RQWKHhost
7KLVZLOOEHRQHRIWKHLPSRUWDQWGLVWLQFWLRQVPDGHLQWKLVERRNZHUHIHUWRWKH
&38DQGWKHV\VWHPǢVPHPRU\DVWKHhostDQGUHIHUWRWKH*38DQGLWVPHPRU\
DVWKHdevice7KLVH[DPSOHUHVHPEOHVDOPRVWDOOWKHFRGH\RXKDYHHYHUZULWWHQ
EHFDXVHLWVLPSO\LJQRUHVDQ\FRPSXWLQJGHYLFHVRXWVLGHWKHKRVW

7RUHPHG\WKDWVLQNLQJIHHOLQJWKDW\RXǢYHLQYHVWHGLQQRWKLQJPRUHWKDQDQ
H[SHQVLYHFROOHFWLRQRIWULYLDOLWLHVZHZLOOJUDGXDOO\EXLOGXSRQWKLVVLPSOH
H[DPSOH/HWǢVORRNDWVRPHWKLQJWKDWXVHVWKH*38 Ddevice WRH[HFXWHFRGH
$IXQFWLRQWKDWH[HFXWHVRQWKHGHYLFHLVW\SLFDOO\FDOOHGDkernel

 A KERNEL CALL


1RZZHZLOOEXLOGXSRQRXUH[DPSOHZLWKVRPHFRGHWKDWVKRXOGORRNPRUH
IRUHLJQWKDQRXUSODLQYDQLOODǤ+HOOR:RUOGǥSURJUDP

#include <iostream>

__global__ void kernel( void ) {


}

int main( void ) {


kernel<<<1,1>>>();
printf( "Hello, World!\n" );
return 0;
}

7KLVSURJUDPPDNHVWZRQRWDEOHDGGLWLRQVWRWKHRULJLQDOǤ+HOOR:RUOGǥ
H[DPSOH

ǩ $QHPSW\IXQFWLRQQDPHGkernel()TXDOLȌHGZLWK__global__

ǩ $FDOOWRWKHHPSW\IXQFWLRQHPEHOOLVKHGZLWK<<<1,1>>>

$VZHVDZLQWKHSUHYLRXVVHFWLRQFRGHLVFRPSLOHGE\\RXUV\VWHPǢVVWDQGDUG
&FRPSLOHUE\GHIDXOW)RUH[DPSOH*18gccPLJKWFRPSLOH\RXUKRVWFRGH

23

Download from www.wowebook.com


INTRODUCTION TO CUDA C

RQ/LQX[RSHUDWLQJV\VWHPVZKLOH0LFURVRIW9LVXDO&FRPSLOHVLWRQ:LQGRZV
V\VWHPV7KH19,',$WRROVVLPSO\IHHGWKLVKRVWFRPSLOHU\RXUFRGHDQGHYHU\-
WKLQJEHKDYHVDVLWZRXOGLQDZRUOGZLWKRXW&8'$

1RZZHVHHWKDW&8'$&DGGVWKH__global__TXDOLȌHUWRVWDQGDUG&7KLV
PHFKDQLVPDOHUWVWKHFRPSLOHUWKDWDIXQFWLRQVKRXOGEHFRPSLOHGWRUXQRQ
DGHYLFHLQVWHDGRIWKHKRVW,QWKLVVLPSOHH[DPSOHnvccJLYHVWKHIXQFWLRQ
kernel()WRWKHFRPSLOHUWKDWKDQGOHVGHYLFHFRGHDQGLWIHHGVmain()WRWKH
KRVWFRPSLOHUDVLWGLGLQWKHSUHYLRXVH[DPSOH

6RZKDWLVWKHP\VWHULRXVFDOOWRkernel()DQGZK\PXVWZHYDQGDOL]HRXU
VWDQGDUG&ZLWKDQJOHEUDFNHWVDQGDQXPHULFWXSOH"%UDFH\RXUVHOIEHFDXVHWKLV
LVZKHUHWKHPDJLFKDSSHQV

:HKDYHVHHQWKDW&8'$&QHHGHGDOLQJXLVWLFPHWKRGIRUPDUNLQJDIXQFWLRQ
DVGHYLFHFRGH7KHUHLVQRWKLQJVSHFLDODERXWWKLVLWLVVKRUWKDQGWRVHQGKRVW
FRGHWRRQHFRPSLOHUDQGGHYLFHFRGHWRDQRWKHUFRPSLOHU7KHWULFNLVDFWXDOO\LQ
FDOOLQJWKHGHYLFHFRGHIURPWKHKRVWFRGH2QHRIWKHEHQHȌWVRI&8'$&LVWKDW
LWSURYLGHVWKLVODQJXDJHLQWHJUDWLRQVRWKDWGHYLFHIXQFWLRQFDOOVORRNYHU\PXFK
OLNHKRVWIXQFWLRQFDOOV/DWHUZHZLOOGLVFXVVZKDWDFWXDOO\KDSSHQVEHKLQGWKH
VFHQHVEXWVXIȌFHWRVD\WKDWWKH&8'$FRPSLOHUDQGUXQWLPHWDNHFDUHRIWKH
PHVV\EXVLQHVVRILQYRNLQJGHYLFHFRGHIURPWKHKRVW

6RWKHP\VWHULRXVORRNLQJFDOOLQYRNHVGHYLFHFRGHEXWZK\WKHDQJOHEUDFNHWV
DQGQXPEHUV"7KHDQJOHEUDFNHWVGHQRWHDUJXPHQWVZHSODQWRSDVVWRWKH
UXQWLPHV\VWHP7KHVHDUHQRWDUJXPHQWVWRWKHGHYLFHFRGHEXWDUHSDUDPHWHUV
WKDWZLOOLQȍXHQFHKRZWKHUXQWLPHZLOOODXQFKRXUGHYLFHFRGH:HZLOOOHDUQ
DERXWWKHVHSDUDPHWHUVWRWKHUXQWLPHLQWKHQH[WFKDSWHU$UJXPHQWVWRWKH
GHYLFHFRGHLWVHOIJHWSDVVHGZLWKLQWKHSDUHQWKHVHVMXVWOLNHDQ\RWKHUIXQFWLRQ
LQYRFDWLRQ

 PASSING PARAMETERS


:HǢYHSURPLVHGWKHDELOLW\WRSDVVSDUDPHWHUVWRRXUNHUQHODQGWKHWLPHKDV
FRPHIRUXVWRPDNHJRRGRQWKDWSURPLVH&RQVLGHUWKHIROORZLQJHQKDQFHPHQW
WRRXUǤ+HOOR:RUOGǥDSSOLFDWLRQ

24

Download from www.wowebook.com


 $) ,5 6 73 52*5 $ 0

#include <iostream>
#include "book.h"

__global__ void add( int a, int b, int *c ) {


*c = a + b;
}

int main( void ) {


int c;
int *dev_c;
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, sizeof(int) ) );

add<<<1,1>>>( 2, 7, dev_c );

HANDLE_ERROR( cudaMemcpy( &c,


dev_c,
sizeof(int),
cudaMemcpyDeviceToHost ) );
printf( "2 + 7 = %d\n", c );
cudaFree( dev_c );

return 0;
}

<RXZLOOQRWLFHDKDQGIXORIQHZOLQHVKHUHEXWWKHVHFKDQJHVLQWURGXFHRQO\WZR
FRQFHSWV

ǩ :HFDQSDVVSDUDPHWHUVWRDNHUQHODVZHZRXOGZLWKDQ\&IXQFWLRQ

ǩ :HQHHGWRDOORFDWHPHPRU\WRGRDQ\WKLQJXVHIXORQDGHYLFHVXFKDVUHWXUQ
YDOXHVWRWKHKRVW

7KHUHLVQRWKLQJVSHFLDODERXWSDVVLQJSDUDPHWHUVWRDNHUQHO7KHDQJOHEUDFNHW
V\QWD[QRWZLWKVWDQGLQJDNHUQHOFDOOORRNVDQGDFWVH[DFWO\OLNHDQ\IXQFWLRQFDOO
LQVWDQGDUG&7KHUXQWLPHV\VWHPWDNHVFDUHRIDQ\FRPSOH[LW\LQWURGXFHGE\WKH
IDFWWKDWWKHVHSDUDPHWHUVQHHGWRJHWIURPWKHKRVWWRWKHGHYLFH

25

Download from www.wowebook.com


INTRODUCTION TO CUDA C

7KHPRUHLQWHUHVWLQJDGGLWLRQLVWKHDOORFDWLRQRIPHPRU\XVLQJcudaMalloc()
7KLVFDOOEHKDYHVYHU\VLPLODUO\WRWKHVWDQGDUG&FDOOmalloc()EXWLWWHOOV
WKH&8'$UXQWLPHWRDOORFDWHWKHPHPRU\RQWKHGHYLFH7KHȌUVWDUJXPHQW
LVDSRLQWHUWRWKHSRLQWHU\RXZDQWWRKROGWKHDGGUHVVRIWKHQHZO\DOORFDWHG
PHPRU\DQGWKHVHFRQGSDUDPHWHULVWKHVL]HRIWKHDOORFDWLRQ\RXZDQWWRPDNH
%HVLGHVWKDW\RXUDOORFDWHGPHPRU\SRLQWHULVQRWWKHIXQFWLRQǢVUHWXUQYDOXH
WKLVLVLGHQWLFDOEHKDYLRUWRmalloc()ULJKWGRZQWRWKHvoid*UHWXUQW\SH7KH
HANDLE_ERROR()WKDWVXUURXQGVWKHVHFDOOVLVDXWLOLW\PDFURWKDWZHKDYH
SURYLGHGDVSDUWRIWKLVERRNǢVVXSSRUWFRGH,WVLPSO\GHWHFWVWKDWWKHFDOOKDV
UHWXUQHGDQHUURUSULQWVWKHDVVRFLDWHGHUURUPHVVDJHDQGH[LWVWKHDSSOLFDWLRQ
ZLWKDQEXIT_FAILUREFRGH$OWKRXJK\RXDUHIUHHWRXVHWKLVFRGHLQ\RXURZQ
DSSOLFDWLRQVLWLVKLJKO\OLNHO\WKDWWKLVHUURUKDQGOLQJFRGHZLOOEHLQVXIȌFLHQWLQ
SURGXFWLRQFRGH

7KLVUDLVHVDVXEWOHEXWLPSRUWDQWSRLQW0XFKRIWKHVLPSOLFLW\DQGSRZHURI
&8'$&GHULYHVIURPWKHDELOLW\WREOXUWKHOLQHEHWZHHQKRVWDQGGHYLFHFRGH
+RZHYHULWLVWKHUHVSRQVLELOLW\RIWKHSURJUDPPHUQRWWRGHUHIHUHQFHWKHSRLQWHU
UHWXUQHGE\cudaMalloc()IURPFRGHWKDWH[HFXWHVRQWKHKRVW+RVWFRGHPD\
SDVVWKLVSRLQWHUDURXQGSHUIRUPDULWKPHWLFRQLWRUHYHQFDVWLWWRDGLIIHUHQW
W\SH%XW\RXFDQQRWXVHLWWRUHDGRUZULWHIURPPHPRU\

8QIRUWXQDWHO\WKHFRPSLOHUFDQQRWSURWHFW\RXIURPWKLVPLVWDNHHLWKHU,WZLOO
EHSHUIHFWO\KDSS\WRDOORZGHUHIHUHQFHVRIGHYLFHSRLQWHUVLQ\RXUKRVWFRGH
EHFDXVHLWORRNVOLNHDQ\RWKHUSRLQWHULQWKHDSSOLFDWLRQ:HFDQVXPPDUL]HWKH
UHVWULFWLRQVRQWKHXVDJHRIGHYLFHSRLQWHUDVIROORZV

<RXcanSDVVSRLQWHUVDOORFDWHGZLWKcudaMalloc()WRIXQFWLRQVWKDW
H[HFXWHRQWKHGHYLFH

<RXcanXVHSRLQWHUVDOORFDWHGZLWKcudaMalloc()WRUHDGRUZULWH
PHPRU\IURPFRGHWKDWH[HFXWHVRQWKHGHYLFH

<RXcanSDVVSRLQWHUVDOORFDWHGZLWKcudaMalloc()WRIXQFWLRQVWKDW
H[HFXWHRQWKHKRVW

<RXcannotXVHSRLQWHUVDOORFDWHGZLWKcudaMalloc()WRUHDGRUZULWH
PHPRU\IURPFRGHWKDWH[HFXWHVRQWKHKRVW

,I\RXǢYHEHHQUHDGLQJFDUHIXOO\\RXPLJKWKDYHDQWLFLSDWHGWKHQH[WOHVVRQ:H
FDQǢWXVHVWDQGDUG&ǢVfree()IXQFWLRQWRUHOHDVHPHPRU\ZHǢYHDOORFDWHGZLWK
cudaMalloc()7RIUHHPHPRU\ZHǢYHDOORFDWHGZLWKcudaMalloc()ZHQHHG
WRXVHDFDOOWRcudaFree()ZKLFKEHKDYHVH[DFWO\OLNHfree()GRHV

26

Download from www.wowebook.com


 48(5<,1*'(9,&(6
QUERYING DEVICES

:HǢYHVHHQKRZWRXVHWKHKRVWWRDOORFDWHDQGIUHHPHPRU\RQWKHGHYLFHEXW
ZHǢYHDOVRPDGHLWSDLQIXOO\FOHDUWKDW\RXFDQQRWPRGLI\WKLVPHPRU\IURPWKH
KRVW7KHUHPDLQLQJWZROLQHVRIWKHVDPSOHSURJUDPLOOXVWUDWHWZRRIWKHPRVW
FRPPRQPHWKRGVIRUDFFHVVLQJGHYLFHPHPRU\ǟE\XVLQJGHYLFHSRLQWHUVIURP
ZLWKLQGHYLFHFRGHDQGE\XVLQJFDOOVWRcudaMemcpy()

:HXVHSRLQWHUVIURPZLWKLQGHYLFHFRGHH[DFWO\WKHVDPHZD\ZHXVHWKHPLQ
VWDQGDUG&WKDWUXQVRQWKHKRVWFRGH7KHVWDWHPHQW*c = a + bLVDVVLPSOH
DVLWORRNV,WDGGVWKHSDUDPHWHUVaDQGbWRJHWKHUDQGVWRUHVWKHUHVXOWLQWKH
PHPRU\SRLQWHGWRE\c:HKRSHWKLVLVDOPRVWWRRHDV\WRHYHQEHLQWHUHVWLQJ

:HOLVWHGWKHZD\VLQZKLFKZHFDQDQGFDQQRWXVHGHYLFHSRLQWHUVIURPZLWKLQ
GHYLFHDQGKRVWFRGH7KHVHFDYHDWVWUDQVODWHH[DFWO\DVRQHPLJKWLPDJLQH
ZKHQFRQVLGHULQJKRVWSRLQWHUV$OWKRXJKZHDUHIUHHWRSDVVKRVWSRLQWHUV
DURXQGLQGHYLFHFRGHZHUXQLQWRWURXEOHZKHQZHDWWHPSWWRXVHDKRVWSRLQWHU
WRDFFHVVPHPRU\IURPZLWKLQGHYLFHFRGH7RVXPPDUL]HKRVWSRLQWHUVFDQ
DFFHVVPHPRU\IURPKRVWFRGHDQGGHYLFHSRLQWHUVFDQDFFHVVPHPRU\IURP
GHYLFHFRGH

$VSURPLVHGZHFDQDOVRDFFHVVPHPRU\RQDGHYLFHWKURXJKFDOOVWR
cudaMemcpy()IURPKRVWFRGH7KHVHFDOOVEHKDYHH[DFWO\OLNHVWDQGDUG&
memcpy()ZLWKDQDGGLWLRQDOSDUDPHWHUWRVSHFLI\ZKLFKRIWKHVRXUFHDQG
GHVWLQDWLRQSRLQWHUVSRLQWWRGHYLFHPHPRU\,QWKHH[DPSOHQRWLFHWKDWWKHODVW
SDUDPHWHUWRcudaMemcpy()LVcudaMemcpyDeviceToHostLQVWUXFWLQJWKH
UXQWLPHWKDWWKHVRXUFHSRLQWHULVDGHYLFHSRLQWHUDQGWKHGHVWLQDWLRQSRLQWHULVD
KRVWSRLQWHU

8QVXUSULVLQJO\cudaMemcpyHostToDeviceZRXOGLQGLFDWHWKHRSSRVLWHVLWX-
DWLRQZKHUHWKHVRXUFHGDWDLVRQWKHKRVWDQGWKHGHVWLQDWLRQLVDQDGGUHVVRQ
WKHGHYLFH)LQDOO\ZHFDQHYHQVSHFLI\WKDWbothSRLQWHUVDUHRQWKHGHYLFHE\
SDVVLQJcudaMemcpyDeviceToDevice,IWKHVRXUFHDQGGHVWLQDWLRQSRLQWHUV
DUHERWKRQWKHKRVWZHZRXOGVLPSO\XVHVWDQGDUG&ǢVmemcpy()URXWLQHWRFRS\
EHWZHHQWKHP

 4XHU\LQJ'HYLFHV
6LQFHZHZRXOGOLNHWREHDOORFDWLQJPHPRU\DQGH[HFXWLQJFRGHRQRXUGHYLFH
LWZRXOGEHXVHIXOLIRXUSURJUDPKDGDZD\RINQRZLQJKRZPXFKPHPRU\DQG
ZKDWW\SHVRIFDSDELOLWLHVWKHGHYLFHKDG)XUWKHUPRUHLWLVUHODWLYHO\FRPPRQIRU

27

Download from www.wowebook.com


INTRODUCTION TO CUDA C

SHRSOHWRKDYHPRUHWKDQRQH&8'$FDSDEOHGHYLFHSHUFRPSXWHU,QVLWXDWLRQV
OLNHWKLVZHZLOOGHȌQLWHO\ZDQWDZD\WRGHWHUPLQHZKLFKSURFHVVRULVZKLFK

)RUH[DPSOHPDQ\PRWKHUERDUGVVKLSZLWKLQWHJUDWHG19,',$JUDSKLFVSURFHV-
VRUV:KHQDPDQXIDFWXUHURUXVHUDGGVDGLVFUHWHJUDSKLFVSURFHVVRUWRWKLV
FRPSXWHULWWKHQSRVVHVVHVWZR&8'$FDSDEOHSURFHVVRUV6RPH19,',$SURG-
XFWVOLNHWKH*H)RUFH*7;VKLSZLWKWZR*38VRQDVLQJOHFDUG&RPSXWHUV
WKDWFRQWDLQSURGXFWVVXFKDVWKLVZLOODOVRVKRZWZR&8'$FDSDEOHSURFHVVRUV

%HIRUHZHJHWWRRGHHSLQWRZULWLQJGHYLFHFRGHZHZRXOGORYHWRKDYHD
PHFKDQLVPIRUGHWHUPLQLQJZKLFKGHYLFHV LIDQ\ DUHSUHVHQWDQGZKDWFDSD-
ELOLWLHVHDFKGHYLFHVXSSRUWV)RUWXQDWHO\WKHUHLVDYHU\HDV\LQWHUIDFHWR
GHWHUPLQHWKLVLQIRUPDWLRQ)LUVWZHZLOOZDQWWRNQRZKRZPDQ\GHYLFHVLQWKH
V\VWHPZHUHEXLOWRQWKH&8'$$UFKLWHFWXUH7KHVHGHYLFHVZLOOEHFDSDEOHRI
H[HFXWLQJNHUQHOVZULWWHQLQ&8'$&7RJHWWKHFRXQWRI&8'$GHYLFHVZHFDOO
cudaGetDeviceCount()1HHGOHVVWRVD\ZHDQWLFLSDWHUHFHLYLQJDQDZDUG
IRU0RVW&UHDWLYH)XQFWLRQ1DPH

int count;
HANDLE_ERROR( cudaGetDeviceCount( &count ) );

$IWHUFDOOLQJcudaGetDeviceCount()ZHFDQWKHQLWHUDWHWKURXJKWKHGHYLFHV
DQGTXHU\UHOHYDQWLQIRUPDWLRQDERXWHDFK7KH&8'$UXQWLPHUHWXUQVXVWKHVH
SURSHUWLHVLQDVWUXFWXUHRIW\SHcudaDeviceProp:KDWNLQGRISURSHUWLHV
FDQZHUHWULHYH"$VRI&8'$WKHcudaDevicePropVWUXFWXUHFRQWDLQVWKH
IROORZLQJ

struct cudaDeviceProp {
char name[256];
size_t totalGlobalMem;
size_t sharedMemPerBlock;
int regsPerBlock;
int warpSize;
size_t memPitch;
int maxThreadsPerBlock;
int maxThreadsDim[3];
int maxGridSize[3];
size_t totalConstMem;
int major;

28

Download from www.wowebook.com


 48(5<,1*'(9,&(6
QUERYING DEVICES

int minor;
int clockRate;
size_t textureAlignment;
int deviceOverlap;
int multiProcessorCount;
int kernelExecTimeoutEnabled;
int integrated;
int canMapHostMemory;
int computeMode;
int maxTexture1D;
int maxTexture2D[2];
int maxTexture3D[3];
int maxTexture2DArray[3];
int concurrentKernels;
}

6RPHRIWKHVHDUHVHOIH[SODQDWRU\RWKHUVEHDUVRPHDGGLWLRQDOGHVFULSWLRQ VHH
7DEOH 

Table 3.1 &8'$'HYLFH3URSHUWLHV

DEVICE PROPERTY DESCRIPTION

char name[256]; $Q$6&,,VWULQJLGHQWLI\LQJWKHGHYLFH HJ


"GeForce GTX 280")

size_t totalGlobalMem 7KHDPRXQWRIJOREDOPHPRU\RQWKHGHYLFHLQ


E\WHV

size_t sharedMemPerBlock 7KHPD[LPXPDPRXQWRIVKDUHGPHPRU\DVLQJOH


EORFNPD\XVHLQE\WHV

int regsPerBlock 7KHQXPEHURIELWUHJLVWHUVDYDLODEOHSHUEORFN

int warpSize 7KHQXPEHURIWKUHDGVLQDZDUS

size_t memPitch 7KHPD[LPXPSLWFKDOORZHGIRUPHPRU\FRSLHVLQ


E\WHV

Continued

29

Download from www.wowebook.com


INTRODUCTION TO CUDA C

Table 3.1 &DSWLRQQHHGHG &RQWLQXHG

DEVICE PROPERTY DESCRIPTION

int maxThreadsPerBlock 7KHPD[LPXPQXPEHURIWKUHDGVWKDWDEORFNPD\


FRQWDLQ

int maxThreadsDim[3] 7KHPD[LPXPQXPEHURIWKUHDGVDOORZHGDORQJ


HDFKGLPHQVLRQRIDEORFN

int maxGridSize[3] 7KHQXPEHURIEORFNVDOORZHGDORQJHDFK


GLPHQVLRQRIDJULG

size_t totalConstMem 7KHDPRXQWRIDYDLODEOHFRQVWDQWPHPRU\

int major 7KHPDMRUUHYLVLRQRIWKHGHYLFHǢVFRPSXWH


FDSDELOLW\

int minor 7KHPLQRUUHYLVLRQRIWKHGHYLFHǢVFRPSXWH


FDSDELOLW\

size_t textureAlignment 7KHGHYLFHǢVUHTXLUHPHQWIRUWH[WXUHDOLJQPHQW

int deviceOverlap $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH


FDQVLPXOWDQHRXVO\SHUIRUPDcudaMemcpy()
DQGNHUQHOH[HFXWLRQ

int multiProcessorCount 7KHQXPEHURIPXOWLSURFHVVRUVRQWKHGHYLFH

int kernelExecTimeoutEnabled $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHUHLVD


UXQWLPHOLPLWIRUNHUQHOVH[HFXWHGRQWKLVGHYLFH

int integrated $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFHLV


DQLQWHJUDWHG*38 LHSDUWRIWKHFKLSVHWDQGQRWD
GLVFUHWH*38 

int canMapHostMemory $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH


FDQPDSKRVWPHPRU\LQWRWKH&8'$GHYLFH
DGGUHVVVSDFH

int computeMode $YDOXHUHSUHVHQWLQJWKHGHYLFHǢVFRPSXWLQJPRGH


GHIDXOWH[FOXVLYHRUSURKLELWHG

int maxTexture1D 7KHPD[LPXPVL]HVXSSRUWHGIRU'WH[WXUHV

30

Download from www.wowebook.com


 48(5<,1*'(9,&(6
QUERYING DEVICES

Table 3.1 &8'$'HYLFH3URSHUWLHV &RQWLQXHG

DEVICE PROPERTY DESCRIPTION

int maxTexture2D[2] 7KHPD[LPXPGLPHQVLRQVVXSSRUWHGIRU'


WH[WXUHV

int maxTexture3D[3] 7KHPD[LPXPGLPHQVLRQVVXSSRUWHGIRU'


WH[WXUHV

int maxTexture2DArray[3] 7KHPD[LPXPGLPHQVLRQVVXSSRUWHGIRU'


WH[WXUHDUUD\V

int concurrentKernels $ERROHDQYDOXHUHSUHVHQWLQJZKHWKHUWKHGHYLFH


VXSSRUWVH[HFXWLQJPXOWLSOHNHUQHOVZLWKLQWKH
VDPHFRQWH[WVLPXOWDQHRXVO\

:HǢGOLNHWRDYRLGJRLQJWRRIDUWRRIDVWGRZQRXUUDEELWKROHVRZHZLOOQRW
JRLQWRH[WHQVLYHGHWDLODERXWWKHVHSURSHUWLHVQRZ,QIDFWWKHSUHYLRXVOLVWLV
PLVVLQJVRPHLPSRUWDQWGHWDLOVDERXWVRPHRIWKHVHSURSHUWLHVVR\RXZLOOZDQW
WRFRQVXOWWKHNVIDIA CUDA Programming GuideIRUPRUHLQIRUPDWLRQ:KHQ\RX
PRYHRQWRZULWH\RXURZQDSSOLFDWLRQVWKHVHSURSHUWLHVZLOOSURYHH[WUHPHO\
XVHIXO+RZHYHUIRUQRZZHZLOOVLPSO\VKRZKRZWRTXHU\HDFKGHYLFHDQGUHSRUW
WKHSURSHUWLHVRIHDFK6RIDURXUGHYLFHTXHU\ORRNVVRPHWKLQJOLNHWKLV

#include "../common/book.h"

int main( void ) {


cudaDeviceProp prop;

int count;
HANDLE_ERROR( cudaGetDeviceCount( &count ) );
for (int i=0; i< count; i++) {
HANDLE_ERROR( cudaGetDeviceProperties( &prop, i ) );

//Do something with our device's properties

}
}

31

Download from www.wowebook.com


INTRODUCTION TO CUDA C

1RZWKDWZHNQRZHDFKRIWKHȌHOGVDYDLODEOHWRXVZHFDQH[SDQGRQWKH
DPELJXRXVǤ'RVRPHWKLQJǥVHFWLRQDQGLPSOHPHQWVRPHWKLQJPDUJLQDOO\OHVV
WULYLDO

#include "../common/book.h"

int main( void ) {


cudaDeviceProp prop;

int count;
HANDLE_ERROR( cudaGetDeviceCount( &count ) );
for (int i=0; i< count; i++) {
HANDLE_ERROR( cudaGetDeviceProperties( &prop, i ) );
printf( " --- General Information for device %d ---\n", i );
printf( "Name: %s\n", prop.name );
printf( "Compute capability: %d.%d\n", prop.major, prop.minor );
printf( "Clock rate: %d\n", prop.clockRate );
printf( "Device copy overlap: " );
if (prop.deviceOverlap)
printf( "Enabled\n" );
else
printf( "Disabled\n" );
printf( "Kernel execition timeout : " );
if (prop.kernelExecTimeoutEnabled)
printf( "Enabled\n" );
else
printf( "Disabled\n" );

printf( " --- Memory Information for device %d ---\n", i );


printf( "Total global mem: %ld\n", prop.totalGlobalMem );
printf( "Total constant Mem: %ld\n", prop.totalConstMem );
printf( "Max mem pitch: %ld\n", prop.memPitch );
printf( "Texture Alignment: %ld\n", prop.textureAlignment );

32

Download from www.wowebook.com


 86,1*'(9,&(3523(57,(6
USING DEVICE PROPERTIES

printf( " --- MP Information for device %d ---\n", i );


printf( "Multiprocessor count: %d\n",
prop.multiProcessorCount );
printf( "Shared mem per mp: %ld\n", prop.sharedMemPerBlock );
printf( "Registers per mp: %d\n", prop.regsPerBlock );
printf( "Threads in warp: %d\n", prop.warpSize );
printf( "Max threads per block: %d\n",
prop.maxThreadsPerBlock );
printf( "Max thread dimensions: (%d, %d, %d)\n",
prop.maxThreadsDim[0], prop.maxThreadsDim[1],
prop.maxThreadsDim[2] );
printf( "Max grid dimensions: (%d, %d, %d)\n",
prop.maxGridSize[0], prop.maxGridSize[1],
prop.maxGridSize[2] );
printf( "\n" );
}
}

 8VLQJ'HYLFH3URSHUWLHV
2WKHUWKDQZULWLQJDQDSSOLFDWLRQWKDWKDQGLO\SULQWVHYHU\GHWDLORIHYHU\&8'$
FDSDEOHFDUGZK\PLJKWZHEHLQWHUHVWHGLQWKHSURSHUWLHVRIHDFKGHYLFHLQRXU
V\VWHP"6LQFHZHDVVRIWZDUHGHYHORSHUVZDQWHYHU\RQHWRWKLQNRXUVRIWZDUHLV
IDVWZHPLJKWEHLQWHUHVWHGLQFKRRVLQJWKH*38ZLWKWKHPRVWPXOWLSURFHVVRUV
RQZKLFKWRUXQRXUFRGH2ULIWKHNHUQHOQHHGVFORVHLQWHUDFWLRQZLWKWKH&38
ZHPLJKWEHLQWHUHVWHGLQUXQQLQJRXUFRGHRQWKHLQWHJUDWHG*38WKDWVKDUHV
V\VWHPPHPRU\ZLWKWKH&387KHVHDUHERWKSURSHUWLHVZHFDQTXHU\ZLWK
cudaGetDeviceProperties()

6XSSRVHWKDWZHDUHZULWLQJDQDSSOLFDWLRQWKDWGHSHQGVRQKDYLQJGRXEOH
SUHFLVLRQȍRDWLQJSRLQWVXSSRUW$IWHUDTXLFNFRQVXOWDWLRQZLWK$SSHQGL[$RIWKH
NVIDIA CUDA Programming GuideZHNQRZWKDWFDUGVWKDWKDYHFRPSXWHFDSD-
ELOLW\RUKLJKHUVXSSRUWGRXEOHSUHFLVLRQȍRDWLQJSRLQWPDWK6RWRVXFFHVV-
IXOO\UXQWKHGRXEOHSUHFLVLRQDSSOLFDWLRQWKDWZHǢYHZULWWHQZHQHHGWRȌQGDW
OHDVWRQHGHYLFHRIFRPSXWHFDSDELOLW\RUKLJKHU

33

Download from www.wowebook.com


INTRODUCTION TO CUDA C

%DVHGRQZKDWZHKDYHVHHQZLWKcudaGetDeviceCount()DQG
cudaGetDeviceProperties()ZHFRXOGLWHUDWHWKURXJKHDFKGHYLFHDQGORRN
IRURQHWKDWHLWKHUKDVDPDMRUYHUVLRQJUHDWHUWKDQRUKDVDPDMRUYHUVLRQRI
DQGPLQRUYHUVLRQJUHDWHUWKDQRUHTXDOWR%XWVLQFHWKLVUHODWLYHO\FRPPRQ
SURFHGXUHLVDOVRUHODWLYHO\DQQR\LQJWRSHUIRUPWKH&8'$UXQWLPHRIIHUVXVDQ
DXWRPDWHGZD\WRGRWKLV:HȌUVWȌOODcudaDevicePropVWUXFWXUHZLWKWKH
SURSHUWLHVZHQHHGRXUGHYLFHWRKDYH

cudaDeviceProp prop;
memset( &prop, 0, sizeof( cudaDeviceProp ) );
prop.major = 1;
prop.minor = 3;

$IWHUȌOOLQJDcudaDevicePropVWUXFWXUHZHSDVVLWWR
cudaChooseDevice()WRKDYHWKH&8'$UXQWLPHȌQGDGHYLFHWKDWVDWLVȌHV
WKLVFRQVWUDLQW7KHFDOOWRcudaChooseDevice()UHWXUQVDGHYLFH,'WKDWZH
FDQWKHQSDVVWRcudaSetDevice())URPWKLVSRLQWIRUZDUGDOOGHYLFHRSHUD-
WLRQVZLOOWDNHSODFHRQWKHGHYLFHZHIRXQGLQcudaChooseDevice()

#include "../common/book.h"

int main( void ) {


cudaDeviceProp prop;
int dev;

HANDLE_ERROR( cudaGetDevice( &dev ) );


printf( "ID of current CUDA device: %d\n", dev );

memset( &prop, 0, sizeof( cudaDeviceProp ) );


prop.major = 1;
prop.minor = 3;
HANDLE_ERROR( cudaChooseDevice( &dev, &prop ) );
printf( "ID of CUDA device closest to revision 1.3: %d\n", dev );
HANDLE_ERROR( cudaSetDevice( dev ) );
}

34

Download from www.wowebook.com


 &+ $ 3 5
7 ( 5( 9 ,( :

6\VWHPVZLWKPXOWLSOH*38VDUHEHFRPLQJPRUHDQGPRUHFRPPRQ)RU
H[DPSOHPDQ\RI19,',$ǢVPRWKHUERDUGFKLSVHWVFRQWDLQLQWHJUDWHG&8'$
FDSDEOH*38V:KHQDGLVFUHWH*38LVDGGHGWRRQHRIWKHVHV\VWHPV\RX
VXGGHQO\KDYHDPXOWL*38SODWIRUP0RUHRYHU19,',$ǢV6/,WHFKQRORJ\DOORZV
PXOWLSOHGLVFUHWH*38VWREHLQVWDOOHGVLGHE\VLGH,QHLWKHURIWKHVHFDVHV\RXU
DSSOLFDWLRQPD\KDYHDSUHIHUHQFHRIRQH*38RYHUDQRWKHU,I\RXUDSSOLFDWLRQ
GHSHQGVRQFHUWDLQIHDWXUHVRIWKH*38RUGHSHQGVRQKDYLQJWKHIDVWHVW*38
LQWKHV\VWHP\RXVKRXOGIDPLOLDUL]H\RXUVHOIZLWKWKLV$3,EHFDXVHWKHUHLVQR
JXDUDQWHHWKDWWKH&8'$UXQWLPHZLOOFKRRVHWKHEHVWRUPRVWDSSURSULDWH*38
IRU\RXUDSSOLFDWLRQ

 &KDSWHU5HYLHZ
:HǢYHȌQDOO\JRWWHQRXUKDQGVGLUW\ZULWLQJ&8'$&DQGLGHDOO\LWKDVEHHQOHVV
SDLQIXOWKDQ\RXPLJKWKDYHVXVSHFWHG)XQGDPHQWDOO\&8'$&LVVWDQGDUG&
ZLWKVRPHRUQDPHQWDWLRQWRDOORZXVWRVSHFLI\ZKLFKFRGHVKRXOGUXQRQWKH
GHYLFHDQGZKLFKVKRXOGUXQRQWKHKRVW%\DGGLQJWKHNH\ZRUG__global__
EHIRUHDIXQFWLRQZHLQGLFDWHGWRWKHFRPSLOHUWKDWZHLQWHQGWRUXQWKHIXQFWLRQ
RQWKH*387RXVHWKH*38ǢVGHGLFDWHGPHPRU\ZHDOVROHDUQHGD&8'$$3,
VLPLODUWR&ǢVmalloc(), memcpy()DQGfree()$3,V7KH&8'$YHUVLRQVRI
WKHVHIXQFWLRQVcudaMalloc(), cudaMemcpy()DQGcudaFree()DOORZXV
WRDOORFDWHGHYLFHPHPRU\FRS\GDWDEHWZHHQWKHGHYLFHDQGKRVWDQGIUHHWKH
GHYLFHPHPRU\ZKHQZHǢYHȌQLVKHGZLWKLW

$VZHSURJUHVVWKURXJKWKLVERRNZHZLOOVHHPRUHLQWHUHVWLQJH[DPSOHVRI
KRZZHFDQHIIHFWLYHO\XVHWKHGHYLFHDVDPDVVLYHO\SDUDOOHOFRSURFHVVRU)RU
QRZ\RXVKRXOGNQRZKRZHDV\LWLVWRJHWVWDUWHGZLWK&8'$&DQGLQWKHQH[W
FKDSWHUZHZLOOVHHKRZHDV\LWLVWRH[HFXWHSDUDOOHOFRGHRQWKH*38

35

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Chapter 4
Parallel Programming
in CUDA C

,QWKHSUHYLRXVFKDSWHUZHVDZKRZVLPSOHLWFDQEHWRZULWHFRGHWKDWH[HFXWHV
RQWKH*38:HKDYHHYHQJRQHVRIDUDVWROHDUQKRZWRDGGWZRQXPEHUV
WRJHWKHUDOEHLWMXVWWKHQXPEHUVDQG$GPLWWHGO\WKDWH[DPSOHZDVQRW
LPPHQVHO\LPSUHVVLYHQRUZDVLWLQFUHGLEO\LQWHUHVWLQJ%XWZHKRSH\RXDUH
FRQYLQFHGWKDWLWLVHDV\WRJHWVWDUWHGZLWK&8'$&DQG\RXǢUHH[FLWHGWROHDUQ
PRUH0XFKRIWKHSURPLVHRI*38FRPSXWLQJOLHVLQH[SORLWLQJWKHPDVVLYHO\
SDUDOOHOVWUXFWXUHRIPDQ\SUREOHPV,QWKLVYHLQZHLQWHQGWRVSHQGWKLVFKDSWHU
H[DPLQLQJKRZWRH[HFXWHSDUDOOHOFRGHRQWKH*38XVLQJ&8'$&

37

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQRQHRIWKHIXQGDPHQWDOZD\V&8'$H[SRVHVLWVSDUDOOHOLVP

ǩ <RXZLOOZULWH\RXUȌUVWSDUDOOHOFRGHZLWK&8'$&

 &8'$3DUDOOHO3URJUDPPLQJ
3UHYLRXVO\ZHVDZKRZHDV\LWZDVWRJHWDVWDQGDUG&IXQFWLRQWRVWDUWUXQQLQJ
RQDGHYLFH%\DGGLQJWKH__global__TXDOLȌHUWRWKHIXQFWLRQDQGE\FDOOLQJ
LWXVLQJDVSHFLDODQJOHEUDFNHWV\QWD[ZHH[HFXWHGWKHIXQFWLRQRQRXU*38
$OWKRXJKWKLVZDVH[WUHPHO\VLPSOHLWZDVDOVRH[WUHPHO\LQHIȌFLHQWEHFDXVH
19,',$ǢVKDUGZDUHHQJLQHHULQJPLQLRQVKDYHRSWLPL]HGWKHLUJUDSKLFVSURFHVVRUV
WRSHUIRUPKXQGUHGVRIFRPSXWDWLRQVLQSDUDOOHO+RZHYHUWKXVIDUZHKDYHRQO\
HYHUODXQFKHGDNHUQHOWKDWUXQVVHULDOO\RQWKH*38,QWKLVFKDSWHUZHVHHKRZ
VWUDLJKWIRUZDUGLWLVWRODXQFKDGHYLFHNHUQHOWKDWSHUIRUPVLWVFRPSXWDWLRQVLQ
SDUDOOHO

 SUMMING VECTORS


:HZLOOFRQWULYHDVLPSOHH[DPSOHWRLOOXVWUDWHWKUHDGVDQGKRZZHXVHWKHPWR
FRGHZLWK&8'$&,PDJLQHKDYLQJWZROLVWVRIQXPEHUVZKHUHZHZDQWWRVXP
FRUUHVSRQGLQJHOHPHQWVRIHDFKOLVWDQGVWRUHWKHUHVXOWLQDWKLUGOLVW)LJXUH
VKRZVWKLVSURFHVV,I\RXKDYHDQ\EDFNJURXQGLQOLQHDUDOJHEUD\RXZLOOUHFRJ-
QL]HWKLVRSHUDWLRQDVVXPPLQJWZRYHFWRUV

38

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

Figure 4.1 6XPPLQJWZRYHFWRUV

CPU VECTOR SUMS


)LUVWZHǢOOORRNDWRQHZD\WKLVDGGLWLRQFDQEHDFFRPSOLVKHGZLWKWUDGLWLRQDO&FRGH

#include "../common/book.h"

#define N 10

void add( int *a, int *b, int *c ) {


int tid = 0; // this is CPU zero, so we start at zero
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += 1; // we have one CPU, so we increment by one
}
}

int main( void ) {


int a[N], b[N], c[N];

// fill the arrays 'a' and 'b' on the CPU


for (int i=0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}

add( a, b, c );

39

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

// display the results


for (int i=0; i<N; i++) {
printf( "%d + %d = %d\n", a[i], b[i], c[i] );
}

return 0;
}

0RVWRIWKLVH[DPSOHEHDUVDOPRVWQRH[SODQDWLRQEXWZHZLOOEULHȍ\ORRNDWWKH
add()IXQFWLRQWRH[SODLQZK\ZHRYHUO\FRPSOLFDWHGLW

void add( int *a, int *b, int *c ) {


int tid = 0; // this is CPU zero, so we start at zero
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += 1; // we have one CPU, so we increment by one
}
}

:HFRPSXWHWKHVXPZLWKLQDwhileORRSZKHUHWKHLQGH[tidUDQJHVIURP0WR
N-1:HDGGFRUUHVSRQGLQJHOHPHQWVRIa[]DQGb[]SODFLQJWKHUHVXOWLQWKH
FRUUHVSRQGLQJHOHPHQWRIc[]2QHZRXOGW\SLFDOO\FRGHWKLVLQDVOLJKWO\VLPSOHU
PDQQHUOLNHVR

void add( int *a, int *b, int *c ) {


for (i=0; i < N; i++) {
c[i] = a[i] + b[i];
}
}

2XUVOLJKWO\PRUHFRQYROXWHGPHWKRGZDVLQWHQGHGWRVXJJHVWDSRWHQWLDOZD\WR
SDUDOOHOL]HWKHFRGHRQDV\VWHPZLWKPXOWLSOH&38VRU&38FRUHV)RUH[DPSOH
ZLWKDGXDOFRUHSURFHVVRURQHFRXOGFKDQJHWKHLQFUHPHQWWRDQGKDYHRQH
FRUHLQLWLDOL]HWKHORRSZLWKtid = 0DQGDQRWKHUZLWKtid = 17KHȌUVWFRUH
ZRXOGDGGWKHHYHQLQGH[HGHOHPHQWVDQGWKHVHFRQGFRUHZRXOGDGGWKHRGG
LQGH[HGHOHPHQWV7KLVDPRXQWVWRH[HFXWLQJWKHIROORZLQJFRGHRQHDFKRIWKH
WZR&38FRUHV

40

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

CPU CORE 1 CPU CORE 2

void add( int *a, int *b, int *c ) void add( int *a, int *b, int *c )
{ {
int tid = 0; int tid = 1;
while (tid < N) { while (tid < N) {
c[tid] = a[tid] + b[tid]; c[tid] = a[tid] + b[tid];
tid += 2; tid += 2;
} }
} }

2IFRXUVHGRLQJWKLVRQD&38ZRXOGUHTXLUHFRQVLGHUDEO\PRUHFRGHWKDQZH
KDYHLQFOXGHGLQWKLVH[DPSOH<RXZRXOGQHHGWRSURYLGHDUHDVRQDEOHDPRXQWRI
LQIUDVWUXFWXUHWRFUHDWHWKHZRUNHUWKUHDGVWKDWH[HFXWHWKHIXQFWLRQadd()DV
ZHOODVPDNHWKHDVVXPSWLRQWKDWHDFKWKUHDGZRXOGH[HFXWHLQSDUDOOHODVFKHG-
XOLQJDVVXPSWLRQWKDWLVXQIRUWXQDWHO\QRWDOZD\VWUXH

GPU VECTOR SUMS


:HFDQDFFRPSOLVKWKHVDPHDGGLWLRQYHU\VLPLODUO\RQD*38E\ZULWLQJadd()
DVDGHYLFHIXQFWLRQ7KLVVKRXOGORRNVLPLODUWRFRGH\RXVDZLQWKHSUHYLRXV
FKDSWHU%XWEHIRUHZHORRNDWWKHGHYLFHFRGHZHSUHVHQWmain()$OWKRXJKWKH
*38LPSOHPHQWDWLRQRImain()LVGLIIHUHQWIURPWKHFRUUHVSRQGLQJ&38YHUVLRQ
QRWKLQJKHUHVKRXOGORRNQHZ

#include "../common/book.h"

#define N 10

int main( void ) {


int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

// fill the arrays 'a' and 'b' on the CPU


for (int i=0; i<N; i++) {
a[i] = -i;
b[i] = i * i;
}

41

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

// copy the arrays 'a' and 'b' to the GPU


HANDLE_ERROR( cudaMemcpy( dev_a, a, N * sizeof(int),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, N * sizeof(int),
cudaMemcpyHostToDevice ) );

add<<<N,1>>>( dev_a, dev_b, dev_c );

// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( c, dev_c, N * sizeof(int),
cudaMemcpyDeviceToHost ) );

// display the results


for (int i=0; i<N; i++) {
printf( "%d + %d = %d\n", a[i], b[i], c[i] );
}

// free the memory allocated on the GPU


cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );

return 0;
}

<RXZLOOQRWLFHVRPHFRPPRQSDWWHUQVWKDWZHHPSOR\DJDLQ

ǩ :HDOORFDWHWKUHHDUUD\VRQWKHGHYLFHXVLQJFDOOVWRcudaMalloc()WZR
DUUD\Vdev_aDQGdev_bWRKROGLQSXWVDQGRQHDUUD\dev_cWRKROGWKH
UHVXOW

ǩ %HFDXVHZHDUHHQYLURQPHQWDOO\FRQVFLHQWLRXVFRGHUVZHFOHDQXSDIWHU
RXUVHOYHVZLWKcudaFree()

ǩ 8VLQJcudaMemcpy()ZHFRS\WKHLQSXWGDWDWRWKHGHYLFHZLWKWKHSDUDPHWHU
cudaMemcpyHostToDeviceDQGFRS\WKHUHVXOWGDWDEDFNWRWKHKRVWZLWK
cudaMemcpyDeviceToHost

ǩ :HH[HFXWHWKHGHYLFHFRGHLQadd()IURPWKHKRVWFRGHLQmain()XVLQJWKH
WULSOHDQJOHEUDFNHWV\QWD[
42

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

$VDQDVLGH\RXPD\EHZRQGHULQJZK\ZHȌOOWKHLQSXWDUUD\VRQWKH&387KHUH
LVQRUHDVRQLQSDUWLFXODUZK\ZHneedWRGRWKLV,QIDFWWKHSHUIRUPDQFHRIWKLV
VWHSZRXOGEHIDVWHULIZHȌOOHGWKHDUUD\VRQWKH*38%XWZHLQWHQGWRVKRZKRZ
DSDUWLFXODURSHUDWLRQQDPHO\WKHDGGLWLRQRIWZRYHFWRUVFDQEHLPSOHPHQWHG
RQDJUDSKLFVSURFHVVRU$VDUHVXOWZHDVN\RXWRLPDJLQHWKDWWKLVLVEXWRQH
VWHSRIDODUJHUDSSOLFDWLRQZKHUHWKHLQSXWDUUD\Va[]DQGb[]KDYHEHHQ
JHQHUDWHGE\VRPHRWKHUDOJRULWKPRUORDGHGIURPWKHKDUGGULYHE\WKHXVHU,Q
VXPPDU\LWZLOOVXIȌFHWRSUHWHQGWKDWWKLVGDWDDSSHDUHGRXWRIQRZKHUHDQG
QRZZHQHHGWRGRVRPHWKLQJZLWKLW

0RYLQJRQRXUadd()URXWLQHORRNVVLPLODUWRLWVFRUUHVSRQGLQJ&38
LPSOHPHQWDWLRQ

__global__ void add( int *a, int *b, int *c ) {


int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}

$JDLQZHVHHDFRPPRQSDWWHUQZLWKWKHIXQFWLRQadd()

ǩ :HKDYHZULWWHQDIXQFWLRQFDOOHGadd()WKDWH[HFXWHVRQWKHGHYLFH:H
DFFRPSOLVKHGWKLVE\WDNLQJ&FRGHDQGDGGLQJD__global__TXDOLȌHUWR
WKHIXQFWLRQQDPH

6RIDUWKHUHLVQRWKLQJQHZLQWKLVH[DPSOHH[FHSWLWFDQGRPRUHWKDQDGGDQG
+RZHYHUWKHUHareWZRQRWHZRUWK\FRPSRQHQWVRIWKLVH[DPSOH7KHSDUDP-
HWHUVZLWKLQWKHWULSOHDQJOHEUDFNHWVDQGWKHFRGHFRQWDLQHGLQWKHNHUQHOLWVHOI
ERWKLQWURGXFHQHZFRQFHSWV

8SWRWKLVSRLQWZHKDYHDOZD\VVHHQNHUQHOVODXQFKHGLQWKHIROORZLQJIRUP
kernel<<<1,1>>>( param1, param2, … );

%XWLQWKLVH[DPSOHZHDUHODXQFKLQJZLWKDQXPEHULQWKHDQJOHEUDFNHWVWKDWLV
QRW
add<<<N,1>>>( dev _ a, dev _ b, dev _ c );

:KDWJLYHV"

43

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

5HFDOOWKDWZHOHIWWKRVHWZRQXPEHUVLQWKHDQJOHEUDFNHWVXQH[SODLQHGZH
VWDWHGYDJXHO\WKDWWKH\ZHUHSDUDPHWHUVWRWKHUXQWLPHWKDWGHVFULEHKRZWR
ODXQFKWKHNHUQHO:HOOWKHȌUVWQXPEHULQWKRVHSDUDPHWHUVUHSUHVHQWVWKH
QXPEHURISDUDOOHOEORFNVLQZKLFKZHZRXOGOLNHWKHGHYLFHWRH[HFXWHRXUNHUQHO
,QWKLVFDVHZHǢUHSDVVLQJWKHYDOXHNIRUWKLVSDUDPHWHU

)RUH[DPSOHLIZHODXQFKZLWKkernel<<<2,1>>>()\RXFDQWKLQNRIWKH
UXQWLPHFUHDWLQJWZRFRSLHVRIWKHNHUQHODQGUXQQLQJWKHPLQSDUDOOHO:HFDOO
HDFKRIWKHVHSDUDOOHOLQYRFDWLRQVDblock:LWKkernel<<<256,1>>>()\RX
ZRXOGJHWblocksUXQQLQJRQWKH*383DUDOOHOSURJUDPPLQJKDVQHYHUEHHQ
HDVLHU

%XWWKLVUDLVHVDQH[FHOOHQWTXHVWLRQ7KH*38UXQVNFRSLHVRIRXUNHUQHOFRGH
EXWKRZFDQZHWHOOIURPZLWKLQWKHFRGHZKLFKEORFNLVFXUUHQWO\UXQQLQJ"7KLV
TXHVWLRQEULQJVXVWRWKHVHFRQGQHZIHDWXUHRIWKHH[DPSOHWKHNHUQHOFRGH
LWVHOI6SHFLȌFDOO\LWEULQJVXVWRWKHYDULDEOHblockIdx.x

__global__ void add( int *a, int *b, int *c ) {


int tid = blockIdx.x; // handle the data at this index
if (tid < N)
c[tid] = a[tid] + b[tid];
}

$WȌUVWJODQFHLWORRNVOLNHWKLVYDULDEOHVKRXOGFDXVHDV\QWD[HUURUDWFRPSLOH
WLPHVLQFHZHXVHLWWRDVVLJQWKHYDOXHRItidEXWZHKDYHQHYHUGHȌQHGLW
+RZHYHUWKHUHLVQRQHHGWRGHȌQHWKHYDULDEOHblockIdxWKLVLVRQHRIWKH
EXLOWLQYDULDEOHVWKDWWKH&8'$UXQWLPHGHȌQHVIRUXV)XUWKHUPRUHZHXVHWKLV
YDULDEOHIRUH[DFWO\ZKDWLWVRXQGVOLNHLWPHDQV,WFRQWDLQVWKHYDOXHRIWKHEORFN
LQGH[IRUZKLFKHYHUEORFNLVFXUUHQWO\UXQQLQJWKHGHYLFHFRGH

:K\\RXPD\WKHQDVNLVLWQRWMXVWblockIdx":K\blockIdx.x"$VLWWXUQV
RXW&8'$&DOORZV\RXWRGHȌQHDJURXSRIEORFNVLQWZRGLPHQVLRQV)RUSURE-
OHPVZLWKWZRGLPHQVLRQDOGRPDLQVVXFKDVPDWUL[PDWKRULPDJHSURFHVVLQJ
LWLVRIWHQFRQYHQLHQWWRXVHWZRGLPHQVLRQDOLQGH[LQJWRDYRLGDQQR\LQJWUDQVOD-
WLRQVIURPOLQHDUWRUHFWDQJXODULQGLFHV'RQǢWZRUU\LI\RXDUHQǢWIDPLOLDUZLWK
WKHVHSUREOHPW\SHVMXVWNQRZWKDWXVLQJWZRGLPHQVLRQDOLQGH[LQJFDQVRPH-
WLPHVEHPRUHFRQYHQLHQWWKDQRQHGLPHQVLRQDOLQGH[LQJ%XW\RXQHYHUhaveWR
XVHLW:HZRQǢWEHRIIHQGHG

44

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

:KHQZHODXQFKHGWKHNHUQHOZHVSHFLȌHGNDVWKHQXPEHURISDUDOOHOEORFNV
:HFDOOWKHFROOHFWLRQRISDUDOOHOEORFNVDgrid7KLVVSHFLȌHVWRWKHUXQWLPH
V\VWHPWKDWZHZDQWDRQHGLPHQVLRQDOgridRINEORFNV VFDODUYDOXHVDUH
LQWHUSUHWHGDVRQHGLPHQVLRQDO 7KHVHWKUHDGVZLOOKDYHYDU\LQJYDOXHVIRU
blockIdx.xWKHȌUVWWDNLQJYDOXHDQGWKHODVWWDNLQJYDOXHN-16RLPDJLQH
IRXUEORFNVDOOUXQQLQJWKURXJKWKHVDPHFRS\RIWKHGHYLFHFRGHEXWKDYLQJ
GLIIHUHQWYDOXHVIRUWKHYDULDEOHblockIdx.x7KLVLVZKDWWKHDFWXDOFRGHEHLQJ
H[HFXWHGLQHDFKRIWKHIRXUSDUDOOHOEORFNVORRNVOLNHDIWHUWKHUXQWLPHVXEVWL-
WXWHVWKHDSSURSULDWHEORFNLQGH[IRUblockIdx.x

BLOCK 1 BLOCK 2

__global__ void __global__ void


add( int *a, int *b, int *c ) { add( int *a, int *b, int *c ) {
int tid = 0; int tid = 1;
if (tid < N) if (tid < N)
c[tid] = a[tid] + b[tid]; c[tid] = a[tid] + b[tid];
} }

BLOCK 3 BLOCK 4

__global__ void __global__ void


add( int *a, int *b, int *c ) { add( int *a, int *b, int *c ) {
int tid = 2; int tid = 3;
if (tid < N) if (tid < N)
c[tid] = a[tid] + b[tid]; c[tid] = a[tid] + b[tid];
} }

,I\RXUHFDOOWKH&38EDVHGH[DPSOHZLWKZKLFKZHEHJDQ\RXZLOOUHFDOOWKDWZH
QHHGHGWRZDONWKURXJKLQGLFHVIURPWRN-1LQRUGHUWRVXPWKHWZRYHFWRUV
6LQFHWKHUXQWLPHV\VWHPLVDOUHDG\ODXQFKLQJDNHUQHOZKHUHHDFKEORFNZLOO
KDYHRQHRIWKHVHLQGLFHVQHDUO\DOORIWKLVZRUNKDVDOUHDG\EHHQGRQHIRUXV
%HFDXVHZHǢUHVRPHWKLQJRIDOD]\ORWWKLVLVDJRRGWKLQJ,WDIIRUGVXVPRUHWLPH
WREORJSUREDEO\DERXWKRZOD]\ZHDUH

7KHODVWUHPDLQLQJTXHVWLRQWREHDQVZHUHGLVZK\GRZHFKHFNZKHWKHUtid
LVOHVVWKDQN",WshouldDOZD\VEHOHVVWKDQNVLQFHZHǢYHVSHFLȌFDOO\ODXQFKHG
RXUNHUQHOVXFKWKDWWKLVDVVXPSWLRQKROGV%XWRXUGHVLUHWREHOD]\DOVRPDNHV
XVSDUDQRLGDERXWVRPHRQHEUHDNLQJDQDVVXPSWLRQZHǢYHPDGHLQRXUFRGH
%UHDNLQJFRGHDVVXPSWLRQVPHDQVEURNHQFRGH7KLVPHDQVEXJUHSRUWVODWH

45

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

QLJKWVWUDFNLQJGRZQEDGEHKDYLRUDQGJHQHUDOO\ORWVRIDFWLYLWLHVWKDWVWDQG
EHWZHHQXVDQGRXUEORJ,IZHGLGQǢWFKHFNWKDWtidLVOHVVWKDQNDQGVXEVH-
TXHQWO\IHWFKHGPHPRU\WKDWZDVQǢWRXUVWKLVZRXOGEHEDG,QIDFWLWFRXOG
SRVVLEO\NLOOWKHH[HFXWLRQRI\RXUNHUQHOVLQFH*38VKDYHVRSKLVWLFDWHGPHPRU\
PDQDJHPHQWXQLWVWKDWNLOOSURFHVVHVWKDWVHHPWREHYLRODWLQJPHPRU\UXOHV

,I\RXHQFRXQWHUSUREOHPVOLNHWKHRQHVMXVWPHQWLRQHGRQHRIWKHHANDLE_
ERROR()PDFURVWKDWZHǢYHVSULQNOHGVROLEHUDOO\WKURXJKRXWWKHFRGHZLOO
GHWHFWDQGDOHUW\RXWRWKHVLWXDWLRQ$VZLWKWUDGLWLRQDO&SURJUDPPLQJWKH
OHVVRQKHUHLVWKDWIXQFWLRQVUHWXUQHUURUFRGHVIRUDUHDVRQ$OWKRXJKLWLV
DOZD\VWHPSWLQJWRLJQRUHWKHVHHUURUFRGHVZHZRXOGORYHWRVDYHyouWKHKRXUV
RISDLQWKURXJKZKLFKweKDYHVXIIHUHGE\XUJLQJWKDW\RXcheck the results of
every operation that can fail$VLVRIWHQWKHFDVHWKHSUHVHQFHRIWKHVHHUURUV
ZLOOQRWSUHYHQW\RXIURPFRQWLQXLQJWKHH[HFXWLRQRI\RXUDSSOLFDWLRQEXWWKH\
ZLOOPRVWFHUWDLQO\FDXVHDOOPDQQHURIXQSUHGLFWDEOHDQGXQVDYRU\VLGHHIIHFWV
GRZQVWUHDP

$WWKLVSRLQW\RXǢUHUXQQLQJFRGHLQSDUDOOHORQWKH*383HUKDSV\RXKDGKHDUG
WKLVZDVWULFN\RUWKDW\RXKDGWRXQGHUVWDQGFRPSXWHUJUDSKLFVWRGRJHQHUDO
SXUSRVHSURJUDPPLQJRQDJUDSKLFVSURFHVVRU:HKRSH\RXDUHVWDUWLQJWRVHH
KRZ&8'$&PDNHVLWPXFKHDVLHUWRJHWVWDUWHGZULWLQJSDUDOOHOFRGHRQD*38
:HXVHGWKHH[DPSOHRQO\WRVXPYHFWRUVRIOHQJWK,I\RXZRXOGOLNHWRVHH
KRZHDV\LWLVWRJHQHUDWHDPDVVLYHO\SDUDOOHODSSOLFDWLRQWU\FKDQJLQJWKHLQ
WKHOLQH#define N 10WRRUWRODXQFKWHQVRIWKRXVDQGVRISDUDOOHO
EORFNV%HZDUQHGWKRXJK1RGLPHQVLRQRI\RXUODXQFKRIEORFNVPD\H[FHHG
7KLVLVVLPSO\DKDUGZDUHLPSRVHGOLPLWVR\RXZLOOVWDUWWRVHHIDLOXUHVLI
\RXDWWHPSWODXQFKHVZLWKPRUHEORFNVWKDQWKLV,QWKHQH[WFKDSWHUZHZLOOVHH
KRZWRZRUNZLWKLQWKLVOLPLWDWLRQ

 $)81(;$03/(
:HGRQǢWPHDQWRLPSO\WKDWDGGLQJYHFWRUVLVDQ\WKLQJOHVVWKDQIXQEXWWKH
IROORZLQJH[DPSOHZLOOVDWLVI\WKRVHORRNLQJIRUVRPHȍDVK\H[DPSOHVRISDUDOOHO
&8'$&

7KHIROORZLQJH[DPSOHZLOOGHPRQVWUDWHFRGHWRGUDZVOLFHVRIWKH-XOLD6HW)RU
WKHXQLQLWLDWHGWKH-XOLD6HWLVWKHERXQGDU\RIDFHUWDLQFODVVRIIXQFWLRQVRYHU
FRPSOH[QXPEHUV8QGRXEWHGO\WKLVVRXQGVHYHQOHVVIXQWKDQYHFWRUDGGL-
WLRQDQGPDWUL[PXOWLSOLFDWLRQ+RZHYHUIRUDOPRVWDOOYDOXHVRIWKHIXQFWLRQǢV

46

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

SDUDPHWHUVWKLVERXQGDU\IRUPVDIUDFWDORQHRIWKHPRVWLQWHUHVWLQJDQGEHDX-
WLIXOFXULRVLWLHVRIPDWKHPDWLFV

7KHFDOFXODWLRQVLQYROYHGLQJHQHUDWLQJVXFKDVHWDUHTXLWHVLPSOH$WLWVKHDUW
WKH-XOLD6HWHYDOXDWHVDVLPSOHLWHUDWLYHHTXDWLRQIRUSRLQWVLQWKHFRPSOH[SODQH
$SRLQWLVnotLQWKHVHWLIWKHSURFHVVRILWHUDWLQJWKHHTXDWLRQGLYHUJHVIRUWKDW
SRLQW7KDWLVLIWKHVHTXHQFHRIYDOXHVSURGXFHGE\LWHUDWLQJWKHHTXDWLRQJURZV
WRZDUGLQȌQLW\DSRLQWLVFRQVLGHUHGoutsideWKHVHW&RQYHUVHO\LIWKHYDOXHV
WDNHQE\WKHHTXDWLRQUHPDLQERXQGHGWKHSRLQWisLQWKHVHW

&RPSXWDWLRQDOO\WKHLWHUDWLYHHTXDWLRQLQTXHVWLRQLVUHPDUNDEO\VLPSOHDV
VKRZQLQ(TXDWLRQ

Equation 4.1

&RPSXWLQJDQLWHUDWLRQRI(TXDWLRQZRXOGWKHUHIRUHLQYROYHVTXDULQJWKH
FXUUHQWYDOXHDQGDGGLQJDFRQVWDQWWRJHWWKHQH[WYDOXHRIWKHHTXDWLRQ

CPU JULIA SET


:HZLOOH[DPLQHDVRXUFHOLVWLQJQRZWKDWZLOOFRPSXWHDQGYLVXDOL]HWKH-XOLD
6HW6LQFHWKLVLVDPRUHFRPSOLFDWHGSURJUDPWKDQZHKDYHVWXGLHGVRIDUZHZLOO
VSOLWLWLQWRSLHFHVKHUH/DWHULQWKHFKDSWHU\RXZLOOVHHWKHHQWLUHVRXUFHOLVWLQJ

int main( void ) {


CPUBitmap bitmap( DIM, DIM );
unsigned char *ptr = bitmap.get_ptr();

kernel( ptr );

bitmap.display_and_exit();
}

2XUPDLQURXWLQHLVUHPDUNDEO\VLPSOH,WFUHDWHVWKHDSSURSULDWHVL]HELWPDS
LPDJHXVLQJDXWLOLW\OLEUDU\SURYLGHG1H[WLWSDVVHVDSRLQWHUWRWKHELWPDSGDWD
WRWKHNHUQHOIXQFWLRQ

47

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

void kernel( unsigned char *ptr ){


for (int y=0; y<DIM; y++) {
for (int x=0; x<DIM; x++) {
int offset = x + y * DIM;

int juliaValue = julia( x, y );


ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
}
}

7KHFRPSXWDWLRQNHUQHOGRHVQRWKLQJPRUHWKDQLWHUDWHWKURXJKDOOSRLQWVZH
FDUHWRUHQGHUFDOOLQJjulia()RQHDFKWRGHWHUPLQHPHPEHUVKLSLQWKH-XOLD
6HW7KHIXQFWLRQjulia()ZLOOUHWXUQLIWKHSRLQWLVLQWKHVHWDQGLILWLVQRW
LQWKHVHW:HVHWWKHSRLQWǢVFRORUWREHUHGLIjulia()UHWXUQVDQGEODFNLILW
UHWXUQV7KHVHFRORUVDUHDUELWUDU\DQG\RXVKRXOGIHHOIUHHWRFKRRVHDFRORU
VFKHPHWKDWPDWFKHV\RXUSHUVRQDODHVWKHWLFV

int julia( int x, int y ) {


const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);

cuComplex c(-0.8, 0.156);


cuComplex a(jx, jy);

int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}

return 1;
}

48

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

7KLVIXQFWLRQLVWKHPHDWRIWKHH[DPSOH:HEHJLQE\WUDQVODWLQJRXUSL[HO
FRRUGLQDWHWRDFRRUGLQDWHLQFRPSOH[VSDFH7RFHQWHUWKHFRPSOH[SODQHDWWKH
LPDJHFHQWHUZHVKLIWE\DIM/27KHQWRHQVXUHWKDWWKHLPDJHVSDQVWKHUDQJH
RIWRZHVFDOHWKHLPDJHFRRUGLQDWHE\DIM/27KXVJLYHQDQLPDJH
SRLQWDW(x,y)ZHJHWDSRLQWLQFRPSOH[VSDFHDW( (DIM/2 – x)/(DIM/2),
((DIM/2 – y)/(DIM/2) )

7KHQWRSRWHQWLDOO\]RRPLQRURXWZHLQWURGXFHDscaleIDFWRU&XUUHQWO\WKHVFDOH
LVKDUGFRGHGWREHEXW\RXVKRXOGWZHDNWKLVSDUDPHWHUWR]RRPLQRURXW,I\RX
DUHIHHOLQJUHDOO\DPELWLRXV\RXFRXOGPDNHWKLVDFRPPDQGOLQHSDUDPHWHU

$IWHUREWDLQLQJWKHSRLQWLQFRPSOH[VSDFHZHWKHQQHHGWRGHWHUPLQHZKHWKHU
WKHSRLQWLVLQRURXWRIWKH-XOLD6HW,I\RXUHFDOOWKHSUHYLRXVVHFWLRQZHGRWKLV
E\FRPSXWLQJWKHYDOXHVRIWKHLWHUDWLYHHTXDWLRQ=Q ]Q2&6LQFH&LVVRPH
DUELWUDU\FRPSOH[YDOXHGFRQVWDQWZHKDYHFKRVHQ-0.8 + 0.156iEHFDXVHLW
KDSSHQVWR\LHOGDQLQWHUHVWLQJSLFWXUH<RXVKRXOGSOD\ZLWKWKLVFRQVWDQWLI\RX
ZDQWWRVHHRWKHUYHUVLRQVRIWKH-XOLD6HW

,QWKHH[DPSOHZHFRPSXWHLWHUDWLRQVRIWKLVIXQFWLRQ$IWHUHDFKLWHUDWLRQ
ZHFKHFNZKHWKHUWKHPDJQLWXGHRIWKHUHVXOWH[FHHGVVRPHWKUHVKROG IRU
RXUSXUSRVHV ,IVRWKHHTXDWLRQLVGLYHUJLQJDQGZHFDQUHWXUQWRLQGLFDWHWKDW
WKHSRLQWLVnotLQWKHVHW2QWKHRWKHUKDQGLIZHȌQLVKDOOLWHUDWLRQVDQGWKH
PDJQLWXGHLVVWLOOERXQGHGXQGHUZHDVVXPHWKDWWKHSRLQWLVLQWKHVHW
DQGZHUHWXUQWRWKHFDOOHUkernel()

6LQFHDOOWKHFRPSXWDWLRQVDUHEHLQJSHUIRUPHGRQFRPSOH[QXPEHUVZHGHȌQH
DJHQHULFVWUXFWXUHWRVWRUHFRPSOH[QXPEHUV

struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
float magnitude2( void ) { return r * r + i * i; }
cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};

49

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

7KHFODVVUHSUHVHQWVFRPSOH[QXPEHUVZLWKWZRGDWDHOHPHQWVDVLQJOH
SUHFLVLRQUHDOFRPSRQHQWrDQGDVLQJOHSUHFLVLRQLPDJLQDU\FRPSRQHQWi
7KHFODVVGHȌQHVDGGLWLRQDQGPXOWLSOLFDWLRQRSHUDWRUVWKDWFRPELQHFRPSOH[
QXPEHUVDVH[SHFWHG ,I\RXDUHFRPSOHWHO\XQIDPLOLDUZLWKFRPSOH[QXPEHUV
\RXFDQJHWDTXLFNSULPHURQOLQH )LQDOO\ZHGHȌQHDPHWKRGWKDWUHWXUQVWKH
PDJQLWXGHRIWKHFRPSOH[QXPEHU

GPU JULIA SET


7KHGHYLFHLPSOHPHQWDWLRQLVUHPDUNDEO\VLPLODUWRWKH&38YHUVLRQFRQWLQXLQJD
WUHQG\RXPD\KDYHQRWLFHG

int main( void ) {


CPUBitmap bitmap( DIM, DIM );
unsigned char *dev_bitmap;

HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap,


bitmap.image_size() ) );

dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );

HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(),


dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
bitmap.display_and_exit();

cudaFree( dev_bitmap );
}

7KLVYHUVLRQRImain()ORRNVPXFKPRUHFRPSOLFDWHGWKDQWKH&38YHUVLRQEXW
WKHȍRZLVDFWXDOO\LGHQWLFDO/LNHZLWKWKH&38YHUVLRQZHFUHDWHDDIM[DIM

50

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

ELWPDSLPDJHXVLQJRXUXWLOLW\OLEUDU\%XWEHFDXVHZHZLOOEHGRLQJFRPSXWD-
WLRQRQD*38ZHDOVRGHFODUHDSRLQWHUFDOOHGdev_bitmapWRKROGDFRS\
RIWKHGDWDRQWKHGHYLFH$QGWRKROGGDWDZHQHHGWRDOORFDWHPHPRU\XVLQJ
cudaMalloc()

:HWKHQUXQRXUkernel()IXQFWLRQH[DFWO\OLNHLQWKH&38YHUVLRQDOWKRXJK
QRZLWLVD__global__IXQFWLRQPHDQLQJLWZLOOUXQRQWKH*38$VZLWKWKH
&38H[DPSOHZHSDVVkernel()WKHSRLQWHUZHDOORFDWHGLQWKHSUHYLRXVOLQHWR
VWRUHWKHUHVXOWV7KHRQO\GLIIHUHQFHLVWKDWWKHPHPRU\UHVLGHVRQWKH*38QRZ
QRWRQWKHKRVWV\VWHP

7KHPRVWVLJQLȌFDQWGLIIHUHQFHLVWKDWZHVSHFLI\KRZPDQ\SDUDOOHOEORFNVRQ
ZKLFKWRH[HFXWHWKHIXQFWLRQkernel()%HFDXVHHDFKSRLQWFDQEHFRPSXWHG
LQGHSHQGHQWO\RIHYHU\RWKHUSRLQWZHVLPSO\VSHFLI\RQHFRS\RIWKHIXQFWLRQIRU
HDFKSRLQWZHZDQWWRFRPSXWH:HPHQWLRQHGWKDWIRUVRPHSUREOHPGRPDLQV
LWKHOSVWRXVHWZRGLPHQVLRQDOLQGH[LQJ8QVXUSULVLQJO\FRPSXWLQJIXQFWLRQ
YDOXHVRYHUDWZRGLPHQVLRQDOGRPDLQVXFKDVWKHFRPSOH[SODQHLVRQHRIWKHVH
SUREOHPV6RZHVSHFLI\DWZRGLPHQVLRQDOJULGRIEORFNVLQWKLVOLQH
dim3 grid(DIM,DIM);

7KHW\SHdim3LVQRWDVWDQGDUG&W\SHOHVW\RXIHDUHG\RXKDGIRUJRWWHQVRPH
NH\SLHFHVRILQIRUPDWLRQ5DWKHUWKH&8'$UXQWLPHKHDGHUȌOHVGHȌQHVRPH
FRQYHQLHQFHW\SHVWRHQFDSVXODWHPXOWLGLPHQVLRQDOWXSOHV7KHW\SHdim3 repre-
VHQWVDWKUHHGLPHQVLRQDOWXSOHWKDWZLOOEHXVHGWRVSHFLI\WKHVL]HRIRXUODXQFK
%XWZK\GRZHXVHDWKUHHGLPHQVLRQDOYDOXHZKHQZHRKVRFOHDUO\VWDWHGWKDW
RXUODXQFKLVDtwo-dimensionalJULG"

)UDQNO\ZHGRWKLVEHFDXVHDWKUHHGLPHQVLRQDOdim3YDOXHLVZKDWWKH&8'$
UXQWLPHH[SHFWV$OWKRXJKDWKUHHGLPHQVLRQDOODXQFKJULGLVQRWFXUUHQWO\
VXSSRUWHGWKH&8'$UXQWLPHVWLOOH[SHFWVDdim3YDULDEOHZKHUHWKHODVWFRPSR-
QHQWHTXDOV:KHQZHLQLWLDOL]HLWZLWKRQO\WZRYDOXHVDVZHGRLQWKHVWDWH-
PHQWdim3 grid(DIM,DIM)WKH&8'$UXQWLPHDXWRPDWLFDOO\ȌOOVWKHWKLUG
GLPHQVLRQZLWKWKHYDOXHVRHYHU\WKLQJKHUHZLOOZRUNDVH[SHFWHG$OWKRXJK
LWǢVSRVVLEOHWKDW19,',$ZLOOVXSSRUWDWKUHHGLPHQVLRQDOJULGLQWKHIXWXUHIRU
QRZZHǢOOMXVWSOD\QLFHO\ZLWKWKHNHUQHOODXQFK$3,EHFDXVHZKHQFRGHUVDQG
$3,VȌJKWWKH$3,DOZD\VZLQV

51

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

:HWKHQSDVVRXUdim3YDULDEOHgridWRWKH&8'$UXQWLPHLQWKLVOLQH
kernel<<<grid,1>>>( dev _ bitmap );

)LQDOO\DFRQVHTXHQFHRIWKHUHVXOWVUHVLGLQJRQWKHGHYLFHLVWKDWDIWHUH[HFXWLQJ
kernel()ZHKDYHWRFRS\WKHUHVXOWVEDFNWRWKHKRVW$VZHOHDUQHGLQ
SUHYLRXVFKDSWHUVZHDFFRPSOLVKWKLVZLWKDFDOOWRcudaMemcpy()VSHFLI\LQJ
WKHGLUHFWLRQcudaMemcpyDeviceToHostDVWKHODVWDUJXPHQW

HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(),


dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );

2QHRIWKHODVWZULQNOHVLQWKHGLIIHUHQFHRILPSOHPHQWDWLRQFRPHVLQWKHLPSOH-
PHQWDWLRQRIkernel()

__global__ void kernel( unsigned char *ptr ) {


// map from threadIdx/BlockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;

// now calculate the value at that position


int juliaValue = julia( x, y );
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}

)LUVWZHQHHGkernel()WREHGHFODUHGDVD__global__IXQFWLRQVRLWUXQV
RQWKHGHYLFHEXWFDQEHFDOOHGIURPWKHKRVW8QOLNHWKH&38YHUVLRQZHQR
ORQJHUQHHGQHVWHGfor()ORRSVWRJHQHUDWHWKHSL[HOLQGLFHVWKDWJHWSDVVHG

52

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

WRjulia()$VZLWKWKHYHFWRUDGGLWLRQH[DPSOHWKH&8'$UXQWLPHJHQHUDWHV
WKHVHLQGLFHVIRUXVLQWKHYDULDEOHblockIdx7KLVZRUNVEHFDXVHZHGHFODUHG
RXUJULGRIEORFNVWRKDYHWKHVDPHGLPHQVLRQVDVRXULPDJHVRZHJHWRQHEORFN
IRUHDFKSDLURILQWHJHUV(x,y)EHWZHHQ(0,0)DQG(DIM-1, DIM-1)

1H[WWKHRQO\DGGLWLRQDOLQIRUPDWLRQZHQHHGLVDOLQHDURIIVHWLQWRRXURXWSXW
EXIIHUptr7KLVJHWVFRPSXWHGXVLQJDQRWKHUEXLOWLQYDULDEOHgridDim7KLV
YDULDEOHLVDFRQVWDQWDFURVVDOOEORFNVDQGVLPSO\KROGVWKHGLPHQVLRQVRIWKH
JULGWKDWZDVODXQFKHG,QWKLVH[DPSOHLWZLOODOZD\VEHWKHYDOXH DIM, DIM)
6RPXOWLSO\LQJWKHURZLQGH[E\WKHJULGZLGWKDQGDGGLQJWKHFROXPQLQGH[ZLOO
JLYHXVDXQLTXHLQGH[LQWRptrWKDWUDQJHVIURP0WR(DIM*DIM-1)
int offset = x + y * gridDim.x;

)LQDOO\ZHH[DPLQHWKHDFWXDOFRGHWKDWGHWHUPLQHVZKHWKHUDSRLQWLVLQRURXW
RIWKH-XOLD6HW7KLVFRGHVKRXOGORRNLGHQWLFDOWRWKH&38YHUVLRQFRQWLQXLQJD
WUHQGZHKDYHVHHQLQPDQ\H[DPSOHVQRZ

__device__ int julia( int x, int y ) {


const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);

cuComplex c(-0.8, 0.156);


cuComplex a(jx, jy);

int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}

return 1;
}

53

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

$JDLQZHGHȌQHDcuComplexVWUXFWXUHWKDWGHȌQHVDPHWKRGIRUVWRULQJD
FRPSOH[QXPEHUZLWKVLQJOHSUHFLVLRQȍRDWLQJSRLQWFRPSRQHQWV7KHVWUXFWXUH
DOVRGHȌQHVDGGLWLRQDQGPXOWLSOLFDWLRQRSHUDWRUVDVZHOODVDIXQFWLRQWRUHWXUQ
WKHPDJQLWXGHRIWKHFRPSOH[YDOXH

struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
__device__ float magnitude2( void ) {
return r * r + i * i;
}
__device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};

1RWLFHWKDWZHXVHWKHVDPHODQJXDJHFRQVWUXFWVLQ&8'$&WKDWZHXVHLQRXU
&38YHUVLRQ7KHRQHGLIIHUHQFHLVWKHTXDOLȌHU__device__ZKLFKLQGLFDWHV
WKDWWKLVFRGHZLOOUXQRQD*38DQGQRWRQWKHKRVW5HFDOOWKDWEHFDXVHWKHVH
IXQFWLRQVDUHGHFODUHGDV__device__IXQFWLRQVWKH\ZLOOEHFDOODEOHRQO\IURP
RWKHU__device__IXQFWLRQVRUIURP__global__IXQFWLRQV

6LQFHZHǢYHLQWHUUXSWHGWKHFRGHZLWKFRPPHQWDU\VRIUHTXHQWO\KHUHLVWKH
HQWLUHVRXUFHOLVWLQJIURPVWDUWWRȌQLVK

#include "../common/book.h"
#include "../common/cpu_bitmap.h"

#define DIM 1000

54

Download from www.wowebook.com


 &8'$3$5$//(/352*5$00,1*
CUDA PARALLEL PROGRAMMING

struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
__device__ float magnitude2( void ) {
return r * r + i * i;
}
__device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};

__device__ int julia( int x, int y ) {


const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);

cuComplex c(-0.8, 0.156);


cuComplex a(jx, jy);

int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}

return 1;
}

55

Download from www.wowebook.com


PARALLEL PROGRAMMING IN CUDA C

__global__ void kernel( unsigned char *ptr ) {


// map from threadIdx/BlockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;

// now calculate the value at that position


int juliaValue = julia( x, y );
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}

int main( void ) {


CPUBitmap bitmap( DIM, DIM );
unsigned char *dev_bitmap;

HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap,


bitmap.image_size() ) );

dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );

HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,


bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
bitmap.display_and_exit();

HANDLE_ERROR( cudaFree( dev_bitmap ) );


}

:KHQ\RXUXQWKHDSSOLFDWLRQ\RXVKRXOGVHHDQDQLPDWLQJYLVXDOL]DWLRQRIWKH
-XOLD6HW7RFRQYLQFH\RXWKDWLWKDVHDUQHGWKHWLWOHǤ$)XQ([DPSOHǥ)LJXUH
VKRZVDVFUHHQVKRWWDNHQIURPWKLVDSSOLFDWLRQ

56

Download from www.wowebook.com


 &+ $ 3 5
7( 5
 ( 9 ,( :

Figure 4.2 $VFUHHQVKRWIURPWKH*38-XOLD6HWDSSOLFDWLRQ

 &KDSWHU5HYLHZ
&RQJUDWXODWLRQV\RXFDQQRZZULWHFRPSLOHDQGUXQPDVVLYHO\SDUDOOHOFRGH
RQDJUDSKLFVSURFHVVRU<RXVKRXOGJREUDJWR\RXUIULHQGV$QGLIWKH\DUHVWLOO
XQGHUWKHPLVFRQFHSWLRQWKDW*38FRPSXWLQJLVH[RWLFDQGGLIȌFXOWWRPDVWHU
WKH\ZLOOEHPRVWLPSUHVVHG7KHHDVHZLWKZKLFK\RXDFFRPSOLVKHGLWZLOOEH
RXUVHFUHW,IWKH\ǢUHSHRSOH\RXWUXVWZLWK\RXUVHFUHWVVXJJHVWWKDWWKH\EX\WKH
ERRNWRR

:HKDYHVRIDUORRNHGDWKRZWRLQVWUXFWWKH&8'$UXQWLPHWRH[HFXWHPXOWLSOH
FRSLHVRIRXUSURJUDPLQSDUDOOHORQZKDWZHFDOOHGblocks:HFDOOHGWKHFROOHF-
WLRQRIEORFNVZHODXQFKRQWKH*38Dgrid$VWKHQDPHPLJKWLPSO\DJULGFDQ
EHHLWKHUDRQHRUWZRGLPHQVLRQDOFROOHFWLRQRIEORFNV(DFKFRS\RIWKHNHUQHO
FDQGHWHUPLQHZKLFKEORFNLWLVH[HFXWLQJZLWKWKHEXLOWLQYDULDEOHblockIdx
/LNHZLVHLWFDQGHWHUPLQHWKHVL]HRIWKHJULGE\XVLQJWKHEXLOWLQYDULDEOH
gridDim%RWKRIWKHVHEXLOWLQYDULDEOHVSURYHGXVHIXOZLWKLQRXUNHUQHOWR
FDOFXODWHWKHGDWDLQGH[IRUZKLFKHDFKEORFNLVUHVSRQVLEOH

57

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Chapter 5
Thread Cooperation

:HKDYHQRZZULWWHQRXUȌUVWSURJUDPXVLQJ&8'$&DVZHOODVKDYHVHHQKRZ
WRZULWHFRGHWKDWH[HFXWHVLQSDUDOOHORQD*387KLVLVDQH[FHOOHQWVWDUW%XW
DUJXDEO\RQHRIWKHPRVWLPSRUWDQWFRPSRQHQWVWRSDUDOOHOSURJUDPPLQJLV
WKHPHDQVE\ZKLFKWKHSDUDOOHOSURFHVVLQJHOHPHQWVFRRSHUDWHRQVROYLQJD
SUREOHP5DUHDUHWKHSUREOHPVZKHUHHYHU\SURFHVVRUFDQFRPSXWHUHVXOWV
DQGWHUPLQDWHH[HFXWLRQZLWKRXWDSDVVLQJWKRXJKWDVWRZKDWWKHRWKHUSURFHV-
VRUVDUHGRLQJ)RUHYHQPRGHUDWHO\VRSKLVWLFDWHGDOJRULWKPVZHZLOOQHHGWKH
SDUDOOHOFRSLHVRIRXUFRGHWRFRPPXQLFDWHDQGFRRSHUDWH6RIDUZHKDYHQRW
VHHQDQ\PHFKDQLVPVIRUDFFRPSOLVKLQJWKLVFRPPXQLFDWLRQEHWZHHQVHFWLRQV
RI&8'$&FRGHH[HFXWLQJLQSDUDOOHO)RUWXQDWHO\WKHUHLVDVROXWLRQRQHWKDWZH
ZLOOEHJLQWRH[SORUHLQWKLVFKDSWHU

59

Download from www.wowebook.com


7+5($'&223(5$7,21

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQDERXWZKDW&8'$&FDOOVthreads

ǩ <RXZLOOOHDUQDPHFKDQLVPIRUGLIIHUHQWWKUHDGVWRFRPPXQLFDWHZLWKHDFKRWKHU

ǩ <RXZLOOOHDUQDPHFKDQLVPWRV\QFKURQL]HWKHSDUDOOHOH[HFXWLRQRIGLIIHUHQW
WKUHDGV

 6SOLWWLQJ3DUDOOHO%ORFNV
,QWKHSUHYLRXVFKDSWHUZHORRNHGDWKRZWRODXQFKSDUDOOHOFRGHRQWKH*38:H
GLGWKLVE\LQVWUXFWLQJWKH&8'$UXQWLPHV\VWHPRQKRZPDQ\SDUDOOHOFRSLHVRI
RXUNHUQHOWRODXQFK:HFDOOWKHVHSDUDOOHOFRSLHVblocks

7KH&8'$UXQWLPHDOORZVWKHVHEORFNVWREHVSOLWLQWRthreads5HFDOOWKDWZKHQ
ZHODXQFKHGPXOWLSOHSDUDOOHOEORFNVZHFKDQJHGWKHȌUVWDUJXPHQWLQWKHDQJOH
EUDFNHWVIURPWRWKHQXPEHURIEORFNVZHZDQWHGWRODXQFK)RUH[DPSOHZKHQ
ZHVWXGLHGYHFWRUDGGLWLRQZHODXQFKHGDEORFNIRUHDFKHOHPHQWLQWKHYHFWRURI
VL]H1E\FDOOLQJWKLV
add<<<N,1>>>( dev_a, dev_b, dev_c );

,QVLGHWKHDQJOHEUDFNHWVWKHVHFRQGSDUDPHWHUDFWXDOO\UHSUHVHQWVWKHQXPEHU
RIWKUHDGVSHUEORFNZHZDQWWKH&8'$UXQWLPHWRFUHDWHRQRXUEHKDOI7RWKLV
SRLQWZHKDYHRQO\HYHUODXQFKHGRQHWKUHDGSHUEORFN,QWKHSUHYLRXVH[DPSOH
ZHODXQFKHGWKHIROORZLQJ

1EORFNV[WKUHDGEORFN 1SDUDOOHOWKUHDGV

6RUHDOO\ZHFRXOGKDYHODXQFKHGN/2EORFNVZLWKWZRWKUHDGVSHUEORFNN/4
EORFNVZLWKIRXUWKUHDGVSHUEORFNDQGVRRQ/HWǢVUHYLVLWRXUYHFWRUDGGLWLRQ
H[DPSOHDUPHGZLWKWKLVQHZLQIRUPDWLRQDERXWWKHFDSDELOLWLHVRI&8'$&

 9(&72568065('8;
:HHQGHDYRUWRDFFRPSOLVKWKHVDPHWDVNDVZHGLGLQWKHSUHYLRXVFKDSWHU7KDW
LVZHZDQWWRWDNHWZRLQSXWYHFWRUVDQGVWRUHWKHLUVXPLQDWKLUGRXWSXWYHFWRU
+RZHYHUWKLVWLPHZHZLOOXVHWKUHDGVLQVWHDGRIEORFNVWRDFFRPSOLVKWKLV
60

Download from www.wowebook.com


 63 / , 7 7 ,1*3
 $ 5 $ / / (/%
 /2 &. 6

<RXPD\EHZRQGHULQJZKDWLVWKHDGYDQWDJHRIXVLQJWKUHDGVUDWKHUWKDQ
EORFNV":HOOIRUQRZWKHUHLVQRDGYDQWDJHZRUWKGLVFXVVLQJ%XWSDUDOOHO
WKUHDGVZLWKLQDEORFNZLOOKDYHWKHDELOLW\WRGRWKLQJVWKDWSDUDOOHOEORFNVFDQQRW
GR6RIRUQRZEHSDWLHQWDQGKXPRUXVZKLOHZHZDONWKURXJKDSDUDOOHOWKUHDG
YHUVLRQRIWKHSDUDOOHOEORFNH[DPSOHIURPWKHSUHYLRXVFKDSWHU

*389(&725680686,1*7+5($'6
:HZLOOVWDUWE\DGGUHVVLQJWKHWZRFKDQJHVRIQRWHZKHQPRYLQJIURPSDUDOOHO
EORFNVWRSDUDOOHOWKUHDGV2XUNHUQHOLQYRFDWLRQZLOOFKDQJHIURPRQHWKDW
ODXQFKHVNEORFNVRIRQHWKUHDGDSLHFH
add<<<N,1>>>( dev _ a, dev _ b, dev _ c );

WRDYHUVLRQWKDWODXQFKHVNWKUHDGVDOOZLWKLQRQHEORFN
add<<<1,N>>>( dev _ a, dev _ b, dev _ c );

7KHRQO\RWKHUFKDQJHDULVHVLQWKHPHWKRGE\ZKLFKZHLQGH[RXUGDWD
3UHYLRXVO\ZLWKLQRXUNHUQHOZHLQGH[HGWKHLQSXWDQGRXWSXWGDWDE\EORFNLQGH[
int tid = blockIdx.x;

7KHSXQFKOLQHKHUHVKRXOGQRWEHDVXUSULVH1RZWKDWZHKDYHRQO\DVLQJOH
EORFNZHKDYHWRLQGH[WKHGDWDE\WKUHDGLQGH[
int tid = threadIdx.x;

7KHVHDUHWKHRQO\WZRFKDQJHVUHTXLUHGWRPRYHIURPDSDUDOOHOEORFNLPSOH-
PHQWDWLRQWRDSDUDOOHOWKUHDGLPSOHPHQWDWLRQ)RUFRPSOHWHQHVVKHUHLVWKH
HQWLUHVRXUFHOLVWLQJZLWKWKHFKDQJHGOLQHVLQEROG

#include "../common/book.h"

#define N 10

__global__ void add( int *a, int *b, int *c ) {


int tid = threadIdx.x;
if (tid < N)
c[tid] = a[tid] + b[tid];
}

61

Download from www.wowebook.com


7+5($'&223(5$7,21

int main( void ) {


int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

// fill the arrays ‘a’ and ‘b’ on the CPU


for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i * i;
}

// copy the arrays ‘a’ and ‘b’ to the GPU


HANDLE_ERROR( cudaMemcpy( dev_a,
a,
N * sizeof(int),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b,
b,
N * sizeof(int),
cudaMemcpyHostToDevice ) );

add<<<1,N>>>( dev_a, dev_b, dev_c );

// copy the array ‘c’ back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( c,
dev_c,
N * sizeof(int),
cudaMemcpyDeviceToHost ) );

// display the results


for (int i=0; i<N; i++) {
printf( “%d + %d = %d\n”, a[i], b[i], c[i] );
}

62

Download from www.wowebook.com


 63 / , 7 7 ,1*3
 $ 5 $ / / (/%
 /2 &. 6

// free the memory allocated on the GPU


cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );

return 0;
}

3UHWW\VLPSOHVWXIIULJKW",QWKHQH[WVHFWLRQZHǢOOVHHRQHRIWKHOLPLWDWLRQV
RIWKLVWKUHDGRQO\DSSURDFK$QGRIFRXUVHODWHUZHǢOOVHHZK\ZHZRXOGHYHQ
ERWKHUVSOLWWLQJEORFNVLQWRRWKHUSDUDOOHOFRPSRQHQWV

*3868062)$/21*(59(&725
,QWKHSUHYLRXVFKDSWHUZHQRWHGWKDWWKHKDUGZDUHOLPLWVWKHQXPEHURIEORFNV
LQDVLQJOHODXQFKWR6LPLODUO\WKHKDUGZDUHOLPLWVWKHQXPEHURIWKUHDGV
SHUEORFNZLWKZKLFKZHFDQODXQFKDNHUQHO6SHFLȌFDOO\WKLVQXPEHUFDQQRW
H[FHHGWKHYDOXHVSHFLȌHGE\WKHmaxThreadsPerBlockȌHOGRIWKHGHYLFH
SURSHUWLHVVWUXFWXUHZHORRNHGDWLQ&KDSWHU)RUPDQ\RIWKHJUDSKLFVSURFHV-
VRUVFXUUHQWO\DYDLODEOHWKLVOLPLWLVWKUHDGVSHUEORFNVRKRZZRXOGZHXVH
DWKUHDGEDVHGDSSURDFKWRDGGWZRYHFWRUVRIVL]HJUHDWHUWKDQ":HZLOO
KDYHWRXVHDFRPELQDWLRQRIWKUHDGVDQGEORFNVWRDFFRPSOLVKWKLV

$VEHIRUHWKLVZLOOUHTXLUHWZRFKDQJHV:HZLOOKDYHWRFKDQJHWKHLQGH[FRPSX-
WDWLRQZLWKLQWKHNHUQHODQGZHZLOOKDYHWRFKDQJHWKHNHUQHOODXQFKLWVHOI

1RZWKDWZHKDYHPXOWLSOHEORFNVDQGWKUHDGVWKHLQGH[LQJZLOOVWDUWWRORRN
VLPLODUWRWKHVWDQGDUGPHWKRGIRUFRQYHUWLQJIURPDWZRGLPHQVLRQDOLQGH[
VSDFHWRDOLQHDUVSDFH

int tid = threadIdx.x + blockIdx.x * blockDim.x;

7KLVDVVLJQPHQWXVHVDQHZEXLOWLQYDULDEOHblockDim7KLVYDULDEOHLVD
FRQVWDQWIRUDOOEORFNVDQGVWRUHVWKHQXPEHURIWKUHDGVDORQJHDFKGLPHQ-
VLRQRIWKHEORFN6LQFHZHDUHXVLQJDRQHGLPHQVLRQDOEORFNZHUHIHURQO\WR
blockDim.x,I\RXUHFDOOgridDimVWRUHGDVLPLODUYDOXHEXWLWVWRUHGWKH
QXPEHURIEORFNVDORQJHDFKGLPHQVLRQRIWKHHQWLUHJULG0RUHRYHUgridDimLV
WZRGLPHQVLRQDOZKHUHDVblockDimLVDFWXDOO\WKUHHGLPHQVLRQDO7KDWLVWKH
&8'$UXQWLPHDOORZV\RXWRODXQFKDWZRGLPHQVLRQDOJULGRIEORFNVZKHUHHDFK
EORFNLVDWKUHHGLPHQVLRQDODUUD\RIWKUHDGV<HVWKLVLVDORWRIGLPHQVLRQVDQG
LWLVXQOLNHO\\RXZLOOUHJXODUO\QHHGWKHȌYHGHJUHHVRILQGH[LQJIUHHGRPDIIRUGHG
\RXEXWWKH\DUHDYDLODEOHLIVRGHVLUHG
63

Download from www.wowebook.com


7+5($'&223(5$7,21

,QGH[LQJWKHGDWDLQDOLQHDUDUUD\XVLQJWKHSUHYLRXVDVVLJQPHQWDFWXDOO\LVTXLWH
LQWXLWLYH,I\RXGLVDJUHHLWPD\KHOSWRWKLQNDERXW\RXUFROOHFWLRQRIEORFNVRI
WKUHDGVVSDWLDOO\VLPLODUWRDWZRGLPHQVLRQDODUUD\RISL[HOV:HGHSLFWWKLV
DUUDQJHPHQWLQ)LJXUH

,IWKHWKUHDGVUHSUHVHQWFROXPQVDQGWKHEORFNVUHSUHVHQWURZVZHFDQJHWD
XQLTXHLQGH[E\WDNLQJWKHSURGXFWRIWKHEORFNLQGH[ZLWKWKHQXPEHURIWKUHDGV
LQHDFKEORFNDQGDGGLQJWKHWKUHDGLQGH[ZLWKLQWKHEORFN7KLVLVLGHQWLFDOWRWKH
PHWKRGZHXVHGWROLQHDUL]HWKHWZRGLPHQVLRQDOLPDJHLQGH[LQWKH-XOLD6HW
H[DPSOH
int offset = x + y * DIM;

+HUHDIMLVWKHEORFNGLPHQVLRQ PHDVXUHGLQWKUHDGV yLVWKHEORFNLQGH[


DQGxLVWKHWKUHDGLQGH[ZLWKLQWKHEORFN+HQFHZHDUULYHDWWKHLQGH[
tid = threadIdx.x + blockIdx.x * blockDim.x

7KHRWKHUFKDQJHLVWRWKHNHUQHOODXQFKLWVHOI:HVWLOOQHHGNSDUDOOHOWKUHDGVWR
ODXQFKEXWZHZDQWWKHPWRODXQFKDFURVVPXOWLSOHEORFNVVRZHGRQRWKLWWKH
WKUHDGOLPLWDWLRQLPSRVHGXSRQXV2QHVROXWLRQLVWRDUELWUDULO\VHWWKHEORFN
VL]HWRVRPHȌ[HGQXPEHURIWKUHDGVIRUWKLVH[DPSOHOHWǢVXVHWKUHDGVSHU
EORFN7KHQZHFDQMXVWODXQFKN/128EORFNVWRJHWRXUWRWDORINWKUHDGVUXQQLQJ

7KHZULQNOHKHUHLVWKDWN/128LVDQLQWHJHUGLYLVLRQ7KLVLPSOLHVWKDWLIN were
N/128ZRXOGEH]HURDQGZHZLOOQRWDFWXDOO\FRPSXWHDQ\WKLQJLIZHODXQFK

         

         

         

         

Figure 5.1 $WZRGLPHQVLRQDODUUDQJHPHQWRIDFROOHFWLRQRIEORFNVDQGWKUHDGV

64

Download from www.wowebook.com


 63 / , 7 7 ,1*3
 $ 5 $ / / (/%
 /2 &. 6

]HURWKUHDGV,QIDFWZHZLOOODXQFKWRRIHZWKUHDGVZKHQHYHUNLVQRWDQH[DFW
PXOWLSOHRI7KLVLVEDG:HDFWXDOO\ZDQWWKLVGLYLVLRQWRURXQGXS

7KHUHLVDFRPPRQWULFNWRDFFRPSOLVKWKLVLQLQWHJHUGLYLVLRQZLWKRXWFDOOLQJ
ceil():HDFWXDOO\FRPSXWH(N+127)/128LQVWHDGRIN/128(LWKHU\RXFDQ
WDNHRXUZRUGWKDWWKLVZLOOFRPSXWHWKHVPDOOHVWPXOWLSOHRIJUHDWHUWKDQRU
HTXDOWRNRU\RXFDQWDNHDPRPHQWQRZWRFRQYLQFH\RXUVHOIRIWKLVIDFW

:HKDYHFKRVHQWKUHDGVSHUEORFNDQGWKHUHIRUHXVHWKHIROORZLQJNHUQHO
ODXQFK

add<<< (N+127)/128, 128 >>>( dev _ a, dev _ b, dev _ c );

%HFDXVHRIRXUFKDQJHWRWKHGLYLVLRQWKDWHQVXUHVZHODXQFKHQRXJKWKUHDGVZH
ZLOODFWXDOO\QRZODXQFKtoo many WKUHDGVZKHQNLVQRWDQH[DFWPXOWLSOHRI
%XWWKHUHLVDVLPSOHUHPHG\WRWKLVSUREOHPDQGRXUNHUQHODOUHDG\WDNHVFDUHRI
LW:HKDYHWRFKHFNZKHWKHUDWKUHDGǢVRIIVHWLVDFWXDOO\EHWZHHQDQGNEHIRUH
ZHXVHLWWRDFFHVVRXULQSXWDQGRXWSXWDUUD\V

if (tid < N)
c[tid] = a[tid] + b[tid];

7KXVZKHQRXULQGH[RYHUVKRRWVWKHHQGRIRXUDUUD\DVZLOODOZD\VKDSSHQ
ZKHQZHODXQFKDQRQPXOWLSOHRIZHDXWRPDWLFDOO\UHIUDLQIURPSHUIRUPLQJ
WKHFDOFXODWLRQ0RUHLPSRUWDQWZHUHIUDLQIURPUHDGLQJDQGZULWLQJPHPRU\RII
WKHHQGRIRXUDUUD\

*3868062)$5%,75$5,/</21*9(&7256
:HZHUHQRWFRPSOHWHO\IRUWKFRPLQJZKHQZHȌUVWGLVFXVVHGODXQFKLQJSDUDOOHO
EORFNVRQD*38,QDGGLWLRQWRWKHOLPLWDWLRQRQWKUHDGFRXQWWKHUHLVDOVRD
KDUGZDUHOLPLWDWLRQRQWKHQXPEHURIEORFNV DOEHLWPXFKJUHDWHUWKDQWKHWKUHDG
OLPLWDWLRQ $VZHǢYHPHQWLRQHGSUHYLRXVO\QHLWKHUGLPHQVLRQRIDJULGRIEORFNV
PD\H[FHHG

6RWKLVUDLVHVDSUREOHPZLWKRXUFXUUHQWYHFWRUDGGLWLRQLPSOHPHQWDWLRQ,I
ZHODXQFKN/128EORFNVWRDGGRXUYHFWRUVZHZLOOKLWODXQFKIDLOXUHVZKHQ
RXUYHFWRUVH[FHHG  HOHPHQWV7KLVVHHPVOLNHDODUJH
QXPEHUEXWZLWKFXUUHQWPHPRU\FDSDFLWLHVEHWZHHQ*%DQG*%WKHKLJKHQG
JUDSKLFVSURFHVVRUVFDQKROGRUGHUVRIPDJQLWXGHPRUHGDWDWKDQYHFWRUVZLWK
PLOOLRQHOHPHQWV
65

Download from www.wowebook.com


7+5($'&223(5$7,21

)RUWXQDWHO\WKHVROXWLRQWRWKLVLVVXHLVH[WUHPHO\VLPSOH:HȌUVWPDNHDFKDQJH
WRRXUNHUQHO

__global__ void add( int *a, int *b, int *c ) {


int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}

7KLVORRNVUHPDUNDEO\OLNHRXUoriginalYHUVLRQRIYHFWRUDGGLWLRQ,QIDFWFRPSDUH
LWWRWKHIROORZLQJ&38LPSOHPHQWDWLRQIURPWKHSUHYLRXVFKDSWHU

void add( int *a, int *b, int *c ) {


int tid = 0; // this is CPU zero, so we start at zero
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += 1; // we have one CPU, so we increment by one
}
}

+HUHZHDOVRXVHGDwhile()ORRSWRLWHUDWHWKURXJKWKHGDWD5HFDOOWKDWZH
FODLPHGWKDWUDWKHUWKDQLQFUHPHQWLQJWKHDUUD\LQGH[E\DPXOWL&38RUPXOWL-
FRUHYHUVLRQFRXOGLQFUHPHQWE\WKHQXPEHURISURFHVVRUVZHZDQWHGWRXVH:H
ZLOOQRZXVHWKDWVDPHSULQFLSOHLQWKH*38YHUVLRQ

,QWKH*38LPSOHPHQWDWLRQZHFRQVLGHUWKHQXPEHURISDUDOOHOWKUHDGVODXQFKHG
WREHWKHQXPEHURISURFHVVRUV$OWKRXJKWKHDFWXDO*38PD\KDYHIHZHU RU
PRUH SURFHVVLQJXQLWVWKDQWKLVZHWKLQNRIHDFKWKUHDGDVORJLFDOO\H[HFXWLQJ
LQSDUDOOHODQGWKHQDOORZWKHKDUGZDUHWRVFKHGXOHWKHDFWXDOH[HFXWLRQ
'HFRXSOLQJWKHSDUDOOHOL]DWLRQIURPWKHDFWXDOPHWKRGRIKDUGZDUHH[HFXWLRQLV
RQHRIEXUGHQVWKDW&8'$&OLIWVRIIDVRIWZDUHGHYHORSHUǢVVKRXOGHUV7KLVVKRXOG
FRPHDVDUHOLHIFRQVLGHULQJFXUUHQW19,',$KDUGZDUHFDQVKLSZLWKDQ\ZKHUH
EHWZHHQDQGDULWKPHWLFXQLWVSHUFKLS

1RZWKDWZHXQGHUVWDQGWKHSULQFLSOHEHKLQGWKLVLPSOHPHQWDWLRQZHMXVWQHHG
WRXQGHUVWDQGKRZZHGHWHUPLQHWKHLQLWLDOLQGH[YDOXHIRUHDFKSDUDOOHOWKUHDG

66

Download from www.wowebook.com


 63 / , 7 7 ,1*3
 $ 5 $ / / (/%
 /2 &. 6

DQGKRZZHGHWHUPLQHWKHLQFUHPHQW:HZDQWHDFKSDUDOOHOWKUHDGWRVWDUWRQ
DGLIIHUHQWGDWDLQGH[VRZHMXVWQHHGWRWDNHRXUWKUHDGDQGEORFNLQGH[HVDQG
OLQHDUL]HWKHPDVZHVDZLQWKHǤ*386XPVRID/RQJHU9HFWRUǥVHFWLRQ(DFK
WKUHDGZLOOVWDUWDWDQLQGH[JLYHQE\WKHIROORZLQJ
int tid = threadIdx.x + blockIdx.x * blockDim.x;

$IWHUHDFKWKUHDGȌQLVKHVLWVZRUNDWWKHFXUUHQWLQGH[ZHQHHGWRLQFUHPHQW
HDFKRIWKHPE\WKHWRWDOQXPEHURIWKUHDGVUXQQLQJLQWKHJULG7KLVLVVLPSO\WKH
QXPEHURIWKUHDGVSHUEORFNPXOWLSOLHGE\WKHQXPEHURIEORFNVLQWKHJULGRU
blockDim.x * gridDim.x+HQFHWKHLQFUHPHQWVWHSLVDVIROORZV
tid += blockDim.x * gridDim.x;

:HDUHDOPRVWWKHUH7KHRQO\UHPDLQLQJSLHFHLVWRȌ[WKHODXQFK
LWVHOI,I\RXUHPHPEHUZHWRRNWKLVGHWRXUEHFDXVHWKHODXQFK
add<<<(N+127)/128,128>>>( dev_a, dev_b, dev_c )ZLOOIDLOZKHQ
(N+127)/128LVJUHDWHUWKDQ7RHQVXUHZHQHYHUODXQFKWRRPDQ\EORFNV
ZHZLOOMXVWȌ[WKHQXPEHURIEORFNVWRVRPHUHDVRQDEO\VPDOOYDOXH6LQFHZHOLNH
FRS\LQJDQGSDVWLQJVRPXFKZHZLOOXVHEORFNVHDFKZLWKWKUHDGV
add<<<128,128>>>( dev _ a, dev _ b, dev _ c );

<RXVKRXOGIHHOIUHHWRDGMXVWWKHVHYDOXHVKRZHYHU\RXVHHȌWSURYLGHGWKDW
\RXUYDOXHVUHPDLQZLWKLQWKHOLPLWVZHǢYHGLVFXVVHG/DWHULQWKHERRNZHZLOO
GLVFXVVWKHSRWHQWLDOSHUIRUPDQFHLPSOLFDWLRQVRIWKHVHFKRLFHVEXWIRUQRZLW
VXIȌFHVWRFKRRVHWKUHDGVSHUEORFNDQGEORFNV1RZZHFDQDGGYHFWRUV
RIDUELWUDU\OHQJWKOLPLWHGRQO\E\WKHDPRXQWRI5$0ZHKDYHRQRXU*38+HUH
LVWKHHQWLUHVRXUFHOLVWLQJ

#include "../common/book.h"

#define N (33 * 1024)

__global__ void add( int *a, int *b, int *c ) {


int tid = threadIdx.x + blockIdx.x * blockDim.x;
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += blockDim.x * gridDim.x;
}
}

67

Download from www.wowebook.com


7+5($'&223(5$7,21

int main( void ) {


int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b, N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c, N * sizeof(int) ) );

// fill the arrays ‘a’ and ‘b’ on the CPU


for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i * i;
}

// copy the arrays 'a' and 'b' to the GPU


HANDLE_ERROR( cudaMemcpy( dev_a,
a,
N * sizeof(int),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b,
b,
N * sizeof(int),
cudaMemcpyHostToDevice ) );

add<<<128,128>>>( dev_a, dev_b, dev_c );

// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( c,
dev_c,
N * sizeof(int),
cudaMemcpyDeviceToHost ) );
// verify that the GPU did the work we requested
bool success = true;
for (int i=0; i<N; i++) {
if ((a[i] + b[i]) != c[i]) {
printf( “Error: %d + %d != %d\n”, a[i], b[i], c[i] );
success = false;

68

Download from www.wowebook.com


 63 / , 7 7 ,1*3
 $ 5 $ / / (/%
 /2 &. 6

}
}
if (success) printf( "We did it!\n" );

// free the memory allocated on the GPU


cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );

return 0;
}

 *385,33/(86,1*7+5($'6
$VZLWKWKHSUHYLRXVFKDSWHUZHZLOOUHZDUG\RXUSDWLHQFHZLWKYHFWRUDGGLWLRQE\
SUHVHQWLQJDPRUHIXQH[DPSOHWKDWGHPRQVWUDWHVVRPHRIWKHWHFKQLTXHVZHǢYH
EHHQXVLQJ:HZLOODJDLQXVHRXU*38FRPSXWLQJSRZHUWRJHQHUDWHSLFWXUHV
SURFHGXUDOO\%XWWRPDNHWKLQJVHYHQPRUHLQWHUHVWLQJWKLVWLPHZHZLOODQLPDWH
WKHP%XWGRQǢWZRUU\ZHǢYHSDFNDJHGDOOWKHXQUHODWHGDQLPDWLRQFRGHLQWR
KHOSHUIXQFWLRQVVR\RXZRQǢWKDYHWRPDVWHUDQ\JUDSKLFVRUDQLPDWLRQ

struct DataBlock {
unsigned char *dev_bitmap;
CPUAnimBitmap *bitmap;
};

// clean up memory allocated on the GPU


void cleanup( DataBlock *d ) {
cudaFree( d->dev_bitmap );
}

int main( void ) {


DataBlock data;
CPUAnimBitmap bitmap( DIM, DIM, &data );
data.bitmap = &bitmap;

69

Download from www.wowebook.com


7+5($'&223(5$7,21

HANDLE_ERROR( cudaMalloc( (void**)&data.dev_bitmap,


bitmap.image_size() ) );

bitmap.anim_and_exit( (void (*)(void*,int))generate_frame,


(void (*)(void*))cleanup );
}

0RVWRIWKHFRPSOH[LW\RImain()LVKLGGHQLQWKHKHOSHUFODVV
CPUAnimBitmap<RXZLOOQRWLFHWKDWZHDJDLQKDYHDSDWWHUQRIGRLQJD
cudaMalloc()H[HFXWLQJGHYLFHFRGHWKDWXVHVWKHDOORFDWHGPHPRU\DQG
WKHQFOHDQLQJXSZLWKcudaFree()7KLVVKRXOGEHROGKDWWR\RXE\QRZ

,QWKLVH[DPSOHZHKDYHVOLJKWO\FRQYROXWHGWKHPHDQVE\ZKLFKZHDFFRPSOLVK
WKHPLGGOHVWHSǤH[HFXWLQJGHYLFHFRGHWKDWXVHVWKHDOORFDWHGPHPRU\ǥ:H
SDVVWKHanim_and_exit()PHWKRGDIXQFWLRQSRLQWHUWRgenerate_frame()
7KLVIXQFWLRQZLOOEHFDOOHGE\WKHFODVVHYHU\WLPHLWZDQWVWRJHQHUDWHDQHZ
IUDPHRIWKHDQLPDWLRQ

void generate_frame( DataBlock *d, int ticks ) {


dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<blocks,threads>>>( d->dev_bitmap, ticks );

HANDLE_ERROR( cudaMemcpy( d->bitmap->get_ptr(),


d->dev_bitmap,
d->bitmap->image_size(),
cudaMemcpyDeviceToHost ) );
}

$OWKRXJKWKLVIXQFWLRQFRQVLVWVRQO\RIIRXUOLQHVWKH\DOOLQYROYHLPSRUWDQW
&8'$&FRQFHSWV)LUVWZHGHFODUHWZRWZRGLPHQVLRQDOYDULDEOHVblocks
DQGthreads$VRXUQDPLQJFRQYHQWLRQPDNHVSDLQIXOO\REYLRXVWKHYDULDEOH
blocksUHSUHVHQWVWKHQXPEHURISDUDOOHOEORFNVZHZLOOODXQFKLQRXUJULG7KH
YDULDEOHthreadsUHSUHVHQWVWKHQXPEHURIWKUHDGVZHZLOOODXQFKSHUEORFN
%HFDXVHZHDUHJHQHUDWLQJDQLPDJHZHXVHWZRGLPHQVLRQDOLQGH[LQJVRWKDW
HDFKWKUHDGZLOOKDYHDXQLTXH(x,y)LQGH[WKDWZHFDQHDVLO\SXWLQWRFRUUHVSRQ-
GHQFHZLWKDSL[HOLQWKHRXWSXWLPDJH:HKDYHFKRVHQWRXVHEORFNVWKDWFRQVLVW

70

Download from www.wowebook.com


 63 / , 7 7 ,1*3
 $ 5 $ / / (/%
 /2 &. 6

RID[DUUD\RIWKUHDGV,IWKHLPDJHKDVDIM[DIMSL[HOVZHQHHGWRODXQFK
DIM/16[DIM/16EORFNVWRJHWRQHWKUHDGSHUSL[HO)LJXUHVKRZVKRZWKLV
EORFNDQGWKUHDGFRQȌJXUDWLRQZRXOGORRNLQD ULGLFXORXVO\ VPDOOSL[HOZLGH
SL[HOKLJKLPDJH

     

     

       


    

       


    

       


        

Figure 5.2 $'KLHUDUFK\RIEORFNVDQGWKUHDGVWKDWFRXOGEHXVHGWRSURFHVVD


[SL[HOLPDJHXVLQJRQHWKUHDGSHUSL[HO

71

Download from www.wowebook.com


7+5($'&223(5$7,21

,I\RXKDYHGRQHDQ\PXOWLWKUHDGHG&38SURJUDPPLQJ\RXPD\EHZRQGHULQJ
ZK\ZHZRXOGODXQFKVRPDQ\WKUHDGV)RUH[DPSOHWRUHQGHUDIXOOKLJK
GHȌQLWLRQDQLPDWLRQDW[WKLVPHWKRGZRXOGFUHDWHPRUHWKDQPLOOLRQ
WKUHDGV$OWKRXJKZHURXWLQHO\FUHDWHDQGVFKHGXOHWKLVPDQ\WKUHDGVRQD*38
RQHZRXOGQRWGUHDPRIFUHDWLQJWKLVPDQ\WKUHDGVRQD&38%HFDXVH&38
WKUHDGPDQDJHPHQWDQGVFKHGXOLQJPXVWEHGRQHLQVRIWZDUHLWVLPSO\FDQQRW
VFDOHWRWKHQXPEHURIWKUHDGVWKDWD*38FDQ%HFDXVHZHFDQVLPSO\FUHDWHD
WKUHDGIRUHDFKGDWDHOHPHQWZHZDQWWRSURFHVVSDUDOOHOSURJUDPPLQJRQD*38
FDQEHIDUVLPSOHUWKDQRQD&38

$IWHUGHFODULQJWKHYDULDEOHVWKDWKROGWKHGLPHQVLRQVRIRXUODXQFKZHVLPSO\
ODXQFKWKHNHUQHOWKDWZLOOFRPSXWHRXUSL[HOYDOXHV
kernel<<< blocks,threads>>>( d->dev _ bitmap, ticks );

7KHNHUQHOZLOOQHHGWZRSLHFHVRILQIRUPDWLRQWKDWZHSDVVDVSDUDPHWHUV)LUVW
LWQHHGVDSRLQWHUWRGHYLFHPHPRU\WKDWKROGVWKHRXWSXWSL[HOV7KLVLVDJOREDO
YDULDEOHWKDWKDGLWVPHPRU\DOORFDWHGLQmain()%XWWKHYDULDEOHLVǤJOREDOǥ
RQO\IRUKRVWFRGHVRZHQHHGWRSDVVLWDVDSDUDPHWHUWRHQVXUHWKDWWKH&8'$
UXQWLPHZLOOPDNHLWDYDLODEOHIRURXUGHYLFHFRGH

6HFRQGRXUNHUQHOZLOOQHHGWRNQRZWKHFXUUHQWDQLPDWLRQWLPHVRLWFDQ
JHQHUDWHWKHFRUUHFWIUDPH7KHFXUUHQWWLPHticksLVSDVVHGWRWKH
generate_frame()IXQFWLRQIURPWKHLQIUDVWUXFWXUHFRGHLQCPUAnimBitmap,
VRZHFDQVLPSO\SDVVWKLVRQWRRXUNHUQHO

$QGQRZKHUHǢVWKHNHUQHOFRGHLWVHOI

__global__ void kernel( unsigned char *ptr, int ticks ) {


// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

// now calculate the value at that position


float fx = x - DIM/2;
float fy = y - DIM/2;
float d = sqrtf( fx * fx + fy * fy );

72

Download from www.wowebook.com


 63 / , 7 7 ,1*3
 $ 5 $ / / (/%
 /2 &. 6

unsigned char grey = (unsigned char)(128.0f + 127.0f *


cos(d/10.0f - ticks/7.0f) /
(d/10.0f + 1.0f));
ptr[offset*4 + 0] = grey;
ptr[offset*4 + 1] = grey;
ptr[offset*4 + 2] = grey;
ptr[offset*4 + 3] = 255;
}

7KHȌUVWWKUHHDUHWKHPRVWLPSRUWDQWOLQHVLQWKHNHUQHO

int x = threadIdx.x + blockIdx.x * blockDim.x;


int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

,QWKHVHOLQHVHDFKWKUHDGWDNHVLWVLQGH[ZLWKLQLWVEORFNDVZHOODVWKHLQGH[
RILWVEORFNZLWKLQWKHJULGDQGLWWUDQVODWHVWKLVLQWRDXQLTXH(x,y)LQGH[
ZLWKLQWKHLPDJH6RZKHQWKHWKUHDGDWLQGH[(3, 5)LQEORFN(12, 8)EHJLQV
H[HFXWLQJLWNQRZVWKDWWKHUHDUHHQWLUHEORFNVWRWKHOHIWRILWDQGHQWLUH
EORFNVDERYHLW:LWKLQLWVEORFNWKHWKUHDGDW(3, 5)KDVWKUHHWKUHDGVWRWKH
OHIWDQGȌYHDERYHLW%HFDXVHWKHUHDUHWKUHDGVSHUEORFNWKLVPHDQVWKH
WKUHDGLQTXHVWLRQKDVWKHIROORZLQJ

WKUHDGVEORFNV WKUHDGVEORFN WKUHDGVWRWKHOHIWRILW

WKUHDGVEORFNV WKUHDGVEORFN WKUHDGVDERYHLW

7KLVFRPSXWDWLRQLVLGHQWLFDOWRWKHFRPSXWDWLRQRIxDQGyLQWKHȌUVWWZROLQHV
DQGLVKRZZHPDSWKHWKUHDGDQGEORFNLQGLFHVWRLPDJHFRRUGLQDWHV7KHQZH
VLPSO\OLQHDUL]HWKHVHxDQGyYDOXHVWRJHWDQRIIVHWLQWRWKHRXWSXWEXIIHU$JDLQ
WKLVLVLGHQWLFDOWRZKDWZHGLGLQWKHǤ*386XPVRID/RQJHU9HFWRUǥDQGǤ*38
6XPVRI$UELWUDULO\/RQJ9HFWRUVǥVHFWLRQV
int offset = x + y * blockDim.x * gridDim.x;

6LQFHZHNQRZZKLFK(x,y)SL[HOLQWKHLPDJHWKHWKUHDGVKRXOGFRPSXWHDQG
ZHNQRZWKHWLPHDWZKLFKLWQHHGVWRFRPSXWHWKLVYDOXHZHFDQFRPSXWHDQ\

73

Download from www.wowebook.com


7+5($'&223(5$7,21

IXQFWLRQRI(x,y,t)DQGVWRUHWKLVYDOXHLQWKHRXWSXWEXIIHU,QWKLVFDVHWKH
IXQFWLRQSURGXFHVDWLPHYDU\LQJVLQXVRLGDOǤULSSOHǥ

float fx = x - DIM/2;
float fy = y - DIM/2;
float d = sqrtf( fx * fx + fy * fy );
unsigned char grey = (unsigned char)(128.0f + 127.0f *
cos(d/10.0f - ticks/7.0f) /
(d/10.0f + 1.0f));

:HUHFRPPHQGWKDW\RXQRWJHWWRRKXQJXSRQWKHFRPSXWDWLRQRIgrey,WǢV
HVVHQWLDOO\MXVWD'IXQFWLRQRIWLPHWKDWPDNHVDQLFHULSSOLQJHIIHFWZKHQLWǢV
DQLPDWHG$VFUHHQVKRWRIRQHIUDPHVKRXOGORRNVRPHWKLQJOLNH)LJXUH

Figure 5.3 $VFUHHQVKRWIURPWKH*38ULSSOHH[DPSOH

74

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

 6KDUHG0HPRU\DQG
6\QFKURQL]DWLRQ
6RIDUWKHPRWLYDWLRQIRUVSOLWWLQJEORFNVLQWRWKUHDGVZDVVLPSO\RQHRIZRUNLQJ
DURXQGKDUGZDUHOLPLWDWLRQVWRWKHQXPEHURIEORFNVZHFDQKDYHLQȍLJKW7KLV
LVIDLUO\ZHDNPRWLYDWLRQEHFDXVHWKLVFRXOGHDVLO\EHGRQHEHKLQGWKHVFHQHVE\
WKH&8'$UXQWLPH)RUWXQDWHO\WKHUHDUHRWKHUUHDVRQVRQHPLJKWZDQWWRVSOLWD
EORFNLQWRWKUHDGV

&8'$&PDNHVDYDLODEOHDUHJLRQRIPHPRU\WKDWZHFDOOshared memory7KLV
UHJLRQRIPHPRU\EULQJVDORQJZLWKLWDQRWKHUH[WHQVLRQWRWKH&ODQJXDJHDNLQ
WR__device__ DQG__global__$VDSURJUDPPHU\RXFDQPRGLI\\RXUYDUL-
DEOHGHFODUDWLRQVZLWKWKH&8'$&NH\ZRUG__shared__WRPDNHWKLVYDULDEOH
UHVLGHQWLQVKDUHGPHPRU\%XWZKDWǢVWKHSRLQW"

:HǢUHJODG\RXDVNHG7KH&8'$&FRPSLOHUWUHDWVYDULDEOHVLQVKDUHGPHPRU\
GLIIHUHQWO\WKDQW\SLFDOYDULDEOHV,WFUHDWHVDFRS\RIWKHYDULDEOHIRUHDFKEORFN
WKDW\RXODXQFKRQWKH*38(YHU\WKUHDGLQWKDWEORFNVKDUHVWKHPHPRU\EXW
WKUHDGVFDQQRWVHHRUPRGLI\WKHFRS\RIWKLVYDULDEOHWKDWLVVHHQZLWKLQRWKHU
EORFNV7KLVSURYLGHVDQH[FHOOHQWPHDQVE\ZKLFKWKUHDGVZLWKLQDEORFNFDQ
FRPPXQLFDWHDQGFROODERUDWHRQFRPSXWDWLRQV)XUWKHUPRUHVKDUHGPHPRU\
EXIIHUVUHVLGHSK\VLFDOO\RQWKH*38DVRSSRVHGWRUHVLGLQJLQRIIFKLS'5$0
%HFDXVHRIWKLVWKHODWHQF\WRDFFHVVVKDUHGPHPRU\WHQGVWREHIDUORZHU
WKDQW\SLFDOEXIIHUVPDNLQJVKDUHGPHPRU\HIIHFWLYHDVDSHUEORFNVRIWZDUH
PDQDJHGFDFKHRUVFUDWFKSDG

7KHSURVSHFWRIFRPPXQLFDWLRQEHWZHHQWKUHDGVVKRXOGH[FLWH\RX,WH[FLWHVXV
WRR%XWQRWKLQJLQOLIHLVIUHHDQGLQWHUWKUHDGFRPPXQLFDWLRQLVQRH[FHSWLRQ
,IZHH[SHFWWRFRPPXQLFDWHEHWZHHQWKUHDGVZHDOVRQHHGDPHFKDQLVPIRU
V\QFKURQL]LQJEHWZHHQWKUHDGV)RUH[DPSOHLIWKUHDG$ZULWHVDYDOXHWRVKDUHG
PHPRU\DQGZHZDQWWKUHDG%WRGRVRPHWKLQJZLWKWKLVYDOXHZHFDQǢWKDYH
WKUHDG%VWDUWLWVZRUNXQWLOZHNQRZWKHZULWHIURPWKUHDG$LVFRPSOHWH:LWKRXW
V\QFKURQL]DWLRQZHKDYHFUHDWHGDUDFHFRQGLWLRQZKHUHWKHFRUUHFWQHVVRIWKH
H[HFXWLRQUHVXOWVGHSHQGVRQWKHQRQGHWHUPLQLVWLFGHWDLOVRIWKHKDUGZDUH

/HWǢVWDNHDORRNDWDQH[DPSOHWKDWXVHVWKHVHIHDWXUHV

75

Download from www.wowebook.com


7+5($'&223(5$7,21

 DOT PRODUCT


&RQJUDWXODWLRQV:HKDYHJUDGXDWHGIURPYHFWRUDGGLWLRQDQGZLOOQRZWDNHDORRN
DWYHFWRUGRWSURGXFWV VRPHWLPHVFDOOHGDQinner product :HZLOOTXLFNO\UHYLHZ
ZKDWDGRWSURGXFWLVMXVWLQFDVH\RXDUHXQIDPLOLDUZLWKYHFWRUPDWKHPDWLFV RU
LWKDVEHHQDIHZ\HDUV 7KHFRPSXWDWLRQFRQVLVWVRIWZRVWHSV)LUVWZHPXOWLSO\
FRUUHVSRQGLQJHOHPHQWVRIWKHWZRLQSXWYHFWRUV7KLVLVYHU\VLPLODUWRYHFWRU
DGGLWLRQEXWXWLOL]HVPXOWLSOLFDWLRQLQVWHDGRIDGGLWLRQ+RZHYHULQVWHDGRIWKHQ
VWRULQJWKHVHYDOXHVWRDWKLUGRXWSXWYHFWRUZHVXPWKHPDOOWRSURGXFHDVLQJOH
VFDODURXWSXW

)RUH[DPSOHLIZHWDNHWKHGRWSURGXFWRIWZRIRXUHOHPHQWYHFWRUVZHZRXOGJHW
(TXDWLRQ

Equation 5.1

3HUKDSVWKHDOJRULWKPZHWHQGWRXVHLVEHFRPLQJREYLRXV:HFDQGRWKHȌUVW
VWHSH[DFWO\KRZZHGLGYHFWRUDGGLWLRQ(DFKWKUHDGPXOWLSOLHVDSDLURIFRUUH-
VSRQGLQJHQWULHVDQGWKHQHYHU\WKUHDGPRYHVRQWRLWVQH[WSDLU%HFDXVHWKH
UHVXOWQHHGVWREHWKHVXPRIDOOWKHVHSDLUZLVHSURGXFWVHDFKWKUHDGNHHSV
DUXQQLQJVXPRIWKHSDLUVLWKDVDGGHG-XVWOLNHLQWKHDGGLWLRQH[DPSOHWKH
WKUHDGVLQFUHPHQWWKHLULQGLFHVE\WKHWRWDOQXPEHURIWKUHDGVWRHQVXUHZHGRQǢW
PLVVDQ\HOHPHQWVDQGGRQǢWPXOWLSO\DSDLUWZLFH+HUHLVWKHȌUVWVWHSRIWKHGRW
SURGXFWURXWLQH

#include "../common/book.h"

#define imin(a,b) (a<b?a:b)

const int N = 33 * 1024;


const int threadsPerBlock = 256;

__global__ void dot( float *a, float *b, float *c ) {


__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;

76

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}

// set the cache values


cache[cacheIndex] = temp;

$V\RXFDQVHHZHKDYHGHFODUHGDEXIIHURIVKDUHGPHPRU\QDPHGcache7KLV
EXIIHUZLOOEHXVHGWRVWRUHHDFKWKUHDGǢVUXQQLQJVXP6RRQZHZLOOVHHwhy we
GRWKLVEXWIRUQRZZHZLOOVLPSO\H[DPLQHWKHPHFKDQLFVE\ZKLFKZHDFFRP-
SOLVKLW,WLVWULYLDOWRGHFODUHDYDULDEOHWRUHVLGHLQVKDUHGPHPRU\DQGLWLV
LGHQWLFDOWRWKHPHDQVE\ZKLFK\RXGHFODUHDYDULDEOHDVstaticRUvolatile
LQVWDQGDUG&
__shared__ float cache[threadsPerBlock];

:HGHFODUHWKHDUUD\RIVL]HthreadsPerBlockVRHDFKWKUHDGLQWKHEORFN
KDVDSODFHWRVWRUHLWVWHPSRUDU\UHVXOW5HFDOOWKDWZKHQZHKDYHDOORFDWHG
PHPRU\JOREDOO\ZHDOORFDWHGHQRXJKIRUHYHU\WKUHDGWKDWUXQVWKHNHUQHORU
threadsPerBlockWLPHVWKHWRWDOQXPEHURIEORFNV%XWVLQFHWKHFRPSLOHU
ZLOOFUHDWHDFRS\RIWKHVKDUHGYDULDEOHVIRUHDFKEORFNZHQHHGWRDOORFDWHRQO\
HQRXJKPHPRU\VXFKWKDWHDFKWKUHDGLQWKHEORFNKDVDQHQWU\

$IWHUDOORFDWLQJWKHVKDUHGPHPRU\ZHFRPSXWHRXUGDWDLQGLFHVPXFKOLNHZH
KDYHLQWKHSDVW

int tid = threadIdx.x + blockIdx.x * blockDim.x;


int cacheIndex = threadIdx.x;

7KHFRPSXWDWLRQIRUWKHYDULDEOHtidVKRXOGORRNIDPLOLDUE\QRZZHDUHMXVW
FRPELQLQJWKHEORFNDQGWKUHDGLQGLFHVWRJHWDJOREDORIIVHWLQWRRXULQSXWDUUD\V
7KHRIIVHWLQWRRXUVKDUHGPHPRU\FDFKHLVVLPSO\RXUWKUHDGLQGH[$JDLQZH
GRQǢWQHHGWRLQFRUSRUDWHRXUEORFNLQGH[LQWRWKLVRIIVHWEHFDXVHHDFKEORFNKDV
LWVRZQSULYDWHFRS\RIWKLVVKDUHGPHPRU\

77

Download from www.wowebook.com


7+5($'&223(5$7,21

)LQDOO\ZHFOHDURXUVKDUHGPHPRU\EXIIHUVRWKDWODWHUZHZLOOEHDEOHWREOLQGO\
VXPWKHHQWLUHDUUD\ZLWKRXWZRUU\LQJZKHWKHUDSDUWLFXODUHQWU\KDVYDOLGGDWD
VWRUHGWKHUH

// set the cache values


cache[cacheIndex] = temp;

,WZLOOEHSRVVLEOHWKDWQRWHYHU\HQWU\ZLOOEHXVHGLIWKHVL]HRIWKHLQSXWYHFWRUV
LVQRWDPXOWLSOHRIWKHQXPEHURIWKUHDGVSHUEORFN,QWKLVFDVHWKHODVWEORFN
ZLOOKDYHVRPHWKUHDGVWKDWGRQRWKLQJDQGWKHUHIRUHGRQRWZULWHYDOXHV

(DFKWKUHDGFRPSXWHVDUXQQLQJVXPRIWKHSURGXFWRIFRUUHVSRQGLQJHQWULHVLQa
DQGb$IWHUUHDFKLQJWKHHQGRIWKHDUUD\HDFKWKUHDGVWRUHVLWVWHPSRUDU\VXP
LQWRWKHVKDUHGEXIIHU

float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}

// set the cache values


cache[cacheIndex] = temp;

$WWKLVSRLQWLQWKHDOJRULWKPZHQHHGWRVXPDOOWKHWHPSRUDU\YDOXHVZHǢYH
SODFHGLQWKHFDFKH7RGRWKLVZHZLOOQHHGVRPHRIWKHWKUHDGVWRUHDGWKH
YDOXHVWKDWKDYHEHHQVWRUHGWKHUH+RZHYHUDVZHPHQWLRQHGWKLVLVDSRWHQ-
WLDOO\GDQJHURXVRSHUDWLRQ:HQHHGDPHWKRGWRJXDUDQWHHWKDWDOORIWKHVH
ZULWHVWRWKHVKDUHGDUUD\cache[]FRPSOHWHEHIRUHDQ\RQHWULHVWRUHDGIURP
WKLVEXIIHU)RUWXQDWHO\VXFKDPHWKRGH[LVWV

// synchronize threads in this block


__syncthreads();

7KLVFDOOJXDUDQWHHVWKDWHYHU\WKUHDGLQWKHEORFNKDVFRPSOHWHGLQVWUXFWLRQV
SULRUWRWKH__syncthreads()EHIRUHWKHKDUGZDUHZLOOH[HFXWHWKHQH[W

78

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

LQVWUXFWLRQRQDQ\WKUHDG7KLVLVH[DFWO\ZKDWZHQHHG:HQRZNQRZWKDWZKHQ
WKHȌUVWWKUHDGH[HFXWHVWKHȌUVWLQVWUXFWLRQDIWHURXU__syncthreads(),
HYHU\RWKHUWKUHDGLQWKHEORFNKDVDOVRȌQLVKHGH[HFXWLQJXSWRWKH
__syncthreads()

1RZWKDWZHKDYHJXDUDQWHHGWKDWRXUWHPSRUDU\FDFKHKDVEHHQȌOOHGZH
FDQVXPWKHYDOXHVLQLW:HFDOOWKHJHQHUDOSURFHVVRIWDNLQJDQLQSXWDUUD\
DQGSHUIRUPLQJVRPHFRPSXWDWLRQVWKDWSURGXFHDVPDOOHUDUUD\RIUHVXOWVD
reduction5HGXFWLRQVDULVHRIWHQLQSDUDOOHOFRPSXWLQJZKLFKOHDGVWRWKHGHVLUH
WRJLYHWKHPDQDPH

7KHQD±YHZD\WRDFFRPSOLVKWKLVUHGXFWLRQZRXOGEHKDYLQJRQHWKUHDGLWHUDWH
RYHUWKHVKDUHGPHPRU\DQGFDOFXODWHDUXQQLQJVXP7KLVZLOOWDNHXVWLPH
SURSRUWLRQDOWRWKHOHQJWKRIWKHDUUD\+RZHYHUVLQFHZHKDYHKXQGUHGVRI
WKUHDGVDYDLODEOHWRGRRXUZRUNZHFDQGRWKLVUHGXFWLRQLQSDUDOOHODQGWDNH
WLPHWKDWLVSURSRUWLRQDOWRWKHORJDULWKPRIWKHOHQJWKRIWKHDUUD\$WȌUVWWKH
IROORZLQJFRGHZLOOORRNFRQYROXWHGZHǢOOEUHDNLWGRZQLQDPRPHQW

7KHJHQHUDOLGHDLVWKDWHDFKWKUHDGZLOODGGWZRRIWKHYDOXHVLQcache[]DQG
VWRUHWKHUHVXOWEDFNWRcache[]6LQFHHDFKWKUHDGFRPELQHVWZRHQWULHVLQWR
RQHZHFRPSOHWHWKLVVWHSZLWKKDOIDVPDQ\HQWULHVDVZHVWDUWHGZLWK,QWKH
QH[WVWHSZHGRWKHVDPHWKLQJRQWKHUHPDLQLQJKDOI:HFRQWLQXHLQWKLVIDVKLRQ
IRUlog2(threadsPerBlock)VWHSVXQWLOZHKDYHWKHVXPRIHYHU\HQWU\LQ
cache[])RURXUH[DPSOHZHǢUHXVLQJWKUHDGVSHUEORFNVRLWWDNHVLWHUD-
WLRQVRIWKLVSURFHVVWRUHGXFHWKHHQWULHVLQcache[]WRDVLQJOHVXP

7KHFRGHIRUWKLVIROORZV

// for reductions, threadsPerBlock must be a power of 2


// because of the following code
int i = blockDim.x/2;
while (i != 0) {
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex + i];
__syncthreads();
i /= 2;
}

79

Download from www.wowebook.com


7+5($'&223(5$7,21

Figure 5.4 2QHVWHSRIDVXPPDWLRQUHGXFWLRQ

)RUWKHȌUVWVWHSZHVWDUWZLWKiDVKDOIWKHQXPEHURIthreadsPerBlock
:HRQO\ZDQWWKHWKUHDGVZLWKLQGLFHVOHVVWKDQWKLVYDOXHWRGRDQ\ZRUNVRZH
FRQGLWLRQDOO\DGGWZRHQWULHVRIcache[]LIWKHWKUHDGǢVLQGH[LVOHVVWKDQi:H
SURWHFWRXUDGGLWLRQZLWKLQDQif(cacheIndex < i)EORFN(DFKWKUHDGZLOO
WDNHWKHHQWU\DWLWVLQGH[LQcache[]DGGLWWRWKHHQWU\DWLWVLQGH[RIIVHWE\i,
DQGVWRUHWKLVVXPEDFNWRcache[]

6XSSRVHWKHUHZHUHHLJKWHQWULHVLQcache[]DQGDVDUHVXOWiKDGWKHYDOXH
2QHVWHSRIWKHUHGXFWLRQZRXOGORRNOLNH)LJXUH

$IWHUZHKDYHFRPSOHWHGDVWHSZHKDYHWKHVDPHUHVWULFWLRQZHGLGDIWHU
FRPSXWLQJDOOWKHSDLUZLVHSURGXFWV%HIRUHZHFDQUHDGWKHYDOXHVZHMXVWVWRUHG
LQcache[]ZHQHHGWRHQVXUHWKDWHYHU\WKUHDGWKDWQHHGVWRZULWHWRcache[]
KDVDOUHDG\GRQHVR7KH__syncthreads()DIWHUWKHDVVLJQPHQWHQVXUHVWKLV
FRQGLWLRQLVPHW

$IWHUWHUPLQDWLRQRIWKLVwhile()ORRSHDFKEORFNKDVEXWDVLQJOHQXPEHU
UHPDLQLQJ7KLVQXPEHULVVLWWLQJLQWKHȌUVWHQWU\RIcache[]DQGLVWKHVXP
RIHYHU\SDLUZLVHSURGXFWWKHWKUHDGVLQWKDWEORFNFRPSXWHG:HWKHQVWRUHWKLV
VLQJOHYDOXHWRJOREDOPHPRU\DQGHQGRXUNHUQHO

if (cacheIndex == 0)
c[blockIdx.x] = cache[0];
}

80

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

:K\GRZHGRWKLVJOREDOVWRUHRQO\IRUWKHWKUHDGZLWKcacheIndex == 0":HOO
VLQFHWKHUHLVRQO\RQHQXPEHUWKDWQHHGVZULWLQJWRJOREDOPHPRU\RQO\DVLQJOH
WKUHDGQHHGVWRSHUIRUPWKLVRSHUDWLRQ&RQFHLYDEO\HYHU\WKUHDGFRXOGSHUIRUP
WKLVZULWHDQGWKHSURJUDPZRXOGVWLOOZRUNEXWGRLQJVRZRXOGFUHDWHDQXQQHF-
HVVDULO\ODUJHDPRXQWRIPHPRU\WUDIȌFWRZULWHDVLQJOHYDOXH)RUVLPSOLFLW\
ZHFKRVHWKHWKUHDGZLWKLQGH[WKRXJK\RXFRXOGFRQFHLYDEO\KDYHFKRVHQDQ\
cacheIndexWRZULWH cache[0@WRJOREDOPHPRU\)LQDOO\VLQFHHDFKEORFN
ZLOOZULWHH[DFWO\RQHYDOXHWRWKHJOREDODUUD\c[]ZHFDQVLPSO\LQGH[LWE\
blockIdx

:HDUHOHIWZLWKDQDUUD\c[]HDFKHQWU\RIZKLFKFRQWDLQVWKHVXPSURGXFHGE\
RQHRIWKHSDUDOOHOEORFNV7KHODVWVWHSRIWKHGRWSURGXFWLVWRVXPWKHHQWULHV
RIc[](YHQWKRXJKWKHGRWSURGXFWLVQRWIXOO\FRPSXWHGZHH[LWWKHNHUQHODQG
UHWXUQFRQWUROWRWKHKRVWDWWKLVSRLQW%XWZK\GRZHUHWXUQWRWKHKRVWEHIRUH
WKHFRPSXWDWLRQLVFRPSOHWH"

3UHYLRXVO\ZHUHIHUUHGWRDQRSHUDWLRQOLNHDGRWSURGXFWDVDreduction5RXJKO\
VSHDNLQJWKLVLVEHFDXVHZHSURGXFHIHZHURXWSXWGDWDHOHPHQWVWKDQZHLQSXW
,QWKHFDVHRIDGRWSURGXFWZHDOZD\VSURGXFHH[DFWO\RQHRXWSXWUHJDUGOHVV
RIWKHVL]HRIRXULQSXW,WWXUQVRXWWKDWDPDVVLYHO\SDUDOOHOPDFKLQHOLNHD*38
WHQGVWRZDVWHLWVUHVRXUFHVZKHQSHUIRUPLQJWKHODVWVWHSVRIDUHGXFWLRQVLQFH
WKHVL]HRIWKHGDWDVHWLVVRVPDOODWWKDWSRLQWLWLVKDUGWRXWLOL]HDULWKPHWLF
XQLWVWRDGGQXPEHUV

)RUWKLVUHDVRQZHUHWXUQFRQWUROWRWKHKRVWDQGOHWWKH&38ȌQLVKWKHȌQDOVWHS
RIWKHDGGLWLRQVXPPLQJWKHDUUD\c[],QDODUJHUDSSOLFDWLRQWKH*38ZRXOG
QRZEHIUHHWRVWDUWDQRWKHUGRWSURGXFWRUZRUNRQDQRWKHUODUJHFRPSXWDWLRQ
+RZHYHULQWKLVH[DPSOHZHDUHGRQHZLWKWKH*38

,QH[SODLQLQJWKLVH[DPSOHZHEURNHZLWKWUDGLWLRQDQGMXPSHGULJKWLQWRWKH
DFWXDONHUQHOFRPSXWDWLRQ:HKRSH\RXZLOOKDYHQRWURXEOHXQGHUVWDQGLQJWKH
ERG\RImain()XSWRWKHNHUQHOFDOOVLQFHLWLVRYHUZKHOPLQJO\VLPLODUWRZKDW
ZHKDYHVKRZQEHIRUH

const int blocksPerGrid =


imin( 32, (N+threadsPerBlock-1) / threadsPerBlock );

int main( void ) {


float *a, *b, c, *partial_c;
float *dev_a, *dev_b, *dev_partial_c;

81

Download from www.wowebook.com


7+5($'&223(5$7,21

// allocate memory on the CPU side


a = new float[N];
b = new float[N];
partial_c = new float[blocksPerGrid];

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
N*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b,
N*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_partial_c,
blocksPerGrid*sizeof(float) ) );

// fill in the host memory with data


for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i*2;
}

// copy the arrays 'a' and 'b' to the GPU


HANDLE_ERROR( cudaMemcpy( dev_a, a, N*sizeof(float),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, N*sizeof(float),
cudaMemcpyHostToDevice ) );

dot<<<blocksPerGrid,threadsPerBlock>>>( dev_a,
dev_b,
dev_partial_c );

7RDYRLG\RXSDVVLQJRXWIURPERUHGRPZHZLOOTXLFNO\VXPPDUL]HWKLVFRGH

 $OORFDWHKRVWDQGGHYLFHPHPRU\IRULQSXWDQGRXWSXWDUUD\V

 )LOOLQSXWDUUD\Va[]DQGb[]DQGWKHQFRS\WKHVHWRWKHGHYLFHXVLQJ
cudaMemcpy()

 &DOORXUGRWSURGXFWNHUQHOXVLQJVRPHSUHGHWHUPLQHGQXPEHURIWKUHDGV
SHUEORFNDQGEORFNVSHUJULG

82

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

'HVSLWHPRVWRIWKLVEHLQJFRPPRQSODFHWR\RXQRZLWLVZRUWKH[DPLQLQJWKH
FRPSXWDWLRQIRUWKHQXPEHURIEORFNVZHODXQFK:HGLVFXVVHGKRZWKHGRW
SURGXFWLVDUHGXFWLRQDQGKRZHDFKEORFNODXQFKHGZLOOFRPSXWHDSDUWLDOVXP
7KHOHQJWKRIWKLVOLVWRISDUWLDOVXPVVKRXOGEHVRPHWKLQJPDQDJHDEO\VPDOO
IRUWKH&38\HWODUJHHQRXJKVXFKWKDWZHKDYHHQRXJKEORFNVLQȍLJKWWRNHHS
HYHQWKHIDVWHVW*38VEXV\:HKDYHFKRVHQEORFNVDOWKRXJKWKLVLVDFDVH
ZKHUH\RXPD\QRWLFHEHWWHURUZRUVHSHUIRUPDQFHIRURWKHUFKRLFHVHVSHFLDOO\
GHSHQGLQJRQWKHUHODWLYHVSHHGVRI\RXU&38DQG*38

%XWZKDWLIZHDUHJLYHQDYHU\VKRUWOLVWDQGEORFNVRIWKUHDGVDSLHFH
LVWRRPDQ\",IZHKDYHNGDWDHOHPHQWVZHQHHGRQO\NWKUHDGVLQRUGHU
WRFRPSXWHRXUGRWSURGXFW6RLQWKLVFDVHZHQHHGWKHVPDOOHVWPXOWLSOH
RIthreadsPerBlockWKDWLVJUHDWHUWKDQRUHTXDOWRN:HKDYHVHHQWKLV
RQFHEHIRUHZKHQZHZHUHDGGLQJYHFWRUV,QWKLVFDVHZHJHWWKHVPDOOHVW
PXOWLSOHRIthreadsPerBlockWKDWLVJUHDWHUWKDQRUHTXDOWRNE\FRPSXWLQJ
(N+(threadsPerBlock-1)) / threadsPerBlock$V\RXPD\EHDEOH
WRWHOOWKLVLVDFWXDOO\DIDLUO\FRPPRQWULFNLQLQWHJHUPDWKVRLWLVZRUWK
GLJHVWLQJWKLVHYHQLI\RXVSHQGPRVWRI\RXUWLPHZRUNLQJRXWVLGHWKH
&8'$&UHDOP

7KHUHIRUHWKHQXPEHURIEORFNVZHODXQFKVKRXOGEHHLWKHURU
(N+(threadsPerBlock-1)) / threadsPerBlockZKLFKHYHUYDOXHLV
VPDOOHU

const int blocksPerGrid =


imin( 32, (N+threadsPerBlock-1) / threadsPerBlock );

1RZLWVKRXOGEHFOHDUKRZZHDUULYHDWWKHFRGHLQmain()$IWHUWKHNHUQHO
ȌQLVKHVZHVWLOOKDYHWRVXPWKHUHVXOW%XWOLNHWKHZD\ZHFRS\RXULQSXWWR
WKH*38EHIRUHZHODXQFKDNHUQHOZHQHHGWRFRS\RXURXWSXWEDFNWRWKH&38
EHIRUHZHFRQWLQXHZRUNLQJZLWKLW6RDIWHUWKHNHUQHOȌQLVKHVZHFRS\EDFNWKH
OLVWRISDUWLDOVXPVDQGFRPSOHWHWKHVXPRQWKH&38

// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( partial_c, dev_partial_c,
blocksPerGrid*sizeof(float),
cudaMemcpyDeviceToHost ) );

83

Download from www.wowebook.com


7+5($'&223(5$7,21

// finish up on the CPU side


c = 0;
for (int i=0; i<blocksPerGrid; i++) {
c += partial_c[i];
}

)LQDOO\ZHFKHFNRXUUHVXOWVDQGFOHDQXSWKHPHPRU\ZHǢYHDOORFDWHGRQERWK
WKH&38DQG*38&KHFNLQJWKHUHVXOWVLVPDGHHDVLHUEHFDXVHZHǢYHȌOOHGWKH
LQSXWVZLWKSUHGLFWDEOHGDWD,I\RXUHFDOOa[]LVȌOOHGZLWKWKHLQWHJHUVIURPWR
N-1DQGb[]LVMXVW2*a[]

// fill in the host memory with data


for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i*2;
}

2XUGRWSURGXFWVKRXOGEHWZRWLPHVWKHVXPRIWKHVTXDUHVRIWKHLQWHJHUV
IURPWRN-1)RUWKHUHDGHUZKRORYHVGLVFUHWHPDWKHPDWLFV DQGZKDWǢVQRWWR
ORYH" LWZLOOEHDQDPXVLQJGLYHUVLRQWRGHULYHWKHFORVHGIRUPVROXWLRQIRUWKLV
VXPPDWLRQ)RUWKRVHZLWKOHVVSDWLHQFHRULQWHUHVWZHSUHVHQWWKHFORVHGIRUP
KHUHDVZHOODVWKHUHVWRIWKHERG\RImain()

#define sum_squares(x) (x*(x+1)*(2*x+1)/6)


printf( "Does GPU value %.6g = %.6g?\n", c,
2 * sum_squares( (float)(N - 1) ) );

// free memory on the GPU side


cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_partial_c );

// free memory on the CPU side


delete [] a;
delete [] b;
delete [] partial_c;
}

84

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

,I\RXIRXQGDOORXUH[SODQDWRU\LQWHUUXSWLRQVERWKHUVRPHKHUHLVWKHHQWLUH
VRXUFHOLVWLQJVDQVFRPPHQWDU\

#include "../common/book.h"

#define imin(a,b) (a<b?a:b)

const int N = 33 * 1024;


const int threadsPerBlock = 256;
const int blocksPerGrid =
imin( 32, (N+threadsPerBlock-1) / threadsPerBlock );

__global__ void dot( float *a, float *b, float *c ) {


__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;

float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}

// set the cache values


cache[cacheIndex] = temp;

// synchronize threads in this block


__syncthreads();

// for reductions, threadsPerBlock must be a power of 2


// because of the following code
int i = blockDim.x/2;
while (i != 0) {
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex + i];
__syncthreads();
i /= 2;
}

85

Download from www.wowebook.com


7+5($'&223(5$7,21

if (cacheIndex == 0)
c[blockIdx.x] = cache[0];
}

int main( void ) {


float *a, *b, c, *partial_c;
float *dev_a, *dev_b, *dev_partial_c;

// allocate memory on the CPU side


a = (float*)malloc( N*sizeof(float) );
b = (float*)malloc( N*sizeof(float) );
partial_c = (float*)malloc( blocksPerGrid*sizeof(float) );

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
N*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b,
N*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_partial_c,
blocksPerGrid*sizeof(float) ) );

// fill in the host memory with data


for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i*2;
}

// copy the arrays ‘a’ and ‘b’ to the GPU


HANDLE_ERROR( cudaMemcpy( dev_a, a, N*sizeof(float),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, N*sizeof(float),
cudaMemcpyHostToDevice ) );

86

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

dot<<<blocksPerGrid,threadsPerBlock>>>( dev_a, dev_b,


dev_partial_c );

// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( partial_c, dev_partial_c,
blocksPerGrid*sizeof(float),
cudaMemcpyDeviceToHost ) );

// finish up on the CPU side


c = 0;
for (int i=0; i<blocksPerGrid; i++) {
c += partial_c[i];
}

#define sum_squares(x) (x*(x+1)*(2*x+1)/6)


printf( “Does GPU value %.6g = %.6g?\n”, c,
2 * sum_squares( (float)(N - 1) ) );

// free memory on the GPU side


cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_partial_c );

// free memory on the CPU side


free( a );
free( b );
free( partial_c );
}

 '27352'8&7237,0,=('Ȏ,1&255(&7/<ȏ
:HTXLFNO\JORVVHGRYHUWKHVHFRQG__syncthreads()LQWKHGRWSURGXFW
H[DPSOH1RZZHZLOOWDNHDFORVHUORRNDWLWDVZHOODVH[DPLQLQJDQDWWHPSW
WRLPSURYHLW,I\RXUHFDOOZHQHHGHGWKHVHFRQG__syncthreads()EHFDXVH

87

Download from www.wowebook.com


7+5($'&223(5$7,21

ZHXSGDWHRXUVKDUHGPHPRU\YDULDEOHcache[]DQGQHHGWKHVHXSGDWHVWREH
YLVLEOHWRHYHU\WKUHDGRQWKHQH[WLWHUDWLRQWKURXJKWKHORRS

int i = blockDim.x/2;
while (i != 0) {
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex + i];
__syncthreads();
i /= 2;
}

2EVHUYHWKDWZHXSGDWHRXUVKDUHGPHPRU\EXIIHUcache[]RQO\LIcacheIndex
LVOHVVWKDQi6LQFHcacheIndexLVUHDOO\MXVWthreadIdx.xWKLVPHDQVWKDW
RQO\someRIWKHWKUHDGVDUHXSGDWLQJHQWULHVLQWKHVKDUHGPHPRU\FDFKH6LQFH
ZHDUHXVLQJ__syncthreadsRQO\WRHQVXUHWKDWWKHVHXSGDWHVKDYHWDNHQ
SODFHEHIRUHSURFHHGLQJLWVWDQGVWRUHDVRQWKDWZHPLJKWVHHDVSHHGLPSURYH-
PHQWRQO\LIZHZDLWIRUWKHWKUHDGVWKDWDUHDFWXDOO\ZULWLQJWRVKDUHGPHPRU\
:HGRWKLVE\PRYLQJWKHV\QFKURQL]DWLRQFDOOLQVLGHWKHif()EORFN

int i = blockDim.x/2;
while (i != 0) {
if (cacheIndex < i) {
cache[cacheIndex] += cache[cacheIndex + i];
__syncthreads();
}
i /= 2;
}

$OWKRXJKWKLVZDVDYDOLDQWHIIRUWDWRSWLPL]DWLRQLWZLOOQRWDFWXDOO\ZRUN,QIDFW
WKHVLWXDWLRQLVZRUVHWKDQWKDW7KLVFKDQJHWRWKHNHUQHOZLOODFWXDOO\FDXVHWKH
*38WRVWRSUHVSRQGLQJIRUFLQJ\RXWRNLOO\RXUSURJUDP%XWZKDWFRXOGKDYH
JRQHVRFDWDVWURSKLFDOO\ZURQJZLWKVXFKDVHHPLQJO\LQQRFXRXVFKDQJH"

7RDQVZHUWKLVTXHVWLRQLWKHOSVWRLPDJLQHHYHU\WKUHDGLQWKHEORFNPDUFKLQJ
WKURXJKWKHFRGHRQHOLQHDWDWLPH$WHDFKLQVWUXFWLRQLQWKHSURJUDPHYHU\
WKUHDGH[HFXWHVWKHVDPHLQVWUXFWLRQEXWHDFKFDQRSHUDWHRQGLIIHUHQWGDWD
%XWZKDWKDSSHQVZKHQWKHLQVWUXFWLRQWKDWHYHU\WKUHDGLVVXSSRVHGWRH[HFXWH

88

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

LVLQVLGHDFRQGLWLRQDOEORFNOLNHDQif()"2EYLRXVO\QRWHYHU\WKUHDGVKRXOG
H[HFXWHWKDWLQVWUXFWLRQULJKW")RUH[DPSOHFRQVLGHUDNHUQHOWKDWFRQWDLQVWKH
IROORZLQJIUDJPHQWRIFRGHWKDWLQWHQGVIRURGGLQGH[HGWKUHDGVWRXSGDWHWKH
YDOXHRIVRPHYDULDEOH

int myVar = 0;
if( threadIdx.x % 2 )
myVar = threadIdx.x;

,QWKHSUHYLRXVH[DPSOHZKHQWKHWKUHDGVDUULYHDWWKHOLQHLQEROGRQO\WKH
WKUHDGVZLWKRGGLQGLFHVZLOOH[HFXWHLWVLQFHWKHWKUHDGVZLWKHYHQLQGLFHVGRQRW
VDWLVI\WKHFRQGLWLRQif( threadIdx.x % 2 )7KHHYHQQXPEHUHGWKUHDGV
VLPSO\GRQRWKLQJZKLOHWKHRGGWKUHDGVH[HFXWHWKLVLQVWUXFWLRQ:KHQVRPHRI
WKHWKUHDGVQHHGWRH[HFXWHDQLQVWUXFWLRQZKLOHRWKHUVGRQǢWWKLVVLWXDWLRQLV
NQRZQDVthread divergence8QGHUQRUPDOFLUFXPVWDQFHVGLYHUJHQWEUDQFKHV
VLPSO\UHVXOWLQVRPHWKUHDGVUHPDLQLQJLGOHZKLOHWKHRWKHUWKUHDGVDFWXDOO\
H[HFXWHWKHLQVWUXFWLRQVLQWKHEUDQFK

%XWLQWKHFDVHRI__syncthreads()WKHUHVXOWLVVRPHZKDWWUDJLF7KH
&8'$$UFKLWHFWXUHJXDUDQWHHVWKDWno threadZLOODGYDQFHWRDQLQVWUXFWLRQ
EH\RQGWKH__syncthreads()XQWLOeveryWKUHDGLQWKHEORFNKDVH[HFXWHGWKH
__syncthreads()8QIRUWXQDWHO\LIWKH__syncthreads()VLWVLQDGLYHUJHQW
EUDQFKVRPHRIWKHWKUHDGVZLOOneverUHDFKWKH__syncthreads()7KHUHIRUH
EHFDXVHRIWKHJXDUDQWHHWKDWQRLQVWUXFWLRQDIWHUD__syncthreads()FDQEH
H[HFXWHGEHIRUHHYHU\WKUHDGKDVH[HFXWHGLWWKHKDUGZDUHVLPSO\FRQWLQXHVWR
ZDLWIRUWKHVHWKUHDGV$QGZDLWV$QGZDLWV)RUHYHU

7KLVLVWKHVLWXDWLRQLQWKHGRWSURGXFWH[DPSOHZKHQZHPRYHWKH
__syncthreads()FDOOLQVLGHWKHif()EORFN$Q\WKUHDGZLWKcacheIndex
JUHDWHUWKDQRUHTXDOWRi will neverH[HFXWHWKH__syncthreads()7KLVHIIHF-
WLYHO\KDQJVWKHSURFHVVRUEHFDXVHLWUHVXOWVLQWKH*38ZDLWLQJIRUVRPHWKLQJ
WKDWZLOOQHYHUKDSSHQ

if (cacheIndex < i) {
cache[cacheIndex] += cache[cacheIndex + i];
__syncthreads();
}

89

Download from www.wowebook.com


7+5($'&223(5$7,21

7KHPRUDORIWKLVVWRU\LVWKDW__syncthreads()LVDSRZHUIXOPHFKDQLVP
IRUHQVXULQJWKDW\RXUPDVVLYHO\SDUDOOHODSSOLFDWLRQVWLOOFRPSXWHVWKHFRUUHFW
UHVXOWV%XWEHFDXVHRIWKLVSRWHQWLDOIRUXQLQWHQGHGFRQVHTXHQFHVZHVWLOOQHHG
WRWDNHFDUHZKHQXVLQJLW

 6+$5('0(025<%,70$3
:HKDYHORRNHGDWH[DPSOHVWKDWXVHVKDUHGPHPRU\DQGHPSOR\HG
__syncthreads()WRHQVXUHWKDWGDWDLVUHDG\EHIRUHZHFRQWLQXH
,QWKHQDPHRIVSHHG\RXPD\EHWHPSWHGWROLYHGDQJHURXVO\DQGRPLW
WKH__syncthreads():HZLOOQRZORRNDWDJUDSKLFDOH[DPSOHWKDWUHTXLUHV
__syncthreads()IRUFRUUHFWQHVV:HZLOOVKRZ\RXVFUHHQVKRWVRIWKH
LQWHQGHGRXWSXWDQGRIWKHRXWSXWZKHQUXQZLWKRXW__syncthreads(),W
ZRQǢWEHSUHWW\

7KHERG\RImain()LVLGHQWLFDOWRWKH*38-XOLD6HWH[DPSOHDOWKRXJKWKLVWLPH
ZHODXQFKPXOWLSOHWKUHDGVSHUEORFN

#include "cuda.h"
#include "../common/book.h"
#include "../common/cpu_bitmap.h"

#define DIM 1024


#define PI 3.1415926535897932f

int main( void ) {


CPUBitmap bitmap( DIM, DIM );
unsigned char *dev_bitmap;

HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap,


bitmap.image_size() ) );

dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( dev_bitmap );

90

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,


bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
bitmap.display_and_exit();

cudaFree( dev_bitmap );
}

$VZLWKWKH-XOLD6HWH[DPSOHHDFKWKUHDGZLOOEHFRPSXWLQJDSL[HOYDOXHIRUD
VLQJOHRXWSXWORFDWLRQ7KHȌUVWWKLQJWKDWHDFKWKUHDGGRHVLVFRPSXWHLWVxDQG
yORFDWLRQLQWKHRXWSXWLPDJH7KLVFRPSXWDWLRQLVLGHQWLFDOWRWKHtidFRPSXWD-
WLRQLQWKHYHFWRUDGGLWLRQH[DPSOHDOWKRXJKZHFRPSXWHLWLQWZRGLPHQVLRQV
WKLVWLPH

__global__ void kernel( unsigned char *ptr ) {


// map from threadIdx/blockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

6LQFHZHZLOOEHXVLQJDVKDUHGPHPRU\EXIIHUWRFDFKHRXUFRPSXWDWLRQVZH
GHFODUHRQHVXFKWKDWHDFKWKUHDGLQRXU[EORFNKDVDQHQWU\
__shared__ float shared[16][16];

7KHQHDFKWKUHDGFRPSXWHVDYDOXHWREHVWRUHGLQWRWKLVEXIIHU

// now calculate the value at that position


const float period = 128.0f;

shared[threadIdx.x][threadIdx.y] =
255 * (sinf(x*2.0f*PI/ period) + 1.0f) *
(sinf(y*2.0f*PI/ period) + 1.0f) / 4.0f;

91

Download from www.wowebook.com


7+5($'&223(5$7,21

$QGODVWO\ZHVWRUHWKHVHYDOXHVEDFNRXWWRWKHSL[HOUHYHUVLQJWKHRUGHURIx
DQGy

ptr[offset*4 + 0] = 0;
ptr[offset*4 + 1] = shared[15-threadIdx.x][15-threadIdx.y];
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}

*UDQWHGWKHVHFRPSXWDWLRQVDUHVRPHZKDWDUELWUDU\:HǢYHVLPSO\FRPHXSZLWK
VRPHWKLQJWKDWZLOOGUDZDJULGRIJUHHQVSKHULFDOEOREV6RDIWHUFRPSLOLQJDQG
UXQQLQJWKLVNHUQHOZHRXWSXWDQLPDJHOLNHWKHRQHLQ)LJXUH

:KDWKDSSHQHGKHUH"$V\RXPD\KDYHJXHVVHGIURPWKHZD\ZHVHWXSWKLV
H[DPSOHZHǢUHPLVVLQJDQLPSRUWDQWV\QFKURQL]DWLRQSRLQW:KHQDWKUHDG
VWRUHVWKHFRPSXWHGYDOXHLQshared[][]WRWKHSL[HOLWLVSRVVLEOHWKDWWKH
WKUHDGUHVSRQVLEOHIRUZULWLQJWKDWYDOXHWRshared[][]KDVQRWȌQLVKHG
ZULWLQJLW\HW7KHRQO\ZD\WRJXDUDQWHHWKDWWKLVGRHVQRWKDSSHQLVE\XVLQJ
__syncthreads()7KXVWKHUHVXOWLVDFRUUXSWHGSLFWXUHRIJUHHQEOREV

Figure 5.5 $VFUHHQVKRWUHQGHUHGZLWKRXWSURSHUV\QFKURQL]DWLRQ

92

Download from www.wowebook.com


 6+ $ 5('0 (
<0
$ 25 1'6 < 1
= &+5 2 1, $
17 ,2

$OWKRXJKWKLVPD\QRWEHWKHHQGRIWKHZRUOG\RXUDSSOLFDWLRQPLJKWEH
FRPSXWLQJPRUHLPSRUWDQWYDOXHV

,QVWHDGZHQHHGWRDGGDV\QFKURQL]DWLRQSRLQWEHWZHHQWKHZULWHWRVKDUHG
PHPRU\DQGWKHVXEVHTXHQWUHDGIURPLW

shared[threadIdx.x][threadIdx.y] =
255 * (sinf(x*2.0f*PI/ period) + 1.0f) *
(sinf(y*2.0f*PI/ period) + 1.0f) / 4.0f;

__syncthreads();

ptr[offset*4 + 0] = 0;
ptr[offset*4 + 1] = shared[15-threadIdx.x][15-threadIdx.y];
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}

:LWKWKLV__syncthreads()LQSODFHZHWKHQJHWDIDUPRUHSUHGLFWDEOH DQG
DHVWKHWLFDOO\SOHDVLQJ UHVXOWDVVKRZQLQ)LJXUH

Figure 5.6 $VFUHHQVKRWDIWHUDGGLQJWKHFRUUHFWV\QFKURQL]DWLRQ


93

Download from www.wowebook.com


7+5($'&223(5$7,21

 &KDSWHU5HYLHZ
:HNQRZKRZEORFNVFDQEHVXEGLYLGHGLQWRVPDOOHUSDUDOOHOH[HFXWLRQXQLWV
NQRZQDVthreads:HUHYLVLWHGWKHYHFWRUDGGLWLRQH[DPSOHRIWKHSUHYLRXV
FKDSWHUWRVHHKRZWRSHUIRUPDGGLWLRQRIDUELWUDULO\ORQJYHFWRUV:HDOVRVKRZHG
DQH[DPSOHRIreductionDQGKRZZHXVHVKDUHGPHPRU\DQGV\QFKURQL]DWLRQWR
DFFRPSOLVKWKLV,QIDFWWKLVH[DPSOHVKRZHGKRZWKH*38DQG&38FDQFROODER-
UDWHRQFRPSXWLQJUHVXOWV)LQDOO\ZHVKRZHGKRZSHULORXVLWFDQEHWRDQDSSOL-
FDWLRQZKHQZHQHJOHFWWKHQHHGIRUV\QFKURQL]DWLRQ

<RXKDYHOHDUQHGPRVWRIWKHEDVLFVRI&8'$&DVZHOODVVRPHRIWKHZD\VLW
UHVHPEOHVVWDQGDUG&DQGDORWRIWKHLPSRUWDQWZD\VLWGLIIHUVIURPVWDQGDUG
&7KLVZRXOGEHDQH[FHOOHQWWLPHWRFRQVLGHUVRPHRIWKHSUREOHPV\RXKDYH
HQFRXQWHUHGDQGZKLFKRQHVPLJKWOHQGWKHPVHOYHVWRSDUDOOHOLPSOHPHQWDWLRQV
ZLWK&8'$&$VZHSURJUHVVZHZLOOORRNDWVRPHRIWKHRWKHUIHDWXUHVZHFDQ
XVHWRDFFRPSOLVKWDVNVRQWKH*38DVZHOODVVRPHRIWKHPRUHDGYDQFHG$3,
IHDWXUHVWKDW&8'$SURYLGHVWRXV

94

Download from www.wowebook.com


Chapter 6
Constant Memory
and Events

:HKRSH\RXKDYHOHDUQHGPXFKDERXWZULWLQJFRGHWKDWH[HFXWHVRQWKH*38
<RXVKRXOGNQRZKRZWRVSDZQSDUDOOHOEORFNVWRH[HFXWH\RXUNHUQHOVDQG\RX
VKRXOGNQRZKRZWRIXUWKHUVSOLWWKHVHEORFNVLQWRSDUDOOHOWKUHDGV<RXKDYHDOVR
VHHQZD\VWRHQDEOHFRPPXQLFDWLRQDQGV\QFKURQL]DWLRQEHWZHHQWKHVHWKUHDGV
%XWVLQFHWKHERRNLVQRWRYHU\HW\RXPD\KDYHJXHVVHGWKDW&8'$&KDVHYHQ
PRUHIHDWXUHVWKDWPLJKWEHXVHIXOWR\RX

7KLVFKDSWHUZLOOLQWURGXFH\RXWRDFRXSOHRIWKHVHPRUHDGYDQFHGIHDWXUHV
6SHFLȌFDOO\WKHUHH[LVWZD\VLQZKLFK\RXFDQH[SORLWVSHFLDOUHJLRQVRIPHPRU\
RQ\RXU*38LQRUGHUWRDFFHOHUDWH\RXUDSSOLFDWLRQV,QWKLVFKDSWHUZHZLOO
GLVFXVVRQHRIWKHVHUHJLRQVRIPHPRU\constant memory,QDGGLWLRQEHFDXVH
ZHDUHORRNLQJDWRXUȌUVWPHWKRGIRUHQKDQFLQJWKHSHUIRUPDQFHRI\RXU&8'$&
DSSOLFDWLRQV\RXZLOODOVROHDUQKRZWRPHDVXUHWKHSHUIRUPDQFHRI\RXUDSSOLFD-
WLRQVXVLQJ&8'$events)URPWKHVHPHDVXUHPHQWV\RXZLOOEHDEOHWRTXDQWLI\
WKHJDLQ RUORVV IURPDQ\HQKDQFHPHQWV\RXPDNH

95

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQDERXWXVLQJFRQVWDQWPHPRU\ZLWK&8'$&

ǩ <RXZLOOOHDUQDERXWWKHSHUIRUPDQFHFKDUDFWHULVWLFVRIFRQVWDQWPHPRU\

ǩ <RXZLOOOHDUQKRZWRXVH&8'$HYHQWVWRPHDVXUHDSSOLFDWLRQSHUIRUPDQFH

 &RQVWDQW0HPRU\
3UHYLRXVO\ZHGLVFXVVHGKRZPRGHUQ*38VDUHHTXLSSHGZLWKHQRUPRXV
DPRXQWVRIDULWKPHWLFSURFHVVLQJSRZHU,QIDFWWKHFRPSXWDWLRQDODGYDQWDJH
JUDSKLFVSURFHVVRUVKDYHRYHU&38VKHOSHGSUHFLSLWDWHWKHLQLWLDOLQWHUHVWLQXVLQJ
JUDSKLFVSURFHVVRUVIRUJHQHUDOSXUSRVHFRPSXWLQJ:LWKKXQGUHGVRIDULWKPHWLF
XQLWVRQWKH*38RIWHQWKHERWWOHQHFNLVQRWWKHDULWKPHWLFWKURXJKSXWRIWKH
FKLSEXWUDWKHUWKHPHPRU\EDQGZLGWKRIWKHFKLS7KHUHDUHVRPDQ\$/8VRQ
JUDSKLFVSURFHVVRUVWKDWVRPHWLPHVZHMXVWFDQǢWNHHSWKHLQSXWFRPLQJWRWKHP
IDVWHQRXJKWRVXVWDLQVXFKKLJKUDWHVRIFRPSXWDWLRQ6RLWLVZRUWKLQYHVWLJDWLQJ
PHDQVE\ZKLFKZHFDQUHGXFHWKHDPRXQWRIPHPRU\WUDIȌFUHTXLUHGIRUDJLYHQ
SUREOHP

:HKDYHVHHQ&8'$&SURJUDPVWKDWKDYHXVHGERWKJOREDODQGVKDUHGPHPRU\
VRIDU+RZHYHUWKHODQJXDJHPDNHVDYDLODEOHDQRWKHUNLQGRIPHPRU\NQRZQ
DVconstant memory$VWKHQDPHPD\LQGLFDWHZHXVHFRQVWDQWPHPRU\IRU
GDWDWKDWZLOOQRWFKDQJHRYHUWKHFRXUVHRIDNHUQHOH[HFXWLRQ19,',$KDUGZDUH
SURYLGHV.%RIFRQVWDQWPHPRU\WKDWLWWUHDWVGLIIHUHQWO\WKDQLWWUHDWVVWDQGDUG
JOREDOPHPRU\,QVRPHVLWXDWLRQVXVLQJFRQVWDQWPHPRU\UDWKHUWKDQJOREDO
PHPRU\ZLOOUHGXFHWKHUHTXLUHGPHPRU\EDQGZLGWK

 RAY TRACING INTRODUCTION


:HZLOOORRNDWRQHZD\RIH[SORLWLQJFRQVWDQWPHPRU\LQWKHFRQWH[WRIDVLPSOH
ray tracingDSSOLFDWLRQ)LUVWZHZLOOJLYH\RXVRPHEDFNJURXQGLQWKHPDMRU
FRQFHSWVEHKLQGUD\WUDFLQJ,I\RXDUHDOUHDG\FRPIRUWDEOHZLWKWKHFRQFHSWV
EHKLQGUD\WUDFLQJ\RXFDQVNLSWRWKHǤ5D\7UDFLQJRQWKH*38ǥVHFWLRQ

96

Download from www.wowebook.com


 CONSTANT
&2167$170(025<
MEMORY

6LPSO\SXWUD\WUDFLQJLVRQHZD\RISURGXFLQJDWZRGLPHQVLRQDOLPDJHRID
VFHQHFRQVLVWLQJRIWKUHHGLPHQVLRQDOREMHFWV%XWLVQǢWWKLVZKDW*38VZHUH
RULJLQDOO\GHVLJQHGIRU"+RZLVWKLVGLIIHUHQWIURPZKDW2SHQ*/RU'LUHFW;
GRZKHQ\RXSOD\\RXUIDYRULWHJDPH":HOO*38VGRLQGHHGVROYHWKLVVDPH
SUREOHPEXWWKH\XVHDWHFKQLTXHNQRZQDVrasterization7KHUHDUHPDQ\H[FHO-
OHQWERRNVRQUDVWHUL]DWLRQVRZHZLOOQRWHQGHDYRUWRH[SODLQWKHGLIIHUHQFHV
KHUH,WVXIȌFHVWRVD\WKDWWKH\DUHFRPSOHWHO\GLIIHUHQWPHWKRGVWKDWVROYHWKH
VDPHSUREOHP

6RKRZGRHVUD\WUDFLQJSURGXFHDQLPDJHRIDWKUHHGLPHQVLRQDOVFHQH"7KH
LGHDLVVLPSOH:HFKRRVHDVSRWLQRXUVFHQHWRSODFHDQLPDJLQDU\FDPHUD7KLV
VLPSOLȌHGGLJLWDOFDPHUDFRQWDLQVDOLJKWVHQVRUVRWRSURGXFHDQLPDJHZH
QHHGWRGHWHUPLQHZKDWOLJKWZRXOGKLWWKDWVHQVRU(DFKSL[HORIWKHUHVXOWLQJ
LPDJHVKRXOGEHWKHVDPHFRORUDQGLQWHQVLW\RIWKHUD\RIOLJKWWKDWKLWVWKDWVSRW
VHQVRU

6LQFHOLJKWLQFLGHQWDWDQ\SRLQWRQWKHVHQVRUFDQFRPHIURPDQ\SODFHLQRXU
VFHQHLWWXUQVRXWLWǢVHDVLHUWRZRUNEDFNZDUG7KDWLVUDWKHUWKDQWU\LQJWR
ȌJXUHRXWZKDWOLJKWUD\KLWVWKHSL[HOLQTXHVWLRQZKDWLIZHLPDJLQHVKRRWLQJD
UD\fromWKHSL[HODQGLQWRWKHVFHQH",QWKLVZD\HDFKSL[HOEHKDYHVVRPHWKLQJ
OLNHDQH\HWKDWLVǤORRNLQJǥLQWRWKHVFHQH)LJXUHLOOXVWUDWHVWKHVHUD\VEHLQJ
FDVWRXWRIHDFKSL[HODQGLQWRWKHVFHQH

 


   

Figure 6.1 $VLPSOHUD\WUDFLQJVFKHPH

97

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

:HȌJXUHRXWZKDWFRORULVVHHQE\HDFKSL[HOE\WUDFLQJDUD\IURPWKHSL[HOLQ
TXHVWLRQWKURXJKWKHVFHQHXQWLOLWKLWVRQHRIRXUREMHFWV:HWKHQVD\WKDWWKH
SL[HOZRXOGǤVHHǥWKLVREMHFWDQGFDQDVVLJQLWVFRORUEDVHGRQWKHFRORURIWKH
REMHFWLWVHHV0RVWRIWKHFRPSXWDWLRQUHTXLUHGE\UD\WUDFLQJLVLQWKHFRPSXWD-
WLRQRIWKHVHLQWHUVHFWLRQVRIWKHUD\ZLWKWKHREMHFWVLQWKHVFHQH

0RUHRYHULQPRUHFRPSOH[UD\WUDFLQJPRGHOVVKLQ\REMHFWVLQWKHVFHQHFDQ
UHȍHFWUD\VDQGWUDQVOXFHQWREMHFWVFDQUHIUDFWWKHUD\VRIOLJKW7KLVFUHDWHV
VHFRQGDU\UD\VWHUWLDU\UD\VDQGVRRQ,QIDFWWKLVLVRQHRIWKHDWWUDFWLYH
IHDWXUHVRIUD\WUDFLQJLWLVYHU\VLPSOHWRJHWDEDVLFUD\WUDFHUZRUNLQJEXWZH
FDQEXLOGPRGHOVRIPRUHFRPSOH[SKHQRPHQRQLQWRWKHUD\WUDFHULQRUGHUWR
SURGXFHPRUHUHDOLVWLFLPDJHV

 5$<75$&,1*217+(*38
6LQFH$3,VVXFKDV2SHQ*/DQG'LUHFW;DUHQRWGHVLJQHGWRDOORZUD\WUDFHG
UHQGHULQJZHZLOOKDYHWRXVH&8'$&WRLPSOHPHQWRXUEDVLFUD\WUDFHU2XU
UD\WUDFHUZLOOEHH[WUDRUGLQDULO\VLPSOHVRWKDWZHFDQFRQFHQWUDWHRQWKHXVH
RIFRQVWDQWPHPRU\VRLI\RXZHUHH[SHFWLQJFRGHWKDWFRXOGIRUPWKHEDVLVRI
DIXOOEORZQSURGXFWLRQUHQGHUHU\RXZLOOEHGLVDSSRLQWHG2XUEDVLFUD\WUDFHU
ZLOORQO\VXSSRUWVFHQHVRIVSKHUHVDQGWKHFDPHUDLVUHVWULFWHGWRWKH]D[LV
IDFLQJWKHRULJLQ0RUHRYHUZHZLOOQRWVXSSRUWDQ\OLJKWLQJRIWKHVFHQHWRDYRLG
WKHFRPSOLFDWLRQVRIVHFRQGDU\UD\V,QVWHDGRIFRPSXWLQJOLJKWLQJHIIHFWVZHZLOO
VLPSO\DVVLJQHDFKVSKHUHDFRORUDQGWKHQVKDGHWKHPZLWKVRPHSUHFRPSXWHG
IXQFWLRQLIWKH\DUHYLVLEOH

6RZKDWwillWKHUD\WUDFHUGR",WZLOOȌUHDUD\IURPHDFKSL[HODQGNHHSWUDFNRI
ZKLFKUD\VKLWZKLFKVSKHUHV,WZLOODOVRWUDFNWKHGHSWKRIHDFKRIWKHVHKLWV,Q
WKHFDVHZKHUHDUD\SDVVHVWKURXJKPXOWLSOHVSKHUHVRQO\WKHVSKHUHFORVHVW
WRWKHFDPHUDFDQEHVHHQ,QHVVHQFHRXUǤUD\WUDFHUǥLVQRWGRLQJPXFKPRUH
WKDQKLGLQJVXUIDFHVWKDWFDQQRWEHVHHQE\WKHFDPHUD

:HZLOOPRGHORXUVSKHUHVZLWKDGDWDVWUXFWXUHWKDWVWRUHVWKHVSKHUHǢVFHQWHU
FRRUGLQDWHRI(x, y, z)LWVradiusDQGLWVFRORURI(r, b, g)

98

Download from www.wowebook.com


 CONSTANT
&2167$170(025<
MEMORY

#define INF 2e10f

struct Sphere {
float r,b,g;
float radius;
float x,y,z;
__device__ float hit( float ox, float oy, float *n ) {
float dx = ox - x;
float dy = oy - y;
if (dx*dx + dy*dy < radius*radius) {
float dz = sqrtf( radius*radius - dx*dx - dy*dy );
*n = dz / sqrtf( radius * radius );
return dz + z;
}
return -INF;
}
};

<RXZLOODOVRQRWLFHWKDWWKHVWUXFWXUHKDVDPHWKRGFDOOHGhit( float ox,


float oy, float *n )*LYHQDUD\VKRWIURPWKHSL[HODW(ox, oy)WKLV
PHWKRGFRPSXWHVZKHWKHUWKHUD\LQWHUVHFWVWKHVSKHUH,IWKHUD\doesLQWHUVHFW
WKHVSKHUHWKHPHWKRGFRPSXWHVWKHGLVWDQFHIURPWKHFDPHUDZKHUHWKHUD\
KLWVWKHVSKHUH:HQHHGWKLVLQIRUPDWLRQIRUWKHUHDVRQPHQWLRQHGEHIRUH,QWKH
HYHQWWKDWWKHUD\KLWVPRUHWKDQRQHVSKHUHRQO\WKHFORVHVWVSKHUHFDQDFWXDOO\
EHVHHQ

2XUmain()URXWLQHIROORZVURXJKO\WKHVDPHVHTXHQFHDVRXUSUHYLRXVLPDJH
JHQHUDWLQJH[DPSOHV

#include "cuda.h"
#include "../common/book.h"
#include "../common/cpu_bitmap.h"

#define rnd( x ) (x * rand() / RAND_MAX)


#define SPHERES 20

Sphere *s;

99

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

int main( void ) {


// capture the start time
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

CPUBitmap bitmap( DIM, DIM );


unsigned char *dev_bitmap;

// allocate memory on the GPU for the output bitmap


HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap,
bitmap.image_size() ) );
// allocate memory for the Sphere dataset
HANDLE_ERROR( cudaMalloc( (void**)&s,
sizeof(Sphere) * SPHERES ) );

:HDOORFDWHPHPRU\IRURXULQSXWGDWDZKLFKLVDQDUUD\RIVSKHUHVWKDWFRPSRVH
RXUVFHQH6LQFHZHQHHGWKLVGDWDRQWKH*38EXWDUHJHQHUDWLQJLWZLWKWKH&38
ZHKDYHWRGRERWKDcudaMalloc() and a malloc()WRDOORFDWHPHPRU\RQ
ERWKWKH*38DQGWKH&38:HDOVRDOORFDWHDELWPDSLPDJHWKDWZHZLOOȌOOZLWK
RXWSXWSL[HOGDWDDVZHUD\WUDFHRXUVSKHUHVRQWKH*38

$IWHUDOORFDWLQJPHPRU\IRULQSXWDQGRXWSXWZHUDQGRPO\JHQHUDWHWKHFHQWHU
FRRUGLQDWHFRORUDQGUDGLXVIRURXUVSKHUHV

// allocate temp memory, initialize it, copy to


// memory on the GPU, and then free our temp memory
Sphere *temp_s = (Sphere*)malloc( sizeof(Sphere) * SPHERES );
for (int i=0; i<SPHERES; i++) {
temp_s[i].r = rnd( 1.0f );
temp_s[i].g = rnd( 1.0f );
temp_s[i].b = rnd( 1.0f );
temp_s[i].x = rnd( 1000.0f ) - 500;
temp_s[i].y = rnd( 1000.0f ) - 500;
temp_s[i].z = rnd( 1000.0f ) - 500;
temp_s[i].radius = rnd( 100.0f ) + 20;
}

100

Download from www.wowebook.com


 CONSTANT
&2167$170(025<
MEMORY

7KHSURJUDPFXUUHQWO\JHQHUDWHVDUDQGRPDUUD\RIVSKHUHVEXWWKLVTXDQWLW\
LVVSHFLȌHGLQD#defineDQGFDQEHDGMXVWHGDFFRUGLQJO\

:HFRS\WKLVDUUD\RIVSKHUHVWRWKH*38XVLQJcudaMemcpy()DQGWKHQIUHHWKH
WHPSRUDU\EXIIHU

HANDLE_ERROR( cudaMemcpy( s, temp_s,


sizeof(Sphere) * SPHERES,
cudaMemcpyHostToDevice ) );
free( temp_s );

1RZWKDWRXULQSXWLVRQWKH*38DQGZHKDYHDOORFDWHGVSDFHIRUWKHRXWSXWZH
DUHUHDG\WRODXQFKRXUNHUQHO

// generate a bitmap from our sphere data


dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( dev_bitmap );

:HZLOOH[DPLQHWKHNHUQHOLWVHOILQDPRPHQWEXWIRUQRZ\RXVKRXOGWDNHLWRQ
IDLWKWKDWLWUD\WUDFHVWKHVFHQHDQGJHQHUDWHVSL[HOGDWDIRUWKHLQSXWVFHQHRI
VSKHUHV)LQDOO\ZHFRS\WKHRXWSXWLPDJHEDFNIURPWKH*38DQGGLVSOD\LW,W
VKRXOGJRZLWKRXWVD\LQJWKDWZHIUHHDOODOORFDWHGPHPRU\WKDWKDVQǢWDOUHDG\
EHHQIUHHG

// copy our bitmap back from the GPU for display


HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
bitmap.display_and_exit();

// free our memory


cudaFree( dev_bitmap );
cudaFree( s );
}

101

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

$OORIWKLVVKRXOGEHFRPPRQSODFHWR\RXQRZ6RKRZGRZHGRWKHDFWXDOUD\
WUDFLQJ"%HFDXVHZHKDYHVHWWOHGRQDYHU\VLPSOHUD\WUDFLQJPRGHORXUNHUQHO
ZLOOEHYHU\HDV\WRXQGHUVWDQG(DFKWKUHDGLVJHQHUDWLQJRQHSL[HOIRURXURXWSXW
LPDJHVRZHVWDUWLQWKHXVXDOPDQQHUE\FRPSXWLQJWKHxDQGyFRRUGLQDWHV
IRUWKHWKUHDGDVZHOODVWKHOLQHDUL]HGoffsetLQWRRXURXWSXWEXIIHU:HZLOO
DOVRVKLIWRXU(x,y)LPDJHFRRUGLQDWHVE\DIM/2VRWKDWWKH]D[LVUXQVWKURXJK
WKHFHQWHURIWKHLPDJH

__global__ void kernel( unsigned char *ptr ) {


// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
float ox = (x - DIM/2);
float oy = (y - DIM/2);

6LQFHHDFKUD\QHHGVWRFKHFNHDFKVSKHUHIRULQWHUVHFWLRQZHZLOOQRZLWHUDWH
WKURXJKWKHDUUD\RIVSKHUHVFKHFNLQJHDFKIRUDKLW

float r=0, g=0, b=0;


float maxz = -INF;
for(int i=0; i<SPHERES; i++) {
float n;
float t = s[i].hit( ox, oy, &n );
if (t > maxz) {
float fscale = n;
r = s[i].r * fscale;
g = s[i].g * fscale;
b = s[i].b * fscale;
}
}

&OHDUO\WKHPDMRULW\RIWKHLQWHUHVWLQJFRPSXWDWLRQOLHVLQWKHfor()ORRS:H
LWHUDWHWKURXJKHDFKRIWKHLQSXWVSKHUHVDQGFDOOLWVhit()PHWKRGWRGHWHU-
PLQHZKHWKHUWKHUD\IURPRXUSL[HOǤVHHVǥWKHVSKHUH,IWKHUD\KLWVWKHFXUUHQW
VSKHUHZHGHWHUPLQHZKHWKHUWKHKLWLVFORVHUWRWKHFDPHUDWKDQWKHODVWVSKHUH
ZHKLW,ILWLVFORVHUZHVWRUHWKLVGHSWKDVRXUQHZFORVHVWVSKHUH,QDGGLWLRQZH

102

Download from www.wowebook.com


 CONSTANT
&2167$170(025<
MEMORY

VWRUHWKHFRORUDVVRFLDWHGZLWKWKLVVSKHUHVRWKDWZKHQWKHORRSKDVWHUPLQDWHG
WKHWKUHDGNQRZVWKHFRORURIWKHVSKHUHWKDWLVFORVHVWWRWKHFDPHUD6LQFHWKLV
LVWKHFRORUWKDWWKHUD\IURPRXUSL[HOǤVHHVǥZHFRQFOXGHWKDWWKLVLVWKHFRORURI
WKHSL[HODQGVWRUHWKLVYDOXHLQRXURXWSXWLPDJHEXIIHU

$IWHUHYHU\VSKHUHKDVEHHQFKHFNHGIRULQWHUVHFWLRQZHFDQVWRUHWKHFXUUHQW
FRORULQWRWKHRXWSXWLPDJH

ptr[offset*4 + 0] = (int)(r * 255);


ptr[offset*4 + 1] = (int)(g * 255);
ptr[offset*4 + 2] = (int)(b * 255);
ptr[offset*4 + 3] = 255;
}

1RWHWKDWLIQRVSKHUHVKDYHEHHQKLWWKHFRORUWKDWZHVWRUHZLOOEHZKDWHYHU
FRORUZHLQLWLDOL]HGWKHYDULDEOHVr, bDQGgWR,QWKLVFDVHZHVHWr, bDQGg
WR]HURVRWKHEDFNJURXQGZLOOEHEODFN<RXFDQFKDQJHWKHVHYDOXHVWRUHQGHU
DGLIIHUHQWFRORUEDFNJURXQG)LJXUHVKRZVDQH[DPSOHRIZKDWWKHRXWSXW
VKRXOGORRNOLNHZKHQUHQGHUHGZLWKVSKHUHVDQGDEODFNEDFNJURXQG

Figure 6.2 $VFUHHQVKRWIURPWKHUD\WUDFLQJH[DPSOH


103

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

6LQFHZHUDQGRPO\JHQHUDWHGWKHVSKHUHSRVLWLRQVFRORUVDQGVL]HVZHDGYLVH
\RXQRWWRSDQLFLI\RXURXWSXWGRHVQǢWPDWFKWKLVLPDJHLGHQWLFDOO\

 5$<75$&,1*:,7+&2167$170(025<
<RXPD\KDYHQRWLFHGWKDWZHQHYHUPHQWLRQHGFRQVWDQWPHPRU\LQWKHUD\
WUDFLQJH[DPSOH1RZLWǢVWLPHWRLPSURYHWKLVH[DPSOHXVLQJWKHEHQHȌWVRI
FRQVWDQWPHPRU\6LQFHZHFDQQRWPRGLI\FRQVWDQWPHPRU\ZHFOHDUO\FDQǢW
XVHLWIRUWKHRXWSXWLPDJHGDWD$QGWKLVH[DPSOHKDVRQO\RQHLQSXWWKHDUUD\
RIVSKHUHVVRLWVKRXOGEHSUHWW\REYLRXVZKDWGDWDZHZLOOVWRUHLQFRQVWDQW
PHPRU\

7KHPHFKDQLVPIRUGHFODULQJPHPRU\FRQVWDQWLVLGHQWLFDOWRWKHRQHZHXVHGIRU
GHFODULQJDEXIIHUDVVKDUHGPHPRU\,QVWHDGRIGHFODULQJRXUDUUD\OLNHWKLV
Sphere *s;

ZHDGGWKHPRGLȌHU__constant__EHIRUHLW

__constant__ Sphere s[SPHERES];

1RWLFHWKDWLQWKHRULJLQDOH[DPSOHZHGHFODUHGDSRLQWHUDQGWKHQXVHG
cudaMalloc()WRDOORFDWH*38PHPRU\IRULW:KHQZHFKDQJHGLWWRFRQVWDQW
PHPRU\ZHDOVRFKDQJHGWKHGHFODUDWLRQWRVWDWLFDOO\DOORFDWHWKHVSDFHLQ
FRQVWDQWPHPRU\:HQRORQJHUQHHGWRZRUU\DERXWFDOOLQJcudaMalloc()RU
cudaFree()IRURXUDUUD\RIVSKHUHVEXWZHGRQHHGWRFRPPLWWRDVL]HIRUWKLV
DUUD\DWFRPSLOHWLPH)RUPDQ\DSSOLFDWLRQVWKLVLVDQDFFHSWDEOHWUDGHRIIIRU
WKHSHUIRUPDQFHEHQHȌWVRIFRQVWDQWPHPRU\:HZLOOWDONDERXWWKHVHEHQHȌWV
PRPHQWDULO\EXWȌUVWZHZLOOORRNDWKRZWKHXVHRIFRQVWDQWPHPRU\FKDQJHV
RXUmain()URXWLQH

int main( void ) {


CPUBitmap bitmap( DIM, DIM );
unsigned char *dev_bitmap;

// allocate memory on the GPU for the output bitmap


HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap,
bitmap.image_size() ) );

104

Download from www.wowebook.com


 CONSTANT
&2167$170(025<
MEMORY

// allocate temp memory, initialize it, copy to constant


// memory on the GPU, and then free our temp memory
Sphere *temp_s = (Sphere*)malloc( sizeof(Sphere) * SPHERES );
for (int i=0; i<SPHERES; i++) {
temp_s[i].r = rnd( 1.0f );
temp_s[i].g = rnd( 1.0f );
temp_s[i].b = rnd( 1.0f );
temp_s[i].x = rnd( 1000.0f ) - 500;
temp_s[i].y = rnd( 1000.0f ) - 500;
temp_s[i].z = rnd( 1000.0f ) - 500;
temp_s[i].radius = rnd( 100.0f ) + 20;
}
HANDLE_ERROR( cudaMemcpyToSymbol( s, temp_s,
sizeof(Sphere) * SPHERES) );
free( temp_s );

// generate a bitmap from our sphere data


dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( dev_bitmap );

// copy our bitmap back from the GPU for display


HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
bitmap.display_and_exit();

// free our memory


cudaFree( dev_bitmap );
}

/DUJHO\WKLVLVLGHQWLFDOWRWKHSUHYLRXVLPSOHPHQWDWLRQRImain()$VZH
PHQWLRQHGSUHYLRXVO\ZHQRORQJHUQHHGWKHFDOOWRcudaMalloc()WRDOORFDWH

105

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

VSDFHIRURXUDUUD\RIVSKHUHV7KHRWKHUFKDQJHKDVEHHQKLJKOLJKWHGLQWKH
OLVWLQJ

HANDLE_ERROR( cudaMemcpyToSymbol( s, temp_s,


sizeof(Sphere) * SPHERES ) );

:HXVHWKLVVSHFLDOYHUVLRQRIcudaMemcpy()ZKHQZHFRS\IURPKRVW
PHPRU\WRFRQVWDQWPHPRU\RQWKH*387KHRQO\GLIIHUHQFHVEHWZHHQ
cudaMemcpyToSymbol()DQGcudaMemcpy()XVLQJcudaMemcpyHostToDevice
DUHWKDWcudaMemcpyToSymbol()FRSLHVWRFRQVWDQWPHPRU\DQG
cudaMemcpy()FRSLHVWRJOREDOPHPRU\

2XWVLGHWKH__constant__PRGLȌHUDQGWKHWZRFKDQJHVWRmain()WKH
YHUVLRQVZLWKDQGZLWKRXWFRQVWDQWPHPRU\DUHLGHQWLFDO

 3(5)250$1&(:,7+&2167$170(025<
'HFODULQJPHPRU\DV__constant__FRQVWUDLQVRXUXVDJHWREHUHDGRQO\,Q
WDNLQJRQWKLVFRQVWUDLQWZHH[SHFWWRJHWVRPHWKLQJLQUHWXUQ$VZHSUHYLRXVO\
PHQWLRQHGUHDGLQJIURPFRQVWDQWPHPRU\FDQFRQVHUYHPHPRU\EDQGZLGWK
ZKHQFRPSDUHGWRUHDGLQJWKHVDPHGDWDIURPJOREDOPHPRU\7KHUHDUHWZR
UHDVRQVZK\UHDGLQJIURPWKH.%RIFRQVWDQWPHPRU\FDQVDYHEDQGZLGWKRYHU
VWDQGDUGUHDGVRIJOREDOPHPRU\

ǩ $VLQJOHUHDGIURPFRQVWDQWPHPRU\FDQEHEURDGFDVWWRRWKHUǤQHDUE\ǥ
WKUHDGVHIIHFWLYHO\VDYLQJXSWRUHDGV

ǩ &RQVWDQWPHPRU\LVFDFKHGVRFRQVHFXWLYHUHDGVRIWKHVDPHDGGUHVVZLOOQRW
LQFXUDQ\DGGLWLRQDOPHPRU\WUDIȌF

:KDWGRZHPHDQE\WKHZRUGnearby"7RDQVZHUWKLVTXHVWLRQZHZLOOQHHGWR
H[SODLQWKHFRQFHSWRIDwarp)RUWKRVHUHDGHUVZKRDUHPRUHIDPLOLDUZLWKStar
TrekWKDQZLWKZHDYLQJDZDUSLQWKLVFRQWH[WKDVQRWKLQJWRGRZLWKWKHVSHHG
RIWUDYHOWKURXJKVSDFH,QWKHZRUOGRIZHDYLQJDZDUSUHIHUVWRWKHJURXS
RIthreadsEHLQJZRYHQWRJHWKHULQWRIDEULF,QWKH&8'$$UFKLWHFWXUHDwarp
UHIHUVWRDFROOHFWLRQRIWKUHDGVWKDWDUHǤZRYHQWRJHWKHUǥDQGJHWH[HFXWHGLQ
ORFNVWHS$WHYHU\OLQHLQ\RXUSURJUDPHDFKWKUHDGLQDZDUSH[HFXWHVWKHVDPH
LQVWUXFWLRQRQGLIIHUHQWGDWD

106

Download from www.wowebook.com


 CONSTANT
&2167$170(025<
MEMORY

:KHQLWFRPHVWRKDQGOLQJFRQVWDQWPHPRU\19,',$KDUGZDUHFDQEURDGFDVW
DVLQJOHPHPRU\UHDGWRHDFKKDOIZDUS$KDOIZDUSǟQRWQHDUO\DVFUHDWLYHO\
QDPHGDVDZDUSǟLVDJURXSRIWKUHDGVKDOIRIDWKUHDGZDUS,IHYHU\
WKUHDGLQDKDOIZDUSUHTXHVWVGDWDIURPWKHVDPHDGGUHVVLQFRQVWDQWPHPRU\
\RXU*38ZLOOJHQHUDWHRQO\DVLQJOHUHDGUHTXHVWDQGVXEVHTXHQWO\EURDGFDVW
WKHGDWDWRHYHU\WKUHDG,I\RXDUHUHDGLQJDORWRIGDWDIURPFRQVWDQWPHPRU\
\RXZLOOJHQHUDWHRQO\ URXJKO\SHUFHQW RIWKHPHPRU\WUDIȌFDV\RXZRXOG
ZKHQXVLQJJOREDOPHPRU\

%XWWKHVDYLQJVGRQǢWVWRSDWDSHUFHQWUHGXFWLRQLQEDQGZLGWKZKHQ
UHDGLQJFRQVWDQWPHPRU\%HFDXVHZHKDYHFRPPLWWHGWROHDYLQJWKHPHPRU\
XQFKDQJHGWKHKDUGZDUHFDQDJJUHVVLYHO\FDFKHWKHFRQVWDQWGDWDRQWKH*38
6RDIWHUWKHȌUVWUHDGIURPDQDGGUHVVLQFRQVWDQWPHPRU\RWKHUKDOIZDUSV
UHTXHVWLQJWKHVDPHDGGUHVVDQGWKHUHIRUHKLWWLQJWKHFRQVWDQWFDFKHZLOO
JHQHUDWHQRDGGLWLRQDOPHPRU\WUDIȌF

,QWKHFDVHRIRXUUD\WUDFHUHYHU\WKUHDGLQWKHODXQFKUHDGVWKHGDWDFRUUH-
VSRQGLQJWRWKHȌUVWVSKHUHVRWKHWKUHDGFDQWHVWLWVUD\IRULQWHUVHFWLRQ$IWHU
ZHPRGLI\RXUDSSOLFDWLRQWRVWRUHWKHVSKHUHVLQFRQVWDQWPHPRU\WKHKDUG-
ZDUHQHHGVWRPDNHRQO\DVLQJOHUHTXHVWIRUWKLVGDWD$IWHUFDFKLQJWKHGDWD
HYHU\RWKHUWKUHDGDYRLGVJHQHUDWLQJPHPRU\WUDIȌFDVDUHVXOWRIRQHRIWKHWZR
FRQVWDQWPHPRU\EHQHȌWV

ǩ ,WUHFHLYHVWKHGDWDLQDKDOIZDUSEURDGFDVW

ǩ ,WUHWULHYHVWKHGDWDIURPWKHFRQVWDQWPHPRU\FDFKH

8QIRUWXQDWHO\WKHUHFDQSRWHQWLDOO\EHDGRZQVLGHWRSHUIRUPDQFHZKHQXVLQJ
FRQVWDQWPHPRU\7KHKDOIZDUSEURDGFDVWIHDWXUHLVLQDFWXDOLW\DGRXEOHHGJHG
VZRUG$OWKRXJKLWFDQGUDPDWLFDOO\DFFHOHUDWHSHUIRUPDQFHZKHQDOOWKUHDGV
DUHUHDGLQJWKHVDPHDGGUHVVLWDFWXDOO\VORZVSHUIRUPDQFHWRDFUDZOZKHQDOO
WKUHDGVUHDGGLIIHUHQWDGGUHVVHV

7KHWUDGHRIIWRDOORZLQJWKHEURDGFDVWRIDVLQJOHUHDGWRWKUHDGVLVWKDWWKH
WKUHDGVDUHDOORZHGWRSODFHRQO\DVLQJOHUHDGUHTXHVWDWDWLPH)RUH[DPSOH
LIDOOWKUHDGVLQDKDOIZDUSQHHGGLIIHUHQWGDWDIURPFRQVWDQWPHPRU\WKH
GLIIHUHQWUHDGVJHWVHULDOL]HGHIIHFWLYHO\WDNLQJWLPHVWKHDPRXQWRIWLPH
WRSODFHWKHUHTXHVW,IWKH\ZHUHUHDGLQJIURPFRQYHQWLRQDOJOREDOPHPRU\WKH
UHTXHVWFRXOGEHLVVXHGDWWKHVDPHWLPH,QWKLVFDVHUHDGLQJIURPFRQVWDQW
PHPRU\ZRXOGSUREDEO\EHVORZHUWKDQXVLQJJOREDOPHPRU\

107

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

 0HDVXULQJ3HUIRUPDQFHZLWK(YHQWV
)XOO\DZDUHWKDWWKHUHPD\EHHLWKHUSRVLWLYHRUQHJDWLYHLPSOLFDWLRQV\RXKDYH
FKDQJHG\RXUUD\WUDFHUWRXVHFRQVWDQWPHPRU\+RZGR\RXGHWHUPLQHKRZWKLV
KDVLPSDFWHGWKHSHUIRUPDQFHRI\RXUSURJUDP"2QHRIWKHVLPSOHVWPHWULFV
LQYROYHVDQVZHULQJWKLVVLPSOHTXHVWLRQ:KLFKYHUVLRQWDNHVOHVVWLPHWRȌQLVK"
:HFRXOGXVHRQHRIWKH&38RURSHUDWLQJV\VWHPWLPHUVEXWWKLVZLOOLQFOXGH
ODWHQF\DQGYDULDWLRQIURPDQ\QXPEHURIVRXUFHV RSHUDWLQJV\VWHPWKUHDG
VFKHGXOLQJDYDLODELOLW\RIKLJKSUHFLVLRQ&38WLPHUVDQGVRRQ )XUWKHUPRUH
ZKLOHWKH*38NHUQHOUXQVZHPD\EHDV\QFKURQRXVO\SHUIRUPLQJFRPSXWDWLRQ
RQWKHKRVW7KHRQO\ZD\WRWLPHWKHVHKRVWFRPSXWDWLRQVLVXVLQJWKH&38RU
RSHUDWLQJV\VWHPWLPLQJPHFKDQLVP6RWRPHDVXUHWKHWLPHD*38VSHQGVRQD
WDVNZHZLOOXVHWKH&8'$HYHQW$3,

$QeventLQ&8'$LVHVVHQWLDOO\D*38WLPHVWDPSWKDWLVUHFRUGHGDWDXVHU
VSHFLȌHGSRLQWLQWLPH6LQFHWKH*38LWVHOILVUHFRUGLQJWKHWLPHVWDPSLW
HOLPLQDWHVDORWRIWKHSUREOHPVZHPLJKWHQFRXQWHUZKHQWU\LQJWRWLPH*38
H[HFXWLRQZLWK&38WLPHUV7KH$3,LVUHODWLYHO\HDV\WRXVHVLQFHWDNLQJDWLPH
VWDPSFRQVLVWVRIMXVWWZRVWHSVFUHDWLQJDQHYHQWDQGVXEVHTXHQWO\UHFRUGLQJ
DQHYHQW)RUH[DPSOHDWWKHEHJLQQLQJRIVRPHVHTXHQFHRIFRGHZHLQVWUXFW
WKH&8'$UXQWLPHWRPDNHDUHFRUGRIWKHFXUUHQWWLPH:HGRVRE\FUHDWLQJDQG
WKHQUHFRUGLQJWKHHYHQW

cudaEvent_t start;
cudaEventCreate(&start);
cudaEventRecord( start, 0 );

<RXZLOOQRWLFHWKDWZKHQZHLQVWUXFWWKHUXQWLPHWRUHFRUGWKHHYHQWstart, we
DOVRSDVVLWDVHFRQGDUJXPHQW,QWKHSUHYLRXVH[DPSOHWKLVDUJXPHQWLV7KH
H[DFWQDWXUHRIWKLVDUJXPHQWLVXQLPSRUWDQWIRURXUSXUSRVHVULJKWQRZVRZH
LQWHQGWROHDYHLWP\VWHULRXVO\XQH[SODLQHGUDWKHUWKDQRSHQDQHZFDQRIZRUPV
,I\RXUFXULRVLW\LVNLOOLQJ\RXZHLQWHQGWRGLVFXVVWKLVZKHQZHWDONDERXW
streams

7RWLPHDEORFNRIFRGHZHZLOOZDQWWRFUHDWHERWKDVWDUWHYHQWDQGDVWRSHYHQW
:HZLOOKDYHWKH&8'$UXQWLPHUHFRUGZKHQZHVWDUWWHOOLWWRGRVRPHRWKHUZRUN
RQWKH*38DQGWKHQWHOOLWWRUHFRUGZKHQZHǢYHVWRSSHG

108

Download from www.wowebook.com


 0( $ 685,1*3 (5)250 $ 1&(: +
, 7 ( 9 7
(1 6

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );

// do some work on the GPU

cudaEventRecord( stop, 0 );

8QIRUWXQDWHO\WKHUHLVVWLOODSUREOHPZLWKWLPLQJ*38FRGHLQWKLVZD\7KHȌ[ZLOO
UHTXLUHRQO\RQHOLQHRIFRGHEXWZLOOUHTXLUHVRPHH[SODQDWLRQ7KHWULFNLHVWSDUWRI
XVLQJHYHQWVDULVHVDVDFRQVHTXHQFHRIWKHIDFWWKDWVRPHRIWKHFDOOVZHPDNHLQ
&8'$&DUHDFWXDOO\asynchronous)RUH[DPSOHZKHQZHODXQFKHGWKHNHUQHOLQ
RXUUD\WUDFHUWKH*38EHJLQVH[HFXWLQJRXUFRGHEXWWKH&38FRQWLQXHVH[HFXWLQJ
WKHQH[WOLQHRIRXUSURJUDPEHIRUHWKH*38ȌQLVKHV7KLVLVH[FHOOHQWIURPD
SHUIRUPDQFHVWDQGSRLQWEHFDXVHLWPHDQVZHFDQEHFRPSXWLQJVRPHWKLQJRQWKH
*38DQG&38DWWKHVDPHWLPHEXWFRQFHSWXDOO\LWPDNHVWLPLQJWULFN\

<RXVKRXOGLPDJLQHFDOOVWRcudaEventRecord()DVDQLQVWUXFWLRQWRUHFRUG
WKHFXUUHQWWLPHEHLQJSODFHGLQWRWKH*38ǢVSHQGLQJTXHXHRIZRUN$VDUHVXOW
RXUHYHQWZRQǢWDFWXDOO\EHUHFRUGHGXQWLOWKH*38ȌQLVKHVHYHU\WKLQJSULRUWRWKH
FDOOWRcudaEventRecord(),QWHUPVRIKDYLQJRXUstopHYHQWPHDVXUHWKH
FRUUHFWWLPHWKLVLVSUHFLVHO\ZKDWZHZDQW%XWZHFDQQRWVDIHO\readWKHYDOXH
RIWKHstopHYHQWXQWLOWKH*38KDVFRPSOHWHGLWVSULRUZRUNDQGUHFRUGHGWKH
stopHYHQW)RUWXQDWHO\ZHKDYHDZD\WRLQVWUXFWWKH&38WRV\QFKURQL]HRQDQ
HYHQWWKHHYHQW$3,IXQFWLRQcudaEventSynchronize()

cudaEvent_t start, stop;


cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );

// do some work on the GPU

cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );

1RZZHKDYHLQVWUXFWHGWKHUXQWLPHWREORFNIXUWKHULQVWUXFWLRQXQWLOWKH*38
KDVUHDFKHGWKHstopHYHQW:KHQWKHFDOOWRcudaEventSynchronize()

109

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

UHWXUQVZHNQRZWKDWDOO*38ZRUNEHIRUHWKHstopHYHQWKDVFRPSOHWHGVRLW
LVVDIHWRUHDGWKHWLPHVWDPSUHFRUGHGLQstop,WLVZRUWKQRWLQJWKDWEHFDXVH
&8'$HYHQWVJHWLPSOHPHQWHGGLUHFWO\RQWKH*38WKH\DUHXQVXLWDEOHIRUWLPLQJ
PL[WXUHVRIGHYLFHDQGKRVWFRGH7KDWLV\RXZLOOJHWXQUHOLDEOHUHVXOWVLI\RX
DWWHPSWWRXVH&8'$HYHQWVWRWLPHPRUHWKDQNHUQHOH[HFXWLRQVDQGPHPRU\
FRSLHVLQYROYLQJWKHGHYLFH

 0($685,1*5$<75$&(53(5)250$1&(
7RWLPHRXUUD\WUDFHUZHZLOOQHHGWRFUHDWHDVWDUWDQGVWRSHYHQWMXVWDVZHGLG
ZKHQOHDUQLQJDERXWHYHQWV7KHIROORZLQJLVDWLPLQJHQDEOHGYHUVLRQRIWKHUD\
WUDFHUWKDWGRHVnotXVHFRQVWDQWPHPRU\

int main( void ) {


// capture the start time
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

CPUBitmap bitmap( DIM, DIM );


unsigned char *dev_bitmap;

// allocate memory on the GPU for the output bitmap


HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap,
bitmap.image_size() ) );
// allocate memory for the Sphere dataset
HANDLE_ERROR( cudaMalloc( (void**)&s,
sizeof(Sphere) * SPHERES ) );

// allocate temp memory, initialize it, copy to


// memory on the GPU, and then free our temp memory
Sphere *temp_s = (Sphere*)malloc( sizeof(Sphere) * SPHERES );
for (int i=0; i<SPHERES; i++) {
temp_s[i].r = rnd( 1.0f );
temp_s[i].g = rnd( 1.0f );
temp_s[i].b = rnd( 1.0f );
temp_s[i].x = rnd( 1000.0f ) - 500;

110

Download from www.wowebook.com


 0( $ 685,1*3 (5)250 $ 1&(: +
, 7 ( 9 7
(1 6

temp_s[i].y = rnd( 1000.0f ) - 500;


temp_s[i].z = rnd( 1000.0f ) - 500;
temp_s[i].radius = rnd( 100.0f ) + 20;
}
HANDLE_ERROR( cudaMemcpy( s, temp_s,
sizeof(Sphere) * SPHERES,
cudaMemcpyHostToDevice ) );
free( temp_s );

// generate a bitmap from our sphere data


dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( s, dev_bitmap );

// copy our bitmap back from the GPU for display


HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );

// get stop time, and display the timing results


HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );

float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time to generate: %3.1f ms\n", elapsedTime );

HANDLE_ERROR( cudaEventDestroy( start ) );


HANDLE_ERROR( cudaEventDestroy( stop ) );

// display
bitmap.display_and_exit();

// free our memory


cudaFree( dev_bitmap );
cudaFree( s );
}

111

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

1RWLFHWKDWZHKDYHWKURZQWZRDGGLWLRQDOIXQFWLRQVLQWRWKHPL[WKHFDOOV
WRcudaEventElapsedTime()DQGcudaEventDestroy()7KHIXQFWLRQ
cudaEventElapsedTime()LVDXWLOLW\WKDWFRPSXWHVWKHHODSVHGWLPHEHWZHHQ
WZRSUHYLRXVO\UHFRUGHGHYHQWV7KHWLPHLQPLOOLVHFRQGVHODSVHGEHWZHHQWKH
WZRHYHQWVLVUHWXUQHGLQWKHȌUVWDUJXPHQWWKHDGGUHVVRIDȍRDWLQJSRLQW
YDULDEOH

7KHFDOOWRcudaEventDestroy()QHHGVWREHPDGHZKHQZHǢUHȌQLVKHG
XVLQJDQHYHQWFUHDWHGZLWKcudaEventCreate()7KLVLVLGHQWLFDOWRFDOOLQJ
free()RQPHPRU\SUHYLRXVO\DOORFDWHGZLWKmalloc()VRZHQHHGQǢW
VWUHVVKRZLPSRUWDQWLWLVWRPDWFKHYHU\cudaEventCreate()ZLWKD
cudaEventDestroy()

:HFDQLQVWUXPHQWWKHUD\WUDFHUWKDWGRHVXVHFRQVWDQWPHPRU\LQWKHVDPH
IDVKLRQ

int main( void ) {


// capture the start time
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

CPUBitmap bitmap( DIM, DIM );


unsigned char *dev_bitmap;

// allocate memory on the GPU for the output bitmap


HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap,
bitmap.image_size() ) );

// allocate temp memory, initialize it, copy to constant


// memory on the GPU, and then free our temp memory
Sphere *temp_s = (Sphere*)malloc( sizeof(Sphere) * SPHERES );
for (int i=0; i<SPHERES; i++) {
temp_s[i].r = rnd( 1.0f );
temp_s[i].g = rnd( 1.0f );
temp_s[i].b = rnd( 1.0f );
temp_s[i].x = rnd( 1000.0f ) - 500;

112

Download from www.wowebook.com


 0( $ 685,1*3 (5)250 $ 1&(: +
, 7 ( 9 7
(1 6

temp_s[i].y = rnd( 1000.0f ) - 500;


temp_s[i].z = rnd( 1000.0f ) - 500;
temp_s[i].radius = rnd( 100.0f ) + 20;
}
HANDLE_ERROR( cudaMemcpyToSymbol( s, temp_s,
sizeof(Sphere) * SPHERES) );
free( temp_s );

// generate a bitmap from our sphere data


dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( dev_bitmap );

// copy our bitmap back from the GPU for display


HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );

// get stop time, and display the timing results


HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time to generate: %3.1f ms\n", elapsedTime );

HANDLE_ERROR( cudaEventDestroy( start ) );


HANDLE_ERROR( cudaEventDestroy( stop ) );

// display
bitmap.display_and_exit();

// free our memory


cudaFree( dev_bitmap );
}

113

Download from www.wowebook.com


CONSTANT MEMORY AND EVENTS

1RZZKHQZHUXQRXUWZRYHUVLRQVRIWKHUD\WUDFHUZHFDQFRPSDUHWKHWLPHLW
WDNHVWRFRPSOHWHWKH*38ZRUN7KLVZLOOWHOOXVDWDKLJKOHYHOZKHWKHULQWUR-
GXFLQJFRQVWDQWPHPRU\KDVLPSURYHGWKHSHUIRUPDQFHRIRXUDSSOLFDWLRQRU
ZRUVHQHGLW)RUWXQDWHO\LQWKLVFDVHSHUIRUPDQFHLVLPSURYHGGUDPDWLFDOO\
E\XVLQJFRQVWDQWPHPRU\2XUH[SHULPHQWVRQD*H)RUFH*7;VKRZWKH
FRQVWDQWPHPRU\UD\WUDFHUSHUIRUPLQJXSWRSHUFHQWIDVWHUWKDQWKHYHUVLRQ
WKDWXVHVJOREDOPHPRU\2QDGLIIHUHQW*38\RXUPLOHDJHPLJKWYDU\DOWKRXJK
WKHUD\WUDFHUWKDWXVHVFRQVWDQWPHPRU\VKRXOGDOZD\VEHDWOHDVWDVIDVWDVWKH
YHUVLRQZLWKRXWLW

 &KDSWHU5HYLHZ
,QDGGLWLRQWRWKHJOREDODQGVKDUHGPHPRU\ZHH[SORUHGLQSUHYLRXVFKDSWHUV
19,',$KDUGZDUHPDNHVRWKHUW\SHVRIPHPRU\DYDLODEOHIRURXUXVH&RQVWDQW
PHPRU\FRPHVZLWKDGGLWLRQDOFRQVWUDLQWVRYHUVWDQGDUGJOREDOPHPRU\EXW
LQVRPHFDVHVVXEMHFWLQJRXUVHOYHVWRWKHVHFRQVWUDLQWVFDQ\LHOGDGGLWLRQDO
SHUIRUPDQFH6SHFLȌFDOO\ZHFDQVHHDGGLWLRQDOSHUIRUPDQFHZKHQWKUHDGVLQD
ZDUSQHHGDFFHVVWRWKHVDPHUHDGRQO\GDWD8VLQJFRQVWDQWPHPRU\IRUGDWD
ZLWKWKLVDFFHVVSDWWHUQFDQFRQVHUYHEDQGZLGWKERWKEHFDXVHRIWKHFDSDFLW\WR
EURDGFDVWUHDGVDFURVVDKDOIZDUSDQGEHFDXVHRIWKHSUHVHQFHRIDFRQVWDQW
PHPRU\FDFKHRQFKLS0HPRU\EDQGZLGWKERWWOHQHFNVDZLGHFODVVRIDOJR-
ULWKPVVRKDYLQJPHFKDQLVPVWRDPHOLRUDWHWKLVVLWXDWLRQFDQSURYHLQFUHGLEO\
XVHIXO

:HDOVROHDUQHGKRZWRXVH&8'$HYHQWVWRUHTXHVWWKHUXQWLPHWRUHFRUGWLPH
VWDPSVDWVSHFLȌFSRLQWVGXULQJ*38H[HFXWLRQ:HVDZKRZWRV\QFKURQL]HWKH
&38ZLWKWKH*38RQRQHRIWKHVHHYHQWVDQGWKHQKRZWRFRPSXWHWKHWLPH
HODSVHGEHWZHHQWZRHYHQWV,QGRLQJVRZHEXLOWXSDPHWKRGWRFRPSDUHWKH
UXQQLQJWLPHEHWZHHQWZRGLIIHUHQWPHWKRGVIRUUD\WUDFLQJVSKHUHVFRQFOXGLQJ
WKDWIRUWKHDSSOLFDWLRQDWKDQGXVLQJFRQVWDQWPHPRU\JDLQHGXVDVLJQLȌFDQW
DPRXQWRISHUIRUPDQFH

114

Download from www.wowebook.com


Chapter 7
Texture Memory

:KHQZHORRNHGDWFRQVWDQWPHPRU\ZHVDZKRZH[SORLWLQJVSHFLDOPHPRU\
VSDFHVXQGHUWKHULJKWFLUFXPVWDQFHVFDQGUDPDWLFDOO\DFFHOHUDWHDSSOLFDWLRQV
:HDOVROHDUQHGKRZWRPHDVXUHWKHVHSHUIRUPDQFHJDLQVLQRUGHUWRPDNH
LQIRUPHGGHFLVLRQVDERXWSHUIRUPDQFHFKRLFHV,QWKLVFKDSWHUZHZLOOOHDUQ
DERXWKRZWRDOORFDWHDQGXVHtexture memory/LNHFRQVWDQWPHPRU\WH[WXUH
PHPRU\LVDQRWKHUYDULHW\RIUHDGRQO\PHPRU\WKDWFDQLPSURYHSHUIRUPDQFH
DQGUHGXFHPHPRU\WUDIȌFZKHQUHDGVKDYHFHUWDLQDFFHVVSDWWHUQV$OWKRXJK
WH[WXUHPHPRU\ZDVRULJLQDOO\GHVLJQHGIRUWUDGLWLRQDOJUDSKLFVDSSOLFDWLRQVLW
FDQDOVREHXVHGTXLWHHIIHFWLYHO\LQVRPH*38FRPSXWLQJDSSOLFDWLRQV

115

Download from www.wowebook.com


7(;785(0(025<

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQDERXWWKHSHUIRUPDQFHFKDUDFWHULVWLFVRIWH[WXUHPHPRU\

ǩ <RXZLOOOHDUQKRZWRXVHRQHGLPHQVLRQDOWH[WXUHPHPRU\ZLWK&8'$&

ǩ <RXZLOOOHDUQKRZWRXVHWZRGLPHQVLRQDOWH[WXUHPHPRU\ZLWK&8'$&

 7H[WXUH0HPRU\2YHUYLHZ
,I\RXUHDGWKHLQWURGXFWLRQWRWKLVFKDSWHUWKHVHFUHWLVDOUHDG\RXW7KHUHLV
\HWDQRWKHUW\SHRIUHDGRQO\PHPRU\WKDWLVDYDLODEOHIRUXVHLQ\RXUSURJUDPV
ZULWWHQLQ&8'$&5HDGHUVIDPLOLDUZLWKWKHZRUNLQJVRIJUDSKLFVKDUGZDUHZLOO
QRWEHVXUSULVHGEXWWKH*38ǢVVRSKLVWLFDWHGtexture memoryPD\DOVREHXVHG
IRUJHQHUDOSXUSRVHFRPSXWLQJ$OWKRXJK19,',$GHVLJQHGWKHWH[WXUHXQLWVIRU
WKHFODVVLFDO2SHQ*/DQG'LUHFW;UHQGHULQJSLSHOLQHVWH[WXUHPHPRU\KDVVRPH
SURSHUWLHVWKDWPDNHLWH[WUHPHO\XVHIXOIRUFRPSXWLQJ

/LNHFRQVWDQWPHPRU\WH[WXUHPHPRU\LVFDFKHGRQFKLSVRLQVRPHVLWXDWLRQVLW
ZLOOSURYLGHKLJKHUHIIHFWLYHEDQGZLGWKE\UHGXFLQJPHPRU\UHTXHVWVWRRIIFKLS
'5$06SHFLȌFDOO\WH[WXUHFDFKHVDUHGHVLJQHGIRUJUDSKLFVDSSOLFDWLRQVZKHUH
PHPRU\DFFHVVSDWWHUQVH[KLELWDJUHDWGHDORIspatial locality,QDFRPSXWLQJ
DSSOLFDWLRQWKLVURXJKO\LPSOLHVWKDWDWKUHDGLVOLNHO\WRUHDGIURPDQDGGUHVV
ǤQHDUǥWKHDGGUHVVWKDWQHDUE\WKUHDGVUHDGDVVKRZQLQ)LJXUH

Thread 0
Thread 1
Thread 2
Thread 3

Figure 7.1 $PDSSLQJRIWKUHDGVLQWRDWZRGLPHQVLRQDOUHJLRQRIPHPRU\

116

Download from www.wowebook.com


 6,/08 $7 ,1*+ ( 7
$ 7 5 $ 5
16) (

$ULWKPHWLFDOO\WKHIRXUDGGUHVVHVVKRZQDUHQRWFRQVHFXWLYHVRWKH\ZRXOG
QRWEHFDFKHGWRJHWKHULQDW\SLFDO&38FDFKLQJVFKHPH%XWVLQFH*38WH[WXUH
FDFKHVDUHGHVLJQHGWRDFFHOHUDWHDFFHVVSDWWHUQVVXFKDVWKLVRQH\RXZLOOVHH
DQLQFUHDVHLQSHUIRUPDQFHLQWKLVFDVHZKHQXVLQJWH[WXUHPHPRU\LQVWHDGRI
JOREDOPHPRU\,QIDFWWKLVVRUWRIDFFHVVSDWWHUQLVQRWLQFUHGLEO\XQFRPPRQLQ
JHQHUDOSXUSRVHFRPSXWLQJDVZHVKDOOVHH

 6LPXODWLQJ+HDW7UDQVIHU
3K\VLFDOVLPXODWLRQVFDQEHDPRQJWKHPRVWFRPSXWDWLRQDOO\FKDOOHQJLQJSURE-
OHPVWRVROYH)XQGDPHQWDOO\WKHUHLVRIWHQDWUDGHRIIEHWZHHQDFFXUDF\DQG
FRPSXWDWLRQDOFRPSOH[LW\$VDUHVXOWFRPSXWHUVLPXODWLRQVKDYHEHFRPHPRUH
DQGPRUHLPSRUWDQWLQUHFHQW\HDUVWKDQNVLQODUJHSDUWWRWKHLQFUHDVHGDFFX-
UDF\SRVVLEOHDVDFRQVHTXHQFHRIWKHSDUDOOHOFRPSXWLQJUHYROXWLRQ6LQFHPDQ\
SK\VLFDOVLPXODWLRQVFDQEHSDUDOOHOL]HGTXLWHHDVLO\ZHZLOOORRNDWDYHU\VLPSOH
VLPXODWLRQPRGHOLQWKLVH[DPSOH

 6,03/(+($7,1*02'(/
7RGHPRQVWUDWHDVLWXDWLRQZKHUH\RXFDQHIIHFWLYHO\HPSOR\WH[WXUHPHPRU\
ZHZLOOFRQVWUXFWDVLPSOHWZRGLPHQVLRQDOKHDWWUDQVIHUVLPXODWLRQ:HVWDUW
E\DVVXPLQJWKDWZHKDYHVRPHUHFWDQJXODUURRPWKDWZHGLYLGHLQWRDJULG
,QVLGHWKHJULGZHZLOOUDQGRPO\VFDWWHUDKDQGIXORIǤKHDWHUVǥZLWKYDULRXVȌ[HG
WHPSHUDWXUHV)LJXUHVKRZVDQH[DPSOHRIZKDWWKLVURRPPLJKWORRNOLNH

Figure 7.2 $URRPZLWKǤKHDWHUVǥRIYDULRXVWHPSHUDWXUH


117

Download from www.wowebook.com


7(;785(0(025<

Figure 7.3 +HDWGLVVLSDWLQJIURPZDUPFHOOVLQWRFROGFHOOV

*LYHQDUHFWDQJXODUJULGDQGFRQȌJXUDWLRQRIKHDWHUVZHDUHORRNLQJWRVLPX-
ODWHZKDWKDSSHQVWRWKHWHPSHUDWXUHLQHYHU\JULGFHOODVWLPHSURJUHVVHV)RU
VLPSOLFLW\FHOOVZLWKKHDWHUVLQWKHPDOZD\VUHPDLQDFRQVWDQWWHPSHUDWXUH
$WHYHU\VWHSLQWLPHZHZLOODVVXPHWKDWKHDWǤȍRZVǥEHWZHHQDFHOODQGLWV
QHLJKERUV,IDFHOOǢVQHLJKERULVZDUPHUWKDQLWLVWKHZDUPHUQHLJKERUZLOOWHQG
WRZDUPLWXS&RQYHUVHO\LIDFHOOKDVDQHLJKERUFRROHUWKDQLWLVLWZLOOFRRORII
4XDOLWDWLYHO\)LJXUHUHSUHVHQWVWKLVȍRZRIKHDW

,QRXUKHDWWUDQVIHUPRGHOZHZLOOFRPSXWHWKHQHZWHPSHUDWXUHLQDJULGFHOO
DVDVXPRIWKHGLIIHUHQFHVEHWZHHQLWVWHPSHUDWXUHDQGWKHWHPSHUDWXUHVRILWV
QHLJKERURUHVVHQWLDOO\DQXSGDWHHTXDWLRQDVVKRZQLQ(TXDWLRQ

Equation 7.1

,QWKHHTXDWLRQIRUXSGDWLQJDFHOOǢVWHPSHUDWXUHWKHFRQVWDQWkVLPSO\UHSUH-
VHQWVWKHUDWHDWZKLFKKHDWȍRZVWKURXJKWKHVLPXODWLRQ$ODUJHYDOXHRIk will
GULYHWKHV\VWHPWRDFRQVWDQWWHPSHUDWXUHTXLFNO\ZKLOHDVPDOOYDOXHZLOODOORZ
WKHVROXWLRQWRUHWDLQODUJHWHPSHUDWXUHJUDGLHQWVORQJHU6LQFHZHFRQVLGHURQO\
IRXUQHLJKERUV WRSERWWRPOHIWULJKW DQGkDQGTOLDUHPDLQFRQVWDQWLQWKH
HTXDWLRQWKLVXSGDWHEHFRPHVOLNHWKHRQHVKRZQLQ(TXDWLRQ

Equation 7.2

/LNHZLWKWKHUD\WUDFLQJH[DPSOHLQWKHSUHYLRXVFKDSWHUWKLVPRGHOLVQRW
LQWHQGHGWREHFORVHWRZKDWPLJKWEHXVHGLQLQGXVWU\ LQIDFWLWLVQRWUHDOO\
HYHQDQDSSUR[LPDWLRQRIVRPHWKLQJSK\VLFDOO\DFFXUDWH :HKDYHVLPSOLȌHG
WKLVPRGHOLPPHQVHO\LQRUGHUWRGUDZDWWHQWLRQWRWKHWHFKQLTXHVDWKDQG:LWK
WKLVLQPLQGOHWǢVWDNHDORRNDWKRZWKHXSGDWHJLYHQE\(TXDWLRQFDQEH
FRPSXWHGRQWKH*38

118

Download from www.wowebook.com


 6,/08 $7 ,1*+ ( 7
$ 7 5 $ 5
16) (

 COMPUTING TEMPERATURE UPDATES


:HZLOOFRYHUWKHVSHFLȌFVRIHDFKVWHSLQDPRPHQWEXWDWDKLJKOHYHORXU
XSGDWHSURFHVVSURFHHGVDVIROORZV

 *LYHQVRPHJULGRILQSXWWHPSHUDWXUHVFRS\WKHWHPSHUDWXUHRIFHOOV
ZLWKKHDWHUVWRWKLVJULG7KLVZLOORYHUZULWHDQ\SUHYLRXVO\FRPSXWHG
WHPSHUDWXUHVLQWKHVHFHOOVWKHUHE\HQIRUFLQJRXUUHVWULFWLRQWKDWǤKHDWLQJ
FHOOVǥUHPDLQDWDFRQVWDQWWHPSHUDWXUH7KLVFRS\JHWVSHUIRUPHGLQ
copy_const_kernel()

 *LYHQWKHLQSXWWHPSHUDWXUHJULGFRPSXWHWKHRXWSXWWHPSHUDWXUHVEDVHGRQ
WKHXSGDWHLQ(TXDWLRQ7KLVXSGDWHJHWVSHUIRUPHGLQblend_kernel()

 6ZDSWKHLQSXWDQGRXWSXWEXIIHUVLQSUHSDUDWLRQRIWKHQH[WWLPHVWHS7KH
RXWSXWWHPSHUDWXUHJULGFRPSXWHGLQVWHSZLOOEHFRPHWKHLQSXWWHPSHUDWXUH
JULGWKDWZHVWDUWZLWKLQVWHSZKHQVLPXODWLQJWKHQH[WWLPHVWHS

%HIRUHEHJLQQLQJWKHVLPXODWLRQZHDVVXPHZHKDYHJHQHUDWHGDJULGRI
FRQVWDQWV0RVWRIWKHHQWULHVLQWKLVJULGDUH]HUREXWVRPHHQWULHVFRQWDLQ
QRQ]HURWHPSHUDWXUHVWKDWUHSUHVHQWKHDWHUVDWȌ[HGWHPSHUDWXUHV7KLVEXIIHU
RIFRQVWDQWVZLOOQRWFKDQJHRYHUWKHFRXUVHRIWKHVLPXODWLRQDQGJHWVUHDGDW
HDFKWLPHVWHS

%HFDXVHRIWKHZD\ZHDUHPRGHOLQJRXUKHDWWUDQVIHUZHVWDUWZLWKWKHRXWSXW
JULGIURPWKHSUHYLRXVWLPHVWHS7KHQDFFRUGLQJWRVWHSZHFRS\WKHWHPSHUD-
WXUHVRIWKHFHOOVZLWKKHDWHUVLQWRWKLVRXWSXWJULGRYHUZULWLQJDQ\SUHYLRXVO\
FRPSXWHGWHPSHUDWXUHV:HGRWKLVEHFDXVHZHKDYHDVVXPHGWKDWWKHWHPSHUD-
WXUHRIWKHVHKHDWHUFHOOVUHPDLQVFRQVWDQW:HSHUIRUPWKLVFRS\RIWKHFRQVWDQW
JULGRQWRWKHLQSXWJULGZLWKWKHIROORZLQJNHUQHO

__global__ void copy_const_kernel( float *iptr,


const float *cptr ) {
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

if (cptr[offset] != 0) iptr[offset] = cptr[offset];


}

119

Download from www.wowebook.com


7(;785(0(025<

7KHȌUVWWKUHHOLQHVVKRXOGORRNIDPLOLDU7KHȌUVWWZROLQHVFRQYHUWDWKUHDGǢV
threadIdxDQGblockIdxLQWRDQxDQGDyFRRUGLQDWH7KHWKLUGOLQH
FRPSXWHVDOLQHDUoffsetLQWRRXUFRQVWDQWDQGLQSXWEXIIHUV7KHKLJKOLJKWHG
OLQHSHUIRUPVWKHFRS\RIWKHKHDWHUWHPSHUDWXUHLQcptr[]WRWKHLQSXWJULGLQ
iptr[]1RWLFHWKDWWKHFRS\LVSHUIRUPHGRQO\LIWKHFHOOLQWKHFRQVWDQWJULGLV
QRQ]HUR:HGRWKLVWRSUHVHUYHDQ\YDOXHVWKDWZHUHFRPSXWHGLQWKHSUHYLRXV
WLPHVWHSZLWKLQFHOOVWKDWGRQRWFRQWDLQKHDWHUV&HOOVZLWKKHDWHUVZLOOKDYH
QRQ]HURHQWULHVLQcptr[]DQGZLOOWKHUHIRUHKDYHWKHLUWHPSHUDWXUHVSUHVHUYHG
IURPVWHSWRVWHSWKDQNVWRWKLVFRS\NHUQHO

6WHSRIWKHDOJRULWKPLVWKHPRVWFRPSXWDWLRQDOO\LQYROYHG7RSHUIRUPWKH
XSGDWHVZHFDQKDYHHDFKWKUHDGWDNHUHVSRQVLELOLW\IRUDVLQJOHFHOOLQRXU
VLPXODWLRQ(DFKWKUHDGZLOOUHDGLWVFHOOǢVWHPSHUDWXUHDQGWKHWHPSHUDWXUHVRI
LWVQHLJKERULQJFHOOVSHUIRUPWKHSUHYLRXVXSGDWHFRPSXWDWLRQDQGWKHQXSGDWH
LWVWHPSHUDWXUHZLWKWKHQHZYDOXH0XFKRIWKLVNHUQHOUHVHPEOHVWHFKQLTXHV
\RXǢYHXVHGEHIRUH

__global__ void blend_kernel( float *outSrc,


const float *inSrc ) {
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

int left = offset - 1;


int right = offset + 1;
if (x == 0) left++;
if (x == DIM-1) right--;

int top = offset - DIM;


int bottom = offset + DIM;
if (y == 0) top += DIM;
if (y == DIM-1) bottom -= DIM;

outSrc[offset] = inSrc[offset] + SPEED * ( inSrc[top] +


inSrc[bottom] + inSrc[left] + inSrc[right] -
inSrc[offset]*4);
}

120

Download from www.wowebook.com


7.3 S
L IM U E
AT INTGRHA NAST F ER

Notice that we start exactly as we did for the examples that produced images as
their output. However, instead of computing the color of a pixel, the threads are
computing temperatures of simulation grid cells. Nevertheless, they start by
converting their threadIdx and blockIdx into an x, y, and offset. You might
be able to recite these lines in your sleep by now (although for your sake, we hope
you aren’t actually reciting them in your sleep).

Next, we determine the offsets of our left, right, top, and bottom neighbors so
that we can read the temperatures of those cells. We will need those values to
compute the updated temperature in the current cell. The only complication here
is that we need to adjust indices on the border so that cells around the edges
do not wrap around. Finally, in the highlighted line, we perform the update from
Equation 7.2, adding the old temperature and the scaled differences of that
temperature and the cell’s neighbors’ temperatures.

7.3.3 ANIMATING THE SIMULATION


The remainder of the code primarily sets up the grid and then displays an
animated output of the heat map. We will walk through that code now:

#include "cuda.h"
#include "../common/book.h"
#include "../common/cpu_anim.h"

#define DIM 1024


#define PI 3.1415926535897932f
#define MAX_TEMP 1.0f
#define MIN_TEMP 0.0001f
#define SPEED 0.25f

// globals needed by the update routine


struct DataBlock {
unsigned char *output_bitmap;
float *dev_inSrc;
float *dev_outSrc;
float *dev_constSrc;
CPUAnimBitmap *bitmap;

121

Download from www.wowebook.com


7(;785(0(025<

cudaEvent_t start, stop;


float totalTime;
float frames;
};

void anim_gpu( DataBlock *d, int ticks ) {


HANDLE_ERROR( cudaEventRecord( d->start, 0 ) );
dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);
CPUAnimBitmap *bitmap = d->bitmap;

for (int i=0; i<90; i++) {


copy_const_kernel<<<blocks,threads>>>( d->dev_inSrc,
d->dev_constSrc );
blend_kernel<<<blocks,threads>>>( d->dev_outSrc,
d->dev_inSrc );
swap( d->dev_inSrc, d->dev_outSrc );
}
float_to_color<<<blocks,threads>>>( d->output_bitmap,
d->dev_inSrc );

HANDLE_ERROR( cudaMemcpy( bitmap->get_ptr(),


d->output_bitmap,
bitmap->image_size(),
cudaMemcpyDeviceToHost ) );

HANDLE_ERROR( cudaEventRecord( d->stop, 0 ) );


HANDLE_ERROR( cudaEventSynchronize( d->stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
d->start, d->stop ) );
d->totalTime += elapsedTime;
++d->frames;
printf( "Average Time per frame: %3.1f ms\n",
d->totalTime/d->frames );
}

122

Download from www.wowebook.com


 6,/08 $7 ,1*+ ( 7
$ 7 5 $ 5
16) (

void anim_exit( DataBlock *d ) {


cudaFree( d->dev_inSrc );
cudaFree( d->dev_outSrc );
cudaFree( d->dev_constSrc );

HANDLE_ERROR( cudaEventDestroy( d->start ) );


HANDLE_ERROR( cudaEventDestroy( d->stop ) );
}

:HKDYHHTXLSSHGWKHFRGHZLWKHYHQWEDVHGWLPLQJDVZHGLGLQSUHYLRXVFKDS-
WHUǢVUD\WUDFLQJH[DPSOH7KHWLPLQJFRGHVHUYHVWKHVDPHSXUSRVHDVLWGLG
SUHYLRXVO\6LQFHZHZLOOHQGHDYRUWRDFFHOHUDWHWKHLQLWLDOLPSOHPHQWDWLRQZH
KDYHSXWLQSODFHDPHFKDQLVPE\ZKLFKZHFDQPHDVXUHSHUIRUPDQFHDQG
FRQYLQFHRXUVHOYHVWKDWZHKDYHVXFFHHGHG

7KHIXQFWLRQanim_gpu()JHWVFDOOHGE\WKHDQLPDWLRQIUDPHZRUNRQHYHU\
IUDPH7KHDUJXPHQWVWRWKLVIXQFWLRQDUHDSRLQWHUWRDDataBlockDQGWKH
QXPEHURIticksRIWKHDQLPDWLRQWKDWKDYHHODSVHG$VZLWKWKHDQLPDWLRQ
H[DPSOHVZHXVHEORFNVRIWKUHDGVWKDWZHRUJDQL]HLQWRDWZRGLPHQVLRQDO
JULGRI[(DFKLWHUDWLRQRIWKHfor()ORRSLQanim_gpu()FRPSXWHVD
VLQJOHWLPHVWHSRIWKHVLPXODWLRQDVGHVFULEHGE\WKHWKUHHVWHSDOJRULWKP
DWWKHEHJLQQLQJRI6HFWLRQ&RPSXWLQJ7HPSHUDWXUH8SGDWHV6LQFHWKH
DataBlockFRQWDLQVWKHFRQVWDQWEXIIHURIKHDWHUVDVZHOODVWKHRXWSXWRIWKH
ODVWWLPHVWHSLWHQFDSVXODWHVWKHHQWLUHVWDWHRIWKHDQLPDWLRQDQGFRQVHTXHQWO\
anim_gpu()GRHVQRWDFWXDOO\QHHGWRXVHWKHYDOXHRIticksDQ\ZKHUH

<RXZLOOQRWLFHWKDWZHKDYHFKRVHQWRGRWLPHVWHSVSHUIUDPH7KLVQXPEHU
LVQRWPDJLFDOEXWZDVGHWHUPLQHGVRPHZKDWH[SHULPHQWDOO\DVDUHDVRQDEOH
WUDGHRIIEHWZHHQKDYLQJWRGRZQORDGDELWPDSLPDJHIRUHYHU\WLPHVWHSDQG
FRPSXWLQJWRRPDQ\WLPHVWHSVSHUIUDPHUHVXOWLQJLQDMHUN\DQLPDWLRQ,I\RX
ZHUHPRUHFRQFHUQHGZLWKJHWWLQJWKHRXWSXWRIHDFKVLPXODWLRQVWHSWKDQ\RX
ZHUHZLWKDQLPDWLQJWKHUHVXOWVLQUHDOWLPH\RXFRXOGFKDQJHWKLVVXFKWKDW\RX
FRPSXWHGRQO\DVLQJOHVWHSRQHDFKIUDPH

$IWHUFRPSXWLQJWKHWLPHVWHSVVLQFHWKHSUHYLRXVIUDPHanim_gpu()
LVUHDG\WRFRS\DELWPDSIUDPHRIWKHFXUUHQWDQLPDWLRQEDFNWRWKH&38
6LQFHWKHfor()ORRSOHDYHVWKHLQSXWDQGRXWSXWVZDSSHGZHȌUVWVZDS

123

Download from www.wowebook.com


7(;785(0(025<

WKHLQSXWDQGRXWSXWEXIIHUVVRWKDWWKHRXWSXWDFWXDOO\FRQWDLQVWKHRXWSXW
RIWKHWKWLPHVWHS:HFRQYHUWWKHWHPSHUDWXUHVWRFRORUVXVLQJWKH
NHUQHOfloat_to_color()DQGWKHQFRS\WKHUHVXOWDQWLPDJHEDFNWR
WKH&38ZLWKDcudaMemcpy()WKDWVSHFLȌHVWKHGLUHFWLRQRIFRS\DV
cudaMemcpyDeviceToHost)LQDOO\WRSUHSDUHIRUWKHQH[WVHTXHQFHRIWLPH
VWHSVZHVZDSWKHRXWSXWEXIIHUEDFNWRWKHLQSXWEXIIHUVLQFHLWZLOOVHUYHDV
LQSXWWRWKHQH[WWLPHVWHS

int main( void ) {


DataBlock data;
CPUAnimBitmap bitmap( DIM, DIM, &data );
data.bitmap = &bitmap;
data.totalTime = 0;
data.frames = 0;
HANDLE_ERROR( cudaEventCreate( &data.start ) );
HANDLE_ERROR( cudaEventCreate( &data.stop ) );

HANDLE_ERROR( cudaMalloc( (void**)&data.output_bitmap,


bitmap.image_size() ) );

// assume float == 4 chars in size (i.e., rgba)


HANDLE_ERROR( cudaMalloc( (void**)&data.dev_inSrc,
bitmap.image_size() ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_outSrc,
bitmap.image_size() ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_constSrc,
bitmap.image_size() ) );

float *temp = (float*)malloc( bitmap.image_size() );


for (int i=0; i<DIM*DIM; i++) {
temp[i] = 0;
int x = i % DIM;
int y = i / DIM;
if ((x>300) && (x<600) && (y>310) && (y<601))
temp[i] = MAX_TEMP;
}

124

Download from www.wowebook.com


7.3 S
L IM U E
AT INTGRHA NAST F ER

temp[DIM*100+100] = (MAX_TEMP + MIN_TEMP)/2;


temp[DIM*700+100] = MIN_TEMP;
temp[DIM*300+300] = MIN_TEMP;
temp[DIM*200+700] = MIN_TEMP;
for (int y=800; y<900; y++) {
for (int x=400; x<500; x++) {
temp[x+y*DIM] = MIN_TEMP;
}
}
HANDLE_ERROR( cudaMemcpy( data.dev_constSrc, temp,
bitmap.image_size(),
cudaMemcpyHostToDevice ) );

for (int y=800; y<DIM; y++) {


for (int x=0; x<200; x++) {
temp[x+y*DIM] = MAX_TEMP;
}
}
HANDLE_ERROR( cudaMemcpy( data.dev_inSrc, temp,
bitmap.image_size(),
cudaMemcpyHostToDevice ) );
free( temp );

bitmap.anim_and_exit( (void (*)(void*,int))anim_gpu,


(void (*)(void*))anim_exit );
}

Figure 7.4 shows an example of what the output might look like. You will notice in
the image some of the “heaters” that appear to be pixel-sized islands that disrupt
the continuity of the temperature distribution.

7.3.4 USING TEXTURE MEMORY


There is a considerable amount of spatial locality in the memory access pattern
required to perform the temperature update in each step. As we explained
previously, this is exactly the type of access pattern that GPU texture memory is

125

Download from www.wowebook.com


7(;785(0(025<

Figure 7.4 $VFUHHQVKRWIURPWKHDQLPDWHGKHDWWUDQVIHUVLPXODWLRQ

GHVLJQHGWRDFFHOHUDWH*LYHQWKDWZHZDQWWRXVHWH[WXUHPHPRU\ZHQHHGWR
OHDUQWKHPHFKDQLFVRIGRLQJVR

)LUVWZHZLOOQHHGWRGHFODUHRXULQSXWVDVWH[WXUHUHIHUHQFHV:HZLOOXVHUHIHU-
HQFHVWRȍRDWLQJSRLQWWH[WXUHVVLQFHRXUWHPSHUDWXUHGDWDLVȍRDWLQJSRLQW

// these exist on the GPU side


texture<float> texConstSrc;
texture<float> texIn;
texture<float> texOut;

7KHQH[WPDMRUGLIIHUHQFHLVWKDWDIWHUDOORFDWLQJ*38PHPRU\IRUWKHVH
WKUHHEXIIHUVZHQHHGWRbindWKHUHIHUHQFHVWRWKHPHPRU\EXIIHUXVLQJ
cudaBindTexture()7KLVEDVLFDOO\WHOOVWKH&8'$UXQWLPHWZRWKLQJV

ǩ :HLQWHQGWRXVHWKHVSHFLȌHGEXIIHUDVDWH[WXUH

ǩ :HLQWHQGWRXVHWKHVSHFLȌHGWH[WXUHUHIHUHQFHDVWKHWH[WXUHǢVǤQDPHǥ

126

Download from www.wowebook.com


 6,/08 $7 ,1*+ ( 7
$ 7 5 $ 5
16) (

$IWHUWKHWKUHHDOORFDWLRQVLQRXUKHDWWUDQVIHUVLPXODWLRQZHELQGWKHWKUHH
DOORFDWLRQVWRWKHWH[WXUHUHIHUHQFHVGHFODUHGHDUOLHU texConstSrc, texInDQG
texOut 

HANDLE_ERROR( cudaMalloc( (void**)&data.dev_inSrc,


imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_outSrc,
imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_constSrc,
imageSize ) );

HANDLE_ERROR( cudaBindTexture( NULL, texConstSrc,


data.dev_constSrc,
imageSize ) );
HANDLE_ERROR( cudaBindTexture( NULL, texIn,
data.dev_inSrc,
imageSize ) );
HANDLE_ERROR( cudaBindTexture( NULL, texOut,
data.dev_outSrc,
imageSize ) );

$WWKLVSRLQWRXUWH[WXUHVDUHFRPSOHWHO\VHWXSDQGZHǢUHUHDG\WRODXQFKRXU
NHUQHO+RZHYHUZKHQZHǢUHUHDGLQJIURPWH[WXUHVLQWKHNHUQHOZHQHHGWRXVH
VSHFLDOIXQFWLRQVWRLQVWUXFWWKH*38WRURXWHRXUUHTXHVWVWKURXJKWKHWH[WXUHXQLW
DQGQRWWKURXJKVWDQGDUGJOREDOPHPRU\$VDUHVXOWZHFDQQRORQJHUVLPSO\XVH
VTXDUHEUDFNHWVWRUHDGIURPEXIIHUVZHQHHGWRPRGLI\blend_kernel()WRXVH
tex1Dfetch()ZKHQUHDGLQJIURPPHPRU\

$GGLWLRQDOO\WKHUHLVDQRWKHUGLIIHUHQFHEHWZHHQXVLQJJOREDODQGWH[WXUH
PHPRU\WKDWUHTXLUHVXVWRPDNHDQRWKHUFKDQJH$OWKRXJKLWORRNVOLNHDIXQF-
WLRQtex1Dfetch()LVDFRPSLOHULQWULQVLF$QGVLQFHWH[WXUHUHIHUHQFHVPXVW
EHGHFODUHGJOREDOO\DWȌOHVFRSHZHFDQQRORQJHUSDVVWKHLQSXWDQGRXWSXW
EXIIHUVDVSDUDPHWHUVWRblend_kernel()EHFDXVHWKHFRPSLOHUQHHGVWRNQRZ
DWFRPSLOHWLPHZKLFKWH[WXUHVtex1Dfetch()VKRXOGEHVDPSOLQJ5DWKHU
WKDQSDVVLQJSRLQWHUVWRLQSXWDQGRXWSXWEXIIHUVDVZHSUHYLRXVO\GLGZHZLOO
SDVVWRblend_kernel()DERROHDQȍDJdstOutWKDWLQGLFDWHVZKLFKEXIIHUWR

127

Download from www.wowebook.com


7(;785(0(025<

XVHDVLQSXWDQGZKLFKWRXVHDVRXWSXW7KHFKDQJHVWRblend_kernel() are
KLJKOLJKWHGKHUH

__global__ void blend_kernel( float *dst,


bool dstOut ) {
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

int left = offset - 1;


int right = offset + 1;
if (x == 0) left++;
if (x == DIM-1) right--;

int top = offset - DIM;


int bottom = offset + DIM;
if (y == 0) top += DIM;
if (y == DIM-1) bottom -= DIM;

float t, l, c, r, b;
if (dstOut) {
t = tex1Dfetch(texIn,top);
l = tex1Dfetch(texIn,left);
c = tex1Dfetch(texIn,offset);
r = tex1Dfetch(texIn,right);
b = tex1Dfetch(texIn,bottom);

} else {
t = tex1Dfetch(texOut,top);
l = tex1Dfetch(texOut,left);
c = tex1Dfetch(texOut,offset);
r = tex1Dfetch(texOut,right);
b = tex1Dfetch(texOut,bottom);
}
dst[offset] = c + SPEED * (t + b + r + l - 4 * c);
}

128

Download from www.wowebook.com


 6,/08 $7 ,1*+ ( 7
$ 7 5 $ 5
16) (

6LQFHWKHcopy_const_kernel()NHUQHOUHDGVIURPRXUEXIIHUWKDWKROGVWKH
KHDWHUSRVLWLRQVDQGWHPSHUDWXUHVZHZLOOQHHGWRPDNHDVLPLODUPRGLȌFDWLRQ
WKHUHLQRUGHUWRUHDGWKURXJKWH[WXUHPHPRU\LQVWHDGRIJOREDOPHPRU\

__global__ void copy_const_kernel( float *iptr ) {


// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

float c = tex1Dfetch(texConstSrc,offset);
if (c != 0)
iptr[offset] = c;
}

6LQFHWKHVLJQDWXUHRIblend_kernel()FKDQJHGWRDFFHSWDȍDJWKDWVZLWFKHV
WKHEXIIHUVEHWZHHQLQSXWDQGRXWSXWZHQHHGDFRUUHVSRQGLQJFKDQJHWR
WKHanim_gpu()URXWLQH5DWKHUWKDQVZDSSLQJEXIIHUVZHVHWdstOut =
!dstOutWRWRJJOHWKHȍDJDIWHUHDFKVHULHVRIFDOOV

void anim_gpu( DataBlock *d, int ticks ) {


HANDLE_ERROR( cudaEventRecord( d->start, 0 ) );
dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);
CPUAnimBitmap *bitmap = d->bitmap;

// since tex is global and bound, we have to use a flag to


// select which is in/out per iteration
volatile bool dstOut = true;
for (int i=0; i<90; i++) {
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;

129

Download from www.wowebook.com


7(;785(0(025<

} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
copy_const_kernel<<<blocks,threads>>>( in );
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
}
float_to_color<<<blocks,threads>>>( d->output_bitmap,
d->dev_inSrc );

HANDLE_ERROR( cudaMemcpy( bitmap->get_ptr(),


d->output_bitmap,
bitmap->image_size(),
cudaMemcpyDeviceToHost ) );

HANDLE_ERROR( cudaEventRecord( d->stop, 0 ) );


HANDLE_ERROR( cudaEventSynchronize( d->stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
d->start, d->stop ) );
d->totalTime += elapsedTime;
++d->frames;
printf( "Average Time per frame: %3.1f ms\n",
d->totalTime/d->frames );
}

7KHȌQDOFKDQJHWRRXUKHDWWUDQVIHUURXWLQHLQYROYHVFOHDQLQJXSDWWKHHQGRI
WKHDSSOLFDWLRQǢVUXQ5DWKHUWKDQMXVWIUHHLQJWKHJOREDOEXIIHUVZHDOVRQHHGWR
XQELQGWH[WXUHV

130

Download from www.wowebook.com


7.3 S
L IM U E
AT INTGRHA NAST F ER

// clean up memory allocated on the GPU


void anim_exit( DataBlock *d ) {
cudaUnbindTexture( texIn );
cudaUnbindTexture( texOut );
cudaUnbindTexture( texConstSrc );
cudaFree( d->dev_inSrc );
cudaFree( d->dev_outSrc );
cudaFree( d->dev_constSrc );

HANDLE_ERROR( cudaEventDestroy( d->start ) );


HANDLE_ERROR( cudaEventDestroy( d->stop ) );
}

7.3.5 USING TWO-DIMENSIONAL TEXTURE MEMORY


Toward the beginning of this book, we mentioned how some problems have two-
dimensional domains, and therefore it can be convenient to use two-dimensional
blocks and grids at times. The same is true for texture memory. There are many
cases when having a two-dimensional memory region can be useful, a claim that
should come as no surprise to anyone familiar with multidimensional arrays in
standard C. Let’s look at how we can modify our heat transfer application to use
two-dimensional textures.

First, our texture reference declarations change. If unspecified, texture refer-


ences are one-dimensional by default, so we add a dimensionality argument of 2
in order to declare two-dimensional textures.

texture<float,2> texConstSrc;
texture<float,2> texIn;
texture<float,2> texOut;

The simplification promised by converting to two-dimensional textures comes in


the blend_kernel() method. Although we need to change our tex1Dfetch()

131

Download from www.wowebook.com


7(;785(0(025<

FDOOVWRtex2D()FDOOVZHQRORQJHUQHHGWRXVHWKHOLQHDUL]HGoffsetYDULDEOH
WRFRPSXWHWKHVHWRIRIIVHWVtop, left, rightDQGbottom:KHQZHVZLWFKWR
DWZRGLPHQVLRQDOWH[WXUHZHFDQXVHxDQGyGLUHFWO\WRDGGUHVVWKHWH[WXUH

)XUWKHUPRUHZHQRORQJHUKDYHWRZRUU\DERXWERXQGVRYHUȍRZZKHQZHVZLWFK
WRXVLQJtex2D(),IRQHRIxRUyLVOHVVWKDQ]HURtex2D()ZLOOUHWXUQWKH
YDOXHDW]HUR/LNHZLVHLIRQHRIWKHVHYDOXHVLVJUHDWHUWKDQWKHZLGWKtex2D()
ZLOOUHWXUQWKHYDOXHDWZLGWK1RWHWKDWLQRXUDSSOLFDWLRQWKLVEHKDYLRULVLGHDO
EXWLWǢVSRVVLEOHWKDWRWKHUDSSOLFDWLRQVZRXOGGHVLUHRWKHUEHKDYLRU

$VDUHVXOWRIWKHVHVLPSOLȌFDWLRQVRXUNHUQHOFOHDQVXSQLFHO\

__global__ void blend_kernel( float *dst,


bool dstOut ) {
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

float t, l, c, r, b;
if (dstOut) {
t = tex2D(texIn,x,y-1);
l = tex2D(texIn,x-1,y);
c = tex2D(texIn,x,y);
r = tex2D(texIn,x+1,y);
b = tex2D(texIn,x,y+1);
} else {
t = tex2D(texOut,x,y-1);
l = tex2D(texOut,x-1,y);
c = tex2D(texOut,x,y);
r = tex2D(texOut,x+1,y);
b = tex2D(texOut,x,y+1);
}
dst[offset] = c + SPEED * (t + b + r + l - 4 * c);
}

132

Download from www.wowebook.com


 6,/08 $7 ,1*+ ( 7
$ 7 5 $ 5
16) (

6LQFHDOORIRXUSUHYLRXVFDOOVWRtex1Dfetch()QHHGWREHFKDQJHGWRtex2D()
FDOOVZHPDNHWKHFRUUHVSRQGLQJFKDQJHLQcopy_const_kernel()6LPLODUO\
WRWKHNHUQHOblend_kernel()ZHQRORQJHUQHHGWRXVHoffsetWRDGGUHVV
WKHWH[WXUHZHVLPSO\XVHxDQGyWRDGGUHVVWKHFRQVWDQWVRXUFH

__global__ void copy_const_kernel( float *iptr ) {


// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

float c = tex2D(texConstSrc,x,y);
if (c != 0)
iptr[offset] = c;
}

7KHȌQDOFKDQJHWRWKHRQHGLPHQVLRQDOWH[WXUHYHUVLRQRIRXUKHDWWUDQVIHU
VLPXODWLRQLVDORQJWKHVDPHOLQHVDVRXUSUHYLRXVFKDQJHV6SHFLȌFDOO\LQ
main()ZHQHHGWRFKDQJHRXUWH[WXUHELQGLQJFDOOVWRLQVWUXFWWKHUXQWLPHWKDW
WKHEXIIHUZHSODQWRXVHZLOOEHWUHDWHGDVDWZRGLPHQVLRQDOWH[WXUHQRWDRQH
GLPHQVLRQDORQH

HANDLE_ERROR( cudaMalloc( (void**)&data.dev_inSrc,


imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_outSrc,
imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_constSrc,
imageSize ) );

cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();


HANDLE_ERROR( cudaBindTexture2D( NULL, texConstSrc,
data.dev_constSrc,
desc, DIM, DIM,
sizeof(float) * DIM ) );

133

Download from www.wowebook.com


7(;785(0(025<

HANDLE_ERROR( cudaBindTexture2D( NULL, texIn,


data.dev_inSrc,
desc, DIM, DIM,
sizeof(float) * DIM ) );

HANDLE_ERROR( cudaBindTexture2D( NULL, texOut,


data.dev_outSrc,
desc, DIM, DIM,
sizeof(float) * DIM ) );

$VZLWKWKHQRQWH[WXUHDQGRQHGLPHQVLRQDOWH[WXUHYHUVLRQVZHEHJLQ
E\DOORFDWLQJVWRUDJHIRURXULQSXWDUUD\V:HGHYLDWHIURPWKHRQH
GLPHQVLRQDOH[DPSOHEHFDXVHWKH&8'$UXQWLPHUHTXLUHVWKDWZHSURYLGHD
cudaChannelFormatDescZKHQZHELQGWZRGLPHQVLRQDOWH[WXUHV7KH
SUHYLRXVOLVWLQJLQFOXGHVDGHFODUDWLRQRIDFKDQQHOIRUPDWGHVFULSWRU,QRXU
FDVHZHFDQDFFHSWWKHGHIDXOWSDUDPHWHUVDQGVLPSO\QHHGWRVSHFLI\WKDW
ZHUHTXLUHDȍRDWLQJSRLQWGHVFULSWRU:HWKHQELQGWKHWKUHHLQSXWEXIIHUVDV
WZRGLPHQVLRQDOWH[WXUHVXVLQJcudaBindTexture2D()WKHGLPHQVLRQVRI
WKHWH[WXUH DIM[DIM DQGWKHFKDQQHOIRUPDWGHVFULSWRU desc 7KHUHVWRI
main()UHPDLQVWKHVDPH

int main( void ) {


DataBlock data;
CPUAnimBitmap bitmap( DIM, DIM, &data );
data.bitmap = &bitmap;
data.totalTime = 0;
data.frames = 0;
HANDLE_ERROR( cudaEventCreate( &data.start ) );
HANDLE_ERROR( cudaEventCreate( &data.stop ) );

int imageSize = bitmap.image_size();

HANDLE_ERROR( cudaMalloc( (void**)&data.output_bitmap,


imageSize ) );

134

Download from www.wowebook.com


 6,/08 $7 ,1*+ ( 7
$ 7 5 $ 5
16) (

// assume float == 4 chars in size (i.e., rgba)


HANDLE_ERROR( cudaMalloc( (void**)&data.dev_inSrc,
imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_outSrc,
imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_constSrc,
imageSize ) );

cudaChannelFormatDesc desc = cudaCreateChannelDesc<float>();


HANDLE_ERROR( cudaBindTexture2D( NULL, texConstSrc,
data.dev_constSrc,
desc, DIM, DIM,
sizeof(float) * DIM ) );

HANDLE_ERROR( cudaBindTexture2D( NULL, texIn,


data.dev_inSrc,
desc, DIM, DIM,
sizeof(float) * DIM ) );

HANDLE_ERROR( cudaBindTexture2D( NULL, texOut,


data.dev_outSrc,
desc, DIM, DIM,
sizeof(float) * DIM ) );

// initialize the constant data


float *temp = (float*)malloc( imageSize );
for (int i=0; i<DIM*DIM; i++) {
temp[i] = 0;
int x = i % DIM;
int y = i / DIM;
if ((x>300) && (x<600) && (y>310) && (y<601))
temp[i] = MAX_TEMP;
}

135

Download from www.wowebook.com


7(;785(0(025<

temp[DIM*100+100] = (MAX_TEMP + MIN_TEMP)/2;


temp[DIM*700+100] = MIN_TEMP;
temp[DIM*300+300] = MIN_TEMP;
temp[DIM*200+700] = MIN_TEMP;
for (int y=800; y<900; y++) {
for (int x=400; x<500; x++) {
temp[x+y*DIM] = MIN_TEMP;
}
}
HANDLE_ERROR( cudaMemcpy( data.dev_constSrc, temp,
imageSize,
cudaMemcpyHostToDevice ) );

// initialize the input data


for (int y=800; y<DIM; y++) {
for (int x=0; x<200; x++) {
temp[x+y*DIM] = MAX_TEMP;
}
}
HANDLE_ERROR( cudaMemcpy( data.dev_inSrc, temp,
imageSize,
cudaMemcpyHostToDevice ) );
free( temp );

bitmap.anim_and_exit( (void (*)(void*,int))anim_gpu,


(void (*)(void*))anim_exit );
}

$OWKRXJKZHQHHGHGGLIIHUHQWIXQFWLRQVWRLQVWUXFWWKHUXQWLPHWRELQGRQH
GLPHQVLRQDORUWZRGLPHQVLRQDOWH[WXUHVZHXVHWKHVDPHURXWLQHWRXQELQG
WKHWH[WXUHcudaUnbindTexture()%HFDXVHRIWKLVRXUFOHDQXSURXWLQHFDQ
UHPDLQXQFKDQJHG

// clean up memory allocated on the GPU


void anim_exit( DataBlock *d ) {
cudaUnbindTexture( texIn );
cudaUnbindTexture( texOut );

136

Download from www.wowebook.com


 &+ $ 3 7
5( 5( 9 ,( :

cudaUnbindTexture( texConstSrc );
cudaFree( d->dev_inSrc );
cudaFree( d->dev_outSrc );
cudaFree( d->dev_constSrc );

HANDLE_ERROR( cudaEventDestroy( d->start ) );


HANDLE_ERROR( cudaEventDestroy( d->stop ) );
}

7KHYHUVLRQRIRXUKHDWWUDQVIHUVLPXODWLRQWKDWXVHVWZRGLPHQVLRQDOWH[WXUHV
KDVHVVHQWLDOO\LGHQWLFDOSHUIRUPDQFHFKDUDFWHULVWLFVDVWKHYHUVLRQWKDWXVHV
RQHGLPHQVLRQDOWH[WXUHV6RIURPDSHUIRUPDQFHVWDQGSRLQWWKHGHFLVLRQ
EHWZHHQRQHDQGWZRGLPHQVLRQDOWH[WXUHVLVOLNHO\WREHLQFRQVHTXHQWLDO)RU
RXUSDUWLFXODUDSSOLFDWLRQWKHFRGHLVDOLWWOHVLPSOHUZKHQXVLQJWZRGLPHQVLRQDO
WH[WXUHVEHFDXVHZHKDSSHQWREHVLPXODWLQJDWZRGLPHQVLRQDOGRPDLQ%XW
LQJHQHUDOVLQFHWKLVLVQRWDOZD\VWKHFDVHZHVXJJHVW\RXPDNHWKHGHFLVLRQ
EHWZHHQRQHDQGWZRGLPHQVLRQDOWH[WXUHVRQDFDVHE\FDVHEDVLV

 &KDSWHU5HYLHZ
$VZHVDZLQWKHSUHYLRXVFKDSWHUZLWKFRQVWDQWPHPRU\VRPHRIWKHEHQHȌWRI
WH[WXUHPHPRU\FRPHVDVWKHUHVXOWRIRQFKLSFDFKLQJ7KLVLVHVSHFLDOO\QRWLFH-
DEOHLQDSSOLFDWLRQVVXFKDVRXUKHDWWUDQVIHUVLPXODWLRQDSSOLFDWLRQVWKDWKDYH
VRPHVSDWLDOFRKHUHQFHWRWKHLUGDWDDFFHVVSDWWHUQV:HVDZKRZHLWKHURQHRU
WZRGLPHQVLRQDOWH[WXUHVFDQEHXVHGERWKKDYLQJVLPLODUSHUIRUPDQFHFKDU-
DFWHULVWLFV$VZLWKDEORFNRUJULGVKDSHWKHFKRLFHRIRQHRUWZRGLPHQVLRQDO
WH[WXUHLVODUJHO\RQHRIFRQYHQLHQFH6LQFHWKHFRGHEHFDPHVRPHZKDWFOHDQHU
ZKHQZHVZLWFKHGWRWZRGLPHQVLRQDOWH[WXUHVDQGWKHERUGHUVDUHKDQGOHGDXWR-
PDWLFDOO\ZHZRXOGSUREDEO\DGYRFDWHWKHXVHRID'WH[WXUHLQRXUKHDWWUDQVIHU
DSSOLFDWLRQ%XWDV\RXVDZLWZLOOZRUNȌQHHLWKHUZD\

7H[WXUHPHPRU\FDQSURYLGHDGGLWLRQDOVSHHGXSVLIZHXWLOL]HVRPHRIWKHFRQYHU-
VLRQVWKDWWH[WXUHVDPSOHUVFDQSHUIRUPDXWRPDWLFDOO\VXFKDVXQSDFNLQJSDFNHG
GDWDLQWRVHSDUDWHYDULDEOHVRUFRQYHUWLQJDQGELWLQWHJHUVWRQRUPDOL]HG
ȍRDWLQJSRLQWQXPEHUV:HGLGQǢWH[SORUHHLWKHURIWKHVHFDSDELOLWLHVLQWKHKHDW
WUDQVIHUDSSOLFDWLRQEXWWKH\PLJKWEHXVHIXOWR\RX

137

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Chapter 8
Graphics
Interoperability

6LQFHWKLVERRNKDVIRFXVHGRQJHQHUDOSXUSRVHFRPSXWDWLRQIRUWKHPRVWSDUW
ZHǢYHLJQRUHGWKDW*38VFRQWDLQVRPHVSHFLDOSXUSRVHFRPSRQHQWVDVZHOO7KH
*38RZHVLWVVXFFHVVWRLWVDELOLW\WRSHUIRUPFRPSOH[UHQGHULQJWDVNVLQUHDO
WLPHIUHHLQJWKHUHVWRIWKHV\VWHPWRFRQFHQWUDWHRQRWKHUZRUN7KLVOHDGVXV
WRWKHREYLRXVTXHVWLRQ&DQZHXVHWKH*38IRUERWKUHQGHULQJandJHQHUDO
SXUSRVHFRPSXWDWLRQLQWKHVDPHDSSOLFDWLRQ":KDWLIWKHLPDJHVZHZDQWWR
UHQGHUUHO\RQWKHUHVXOWVRIRXUFRPSXWDWLRQV"2UZKDWLIZHZDQWWRWDNHWKH
IUDPHZHǢYHUHQGHUHGDQGSHUIRUPVRPHLPDJHSURFHVVLQJRUVWDWLVWLFVFRPSX-
WDWLRQVRQLW"

)RUWXQDWHO\QRWRQO\LVWKLVLQWHUDFWLRQEHWZHHQJHQHUDOSXUSRVHFRPSXWDWLRQ
DQGUHQGHULQJPRGHVSRVVLEOHEXWLWǢVIDLUO\HDV\WRDFFRPSOLVKJLYHQZKDW\RX
DOUHDG\NQRZ&8'$&DSSOLFDWLRQVFDQVHDPOHVVO\LQWHURSHUDWHZLWKHLWKHURIWKH
WZRPRVWSRSXODUUHDOWLPHUHQGHULQJ$3,V2SHQ*/DQG'LUHFW;7KLVFKDSWHU
ZLOOORRNDWWKHPHFKDQLFVE\ZKLFK\RXFDQHQDEOHWKLVIXQFWLRQDOLW\

7KHH[DPSOHVLQWKLVFKDSWHUGHYLDWHVRPHIURPWKHSUHFHGHQWVZHǢYHVHWLQ
SUHYLRXVFKDSWHUV,QSDUWLFXODUWKLVFKDSWHUDVVXPHVDVLJQLȌFDQWDPRXQWDERXW
\RXUEDFNJURXQGZLWKRWKHUWHFKQRORJLHV6SHFLȌFDOO\ZHKDYHLQFOXGHGDFRQVLG-
HUDEOHDPRXQWRI2SHQ*/DQG*/87FRGHLQWKHVHH[DPSOHVDOPRVWQRQHRI
ZKLFKZLOOZHH[SODLQLQJUHDWGHSWK7KHUHDUHPDQ\VXSHUEUHVRXUFHVWROHDUQ
JUDSKLFV$3,VERWKRQOLQHDQGLQERRNVWRUHVEXWWKHVHWRSLFVDUHZHOOEH\RQGWKH

139

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

LQWHQGHGVFRSHRIWKLVERRN5DWKHUWKLVFKDSWHULQWHQGVWRIRFXVRQ&8'$&DQG
WKHIDFLOLWLHVLWRIIHUVWRLQFRUSRUDWHLWLQWR\RXUJUDSKLFVDSSOLFDWLRQV,I\RXDUH
XQIDPLOLDUZLWK2SHQ*/RU'LUHFW;\RXDUHXQOLNHO\WRGHULYHPXFKEHQHȌWIURP
WKLVFKDSWHUDQGPD\ZDQWWRVNLSWRWKHQH[W

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQZKDWgraphics interoperability LVDQGZK\\RXPLJKWXVHLW

ǩ <RXZLOOOHDUQKRZWRVHWXSD&8'$GHYLFHIRUJUDSKLFVLQWHURSHUDELOLW\

ǩ <RXZLOOOHDUQKRZWRVKDUHGDWDEHWZHHQ\RXU&8'$&NHUQHOVDQG2SHQ*/
UHQGHULQJ

 *UDSKLFV,QWHURSHUDWLRQ
7RGHPRQVWUDWHWKHPHFKDQLFVRILQWHURSHUDWLRQEHWZHHQJUDSKLFVDQG&8'$&
ZHǢOOZULWHDQDSSOLFDWLRQWKDWZRUNVLQWZRVWHSV7KHȌUVWVWHSXVHVD&8'$&
NHUQHOWRJHQHUDWHLPDJHGDWD,QWKHVHFRQGVWHSWKHDSSOLFDWLRQSDVVHVWKLVGDWD
WRWKH2SHQ*/GULYHUWRUHQGHU7RDFFRPSOLVKWKLVZHZLOOXVHPXFKRIWKH&8'$
&ZHKDYHVHHQLQSUHYLRXVFKDSWHUVDORQJZLWKVRPH2SHQ*/DQG*/87FDOOV

7RVWDUWRXUDSSOLFDWLRQZHLQFOXGHWKHUHOHYDQW*/87DQG&8'$KHDGHUVLQRUGHU
WRHQVXUHWKHFRUUHFWIXQFWLRQVDQGHQXPHUDWLRQVDUHGHȌQHG:HDOVRGHȌQHWKH
VL]HRIWKHZLQGRZLQWRZKLFKRXUDSSOLFDWLRQSODQVWRUHQGHU$W[SL[HOV
ZHZLOOGRUHODWLYHO\VPDOOGUDZLQJV

#define GL_GLEXT_PROTOTYPES
#include "GL/glut.h"
#include "cuda.h"
#include "cuda_gl_interop.h"
#include "../common/book.h"
#include "../common/cpu_bitmap.h"

#define DIM 512

140

Download from www.wowebook.com


 *5 $ &
3+6, , 5
1 7 (5 23 ( 1
$7 ,2

$GGLWLRQDOO\ZHGHFODUHWZRJOREDOYDULDEOHVWKDWZLOOVWRUHKDQGOHVWRWKHGDWDZH
LQWHQGWRVKDUHEHWZHHQ2SHQ*/DQGGDWD:HZLOOVHHPRPHQWDULO\KRZZHXVH
WKHVHWZRYDULDEOHVEXWWKH\ZLOOVWRUHGLIIHUHQWKDQGOHVWRWKHsameEXIIHU:H
QHHGWZRVHSDUDWHYDULDEOHVEHFDXVH2SHQ*/DQG&8'$ZLOOERWKKDYHGLIIHUHQW
ǤQDPHVǥIRUWKHEXIIHU7KHYDULDEOHbufferObjZLOOEH2SHQ*/ǢVQDPHIRUWKH
GDWDDQGWKHYDULDEOHresourceZLOOEHWKH&8'$&QDPHIRULW

GLuint bufferObj;
cudaGraphicsResource *resource;

1RZOHWǢVWDNHDORRNDWWKHDFWXDODSSOLFDWLRQ7KHȌUVWWKLQJZHGRLVVHOHFWD
&8'$GHYLFHRQZKLFKWRUXQRXUDSSOLFDWLRQ2QPDQ\V\VWHPVWKLVLVQRWD
FRPSOLFDWHGSURFHVVVLQFHWKH\ZLOORIWHQFRQWDLQRQO\DVLQJOH&8'$HQDEOHG
*38+RZHYHUDQLQFUHDVLQJQXPEHURIV\VWHPVFRQWDLQPRUHWKDQRQH&8'$
HQDEOHG*38VRZHQHHGDPHWKRGWRFKRRVHRQH)RUWXQDWHO\WKH&8'$UXQWLPH
SURYLGHVVXFKDIDFLOLW\WRXV

int main( int argc, char **argv ) {


cudaDeviceProp prop;
int dev;

memset( &prop, 0, sizeof( cudaDeviceProp ) );


prop.major = 1;
prop.minor = 0;
HANDLE_ERROR( cudaChooseDevice( &dev, &prop ) );

<RXPD\UHFDOOWKDWZHVDZcudaChooseDevice()LQ&KDSWHUEXWVLQFHLWZDV
VRPHWKLQJRIDQDQFLOODU\SRLQWZHǢOOUHYLHZLWDJDLQQRZ(VVHQWLDOO\WKLVFRGHWHOOV
WKHUXQWLPHWRVHOHFWDQ\*38WKDWKDVDcompute capabilityRIYHUVLRQRUEHWWHU
,WDFFRPSOLVKHVWKLVE\ȌUVWFUHDWLQJDQGFOHDULQJDcudaDevicePropVWUXFWXUH
DQGWKHQE\VHWWLQJLWVmajorYHUVLRQWRDQGminorYHUVLRQWR,WSDVVHVWKLV
LQIRUPDWLRQWRcudaChooseDevice()ZKLFKLQVWUXFWVWKHUXQWLPHWRVHOHFWD
*38LQWKHV\VWHPWKDWVDWLVȌHVWKHFRQVWUDLQWVVSHFLȌHGE\WKHcudaDeviceProp
VWUXFWXUH,QWKHQH[WFKDSWHUZHZLOOORRNPRUHDWZKDWLVPHDQWE\D*38ǢV
compute capabilityEXWIRUQRZLWVXIȌFHVWRVD\WKDWLWURXJKO\LQGLFDWHVWKHIHDWXUHV
D*38VXSSRUWV$OO&8'$FDSDEOH*38VKDYHDWOHDVWFRPSXWHFDSDELOLW\VR
WKHQHWHIIHFWRIWKLVFDOOLVWKDWWKHUXQWLPHZLOOVHOHFWDQ\&8'$FDSDEOHGHYLFH
DQGUHWXUQDQLGHQWLȌHUIRUWKLVGHYLFHLQWKHYDULDEOHdev7KHUHLVQRJXDUDQWHH

141

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

WKDWWKLVGHYLFHLVWKHEHVWRUIDVWHVW*38QRULVWKHUHDJXDUDQWHHWKDWWKHGHYLFH
ZLOOEHWKHVDPH*38IURPYHUVLRQWRYHUVLRQRIWKH&8'$UXQWLPH

,IWKHUHVXOWRIGHYLFHVHOHFWLRQLVVRVHHPLQJO\XQGHUZKHOPLQJZK\GR
ZHERWKHUZLWKDOOWKLVHIIRUWWRȌOODcudaDevicePropVWUXFWXUHDQGFDOO
cudaChooseDevice()WRJHWDYDOLGGHYLFH,'")XUWKHUPRUHZHQHYHUKDVVOHG
ZLWKWKLVWRPIRROHU\EHIRUHVRZK\QRZ"7KHVHDUHJRRGTXHVWLRQV,WWXUQVRXW
WKDWZHQHHGWRNQRZWKH&8'$GHYLFH,'VRWKDWZHFDQWHOOWKH&8'$UXQWLPH
WKDWZHLQWHQGWRXVHWKHGHYLFHIRU&8'$and2SHQ*/:HDFKLHYHWKLVZLWKD
FDOOWRcudaGLSetGLDevice()SDVVLQJWKHGHYLFH,'devZHREWDLQHGIURP
cudaChooseDevice()

HANDLE _ ERROR( cudaGLSetGLDevice( dev ) );

$IWHUWKH&8'$UXQWLPHLQLWLDOL]DWLRQZHFDQSURFHHGWRLQLWLDOL]HWKH2SHQ*/
GULYHUE\FDOOLQJRXU*/8WLOLW\7RRONLW */87 VHWXSIXQFWLRQV7KLVVHTXHQFHRI
FDOOVVKRXOGORRNUHODWLYHO\IDPLOLDULI\RXǢYHXVHG*/87EHIRUH

// these GLUT calls need to be made before the other GL calls


glutInit( &argc, argv );
glutInitDisplayMode( GLUT_DOUBLE | GLUT_RGBA );
glutInitWindowSize( DIM, DIM );
glutCreateWindow( "bitmap" );

$WWKLVSRLQWLQmain()ZHǢYHSUHSDUHGRXU&8'$UXQWLPHWRSOD\QLFHO\ZLWKWKH
2SHQ*/GULYHUE\FDOOLQJcudaGLSetGLDevice()7KHQZHLQLWLDOL]HG*/87DQG
FUHDWHGDZLQGRZQDPHGǤELWPDSǥLQZKLFKWRGUDZRXUUHVXOWV1RZZHFDQJHW
RQWRWKHDFWXDO2SHQ*/LQWHURSHUDWLRQ

6KDUHGGDWDEXIIHUVDUHWKHNH\FRPSRQHQWWRLQWHURSHUDWLRQEHWZHHQ&8'$&
NHUQHOVDQG2SHQ*/UHQGHULQJ7RSDVVGDWDEHWZHHQ2SHQ*/DQG&8'$ZHZLOO
ȌUVWQHHGWRFUHDWHDEXIIHUWKDWFDQEHXVHGZLWKERWK$3,V:HVWDUWWKLVSURFHVV
E\FUHDWLQJDSL[HOEXIIHUREMHFWLQ2SHQ*/DQGVWRULQJWKHKDQGOHLQRXUJOREDO
YDULDEOHGLuint bufferObj

glGenBuffers( 1, &bufferObj );
glBindBuffer( GL_PIXEL_UNPACK_BUFFER_ARB, bufferObj );
glBufferData( GL_PIXEL_UNPACK_BUFFER_ARB, DIM * DIM * 4,
NULL, GL_DYNAMIC_DRAW_ARB );

142

Download from www.wowebook.com


 *5 $ &
3+6, , 5
1 7 (5 23 ( 1
$7 ,2

,I\RXKDYHQHYHUXVHGDSL[HOEXIIHUREMHFW 3%2 LQ2SHQ*/\RXZLOOW\SL-


FDOO\FUHDWHRQHZLWKWKHVHWKUHHVWHSV)LUVWZHJHQHUDWHDEXIIHUKDQGOH
ZLWKglGenBuffers()7KHQZHELQGWKHKDQGOHWRDSL[HOEXIIHUZLWK
glBindBuffer())LQDOO\ZHUHTXHVWWKH2SHQ*/GULYHUWRDOORFDWHDEXIIHUIRU
XVZLWKglBufferData(),QWKLVH[DPSOHZHUHTXHVWDEXIIHUWRKROGDIM[DIM
ELWYDOXHVDQGXVHWKHHQXPHUDQWGL_DYNAMIC_DRAW_ARBWRLQGLFDWHWKDWWKH
EXIIHUZLOOEHPRGLȌHGUHSHDWHGO\E\WKHDSSOLFDWLRQ6LQFHZHKDYHQRGDWDWRSUHORDG
WKHEXIIHUZLWKZHSDVVNULLDVWKHSHQXOWLPDWHDUJXPHQWWRglBufferData()

$OOWKDWUHPDLQVLQRXUTXHVWWRVHWXSJUDSKLFVLQWHURSHUDELOLW\LVQRWLI\LQJWKH
&8'$UXQWLPHWKDWZHLQWHQGWRVKDUHWKH2SHQ*/EXIIHUQDPHGbufferObj
ZLWK&8'$:HGRWKLVE\UHJLVWHULQJbufferObjZLWKWKH&8'$UXQWLPHDVD
JUDSKLFVUHVRXUFH

HANDLE_ERROR(
cudaGraphicsGLRegisterBuffer( &resource,
bufferObj,
cudaGraphicsMapFlagsNone )
);

:HVSHFLI\WRWKH&8'$UXQWLPHWKDWZHLQWHQGWRXVHWKH
2SHQ*/3%2bufferObjZLWKERWK2SHQ*/DQG&8'$E\FDOOLQJ
cudaGraphicsGLRegisterBuffer()7KH&8'$UXQWLPHUHWXUQVD&8'$
IULHQGO\KDQGOHWRWKHEXIIHULQWKHYDULDEOHresource7KLVKDQGOHZLOOEHXVHGWR
UHIHUWRbufferObjLQVXEVHTXHQWFDOOVWRWKH&8'$UXQWLPH

7KHȍDJcudaGraphicsMapFlagsNoneVSHFLȌHVWKDWWKHUHLVQRSDUWLFXODU
EHKDYLRURIWKLVEXIIHUWKDWZHZDQWWRVSHFLI\DOWKRXJKZHKDYHWKHRSWLRQWR
VSHFLI\ZLWKcudaGraphicsMapFlagsReadOnlyWKDWWKHEXIIHUZLOOEHUHDG
RQO\:HFRXOGDOVRXVHcudaGraphicsMapFlagsWriteDiscardWRVSHFLI\
WKDWWKHSUHYLRXVFRQWHQWVZLOOEHGLVFDUGHGPDNLQJWKHEXIIHUHVVHQWLDOO\
ZULWHRQO\7KHVHȍDJVDOORZWKH&8'$DQG2SHQ*/GULYHUVWRRSWLPL]HWKHKDUG-
ZDUHVHWWLQJVIRUEXIIHUVZLWKUHVWULFWHGDFFHVVSDWWHUQVDOWKRXJKWKH\DUHQRW
UHTXLUHGWREHVHW

(IIHFWLYHO\WKHFDOOWRglBufferData()UHTXHVWVWKH2SHQ*/GULYHUWRDOORFDWHD
EXIIHUODUJHHQRXJKWRKROGDIM[DIMELWYDOXHV,QVXEVHTXHQW2SHQ*/FDOOV
ZHǢOOUHIHUWRWKLVEXIIHUZLWKWKHKDQGOHbufferObjZKLOHLQ&8'$UXQWLPHFDOOV
ZHǢOOUHIHUWRWKLVEXIIHUZLWKWKHSRLQWHUresource6LQFHZHZRXOGOLNHWRUHDG
IURPDQGZULWHWRWKLVEXIIHUIURPRXU&8'$&NHUQHOVZHZLOOQHHGPRUHWKDQMXVW
DKDQGOHWRWKHREMHFW:HZLOOQHHGDQDFWXDODGGUHVVLQGHYLFHPHPRU\WKDWFDQEH

143

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

SDVVHGWRRXUNHUQHO:HDFKLHYHWKLVE\LQVWUXFWLQJWKH&8'$UXQWLPHWRPDSWKH
VKDUHGUHVRXUFHDQGWKHQE\UHTXHVWLQJDSRLQWHUWRWKHPDSSHGUHVRXUFH

uchar4* devPtr;
size_t size;
HANDLE_ERROR( cudaGraphicsMapResources( 1, &resource, NULL ) );
HANDLE_ERROR(
cudaGraphicsResourceGetMappedPointer( (void**)&devPtr,
&size,
resource )
);

:HFDQWKHQXVHdevPtrDVZHZRXOGXVHDQ\GHYLFHSRLQWHUH[FHSWWKDWWKHGDWD
FDQDOVREHXVHGE\2SHQ*/DVDSL[HOVRXUFH$IWHUDOOWKHVHVHWXSVKHQDQLJDQV
WKHUHVWRImain()SURFHHGVDVIROORZV)LUVWZHODXQFKRXUNHUQHOSDVVLQJLW
WKHSRLQWHUWRRXUVKDUHGEXIIHU7KLVNHUQHOWKHFRGHRIZKLFKZHKDYHQRWVHHQ
\HWJHQHUDWHVLPDJHGDWDWREHUHQGHUHG1H[WZHXQPDSRXUVKDUHGUHVRXUFH
7KLVFDOOLVLPSRUWDQWWRPDNHSULRUWRSHUIRUPLQJUHQGHULQJWDVNVEHFDXVHLW
SURYLGHVV\QFKURQL]DWLRQEHWZHHQWKH&8'$DQGJUDSKLFVSRUWLRQVRIWKHDSSOLFD-
WLRQ6SHFLȌFDOO\LWLPSOLHVWKDWDOO&8'$RSHUDWLRQVSHUIRUPHGSULRUWRWKHFDOO
WRcudaGraphicsUnmapResources()ZLOOFRPSOHWHEHIRUHHQVXLQJJUDSKLFV
FDOOVEHJLQ

/DVWO\ZHUHJLVWHURXUNH\ERDUGDQGGLVSOD\FDOOEDFNIXQFWLRQVZLWK*/87
(key_funcDQGdraw_func DQGZHUHOLQTXLVKFRQWUROWRWKH*/87UHQGHULQJ
ORRSZLWKglutMainLoop()

dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( devPtr );

HANDLE_ERROR( cudaGraphicsUnmapResources( 1, &resource, NULL ) );

// set up GLUT and kick off main loop


glutKeyboardFunc( key_func );
glutDisplayFunc( draw_func );
glutMainLoop();
}

144

Download from www.wowebook.com


 *5 $ &
3+6, , 5
1 7 (5 23 ( 1
$7 ,2

7KHUHPDLQGHURIWKHDSSOLFDWLRQFRQVLVWVRIWKHWKUHHIXQFWLRQVZHMXVWKLJK-
OLJKWHGkernel(), key_func()DQGdraw_func()6ROHWǢVWDNHDORRNDW
WKRVH

7KHNHUQHOIXQFWLRQWDNHVDGHYLFHSRLQWHUDQGJHQHUDWHVLPDJHGDWD,QWKH
IROORZLQJH[DPSOHZHǢUHXVLQJDNHUQHOLQVSLUHGE\WKHULSSOHH[DPSOHLQ
&KDSWHU

// based on ripple code, but uses uchar4, which is the


// type of data graphic interop uses
__global__ void kernel( uchar4 *ptr ) {
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

// now calculate the value at that position


float fx = x/(float)DIM - 0.5f;
float fy = y/(float)DIM - 0.5f;
unsigned char green = 128 + 127 *
sin( abs(fx*100) - abs(fy*100) );

// accessing uchar4 vs. unsigned char*


ptr[offset].x = 0;
ptr[offset].y = green;
ptr[offset].z = 0;
ptr[offset].w = 255;
}

0DQ\IDPLOLDUFRQFHSWVDUHDWZRUNKHUH7KHPHWKRGIRUWXUQLQJWKUHDGDQGEORFN
LQGLFHVLQWRxDQGyFRRUGLQDWHVDQGDOLQHDURIIVHWKDVEHHQH[DPLQHGVHYHUDO
WLPHV:HWKHQSHUIRUPVRPHUHDVRQDEO\DUELWUDU\FRPSXWDWLRQVWRGHWHUPLQHWKH
FRORUIRUWKHSL[HODWWKDW(x,y)ORFDWLRQDQGZHVWRUHWKRVHYDOXHVWRPHPRU\
:HǢUHDJDLQXVLQJ&8'$&WRSURFHGXUDOO\JHQHUDWHDQLPDJHRQWKH*387KH
LPSRUWDQWWKLQJWRUHDOL]HLVWKDWWKLVLPDJHZLOOWKHQEHKDQGHGdirectlyWR2SHQ*/
IRUUHQGHULQJZLWKRXWWKH&38HYHUJHWWLQJLQYROYHG2QWKHRWKHUKDQGLQWKH
ULSSOHH[DPSOHRI&KDSWHUZHJHQHUDWHGLPDJHGDWDRQWKH*38YHU\PXFKOLNH
WKLVEXWRXUDSSOLFDWLRQWKHQFRSLHGWKHEXIIHUEDFNWRWKH&38IRUGLVSOD\

145

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

6RKRZGRZHGUDZWKH&8'$JHQHUDWHGEXIIHUXVLQJ2SHQ*/":HOOLI\RXUHFDOO
WKHVHWXSZHSHUIRUPHGLQmain()\RXǢOOUHPHPEHUWKHIROORZLQJ
glBindBuffer( GL _ PIXEL _ UNPACK _ BUFFER _ ARB, bufferObj );

7KLVFDOOERXQGWKHVKDUHGEXIIHUDVDSL[HOVRXUFHIRUWKH2SHQ*/GULYHUWR
XVHLQDOOVXEVHTXHQWFDOOVWRglDrawPixels()(VVHQWLDOO\WKLVPHDQVWKDW
DFDOOWRglDrawPixels()LVDOOWKDWZHQHHGLQRUGHUWRUHQGHUWKHLPDJH
GDWDRXU&8'$&NHUQHOJHQHUDWHG&RQVHTXHQWO\WKHIROORZLQJLVDOOWKDWRXU
draw_func()QHHGVWRGR

static void draw_func( void ) {


glDrawPixels( DIM, DIM, GL_RGBA, GL_UNSIGNED_BYTE, 0 );
glutSwapBuffers();
}

,WǢVSRVVLEOH\RXǢYHVHHQglDrawPixels()ZLWKDEXIIHUSRLQWHUDVWKHODVWDUJX-
PHQW7KH2SHQ*/GULYHUZLOOFRS\IURPWKLVEXIIHULIQREXIIHULVERXQGDVDGL_
PIXEL_UNPACK_BUFFER_ARBVRXUFH+RZHYHUVLQFHRXUGDWDLVDOUHDG\RQWKH
*38DQGZHhaveERXQGRXUVKDUHGEXIIHUDVWKHGL_PIXEL_UNPACK_BUFFER_
ARBVRXUFHWKLVODVWSDUDPHWHULQVWHDGEHFRPHVDQRIIVHWLQWRWKHERXQGEXIIHU
%HFDXVHZHZDQWWRUHQGHUWKHHQWLUHEXIIHUWKLVRIIVHWLV]HURIRURXUDSSOLFDWLRQ

7KHODVWFRPSRQHQWWRWKLVH[DPSOHVHHPVVRPHZKDWDQWLFOLPDFWLFEXWZHǢYH
GHFLGHGWRJLYHRXUXVHUVDPHWKRGWRH[LWWKHDSSOLFDWLRQ,QWKLVYHLQRXU
key_func()FDOOEDFNUHVSRQGVRQO\WRWKH(VFNH\DQGXVHVWKLVDVDVLJQDOWR
FOHDQXSDQGH[LW

static void key_func( unsigned char key, int x, int y ) {


switch (key) {
case 27:
// clean up OpenGL and CUDA
HANDLE_ERROR( cudaGraphicsUnregisterResource( resource ) );
glBindBuffer( GL_PIXEL_UNPACK_BUFFER_ARB, 0 );
glDeleteBuffers( 1, &bufferObj );
exit(0);
}
}

146

Download from www.wowebook.com


 *3 85(
,3 3 / : , 7 +*
 5 $ 3 +,& 6,1 7 (523 (5 $ 7
%,/ , <

Figure 8.1 $VFUHHQVKRWRIWKHK\SQRWLFJUDSKLFVLQWHURSHUDWLRQH[DPSOH

:KHQUXQWKLVH[DPSOHGUDZVDPHVPHUL]LQJSLFWXUHLQǤ19,',$*UHHQǥDQG
EODFNVKRZQLQ)LJXUH7U\XVLQJLWWRK\SQRWL]H\RXUIULHQGV RUHQHPLHV 

 *385LSSOHZLWK*UDSKLFV
,QWHURSHUDELOLW\
,QǤ6HFWLRQ*UDSKLFV,QWHURSHUDWLRQǥZHUHIHUUHGWR&KDSWHUǢV*38ULSSOH
H[DPSOHDIHZWLPHV,I\RXUHFDOOWKDWDSSOLFDWLRQFUHDWHGDCPUAnimBitmap
DQGSDVVHGLWDIXQFWLRQWREHFDOOHGZKHQHYHUDIUDPHQHHGHGWREHJHQHUDWHG

int main( void ) {


DataBlock data;
CPUAnimBitmap bitmap( DIM, DIM, &data );
data.bitmap = &bitmap;

HANDLE_ERROR( cudaMalloc( (void**)&data.dev_bitmap,


bitmap.image_size() ) );

147

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

bitmap.anim_and_exit( (void (*)(void*,int))generate_frame,


(void (*)(void*))cleanup );
}

:LWKWKHWHFKQLTXHVZHǢYHOHDUQHGLQWKHSUHYLRXVVHFWLRQZHLQWHQGWRFUHDWHD
GPUAnimBitmapVWUXFWXUH7KLVVWUXFWXUHZLOOVHUYHWKHVDPHSXUSRVHDVWKH
CPUAnimBitmapEXWLQWKLVLPSURYHGYHUVLRQWKH&8'$DQG2SHQ*/FRPSR-
QHQWVZLOOFRRSHUDWHZLWKRXW&38LQWHUYHQWLRQ:KHQZHǢUHGRQHWKHDSSOLFDWLRQ
ZLOOXVHDGPUAnimBitmapVRWKDWmain()ZLOOEHFRPHVLPSO\DVIROORZV

int main( void ) {


GPUAnimBitmap bitmap( DIM, DIM, NULL );

bitmap.anim_and_exit(
(void (*)(uchar4*,void*,int))generate_frame, NULL );
}

7KHGPUAnimBitmapVWUXFWXUHXVHVWKHVDPHFDOOVZHMXVWH[DPLQHGLQ
6HFWLRQ*UDSKLFV,QWHURSHUDWLRQ+RZHYHUQRZWKHVHFDOOVZLOOEHDEVWUDFWHG
DZD\LQDGPUAnimBitmapVWUXFWXUHVRWKDWIXWXUHH[DPSOHV DQGSRWHQWLDOO\
\RXURZQDSSOLFDWLRQV ZLOOEHFOHDQHU

 7+(*38$1,0%,70$36758&785(
6HYHUDORIWKHGDWDPHPEHUVIRURXUGPUAnimBitmapZLOOORRNIDPLOLDUWR\RX
IURP6HFWLRQ*UDSKLFV,QWHURSHUDWLRQ

struct GPUAnimBitmap {
GLuint bufferObj;
cudaGraphicsResource *resource;
int width, height;
void *dataBlock;
void (*fAnim)(uchar4*,void*,int);
void (*animExit)(void*);
void (*clickDrag)(void*,int,int,int,int);
int dragStartX, dragStartY;

148

Download from www.wowebook.com


 *3 85(
,3 3 / : , 7 +*
 5 $ 3 +,& 6,1 7 (523 (5 $ 7
%,/ , <

:HNQRZWKDW2SHQ*/DQGWKH&8'$UXQWLPHZLOOKDYHGLIIHUHQWQDPHVIRURXU
*38EXIIHUDQGZHNQRZWKDWZHZLOOQHHGWRUHIHUWRERWKRIWKHVHQDPHV
GHSHQGLQJRQZKHWKHUZHDUHPDNLQJ2SHQ*/RU&8'$&FDOOV7KHUHIRUHRXU
VWUXFWXUHZLOOVWRUHERWK2SHQ*/ǢVbufferObjQDPHDQGWKH&8'$UXQWLPHǢV
UHVRXUFHQDPH6LQFHZHDUHGHDOLQJZLWKDELWPDSLPDJHWKDWZHLQWHQGWR
GLVSOD\ZHNQRZWKDWWKHLPDJHZLOOKDYHDZLGWKDQGKHLJKWWRLW

7RDOORZXVHUVRIRXUGPUAnimBitmapWRUHJLVWHUIRUFHUWDLQFDOOEDFNHYHQWV
ZHZLOODOVRVWRUHDvoid*SRLQWHUWRDUELWUDU\XVHUGDWDLQdataBlock2XU
FODVVZLOOQHYHUORRNDWWKLVGDWDEXWZLOOVLPSO\SDVVLWEDFNWRDQ\UHJLVWHUHG
FDOOEDFNIXQFWLRQV7KHFDOOEDFNVWKDWDXVHUPD\UHJLVWHUDUHVWRUHGLQfAnim,
animExitDQGclickDrag7KHIXQFWLRQfAnim()JHWVFDOOHGLQHYHU\FDOOWR
glutIdleFunc()DQGWKLVIXQFWLRQLVUHVSRQVLEOHIRUSURGXFLQJWKHLPDJHGDWD
WKDWZLOOEHUHQGHUHGLQWKHDQLPDWLRQ7KHIXQFWLRQanimExit()ZLOOEHFDOOHG
RQFHZKHQWKHDQLPDWLRQH[LWV7KLVLVZKHUHWKHXVHUVKRXOGLPSOHPHQWFOHDQXS
FRGHWKDWQHHGVWREHH[HFXWHGZKHQWKHDQLPDWLRQHQGV)LQDOO\clickDrag(),
DQRSWLRQDOIXQFWLRQLPSOHPHQWVWKHXVHUǢVUHVSRQVHWRPRXVHFOLFNGUDJHYHQWV
,IWKHXVHUUHJLVWHUVWKLVIXQFWLRQLWJHWVFDOOHGDIWHUHYHU\VHTXHQFHRIPRXVH
EXWWRQSUHVVGUDJDQGUHOHDVHHYHQWV7KHORFDWLRQRIWKHLQLWLDOPRXVHFOLFNLQ
WKLVVHTXHQFHLVVWRUHGLQ(dragStartX, dragStartY)VRWKDWWKHVWDUWDQG
HQGSRLQWVRIWKHFOLFNGUDJHYHQWFDQEHSDVVHGWRWKHXVHUZKHQWKHPRXVH
EXWWRQLVUHOHDVHG7KLVFDQEHXVHGWRLPSOHPHQWLQWHUDFWLYHDQLPDWLRQVWKDWZLOO
LPSUHVV\RXUIULHQGV

,QLWLDOL]LQJDGPUAnimBitmapIROORZVWKHVDPHVHTXHQFHRIFRGHWKDWZHVDZ
LQRXUSUHYLRXVH[DPSOH$IWHUVWDVKLQJDZD\DUJXPHQWVLQWKHDSSURSULDWH
VWUXFWXUHPHPEHUVZHVWDUWE\TXHU\LQJWKH&8'$UXQWLPHIRUDVXLWDEOH&8'$
GHYLFH

GPUAnimBitmap( int w, int h, void *d ) {


width = w;
height = h;
dataBlock = d;
clickDrag = NULL;

149

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

// first, find a CUDA device and set it to graphic interop


cudaDeviceProp prop;
int dev;
memset( &prop, 0, sizeof( cudaDeviceProp ) );
prop.major = 1;
prop.minor = 0;
HANDLE_ERROR( cudaChooseDevice( &dev, &prop ) );

$IWHUȌQGLQJDFRPSDWLEOH&8'$GHYLFHZHPDNHWKHLPSRUWDQW
cudaGLSetGLDevice()FDOOWRWKH&8'$UXQWLPHLQRUGHUWRQRWLI\LWWKDWZH
LQWHQGWRXVHdevDVDGHYLFHIRULQWHURSHUDWLRQZLWK2SHQ*/

cudaGLSetGLDevice( dev );

6LQFHRXUIUDPHZRUNXVHV*/87WRFUHDWHDZLQGRZHGUHQGHULQJHQYLURQPHQWZH
QHHGWRLQLWLDOL]H*/877KLVLVXQIRUWXQDWHO\DELWDZNZDUGVLQFHglutInit()
ZDQWVFRPPDQGOLQHDUJXPHQWVWRSDVVWRWKHZLQGRZLQJV\VWHP6LQFHZHKDYH
QRQHZHZDQWWRSDVVZHZRXOGOLNHWRVLPSO\VSHFLI\]HURFRPPDQGOLQHDUJX-
PHQWV8QIRUWXQDWHO\VRPHYHUVLRQVRI*/87KDYHDEXJWKDWFDXVHDSSOLFDWLRQV
WRFUDVKZKHQ]HURDUJXPHQWVDUHJLYHQ6RZHWULFN*/87LQWRWKLQNLQJWKDW
ZHǢUHSDVVLQJDQDUJXPHQWDQGDVDUHVXOWOLIHLVJRRG

int c=1;
char *foo = "name";
glutInit( &c, &foo );

:HFRQWLQXHLQLWLDOL]LQJ*/87H[DFWO\DVZHGLGLQWKHSUHYLRXVH[DPSOH:H
FUHDWHDZLQGRZLQZKLFKWRUHQGHUVSHFLI\LQJDWLWOHZLWKWKHVWULQJǤELWPDSǥ,I
\RXǢGOLNHWRQDPH\RXUZLQGRZVRPHWKLQJPRUHLQWHUHVWLQJEHRXUJXHVW

glutInitDisplayMode( GLUT_DOUBLE | GLUT_RGBA );


glutInitWindowSize( width, height );
glutCreateWindow( "bitmap" );

150

Download from www.wowebook.com


 *3 85(
,3 3 / : , 7 +*
 5 $ 3 +,& 6,1 7 (523 (5 $ 7
%,/ , <

1H[WZHUHTXHVWIRUWKH2SHQ*/GULYHUWRDOORFDWHDEXIIHUKDQGOHWKDWZHLPPH-
GLDWHO\ELQGWRWKHGL_PIXEL_UNPACK_BUFFER_ARBWDUJHWWRHQVXUHWKDWIXWXUH
FDOOVWRglDrawPixels()ZLOOGUDZWRRXULQWHURSEXIIHU

glGenBuffers( 1, &bufferObj );
glBindBuffer( GL_PIXEL_UNPACK_BUFFER_ARB, bufferObj );

/DVWEXWPRVWFHUWDLQO\QRWOHDVWZHUHTXHVWWKDWWKH2SHQ*/GULYHUDOORFDWHD
UHJLRQRI*38PHPRU\IRUXV2QFHWKLVLVGRQHZHLQIRUPWKH&8'$UXQWLPHRI
WKLVEXIIHUDQGUHTXHVWD&8'$&QDPHIRUWKLVEXIIHUE\UHJLVWHULQJbufferObj
ZLWKcudaGraphicsGLRegisterBuffer()

glBufferData( GL_PIXEL_UNPACK_BUFFER_ARB, width * height * 4,


NULL, GL_DYNAMIC_DRAW_ARB );

HANDLE_ERROR(
cudaGraphicsGLRegisterBuffer( &resource,
bufferObj,
cudaGraphicsMapFlagsNone ) );
}

:LWKWKHGPUAnimBitmapVHWXSWKHRQO\UHPDLQLQJFRQFHUQLVH[DFWO\KRZ
ZHSHUIRUPWKHUHQGHULQJ7KHPHDWRIWKHUHQGHULQJZLOOEHGRQHLQRXU
glutIdleFunction()7KLVIXQFWLRQZLOOHVVHQWLDOO\GRWKUHHWKLQJV)LUVWLW
PDSVRXUVKDUHGEXIIHUDQGUHWULHYHVD*38SRLQWHUIRUWKLVEXIIHU

// static method used for GLUT callbacks


static void idle_func( void ) {
static int ticks = 1;
GPUAnimBitmap* bitmap = *(get_bitmap_ptr());
uchar4* devPtr;
size_t size;

151

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

HANDLE_ERROR(
cudaGraphicsMapResources( 1, &(bitmap->resource), NULL )
);
HANDLE_ERROR(
cudaGraphicsResourceGetMappedPointer( (void**)&devPtr,
&size,
bitmap->resource )
);

6HFRQGLWFDOOVWKHXVHUVSHFLȌHGIXQFWLRQfAnim()WKDWSUHVXPDEO\ZLOOODXQFK
D&8'$&NHUQHOWRȌOOWKHEXIIHUDWdevPtrZLWKLPDJHGDWD

bitmap->fAnim( devPtr, bitmap->dataBlock, ticks++ );

$QGODVWO\LWXQPDSVWKH*38SRLQWHUWKDWZLOOUHOHDVHWKHEXIIHUIRUXVHE\
WKH2SHQ*/GULYHULQUHQGHULQJ7KLVUHQGHULQJZLOOEHWULJJHUHGE\DFDOOWR
glutPostRedisplay()

HANDLE_ERROR(
cudaGraphicsUnmapResources( 1,
&(bitmap->resource),
NULL ) );

glutPostRedisplay();
}

7KHUHPDLQGHURIWKHGPUAnimBitmapVWUXFWXUHFRQVLVWVRILPSRUWDQWEXWVRPH-
ZKDWWDQJHQWLDOLQIUDVWUXFWXUHFRGH,I\RXKDYHDQLQWHUHVWLQLW\RXVKRXOGE\DOO
PHDQVH[DPLQHLW%XWZHIHHOWKDW\RXǢOOEHDEOHWRSURFHHGVXFFHVVIXOO\HYHQLI
\RXODFNWKHWLPHRULQWHUHVWWRGLJHVWWKHUHVWRIWKHFRGHLQGPUAnimBitmap

 *385,33/(5('8;
1RZWKDWZHKDYHD*38YHUVLRQRICPUAnimBitmapZHFDQSURFHHGWR
UHWURȌWRXU*38ULSSOHDSSOLFDWLRQWRSHUIRUPLWVDQLPDWLRQHQWLUHO\RQWKH*38
7REHJLQZHZLOOLQFOXGHgpu_anim.hWKHKRPHRIRXULPSOHPHQWDWLRQRI

152

Download from www.wowebook.com


 *3 85(
,3 3 / : , 7 +*
 5 $ 3 +,& 6,1 7 (523 (5 $ 7
%,/ , <

GPUAnimBitmap:HDOVRLQFOXGHQHDUO\WKHVDPHNHUQHODVZHH[DPLQHGLQ
&KDSWHU

#include "../common/book.h"
#include "../common/gpu_anim.h"

#define DIM 1024

__global__ void kernel( uchar4 *ptr, int ticks ) {


// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;

// now calculate the value at that position


float fx = x - DIM/2;
float fy = y - DIM/2;
float d = sqrtf( fx * fx + fy * fy );
unsigned char grey = (unsigned char)(128.0f + 127.0f *
cos(d/10.0f -
ticks/7.0f) /
(d/10.0f + 1.0f));
ptr[offset].x = grey;
ptr[offset].y = grey;
ptr[offset].z = grey;
ptr[offset].w = 255;
}

7KHRQHDQGRQO\FKDQJHZHǢYHPDGHLVKLJKOLJKWHG7KHUHDVRQIRUWKLVFKDQJH
LVEHFDXVH2SHQ*/LQWHURSHUDWLRQUHTXLUHVWKDWRXUVKDUHGVXUIDFHVEHǤJUDSKLFV
IULHQGO\ǥ%HFDXVHUHDOWLPHUHQGHULQJW\SLFDOO\XVHVDUUD\VRIIRXUFRPSRQHQW
UHGJUHHQEOXHDOSKD GDWDHOHPHQWVRXUWDUJHWEXIIHULVQRORQJHUVLPSO\DQ
DUUD\RIunsigned charDVLWSUHYLRXVO\ZDV,WǢVQRZUHTXLUHGWREHDQDUUD\RI
W\SHuchar4,QUHDOLW\ZHWUHDWHGRXUEXIIHULQ&KDSWHUDVDIRXUFRPSRQHQW
EXIIHUVRZHDOZD\VLQGH[HGLWZLWKptr[offset*4+k]ZKHUHkLQGLFDWHVWKH
FRPSRQHQWIURPWR%XWQRZWKHIRXUFRPSRQHQWQDWXUHRIWKHGDWDLVPDGH
H[SOLFLWZLWKWKHVZLWFKWRDuchar4W\SH

153

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

6LQFHkernel()LVD&8'$&IXQFWLRQWKDWJHQHUDWHVLPDJHGDWDDOOWKDW
UHPDLQVLVZULWLQJDKRVWIXQFWLRQWKDWZLOOEHXVHGDVDFDOOEDFNLQWKH
idle_func()PHPEHURIGPUAnimBitmap)RURXUFXUUHQWDSSOLFDWLRQ
DOOWKLVIXQFWLRQGRHVLVODXQFKWKH&8'$&NHUQHO

void generate_frame( uchar4 *pixels, void*, int ticks ) {


dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( pixels, ticks );
}

7KDWǢVEDVLFDOO\HYHU\WKLQJZHQHHGVLQFHDOORIWKHKHDY\OLIWLQJZDV
GRQHLQWKHGPUAnimBitmapVWUXFWXUH7RJHWWKLVSDUW\VWDUWHGZHMXVW
FUHDWHDGPUAnimBitmapDQGUHJLVWHURXUDQLPDWLRQFDOOEDFNIXQFWLRQ
generate_frame()

int main( void ) {


GPUAnimBitmap bitmap( DIM, DIM, NULL );

bitmap.anim_and_exit(
(void (*)(uchar4*,void*,int))generate_frame, NULL );
}

 +HDW7UDQVIHUZLWK*UDSKLFV,QWHURS
6RZKDWKDVEHHQWKHSRLQWRIGRLQJDOORIWKLV",I\RXORRNDWWKHLQWHUQDOVRIWKH
CPUAnimBitmapWKHVWUXFWXUHZHXVHGIRUSUHYLRXVDQLPDWLRQH[DPSOHVZH
ZRXOGVHHWKDWLWZRUNVDOPRVWH[DFWO\OLNHWKHUHQGHULQJFRGHLQ6HFWLRQ
*UDSKLFV,QWHURSHUDWLRQ

Almost.

7KHNH\GLIIHUHQFHEHWZHHQWKHCPUAnimBitmapDQGWKHSUHYLRXVH[DPSOHLV
EXULHGLQWKHFDOOWRglDrawPixels()

154

Download from www.wowebook.com


 +( $77 5 $ 1 6) (5: , 7 +*5 $ 3 +,& 6,1 7 (523

glDrawPixels( bitmap->x,
bitmap->y,
GL_RGBA,
GL_UNSIGNED_BYTE,
bitmap->pixels );

:HUHPDUNHGLQWKHȌUVWH[DPSOHRIWKLVFKDSWHUWKDW\RXPD\KDYHSUHYLRXVO\
VHHQFDOOVWRglDrawPixels()ZLWKDEXIIHUSRLQWHUDVWKHODVWDUJXPHQW
:HOOLI\RXKDGQǢWEHIRUH\RXKDYHQRZ7KLVFDOOLQWKHDraw()URXWLQHRI
CPUAnimBitmapWULJJHUVDFRS\RIWKH&38EXIIHULQbitmap->pixelsWRWKH
*38IRUUHQGHULQJ7RGRWKLVWKH&38QHHGVWRVWRSZKDWLWǢVGRLQJDQGLQLWLDWH
DFRS\RQWRWKH*38IRUHYHU\IUDPH7KLVUHTXLUHVV\QFKURQL]DWLRQEHWZHHQWKH
&38DQG*38DQGDGGLWLRQDOODWHQF\WRLQLWLDWHDQGFRPSOHWHDWUDQVIHURYHUWKH
3&,([SUHVVEXV6LQFHWKHFDOOWRglDrawPixels()H[SHFWVDKRVWSRLQWHULQ
WKHODVWDUJXPHQWWKLVDOVRPHDQVWKDWDIWHUJHQHUDWLQJDIUDPHRILPDJHGDWD
ZLWKD&8'$&NHUQHORXU&KDSWHUULSSOHDSSOLFDWLRQQHHGHGWRFRS\WKHIUDPH
IURPWKH*38WRWKH&38ZLWKDcudaMemcpy()

void generate_frame( DataBlock *d, int ticks ) {


dim3 grids(DIM/16,DIM/16);
dim3 threads(16,16);
kernel<<<grids,threads>>>( d->dev_bitmap, ticks );

HANDLE_ERROR( cudaMemcpy( d->bitmap->get_ptr(),


d->dev_bitmap,
d->bitmap->image_size(),
cudaMemcpyDeviceToHost ) );
}

7DNHQWRJHWKHUWKHVHIDFWVPHDQWKDWRXURULJLQDO*38ULSSOHDSSOLFDWLRQ
ZDVPRUHWKDQDOLWWOHVLOO\:HXVHG&8'$&WRFRPSXWHLPDJHYDOXHVIRURXU
UHQGHULQJLQHDFKIUDPHEXWDIWHUWKHFRPSXWDWLRQVZHUHGRQHZHFRSLHGWKH
EXIIHUWRWKH&38ZKLFKWKHQFRSLHGWKHEXIIHUbackWRWKH*38IRUGLVSOD\7KLV
PHDQVWKDWZHLQWURGXFHGXQQHFHVVDU\GDWDWUDQVIHUVEHWZHHQWKHKRVWDQG

155

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

WKHGHYLFHWKDWVWRRGEHWZHHQXVDQGPD[LPXPSHUIRUPDQFH/HWǢVUHYLVLWD
FRPSXWHLQWHQVLYHDQLPDWLRQDSSOLFDWLRQWKDWPLJKWVHHLWVSHUIRUPDQFHLPSURYH
E\PLJUDWLQJLWWRXVHJUDSKLFVLQWHURSHUDWLRQIRULWVUHQGHULQJ

,I\RXUHFDOOWKHSUHYLRXVFKDSWHUǢVKHDWVLPXODWLRQDSSOLFDWLRQ\RXZLOO
UHPHPEHUWKDWLWDOVRXVHGCPUAnimBitmapLQRUGHUWRGLVSOD\WKHRXWSXWRILWV
VLPXODWLRQFRPSXWDWLRQV:HZLOOPRGLI\WKLVDSSOLFDWLRQWRXVHRXUQHZO\LPSOH-
PHQWHGGPUAnimBitmapVWUXFWXUHDQGORRNDWKRZWKHUHVXOWLQJSHUIRUPDQFH
FKDQJHV$VZLWKWKHULSSOHH[DPSOHRXUGPUAnimBitmapLVDOPRVWDSHUIHFW
GURSLQUHSODFHPHQWIRUCPUAnimBitmapZLWKWKHH[FHSWLRQRIWKHunsigned
charWRuchar4FKDQJH6RWKHVLJQDWXUHRIRXUDQLPDWLRQURXWLQHFKDQJHVLQ
RUGHUWRDFFRPPRGDWHWKLVVKLIWLQGDWDW\SHV

void anim_gpu( uchar4* outputBitmap, DataBlock *d, int ticks ) {


HANDLE_ERROR( cudaEventRecord( d->start, 0 ) );
dim3 blocks(DIM/16,DIM/16);
dim3 threads(16,16);

// since tex is global and bound, we have to use a flag to


// select which is in/out per iteration
volatile bool dstOut = true;
for (int i=0; i<90; i++) {
float *in, *out;
if (dstOut) {
in = d->dev_inSrc;
out = d->dev_outSrc;
} else {
out = d->dev_inSrc;
in = d->dev_outSrc;
}
copy_const_kernel<<<blocks,threads>>>( in );
blend_kernel<<<blocks,threads>>>( out, dstOut );
dstOut = !dstOut;
}
float_to_color<<<blocks,threads>>>( outputBitmap,
d->dev_inSrc );

156

Download from www.wowebook.com


 +( $77 5 $ 1 6) (5: , 7 +*5 $ 3 +,& 6,1 7 (523

HANDLE_ERROR( cudaEventRecord( d->stop, 0 ) );


HANDLE_ERROR( cudaEventSynchronize( d->stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
d->start, d->stop ) );
d->totalTime += elapsedTime;
++d->frames;
printf( "Average Time per frame: %3.1f ms\n",
d->totalTime/d->frames );
}

6LQFHWKHfloat_to_color()NHUQHOLVWKHRQO\IXQFWLRQWKDWDFWXDOO\XVHVWKH
outputBitmapLWǢVWKHRQO\RWKHUIXQFWLRQWKDWQHHGVPRGLȌFDWLRQDVDUHVXOW
RIRXUVKLIWWRuchar47KLVIXQFWLRQZDVVLPSO\FRQVLGHUHGXWLOLW\FRGHLQWKH
SUHYLRXVFKDSWHUDQGZHZLOOFRQWLQXHWRFRQVLGHULWXWLOLW\FRGH+RZHYHUZH
KDYHRYHUORDGHGWKLVIXQFWLRQDQGLQFOXGHGERWKunsigned charDQGuchar4
YHUVLRQVLQbook.h<RXZLOOQRWLFHWKDWWKHGLIIHUHQFHVEHWZHHQWKHVHIXQF-
WLRQVDUHLGHQWLFDOWRWKHGLIIHUHQFHVEHWZHHQkernel()LQWKH&38DQLPDWHG
DQG*38DQLPDWHGYHUVLRQVRI*38ULSSOH0RVWRIWKHFRGHIRUWKHfloat_to_
color()NHUQHOVKDVEHHQRPLWWHGIRUFODULW\EXWZHHQFRXUDJH\RXWRFRQVXOW
book.hLI\RXǢUHG\LQJWRVHHWKHGHWDLOV

__global__ void float_to_color( unsigned char *optr,


const float *outSrc ) {

// convert floating-point value to 4-component color

optr[offset*4 + 0] = value( m1, m2, h+120 );


optr[offset*4 + 1] = value( m1, m2, h );
optr[offset*4 + 2] = value( m1, m2, h -120 );
optr[offset*4 + 3] = 255;
}

157

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

__global__ void float_to_color( uchar4 *optr,


const float *outSrc ) {

// convert floating-point value to 4-component color

optr[offset].x = value( m1, m2, h+120 );


optr[offset].y = value( m1, m2, h );
optr[offset].z = value( m1, m2, h -120 );
optr[offset].w = 255;
}

2XWVLGHRIWKHVHFKDQJHVWKHRQO\PDMRUGLIIHUHQFHLVLQWKHFKDQJHIURP
CPUAnimBitmapWRGPUAnimBitmapWRSHUIRUPDQLPDWLRQ

int main( void ) {


DataBlock data;
GPUAnimBitmap bitmap( DIM, DIM, &data );
data.totalTime = 0;
data.frames = 0;
HANDLE_ERROR( cudaEventCreate( &data.start ) );
HANDLE_ERROR( cudaEventCreate( &data.stop ) );

int imageSize = bitmap.image_size();

// assume float == 4 chars in size (i.e., rgba)


HANDLE_ERROR( cudaMalloc( (void**)&data.dev_inSrc,
imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_outSrc,
imageSize ) );
HANDLE_ERROR( cudaMalloc( (void**)&data.dev_constSrc,
imageSize ) );

HANDLE_ERROR( cudaBindTexture( NULL, texConstSrc,


data.dev_constSrc,
imageSize ) );

158

Download from www.wowebook.com


 +( $77 5 $ 1 6) (5: , 7 +*5 $ 3 +,& 6,1 7 (523

HANDLE_ERROR( cudaBindTexture( NULL, texIn,


data.dev_inSrc,
imageSize ) );

HANDLE_ERROR( cudaBindTexture( NULL, texOut,


data.dev_outSrc,
imageSize ) );

// initialize the constant data


float *temp = (float*)malloc( imageSize );
for (int i=0; i<DIM*DIM; i++) {
temp[i] = 0;
int x = i % DIM;
int y = i / DIM;
if ((x>300) && (x<600) && (y>310) && (y<601))
temp[i] = MAX_TEMP;
}
temp[DIM*100+100] = (MAX_TEMP + MIN_TEMP)/2;
temp[DIM*700+100] = MIN_TEMP;
temp[DIM*300+300] = MIN_TEMP;
temp[DIM*200+700] = MIN_TEMP;
for (int y=800; y<900; y++) {
for (int x=400; x<500; x++) {
temp[x+y*DIM] = MIN_TEMP;
}
}
HANDLE_ERROR( cudaMemcpy( data.dev_constSrc, temp,
imageSize,
cudaMemcpyHostToDevice ) );

// initialize the input data


for (int y=800; y<DIM; y++) {
for (int x=0; x<200; x++) {
temp[x+y*DIM] = MAX_TEMP;
}
}

159

Download from www.wowebook.com


*5$3+,&6,17(523(5$%,/,7<

HANDLE_ERROR( cudaMemcpy( data.dev_inSrc, temp,


imageSize,
cudaMemcpyHostToDevice ) );
free( temp );

bitmap.anim_and_exit( (void (*)(uchar4*,void*,int))anim_gpu,


(void (*)(void*))anim_exit );
}

$OWKRXJKLWPLJKWEHLQVWUXFWLYHWRWDNHDJODQFHDWWKHUHVWRIWKLVHQKDQFHGKHDW
VLPXODWLRQDSSOLFDWLRQLWLVQRWVXIȌFLHQWO\GLIIHUHQWIURPWKHSUHYLRXVFKDSWHUǢV
YHUVLRQWRZDUUDQWPRUHGHVFULSWLRQ7KHLPSRUWDQWFRPSRQHQWLVDQVZHULQJWKH
TXHVWLRQKRZGRHVSHUIRUPDQFHFKDQJHQRZWKDWZHǢYHFRPSOHWHO\PLJUDWHGWKH
DSSOLFDWLRQWRWKH*38":LWKRXWKDYLQJWRFRS\HYHU\IUDPHEDFNWRWKHKRVWIRU
GLVSOD\WKHVLWXDWLRQVKRXOGEHPXFKKDSSLHUWKDQLWZDVSUHYLRXVO\

6RH[DFWO\KRZPXFKEHWWHULVLWWRXVHWKHJUDSKLFVLQWHURSHUDELOLW\WRSHUIRUP
WKHUHQGHULQJ"3UHYLRXVO\WKHKHDWWUDQVIHUH[DPSOHFRQVXPHGDERXWPVSHU
IUDPHRQRXU*H)RUFH*7;ǞEDVHGWHVWPDFKLQH$IWHUFRQYHUWLQJWKHDSSOL-
FDWLRQWRXVHJUDSKLFVLQWHURSHUDELOLW\WKLVGURSVE\SHUFHQWWRPVSHU
IUDPH7KHQHWUHVXOWLVWKDWRXUUHQGHULQJORRSLVSHUFHQWIDVWHUDQGQRORQJHU
UHTXLUHVLQWHUYHQWLRQIURPWKHKRVWHYHU\WLPHZHZDQWWRGLVSOD\DIUDPH7KDWǢV
QRWEDGIRUDGD\ǢVZRUN

 'LUHFW;,QWHURSHUDELOLW\
$OWKRXJKZHǢYHORRNHGRQO\DWH[DPSOHVWKDWXVHLQWHURSHUDWLRQZLWKWKH2SHQ*/
UHQGHULQJV\VWHP'LUHFW;LQWHURSHUDWLRQLVQHDUO\LGHQWLFDO<RXZLOOVWLOOXVHD
cudaGraphicsResourceWRUHIHUWREXIIHUVWKDW\RXVKDUHEHWZHHQ'LUHFW;
DQG&8'$DQG\RXZLOOVWLOOXVHFDOOVWRcudaGraphicsMapResources()DQG
cudaGraphicsResourceGetMappedPointer()WRUHWULHYH&8'$IULHQGO\
SRLQWHUVWRWKHVHVKDUHGUHVRXUFHV

)RUWKHPRVWSDUWWKHFDOOVWKDWGLIIHUEHWZHHQ2SHQ*/DQG'LUHFW;LQWHURSHUDELOLW\
KDYHHPEDUUDVVLQJO\VLPSOHWUDQVODWLRQVWR'LUHFW;)RUH[DPSOHUDWKHUWKDQ
FDOOLQJcudaGLSetGLDevice()ZHFDOOcudaD3D9SetDirect3DDevice()
WRVSHFLI\WKDWD&8'$GHYLFHVKRXOGEHHQDEOHGIRU'LUHFW'LQWHURSHUDELOLW\

160

Download from www.wowebook.com


 &+ $ 3 5
7 ( 5( 9 ,( :

/LNHZLVHcudaD3D10SetDirect3DDevice()HQDEOHVDGHYLFHIRU'LUHFW'
LQWHURSHUDWLRQDQGcudaD3D11SetDirect3DDevice()IRU'LUHFW'

7KHGHWDLOVRI'LUHFW;LQWHURSHUDELOLW\SUREDEO\ZLOOQRWVXUSULVH\RXLI\RXǢYH
ZRUNHGWKURXJKWKLVFKDSWHUǢV2SHQ*/H[DPSOHV%XWLI\RXZDQWWRXVH'LUHFW;
LQWHURSHUDWLRQDQGZDQWDVPDOOSURMHFWWRJHWVWDUWHGZHVXJJHVWWKDW\RX
PLJUDWHWKLVFKDSWHUǢVH[DPSOHVWRXVH'LUHFW;7RJHWVWDUWHGZHUHFRP-
PHQGFRQVXOWLQJWKHNVIDIA CUDA Programming GuideIRUDUHIHUHQFHRQWKH
$3,DQGWDNLQJDORRNDWWKH*38&RPSXWLQJ6'.FRGHVDPSOHVRQ'LUHFW;
LQWHURSHUDELOLW\

 &KDSWHU5HYLHZ
$OWKRXJKPXFKRIWKLVERRNKDVEHHQGHYRWHGWRXVLQJWKH*38IRUSDUDOOHO
JHQHUDOSXUSRVHFRPSXWLQJZHFDQǢWIRUJHWWKH*38ǢVVXFFHVVIXOGD\MREDVD
UHQGHULQJHQJLQH0DQ\DSSOLFDWLRQVUHTXLUHRUZRXOGEHQHȌWIURPWKHXVHRI
VWDQGDUGFRPSXWHUJUDSKLFVUHQGHULQJ6LQFHWKH*38LVPDVWHURIWKHUHQGHULQJ
GRPDLQDOOWKDWVWRRGEHWZHHQXVDQGWKHH[SORLWDWLRQRIWKHVHUHVRXUFHVZDV
DODFNRIXQGHUVWDQGLQJRIWKHPHFKDQLFVLQFRQYLQFLQJWKH&8'$UXQWLPHDQG
JUDSKLFVGULYHUVWRFRRSHUDWH1RZWKDWZHKDYHVHHQKRZWKLVLVGRQHZH
QRORQJHUQHHGWKHKRVWWRLQWHUYHQHLQGLVSOD\LQJWKHJUDSKLFDOUHVXOWVRIRXU
FRPSXWDWLRQV7KLVVLPXOWDQHRXVO\DFFHOHUDWHVWKHDSSOLFDWLRQǢVUHQGHULQJORRS
DQGIUHHVWKHKRVWWRSHUIRUPRWKHUFRPSXWDWLRQVLQWKHPHDQWLPH2WKHUZLVH
LIWKHUHDUHQRRWKHUFRPSXWDWLRQVWREHSHUIRUPHGLWOHDYHVRXUV\VWHPPRUH
UHVSRQVLYHWRRWKHUHYHQWVRUDSSOLFDWLRQV

7KHUHDUHPDQ\RWKHUZD\VWRXVHJUDSKLFVLQWHURSHUDELOLW\WKDWZHOHIWXQH[-
SORUHG:HORRNHGSULPDULO\DWXVLQJD&8'$&NHUQHOWRZULWHLQWRDSL[HOEXIIHU
REMHFWIRUGLVSOD\LQDZLQGRZ7KLVLPDJHGDWDFDQDOVREHXVHGDVDWH[WXUHWKDW
FDQEHDSSOLHGWRDQ\VXUIDFHLQWKHVFHQH,QDGGLWLRQWRPRGLI\LQJSL[HOEXIIHU
REMHFWV\RXFDQDOVRVKDUHYHUWH[EXIIHUREMHFWVEHWZHHQ&8'$DQGWKHJUDSKLFV
HQJLQH$PRQJRWKHUWKLQJVWKLVDOORZV\RXWRZULWH&8'$&NHUQHOVWKDWSHUIRUP
FROOLVLRQGHWHFWLRQEHWZHHQREMHFWVRUFRPSXWHYHUWH[GLVSODFHPHQWPDSVWREH
XVHGWRUHQGHUREMHFWVRUVXUIDFHVWKDWLQWHUDFWZLWKWKHXVHURUWKHLUVXUURXQG-
LQJV,I\RXǢUHLQWHUHVWHGLQFRPSXWHUJUDSKLFV&8'$&ǢVJUDSKLFVLQWHURSHUDELOLW\
$3,HQDEOHVDVOHZRIQHZSRVVLELOLWLHVIRU\RXUDSSOLFDWLRQV

161

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Chapter 9
Atomics

,QWKHȌUVWKDOIRIWKHERRNZHVDZPDQ\RFFDVLRQVZKHUHVRPHWKLQJFRPSOL-
FDWHGWRDFFRPSOLVKZLWKDVLQJOHWKUHDGHGDSSOLFDWLRQEHFRPHVTXLWHHDV\ZKHQ
LPSOHPHQWHGXVLQJ&8'$&)RUH[DPSOHWKDQNVWRWKHEHKLQGWKHVFHQHVZRUN
RIWKH&8'$UXQWLPHZHQRORQJHUQHHGHGfor()ORRSVLQRUGHUWRGRSHUSL[HO
XSGDWHVLQRXUDQLPDWLRQVRUKHDWVLPXODWLRQV/LNHZLVHWKRXVDQGVRISDUDOOHO
EORFNVDQGWKUHDGVJHWFUHDWHGDQGDXWRPDWLFDOO\HQXPHUDWHGZLWKWKUHDGDQG
EORFNLQGLFHVVLPSO\E\FDOOLQJD__global__IXQFWLRQIURPKRVWFRGH

2QWKHRWKHUKDQGWKHUHDUHVRPHVLWXDWLRQVZKHUHVRPHWKLQJLQFUHGLEO\VLPSOH
LQVLQJOHWKUHDGHGDSSOLFDWLRQVDFWXDOO\SUHVHQWVDVHULRXVSUREOHPZKHQZHWU\
WRLPSOHPHQWWKHVDPHDOJRULWKPRQDPDVVLYHO\SDUDOOHODUFKLWHFWXUH,QWKLV
FKDSWHUZHǢOOWDNHDORRNDWVRPHRIWKHVLWXDWLRQVZKHUHZHQHHGWRXVHVSHFLDO
SULPLWLYHVLQRUGHUWRVDIHO\DFFRPSOLVKWKLQJVWKDWFDQEHTXLWHWULYLDOWRGRLQD
WUDGLWLRQDOVLQJOHWKUHDGHGDSSOLFDWLRQ

163

Download from www.wowebook.com


ATOMICS

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQDERXWWKHcompute capabilityRIYDULRXV19,',$*38V

ǩ <RXZLOOOHDUQDERXWZKDWDWRPLFRSHUDWLRQVDUHDQGZK\\RXPLJKWQHHGWKHP

ǩ <RXZLOOOHDUQKRZWRSHUIRUPDULWKPHWLFZLWKDWRPLFRSHUDWLRQVLQ\RXU&8'$
&NHUQHOV

 &RPSXWH&DSDELOLW\
$OORIWKHWRSLFVZHKDYHFRYHUHGWRWKLVSRLQWLQYROYHFDSDELOLWLHVWKDWHYHU\
&8'$HQDEOHG*38SRVVHVVHV)RUH[DPSOHHYHU\*38EXLOWRQWKH&8'$
$UFKLWHFWXUHFDQODXQFKNHUQHOVDFFHVVJOREDOPHPRU\DQGUHDGIURPFRQVWDQW
DQGWH[WXUHPHPRULHV%XWMXVWOLNHGLIIHUHQWPRGHOVRI&38VKDYHYDU\LQJFDSD-
ELOLWLHVDQGLQVWUXFWLRQVHWV IRUH[DPSOH00;66(RU66( VRWRRGR&8'$
HQDEOHGJUDSKLFVSURFHVVRUV19,',$UHIHUVWRWKHVXSSRUWHGIHDWXUHVRID*38DV
LWVcompute capability

 7+(&20387(&$3$%,/,7<2)19,',$*386
$VRISUHVVWLPH19,',$*38VFRXOGSRWHQWLDOO\VXSSRUWFRPSXWHFDSDELOLWLHV
RU+LJKHUFDSDELOLW\YHUVLRQVUHSUHVHQWVXSHUVHWVRIWKHYHUVLRQV
EHORZWKHPLPSOHPHQWLQJDǤOD\HUHGRQLRQǥRUǤ5XVVLDQQHVWLQJGROOǥKLHUDUFK\
GHSHQGLQJRQ\RXUPHWDSKRULFDOSUHIHUHQFH )RUH[DPSOHD*38ZLWKFRPSXWH
FDSDELOLW\VXSSRUWVDOOWKHIHDWXUHVRIFRPSXWHFDSDELOLWLHVDQG7KH
NVIDIA CUDA Programming GuideFRQWDLQVDQXSWRGDWHOLVWRIDOO&8'$FDSDEOH
*38VDQGWKHLUFRUUHVSRQGLQJFRPSXWHFDSDELOLW\7DEOHOLVWVWKH19,',$*38V
DYDLODEOHDWSUHVVWLPH7KHFRPSXWHFDSDELOLW\VXSSRUWHGE\HDFK*38LVOLVWHG
QH[WWRWKHGHYLFHǢVQDPH

164

Download from www.wowebook.com


 & 8
203 7 (&
 $ $3 7
%<,/ ,

Table 9.1 6HOHFWHG&8'$(QDEOHG*38VDQG7KHLU&RUUHVSRQGLQJ&RPSXWH


&DSDELOLWLHV

COMPUTE
GPU CAPABILITY

*H)RUFH*7;*7; 

*H)RUFH*7; 

*H)RUFH*7;*7; 

*H)RUFH*7; 

*H)RUFH*; 

*H)RUFH*76*76*7;*7;*76 

*H)RUFH8OWUD*7; 

*H)RUFH*7*7*7;00*7; 

*H)RUFH*7*62*60*7;*7;00*7 

*H)RUFH*76 

*H)RUFH*70*760*76 

*H)RUFH0*7 

*H)RUFH*7*7*76*70*70*60

*70*60*60*70*70*6

*H)RUFH**7*60*70*0*0*6

P*38P*38P*38P*38P*38

*H)RUFH0*60*60*0* 

7HVOD66&& 

7HVOD6& 

Continued

165

Download from www.wowebook.com


ATOMICS

Table 9.1 6HOHFWHG&8'$(QDEOHG*38VDQG7KHLU&RUUHVSRQGLQJ&RPSXWH


&DSDELOLWLHV &RQWLQXHG

COMPUTE
GPU CAPABILITY

7HVOD6'& 

4XDGUR3OH[' 

4XDGUR3OH[' 

4XDGUR3OH[0RGHO6 

4XDGUR3OH[0RGHO,9 

4XDGUR); 

4XDGUR); 

4XDGUR);; 

4XDGUR);0 

4XDGUR); 

4XDGUR); 

4XDGUR);0 

4XDGUR); 

4XDGUR);0 

4XDGUR););1960);0);0);0);

0

4XDGUR);19619601960);0 

4XDGUR);01960 

166

Download from www.wowebook.com


 & 8
203 7 (&
 $ $3 7
%<,/ ,

2IFRXUVHVLQFH19,',$UHOHDVHVQHZJUDSKLFVSURFHVVRUVDOOWKHWLPHWKLVWDEOH
ZLOOXQGRXEWHGO\EHRXWRIGDWHWKHPRPHQWWKLVERRNLVSXEOLVKHG)RUWXQDWHO\
19,',$KDVDZHEVLWHDQGRQWKLVZHEVLWH\RXZLOOȌQGWKH&8'$=RQH$PRQJ
RWKHUWKLQJVWKH&8'$=RQHLVKRPHWRWKHPRVWXSWRGDWHOLVWRIVXSSRUWHG
&8'$GHYLFHV:HUHFRPPHQGWKDW\RXFRQVXOWWKLVOLVWEHIRUHGRLQJDQ\WKLQJ
GUDVWLFDVDUHVXOWRIEHLQJXQDEOHWRȌQG\RXUQHZ*38LQ7DEOH2U\RXFDQ
VLPSO\UXQWKHH[DPSOHIURP&KDSWHUWKDWSULQWVWKHFRPSXWHFDSDELOLW\RIHDFK
&8'$GHYLFHLQWKHV\VWHP

%HFDXVHWKLVLVWKHFKDSWHURQDWRPLFVRISDUWLFXODUUHOHYDQFHLVWKHKDUGZDUH
FDSDELOLW\WRSHUIRUPDWRPLFRSHUDWLRQVRQPHPRU\%HIRUHZHORRNDWZKDW
DWRPLFRSHUDWLRQVDUHDQGZK\\RXFDUH\RXVKRXOGNQRZWKDWDWRPLFRSHUD-
WLRQVRQJOREDOPHPRU\DUHVXSSRUWHGRQO\RQ*38VRIFRPSXWHFDSDELOLW\
RUKLJKHU)XUWKHUPRUHDWRPLFRSHUDWLRQVRQsharedPHPRU\UHTXLUHD*38RI
FRPSXWHFDSDELOLW\RUKLJKHU%HFDXVHRIWKHVXSHUVHWQDWXUHRIFRPSXWH
FDSDELOLW\YHUVLRQV*38VRIFRPSXWHFDSDELOLW\WKHUHIRUHVXSSRUWERWKVKDUHG
PHPRU\DWRPLFVDQGJOREDOPHPRU\DWRPLFV6LPLODUO\*38VRIFRPSXWHFDSD-
ELOLW\VXSSRUWERWKRIWKHVHDVZHOO

,ILWWXUQVRXWWKDW\RXU*38LVRIFRPSXWHFDSDELOLW\DQGLWGRHVQǢWVXSSRUW
DWRPLFRSHUDWLRQVRQJOREDOPHPRU\ZHOOPD\EHZHǢYHMXVWJLYHQ\RXWKHSHUIHFW
H[FXVHWRXSJUDGH,I\RXGHFLGH\RXǢUHQRWUHDG\WRVSOXUJHRQDQHZDWRPLFV
HQDEOHGJUDSKLFVSURFHVVRU\RXFDQFRQWLQXHWRUHDGDERXWDWRPLFRSHUDWLRQV
DQGWKHVLWXDWLRQVLQZKLFK\RXPLJKWZDQWWRXVHWKHP%XWLI\RXȌQGLWWRR
KHDUWEUHDNLQJWKDW\RXZRQǢWEHDEOHWRUXQWKHH[DPSOHVIHHOIUHHWRVNLSWRWKH
QH[WFKDSWHU

 &203,/,1*)25$0,1,080&20387(&$3$%,/,7<
6XSSRVHWKDWZHKDYHZULWWHQFRGHWKDWUHTXLUHVDFHUWDLQPLQLPXPFRPSXWH
FDSDELOLW\)RUH[DPSOHLPDJLQHWKDW\RXǢYHȌQLVKHGWKLVFKDSWHUDQGJRRIIWR
ZULWHDQDSSOLFDWLRQWKDWUHOLHVKHDYLO\RQJOREDOPHPRU\DWRPLFV+DYLQJVWXGLHG
WKLVWH[WH[WHQVLYHO\\RXNQRZWKDWJOREDOPHPRU\DWRPLFVUHTXLUHDFRPSXWH
FDSDELOLW\RI7RFRPSLOH\RXUFRGH\RXQHHGWRLQIRUPWKHFRPSLOHUWKDWWKH
NHUQHOFDQQRWUXQRQKDUGZDUHZLWKDFDSDELOLW\OHVVWKDQ0RUHRYHULQWHOOLQJ
WKHFRPSLOHUWKLV\RXǢUHDOVRJLYLQJLWWKHIUHHGRPWRPDNHRWKHURSWLPL]DWLRQV
WKDWPD\EHDYDLODEOHRQO\RQ*38VRIFRPSXWHFDSDELOLW\RUJUHDWHU,QIRUPLQJ

167

Download from www.wowebook.com


ATOMICS

WKHFRPSLOHURIWKLVLVDVVLPSOHDVDGGLQJDFRPPDQGOLQHRSWLRQWR\RXULQYRFD-
WLRQRInvcc
nvcc -arch=sm _ 11

6LPLODUO\WREXLOGDNHUQHOWKDWUHOLHVRQVKDUHGPHPRU\DWRPLFV\RXQHHGWR
LQIRUPWKHFRPSLOHUWKDWWKHFRGHUHTXLUHVFRPSXWHFDSDELOLW\RUJUHDWHU
nvcc -arch=sm _ 12

 $WRPLF2SHUDWLRQV2YHUYLHZ
3URJUDPPHUVW\SLFDOO\QHYHUQHHGWRXVHDWRPLFRSHUDWLRQVZKHQZULWLQJWUDGL-
WLRQDOVLQJOHWKUHDGHGDSSOLFDWLRQV,IWKLVLVWKHVLWXDWLRQZLWK\RXGRQǢWZRUU\
ZHSODQWRH[SODLQZKDWWKH\DUHDQGZK\ZHPLJKWQHHGWKHPLQDPXOWLWKUHDGHG
DSSOLFDWLRQ7RFODULI\DWRPLFRSHUDWLRQVZHǢOOORRNDWRQHRIWKHȌUVWWKLQJV\RX
OHDUQHGZKHQOHDUQLQJ&RU&WKHLQFUHPHQWRSHUDWRU
x++;

7KLVLVDVLQJOHH[SUHVVLRQLQVWDQGDUG&DQGDIWHUH[HFXWLQJWKLVH[SUHVVLRQWKH
YDOXHLQxVKRXOGEHRQHJUHDWHUWKDQLWZDVSULRUWRH[HFXWLQJWKHLQFUHPHQW%XW
ZKDWVHTXHQFHRIRSHUDWLRQVGRHVWKLVLPSO\"7RDGGRQHWRWKHYDOXHRIx, we
ȌUVWQHHGWRNQRZZKDWYDOXHLVFXUUHQWO\LQx$IWHUUHDGLQJWKHYDOXHRIx, we
FDQPRGLI\LW$QGȌQDOO\ZHQHHGWRZULWHWKLVYDOXHEDFNWRx

6RWKHWKUHHVWHSVLQWKLVRSHUDWLRQDUHDVIROORZV

 5HDGWKHYDOXHLQx

 $GGWRWKHYDOXHUHDGLQVWHS

 :ULWHWKHUHVXOWEDFNWRx

6RPHWLPHVWKLVSURFHVVLVJHQHUDOO\FDOOHGDread-modify-writeRSHUDWLRQVLQFH
VWHSFDQFRQVLVWRIDQ\RSHUDWLRQWKDWFKDQJHVWKHYDOXHWKDWZDVUHDGIURPx

1RZFRQVLGHUDVLWXDWLRQZKHUHWZRWKUHDGVQHHGWRSHUIRUPWKLVLQFUHPHQWRQ
WKHYDOXHLQx/HWǢVFDOOWKHVHWKUHDGVADQGB)RUADQGBWRERWKLQFUHPHQWWKH
YDOXHLQxERWKWKUHDGVQHHGWRSHUIRUPWKHWKUHHRSHUDWLRQVZHǢYHGHVFULEHG
/HWǢVVXSSRVHxVWDUWVZLWKWKHYDOXH,GHDOO\ZHZRXOGOLNHWKUHDGADQGWKUHDG
BWRGRWKHVWHSVVKRZQLQ7DEOH

168

Download from www.wowebook.com


 $720,&23(5$7,21629(59,(:
ATOMIC OPERATIONS OVERVIEW

Table 9.2 7ZRWKUHDGVLQFUHPHQWLQJWKHYDOXHLQx

STEP EXAMPLE

7KUHDG$UHDGVWKHYDOXHLQx A UHDGVIURPx

7KUHDG$DGGVWRWKHYDOXHLWUHDG A FRPSXWHV

7KUHDG$ZULWHVWKHUHVXOWEDFNWRx x <- 8

7KUHDG%UHDGVWKHYDOXHLQx BUHDGVIURPx

7KUHDG%DGGVWRWKHYDOXHLWUHDG BFRPSXWHV

7KUHDG%ZULWHVWKHUHVXOWEDFNWRx x <- 9

6LQFHxVWDUWVZLWKWKHYDOXHDQGJHWVLQFUHPHQWHGE\WZRWKUHDGVZHZRXOG
H[SHFWLWWRKROGWKHYDOXHDIWHUWKH\ǢYHFRPSOHWHG,QWKHSUHYLRXVVHTXHQFH
RIRSHUDWLRQVWKLVLVLQGHHGWKHUHVXOWZHREWDLQ8QIRUWXQDWHO\WKHUHDUHPDQ\
RWKHURUGHULQJVRIWKHVHVWHSVWKDWSURGXFHWKHZURQJYDOXH)RUH[DPSOH
FRQVLGHUWKHRUGHULQJVKRZQLQ7DEOHZKHUHWKUHDG$DQGWKUHDG%ǢVRSHUD-
WLRQVEHFRPHLQWHUOHDYHGZLWKHDFKRWKHU

Table 9.3 7ZRWKUHDGVLQFUHPHQWLQJWKHYDOXHLQxZLWKLQWHUOHDYHGRSHUDWLRQV

STEP EXAMPLE

7KUHDG$UHDGVWKHYDOXHLQx AUHDGVIURPx

7KUHDG%UHDGVWKHYDOXHLQx BUHDGVIURPx

7KUHDG$DGGVWRWKHYDOXHLWUHDG AFRPSXWHV

7KUHDG%DGGVWRWKHYDOXHLWUHDG BFRPSXWHV

7KUHDG$ZULWHVWKHUHVXOWEDFNWRx x <- 8

7KUHDG%ZULWHVWKHUHVXOWEDFNWRx x <- 8

169

Download from www.wowebook.com


ATOMICS

7KHUHIRUHLIRXUWKUHDGVJHWVFKHGXOHGXQIDYRUDEO\ZHHQGXSFRPSXWLQJWKH
ZURQJUHVXOW7KHUHDUHPDQ\RWKHURUGHULQJVIRUWKHVHVL[RSHUDWLRQVVRPH
RIZKLFKSURGXFHFRUUHFWUHVXOWVDQGVRPHRIZKLFKGRQRW:KHQPRYLQJIURP
DVLQJOHWKUHDGHGWRDPXOWLWKUHDGHGYHUVLRQRIWKLVDSSOLFDWLRQZHVXGGHQO\
KDYHSRWHQWLDOIRUXQSUHGLFWDEOHUHVXOWVLIPXOWLSOHWKUHDGVQHHGWRUHDGRUZULWH
VKDUHGYDOXHV

,QWKHSUHYLRXVH[DPSOHZHQHHGDZD\WRSHUIRUPWKHread-modify-writeZLWKRXW
EHLQJLQWHUUXSWHGE\DQRWKHUWKUHDG2UPRUHVSHFLȌFDOO\QRRWKHUWKUHDGFDQ
UHDGRUZULWHWKHYDOXHRIxXQWLOZHKDYHFRPSOHWHGRXURSHUDWLRQ%HFDXVH
WKHH[HFXWLRQRIWKHVHRSHUDWLRQVFDQQRWEHEURNHQLQWRVPDOOHUSDUWVE\RWKHU
WKUHDGVZHFDOORSHUDWLRQVWKDWVDWLVI\WKLVFRQVWUDLQWDVatomic&8'$&
VXSSRUWVVHYHUDODWRPLFRSHUDWLRQVWKDWDOORZ\RXWRRSHUDWHVDIHO\RQPHPRU\
HYHQZKHQWKRXVDQGVRIWKUHDGVDUHSRWHQWLDOO\FRPSHWLQJIRUDFFHVV

1RZZHǢOOWDNHDORRNDWDQH[DPSOHWKDWUHTXLUHVWKHXVHRIDWRPLFRSHUDWLRQVWR
FRPSXWHFRUUHFWUHVXOWV

 &RPSXWLQJ+LVWRJUDPV
2IWHQWLPHVDOJRULWKPVUHTXLUHWKHFRPSXWDWLRQRIDhistogramRIVRPHVHWRI
GDWD,I\RXKDYHQǢWKDGDQ\H[SHULHQFHZLWKKLVWRJUDPVLQWKHSDVWWKDWǢVQRW
DELJGHDO(VVHQWLDOO\JLYHQDGDWDVHWWKDWFRQVLVWVRIVRPHVHWRIHOHPHQWVD
KLVWRJUDPUHSUHVHQWVDFRXQWRIWKHIUHTXHQF\RIHDFKHOHPHQW)RUH[DPSOHLI
ZHFUHDWHGDKLVWRJUDPRIWKHOHWWHUVLQWKHSKUDVHProgramming with CUDA C, we
ZRXOGHQGXSZLWKWKHUHVXOWVKRZQLQ)LJXUH

$OWKRXJKVLPSOHWRGHVFULEHDQGXQGHUVWDQGFRPSXWLQJKLVWRJUDPVRIGDWD
DULVHVVXUSULVLQJO\RIWHQLQFRPSXWHUVFLHQFH,WǢVXVHGLQDOJRULWKPVIRULPDJH
SURFHVVLQJGDWDFRPSUHVVLRQFRPSXWHUYLVLRQPDFKLQHOHDUQLQJDXGLR
HQFRGLQJDQGPDQ\RWKHUV:HZLOOXVHKLVWRJUDPFRPSXWDWLRQDVWKHDOJRULWKP
IRUWKHIROORZLQJFRGHH[DPSOHV

             
        

Figure 9.1 /HWWHUIUHTXHQF\KLVWRJUDPEXLOWIURPWKHVWULQJProgramming with


CUDA C

170

Download from www.wowebook.com


 & 8
203 7 ,1*+
 , 6 72 *5 $ 0 6

 &38+,672*5$0&20387$7,21
%HFDXVHWKHFRPSXWDWLRQRIDKLVWRJUDPPD\QRWEHIDPLOLDUWRDOOUHDGHUVZHǢOO
VWDUWZLWKDQH[DPSOHRIKRZWRFRPSXWHDKLVWRJUDPRQWKH&387KLVH[DPSOH
ZLOODOVRVHUYHWRLOOXVWUDWHKRZFRPSXWLQJDKLVWRJUDPLVUHODWLYHO\VLPSOHLQD
VLQJOHWKUHDGHG&38DSSOLFDWLRQ7KHDSSOLFDWLRQZLOOEHJLYHQVRPHODUJHVWUHDP
RIGDWD,QDQDFWXDODSSOLFDWLRQWKHGDWDPLJKWVLJQLI\DQ\WKLQJIURPSL[HOFRORUV
WRDXGLRVDPSOHVEXWLQRXUVDPSOHDSSOLFDWLRQLWZLOOEHDVWUHDPRIUDQGRPO\
JHQHUDWHGE\WHV:HFDQFUHDWHWKLVUDQGRPVWUHDPRIE\WHVXVLQJDXWLOLW\IXQF-
WLRQZHKDYHSURYLGHGFDOOHGbig_random_block(),QRXUDSSOLFDWLRQZH
FUHDWH0%RIUDQGRPGDWD

#include "../common/book.h"

#define SIZE (100*1024*1024)

int main( void ) {


unsigned char *buffer = (unsigned char*)big_random_block( SIZE );

6LQFHHDFKUDQGRPELWE\WHFDQEHDQ\RIGLIIHUHQWYDOXHV IURP0x00WR
0xFF RXUKLVWRJUDPQHHGVWRFRQWDLQbinsLQRUGHUWRNHHSWUDFNRIWKH
QXPEHURIWLPHVHDFKYDOXHKDVEHHQVHHQLQWKHGDWD:HFUHDWHDELQDUUD\
DQGLQLWLDOL]HDOOWKHELQFRXQWVWR]HUR

unsigned int histo[256];


for (int i=0; i<256; i++)
histo[i] = 0;

2QFHRXUKLVWRJUDPKDVEHHQFUHDWHGDQGDOOWKHELQVDUHLQLWLDOL]HGWR]HUR
ZHQHHGWRWDEXODWHWKHIUHTXHQF\ZLWKZKLFKHDFKYDOXHDSSHDUVLQWKHGDWD
FRQWDLQHGLQbuffer[]7KHLGHDKHUHLVWKDWZKHQHYHUZHVHHVRPHYDOXHzLQ
WKHDUUD\buffer[]ZHZDQWWRLQFUHPHQWWKHYDOXHLQELQzRIRXUKLVWRJUDP
7KLVZD\ZHǢUHFRXQWLQJWKHQXPEHURIWLPHVZHKDYHVHHQDQRFFXUUHQFHRIWKH
YDOXHz

171

Download from www.wowebook.com


ATOMICS

,Ibuffer[i]LVWKHFXUUHQWYDOXHZHDUHORRNLQJDWZHZDQWWRLQFUHPHQWWKH
FRXQWZHKDYHLQWKHELQQXPEHUHGbuffer[i]6LQFHELQbuffer[i]LVORFDWHG
DWhisto[buffer[i]]ZHFDQLQFUHPHQWWKHDSSURSULDWHFRXQWHULQDVLQJOH
OLQHRIFRGH
histo[buffer[i]]++;

:HGRWKLVIRUHDFKHOHPHQWLQbuffer[]ZLWKDVLPSOHfor()ORRS

for (int i=0; i<SIZE; i++)


histo[buffer[i]]++;

$WWKLVSRLQWZHǢYHFRPSOHWHGRXUKLVWRJUDPRIWKHLQSXWGDWD,QDIXOODSSOLFD-
WLRQWKLVKLVWRJUDPPLJKWEHWKHLQSXWWRWKHQH[WVWHSRIFRPSXWDWLRQ,QRXU
VLPSOHH[DPSOHKRZHYHUWKLVLVDOOZHFDUHWRFRPSXWHVRZHHQGWKHDSSOLFD-
WLRQE\YHULI\LQJWKDWDOOWKHELQVRIRXUKLVWRJUDPVXPWRWKHH[SHFWHGYDOXH

long histoCount = 0;
for (int i=0; i<256; i++) {
histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );

,I\RXǢYHIROORZHGFORVHO\\RXZLOOUHDOL]HWKDWWKLVVXPZLOODOZD\VEHWKHVDPH
UHJDUGOHVVRIWKHUDQGRPLQSXWDUUD\(DFKELQFRXQWVWKHQXPEHURIWLPHVZH
KDYHVHHQWKHFRUUHVSRQGLQJGDWDHOHPHQWVRWKHVXPRIDOORIWKHVHELQVVKRXOG
EHWKHWRWDOQXPEHURIGDWDHOHPHQWVZHǢYHH[DPLQHG,QRXUFDVHWKLVZLOOEHWKH
YDOXHSIZE

$QGQHHGOHVVWRVD\ EXWZHZLOODQ\ZD\ ZHFOHDQXSDIWHURXUVHOYHVDQGUHWXUQ

free( buffer );
return 0;
}

172

Download from www.wowebook.com


 & 8
203 7 ,1*+
 , 6 72 *5 $ 0 6

2QRXUEHQFKPDUNPDFKLQHD&RUH'XRWKHKLVWRJUDPRIWKLV0%DUUD\RI
GDWDFDQEHFRQVWUXFWHGLQVHFRQGV7KLVZLOOSURYLGHDEDVHOLQHSHUIRU-
PDQFHIRUWKH*38YHUVLRQZHLQWHQGWRZULWH

 *38+,672*5$0&20387$7,21
:HZRXOGOLNHWRDGDSWWKHKLVWRJUDPFRPSXWDWLRQH[DPSOHWRUXQRQWKH*38
,IRXULQSXWDUUD\LVODUJHHQRXJKLWPLJKWVDYHDFRQVLGHUDEOHDPRXQWRIWLPH
WRKDYHGLIIHUHQWWKUHDGVH[DPLQLQJGLIIHUHQWSDUWVRIWKHEXIIHU+DYLQJGLIIHUHQW
WKUHDGVUHDGGLIIHUHQWSDUWVRIWKHLQSXWVKRXOGEHHDV\HQRXJK$IWHUDOOLWǢVYHU\
VLPLODUWRWKLQJVZHKDYHVHHQVRIDU7KHSUREOHPZLWKFRPSXWLQJDKLVWRJUDP
IURPWKHLQSXWGDWDDULVHVIURPWKHIDFWWKDWPXOWLSOHWKUHDGVPD\ZDQWWRLQFUH-
PHQWWKHVDPHELQRIWKHRXWSXWKLVWRJUDPDWWKHVDPHWLPH,QWKLVVLWXDWLRQZH
ZLOOQHHGWRXVHDWRPLFLQFUHPHQWVWRDYRLGDVLWXDWLRQOLNHWKHRQHGHVFULEHGLQ
6HFWLRQ$WRPLF2SHUDWLRQV2YHUYLHZ

2XUmain()URXWLQHORRNVYHU\VLPLODUWRWKH&38YHUVLRQDOWKRXJKZHZLOOQHHG
WRDGGVRPHRIWKH&8'$&SOXPELQJLQRUGHUWRJHWLQSXWWRWKH*38DQGUHVXOWV
IURPWKH*38+RZHYHUZHVWDUWH[DFWO\DVZHGLGRQWKH&38

int main( void ) {


unsigned char *buffer = (unsigned char*)big_random_block( SIZE );

:HZLOOEHLQWHUHVWHGLQPHDVXULQJKRZRXUFRGHSHUIRUPVVRZHLQLWLDOL]HHYHQWV
IRUWLPLQJH[DFWO\OLNHZHDOZD\VKDYH

cudaEvent_t start, stop;


HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

$IWHUVHWWLQJXSRXULQSXWGDWDDQGHYHQWVZHORRNWR*38PHPRU\:H
ZLOOQHHGWRDOORFDWHVSDFHIRURXUUDQGRPLQSXWGDWDDQGRXURXWSXWKLVWR-
JUDP$IWHUDOORFDWLQJWKHLQSXWEXIIHUZHFRS\WKHDUUD\ZHJHQHUDWHGZLWK

173

Download from www.wowebook.com


ATOMICS

big_random_block()WRWKH*38/LNHZLVHDIWHUDOORFDWLQJWKHKLVWRJUDPZH
LQLWLDOL]HLWWR]HURMXVWOLNHZHGLGLQWKH&38YHUVLRQ

// allocate memory on the GPU for the file's data


unsigned char *dev_buffer;
unsigned int *dev_histo;
HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
cudaMemcpyHostToDevice ) );

HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,


256 * sizeof( long ) ) );
HANDLE_ERROR( cudaMemset( dev_histo, 0,
256 * sizeof( int ) ) );

<RXPD\QRWLFHWKDWZHVOLSSHGLQDQHZ&8'$UXQWLPHIXQFWLRQcudaMemset()
7KLVIXQFWLRQKDVDVLPLODUVLJQDWXUHWRWKHVWDQGDUG&IXQFWLRQmemset()DQG
WKHWZRIXQFWLRQVEHKDYHQHDUO\LGHQWLFDOO\7KHGLIIHUHQFHLQVLJQDWXUHLVEHWZHHQ
WKHVHIXQFWLRQVLVWKDWcudaMemset()UHWXUQVDQHUURUFRGHZKLOHWKH&OLEUDU\
IXQFWLRQmemset()GRHVQRW7KLVHUURUFRGHZLOOLQIRUPWKHFDOOHUZKHWKHU
DQ\WKLQJEDGKDSSHQHGZKLOHDWWHPSWLQJWRVHW*38PHPRU\$VLGHIURPWKH
HUURUFRGHUHWXUQWKHRQO\GLIIHUHQFHLVWKDWcudaMemset()RSHUDWHVRQ*38
PHPRU\ZKLOHmemset()RSHUDWHVRQKRVWPHPRU\

$IWHULQLWLDOL]LQJWKHLQSXWDQGRXWSXWEXIIHUVZHDUHUHDG\WRFRPSXWHRXUKLVWR-
JUDP<RXZLOOVHHKRZZHSUHSDUHDQGODXQFKWKHKLVWRJUDPNHUQHOPRPHQWDULO\
)RUWKHWLPHEHLQJDVVXPHWKDWZHKDYHFRPSXWHGWKHKLVWRJUDPRQWKH*38
$IWHUȌQLVKLQJZHQHHGWRFRS\WKHKLVWRJUDPEDFNWRWKH&38VRZHDOORFDWHD
HQWU\DUUD\DQGSHUIRUPDFRS\IURPGHYLFHWRKRVW

unsigned int histo[256];


HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
256 * sizeof( int ),
cudaMemcpyDeviceToHost ) );

174

Download from www.wowebook.com


 & 8
203 7 ,1*+
 , 6 72 *5 $ 0 6

$WWKLVSRLQWZHDUHGRQHZLWKWKHKLVWRJUDPFRPSXWDWLRQVRZHFDQVWRSRXU
WLPHUVDQGGLVSOD\WKHHODSVHGWLPH-XVWOLNHWKHSUHYLRXVHYHQWFRGHWKLVLV
LGHQWLFDOWRWKHWLPLQJFRGHZHǢYHXVHGIRUVHYHUDOFKDSWHUV

// get stop time, and display the timing results


HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time to generate: %3.1f ms\n", elapsedTime );

$WWKLVSRLQWZHFRXOGSDVVWKHKLVWRJUDPDVLQSXWWRDQRWKHUVWDJHLQWKHDOJR-
ULWKPEXWVLQFHZHDUHQRWXVLQJWKHKLVWRJUDPIRUDQ\WKLQJHOVHZHZLOOVLPSO\
YHULI\WKDWWKHFRPSXWHG*38KLVWRJUDPPDWFKHVZKDWZHJHWRQWKH&38)LUVW
ZHYHULI\WKDWWKHKLVWRJUDPVXPPDWFKHVZKDWZHH[SHFW7KLVLVLGHQWLFDOWRWKH
&38FRGHVKRZQKHUH

long histoCount = 0;
for (int i=0; i<256; i++) {
histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );

7RIXOO\YHULI\WKH*38KLVWRJUDPWKRXJKZHZLOOXVHWKH&38WRFRPSXWHWKH
VDPHKLVWRJUDP7KHREYLRXVZD\WRGRWKLVZRXOGEHWRDOORFDWHDQHZKLVWRJUDP
DUUD\FRPSXWHDKLVWRJUDPIURPWKHLQSXWXVLQJWKHFRGHIURP6HFWLRQ
&38+LVWRJUDP&RPSXWDWLRQDQGȌQDOO\HQVXUHWKDWHDFKELQLQWKH*38DQG
&38YHUVLRQPDWFK%XWUDWKHUWKDQDOORFDWHDQHZKLVWRJUDPDUUD\ZHǢOORSWWR
VWDUWZLWKWKH*38KLVWRJUDPDQGFRPSXWHWKH&38KLVWRJUDPǤLQUHYHUVHǥ

%\FRPSXWLQJWKHKLVWRJUDPǤLQUHYHUVHǥZHPHDQWKDWUDWKHUWKDQVWDUWLQJ
DW]HURDQGLQFUHPHQWLQJELQYDOXHVZKHQZHVHHGDWDHOHPHQWVZHZLOOVWDUW
ZLWKWKH*38KLVWRJUDPDQGdecrementWKHELQǢVYDOXHZKHQWKH&38VHHVGDWD
HOHPHQWV7KHUHIRUHWKH&38KDVFRPSXWHGWKHVDPHKLVWRJUDPDVWKH*38LI
DQGRQO\LIHYHU\ELQKDVWKHYDOXH]HURZKHQZHDUHȌQLVKHG,QVRPHVHQVHZH
DUHFRPSXWLQJWKHGLIIHUHQFHEHWZHHQWKHVHWZRKLVWRJUDPV7KHFRGHZLOOORRN

175

Download from www.wowebook.com


ATOMICS

UHPDUNDEO\OLNHWKH&38KLVWRJUDPFRPSXWDWLRQEXWZLWKDGHFUHPHQWRSHUDWRU
LQVWHDGRIDQLQFUHPHQWRSHUDWRU

// verify that we have the same counts via CPU


for (int i=0; i<SIZE; i++)
histo[buffer[i]]--;
for (int i=0; i<256; i++) {
if (histo[i] != 0)
printf( "Failure at %d!\n", i );
}

$VXVXDOWKHȌQDOHLQYROYHVFOHDQLQJXSRXUDOORFDWHG&8'$HYHQWV*38
PHPRU\DQGKRVWPHPRU\

HANDLE_ERROR( cudaEventDestroy( start ) );


HANDLE_ERROR( cudaEventDestroy( stop ) );
cudaFree( dev_histo );
cudaFree( dev_buffer );
free( buffer );
return 0;
}

%HIRUHZHDVVXPHGWKDWZHKDGODXQFKHGDNHUQHOWKDWFRPSXWHGRXUKLVWRJUDP
DQGWKHQSUHVVHGRQWRGLVFXVVWKHDIWHUPDWK2XUNHUQHOODXQFKLVVOLJKWO\PRUH
FRPSOLFDWHGWKDQXVXDOEHFDXVHRISHUIRUPDQFHFRQFHUQV%HFDXVHWKHKLVWR-
JUDPFRQWDLQVELQVXVLQJWKUHDGVSHUEORFNSURYHVFRQYHQLHQWDVZHOODV
UHVXOWVLQKLJKSHUIRUPDQFH%XWZHKDYHDORWRIȍH[LELOLW\LQWHUPVRIWKHQXPEHU
RIEORFNVZHODXQFK)RUH[DPSOHZLWK0%RIGDWDZHKDYHE\WHV
RIGDWD:HFRXOGODXQFKDVLQJOHEORFNDQGKDYHHDFKWKUHDGH[DPLQH
GDWDHOHPHQWV/LNHZLVHZHFRXOGODXQFKEORFNVDQGKDYHHDFKWKUHDG
H[DPLQHDVLQJOHGDWDHOHPHQW

$V\RXPLJKWKDYHJXHVVHGWKHRSWLPDOVROXWLRQLVDWDSRLQWEHWZHHQWKHVHWZR
H[WUHPHV%\UXQQLQJVRPHSHUIRUPDQFHH[SHULPHQWVRSWLPDOSHUIRUPDQFHLV
DFKLHYHGZKHQWKHQXPEHURIEORFNVZHODXQFKLVH[DFWO\WZLFHWKHQXPEHURI
PXOWLSURFHVVRUVRXU*38FRQWDLQV)RUH[DPSOHD*H)RUFH*7;KDVPXOWL-
SURFHVVRUVVRRXUKLVWRJUDPNHUQHOKDSSHQVWRUXQIDVWHVWRQD*H)RUFH*7;
ZKHQODXQFKHGZLWKSDUDOOHOEORFNV

176

Download from www.wowebook.com


 & 8
203 7 ,1*+
 , 6 72 *5 $ 0 6

,Q&KDSWHUZHGLVFXVVHGDPHWKRGIRUTXHU\LQJYDULRXVSURSHUWLHVRIWKH
KDUGZDUHRQZKLFKRXUSURJUDPLVUXQQLQJ:HZLOOQHHGWRXVHRQHRIWKHVH
GHYLFHSURSHUWLHVLIZHLQWHQGWRG\QDPLFDOO\VL]HRXUODXQFKEDVHGRQRXUFXUUHQW
KDUGZDUHSODWIRUP7RDFFRPSOLVKWKLVZHZLOOXVHWKHIROORZLQJFRGHVHJPHQW
$OWKRXJK\RXKDYHQǢW\HWVHHQWKHNHUQHOLPSOHPHQWDWLRQ\RXVKRXOGVWLOOEHDEOH
WRIROORZZKDWLVJRLQJRQ

cudaDeviceProp prop;
HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
int blocks = prop.multiProcessorCount;
histo_kernel<<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );

6LQFHRXUZDONWKURXJKRImain()KDVEHHQVRPHZKDWIUDJPHQWHGKHUHLVWKH
HQWLUHURXWLQHIURPVWDUWWRȌQLVK

int main( void ) {


unsigned char *buffer =
(unsigned char*)big_random_block( SIZE );

cudaEvent_t start, stop;


HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

// allocate memory on the GPU for the file's data


unsigned char *dev_buffer;
unsigned int *dev_histo;
HANDLE_ERROR( cudaMalloc( (void**)&dev_buffer, SIZE ) );
HANDLE_ERROR( cudaMemcpy( dev_buffer, buffer, SIZE,
cudaMemcpyHostToDevice ) );

HANDLE_ERROR( cudaMalloc( (void**)&dev_histo,


256 * sizeof( long ) ) );
HANDLE_ERROR( cudaMemset( dev_histo, 0,
256 * sizeof( int ) ) );

177

Download from www.wowebook.com


ATOMICS

cudaDeviceProp prop;
HANDLE_ERROR( cudaGetDeviceProperties( &prop, 0 ) );
int blocks = prop.multiProcessorCount;
histo_kernel<<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );

unsigned int histo[256];


HANDLE_ERROR( cudaMemcpy( histo, dev_histo,
256 * sizeof( int ),
cudaMemcpyDeviceToHost ) );

// get stop time, and display the timing results


HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time to generate: %3.1f ms\n", elapsedTime );

long histoCount = 0;
for (int i=0; i<256; i++) {
histoCount += histo[i];
}
printf( "Histogram Sum: %ld\n", histoCount );

// verify that we have the same counts via CPU


for (int i=0; i<SIZE; i++)
histo[buffer[i]]--;
for (int i=0; i<256; i++) {
if (histo[i] != 0)
printf( "Failure at %d!\n", i );
}

HANDLE_ERROR( cudaEventDestroy( start ) );


HANDLE_ERROR( cudaEventDestroy( stop ) );

178

Download from www.wowebook.com


 & 8
203 7 ,1*+
 , 6 72 *5 $ 0 6

cudaFree( dev_histo );
cudaFree( dev_buffer );
free( buffer );
return 0;
}

+,672*5$0.(51(/86,1**/2%$/0(025<$720,&6
$QGQRZIRUWKHIXQSDUWWKH*38FRGHWKDWFRPSXWHVWKHKLVWRJUDP7KHNHUQHO
WKDWFRPSXWHVWKHKLVWRJUDPLWVHOIQHHGVWREHJLYHQDSRLQWHUWRWKHLQSXW
GDWDDUUD\WKHOHQJWKRIWKHLQSXWDUUD\DQGDSRLQWHUWRWKHRXWSXWKLVWRJUDP
7KHȌUVWWKLQJRXUNHUQHOQHHGVWRFRPSXWHLVDOLQHDUL]HGRIIVHWLQWRWKHLQSXW
GDWDDUUD\(DFKWKUHDGZLOOVWDUWZLWKDQRIIVHWEHWZHHQDQGWKHQXPEHURI
WKUHDGVPLQXV,WZLOOWKHQVWULGHE\WKHWRWDOQXPEHURIWKUHDGVWKDWKDYHEHHQ
ODXQFKHG:HKRSH\RXUHPHPEHUWKLVWHFKQLTXHZHXVHGWKHVDPHORJLFWRDGG
YHFWRUVRIDUELWUDU\OHQJWKZKHQ\RXȌUVWOHDUQHGDERXWWKUHDGV

#include "../common/book.h"

#define SIZE (100*1024*1024)

__global__ void histo_kernel( unsigned char *buffer,


long size,
unsigned int *histo ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;

2QFHHDFKWKUHDGNQRZVLWVVWDUWLQJRIIVHWiDQGWKHVWULGHLWVKRXOGXVHWKHFRGH
ZDONVWKURXJKWKHLQSXWDUUD\LQFUHPHQWLQJWKHFRUUHVSRQGLQJKLVWRJUDPELQ

while (i < size) {


atomicAdd( &(histo[buffer[i]]), 1 );
i += stride;
}
}

179

Download from www.wowebook.com


ATOMICS

7KHKLJKOLJKWHGOLQHUHSUHVHQWVWKHZD\ZHXVHDWRPLFRSHUDWLRQVLQ&8'$&
7KHFDOOatomicAdd( addr, y );JHQHUDWHVDQDWRPLFVHTXHQFHRIRSHUD-
WLRQVWKDWUHDGWKHYDOXHDWDGGUHVVaddrDGGVyWRWKDWYDOXHDQGVWRUHVWKH
UHVXOWEDFNWRWKHPHPRU\DGGUHVVaddr7KHKDUGZDUHJXDUDQWHHVXVWKDWQR
RWKHUWKUHDGFDQUHDGRUZULWHWKHYDOXHDWDGGUHVVaddrZKLOHZHSHUIRUPWKHVH
RSHUDWLRQVWKXVHQVXULQJSUHGLFWDEOHUHVXOWV,QRXUH[DPSOHWKHDGGUHVVLQ
TXHVWLRQLVWKHORFDWLRQRIWKHKLVWRJUDPELQWKDWFRUUHVSRQGVWRWKHFXUUHQWE\WH
,IWKHFXUUHQWE\WHLVbuffer[i]MXVWOLNHZHVDZLQWKH&38YHUVLRQWKHFRUUH-
VSRQGLQJKLVWRJUDPELQLVhisto[buffer[i]]7KHDWRPLFRSHUDWLRQQHHGVWKH
DGGUHVVRIWKLVELQVRWKHȌUVWDUJXPHQWLVWKHUHIRUH&(histo[buffer[i]])
6LQFHZHVLPSO\ZDQWWRLQFUHPHQWWKHYDOXHLQWKDWELQE\RQHWKHVHFRQGDUJX-
PHQWLV

6RDIWHUDOOWKDWKXOODEDORRRXU*38KLVWRJUDPFRPSXWDWLRQLVIDLUO\VLPLODUWR
WKHFRUUHVSRQGLQJ&38YHUVLRQ

#include "../common/book.h"

#define SIZE (100*1024*1024)

__global__ void histo_kernel( unsigned char *buffer,


long size,
unsigned int *histo ) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd( &(histo[buffer[i]]), 1 );
i += stride;
}
}

+RZHYHUZHQHHGWRVDYHWKHFHOHEUDWLRQVIRUODWHU$IWHUUXQQLQJWKLVH[DPSOH
ZHGLVFRYHUWKDWD*H)RUFH*7;FDQFRQVWUXFWDKLVWRJUDPIURP0%RI
LQSXWGDWDLQVHFRQGV,I\RXUHDGWKHVHFWLRQRQ&38EDVHGKLVWRJUDPV
\RXZLOOUHDOL]HWKDWWKLVSHUIRUPDQFHLVWHUULEOH,QIDFWWKLVLVPRUHWKDQIRXU
WLPHVVORZHUWKDQWKH&38YHUVLRQ%XWWKLVLVZK\ZHDOZD\VPHDVXUHRXU
EDVHOLQHSHUIRUPDQFH,WZRXOGEHDVKDPHWRVHWWOHIRUVXFKDORZSHUIRUPDQFH
LPSOHPHQWDWLRQVLPSO\EHFDXVHLWUXQVRQWKH*38

180

Download from www.wowebook.com


 & 8
203 7 ,1*+
 , 6 72 *5 $ 0 6

6LQFHZHGRYHU\OLWWOHZRUNLQWKHNHUQHOLWLVTXLWHOLNHO\WKDWWKHDWRPLFRSHUD-
WLRQRQJOREDOPHPRU\LVFDXVLQJWKHSUREOHP(VVHQWLDOO\ZKHQWKRXVDQGV
RIWKUHDGVDUHWU\LQJWRDFFHVVDKDQGIXORIPHPRU\ORFDWLRQVDJUHDWGHDORI
FRQWHQWLRQIRURXUKLVWRJUDPELQVFDQRFFXU7RHQVXUHDWRPLFLW\RIWKHLQFUH-
PHQWRSHUDWLRQVWKHKDUGZDUHQHHGVWRVHULDOL]HRSHUDWLRQVWRWKHVDPHPHPRU\
ORFDWLRQ7KLVFDQUHVXOWLQDORQJTXHXHRISHQGLQJRSHUDWLRQVDQGDQ\SHUIRU-
PDQFHJDLQZHPLJKWKDYHKDGZLOOYDQLVK:HZLOOQHHGWRLPSURYHWKHDOJRULWKP
LWVHOILQRUGHUWRUHFRYHUWKLVSHUIRUPDQFH

+,672*5$0.(51(/86,1*6+$5('$1'*/2%$/0(025<$720,&6
,URQLFDOO\GHVSLWHWKDWWKHDWRPLFRSHUDWLRQVFDXVHWKLVSHUIRUPDQFHGHJUDGD-
WLRQDOOHYLDWLQJWKHVORZGRZQDFWXDOO\LQYROYHVXVLQJmoreDWRPLFVQRWIHZHU
7KHFRUHSUREOHPZDVQRWWKHXVHRIDWRPLFVVRPXFKDVWKHIDFWWKDWWKRXVDQGV
RIWKUHDGVZHUHFRPSHWLQJIRUDFFHVVWRDUHODWLYHO\VPDOOQXPEHURIPHPRU\
DGGUHVVHV7RDGGUHVVWKLVLVVXHZHZLOOVSOLWRXUKLVWRJUDPFRPSXWDWLRQLQWRWZR
SKDVHV

,QSKDVHRQHHDFKSDUDOOHOEORFNZLOOFRPSXWHDVHSDUDWHKLVWRJUDPRIWKHGDWD
WKDWLWVFRQVWLWXHQWWKUHDGVH[DPLQH6LQFHHDFKEORFNGRHVWKLVLQGHSHQGHQWO\
ZHFDQFRPSXWHWKHVHKLVWRJUDPVLQVKDUHGPHPRU\VDYLQJXVWKHWLPHRI
VHQGLQJHDFKZULWHRIIFKLSWR'5$0'RLQJWKLVGRHVQRWIUHHXVIURPQHHGLQJ
DWRPLFRSHUDWLRQVWKRXJKVLQFHPXOWLSOHWKUHDGVZLWKLQWKHEORFNFDQVWLOO
H[DPLQHGDWDHOHPHQWVZLWKWKHVDPHYDOXH+RZHYHUWKHIDFWWKDWRQO\
WKUHDGVZLOOQRZEHFRPSHWLQJIRUDGGUHVVHVZLOOUHGXFHFRQWHQWLRQIURPWKH
JOREDOYHUVLRQZKHUHWKRXVDQGVRIWKUHDGVZHUHFRPSHWLQJ

7KHȌUVWSKDVHWKHQLQYROYHVDOORFDWLQJDQG]HURLQJDVKDUHGPHPRU\EXIIHU
WRKROGHDFKEORFNǢVLQWHUPHGLDWHKLVWRJUDP5HFDOOIURP&KDSWHUWKDWVLQFH
WKHVXEVHTXHQWVWHSZLOOLQYROYHUHDGLQJDQGPRGLI\LQJWKLVEXIIHUZHQHHGD
__syncthreads()FDOOWRHQVXUHWKDWHYHU\WKUHDGǢVZULWHKDVFRPSOHWHG
EHIRUHSURJUHVVLQJ

__global__ void histo_kernel( unsigned char *buffer,


long size,
unsigned int *histo ) {

__shared__ unsigned int temp[256];


temp[threadIdx.x] = 0;
__syncthreads();

181

Download from www.wowebook.com


ATOMICS

$IWHU]HURLQJWKHKLVWRJUDPWKHQH[WVWHSLVUHPDUNDEO\VLPLODUWRRXURULJLQDO
*38KLVWRJUDP7KHVROHGLIIHUHQFHVKHUHDUHWKDWZHXVHWKHVKDUHGPHPRU\
EXIIHUtemp[]LQVWHDGRIWKHJOREDOPHPRU\EXIIHUhisto[]DQGWKDWZHQHHGD
VXEVHTXHQWFDOOWR__syncthreads()WRHQVXUHWKHODVWRIRXUZULWHVKDYHEHHQ
FRPPLWWHG

int i = threadIdx.x + blockIdx.x * blockDim.x;


int offset = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd( &temp[buffer[i]], 1);
i += offset;
}
__syncthreads();

7KHODVWVWHSLQRXUPRGLȌHGKLVWRJUDPH[DPSOHUHTXLUHVWKDWZHPHUJHHDFK
EORFNǢVWHPSRUDU\KLVWRJUDPLQWRWKHJOREDOEXIIHUhisto[]6XSSRVHZHVSOLW
WKHLQSXWLQKDOIDQGWZRWKUHDGVORRNDWGLIIHUHQWKDOYHVDQGFRPSXWHVHSDUDWH
KLVWRJUDPV,IWKUHDG$VHHVE\WH0xFCWLPHVLQWKHLQSXWDQGWKUHDG%VHHV
E\WH0xFCWLPHVWKHE\WH0xFCPXVWKDYHDSSHDUHGWLPHVLQWKHLQSXW
/LNHZLVHHDFKELQRIWKHȌQDOKLVWRJUDPLVMXVWWKHVXPRIWKHFRUUHVSRQGLQJ
ELQLQWKUHDG$ǢVKLVWRJUDPDQGWKUHDG%ǢVKLVWRJUDP7KLVORJLFH[WHQGVWRDQ\
QXPEHURIWKUHDGVVRPHUJLQJHYHU\EORFNǢVKLVWRJUDPLQWRDVLQJOHȌQDOKLVWR-
JUDPLQYROYHVDGGLQJHDFKHQWU\LQWKHEORFNǢVKLVWRJUDPWRWKHFRUUHVSRQGLQJ
HQWU\LQWKHȌQDOKLVWRJUDP)RUDOOWKHUHDVRQVZHǢYHVHHQDOUHDG\WKLVQHHGVWR
EHGRQHDWRPLFDOO\

atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );


}

6LQFHZHKDYHGHFLGHGWRXVHWKUHDGVDQGKDYHKLVWRJUDPELQVHDFK
WKUHDGDWRPLFDOO\DGGVDVLQJOHELQWRWKHȌQDOKLVWRJUDPǢVWRWDO,IWKHVHQXPEHUV
GLGQǢWPDWFKWKLVSKDVHZRXOGEHPRUHFRPSOLFDWHG1RWHWKDWZHKDYHQR
JXDUDQWHHVDERXWZKDWRUGHUWKHEORFNVDGGWKHLUYDOXHVWRWKHȌQDOKLVWRJUDP
EXWVLQFHLQWHJHUDGGLWLRQLVFRPPXWDWLYHZHZLOODOZD\VJHWWKHVDPHDQVZHU
SURYLGHGWKDWWKHDGGLWLRQVRFFXUDWRPLFDOO\

182

Download from www.wowebook.com


 &+ $ 3 7 (55( 9 ,( :

$QGZLWKWKLVRXUWZRSKDVHKLVWRJUDPFRPSXWDWLRQNHUQHOLVFRPSOHWH+HUHLWLV
IURPVWDUWWRȌQLVK

__global__ void histo_kernel( unsigned char *buffer,


long size,
unsigned int *histo ) {
__shared__ unsigned int temp[256];
temp[threadIdx.x] = 0;
__syncthreads();

int i = threadIdx.x + blockIdx.x * blockDim.x;


int offset = blockDim.x * gridDim.x;
while (i < size) {
atomicAdd( &temp[buffer[i]], 1);
i += offset;
}

__syncthreads();
atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}

7KLVYHUVLRQRIRXUKLVWRJUDPH[DPSOHLPSURYHVGUDPDWLFDOO\RYHUWKHSUHYLRXV
*38YHUVLRQ$GGLQJWKHVKDUHGPHPRU\FRPSRQHQWGURSVRXUUXQQLQJWLPHRQ
D*H)RUFH*7;WRVHFRQGV1RWRQO\LVWKLVVLJQLȌFDQWO\EHWWHUWKDQWKH
YHUVLRQWKDWXVHGJOREDOPHPRU\DWRPLFVRQO\EXWWKLVEHDWVRXURULJLQDO&38
LPSOHPHQWDWLRQE\DQRUGHURIPDJQLWXGH IURPVHFRQGVWRVHFRQGV 
7KLVLPSURYHPHQWUHSUHVHQWVJUHDWHUWKDQDVHYHQIROGERRVWLQVSHHGRYHUWKH
&38YHUVLRQ6RGHVSLWHWKHHDUO\VHWEDFNLQDGDSWLQJWKHKLVWRJUDPWRD*38
LPSOHPHQWDWLRQRXUYHUVLRQWKDWXVHVERWKVKDUHGDQGJOREDODWRPLFVVKRXOGEH
FRQVLGHUHGDVXFFHVV

 &KDSWHU5HYLHZ
$OWKRXJKZHKDYHIUHTXHQWO\VSRNHQDWOHQJWKDERXWKRZHDV\SDUDOOHOSURJUDP-
PLQJFDQEHZLWK&8'$&ZHKDYHODUJHO\LJQRUHGVRPHRIWKHVLWXDWLRQVZKHQ

183

Download from www.wowebook.com


ATOMICS

PDVVLYHO\SDUDOOHODUFKLWHFWXUHVVXFKDVWKH*38FDQPDNHRXUOLYHVDVSURJUDP-
PHUVPRUHGLIȌFXOW7U\LQJWRFRSHZLWKSRWHQWLDOO\WHQVRIWKRXVDQGVRIWKUHDGV
VLPXOWDQHRXVO\PRGLI\LQJWKHVDPHPHPRU\DGGUHVVHVLVDFRPPRQVLWXDWLRQ
ZKHUHDPDVVLYHO\SDUDOOHOPDFKLQHFDQVHHPEXUGHQVRPH)RUWXQDWHO\ZHKDYH
KDUGZDUHVXSSRUWHGDWRPLFRSHUDWLRQVDYDLODEOHWRKHOSHDVHWKLVSDLQ

+RZHYHUDV\RXVDZZLWKWKHKLVWRJUDPFRPSXWDWLRQVRPHWLPHVUHOLDQFHRQ
DWRPLFRSHUDWLRQVLQWURGXFHVSHUIRUPDQFHLVVXHVWKDWFDQEHUHVROYHGRQO\
E\UHWKLQNLQJSDUWVRIWKHDOJRULWKP,QWKHKLVWRJUDPH[DPSOHZHPRYHGWRD
WZRVWDJHDOJRULWKPWKDWDOOHYLDWHGFRQWHQWLRQIRUJOREDOPHPRU\DGGUHVVHV,Q
JHQHUDOWKLVVWUDWHJ\RIORRNLQJWROHVVHQPHPRU\FRQWHQWLRQWHQGVWRZRUNZHOO
DQG\RXVKRXOGNHHSLWLQPLQGZKHQXVLQJDWRPLFVLQ\RXURZQDSSOLFDWLRQV

184

Download from www.wowebook.com


Chapter 10
Streams

7LPHDQGWLPHDJDLQLQWKLVERRNZHKDYHVHHQKRZWKHPDVVLYHO\GDWDSDUDOOHO
H[HFXWLRQHQJLQHRQD*38FDQSURYLGHVWXQQLQJSHUIRUPDQFHJDLQVRYHUFRPSD-
UDEOH&38FRGH+RZHYHUWKHUHLV\HWDQRWKHUFODVVRISDUDOOHOLVPWREHH[SORLWHG
RQ19,',$JUDSKLFVSURFHVVRUV7KLVSDUDOOHOLVPLVVLPLODUWRWKHtask parallelism
WKDWLVIRXQGLQPXOWLWKUHDGHG&38DSSOLFDWLRQV5DWKHUWKDQVLPXOWDQHRXVO\
FRPSXWLQJWKHVDPHIXQFWLRQRQORWVRIGDWDHOHPHQWVDVRQHGRHVZLWKGDWD
SDUDOOHOLVPWDVNSDUDOOHOLVPLQYROYHVGRLQJWZRRUPRUHFRPSOHWHO\GLIIHUHQW
WDVNVLQSDUDOOHO

,QWKHFRQWH[WRISDUDOOHOLVPDtaskFRXOGEHDQ\QXPEHURIWKLQJV)RUH[DPSOH
DQDSSOLFDWLRQFRXOGEHH[HFXWLQJWZRWDVNVUHGUDZLQJLWV*8,ZLWKRQHWKUHDG
ZKLOHGRZQORDGLQJDQXSGDWHRYHUWKHQHWZRUNZLWKDQRWKHUWKUHDG7KHVHWDVNV
SURFHHGLQSDUDOOHOGHVSLWHKDYLQJQRWKLQJLQFRPPRQ$OWKRXJKWKHWDVNSDUDO-
OHOLVPRQ*38VLVQRWFXUUHQWO\DVȍH[LEOHDVDJHQHUDOSXUSRVHSURFHVVRUǢVLW
VWLOOSURYLGHVRSSRUWXQLWLHVIRUXVDVSURJUDPPHUVWRH[WUDFWHYHQPRUHVSHHG
IURPRXU*38EDVHGLPSOHPHQWDWLRQV,QWKLVFKDSWHUZHZLOOORRNDW&8'$
VWUHDPVDQGWKHZD\VLQZKLFKWKHLUFDUHIXOXVHZLOOHQDEOHXVWRH[HFXWHFHUWDLQ
RSHUDWLRQVVLPXOWDQHRXVO\RQWKH*38

185

Download from www.wowebook.com


STREAMS

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQDERXWDOORFDWLQJSDJHORFNHGKRVWPHPRU\

ǩ <RXZLOOOHDUQZKDW&8'$streamsDUH

ǩ <RXZLOOOHDUQKRZWRXVH&8'$VWUHDPVWRDFFHOHUDWH\RXUDSSOLFDWLRQV

 3DJH/RFNHG+RVW0HPRU\
,QHYHU\H[DPSOHRYHUWKHFRXUVHRIQLQHFKDSWHUV\RXKDYHVHHQXVDOORFDWH
PHPRU\RQWKH*38ZLWKcudaMalloc()2QWKHKRVWZHKDYHDOZD\VDOORFDWHG
PHPRU\ZLWKWKHYDQLOOD&OLEUDU\URXWLQHmalloc()+RZHYHUWKH&8'$UXQWLPH
RIIHUVLWVRZQPHFKDQLVPIRUDOORFDWLQJKRVWPHPRU\cudaHostAlloc():K\
ZRXOG\RXERWKHUXVLQJWKLVIXQFWLRQZKHQmalloc()KDVVHUYHG\RXTXLWHZHOO
VLQFHGD\RQHRI\RXUOLIHDVD&SURJUDPPHU"

,QIDFWWKHUHLVDVLJQLȌFDQWGLIIHUHQFHEHWZHHQWKHPHPRU\WKDWmalloc()
ZLOODOORFDWHDQGWKHPHPRU\WKDWcudaHostAlloc()DOORFDWHV7KH&
OLEUDU\IXQFWLRQmalloc()DOORFDWHVVWDQGDUGSDJHDEOHKRVWPHPRU\ZKLOH
cudaHostAlloc()DOORFDWHVDEXIIHURIpage-lockedKRVWPHPRU\6RPHWLPHV
FDOOHGpinnedPHPRU\SDJHORFNHGEXIIHUVKDYHDQLPSRUWDQWSURSHUW\7KH
RSHUDWLQJV\VWHPJXDUDQWHHVXVWKDWLWZLOOQHYHUSDJHWKLVPHPRU\RXWWRGLVN
ZKLFKHQVXUHVLWVUHVLGHQF\LQSK\VLFDOPHPRU\7KHFRUROODU\WRWKLVLVWKDWLW
EHFRPHVVDIHIRUWKH26WRDOORZDQDSSOLFDWLRQDFFHVVWRWKHSK\VLFDODGGUHVVRI
WKHPHPRU\VLQFHWKHEXIIHUZLOOQRWEHHYLFWHGRUUHORFDWHG

.QRZLQJWKHSK\VLFDODGGUHVVRIDEXIIHUWKH*38FDQWKHQXVHGLUHFWPHPRU\
DFFHVV '0$ WRFRS\GDWDWRRUIURPWKHKRVW6LQFH'0$FRSLHVSURFHHGZLWKRXW
LQWHUYHQWLRQIURPWKH&38LWDOVRPHDQVWKDWWKH&38FRXOGEHVLPXOWDQHRXVO\
SDJLQJWKHVHEXIIHUVRXWWRGLVNRUUHORFDWLQJWKHLUSK\VLFDODGGUHVVE\XSGDWLQJ
WKHRSHUDWLQJV\VWHPǢVSDJHWDEOHV7KHSRVVLELOLW\RIWKH&38PRYLQJSDJHDEOH
GDWDPHDQVWKDWXVLQJSLQQHGPHPRU\IRUD'0$FRS\LVHVVHQWLDO,QIDFWHYHQ
ZKHQ\RXDWWHPSWWRSHUIRUPDPHPRU\FRS\ZLWKSDJHDEOHPHPRU\WKH&8'$
GULYHUVWLOOXVHV'0$WRWUDQVIHUWKHEXIIHUWRWKH*387KHUHIRUH\RXUFRS\

186

Download from www.wowebook.com


 3(
$* /
Ȑ .
2& '
( +
 2 6 70
 <
(025

KDSSHQVWZLFHȌUVWIURPDSDJHDEOHV\VWHPEXIIHUWRDSDJHORFNHGǤVWDJLQJǥ
EXIIHUDQGWKHQIURPWKHSDJHORFNHGV\VWHPEXIIHUWRWKH*38

$VDUHVXOWZKHQHYHU\RXSHUIRUPPHPRU\FRSLHVIURPSDJHDEOHPHPRU\\RX
JXDUDQWHHWKDWWKHFRS\VSHHGZLOOEHERXQGHGE\WKHlowerRIWKH3&,(WUDQVIHU
VSHHGDQGWKHV\VWHPIURQWVLGHEXVVSHHGV$ODUJHGLVSDULW\LQEDQGZLGWK
EHWZHHQWKHVHEXVHVLQVRPHV\VWHPVHQVXUHVWKDWSDJHORFNHGKRVWPHPRU\
HQMR\VURXJKO\DWZRIROGSHUIRUPDQFHDGYDQWDJHRYHUVWDQGDUGSDJHDEOHPHPRU\
ZKHQXVHGIRUFRS\LQJGDWDEHWZHHQWKH*38DQGWKHKRVW%XWHYHQLQDZRUOG
ZKHUH3&,([SUHVVDQGIURQWVLGHEXVVSHHGVZHUHLGHQWLFDOSDJHDEOHEXIIHUV
ZRXOGVWLOOLQFXUWKHRYHUKHDGRIDQDGGLWLRQDO&38PDQDJHGFRS\

+RZHYHU\RXVKRXOGUHVLVWWKHWHPSWDWLRQWRVLPSO\GRDVHDUFKDQGUHSODFH
RQmallocWRFRQYHUWHYHU\RQHRI\RXUFDOOVWRXVHcudaHostAlloc()8VLQJ
SLQQHGPHPRU\LVDGRXEOHHGJHGVZRUG%\GRLQJVR\RXKDYHHIIHFWLYHO\RSWHG
RXWRIDOOWKHQLFHIHDWXUHVRIYLUWXDOPHPRU\6SHFLȌFDOO\WKHFRPSXWHUUXQQLQJ
WKHDSSOLFDWLRQQHHGVWRKDYHDYDLODEOHSK\VLFDOPHPRU\IRUHYHU\SDJHORFNHG
EXIIHUVLQFHWKHVHEXIIHUVFDQQHYHUEHVZDSSHGRXWWRGLVN7KLVPHDQVWKDW
\RXUV\VWHPZLOOUXQRXWRIPHPRU\PXFKIDVWHUWKDQLWZRXOGLI\RXVWXFNWR
VWDQGDUGmalloc()FDOOV1RWRQO\GRHVWKLVPHDQWKDW\RXUDSSOLFDWLRQPLJKW
VWDUWWRIDLORQPDFKLQHVZLWKVPDOOHUDPRXQWVRISK\VLFDOPHPRU\EXWLWPHDQV
WKDW\RXUDSSOLFDWLRQFDQDIIHFWWKHSHUIRUPDQFHRIRWKHUDSSOLFDWLRQVUXQQLQJRQ
WKHV\VWHP

7KHVHZDUQLQJVDUHQRWPHDQWWRVFDUH\RXRXWRIXVLQJcudaHostAlloc()EXW
\RXVKRXOGUHPDLQDZDUHRIWKHLPSOLFDWLRQVRISDJHORFNLQJEXIIHUV:HVXJJHVW
WU\LQJWRUHVWULFWWKHLUXVHWRPHPRU\WKDWZLOOEHXVHGDVDVRXUFHRUGHVWLQDWLRQ
LQFDOOVWRcudaMemcpy()DQGIUHHLQJWKHPZKHQWKH\DUHQRORQJHUQHHGHG
UDWKHUWKDQZDLWLQJXQWLODSSOLFDWLRQVKXWGRZQWRUHOHDVHWKHPHPRU\7KHXVHRI
cudaHostAlloc()VKRXOGEHQRPRUHGLIȌFXOWWKDQDQ\WKLQJHOVH\RXǢYHVWXGLHG
VRIDUEXWOHWǢVWDNHDORRNDWDQH[DPSOHWKDWZLOOERWKLOOXVWUDWHKRZSLQQHG
PHPRU\LVDOORFDWHGDQGGHPRQVWUDWHLWVSHUIRUPDQFHDGYDQWDJHRYHUVWDQGDUG
SDJHDEOHPHPRU\

2XUDSSOLFDWLRQZLOOEHYHU\VLPSOHDQGVHUYHVSULPDULO\WREHQFKPDUN
cudaMemcpy()SHUIRUPDQFHZLWKERWKSDJHDEOHDQGSDJHORFNHGPHPRU\
$OOZHHQGHDYRUWRGRLVDOORFDWHD*38EXIIHUDQGDKRVWEXIIHURIPDWFKLQJ
VL]HVDQGWKHQH[HFXWHVRPHQXPEHURIFRSLHVEHWZHHQWKHVHWZREXIIHUV:HǢOO
DOORZWKHXVHURIWKLVEHQFKPDUNWRVSHFLI\WKHGLUHFWLRQRIWKHFRS\HLWKHUǤXSǥ
IURPKRVWWRGHYLFH RUǤGRZQǥ IURPGHYLFHWRKRVW <RXZLOODOVRQRWLFHWKDWLQ
RUGHUWRREWDLQDFFXUDWHWLPLQJVZHVHWXS&8'$HYHQWVIRUWKHVWDUWDQGVWRS

187

Download from www.wowebook.com


STREAMS

RIWKHVHTXHQFHRIFRSLHV<RXSUREDEO\UHPHPEHUKRZWRGRWKLVIURPSUHYLRXV
SHUIRUPDQFHWHVWLQJH[DPSOHVEXWLQFDVH\RXǢYHIRUJRWWHQWKHIROORZLQJZLOOMRJ
\RXUPHPRU\

float cuda_malloc_test( int size, bool up ) {


cudaEvent_t start, stop;
int *a, *dev_a;
float elapsedTime;

HANDLE_ERROR( cudaEventCreate( &start ) );


HANDLE_ERROR( cudaEventCreate( &stop ) );

a = (int*)malloc( size * sizeof( *a ) );


HANDLE_NULL( a );
HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
size * sizeof( *dev_a ) ) );

,QGHSHQGHQWRIWKHGLUHFWLRQRIWKHFRSLHVZHVWDUWE\DOORFDWLQJDKRVWDQG*38
EXIIHURIsizeLQWHJHUV$IWHUWKLVZHGRFRSLHVLQWKHGLUHFWLRQVSHFLȌHGE\
WKHDUJXPHQWupVWRSSLQJWKHWLPHUDIWHUZHǢYHȌQLVKHGFRS\LQJ

HANDLE_ERROR( cudaEventRecord( start, 0 ) );


for (int i=0; i<100; i++) {
if (up)
HANDLE_ERROR( cudaMemcpy( dev_a, a,
size * sizeof( *dev_a ),
cudaMemcpyHostToDevice ) );
else
HANDLE_ERROR( cudaMemcpy( a, dev_a,
size * sizeof( *dev_a ),
cudaMemcpyDeviceToHost ) );
}
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );

188

Download from www.wowebook.com


 3(
$* /
Ȑ .
2& '
( +
 2 6 70
 <
(025

$IWHUWKHFRSLHVFOHDQXSE\IUHHLQJWKHKRVWDQG*38EXIIHUVDVZHOODV
GHVWUR\LQJRXUWLPLQJHYHQWV

free( a );
HANDLE_ERROR( cudaFree( dev_a ) );
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );

return elapsedTime;
}

,I\RXGLGQǢWQRWLFHWKHIXQFWLRQcuda_malloc_test()DOORFDWHGSDJHDEOHKRVW
PHPRU\ZLWKWKHVWDQGDUG&malloc()URXWLQH7KHSLQQHGPHPRU\YHUVLRQ
XVHVcudaHostAlloc()WRDOORFDWHDSDJHORFNHGEXIIHU

float cuda_host_alloc_test( int size, bool up ) {


cudaEvent_t start, stop;
int *a, *dev_a;
float elapsedTime;

HANDLE_ERROR( cudaEventCreate( &start ) );


HANDLE_ERROR( cudaEventCreate( &stop ) );

HANDLE_ERROR( cudaHostAlloc( (void**)&a,


size * sizeof( *a ),
cudaHostAllocDefault ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
size * sizeof( *dev_a ) ) );

HANDLE_ERROR( cudaEventRecord( start, 0 ) );


for (int i=0; i<100; i++) {
if (up)
HANDLE_ERROR( cudaMemcpy( dev_a, a,
size * sizeof( *a ),
cudaMemcpyHostToDevice ) );
else

189

Download from www.wowebook.com


STREAMS

HANDLE_ERROR( cudaMemcpy( a, dev_a,


size * sizeof( *a ),
cudaMemcpyDeviceToHost ) );
}
HANDLE_ERROR( cudaEventRecord( stop, 0 ) );
HANDLE_ERROR( cudaEventSynchronize( stop ) );
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );

HANDLE_ERROR( cudaFreeHost( a ) );
HANDLE_ERROR( cudaFree( dev_a ) );
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );

return elapsedTime;
}

$V\RXFDQVHHWKHEXIIHUDOORFDWHGE\cudaHostAlloc()LVXVHGLQWKHVDPH
ZD\DVDEXIIHUDOORFDWHGE\malloc()7KHRWKHUFKDQJHIURPXVLQJmalloc()
OLHVLQWKHODVWDUJXPHQWWKHYDOXHcudaHostAllocDefault7KLVODVWDUJX-
PHQWVWRUHVDFROOHFWLRQRIȍDJVWKDWZHFDQXVHWRPRGLI\WKHEHKDYLRURI
cudaHostAlloc()LQRUGHUWRDOORFDWHRWKHUYDULHWLHVRISLQQHGKRVWPHPRU\
,QWKHQH[WFKDSWHUZHǢOOVHHKRZWRXVHWKHRWKHUSRVVLEOHYDOXHVRIWKHVHȍDJV
EXWIRUQRZZHǢUHFRQWHQWWRXVHWKHGHIDXOWSDJHORFNHGPHPRU\VRZHSDVV
cudaHostAllocDefaultLQRUGHUWRJHWWKHGHIDXOWEHKDYLRU7RIUHHDEXIIHU
WKDWZDVDOORFDWHGZLWKcudaHostAlloc()ZHKDYHWRXVHcudaFreeHost()
7KDWLVHYHU\malloc()QHHGVDfree()DQGHYHU\cudaHostAlloc()QHHGV
a cudaFreeHost()

7KHERG\RImain()SURFHHGVQRWXQOLNHZKDW\RXZRXOGH[SHFW

#include "../common/book.h"

#define SIZE (10*1024*1024)

int main( void ) {


float elapsedTime;
float MB = (float)100*SIZE*sizeof(int)/1024/1024;

190

Download from www.wowebook.com


 3(
$* /
Ȑ .
2& '
( +
 2 6 70
 <
(025

elapsedTime = cuda_malloc_test( SIZE, true );


printf( "Time using cudaMalloc: %3.1f ms\n",
elapsedTime );
printf( "\tMB/s during copy up: %3.1f\n",
MB/(elapsedTime/1000) );

%HFDXVHWKHupDUJXPHQWWRcuda_malloc_test()LVtrueWKHSUHYLRXVFDOO
WHVWVWKHSHUIRUPDQFHRIFRSLHVIURPKRVWWRGHYLFHRUǤXSǥWRWKHGHYLFH7R
EHQFKPDUNWKHFDOOVLQWKHRSSRVLWHGLUHFWLRQZHH[HFXWHWKHVDPHFDOOVEXWZLWK
falseDVWKHVHFRQGDUJXPHQW

elapsedTime = cuda_malloc_test( SIZE, false );


printf( "Time using cudaMalloc: %3.1f ms\n",
elapsedTime );
printf( "\tMB/s during copy down: %3.1f\n",
MB/(elapsedTime/1000) );

:HSHUIRUPWKHVDPHVHWRIVWHSVWRWHVWWKHSHUIRUPDQFHRIcudaHostAlloc()
:HFDOOcuda_ host_alloc_test()WZLFHRQFHZLWKupDVtrueDQGRQFH
ZLWKLWfalse

elapsedTime = cuda_host_alloc_test( SIZE, true );


printf( "Time using cudaHostAlloc: %3.1f ms\n",
elapsedTime );
printf( "\tMB/s during copy up: %3.1f\n",
MB/(elapsedTime/1000) );

elapsedTime = cuda_host_alloc_test( SIZE, false );


printf( "Time using cudaHostAlloc: %3.1f ms\n",
elapsedTime );
printf( "\tMB/s during copy down: %3.1f\n",
MB/(elapsedTime/1000) );
}

2QD*H)RUFH*7;ZHREVHUYHGFRSLHVIURPKRVWWRGHYLFHLPSURYLQJIURP
*%VWR*%VZKHQZHXVHSLQQHGPHPRU\LQVWHDGRISDJHDEOHPHPRU\

191

Download from www.wowebook.com


STREAMS

&RSLHVIURPWKHGHYLFHGRZQWRWKHKRVWLPSURYHVLPLODUO\IURP*%VWR
*%V6RIRUPRVW3&,(EDQGZLGWKOLPLWHGDSSOLFDWLRQV\RXZLOOQRWLFHD
PDUNHGLPSURYHPHQWZKHQXVLQJSLQQHGPHPRU\YHUVXVVWDQGDUGSDJHDEOH
PHPRU\%XWSDJHORFNHGPHPRU\LVQRWVROHO\IRUSHUIRUPDQFHHQKDQFHPHQWV
$VZHǢOOVHHLQWKHQH[WVHFWLRQVWKHUHDUHVLWXDWLRQVZKHUHZHDUHrequiredWR
XVHSDJHORFNHGPHPRU\

 &8'$6WUHDPV
,Q&KDSWHUZHLQWURGXFHGWKHFRQFHSWRI&8'$HYHQWV,QGRLQJVRZHSRVW-
SRQHGDQLQGHSWKGLVFXVVLRQRIWKHVHFRQGDUJXPHQWWRcudaEventRecord(),
LQVWHDGPHQWLRQLQJRQO\WKDWLWVSHFLȌHGWKHstreamLQWRZKLFKZHZHUHLQVHUWLQJ
WKHHYHQW

cudaEvent_t start;
cudaEventCreate(&start);
cudaEventRecord( start, 0 );

&8'$VWUHDPVFDQSOD\DQLPSRUWDQWUROHLQDFFHOHUDWLQJ\RXUDSSOLFDWLRQV
A CUDA stream UHSUHVHQWVDTXHXHRI*38RSHUDWLRQVWKDWJHWH[HFXWHGLQD
VSHFLȌFRUGHU:HFDQDGGRSHUDWLRQVVXFKDVNHUQHOODXQFKHVPHPRU\FRSLHV
DQGHYHQWVWDUWVDQGVWRSVLQWRDVWUHDP7KHRUGHULQZKLFKRSHUDWLRQVDUHDGGHG
WRWKHVWUHDPVSHFLȌHVWKHRUGHULQZKLFKWKH\ZLOOEHH[HFXWHG<RXFDQWKLQNRI
HDFKVWUHDPDVDtaskRQWKH*38DQGWKHUHDUHRSSRUWXQLWLHVIRUWKHVHWDVNVWR
H[HFXWHLQSDUDOOHO:HǢOOȌUVWVHHKRZVWUHDPVDUHXVHGDQGWKHQZHǢOOORRNDW
KRZ\RXFDQXVHVWUHDPVWRDFFHOHUDWH\RXUDSSOLFDWLRQV

 8VLQJD6LQJOH&8'$6WUHDP
$VZHǢOOVHHODWHUWKHUHDOSRZHURIVWUHDPVEHFRPHVDSSDUHQWRQO\ZKHQZH
XVHPRUHWKDQRQHRIWKHPEXWZHǢOOEHJLQWRLOOXVWUDWHWKHPHFKDQLFVRIWKHLU
XVHZLWKLQDQDSSOLFDWLRQWKDWHPSOR\VMXVWDVLQJOHVWUHDP,PDJLQHWKDWZH
KDYHD&8'$&NHUQHOWKDWZLOOWDNHWZRLQSXWEXIIHUVRIGDWDaDQGb7KHNHUQHO
ZLOOFRPSXWHVRPHUHVXOWEDVHGRQDFRPELQDWLRQRIYDOXHVLQWKHVHEXIIHUVWR
SURGXFHDQRXWSXWEXIIHUc2XUYHFWRUDGGLWLRQH[DPSOHGLGVRPHWKLQJDORQJ

192

Download from www.wowebook.com


 86,1*$6,1*/(&8'$675($0
USING A SINGLE CUDA STREAM

WKHVHOLQHVEXWLQWKLVH[DPSOHZHǢOOFRPSXWHDQDYHUDJHRIWKUHHYDOXHVLQaDQG
WKUHHYDOXHVLQb

#include "../common/book.h"

#define N (1024*1024)
#define FULL_DATA_SIZE (N*20)

__global__ void kernel( int *a, int *b, int *c ) {


int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
int idx1 = (idx + 1) % 256;
int idx2 = (idx + 2) % 256;
float as = (a[idx] + a[idx1] + a[idx2]) / 3.0f;
float bs = (b[idx] + b[idx1] + b[idx2]) / 3.0f;
c[idx] = (as + bs) / 2;
}
}

7KLVNHUQHOLVQRWLQFUHGLEO\LPSRUWDQWVRGRQǢWJHWWRRKXQJXSRQLWLI\RX
DUHQǢWVXUHH[DFWO\ZKDWLWǢVVXSSRVHGWREHFRPSXWLQJ,WǢVVRPHWKLQJRID
SODFHKROGHUVLQFHWKHLPSRUWDQWVWUHDPUHODWHGFRPSRQHQWRIWKLVH[DPSOH
UHVLGHVLQmain()

int main( void ) {


cudaDeviceProp prop;
int whichDevice;
HANDLE_ERROR( cudaGetDevice( &whichDevice ) );
HANDLE_ERROR( cudaGetDeviceProperties( &prop, whichDevice ) );
if (!prop.deviceOverlap) {
printf( "Device will not handle overlaps, so no "
"speed up from streams\n" );

return 0;
}

193

Download from www.wowebook.com


STREAMS

7KHȌUVWWKLQJZHGRLVFKRRVHDGHYLFHDQGFKHFNWRVHHZKHWKHULWVXSSRUWVD
IHDWXUHNQRZQDVdevice overlap$*38VXSSRUWLQJGHYLFHRYHUODSSRVVHVVHVWKH
FDSDFLW\WRVLPXOWDQHRXVO\H[HFXWHD&8'$&NHUQHOZKLOHSHUIRUPLQJDFRS\
EHWZHHQGHYLFHDQGKRVWPHPRU\$VZHǢYHSURPLVHGEHIRUHZHǢOOXVHPXOWLSOH
VWUHDPVWRDFKLHYHWKLVRYHUODSRIFRPSXWDWLRQDQGGDWDWUDQVIHUEXWȌUVWZHǢOO
VHHKRZWRFUHDWHDQGXVHDVLQJOHVWUHDP$VZLWKDOORIRXUH[DPSOHVWKDWDLPWR
PHDVXUHSHUIRUPDQFHLPSURYHPHQWV RUUHJUHVVLRQV ZHEHJLQE\FUHDWLQJDQG
VWDUWLQJDQHYHQWWLPHU

cudaEvent_t start, stop;


float elapsedTime;

// start the timers


HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

$IWHUVWDUWLQJRXUWLPHUZHFUHDWHWKHVWUHDPZHZDQWWRXVHIRUWKLVDSSOLFDWLRQ

// initialize the stream


cudaStream_t stream;
HANDLE_ERROR( cudaStreamCreate( &stream ) );

<HDKWKDWǢVSUHWW\PXFKDOOLWWDNHVWRFUHDWHDVWUHDP,WǢVQRWUHDOO\ZRUWK
GZHOOLQJRQVROHWǢVSUHVVRQWRWKHGDWDDOORFDWLRQ

int *host_a, *host_b, *host_c;


int *dev_a, *dev_b, *dev_c;

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b,
N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c,
N * sizeof(int) ) );

194

Download from www.wowebook.com


 86,1*$6,1*/(&8'$675($0
USING A SINGLE CUDA STREAM

// allocate page-locked memory, used to stream


HANDLE_ERROR( cudaHostAlloc( (void**)&host_a,
FULL_DATA_SIZE * sizeof(int),
cudaHostAllocDefault ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&host_b,
FULL_DATA_SIZE * sizeof(int),
cudaHostAllocDefault ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&host_c,
FULL_DATA_SIZE * sizeof(int),
cudaHostAllocDefault ) );

for (int i=0; i<FULL_DATA_SIZE; i++) {


host_a[i] = rand();
host_b[i] = rand();
}

:HKDYHDOORFDWHGRXULQSXWDQGRXWSXWEXIIHUVRQERWKWKH*38DQGWKH
KRVW1RWLFHWKDWZHǢYHGHFLGHGWRXVHSLQQHGPHPRU\RQWKHKRVWE\XVLQJ
cudaHostAlloc()WRSHUIRUPWKHDOORFDWLRQV7KHUHLVDYHU\JRRGUHDVRQIRU
XVLQJSLQQHGPHPRU\DQGLWǢVQRWVWULFWO\EHFDXVHLWPDNHVFRSLHVIDVWHU:HǢOO
VHHLQGHWDLOPRPHQWDULO\EXWZHZLOOEHXVLQJDQHZNLQGRIcudaMemcpy()
IXQFWLRQDQGWKLVQHZIXQFWLRQrequiresWKDWWKHKRVWPHPRU\EHSDJHORFNHG
$IWHUDOORFDWLQJWKHLQSXWEXIIHUVZHȌOOWKHKRVWDOORFDWLRQVZLWKUDQGRPLQWHJHUV
XVLQJWKH&OLEUDU\FDOOrand()

:LWKRXUVWUHDPDQGRXUWLPLQJHYHQWVFUHDWHGDQGRXUGHYLFHDQGKRVWEXIIHUV
DOORFDWHGZHǢUHUHDG\WRSHUIRUPVRPHFRPSXWDWLRQV7\SLFDOO\ZHEODVWWKURXJK
WKLVVWDJHE\FRS\LQJWKHWZRLQSXWEXIIHUVWRWKH*38ODXQFKLQJRXUNHUQHODQG
FRS\LQJWKHRXWSXWEXIIHUEDFNWRWKHKRVW:HZLOOIROORZWKLVSDWWHUQDJDLQEXW
WKLVWLPHZLWKVRPHVPDOOFKDQJHV

)LUVWZHZLOORSWnotWRFRS\WKHLQSXWEXIIHUVLQWKHLUHQWLUHW\WRWKH*385DWKHU
ZHZLOOVSOLWRXULQSXWVLQWRVPDOOHUFKXQNVDQGSHUIRUPWKHWKUHHVWHSSURFHVV
RQHDFKFKXQN7KDWLVZHZLOOWDNHVRPHIUDFWLRQRIWKHLQSXWEXIIHUVFRS\
WKHPWRWKH*38H[HFXWHRXUNHUQHORQWKDWIUDFWLRQRIWKHEXIIHUVDQGFRS\WKH
UHVXOWLQJIUDFWLRQRIWKHRXWSXWEXIIHUEDFNWRWKHKRVW,PDJLQHWKDWZHQHHG

195

Download from www.wowebook.com


STREAMS

WRGRWKLVEHFDXVHRXU*38KDVPXFKOHVVPHPRU\WKDQRXUKRVWGRHVVRWKH
FRPSXWDWLRQQHHGVWREHVWDJHGLQFKXQNVEHFDXVHWKHHQWLUHEXIIHUFDQǢWȌWRQ
WKH*38DWRQFH7KHFRGHWRSHUIRUPWKLVǤFKXQNLȌHGǥVHTXHQFHRIFRPSXWDWLRQV
ZLOOORRNOLNHWKLV

// now loop over full data, in bite-sized chunks


for (int i=0; i<FULL_DATA_SIZE; i+= N) {
// copy the locked memory to the device, async
HANDLE_ERROR( cudaMemcpyAsync( dev_a, host_a+i,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_b, host_b+i,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream ) );

kernel<<<N/256,256,0,stream>>>( dev_a, dev_b, dev_c );

// copy the data from device to locked memory


HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c,
N * sizeof(int),
cudaMemcpyDeviceToHost,
stream ) );

%XW\RXZLOOQRWLFHWZRRWKHUXQH[SHFWHGVKLIWVIURPWKHQRUPLQWKHSUHFHGLQJ
H[FHUSW)LUVWLQVWHDGRIXVLQJWKHIDPLOLDUcudaMemcpy()ZHǢUHFRS\LQJ
WKHGDWDWRDQGIURPWKH*38ZLWKDQHZURXWLQHcudaMemcpyAsync()
7KHGLIIHUHQFHEHWZHHQWKHVHIXQFWLRQVLVVXEWOH\HWVLJQLȌFDQW7KHRULJLQDO
cudaMemcpy()EHKDYHVOLNHWKH&OLEUDU\IXQFWLRQmemcpy()6SHFLȌFDOO\WKLV
IXQFWLRQH[HFXWHVsynchronouslyPHDQLQJWKDWZKHQWKHIXQFWLRQUHWXUQVWKH
FRS\KDVFRPSOHWHGDQGWKHRXWSXWEXIIHUQRZFRQWDLQVWKHFRQWHQWVWKDWZHUH
VXSSRVHGWREHFRSLHGLQWRLW

196

Download from www.wowebook.com


 86,1*$6,1*/(&8'$675($0
USING A SINGLE CUDA STREAM

7KHRSSRVLWHRIDsynchronousIXQFWLRQLVDQasynchronousIXQFWLRQZKLFK
LQVSLUHGWKHQDPHcudaMemcpyAsync()7KHFDOOWRcudaMemcpyAsync()
VLPSO\SODFHVDrequestWRSHUIRUPDPHPRU\FRS\LQWRWKHVWUHDPVSHFLȌHGE\
WKHDUJXPHQWstream:KHQWKHFDOOUHWXUQVWKHUHLVQRJXDUDQWHHWKDWWKH
FRS\KDVHYHQVWDUWHG\HWPXFKOHVVWKDWLWKDVȌQLVKHG7KHJXDUDQWHHWKDW
ZHKDYHLVWKDWWKHFRS\ZLOOGHȌQLWHO\EHSHUIRUPHGEHIRUHWKHQH[WRSHUD-
WLRQSODFHGLQWRWKHVDPHVWUHDP,WLVUHTXLUHGWKDWDQ\KRVWPHPRU\SRLQWHUV
SDVVHGWRcudaMemcpyAsync()KDYHEHHQDOORFDWHGE\cudaHostAlloc()
7KDWLV\RXDUHRQO\DOORZHGWRVFKHGXOHDV\QFKURQRXVFRSLHVWRRUIURPSDJH
ORFNHGPHPRU\

1RWLFHWKDWWKHDQJOHEUDFNHWHGNHUQHOODXQFKDOVRWDNHVDQRSWLRQDOVWUHDP
DUJXPHQW7KLVNHUQHOODXQFKLVDV\QFKURQRXVMXVWOLNHWKHSUHFHGLQJWZR
PHPRU\FRSLHVWRWKH*38DQGWKHWUDLOLQJPHPRU\FRS\EDFNIURPWKH*38
7HFKQLFDOO\ZHFDQHQGDQLWHUDWLRQRIWKLVORRSZLWKRXWKDYLQJDFWXDOO\VWDUWHG
DQ\RIWKHPHPRU\FRSLHVRUNHUQHOH[HFXWLRQ$VZHPHQWLRQHGDOOWKDWZHDUH
JXDUDQWHHGLVWKDWWKHȌUVWFRS\SODFHGLQWRWKHVWUHDPZLOOH[HFXWHEHIRUHWKH
VHFRQGFRS\0RUHRYHUWKHVHFRQGFRS\ZLOOFRPSOHWHEHIRUHWKHNHUQHOVWDUWV
DQGWKHNHUQHOZLOOFRPSOHWHEHIRUHWKHWKLUGFRS\VWDUWV6RDVZHǢYHPHQWLRQHG
HDUOLHULQWKLVFKDSWHUDVWUHDPDFWVMXVWOLNHDQRUGHUHGTXHXHRIZRUNIRUWKH
*38WRSHUIRUP

:KHQWKHfor()ORRSKDVWHUPLQDWHGWKHUHFRXOGVWLOOEHTXLWHDELWRIZRUN
TXHXHGXSIRUWKH*38WRȌQLVK,IZHZRXOGOLNHWRJXDUDQWHHWKDWWKH*38
LVGRQHZLWKLWVFRPSXWDWLRQVDQGPHPRU\FRSLHVZHQHHGWRV\QFKURQL]H
LWZLWKWKHKRVW7KDWLVZHEDVLFDOO\ZDQWWRWHOOWKHKRVWWRVLWDURXQGDQG
ZDLWIRUWKH*38WRȌQLVKEHIRUHSURFHHGLQJ:HDFFRPSOLVKWKDWE\FDOOLQJ
cudaStreamSynchronize()DQGVSHFLI\LQJWKHVWUHDPWKDWZHZDQWWRZDLWIRU

// copy result chunk from locked to full buffer


HANDLE_ERROR( cudaStreamSynchronize( stream ) );

6LQFHWKHFRPSXWDWLRQVDQGFRSLHVKDYHFRPSOHWHGDIWHUV\QFKURQL]LQJstream
ZLWKWKHKRVWZHFDQVWRSRXUWLPHUFROOHFWRXUSHUIRUPDQFHGDWDDQGIUHHRXU
LQSXWDQGRXWSXWEXIIHUV

197

Download from www.wowebook.com


STREAMS

HANDLE_ERROR( cudaEventRecord( stop, 0 ) );

HANDLE_ERROR( cudaEventSynchronize( stop ) );


HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time taken: %3.1f ms\n", elapsedTime );

// cleanup the streams and memory


HANDLE_ERROR( cudaFreeHost( host_a ) );
HANDLE_ERROR( cudaFreeHost( host_b ) );
HANDLE_ERROR( cudaFreeHost( host_c ) );
HANDLE_ERROR( cudaFree( dev_a ) );
HANDLE_ERROR( cudaFree( dev_b ) );
HANDLE_ERROR( cudaFree( dev_c ) );

)LQDOO\EHIRUHH[LWLQJWKHDSSOLFDWLRQZHGHVWUR\WKHVWUHDPWKDWZHZHUHXVLQJ
WRTXHXHWKH*38RSHUDWLRQV

HANDLE_ERROR( cudaStreamDestroy( stream ) );

return 0;
}

7REHKRQHVWWKLVH[DPSOHKDVGRQHYHU\OLWWOHWRGHPRQVWUDWHWKHSRZHURI
VWUHDPV2IFRXUVHHYHQXVLQJDVLQJOHVWUHDPFDQKHOSVSHHGXSDQDSSOLFDWLRQ
LIZHKDYHZRUNZHZDQWWRFRPSOHWHRQWKHKRVWZKLOHWKH*38LVEXV\FKXUQLQJ
WKURXJKWKHZRUNZHǢYHVWXIIHGLQWRDVWUHDP%XWDVVXPLQJWKDWZHGRQǢWKDYH
PXFKWRGRRQWKHKRVWZHFDQVWLOOVSHHGXSDSSOLFDWLRQVE\XVLQJVWUHDPVDQG
LQWKHQH[WVHFWLRQZHǢOOWDNHDORRNDWKRZWKLVFDQEHDFFRPSOLVKHG

 8VLQJ0XOWLSOH&8'$6WUHDPV
/HWǢVDGDSWWKHVLQJOHVWUHDPH[DPSOHIURP6HFWLRQ8VLQJD6LQJOH&8'$
6WUHDPWRSHUIRUPLWVZRUNLQWZRGLIIHUHQWVWUHDPV$WWKHEHJLQQLQJRIWKH
SUHYLRXVH[DPSOHZHFKHFNHGWKDWWKHGHYLFHLQGHHGVXSSRUWHGoverlapDQG

198

Download from www.wowebook.com


 86,1*08/7,3/(&8'$675($06
USING MULTIPLE CUDA STREAMS

EURNHWKHFRPSXWDWLRQLQWRFKXQNV7KHLGHDXQGHUO\LQJWKHLPSURYHGYHUVLRQ
RIWKLVDSSOLFDWLRQLVVLPSOHDQGUHOLHVRQWZRWKLQJVWKHǤFKXQNHGǥFRPSXWD-
WLRQDQGWKHRYHUODSRIPHPRU\FRSLHVZLWKNHUQHOH[HFXWLRQ:HHQGHDYRUWR
JHWVWUHDPWRFRS\LWVLQSXWEXIIHUVWRWKH*38ZKLOHVWUHDPLVH[HFXWLQJLWV
NHUQHO7KHQVWUHDPZLOOH[HFXWHLWVNHUQHOZKLOHVWUHDPFRSLHVLWVUHVXOWV
WRWKHKRVW6WUHDPZLOOWKHQFRS\LWVUHVXOWVWRWKHKRVWZKLOHVWUHDPEHJLQV
H[HFXWLQJLWVNHUQHORQWKHQH[WFKXQNRIGDWD$VVXPLQJWKDWRXUPHPRU\FRSLHV
DQGNHUQHOH[HFXWLRQVWDNHURXJKO\WKHVDPHDPRXQWRIWLPHRXUDSSOLFDWLRQǢV
H[HFXWLRQWLPHOLQHPLJKWORRNVRPHWKLQJOLNH)LJXUH7KHȌJXUHDVVXPHV
WKDWWKH*38FDQSHUIRUPDPHPRU\FRS\DQGDNHUQHOH[HFXWLRQDWWKHVDPH
WLPHVRHPSW\ER[HVUHSUHVHQWWLPHZKHQRQHVWUHDPLVZDLWLQJWRH[HFXWHDQ
RSHUDWLRQWKDWLWFDQQRWRYHUODSZLWKWKHRWKHUVWUHDPǢVRSHUDWLRQ1RWHDOVRWKDW
FDOOVWRcudaMemcpyAsync()DUHDEEUHYLDWHGLQWKHUHPDLQLQJȌJXUHVLQWKLV
FKDSWHUUHSUHVHQWHGVLPSO\DVǤmemcpyǥ

     




 



   
 

  





 



   

  

Figure 10.1 7LPHOLQHRILQWHQGHGDSSOLFDWLRQH[HFXWLRQXVLQJWZR


LQGHSHQGHQWVWUHDPV

199

Download from www.wowebook.com


STREAMS

,QIDFWWKHH[HFXWLRQWLPHOLQHFDQEHHYHQPRUHIDYRUDEOHWKDQWKLVVRPHQHZHU
19,',$*38VVXSSRUWVLPXOWDQHRXVNHUQHOH[HFXWLRQDQGtwoPHPRU\FRSLHV
RQHtoWKHGHYLFHDQGRQHfromWKHGHYLFH%XWRQDQ\GHYLFHWKDWVXSSRUWVWKH
RYHUODSRIPHPRU\FRSLHVDQGNHUQHOH[HFXWLRQWKHRYHUDOODSSOLFDWLRQVKRXOG
DFFHOHUDWHZKHQZHXVHPXOWLSOHVWUHDPV

'HVSLWHWKHVHJUDQGSODQVWRDFFHOHUDWHRXUDSSOLFDWLRQWKHFRPSXWDWLRQNHUQHO
ZLOOUHPDLQXQFKDQJHG

#include "../common/book.h"

#define N (1024*1024)
#define FULL_DATA_SIZE (N*20)

__global__ void kernel( int *a, int *b, int *c ) {


int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
int idx1 = (idx + 1) % 256;
int idx2 = (idx + 2) % 256;
float as = (a[idx] + a[idx1] + a[idx2]) / 3.0f;
float bs = (b[idx] + b[idx1] + b[idx2]) / 3.0f;
c[idx] = (as + bs) / 2
}
}

$VZLWKWKHVLQJOHVWUHDPYHUVLRQZHZLOOFKHFNWKDWWKHGHYLFHVXSSRUWVRYHU-
ODSSLQJFRPSXWDWLRQZLWKPHPRU\FRS\,IWKHGHYLFHdoesVXSSRUWRYHUODSZH
SURFHHGDVZHGLGEHIRUHE\FUHDWLQJ&8'$HYHQWVWRWLPHWKHDSSOLFDWLRQ

int main( void ) {


cudaDeviceProp prop;
int whichDevice;
HANDLE_ERROR( cudaGetDevice( &whichDevice ) );
HANDLE_ERROR( cudaGetDeviceProperties( &prop, whichDevice ) );

200

Download from www.wowebook.com


 86,1*08/7,3/(&8'$675($06
USING MULTIPLE CUDA STREAMS

if (!prop.deviceOverlap) {
printf( “Device will not handle overlaps, so no “
“speed up from streams\n” );
return 0;
}

cudaEvent_t start, stop;


float elapsedTime;

// start the timers


HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

1H[WZHFUHDWHRXUWZRVWUHDPVH[DFWO\DVZHFUHDWHGWKHVLQJOHVWUHDPLQWKH
SUHYLRXVVHFWLRQǢVYHUVLRQRIWKHFRGH

// initialize the streams


cudaStream_t stream0, stream1;
HANDLE_ERROR( cudaStreamCreate( &stream0 ) );
HANDLE_ERROR( cudaStreamCreate( &stream1 ) );

:HZLOODVVXPHWKDWZHVWLOOKDYHWZRLQSXWEXIIHUVDQGDVLQJOHRXWSXWEXIIHURQ
WKHKRVW7KHLQSXWEXIIHUVDUHȌOOHGZLWKUDQGRPGDWDH[DFWO\DVWKH\ZHUHLQWKH
VLQJOHVWUHDPYHUVLRQRIWKLVDSSOLFDWLRQ+RZHYHUQRZWKDWZHLQWHQGWRXVHWZR
VWUHDPVWRSURFHVVWKHGDWDZHDOORFDWHWZRLGHQWLFDOVHWVRI*38EXIIHUVVRWKDW
HDFKVWUHDPFDQLQGHSHQGHQWO\ZRUNRQFKXQNVRIWKHLQSXW

int *host_a, *host_b, *host_c;


int *dev_a0, *dev_b0, *dev_c0; //GPU buffers for stream0
int *dev_a1, *dev_b1, *dev_c1; //GPU buffers for stream1

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a0,
N * sizeof(int) ) );

201

Download from www.wowebook.com


STREAMS

HANDLE_ERROR( cudaMalloc( (void**)&dev_b0,


N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c0,
N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_a1,
N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b1,
N * sizeof(int) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c1,
N * sizeof(int) ) );

// allocate page-locked memory, used to stream


HANDLE_ERROR( cudaHostAlloc( (void**)&host_a,
FULL_DATA_SIZE * sizeof(int),
cudaHostAllocDefault ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&host_b,
FULL_DATA_SIZE * sizeof(int),
cudaHostAllocDefault ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&host_c,
FULL_DATA_SIZE * sizeof(int),
cudaHostAllocDefault ) );

for (int i=0; i<FULL_DATA_SIZE; i++) {


host_a[i] = rand();
host_b[i] = rand();
}

:HWKHQORRSRYHUWKHFKXQNVRILQSXWH[DFWO\DVZHGLGLQWKHȌUVWDWWHPSWDWWKLV
DSSOLFDWLRQ%XWQRZWKDWZHǢUHXVLQJWZRVWUHDPVZHSURFHVVWZLFHDVPXFK
GDWDLQHDFKLWHUDWLRQRIWKHfor()ORRS,Qstream0ZHTXHXHDV\QFKURQRXV
FRSLHVRIaDQGbWRWKH*38TXHXHDNHUQHOH[HFXWLRQDQGWKHQTXHXHDFRS\
EDFNWRc

202

Download from www.wowebook.com


 86,1*08/7,3/(&8'$675($06
USING MULTIPLE CUDA STREAMS

// now loop over full data, in bite-sized chunks


for (int i=0; i<FULL_DATA_SIZE; i+= N*2) {
// copy the locked memory to the device, async
HANDLE_ERROR( cudaMemcpyAsync( dev_a0, host_a+i,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream0 ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_b0, host_b+i,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream0 ) );

kernel<<<N/256,256,0,stream0>>>( dev_a0, dev_b0, dev_c0 );

// copy the data from device to locked memory


HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c0,
N * sizeof(int),
cudaMemcpyDeviceToHost,
stream0 ) );

$IWHUTXHXLQJWKHVHRSHUDWLRQVLQstream0ZHTXHXHLGHQWLFDORSHUDWLRQVRQWKH
QH[WFKXQNRIGDWDEXWWKLVWLPHLQstream1

// copy the locked memory to the device, async


HANDLE_ERROR( cudaMemcpyAsync( dev_a1, host_a+i+N,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream1 ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_b1, host_b+i+N,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream1 ) );

203

Download from www.wowebook.com


STREAMS

kernel<<<N/256,256,0,stream1>>>( dev_a1, dev_b1, dev_c1 );

// copy the data from device to locked memory


HANDLE_ERROR( cudaMemcpyAsync( host_c+i+N, dev_c1,
N * sizeof(int),
cudaMemcpyDeviceToHost,
stream1 ) );
}

$QGVRRXUfor()ORRSSURFHHGVDOWHUQDWLQJWKHVWUHDPVWRZKLFKLWTXHXHV
HDFKFKXQNRIGDWDXQWLOLWKDVTXHXHGHYHU\SLHFHRILQSXWGDWDIRUSURFHVVLQJ
$IWHUWHUPLQDWLQJWKHfor()ORRSZHV\QFKURQL]HWKH*38ZLWKWKH&38EHIRUH
ZHVWRSRXUDSSOLFDWLRQWLPHUV6LQFHZHDUHZRUNLQJLQWZRVWUHDPVZHQHHGWR
V\QFKURQL]HERWK

HANDLE_ERROR( cudaStreamSynchronize( stream0 ) );


HANDLE_ERROR( cudaStreamSynchronize( stream1 ) );

:HZUDSXSmain()WKHVDPHZD\ZHFRQFOXGHGRXUVLQJOHVWUHDPLPSOHPHQWD-
WLRQ:HVWRSRXUWLPHUVGLVSOD\WKHHODSVHGWLPHDQGFOHDQXSDIWHURXUVHOYHV
2IFRXUVHZHUHPHPEHUWKDWZHQRZQHHGWRGHVWUR\WZRVWUHDPVDQGIUHHWZLFH
DVPDQ\*38EXIIHUVEXWDVLGHIURPWKDWWKLVFRGHLVLGHQWLFDOWRZKDWZHǢYH
VHHQDOUHDG\

HANDLE_ERROR( cudaEventRecord( stop, 0 ) );

HANDLE_ERROR( cudaEventSynchronize( stop ) );


HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time taken: %3.1f ms\n", elapsedTime );

// cleanup the streams and memory


HANDLE_ERROR( cudaFreeHost( host_a ) );
HANDLE_ERROR( cudaFreeHost( host_b ) );
HANDLE_ERROR( cudaFreeHost( host_c ) );

204

Download from www.wowebook.com


 :
*3 8 25.6
 & +( ' 8 / ,1 *

HANDLE_ERROR( cudaFree( dev_a0 ) );


HANDLE_ERROR( cudaFree( dev_b0 ) );
HANDLE_ERROR( cudaFree( dev_c0 ) );
HANDLE_ERROR( cudaFree( dev_a1 ) );
HANDLE_ERROR( cudaFree( dev_b1 ) );
HANDLE_ERROR( cudaFree( dev_c1 ) );
HANDLE_ERROR( cudaStreamDestroy( stream0 ) );
HANDLE_ERROR( cudaStreamDestroy( stream1 ) );

return 0;
}

:HEHQFKPDUNHGERWKWKHRULJLQDOVLQJOHVWUHDPLPSOHPHQWDWLRQIURP
6HFWLRQ8VLQJD6LQJOH&8'$6WUHDPDQGWKHLPSURYHGGRXEOHVWUHDP
YHUVLRQRQD*H)RUFH*7;7KHRULJLQDOYHUVLRQWDNHVPVWRUXQWRFRPSOH-
WLRQ$IWHUPRGLI\LQJLWWRXVHWZRVWUHDPVLWWDNHVPV

8KRK

:HOOWKHJRRGQHZVLVWKDWWKLVLVWKHUHDVRQZHERWKHUWRWLPHRXUDSSOLFDWLRQV
6RPHWLPHVRXUPRVWZHOOLQWHQGHGSHUIRUPDQFHǤHQKDQFHPHQWVǥGRQRWKLQJ
PRUHWKDQLQWURGXFHXQQHFHVVDU\FRPSOLFDWLRQVWRWKHFRGH

%XWZK\GLGQǢWWKLVDSSOLFDWLRQJHWDQ\IDVWHU":HHYHQVDLGWKDWLWZRXOGJHW
IDVWHU'RQǢWORVHKRSH\HWWKRXJKEHFDXVHZHDFWXDOO\canDFFHOHUDWHWKHVLQJOH
VWUHDPYHUVLRQZLWKDVHFRQGVWUHDPEXWZHQHHGWRXQGHUVWDQGDELWPRUHDERXW
KRZVWUHDPVDUHKDQGOHGE\WKH&8'$GULYHULQRUGHUWRUHDSWKHUHZDUGVRI
GHYLFHRYHUODS7RXQGHUVWDQGKRZVWUHDPVZRUNEHKLQGWKHVFHQHVZHǢOOQHHGWR
ORRNDWERWKWKH&8'$GULYHUDQGKRZWKH&8'$KDUGZDUHDUFKLWHFWXUHZRUNV

 *38:RUN6FKHGXOLQJ
$OWKRXJKVWUHDPVDUHORJLFDOO\LQGHSHQGHQWTXHXHVRIRSHUDWLRQVWREHH[HFXWHG
RQWKH*38LWWXUQVRXWWKDWWKLVDEVWUDFWLRQGRHVQRWH[DFWO\PDWFKWKH*38ǢV
TXHXLQJPHFKDQLVP$VSURJUDPPHUVZHWKLQNDERXWRXUVWUHDPVDVRUGHUHG
VHTXHQFHVRIRSHUDWLRQVFRPSRVHGRIDPL[WXUHRIPHPRU\FRSLHVDQGNHUQHO

205

Download from www.wowebook.com


STREAMS

LQYRFDWLRQV+RZHYHUWKHKDUGZDUHKDVQRQRWLRQRIVWUHDPV5DWKHULWKDVRQH
RUPRUHHQJLQHVWRSHUIRUPPHPRU\FRSLHVDQGDQHQJLQHWRH[HFXWHNHUQHOV
7KHVHHQJLQHVTXHXHFRPPDQGVLQGHSHQGHQWO\IURPHDFKRWKHUUHVXOWLQJLQD
WDVNVFKHGXOLQJVFHQDULROLNHWKHRQHVKRZQLQ)LJXUH7KHDUURZVLQWKH
ȌJXUHLOOXVWUDWHKRZRSHUDWLRQVWKDWKDYHEHHQTXHXHGLQWRVWUHDPVJHWVFKHG-
XOHGRQWKHKDUGZDUHHQJLQHVWKDWDFWXDOO\H[HFXWHWKHP

6RWKHXVHUDQGWKHKDUGZDUHKDYHVRPHZKDWRUWKRJRQDOQRWLRQVRIKRZWR
TXHXH*38ZRUNDQGWKHEXUGHQRINHHSLQJERWKWKHXVHUDQGKDUGZDUHVLGHV
RIWKLVHTXDWLRQKDSS\IDOOVRQWKH&8'$GULYHU)LUVWDQGIRUHPRVWWKHUHDUH
LPSRUWDQWGHSHQGHQFLHVVSHFLȌHGE\WKHRUGHULQZKLFKRSHUDWLRQVDUHDGGHG
WRVWUHDPV)RUH[DPSOHLQ)LJXUHVWUHDPǢVPHPRU\FRS\RI$QHHGVWR
EHFRPSOHWHGEHIRUHLWVPHPRU\FRS\RI%ZKLFKLQWXUQQHHGVWREHFRPSOHWHG
EHIRUHNHUQHO$LVODXQFKHG%XWRQFHWKHVHRSHUDWLRQVDUHSODFHGLQWRWKHKDUG-
ZDUHǢVFRS\HQJLQHDQGNHUQHOHQJLQHTXHXHVWKHVHGHSHQGHQFLHVDUHORVWVR
WKH&8'$GULYHUQHHGVWRNHHSHYHU\RQHKDSS\E\HQVXULQJWKDWWKHLQWUDVWUHDP
GHSHQGHQFLHVUHPDLQVDWLVȌHGE\WKHKDUGZDUHǢVH[HFXWLRQXQLWV

    
     

     

    

     

  
  
  
  

      
   

    

Figure 10.2 0DSSLQJRI&8'$VWUHDPVRQWR*38HQJLQHV

206

Download from www.wowebook.com


 :
*3 8 25.6
 & +( ' 8 / ,1 *

:KDWGRHVWKLVPHDQWRXV":HOOOHWǢVORRNDWZKDWǢVDFWXDOO\KDSSHQLQJZLWK
RXUH[DPSOHLQ6HFWLRQ8VLQJ0XOWLSOH&8'$6WUHDPV,IZHUHYLHZWKH
FRGHZHVHHWKDWRXUDSSOLFDWLRQEDVLFDOO\DPRXQWVWRDcudaMemcpyAsync()
RIa, cudaMemcpyAsync()RIbRXUNHUQHOH[HFXWLRQDQGWKHQD
cudaMemcpyAsync()RIcEDFNWRWKHKRVW7KHDSSOLFDWLRQHQTXHXHVDOOWKH
RSHUDWLRQVIURPVWUHDPIROORZHGE\DOOWKHRSHUDWLRQVIURPVWUHDP7KH&8'$
GULYHUVFKHGXOHVWKHVHRSHUDWLRQVRQWKHKDUGZDUHIRUXVLQWKHRUGHUWKH\ZHUH
VSHFLȌHGNHHSLQJWKHLQWHUHQJLQHGHSHQGHQFLHVVWUDLJKW7KHVHGHSHQGHQFLHV
DUHLOOXVWUDWHGLQ)LJXUHZKHUHDQDUURZIURPDFRS\WRDNHUQHOLQGLFDWHV
WKDWWKHFRS\GHSHQGVRQWKHNHUQHOFRPSOHWLQJH[HFXWLRQEHIRUHLWFDQEHJLQ

*LYHQRXUQHZIRXQGXQGHUVWDQGLQJRIKRZWKH*38VFKHGXOHVZRUNZHFDQORRN
DWDWLPHOLQHRIKRZWKHVHJHWH[HFXWHGRQWKHKDUGZDUHLQ)LJXUH

%HFDXVHVWUHDPǢVFRS\RIcEDFNWRWKHKRVWGHSHQGVRQLWVNHUQHOH[HFXWLRQ
FRPSOHWLQJVWUHDPǢVFRPSOHWHO\LQGHSHQGHQWFRSLHVRIaDQGbWRWKH*38JHW
EORFNHGEHFDXVHWKH*38ǢVHQJLQHVH[HFXWHZRUNLQWKHRUGHULWǢVSURYLGHG7KLV
LQHIȌFLHQF\H[SODLQVZK\WKHWZRVWUHDPYHUVLRQRIRXUDSSOLFDWLRQVKRZHGDEVR-
OXWHO\QRVSHHGXS7KHODFNRILPSURYHPHQWLVDGLUHFWUHVXOWRIRXUDVVXPSWLRQ
WKDWWKHKDUGZDUHZRUNVLQWKHVDPHPDQQHUDVWKH&8'$VWUHDPSURJUDPPLQJ
PRGHOLPSOLHV

       
       

       

    

    

    

    

Figure 10.3 $UURZVGHSLFWLQJWKHGHSHQGHQF\RIcudaMemcpyAsync()FDOOV


RQNHUQHOH[HFXWLRQVLQWKHH[DPSOHIURP6HFWLRQ8VLQJ0XOWLSOH&8'$
6WUHDPV

207

Download from www.wowebook.com


STREAMS

        
    

    

  
 

    

    

    

  

    

Figure 10.4 ([HFXWLRQWLPHOLQHRIWKHH[DPSOHIURP6HFWLRQ8VLQJ0XOWLSOH


&8'$6WUHDPV

7KHPRUDORIWKLVVWRU\LVWKDWZHDVSURJUDPPHUVQHHGWRKHOSRXWZKHQLW
FRPHVWRHQVXULQJWKDWLQGHSHQGHQWVWUHDPVDFWXDOO\JHWH[HFXWHGLQSDUDOOHO
.HHSLQJLQPLQGWKDWWKHKDUGZDUHKDVLQGHSHQGHQWHQJLQHVWKDWKDQGOHPHPRU\
FRSLHVDQGNHUQHOH[HFXWLRQVZHQHHGWRUHPDLQDZDUHWKDWWKHRUGHULQZKLFK
ZHHQTXHXHWKHVHRSHUDWLRQVLQRXUVWUHDPVZLOODIIHFWWKHZD\LQZKLFKWKH
&8'$GULYHUVFKHGXOHVWKHVHIRUH[HFXWLRQ,QWKHQH[WVHFWLRQZHǢOOVHHKRZWR
KHOSWKHKDUGZDUHDFKLHYHRYHUODSRIPHPRU\FRSLHVDQGNHUQHOH[HFXWLRQ

 8VLQJ0XOWLSOH&8'$6WUHDPV
(IIHFWLYHO\
$VZHVDZLQWKHSUHYLRXVVHFWLRQLIZHVFKHGXOHDOORIDSDUWLFXODUVWUHDPǢV
RSHUDWLRQVDWRQFHLWǢVYHU\HDV\WRLQDGYHUWHQWO\EORFNWKHFRSLHVRUNHUQHO
H[HFXWLRQVRIDQRWKHUVWUHDP7RDOOHYLDWHWKLVSUREOHPLWVXIȌFHVWRHQTXHXHRXU
RSHUDWLRQVEUHDGWKȌUVWDFURVVVWUHDPVUDWKHUWKDQGHSWKȌUVW7KDWLVUDWKHU
WKDQDGGWKHFRS\RIaFRS\RIbNHUQHOH[HFXWLRQDQGFRS\RIcWRVWUHDP
EHIRUHVWDUWLQJWRVFKHGXOHRQVWUHDPZHERXQFHEDFNDQGIRUWKEHWZHHQWKH

208

Download from www.wowebook.com


 86,1*0(
8/7 ,3 / &8'$6 (
7 5 $ 0 6(,) ) (& 7 9 (/<

VWUHDPVDVVLJQLQJZRUN:HDGGWKHFRS\RIaWRVWUHDPDQGWKHQZHDGGWKH
FRS\RIaWRVWUHDP7KHQZHDGGWKHFRS\RIbWRVWUHDPDQGWKHQZHDGGWKH
FRS\RIbWRVWUHDP:HHQTXHXHWKHNHUQHOLQYRFDWLRQLQVWUHDPDQGWKHQZH
HQTXHXHRQHLQVWUHDP)LQDOO\ZHHQTXHXHWKHFRS\RIcEDFNWRWKHKRVWLQ
VWUHDPIROORZHGE\WKHFRS\RIcLQVWUHDP

7RPDNHWKLVPRUHFRQFUHWHOHWǢVWDNHDORRNDWWKHFRGH$OOZHǢYHFKDQJHGLV
WKHRUGHULQZKLFKRSHUDWLRQVJHWDVVLJQHGWRHDFKRIRXUWZRVWUHDPVVRWKLV
ZLOOEHVWULFWO\DFRS\DQGSDVWHRSWLPL]DWLRQ(YHU\WKLQJHOVHLQWKHDSSOLFDWLRQ
ZLOOUHPDLQXQFKDQJHGZKLFKPHDQVWKDWRXULPSURYHPHQWVDUHORFDOL]HGWRWKH
for()ORRS7KHQHZEUHDGWKȌUVWDVVLJQPHQWWRWKHWZRVWUHDPVORRNVOLNHWKLV

for (int i=0; i<FULL_DATA_SIZE; i+= N*2) {


// enqueue copies of a in stream0 and stream1
HANDLE_ERROR( cudaMemcpyAsync( dev_a0, host_a+i,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream0 ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_a1, host_a+i+N,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream1 ) );
// enqueue copies of b in stream0 and stream1
HANDLE_ERROR( cudaMemcpyAsync( dev_b0, host_b+i,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream0 ) );
HANDLE_ERROR( cudaMemcpyAsync( dev_b1, host_b+i+N,
N * sizeof(int),
cudaMemcpyHostToDevice,
stream1 ) );

// enqueue kernels in stream0 and stream1


kernel<<<N/256,256,0,stream0>>>( dev_a0, dev_b0, dev_c0 );
kernel<<<N/256,256,0,stream1>>>( dev_a1, dev_b1, dev_c1 );

209

Download from www.wowebook.com


STREAMS

// enqueue copies of c from device to locked memory


HANDLE_ERROR( cudaMemcpyAsync( host_c+i, dev_c0,
N * sizeof(int),
cudaMemcpyDeviceToHost,
stream0 ) );
HANDLE_ERROR( cudaMemcpyAsync( host_c+i+N, dev_c1,
N * sizeof(int),
cudaMemcpyDeviceToHost,
stream1 ) );
}

,IZHDVVXPHWKDWRXUPHPRU\FRSLHVDQGNHUQHOH[HFXWLRQVDUHURXJKO\FRPSD-
UDEOHLQH[HFXWLRQWLPHRXUQHZH[HFXWLRQWLPHOLQHZLOOORRNOLNH)LJXUH7KH
LQWHUHQJLQHGHSHQGHQFLHVDUHKLJKOLJKWHGZLWKDUURZVVLPSO\WRLOOXVWUDWHWKDW
WKH\DUHVWLOOVDWLVȌHGZLWKWKLVQHZVFKHGXOLQJRUGHU

%HFDXVHZHKDYHTXHXHGRXURSHUDWLRQVEUHDGWKȌUVWDFURVVVWUHDPVZHQR
ORQJHUKDYHVWUHDPǢVFRS\RIcEORFNLQJVWUHDPǢVLQLWLDOPHPRU\FRSLHVRIa
DQGb7KLVDOORZVWKH*38WRH[HFXWHFRSLHVDQGNHUQHOVLQSDUDOOHODOORZLQJRXU
DSSOLFDWLRQWRUXQVLJQLȌFDQWO\IDVWHU7KHQHZFRGHUXQVLQPVDSHUFHQW
LPSURYHPHQWRYHURXURULJLQDOQD±YHGRXEOHVWUHDPLPSOHPHQWDWLRQ)RUDSSOL-
FDWLRQVWKDWFDQRYHUODSQHDUO\DOOFRPSXWDWLRQDQGPHPRU\FRSLHV\RXFDQ
DSSURDFKDQHDUO\WZRIROGLPSURYHPHQWLQSHUIRUPDQFHEHFDXVHWKHFRS\DQG
NHUQHOHQJLQHVZLOOEHFUDQNLQJWKHHQWLUHWLPH

        
    

    
 

    

       

       

    

Figure 10.5 ([HFXWLRQWLPHOLQHRIWKHLPSURYHGH[DPSOHZLWKDUURZVLQGLFDWLQJ


LQWHUHQJLQHGHSHQGHQFLHV
210

Download from www.wowebook.com


 &+ $ 3 5
7( 5
 ( 9 ,( :

 &KDSWHU5HYLHZ
,QWKLVFKDSWHUZHORRNHGDWDPHWKRGIRUDFKLHYLQJDNLQGRIWDVNOHYHOSDUDO-
OHOLVPLQ&8'$&DSSOLFDWLRQV%\XVLQJWZR RUPRUH &8'$VWUHDPVZHFDQ
DOORZWKH*38WRVLPXOWDQHRXVO\H[HFXWHDNHUQHOZKLOHSHUIRUPLQJDFRS\
EHWZHHQWKHKRVWDQG*38:HQHHGWREHFDUHIXODERXWWZRWKLQJVZKHQZH
HQGHDYRUWRGRWKLVWKRXJK)LUVWWKHKRVWPHPRU\LQYROYHGQHHGVWREHDOOR-
FDWHGXVLQJcudaHostAlloc()VLQFHZHZLOOTXHXHRXUPHPRU\FRSLHVZLWK
cudaMemcpyAsync()DQGDV\QFKURQRXVFRSLHVQHHGWREHSHUIRUPHGZLWK
SLQQHGEXIIHUV6HFRQGZHQHHGWREHDZDUHWKDWWKHRUGHULQZKLFKZHDGGRSHU-
DWLRQVWRRXUVWUHDPVZLOODIIHFWRXUFDSDFLW\WRDFKLHYHRYHUODSSLQJRIFRSLHVDQG
NHUQHOH[HFXWLRQV7KHJHQHUDOJXLGHOLQHLQYROYHVDEUHDGWKȌUVWRUURXQGURELQ
DVVLJQPHQWRIZRUNWRWKHVWUHDPV\RXLQWHQGWRXVH7KLVFDQEHFRXQWHULQWXLWLYH
LI\RXGRQǢWXQGHUVWDQGKRZWKHKDUGZDUHTXHXLQJZRUNVVRLWǢVDJRRGWKLQJWR
UHPHPEHUZKHQ\RXJRDERXWZULWLQJ\RXURZQDSSOLFDWLRQV

211

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Chapter 11
CUDA C on
Multiple GPUs

7KHUHLVDQROGVD\LQJWKDWJRHVVRPHWKLQJOLNHWKLVǤ7KHRQO\WKLQJEHWWHUWKDQ
FRPSXWLQJRQD*38LVFRPSXWLQJRQWZR*38Vǥ6\VWHPVFRQWDLQLQJPXOWLSOH
JUDSKLFVSURFHVVRUVKDYHEHFRPHPRUHDQGPRUHFRPPRQLQUHFHQW\HDUV2I
FRXUVHLQVRPHZD\VPXOWL*38V\VWHPVDUHVLPLODUWRPXOWL&38V\VWHPVLQ
WKDWWKH\DUHVWLOOIDUIURPWKHFRPPRQV\VWHPFRQȌJXUDWLRQEXWLWKDVJRWWHQ
TXLWHHDV\WRHQGXSZLWKPRUHWKDQRQH*38LQ\RXUV\VWHP3URGXFWVVXFKDV
WKH*H)RUFH*7;FRQWDLQWZR*38VRQDVLQJOHFDUG19,',$ǢV7HVOD6
FRQWDLQVDZKRSSLQJIRXU&8'$FDSDEOHJUDSKLFVSURFHVVRUVLQLW6\VWHPVEXLOW
DURXQGDUHFHQW19,',$FKLSVHWZLOOKDYHDQLQWHJUDWHG&8'$FDSDEOH*38RQ
WKHPRWKHUERDUG$GGLQJDGLVFUHWH19,',$*38LQRQHRIWKH3&,([SUHVVVORWV
ZLOOPDNHWKLVV\VWHPPXOWL*381HLWKHURIWKHVHVFHQDULRVLVYHU\IDUIHWFKHG
VRZHZRXOGEHEHVWVHUYHGE\OHDUQLQJWRH[SORLWWKHUHVRXUFHVRIDV\VWHPZLWK
PXOWLSOH*38VLQLW

213

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQKRZWRDOORFDWHDQGXVHzero-copyPHPRU\

ǩ <RXZLOOOHDUQKRZWRXVHPXOWLSOH*38VZLWKLQWKHVDPHDSSOLFDWLRQ

ǩ <RXZLOOOHDUQKRZWRDOORFDWHDQGXVHportableSLQQHGPHPRU\

 =HUR&RS\+RVW0HPRU\
,Q&KDSWHUZHH[DPLQHGSLQQHGRUSDJHORFNHGPHPRU\DQHZW\SHRI
KRVWPHPRU\WKDWFDPHZLWKWKHJXDUDQWHHWKDWWKHEXIIHUZRXOGQHYHUEH
VZDSSHGRXWRISK\VLFDOPHPRU\,I\RXUHFDOOZHDOORFDWHGWKLVPHPRU\E\
PDNLQJDFDOOWRcudaHostAlloc()DQGSDVVLQJcudaHostAllocDefault
WRJHWGHIDXOWSLQQHGPHPRU\:HSURPLVHGWKDWLQWKHQH[WFKDSWHU\RXZRXOG
VHHRWKHUPRUHH[FLWLQJPHDQVE\ZKLFK\RXFDQDOORFDWHSLQQHGPHPRU\
$VVXPLQJWKDWWKLVLVWKHRQO\UHDVRQ\RXǢYHFRQWLQXHGUHDGLQJ\RXZLOOEH
JODGWRNQRZWKDWWKHZDLWLVRYHU7KHȍDJcudaHostAllocMappedFDQEH
SDVVHGLQVWHDGRIcudaHostAllocDefault7KHKRVWPHPRU\DOORFDWHGXVLQJ
cudaHostAllocMappedLVpinnedLQWKHVDPHVHQVHWKDWPHPRU\DOORFDWHG
ZLWKcudaHostAllocDefaultLVSLQQHGVSHFLȌFDOO\WKDWLWFDQQRWEHSDJHGRXW
RIRUUHORFDWHGZLWKLQSK\VLFDOPHPRU\%XWLQDGGLWLRQWRXVLQJWKLVPHPRU\IURP
WKHKRVWIRUPHPRU\FRSLHVWRDQGIURPWKH*38WKLVQHZNLQGRIKRVWPHPRU\
DOORZVXVWRYLRODWHRQHRIWKHȌUVWUXOHVZHSUHVHQWHGLQ&KDSWHUFRQFHUQLQJ
KRVWPHPRU\:HFDQDFFHVVWKLVKRVWPHPRU\GLUHFWO\IURPZLWKLQ&8'$&
NHUQHOV%HFDXVHWKLVPHPRU\GRHVQRWUHTXLUHFRSLHVWRDQGIURPWKH*38ZH
UHIHUWRLWDVzero-copyPHPRU\

 =(52Ȑ&23<'27352'8&7
7\SLFDOO\RXU*38DFFHVVHVRQO\*38PHPRU\DQGRXU&38DFFHVVHVRQO\KRVW
PHPRU\%XWLQVRPHFLUFXPVWDQFHVLWǢVEHWWHUWREUHDNWKHVHUXOHV7RVHHDQ
LQVWDQFHZKHUHLWǢVEHWWHUWRKDYHWKH*38PDQLSXODWHKRVWPHPRU\ZHǢOOUHYLVLW
RXUIDYRULWHUHGXFWLRQWKHYHFWRUGRWSURGXFW,I\RXǢYHPDQDJHGWRUHDGWKLV
HQWLUHERRN\RXPD\UHFDOORXUȌUVWDWWHPSWDWWKHGRWSURGXFW:HFRSLHGWKHWZR
LQSXWYHFWRUVWRWKH*38SHUIRUPHGWKHFRPSXWDWLRQFRSLHGWKHLQWHUPHGLDWH
UHVXOWVEDFNWRWKHKRVWDQGFRPSOHWHGWKHFRPSXWDWLRQRQWKH&38

214

Download from www.wowebook.com


 =(52Ȑ& 23 <+2 6 70(025 <

,QWKLVYHUVLRQZHǢOOVNLSWKHH[SOLFLWFRSLHVRIRXULQSXWXSWRWKH*38DQGLQVWHDG
XVH]HURFRS\PHPRU\WRDFFHVVWKHGDWDGLUHFWO\IURPWKH*387KLVYHUVLRQRI
GRWSURGXFWZLOOEHVHWXSH[DFWO\OLNHRXUSLQQHGPHPRU\WHVW6SHFLȌFDOO\ZHǢOO
ZULWHWZRIXQFWLRQVRQHZLOOSHUIRUPWKHWHVWZLWKVWDQGDUGKRVWPHPRU\DQG
WKHRWKHUZLOOȌQLVKWKHUHGXFWLRQRQWKH*38XVLQJ]HURFRS\PHPRU\WRKROG
WKHLQSXWDQGRXWSXWEXIIHUV)LUVWOHWǢVWDNHDORRNDWWKHVWDQGDUGKRVWPHPRU\
YHUVLRQRIWKHGRWSURGXFW:HVWDUWLQWKHXVXDOIDVKLRQE\FUHDWLQJWLPLQJHYHQWV
DOORFDWLQJLQSXWDQGRXWSXWEXIIHUVDQGȌOOLQJRXULQSXWEXIIHUVZLWKGDWD

float malloc_test( int size ) {


cudaEvent_t start, stop;
float *a, *b, c, *partial_c;
float *dev_a, *dev_b, *dev_partial_c;
float elapsedTime;

HANDLE_ERROR( cudaEventCreate( &start ) );


HANDLE_ERROR( cudaEventCreate( &stop ) );

// allocate memory on the CPU side


a = (float*)malloc( size*sizeof(float) );
b = (float*)malloc( size*sizeof(float) );
partial_c = (float*)malloc( blocksPerGrid*sizeof(float) );

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
size*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b,
size*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_partial_c,
blocksPerGrid*sizeof(float) ) );

// fill in the host memory with data


for (int i=0; i<size; i++) {
a[i] = i;
b[i] = i*2;
}

215

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

$IWHUWKHDOORFDWLRQVDQGGDWDFUHDWLRQZHFDQEHJLQWKHFRPSXWDWLRQV:HVWDUW
RXUWLPHUFRS\RXULQSXWVWRWKH*38H[HFXWHWKHGRWSURGXFWNHUQHODQGFRS\
WKHSDUWLDOUHVXOWVEDFNWRWKHKRVW

HANDLE_ERROR( cudaEventRecord( start, 0 ) );


// copy the arrays 'a' and 'b' to the GPU
HANDLE_ERROR( cudaMemcpy( dev_a, a, size*sizeof(float),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, size*sizeof(float),
cudaMemcpyHostToDevice ) );

dot<<<blocksPerGrid,threadsPerBlock>>>( size, dev_a, dev_b,


dev_partial_c );

// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( partial_c, dev_partial_c,
blocksPerGrid*sizeof(float),
cudaMemcpyDeviceToHost ) );

1RZZHQHHGWRȌQLVKXSRXUFRPSXWDWLRQVRQWKH&38DVZHGLGLQ&KDSWHU
%HIRUHGRLQJWKLVZHǢOOVWRSRXUHYHQWWLPHUEHFDXVHLWRQO\PHDVXUHVZRUNWKDWǢV
EHLQJSHUIRUPHGRQWKH*38

HANDLE_ERROR( cudaEventRecord( stop, 0 ) );


HANDLE_ERROR( cudaEventSynchronize( stop ) );
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );

)LQDOO\ZHVXPRXUSDUWLDOUHVXOWVDQGIUHHRXULQSXWDQGRXWSXWEXIIHUV

// finish up on the CPU side


c = 0;
for (int i=0; i<blocksPerGrid; i++) {
c += partial_c[i];
}

216

Download from www.wowebook.com


 =(52Ȑ& 23 <+2 6 70(025 <

HANDLE_ERROR( cudaFree( dev_a ) );


HANDLE_ERROR( cudaFree( dev_b ) );
HANDLE_ERROR( cudaFree( dev_partial_c ) );

// free memory on the CPU side


free( a );
free( b );
free( partial_c );

// free events
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );

printf( "Value calculated: %f\n", c );

return elapsedTime;
}

7KHYHUVLRQWKDWXVHV]HURFRS\PHPRU\ZLOOEHUHPDUNDEO\VLPLODUZLWKWKH
H[FHSWLRQRIPHPRU\DOORFDWLRQ6RZHVWDUWE\DOORFDWLQJRXULQSXWDQGRXWSXW
ȌOOLQJWKHLQSXWPHPRU\ZLWKGDWDDVEHIRUH

float cuda_host_alloc_test( int size ) {


cudaEvent_t start, stop;
float *a, *b, c, *partial_c;
float *dev_a, *dev_b, *dev_partial_c;
float elapsedTime;

HANDLE_ERROR( cudaEventCreate( &start ) );


HANDLE_ERROR( cudaEventCreate( &stop ) );

// allocate the memory on the CPU


HANDLE_ERROR( cudaHostAlloc( (void**)&a,
size*sizeof(float),
cudaHostAllocWriteCombined |
cudaHostAllocMapped ) );

217

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

HANDLE_ERROR( cudaHostAlloc( (void**)&b,


size*sizeof(float),
cudaHostAllocWriteCombined |
cudaHostAllocMapped ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&partial_c,
blocksPerGrid*sizeof(float),
cudaHostAllocMapped ) );

// fill in the host memory with data


for (int i=0; i<size; i++) {
a[i] = i;
b[i] = i*2;
}

$VZLWK&KDSWHUZHVHHcudaHostAlloc()LQDFWLRQDJDLQDOWKRXJKZHǢUH
QRZXVLQJWKHflagsDUJXPHQWWRVSHFLI\PRUHWKDQMXVWGHIDXOWEHKDYLRU7KH
ȍDJcudaHostAllocMappedWHOOVWKHUXQWLPHWKDWZHLQWHQGWRDFFHVVWKLV
EXIIHUIURPWKH*38,QRWKHUZRUGVWKLVȍDJLVZKDWPDNHVRXUEXIIHUzero-copy
)RUWKHWZRLQSXWEXIIHUVZHVSHFLI\WKHȍDJcudaHostAllocWriteCombined
7KLVȍDJLQGLFDWHVWKDWWKHUXQWLPHVKRXOGDOORFDWHWKHEXIIHUDVZULWHFRPELQHG
ZLWKUHVSHFWWRWKH&38FDFKH7KLVȍDJZLOOQRWFKDQJHIXQFWLRQDOLW\LQRXUDSSOL-
FDWLRQEXWUHSUHVHQWVDQLPSRUWDQWSHUIRUPDQFHHQKDQFHPHQWIRUEXIIHUVWKDW
ZLOOEHUHDGRQO\E\WKH*38+RZHYHUZULWHFRPELQHGPHPRU\FDQEHH[WUHPHO\
LQHIȌFLHQWLQVFHQDULRVZKHUHWKH&38DOVRQHHGVWRSHUIRUPUHDGVIURPWKH
EXIIHUVR\RXZLOOKDYHWRFRQVLGHU\RXUDSSOLFDWLRQǢVOLNHO\DFFHVVSDWWHUQVZKHQ
PDNLQJWKLVGHFLVLRQ

6LQFHZHǢYHDOORFDWHGRXUKRVWPHPRU\ZLWKWKHȍDJcudaHostAllocMapped,
WKHEXIIHUVFDQEHDFFHVVHGIURPWKH*38+RZHYHUWKH*38KDVDGLIIHUHQW
YLUWXDOPHPRU\VSDFHWKDQWKH&38VRWKHEXIIHUVZLOOKDYHGLIIHUHQWDGGUHVVHV
ZKHQWKH\ǢUHDFFHVVHGRQWKH*38DVFRPSDUHGWRWKH&387KHFDOOWR
cudaHostAlloc()UHWXUQVWKH&38SRLQWHUIRUWKHPHPRU\VRZHQHHGWRFDOO
cudaHostGetDevicePointer()LQRUGHUWRJHWDYDOLG*38SRLQWHUIRUWKH
PHPRU\7KHVHSRLQWHUVZLOOEHSDVVHGWRWKHNHUQHODQGWKHQXVHGE\WKH*38WR
UHDGIURPDQGZULWHWRRXUKRVWDOORFDWLRQV

218

Download from www.wowebook.com


 =(52Ȑ& 23 <+2 6 70(025 <

HANDLE_ERROR( cudaHostGetDevicePointer( &dev_a, a, 0 ) );


HANDLE_ERROR( cudaHostGetDevicePointer( &dev_b, b, 0 ) );
HANDLE_ERROR( cudaHostGetDevicePointer( &dev_partial_c,
partial_c, 0 ) );

:LWKYDOLGGHYLFHSRLQWHUVLQKDQGZHǢUHUHDG\WRVWDUWRXUWLPHUDQGODXQFKRXU
NHUQHO

HANDLE_ERROR( cudaEventRecord( start, 0 ) );

dot<<<blocksPerGrid,threadsPerBlock>>>( size, dev_a, dev_b,


dev_partial_c );
HANDLE_ERROR( cudaThreadSynchronize() );

(YHQWKRXJKWKHSRLQWHUVdev_a, dev_bDQGdev_partial_cDOOUHVLGHRQ
WKHKRVWWKH\ZLOOORRNWRRXUNHUQHODVLIWKH\DUH*38PHPRU\WKDQNVWRRXU
FDOOVWRcudaHostGetDevicePointer()6LQFHRXUSDUWLDOUHVXOWVDUHDOUHDG\
RQWKHKRVWZHGRQǢWQHHGWRERWKHUZLWKDcudaMemcpy()IURPWKHGHYLFH
+RZHYHU\RXZLOOQRWLFHWKDWZHǢUHV\QFKURQL]LQJWKH&38ZLWKWKH*38E\FDOOLQJ
cudaThreadSynchronize()7KHFRQWHQWVRI]HURFRS\PHPRU\DUHXQGHȌQHG
GXULQJWKHH[HFXWLRQRIDNHUQHOWKDWSRWHQWLDOO\PDNHVFKDQJHVWRLWVFRQWHQWV
$IWHUV\QFKURQL]LQJZHǢUHVXUHWKDWWKHNHUQHOKDVFRPSOHWHGDQGWKDWRXU]HUR
FRS\EXIIHUFRQWDLQVWKHUHVXOWVVRZHFDQVWRSRXUWLPHUDQGȌQLVKWKHFRPSXWD-
WLRQRQWKH&38DVZHGLGEHIRUH

HANDLE_ERROR( cudaEventRecord( stop, 0 ) );


HANDLE_ERROR( cudaEventSynchronize( stop ) );
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );

// finish up on the CPU side


c = 0;
for (int i=0; i<blocksPerGrid; i++) {
c += partial_c[i];
}

219

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

7KHRQO\WKLQJUHPDLQLQJLQWKHcudaHostAlloc()YHUVLRQRIWKHGRWSURGXFWLV
FOHDQXS

HANDLE_ERROR( cudaFreeHost( a ) );
HANDLE_ERROR( cudaFreeHost( b ) );
HANDLE_ERROR( cudaFreeHost( partial_c ) );

// free events
HANDLE_ERROR( cudaEventDestroy( start ) );
HANDLE_ERROR( cudaEventDestroy( stop ) );

printf( "Value calculated: %f\n", c );

return elapsedTime;
}

<RXZLOOQRWLFHWKDWQRPDWWHUZKDWȍDJVZHXVHZLWKcudaHostAlloc(),
WKHPHPRU\DOZD\VJHWVIUHHGLQWKHVDPHZD\6SHFLȌFDOO\DFDOOWR
cudaFreeHost()GRHVWKHWULFN

$QGWKDWǢVWKDW$OOWKDWUHPDLQVLVWRORRNDWKRZmain()WLHVDOORIWKLVWRJHWKHU
7KHȌUVWWKLQJZHQHHGWRFKHFNLVZKHWKHURXUGHYLFHVXSSRUWVPDSSLQJKRVW
PHPRU\:HGRWKLVWKHVDPHZD\ZHFKHFNHGIRUGHYLFHRYHUODSLQWKHSUHYLRXV
FKDSWHUZLWKDFDOOWRcudaGetDeviceProperties()

int main( void ) {


cudaDeviceProp prop;
int whichDevice;
HANDLE_ERROR( cudaGetDevice( &whichDevice ) );
HANDLE_ERROR( cudaGetDeviceProperties( &prop, whichDevice ) );
if (prop.canMapHostMemory != 1) {
printf( "Device cannot map memory.\n" );
return 0;
}

220

Download from www.wowebook.com


 =(52Ȑ& 23 <+2 6 70(025 <

$VVXPLQJWKDWRXUGHYLFHVXSSRUWV]HURFRS\PHPRU\ZHSODFHWKHUXQWLPH
LQWRDVWDWHZKHUHLWZLOOEHDEOHWRDOORFDWH]HURFRS\EXIIHUVIRUXV:HDFFRP-
SOLVKWKLVE\DFDOOWRcudaSetDeviceFlags()DQGE\SDVVLQJWKHȍDJ
cudaDeviceMapHostWRLQGLFDWHWKDWZHZDQWWKHGHYLFHWREHDOORZHGWRPDS
KRVWPHPRU\

HANDLE_ERROR( cudaSetDeviceFlags( cudaDeviceMapHost ) );

7KDWǢVUHDOO\DOOWKHUHLVWRmain():HUXQRXUWZRWHVWVGLVSOD\WKHHODSVHG
WLPHDQGH[LWWKHDSSOLFDWLRQ

float elapsedTime = malloc_test( N );


printf( "Time using cudaMalloc: %3.1f ms\n",
elapsedTime );

elapsedTime = cuda_host_alloc_test( N );
printf( "Time using cudaHostAlloc: %3.1f ms\n",
elapsedTime );
}

7KHNHUQHOLWVHOILVXQFKDQJHGIURP&KDSWHUEXWIRUWKHVDNHRIFRPSOHWHQHVV
KHUHLWLVLQLWVHQWLUHW\

#define imin(a,b) (a<b?a:b)

const int N = 33 * 1024 * 1024;


const int threadsPerBlock = 256;
const int blocksPerGrid =
imin( 32, (N+threadsPerBlock-1) / threadsPerBlock );

__global__ void dot( int size, float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;

221

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

float temp = 0;
while (tid < size) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}

// set the cache values


cache[cacheIndex] = temp;

// synchronize threads in this block


__syncthreads();

// for reductions, threadsPerBlock must be a power of 2


// because of the following code
int i = blockDim.x/2;
while (i != 0) {
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex + i];
__syncthreads();
i /= 2;
}

if (cacheIndex == 0)
c[blockIdx.x] = cache[0];
}

 =(52Ȑ&23<3(5)250$1&(
:KDWVKRXOGZHH[SHFWWRJDLQIURPXVLQJ]HURFRS\PHPRU\"7KHDQVZHUWR
WKLVTXHVWLRQLVGLIIHUHQWIRUGLVFUHWH*38VDQGLQWHJUDWHG*38VDiscrete GPUs
DUHJUDSKLFVSURFHVVRUVWKDWKDYHWKHLURZQGHGLFDWHG'5$0VDQGW\SLFDOO\VLW
RQVHSDUDWHFLUFXLWERDUGVIURPWKH&38)RUH[DPSOHLI\RXKDYHHYHULQVWDOOHG
DJUDSKLFVFDUGLQWR\RXUGHVNWRSWKLV*38LVDGLVFUHWH*38Integrated GPUs
DUHJUDSKLFVSURFHVVRUVEXLOWLQWRDV\VWHPǢVFKLSVHWDQGXVXDOO\VKDUHUHJXODU

222

Download from www.wowebook.com


 =(52Ȑ& 23 <+2 6 70(025 <

V\VWHPPHPRU\ZLWKWKH&380DQ\UHFHQWV\VWHPVEXLOWZLWK19,',$ǢVQ)RUFH
PHGLDDQGFRPPXQLFDWLRQVSURFHVVRUV 0&3V FRQWDLQ&8'$FDSDEOHLQWH-
JUDWHG*38V,QDGGLWLRQWRQ)RUFH0&3VDOOWKHQHWERRNQRWHERRNDQGGHVNWRS
FRPSXWHUVEDVHGRQ19,',$ǢVQHZ,21SODWIRUPFRQWDLQLQWHJUDWHG&8'$
FDSDEOH*38V)RULQWHJUDWHG*38VWKHXVHRI]HURFRS\PHPRU\LValways a
SHUIRUPDQFHZLQEHFDXVHWKHPHPRU\LVSK\VLFDOO\VKDUHGZLWKWKHKRVWDQ\ZD\
'HFODULQJDEXIIHUDV]HURFRS\KDVWKHVROHHIIHFWRISUHYHQWLQJXQQHFHVVDU\
FRSLHVRIGDWD%XWUHPHPEHUWKDWQRWKLQJLVIUHHDQGWKDW]HURFRS\EXIIHUV
DUHVWLOOFRQVWUDLQHGLQWKHVDPHZD\WKDWDOOSLQQHGPHPRU\DOORFDWLRQVDUH
FRQVWUDLQHG(DFKSLQQHGDOORFDWLRQFDUYHVLQWRWKHV\VWHPǢVDYDLODEOHSK\VLFDO
PHPRU\ZKLFKZLOOHYHQWXDOO\GHJUDGHV\VWHPSHUIRUPDQFH

,QFDVHVZKHUHLQSXWVDQGRXWSXWVDUHXVHGH[DFWO\RQFHZHZLOOHYHQVHHD
SHUIRUPDQFHHQKDQFHPHQWZKHQXVLQJ]HURFRS\PHPRU\ZLWKDGLVFUHWH*38
6LQFH*38VDUHGHVLJQHGWRH[FHODWKLGLQJWKHODWHQFLHVDVVRFLDWHGZLWKPHPRU\
DFFHVVSHUIRUPLQJUHDGVDQGZULWHVRYHUWKH3&,([SUHVVEXVFDQEHPLWLJDWHG
WRVRPHGHJUHHE\WKLVPHFKDQLVP\LHOGLQJDQRWLFHDEOHSHUIRUPDQFHDGYDQWDJH
%XWVLQFHWKH]HURFRS\PHPRU\LVQRWFDFKHGRQWKH*38LQVLWXDWLRQVZKHUH
WKHPHPRU\JHWVUHDGPXOWLSOHWLPHVZHZLOOHQGXSSD\LQJDODUJHSHQDOW\WKDW
FRXOGEHDYRLGHGE\VLPSO\FRS\LQJWKHGDWDWRWKH*38ȌUVW

+RZGR\RXGHWHUPLQHZKHWKHUD*38LVLQWHJUDWHGRUGLVFUHWH":HOO\RXFDQ
RSHQXS\RXUFRPSXWHUDQGORRNEXWWKLVVROXWLRQLVIDLUO\XQZRUNDEOHIRU\RXU
&8'$&DSSOLFDWLRQ<RXUFRGHFDQFKHFNWKLVSURSHUW\RID*38E\QRWVXUSULV-
LQJO\ORRNLQJDWWKHVWUXFWXUHUHWXUQHGE\cudaGetDeviceProperties()7KLV
VWUXFWXUHKDVDȌHOGQDPHGintegratedZKLFKZLOOEHtrueLIWKHGHYLFHLVDQ
LQWHJUDWHG*38DQGfalseLILWǢVQRW

6LQFHRXUGRWSURGXFWDSSOLFDWLRQVDWLVȌHVWKHǤUHDGDQGRUZULWHH[DFWO\RQFHǥ
FRQVWUDLQWLWǢVSRVVLEOHWKDWLWZLOOHQMR\DSHUIRUPDQFHERRVWZKHQUXQZLWK
]HURFRS\PHPRU\$QGLQIDFWLWGRHVHQMR\DVOLJKWERRVWLQSHUIRUPDQFH2QD
*H)RUFH*7;WKHH[HFXWLRQWLPHLPSURYHVE\PRUHWKDQSHUFHQWGURS-
SLQJIURPPVWRPVZKHQPLJUDWHGWR]HURFRS\PHPRU\$*H)RUFH*7;
HQMR\VDVLPLODULPSURYHPHQWVSHHGLQJXSE\SHUFHQWIURPPVWR
PV2IFRXUVHGLIIHUHQW*38VZLOOH[KLELWGLIIHUHQWSHUIRUPDQFHFKDUDFWHULV-
WLFVEHFDXVHRIYDU\LQJUDWLRVRIFRPSXWDWLRQWREDQGZLGWKDVZHOODVEHFDXVHRI
YDULDWLRQVLQHIIHFWLYH3&,([SUHVVEDQGZLGWKDFURVVFKLSVHWV

223

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

 8VLQJ0XOWLSOH*38V
,QWKHSUHYLRXVVHFWLRQZHPHQWLRQHGKRZGHYLFHVDUHHLWKHULQWHJUDWHGRU
GLVFUHWH*38VZKHUHWKHIRUPHULVEXLOWLQWRWKHV\VWHPǢVFKLSVHWDQGWKHODWWHULV
W\SLFDOO\DQH[SDQVLRQFDUGLQD3&,([SUHVVVORW0RUHDQGPRUHV\VWHPVFRQWDLQ
bothLQWHJUDWHGDQGGLVFUHWH*38VPHDQLQJWKDWWKH\DOVRKDYHPXOWLSOH&8'$
FDSDEOHSURFHVVRUV19,',$DOVRVHOOVSURGXFWVVXFKDVWKH*H)RUFH*7;
WKDWFRQWDLQPRUHWKDQRQH*38$*H)RUFH*7;ZKLOHSK\VLFDOO\RFFXS\LQJ
DVLQJOHH[SDQVLRQVORWZLOODSSHDUWR\RXU&8'$DSSOLFDWLRQVDVWZRVHSDUDWH
*38V)XUWKHUPRUHXVHUVFDQDOVRDGGPXOWLSOH*38VWRVHSDUDWH3&,([SUHVV
VORWVFRQQHFWLQJWKHPZLWKEULGJHVXVLQJ19,',$ǢVscalable link interface (SLI)
WHFKQRORJ\$VDUHVXOWRIWKHVHWUHQGVLWKDVEHFRPHUHODWLYHO\FRPPRQWRKDYH
D&8'$DSSOLFDWLRQUXQQLQJRQDV\VWHPZLWKPXOWLSOHJUDSKLFVSURFHVVRUV6LQFH
RXU&8'$DSSOLFDWLRQVWHQGWREHYHU\SDUDOOHOL]DEOHWREHJLQZLWKLWZRXOGEH
H[FHOOHQWLIZHFRXOGXVHHYHU\&8'$GHYLFHLQWKHV\VWHPWRDFKLHYHPD[LPXP
WKURXJKSXW6ROHWǢVȌJXUHRXWKRZZHFDQDFFRPSOLVKWKLV

7RDYRLGOHDUQLQJDQHZH[DPSOHOHWǢVFRQYHUWRXUGRWSURGXFWWRXVHPXOWLSOH
*38V7RPDNHRXUOLYHVHDVLHUZHZLOOVXPPDUL]HDOOWKHGDWDQHFHVVDU\WR
FRPSXWHDGRWSURGXFWLQDVLQJOHVWUXFWXUH<RXǢOOVHHPRPHQWDULO\H[DFWO\ZK\
WKLVZLOOPDNHRXUOLYHVHDVLHU

struct DataStruct {
int deviceID;
int size;
float *a;
float *b;
float returnValue;
};

7KLVVWUXFWXUHFRQWDLQVWKHLGHQWLȌFDWLRQIRUWKHGHYLFHRQZKLFKWKHGRWSURGXFW
ZLOOEHFRPSXWHGLWFRQWDLQVWKHVL]HRIWKHLQSXWEXIIHUVDVZHOODVSRLQWHUVWR
WKHWZRLQSXWVaDQGb)LQDOO\LWKDVDQHQWU\WRVWRUHWKHYDOXHFRPSXWHGDVWKH
GRWSURGXFWRIaDQGb

7RXVHN*38VZHȌUVWZRXOGOLNHWRNQRZH[DFWO\ZKDWYDOXHRINZHǢUHGHDOLQJ
ZLWK6RZHVWDUWRXUDSSOLFDWLRQZLWKDFDOOWRcudaGetDeviceCount()LQ

224

Download from www.wowebook.com


 USING
86,1*08/7,3/(*386
MULTIPLE GPUS

RUGHUWRGHWHUPLQHKRZPDQ\&8'$FDSDEOHSURFHVVRUVKDYHEHHQLQVWDOOHGLQ
RXUV\VWHP

int main( void ) {


int deviceCount;
HANDLE_ERROR( cudaGetDeviceCount( &deviceCount ) );
if (deviceCount < 2) {
printf( "We need at least two compute 1.0 or greater "
"devices, but only found %d\n", deviceCount );
return 0;
}

7KLVH[DPSOHLVGHVLJQHGWRVKRZPXOWL*38XVDJHVR\RXǢOOQRWLFHWKDWZH
VLPSO\H[LWLIWKHV\VWHPKDVRQO\RQH&8'$GHYLFH QRWWKDWWKHUHǢVDQ\WKLQJ
ZURQJZLWKWKDW 7KLVLVQRWHQFRXUDJHGDVDEHVWSUDFWLFHIRUREYLRXVUHDVRQV
7RNHHSWKLQJVDVVLPSOHDVSRVVLEOHZHǢOODOORFDWHVWDQGDUGKRVWPHPRU\IRURXU
LQSXWVDQGȌOOWKHPZLWKGDWDH[DFWO\KRZZHǢYHGRQHLQWKHSDVW

float *a = (float*)malloc( sizeof(float) * N );


HANDLE_NULL( a );
float *b = (float*)malloc( sizeof(float) * N );
HANDLE_NULL( b );

// fill in the host memory with data


for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i*2;
}

:HǢUHQRZUHDG\WRGLYHLQWRWKHPXOWL*38FRGH7KHWULFNWRXVLQJPXOWLSOH*38V
ZLWKWKH&8'$UXQWLPH$3,LVUHDOL]LQJWKDWHDFK*38QHHGVWREHFRQWUROOHG
E\DGLIIHUHQW&38WKUHDG6LQFHZHKDYHXVHGRQO\DVLQJOH*38EHIRUHZH
KDYHQǢWQHHGHGWRZRUU\DERXWWKLV:HKDYHPRYHGDORWRIWKHDQQR\DQFHRI
PXOWLWKUHDGHGFRGHWRRXUȌOHRIDX[LOLDU\FRGHbook.h:LWKWKLVFRGHWXFNHG
DZD\DOOZHQHHGWRGRLVȌOODVWUXFWXUHZLWKGDWDQHFHVVDU\WRSHUIRUPWKH

225

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

FRPSXWDWLRQV$OWKRXJKWKHV\VWHPFRXOGKDYHDQ\QXPEHURI*38VJUHDWHUWKDQ
RQHZHZLOOXVHRQO\WZRRIWKHPIRUFODULW\

DataStruct data[2];

data[0].deviceID = 0;
data[0].size = N/2;
data[0].a = a;
data[0].b = b;

data[1].deviceID = 1;
data[1].size = N/2;
data[1].a = a + N/2;
data[1].b = b + N/2;

7RSURFHHGZHSDVVRQHRIWKHDataStructYDULDEOHVWRDXWLOLW\IXQFWLRQZHǢYH
QDPHGstart_thread():HDOVRSDVVstart_thread()DSRLQWHUWRDIXQF-
WLRQWREHFDOOHGE\WKHQHZO\FUHDWHGWKUHDGWKLVH[DPSOHǢVWKUHDGIXQFWLRQLV
FDOOHGroutine()7KHIXQFWLRQstart_thread()ZLOOFUHDWHDQHZWKUHDGWKDW
WKHQFDOOVWKHVSHFLȌHGIXQFWLRQSDVVLQJWKHDataStructWRWKLVIXQFWLRQ7KH
RWKHUFDOOWRroutine()JHWVPDGHIURPWKHGHIDXOWDSSOLFDWLRQWKUHDG VRZHǢYH
FUHDWHGRQO\RQHadditionalWKUHDG 

CUTThread thread = start_thread( routine, &(data[0]) );


routine( &(data[1]) );

%HIRUHZHSURFHHGZHKDYHWKHPDLQDSSOLFDWLRQWKUHDGZDLWIRUWKHRWKHUWKUHDG
WRȌQLVKE\FDOOLQJend_thread()

end_thread( thread );

6LQFHERWKWKUHDGVKDYHFRPSOHWHGDWWKLVSRLQWLQmain()LWǢVVDIHWRFOHDQXS
DQGGLVSOD\WKHUHVXOW

226

Download from www.wowebook.com


 USING
86,1*08/7,3/(*386
MULTIPLE GPUS

free( a );
free( b );

printf( "Value calculated: %f\n",


data[0].returnValue + data[1].returnValue );

return 0;
}

1RWLFHWKDWZHVXPWKHUHVXOWVFRPSXWHGE\HDFKWKUHDG7KLVLVWKHODVWVWHS
LQRXUGRWSURGXFWUHGXFWLRQ,QDQRWKHUDOJRULWKPWKLVFRPELQDWLRQRIPXOWLSOH
UHVXOWVPD\LQYROYHRWKHUVWHSV,QIDFWLQVRPHDSSOLFDWLRQVWKHWZR*38VPD\
EHH[HFXWLQJFRPSOHWHO\GLIIHUHQWFRGHRQFRPSOHWHO\GLIIHUHQWGDWDVHWV)RU
VLPSOLFLW\ǢVVDNHWKLVLVQRWWKHFDVHLQRXUGRWSURGXFWH[DPSOH

6LQFHWKHGRWSURGXFWURXWLQHLVLGHQWLFDOWRWKHRWKHUYHUVLRQV\RXǢYHVHHQZHǢOO
RPLWLWIURPWKLVVHFWLRQ+RZHYHUWKHFRQWHQWVRIroutine()PD\EHRILQWHUHVW
:HGHFODUHroutine()DVWDNLQJDQGUHWXUQLQJDvoid*VRWKDW\RXFDQUHXVH
WKHstart_thread()FRGHZLWKDUELWUDU\LPSOHPHQWDWLRQVRIDWKUHDGIXQFWLRQ
$OWKRXJKZHǢGORYHWRWDNHFUHGLWIRUWKLVLGHDLWǢVIDLUO\VWDQGDUGSURFHGXUHIRU
FDOOEDFNIXQFWLRQVLQ&

void* routine( void *pvoidData ) {


DataStruct *data = (DataStruct*)pvoidData;
HANDLE_ERROR( cudaSetDevice( data->deviceID ) );

(DFKWKUHDGFDOOVcudaSetDevice()DQGHDFKSDVVHVDGLIIHUHQW,'WRWKLV
IXQFWLRQ$VDUHVXOWZHNQRZHDFKWKUHDGZLOOEHPDQLSXODWLQJDGLIIHUHQW*38
7KHVH*38VPD\KDYHLGHQWLFDOSHUIRUPDQFHDVZLWKWKHGXDO*38*H)RUFH
*7;RUWKH\PD\EHGLIIHUHQW*38VDVZRXOGEHWKHFDVHLQDV\VWHPWKDW
KDVERWKDQLQWHJUDWHG*38DQGDGLVFUHWH*387KHVHGHWDLOVDUHQRWLPSRUWDQW
WRRXUDSSOLFDWLRQWKRXJKWKH\PLJKWEHRILQWHUHVWWR\RX3DUWLFXODUO\WKHVH
GHWDLOVSURYHXVHIXOLI\RXGHSHQGRQDFHUWDLQPLQLPXPFRPSXWHFDSDELOLW\WR
ODXQFK\RXUNHUQHOVRULI\RXKDYHDVHULRXVGHVLUHWRORDGEDODQFH\RXUDSSOLFD-
WLRQDFURVVWKHV\VWHPǢV*38V,IWKH*38VDUHGLIIHUHQW\RXZLOOQHHGWRGRVRPH

227

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

ZRUNWRSDUWLWLRQWKHFRPSXWDWLRQVVRWKDWHDFK*38LVRFFXSLHGIRUURXJKO\
WKHVDPHDPRXQWRIWLPH)RURXUSXUSRVHVLQWKLVH[DPSOHKRZHYHUWKHVHDUH
SLGGOLQJGHWDLOVZLWKZKLFKZHZRQǢWZRUU\

2XWVLGHWKHFDOOWRcudaSetDevice()WRVSHFLI\ZKLFK&8'$GHYLFHZH
LQWHQGWRXVHWKLVLPSOHPHQWDWLRQRIroutine()LVUHPDUNDEO\VLPLODUWRWKH
YDQLOODmalloc_test()IURP6HFWLRQ=HUR&RS\'RW3URGXFW:HDOOR-
FDWHEXIIHUVIRURXU*38FRSLHVRIWKHLQSXWDQGDEXIIHUIRURXUSDUWLDOUHVXOWV
IROORZHGE\DcudaMemcpy()RIHDFKLQSXWDUUD\WRWKH*38

int size = data->size;


float *a, *b, c, *partial_c;
float *dev_a, *dev_b, *dev_partial_c;

// allocate memory on the CPU side


a = data->a;
b = data->b;
partial_c = (float*)malloc( blocksPerGrid*sizeof(float) );

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
size*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b,
size*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_partial_c,
blocksPerGrid*sizeof(float) ) );

// copy the arrays 'a' and 'b' to the GPU


HANDLE_ERROR( cudaMemcpy( dev_a, a, size*sizeof(float),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, size*sizeof(float),
cudaMemcpyHostToDevice ) );

:HWKHQODXQFKRXUGRWSURGXFWNHUQHOFRS\WKHUHVXOWVEDFNDQGȌQLVKWKH
FRPSXWDWLRQRQWKH&38

228

Download from www.wowebook.com


 USING
86,1*08/7,3/(*386
MULTIPLE GPUS

dot<<<blocksPerGrid,threadsPerBlock>>>( size, dev_a, dev_b,


dev_partial_c );
// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( partial_c, dev_partial_c,
blocksPerGrid*sizeof(float),
cudaMemcpyDeviceToHost ) );

// finish up on the CPU side


c = 0;
for (int i=0; i<blocksPerGrid; i++) {
c += partial_c[i];
}

$VXVXDOZHFOHDQXSRXU*38EXIIHUVDQGUHWXUQWKHGRWSURGXFWZHǢYH
FRPSXWHGLQWKHreturnValueȌHOGRIRXUDataStruct

HANDLE_ERROR( cudaFree( dev_a ) );


HANDLE_ERROR( cudaFree( dev_b ) );
HANDLE_ERROR( cudaFree( dev_partial_c ) );

// free memory on the CPU side


free( partial_c );

data->returnValue = c;
return 0;
}

6RZKHQZHJHWGRZQWRLWRXWVLGHRIWKHKRVWWKUHDGPDQDJHPHQWLVVXHXVLQJ
PXOWLSOH*38VLVQRWWRRPXFKWRXJKHUWKDQXVLQJDVLQJOH*388VLQJRXUKHOSHU
FRGHWRFUHDWHDWKUHDGDQGH[HFXWHDIXQFWLRQRQWKDWWKUHDGWKLVEHFRPHV
VLJQLȌFDQWO\PRUHPDQDJHDEOH,I\RXKDYH\RXURZQWKUHDGOLEUDULHV\RXVKRXOG
IHHOIUHHWRXVHWKHPLQ\RXURZQDSSOLFDWLRQV<RXMXVWQHHGWRUHPHPEHUWKDW
HDFK*38JHWVLWVRZQWKUHDGDQGHYHU\WKLQJHOVHLVFUHDPFKHHVH

229

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

 3RUWDEOH3LQQHG0HPRU\
7KHODVWLPSRUWDQWSLHFHWRXVLQJPXOWLSOH*38VLQYROYHVWKHXVHRISLQQHG
PHPRU\:HOHDUQHGLQ&KDSWHUWKDWSLQQHGPHPRU\LVDFWXDOO\KRVWPHPRU\
WKDWKDVLWVSDJHVORFNHGLQSK\VLFDOPHPRU\WRSUHYHQWLWIURPEHLQJSDJHGRXW
RUUHORFDWHG+RZHYHULWWXUQVRXWWKDWSDJHVFDQDSSHDUSLQQHGWRDVLQJOH&38
WKUHDGRQO\7KDWLVWKH\ZLOOUHPDLQSDJHORFNHGLIanyWKUHDGKDVDOORFDWHGWKHP
DVSLQQHGPHPRU\EXWWKH\ZLOORQO\appearSDJHORFNHGWRWKHWKUHDGWKDWDOOR-
FDWHGWKHP,IWKHSRLQWHUWRWKLVPHPRU\LVVKDUHGEHWZHHQWKUHDGVWKHRWKHU
WKUHDGVZLOOVHHWKHEXIIHUDVVWDQGDUGSDJHDEOHGDWD

$VDVLGHHIIHFWRIWKLVEHKDYLRUZKHQDWKUHDGWKDWGLGQRWDOORFDWHDSLQQHG
EXIIHUDWWHPSWVWRSHUIRUPDcudaMemcpy()XVLQJLWWKHFRS\ZLOOEHSHUIRUPHG
DWVWDQGDUGSDJHDEOHPHPRU\VSHHGV$VZHVDZLQ&KDSWHUWKLVVSHHGFDQ
EHURXJKO\SHUFHQWRIWKHPD[LPXPDWWDLQDEOHWUDQVIHUVSHHG:KDWǢVZRUVH
LIWKHWKUHDGDWWHPSWVWRHQTXHXHDcudaMemcpyAsync()FDOOLQWRD&8'$
VWUHDPWKLVRSHUDWLRQZLOOIDLOEHFDXVHLWUHTXLUHVDSLQQHGEXIIHUWRSURFHHG
6LQFHWKHEXIIHUDSSHDUVSDJHDEOHIURPWKHWKUHDGWKDWGLGQǢWDOORFDWHLWWKHFDOO
GLHVDJULVO\GHDWK(YHQLQWKHIXWXUHQRWKLQJZRUNV

%XWWKHUHLVDUHPHG\WRWKLVSUREOHP:HFDQDOORFDWHSLQQHGPHPRU\DV
portablePHDQLQJWKDWZHZLOOEHDOORZHGWRPLJUDWHLWEHWZHHQKRVWWKUHDGV
DQGDOORZDQ\WKUHDGWRYLHZLWDVDSLQQHGEXIIHU7RGRVRZHXVHRXUWUXVW\
cudaHostAlloc()WRDOORFDWHWKHPHPRU\EXWZHFDOOLWZLWKDQHZȍDJ
cudaHostAllocPortable7KLVȍDJFDQEHXVHGLQFRQFHUWZLWKWKH
RWKHUȍDJV\RXǢYHVHHQVXFKDVcudaHostAllocWriteCombinedDQG
cudaHostAllocMapped7KLVPHDQVWKDW\RXFDQDOORFDWH\RXUKRVWEXIIHUVDV
DQ\FRPELQDWLRQRISRUWDEOH]HURFRS\DQGZULWHFRPELQHG

7RGHPRQVWUDWHSRUWDEOHSLQQHGPHPRU\ZHǢOOHQKDQFHRXUPXOWL*38GRW
SURGXFWDSSOLFDWLRQ:HǢOODGDSWRXURULJLQDO]HURFRS\YHUVLRQRIWKHGRW
SURGXFWVRWKLVYHUVLRQEHJLQVDVVRPHWKLQJRIDPDVKXSRIWKH]HURFRS\DQG
PXOWL*38YHUVLRQV$VZHKDYHWKURXJKRXWWKLVFKDSWHUZHQHHGWRYHULI\WKDW
WKHUHDUHDWOHDVWWZR&8'$FDSDEOH*38VDQGWKDWERWKFDQKDQGOH]HURFRS\
EXIIHUV

230

Download from www.wowebook.com


 3 5
2 $
7 (
%/ 3
 '
,11( 0
 <
(025

int main( void ) {


int deviceCount;
HANDLE_ERROR( cudaGetDeviceCount( &deviceCount ) );
if (deviceCount < 2) {
printf( "We need at least two compute 1.0 or greater "
"devices, but only found %d\n", deviceCount );
return 0;
}

cudaDeviceProp prop;
for (int i=0; i<2; i++) {
HANDLE_ERROR( cudaGetDeviceProperties( &prop, i ) );
if (prop.canMapHostMemory != 1) {
printf( "Device %d cannot map memory.\n", i );
return 0;
}
}

,QSUHYLRXVH[DPSOHVZHǢGEHUHDG\WRVWDUWDOORFDWLQJPHPRU\RQWKHKRVWWR
KROGRXULQSXWYHFWRUV7RDOORFDWHSRUWDEOHSLQQHGPHPRU\KRZHYHULWǢVQHFHV-
VDU\WRȌUVWVHWWKH&8'$GHYLFHRQZKLFKZHLQWHQGWRUXQ6LQFHZHLQWHQGWR
XVHWKHGHYLFHIRU]HURFRS\PHPRU\DVZHOOZHIROORZWKHcudaSetDevice()
FDOOZLWKDFDOOWRcudaSetDeviceFlags()DVZHGLGLQ6HFWLRQ=HUR
&RS\'RW3URGXFW

float *a, *b;


HANDLE_ERROR( cudaSetDevice( 0 ) );
HANDLE_ERROR( cudaSetDeviceFlags( cudaDeviceMapHost ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&a, N*sizeof(float),
cudaHostAllocWriteCombined |
cudaHostAllocPortable |
cudaHostAllocMapped ) );
HANDLE_ERROR( cudaHostAlloc( (void**)&b, N*sizeof(float),
cudaHostAllocWriteCombined |
cudaHostAllocPortable |
cudaHostAllocMapped ) );

231

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

(DUOLHULQWKLVFKDSWHUZHFDOOHGcudaSetDevice()EXWQRWXQWLOZHKDGDOUHDG\
DOORFDWHGRXUPHPRU\DQGFUHDWHGRXUWKUHDGV2QHRIWKHUHTXLUHPHQWVRIDOOR-
FDWLQJSDJHORFNHGPHPRU\ZLWKcudaHostAlloc()WKRXJKLVWKDWZHKDYH
LQLWLDOL]HGWKHGHYLFHȌUVWE\FDOOLQJcudaSetDevice()<RXZLOODOVRQRWLFHWKDW
ZHSDVVRXUQHZO\OHDUQHGȍDJcudaHostAllocPortableWRERWKDOORFDWLRQV
6LQFHWKHVHZHUHDOORFDWHGDIWHUFDOOLQJcudaSetDevice(0)RQO\&8'$GHYLFH
]HURZRXOGVHHWKHVHEXIIHUVDVSLQQHGPHPRU\LIZHKDGQRWVSHFLȌHGWKDWWKH\
ZHUHWREHSRUWDEOHDOORFDWLRQV

:HFRQWLQXHWKHDSSOLFDWLRQDVZHKDYHLQWKHSDVWJHQHUDWLQJGDWDIRURXULQSXW
YHFWRUVDQGSUHSDULQJRXUDataStructVWUXFWXUHVDVZHGLGLQWKHPXOWL*38
H[DPSOHLQ6HFWLRQ=HUR&RS\3HUIRUPDQFH

// fill in the host memory with data


for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i*2;
}

// prepare for multithread


DataStruct data[2];
data[0].deviceID = 0;
data[0].offset = 0;
data[0].size = N/2;
data[0].a = a;
data[0].b = b;

data[1].deviceID = 1;
data[1].offset = N/2;
data[1].size = N/2;
data[1].a = a;
data[1].b = b;

:HFDQWKHQFUHDWHRXUVHFRQGDU\WKUHDGDQGFDOOroutine()WREHJLQ
FRPSXWLQJRQHDFKGHYLFH

232

Download from www.wowebook.com


 3 5
2 $
7 (
%/ 3
 '
,11( 0
 <
(025

CUTThread thread = start_thread( routine, &(data[1]) );


routine( &(data[0]) );
end_thread( thread );

%HFDXVHRXUKRVWPHPRU\ZDVDOORFDWHGE\WKH&8'$UXQWLPHZHXVH
cudaFreeHost()WRIUHHLW2WKHUWKDQQRORQJHUFDOOLQJfree()ZHKDYHVHHQ
DOOWKHUHLVWRVHHLQmain()

// free memory on the CPU side


HANDLE_ERROR( cudaFreeHost( a ) );
HANDLE_ERROR( cudaFreeHost( b ) );

printf( "Value calculated: %f\n",


data[0].returnValue + data[1].returnValue );

return 0;
}

7RVXSSRUWSRUWDEOHSLQQHGPHPRU\DQG]HURFRS\PHPRU\LQRXUPXOWL*38
DSSOLFDWLRQZHQHHGWRPDNHWZRQRWDEOHFKDQJHVLQWKHFRGHIRUroutine()
7KHȌUVWLVDELWVXEWOHDQGLQQRZD\VKRXOGWKLVKDYHEHHQREYLRXV

void* routine( void *pvoidData ) {


DataStruct *data = (DataStruct*)pvoidData;
if (data->deviceID != 0) {
HANDLE_ERROR( cudaSetDevice( data->deviceID ) );
HANDLE_ERROR( cudaSetDeviceFlags( cudaDeviceMapHost ) );
}

<RXPD\UHFDOOLQRXUPXOWL*38YHUVLRQRIWKLVFRGHZHQHHGDFDOOWR
cudaSetDevice()LQroutine()LQRUGHUWRHQVXUHWKDWHDFKSDUWLFLSDWLQJ
WKUHDGFRQWUROVDGLIIHUHQW*382QWKHRWKHUKDQGLQWKLVH[DPSOHZHKDYH
DOUHDG\PDGHDFDOOWRcudaSetDevice()IURPWKHPDLQWKUHDG:HGLGVRLQ
RUGHUWRDOORFDWHSLQQHGPHPRU\LQmain()$VDUHVXOWZHRQO\ZDQWWRFDOO

233

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

cudaSetDevice()DQGcudaSetDeviceFlags()RQGHYLFHVZKHUHZHKDYH
QRWPDGHWKLVFDOO7KDWLVZHFDOOWKHVHWZRIXQFWLRQVLIWKHdeviceIDLVQRW
]HUR$OWKRXJKLWZRXOG\LHOGFOHDQHUFRGHWRVLPSO\UHSHDWWKHVHFDOOVRQGHYLFH
]HURLWWXUQVRXWWKDWWKLVLVLQIDFWDQHUURU2QFH\RXKDYHVHWWKHGHYLFHRQD
SDUWLFXODUWKUHDG\RXFDQQRWFDOOcudaSetDevice()DJDLQHYHQLI\RXSDVVWKH
VDPHGHYLFHLGHQWLȌHU7KHKLJKOLJKWHGif()VWDWHPHQWKHOSVXVDYRLGWKLVOLWWOH
QDVW\JUDPIURPWKH&8'$UXQWLPHVRZHPRYHRQWRWKHQH[WLPSRUWDQWFKDQJH
WRroutine()

,QDGGLWLRQWRXVLQJSRUWDEOHSLQQHGPHPRU\IRUWKHKRVWVLGHPHPRU\ZH
DUHXVLQJ]HURFRS\LQRUGHUWRDFFHVVWKHVHEXIIHUVGLUHFWO\IURPWKH*38
&RQVHTXHQWO\ZHQRORQJHUXVHcudaMemcpy()DVZHGLGLQWKHRULJLQDO
PXOWL*38DSSOLFDWLRQEXWZHXVHcudaHostGetDevicePointer()WRJHW
YDOLGGHYLFHSRLQWHUVIRUWKHKRVWPHPRU\DVZHGLGLQWKH]HURFRS\H[DPSOH
+RZHYHU\RXZLOOQRWLFHWKDWZHXVHVWDQGDUG*38PHPRU\IRUWKHSDUWLDOUHVXOWV
$VDOZD\VWKLVPHPRU\JHWVDOORFDWHGXVLQJcudaMalloc()

int size = data->size;


float *a, *b, c, *partial_c;
float *dev_a, *dev_b, *dev_partial_c;

// allocate memory on the CPU side


a = data->a;
b = data->b;
partial_c = (float*)malloc( blocksPerGrid*sizeof(float) );

HANDLE_ERROR( cudaHostGetDevicePointer( &dev_a, a, 0 ) );


HANDLE_ERROR( cudaHostGetDevicePointer( &dev_b, b, 0 ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_partial_c,
blocksPerGrid*sizeof(float) ) );

// offset 'a' and 'b' to where this GPU is gets it data


dev_a += data->offset;
dev_b += data->offset;

234

Download from www.wowebook.com


 &+ $ 3 5
7( 5
 ( 9 ,( :

$WWKLVSRLQWZHǢUHSUHWW\PXFKUHDG\WRJRVRZHODXQFKRXUNHUQHODQGFRS\RXU
UHVXOWVEDFNIURPWKH*38

dot<<<blocksPerGrid,threadsPerBlock>>>( size, dev_a, dev_b,


dev_partial_c );
// copy the array 'c' back from the GPU to the CPU
HANDLE_ERROR( cudaMemcpy( partial_c, dev_partial_c,
blocksPerGrid*sizeof(float),
cudaMemcpyDeviceToHost ) );

:HFRQFOXGHDVZHDOZD\VKDYHLQRXUGRWSURGXFWH[DPSOHE\VXPPLQJ
RXUSDUWLDOUHVXOWVRQWKH&38IUHHLQJRXUWHPSRUDU\VWRUDJHDQGUHWXUQLQJ
WRmain()

// finish up on the CPU side


c = 0;
for (int i=0; i<blocksPerGrid; i++) {
c += partial_c[i];
}

HANDLE_ERROR( cudaFree( dev_partial_c ) );

// free memory on the CPU side


free( partial_c );

data->returnValue = c;
return 0;
}

 &KDSWHU5HYLHZ
:HKDYHVHHQVRPHQHZW\SHVRIKRVWPHPRU\DOORFDWLRQVDOORIZKLFKJHW
DOORFDWHGZLWKDVLQJOHFDOOcudaHostAlloc()8VLQJDFRPELQDWLRQRIWKLV
RQHHQWU\SRLQWDQGDVHWRIDUJXPHQWȍDJVZHFDQDOORFDWHPHPRU\DVDQ\
FRPELQDWLRQRI]HURFRS\SRUWDEOHDQGRUZULWHFRPELQHG:HXVHGzero-copy

235

Download from www.wowebook.com


CUDA C ON MULTIPLE GPUS

EXIIHUVWRDYRLGPDNLQJH[SOLFLWFRSLHVRIGDWDWRDQGIURPWKH*38DPDQHXYHU
WKDWSRWHQWLDOO\VSHHGVXSDZLGHFODVVRIDSSOLFDWLRQV8VLQJDVXSSRUWOLEUDU\IRU
WKUHDGLQJZHPDQLSXODWHGPXOWLSOH*38VIURPWKHVDPHDSSOLFDWLRQDOORZLQJ
RXUGRWSURGXFWFRPSXWDWLRQWREHSHUIRUPHGDFURVVPXOWLSOHGHYLFHV)LQDOO\
ZHVDZKRZPXOWLSOH*38VFRXOGVKDUHSLQQHGPHPRU\DOORFDWLRQVE\DOOR-
FDWLQJWKHPDVportableSLQQHGPHPRU\2XUODVWH[DPSOHXVHGSRUWDEOHSLQQHG
PHPRU\PXOWLSOH*38VDQG]HURFRS\EXIIHUVLQRUGHUWRGHPRQVWUDWHDWXUER-
FKDUJHGYHUVLRQRIWKHGRWSURGXFWZHVWDUWHGWR\LQJZLWKEDFNLQ&KDSWHU$V
PXOWLSOHGHYLFHV\VWHPVJDLQSRSXODULW\WKHVHWHFKQLTXHVVKRXOGVHUYH\RXZHOO
LQKDUQHVVLQJWKHFRPSXWDWLRQDOSRZHURI\RXUWDUJHWSODWIRUPLQLWVHQWLUHW\

236

Download from www.wowebook.com


Chapter 12
The Final Countdown

&RQJUDWXODWLRQV:HKRSH\RXǢYHHQMR\HGOHDUQLQJDERXW&8'$&DQGH[SHUL-
PHQWLQJVRPHZLWK*38FRPSXWLQJ,WǢVEHHQDORQJWULSVROHWǢVWDNHDPRPHQW
WRUHYLHZZKHUHZHVWDUWHGDQGKRZPXFKJURXQGZHǢYHFRYHUHG6WDUWLQJZLWK
DEDFNJURXQGLQ&RU&SURJUDPPLQJZHǢYHOHDUQHGKRZWRXVHWKH&8'$
UXQWLPHǢVDQJOHEUDFNHWV\QWD[WRHDVLO\ODXQFKPXOWLSOHFRSLHVRINHUQHOVDFURVV
DQ\QXPEHURIPXOWLSURFHVVRUV:HH[SDQGHGWKHVHFRQFHSWVWRXVHFROOHF-
WLRQVRIWKUHDGVandEORFNVRSHUDWLQJRQDUELWUDULO\ODUJHLQSXWV7KHVHPRUH
FRPSOH[ODXQFKHVH[SORLWHGLQWHUWKUHDGFRPPXQLFDWLRQXVLQJWKH*38ǢVVSHFLDO
RQFKLSVKDUHGPHPRU\DQGWKH\HPSOR\HGGHGLFDWHGV\QFKURQL]DWLRQSULPLWLYHV
WRHQVXUHFRUUHFWRSHUDWLRQLQDQHQYLURQPHQWWKDWVXSSRUWV DQGHQFRXUDJHV 
WKRXVDQGVXSRQWKRXVDQGVRISDUDOOHOWKUHDGV

$UPHGZLWKEDVLFFRQFHSWVDERXWSDUDOOHOSURJUDPPLQJXVLQJ&8'$&RQ
19,',$ǢV&8'$$UFKLWHFWXUHZHH[SORUHGVRPHRIWKHPRUHDGYDQFHGFRQFHSWV
DQG$3,VWKDW19,',$SURYLGHV7KH*38ǢVGHGLFDWHGJUDSKLFVKDUGZDUHSURYHV
XVHIXOIRU*38FRPSXWLQJVRZHOHDUQHGKRZWRH[SORLWWH[WXUHPHPRU\WRDFFHO-
HUDWHVRPHFRPPRQSDWWHUQVRIPHPRU\DFFHVV%HFDXVHPDQ\XVHUVDGG*38
FRPSXWLQJWRWKHLULQWHUDFWLYHJUDSKLFVDSSOLFDWLRQVZHH[SORUHGWKHLQWHURSHUD-
WLRQRI&8'$&NHUQHOVZLWKLQGXVWU\VWDQGDUGJUDSKLFV$3,VVXFKDV2SHQ*/
DQG'LUHFW;$WRPLFRSHUDWLRQVRQERWKJOREDODQGVKDUHGPHPRU\DOORZHGVDIH

237

Download from www.wowebook.com


7+(),1$/&2817'2:1

PXOWLWKUHDGHGDFFHVVWRFRPPRQPHPRU\ORFDWLRQV0RYLQJVWHDGLO\LQWRPRUH
DQGPRUHDGYDQFHGWRSLFVVWUHDPVHQDEOHGXVWRNHHSRXUHQWLUHV\VWHPDVEXV\
DVSRVVLEOHDOORZLQJNHUQHOVWRH[HFXWHVLPXOWDQHRXVO\ZLWKPHPRU\FRSLHV
EHWZHHQWKHKRVWDQG*38)LQDOO\ZHORRNHGDWWKHZD\VLQZKLFKZHFRXOGDOOR-
FDWHDQGXVH]HURFRS\PHPRU\WRDFFHOHUDWHDSSOLFDWLRQVRQLQWHJUDWHG*38V
0RUHRYHUZHOHDUQHGWRLQLWLDOL]HPXOWLSOHGHYLFHVDQGDOORFDWHSRUWDEOHSLQQHG
PHPRU\LQRUGHUWRZULWH&8'$&WKDWIXOO\XWLOL]HVLQFUHDVLQJO\FRPPRQPXOWL
*38HQYLURQPHQWV

 &KDSWHU2EMHFWLYHV
7KURXJKWKHFRXUVHRIWKLVFKDSWHU\RXZLOODFFRPSOLVKWKHIROORZLQJ

ǩ <RXZLOOOHDUQDERXWVRPHRIWKHWRROVDYDLODEOHWRDLG\RXU&8'$&GHYHORSPHQW

ǩ <RXZLOOOHDUQDERXWDGGLWLRQDOZULWWHQDQGFRGHUHVRXUFHVWRWDNH\RXU&8'$&
GHYHORSPHQWWRWKHQH[WOHYHO

 &8'$7RROV
7KURXJKWKHFRXUVHRIWKLVERRNZHKDYHUHOLHGXSRQVHYHUDOFRPSRQHQWVRI
WKH&8'$&VRIWZDUHV\VWHP7KHDSSOLFDWLRQVZHZURWHPDGHKHDY\XVHRIWKH
&8'$&FRPSLOHULQRUGHUWRFRQYHUWRXU&8'$&NHUQHOVLQWRFRGHWKDWFRXOGEH
H[HFXWHGRQ19,',$*38V:HDOVRXVHGWKH&8'$UXQWLPHLQRUGHUWRSHUIRUP
PXFKRIWKHVHWXSDQGGLUW\ZRUNEHKLQGODXQFKLQJNHUQHOVDQGFRPPXQLFDWLQJ
ZLWKWKH*387KH&8'$UXQWLPHLQWXUQXVHVWKH&8'$GULYHUWRWDONGLUHFWO\
WRWKHKDUGZDUHLQ\RXUV\VWHP,QDGGLWLRQWRWKHVHFRPSRQHQWVWKDWZHKDYH
DOUHDG\XVHGDWOHQJWK19,',$PDNHVDYDLODEOHDKRVWRIRWKHUVRIWZDUHLQRUGHU
WRHDVHWKHGHYHORSPHQWRI&8'$&DSSOLFDWLRQV7KLVVHFWLRQGRHVQRWVHUYHZHOO
DVDXVHUǢVPDQXDOWRWKHVHSURGXFWVEXWUDWKHULWDLPVVROHO\WRLQIRUP\RXRI
WKHH[LVWHQFHDQGXWLOLW\RIWKHVHSDFNDJHV

 CUDA TOOLKIT


<RXDOPRVWFHUWDLQO\DOUHDG\KDYHWKH&8'$7RRONLWFROOHFWLRQRIVRIWZDUHRQ
\RXUGHYHORSPHQWPDFKLQH:HFDQEHVRVXUHRIWKLVEHFDXVHWKHVHWRI&8'$
&FRPSLOHUWRROVFRPSULVHVRQHRIWKHSULQFLSDOFRPSRQHQWVRIWKLVSDFNDJH,I

238

Download from www.wowebook.com


 &8'$722/6
CUDA TOOLS

\RXGRQǢWKDYHWKH&8'$7RRONLWRQ\RXUPDFKLQHWKHQLWǢVDYHULWDEOHFHUWDLQW\
WKDW\RXKDYHQǢWWULHGWRZULWHRUFRPSLOHDQ\&8'$&FRGH:HǢUHRQWR\RXQRZ
VXFNHU$FWXDOO\WKLVLVQRELJGHDO EXWLWGRHVPDNHXVZRQGHUZK\\RXǢYHUHDG
WKLVHQWLUHERRN 2QWKHRWKHUKDQGLI\RXhaveEHHQZRUNLQJWKURXJKWKHH[DP-
SOHVLQWKLVERRNWKHQ\RXVKRXOGSRVVHVVWKHOLEUDULHVZHǢUHDERXWWRGLVFXVV

 &8))7
7KH&8'$7RRONLWFRPHVZLWKWZRYHU\LPSRUWDQWXWLOLW\OLEUDULHVLI\RXSODQWR
SXUVXH*38FRPSXWLQJLQ\RXURZQDSSOLFDWLRQV)LUVW19,',$SURYLGHVDWXQHG
)DVW)RXULHU7UDQVIRUPOLEUDU\NQRZQDVCUFFT$VRIUHOHDVHWKH&8))7
OLEUDU\VXSSRUWVDQXPEHURIXVHIXOIHDWXUHVLQFOXGLQJWKHIROORZLQJ

ǩ 2QHWZRDQGWKUHHGLPHQVLRQDOWUDQVIRUPVRIERWKUHDOYDOXHGDQG
FRPSOH[YDOXHGLQSXWGDWD

ǩ %DWFKH[HFXWLRQIRUSHUIRUPLQJPXOWLSOHRQHGLPHQVLRQDOWUDQVIRUPVLQ
parallel

ǩ 'DQG'WUDQVIRUPVZLWKVL]HVUDQJLQJIURPWRLQDQ\GLPHQVLRQ

ǩ 'WUDQVIRUPVRILQSXWVXSWRPLOOLRQHOHPHQWVLQVL]H

ǩ ,QSODFHDQGRXWRISODFHWUDQVIRUPVIRUERWKUHDOYDOXHGDQGFRPSOH[
YDOXHGGDWD

19,',$SURYLGHVWKH&8))7OLEUDU\IUHHRIFKDUJHZLWKDQDFFRPSDQ\LQJOLFHQVH
WKDWDOORZVIRUXVHLQDQ\DSSOLFDWLRQUHJDUGOHVVRIZKHWKHULWǢVIRUSHUVRQDO
DFDGHPLFRUSURIHVVLRQDOGHYHORSPHQW

 &8%/$6
,QDGGLWLRQWRD)DVW)RXULHU7UDQVIRUPOLEUDU\19,',$DOVRSURYLGHVDOLEUDU\RI
OLQHDUDOJHEUDURXWLQHVWKDWLPSOHPHQWVWKHZHOONQRZQSDFNDJHRI%DVLF/LQHDU
$OJHEUD6XESURJUDPV %/$6 7KLVOLEUDU\QDPHGCUBLASLVDOVRIUHHO\DYDLO-
DEOHDQGVXSSRUWVDODUJHVXEVHWRIWKHIXOO%/$6SDFNDJH7KLVLQFOXGHVYHUVLRQV
RIHDFKURXWLQHWKDWDFFHSWERWKVLQJOHDQGGRXEOHSUHFLVLRQLQSXWVDVZHOO
DVUHDODQGFRPSOH[YDOXHGGDWD%HFDXVH%/$6ZDVRULJLQDOO\D)2575$1
LPSOHPHQWHGOLEUDU\RIOLQHDUDOJHEUDURXWLQHV19,',$DWWHPSWVWRPD[LPL]H
FRPSDWLELOLW\ZLWKWKHUHTXLUHPHQWVDQGH[SHFWDWLRQVRIWKHVHLPSOHPHQWDWLRQV
6SHFLȌFDOO\WKH&8%/$6OLEUDU\XVHVDFROXPQPDMRUVWRUDJHOD\RXWIRUDUUD\V
UDWKHUWKDQWKHURZPDMRUOD\RXWQDWLYHO\XVHGE\&DQG&,QSUDFWLFHWKLVLV

239

Download from www.wowebook.com


7+(),1$/&2817'2:1

QRWW\SLFDOO\DFRQFHUQEXWLWGRHVDOORZIRUFXUUHQWXVHUVRI%/$6WRDGDSWWKHLU
DSSOLFDWLRQVWRH[SORLWWKH*38DFFHOHUDWHG&8%/$6ZLWKPLQLPDOHIIRUW19,',$
DOVRGLVWULEXWHV)2575$1ELQGLQJVWR&8%/$6LQRUGHUWRGHPRQVWUDWHKRZWR
OLQNH[LVWLQJ)2575$1DSSOLFDWLRQVWR&8'$OLEUDULHV

 NVIDIA GPU COMPUTING SDK


$YDLODEOHVHSDUDWHO\IURPWKH19,',$GULYHUVDQG&8'$7RRONLWWKHRSWLRQDOGPU
Computing SDK GRZQORDGFRQWDLQVDSDFNDJHRIGR]HQVDQGGR]HQVRIVDPSOH
*38FRPSXWLQJDSSOLFDWLRQV:HPHQWLRQHGWKLV6'.HDUOLHULQWKHERRNEHFDXVH
LWVVDPSOHVVHUYHDVDQH[FHOOHQWFRPSOHPHQWWRWKHPDWHULDOZHǢYHFRYHUHGLQ
WKHȌUVWFKDSWHUV%XWLI\RXKDYHQǢWWDNHQDORRN\HW19,',$KDVJHDUHGWKHVH
VDPSOHVWRZDUGYDU\LQJOHYHOVRI&8'$&FRPSHWHQF\DVZHOODVVSUHDGLQJWKHP
RYHUDEURDGVSHFWUXPRIVXEMHFWPDWHULDO7KHVDPSOHVDUHURXJKO\FDWHJRUL]HG
LQWRWKHIROORZLQJVHFWLRQV

&8'$%DVLF7RSLFV

&8'$$GYDQFHG7RSLFV

&8'$6\VWHPV,QWHJUDWLRQ

'DWD3DUDOOHO$OJRULWKPV

*UDSKLFV,QWHURSHUDELOLW\

7H[WXUH

3HUIRUPDQFH6WUDWHJLHV

/LQHDU$OJHEUD

,PDJH9LGHR3URFHVVLQJ

&RPSXWDWLRQDO)LQDQFH

'DWD&RPSUHVVLRQ

3K\VLFDOO\%DVHG6LPXODWLRQ

7KHH[DPSOHVZRUNRQDQ\SODWIRUPWKDW&8'$&ZRUNVRQDQGFDQVHUYHDV
H[FHOOHQWMXPSLQJRIISRLQWVIRU\RXURZQDSSOLFDWLRQV)RUUHDGHUVZKRKDYH
FRQVLGHUDEOHH[SHULHQFHLQVRPHRIWKHVHDUHDVZHZDUQ\RXDJDLQVWH[SHFWLQJ
WRVHHVWDWHRIWKHDUWLPSOHPHQWDWLRQVRI\RXUIDYRULWHDOJRULWKPVLQWKH19,',$

240

Download from www.wowebook.com


 &8'$722/6
CUDA TOOLS

*38&RPSXWLQJ6'.7KHVHFRGHVDPSOHVVKRXOGQRWEHWUHDWHGDVSURGXFWLRQ
ZRUWK\OLEUDU\FRGHEXWUDWKHUDVHGXFDWLRQDOLOOXVWUDWLRQVRIIXQFWLRQLQJ&8'$&
SURJUDPVQRWXQOLNHWKHH[DPSOHVLQWKLVERRN

 19,',$3(5)250$1&(35,0,7,9(6
,QDGGLWLRQWRWKHURXWLQHVRIIHUHGLQWKH&8))7DQG&8%/$6OLEUDULHV19,',$
DOVRPDLQWDLQVDOLEUDU\RIIXQFWLRQVIRUSHUIRUPLQJ&8'$DFFHOHUDWHGGDWD
SURFHVVLQJNQRZQDVWKH19,',$3HUIRUPDQFH3ULPLWLYHV 133 &XUUHQWO\133ǢV
LQLWLDOVHWRIIXQFWLRQDOLW\IRFXVHVVSHFLȌFDOO\RQLPDJLQJDQGYLGHRSURFHVVLQJ
DQGLVZLGHO\DSSOLFDEOHIRUGHYHORSHUVLQWKHVHDUHDV19,',$LQWHQGVIRU133WR
HYROYHRYHUWLPHWRDGGUHVVDJUHDWHUQXPEHURIFRPSXWLQJWDVNVLQDZLGHUUDQJH
RIGRPDLQV,I\RXKDYHDQLQWHUHVWLQKLJKSHUIRUPDQFHLPDJLQJRUYLGHRDSSOLFD-
WLRQV\RXVKRXOGPDNHLWDSULRULW\WRORRNLQWR133DYDLODEOHDVDIUHHGRZQORDG
DWZZZQYLGLDFRPREMHFWQSSKWPO RUDFFHVVLEOHIURP\RXUIDYRULWHZHEVHDUFK
HQJLQH 

 '(%8**,1*&8'$&
:HKDYHKHDUGIURPDYDULHW\RIVRXUFHVWKDWLQUDUHLQVWDQFHVFRPSXWHU
VRIWZDUHGRHVQRWZRUNH[DFWO\DVLQWHQGHGZKHQȌUVWH[HFXWHG6RPHFRGH
FRPSXWHVLQFRUUHFWYDOXHVVRPHIDLOVWRWHUPLQDWHH[HFXWLRQDQGVRPH
FRGHHYHQSXWVWKHFRPSXWHULQWRDVWDWHWKDWRQO\DȍLSRIWKHSRZHUVZLWFK
FDQUHPHG\$OWKRXJKKDYLQJFOHDUO\neverZULWWHQFRGHOLNHWKLVSHUVRQDOO\
WKHDXWKRUVRIWKLVERRNUHFRJQL]HWKDWVRPHVRIWZDUHHQJLQHHUVPD\GHVLUH
UHVRXUFHVWRGHEXJWKHLU&8'$&NHUQHOV)RUWXQDWHO\19,',$SURYLGHVWRROVWR
PDNHWKLVSDLQIXOSURFHVVVLJQLȌFDQWO\OHVVWURXEOHVRPH

&8'$Ȑ*'%
$WRRONQRZQDVCUDA-GDBLVRQHRIWKHPRVWXVHIXO&8'$GRZQORDGVDYDLODEOH
WR&8'$&SURJUDPPHUVZKRGHYHORSWKHLUFRGHRQ/LQX[EDVHGV\VWHPV19,',$
H[WHQGHGWKHRSHQVRXUFH*18GHEXJJHU gdb WRWUDQVSDUHQWO\VXSSRUWGHEXJ-
JLQJGHYLFHFRGHLQUHDOWLPHZKLOHPDLQWDLQLQJWKHIDPLOLDULQWHUIDFHRIgdb3ULRU
WR&8'$*'%WKHUHH[LVWHGQRJRRGZD\WRGHEXJGHYLFHFRGHRXWVLGHRIXVLQJ
WKH&38WRVLPXODWHWKHZD\LQZKLFKLWZDVH[SHFWHGWRUXQ7KLVPHWKRG\LHOGHG
H[WUHPHO\VORZGHEXJJLQJDQGLQIDFWLWZDVIUHTXHQWO\DYHU\SRRUDSSUR[L-
PDWLRQRIWKHH[DFW*38H[HFXWLRQRIWKHNHUQHO19,',$ǢV&8'$*'%HQDEOHV
SURJUDPPHUVWRGHEXJWKHLUNHUQHOVGLUHFWO\RQWKH*38DIIRUGLQJWKHPDOORI

241

Download from www.wowebook.com


7+(),1$/&2817'2:1

WKHFRQWUROWKDWWKH\ǢYHJURZQDFFXVWRPHGWRZLWK&38GHEXJJHUV6RPHRIWKH
KLJKOLJKWVRI&8'$*'%LQFOXGHWKHIROORZLQJ

ǩ 9LHZLQJ&8'$VWDWHVXFKDVLQIRUPDWLRQUHJDUGLQJLQVWDOOHG*38VDQGWKHLU
FDSDELOLWLHV

ǩ 6HWWLQJEUHDNSRLQWVLQ&8'$&VRXUFHFRGH

ǩ ,QVSHFWLQJ*38PHPRU\LQFOXGLQJDOOJOREDODQGVKDUHGPHPRU\

ǩ ,QVSHFWLQJWKHEORFNVDQGWKUHDGVFXUUHQWO\UHVLGHQWRQWKH*38

ǩ 6LQJOHVWHSSLQJDZDUSRIWKUHDGV

ǩ %UHDNLQJLQWRFXUUHQWO\UXQQLQJDSSOLFDWLRQVLQFOXGLQJKXQJRUGHDGORFNHG
DSSOLFDWLRQV

$ORQJZLWKWKHGHEXJJHU19,',$SURYLGHVWKH&8'$0HPRU\&KHFNHUZKRVH
IXQFWLRQDOLW\FDQEHDFFHVVHGWKURXJK&8'$*'%RUWKHVWDQGDORQHWRRO
cuda-memcheck%HFDXVHWKH&8'$$UFKLWHFWXUHLQFOXGHVDVRSKLVWLFDWHG
PHPRU\PDQDJHPHQWXQLWEXLOWGLUHFWO\LQWRWKHKDUGZDUHDOOLOOHJDOPHPRU\
DFFHVVHVZLOOEHGHWHFWHGDQGSUHYHQWHGE\WKHKDUGZDUH$VDUHVXOWRID
PHPRU\YLRODWLRQ\RXUSURJUDPZLOOFHDVHIXQFWLRQLQJDVH[SHFWHGVR\RXZLOO
FHUWDLQO\ZDQWYLVLELOLW\LQWRWKHVHW\SHVRIHUURUV:KHQHQDEOHGWKH&8'$
0HPRU\&KHFNHUZLOOGHWHFWDQ\JOREDOPHPRU\YLRODWLRQVRUPLVDOLJQHGJOREDO
PHPRU\DFFHVVHVWKDW\RXUNHUQHODWWHPSWVWRPDNHUHSRUWLQJWKHPWR\RXLQD
IDUPRUHKHOSIXODQGYHUERVHPDQQHUWKDQSUHYLRXVO\SRVVLEOH

19,',$3$5$//(/16,*+7
$OWKRXJK&8'$*'%LVDPDWXUHDQGIDQWDVWLFWRROIRUGHEXJJLQJ\RXU&8'$
&NHUQHOVRQKDUGZDUHLQUHDOWLPH19,',$UHFRJQL]HVWKDWQRWHYHU\GHYHO-
RSHULVRYHUWKHPRRQDERXW/LQX[6RXQOHVV:LQGRZVXVHUVDUHKHGJLQJWKHLU
EHWVE\VDYLQJXSWRRSHQWKHLURZQSHWVWRUHVWKH\QHHGDZD\WRGHEXJWKHLU
DSSOLFDWLRQVWRR7RZDUGWKHHQGRI19,',$LQWURGXFHG19,',$3DUDOOHO
1VLJKW RULJLQDOO\FRGHQDPHG1H[XV WKHȌUVWLQWHJUDWHG*38&38GHEXJJHU
IRU0LFURVRIW9LVXDO6WXGLR/LNH&8'$*'%3DUDOOHO1VLJKWVXSSRUWVGHEXJ-
JLQJ&8'$DSSOLFDWLRQVZLWKWKRXVDQGVRIWKUHDGV8VHUVFDQSODFHEUHDNSRLQWV
DQ\ZKHUHLQWKHLU&8'$&VRXUFHFRGHLQFOXGLQJEUHDNSRLQWVWKDWWULJJHURQ
ZULWHVWRDUELWUDU\PHPRU\ORFDWLRQV7KH\FDQLQVSHFW*38PHPRU\GLUHFWO\
IURPWKH9LVXDO6WXGLR0HPRU\ZLQGRZDQGFKHFNIRURXWRIERXQGVPHPRU\
DFFHVVHV7KLVWRROKDVEHHQPDGHSXEOLFO\DYDLODEOHLQDEHWDSURJUDPDVRI
SUHVVWLPHDQGWKHȌQDOYHUVLRQVKRXOGEHUHOHDVHGVKRUWO\

242

Download from www.wowebook.com


 &8'$722/6
CUDA TOOLS

 &8'$9,68$/352),/(5
:HRIWHQWRXWWKH&8'$$UFKLWHFWXUHDVDZRQGHUIXOIRXQGDWLRQIRUKLJK
SHUIRUPDQFHFRPSXWLQJDSSOLFDWLRQV8QIRUWXQDWHO\WKHUHDOLW\LVWKDWDIWHU
IHUUHWLQJRXWDOOWKHEXJVIURP\RXUDSSOLFDWLRQVHYHQWKHPRVWZHOOPHDQLQJ
ǤKLJKSHUIRUPDQFHFRPSXWLQJǥDSSOLFDWLRQVDUHPRUHDFFXUDWHO\UHIHUUHGWRDV
VLPSO\ǤFRPSXWLQJǥDSSOLFDWLRQV:HKDYHRIWHQEHHQLQWKHSRVLWLRQZKHUHZH
ZRQGHUǤ:K\LQWKH6DP+LOOLVP\FRGHSHUIRUPLQJVRSRRUO\"ǥ,QVLWXDWLRQVOLNH
WKLVLWKHOSVWREHDEOHWRH[HFXWHWKHNHUQHOVLQTXHVWLRQXQGHUWKHZDWFKIXOJD]H
RIDSURȌOLQJWRRO19,',$SURYLGHVMXVWVXFKDWRRODYDLODEOHDVDVHSDUDWHGRZQ-
ORDGRQWKH&8'$=RQHZHEVLWH)LJXUHVKRZVWKH9LVXDO3URȌOHUEHLQJXVHG
WRFRPSDUHWZRLPSOHPHQWDWLRQVRIDPDWUL[WUDQVSRVHRSHUDWLRQ'HVSLWHQRW
ORRNLQJDWDOLQHRIFRGHLWEHFRPHVTXLWHHDV\WRGHWHUPLQHWKDWERWKPHPRU\
DQGLQVWUXFWLRQWKURXJKSXWRIWKHtranspose()NHUQHORXWVWULSWKDWRIWKH
transpose_naive()NHUQHO %XWWKHQDJDLQLWZRXOGEHXQIDLUWRH[SHFWPXFK
PRUHIURPDIXQFWLRQZLWKnaiveLQWKHQDPH

Figure 12.1 7KH&8'$9LVXDO3URȌOHUEHLQJXVHGWRSURȌOHDPDWUL[WUDQVSRVH


DSSOLFDWLRQ
243

Download from www.wowebook.com


7+(),1$/&2817'2:1

7KH&8'$9LVXDO3URȌOHUZLOOH[HFXWH\RXUDSSOLFDWLRQH[DPLQLQJVSHFLDOSHUIRU-
PDQFHFRXQWHUVEXLOWLQWRWKH*38$IWHUH[HFXWLRQWKHSURȌOHUFDQFRPSLOHGDWD
EDVHGRQWKHVHFRXQWHUVDQGSUHVHQW\RXZLWKUHSRUWVEDVHGRQZKDWLWREVHUYHG
,WFDQYHULI\KRZORQJ\RXUDSSOLFDWLRQVSHQGVH[HFXWLQJHDFKNHUQHODVZHOO
DVGHWHUPLQHWKHQXPEHURIEORFNVODXQFKHGZKHWKHU\RXUNHUQHOǢVPHPRU\
DFFHVVHVDUHFRDOHVFHGWKHQXPEHURIGLYHUJHQWEUDQFKHVWKHZDUSVLQ\RXUFRGH
H[HFXWHDQGVRRQ:HHQFRXUDJH\RXWRORRNLQWRWKH&8'$9LVXDO3URȌOHULI\RX
KDYHVRPHVXEWOHSHUIRUPDQFHSUREOHPVLQQHHGRIUHVROXWLRQ

 :ULWWHQ5HVRXUFHV
,I\RXKDYHQǢWDOUHDG\JURZQTXHDV\IURPDOOWKHSURVHLQWKLVERRNWKHQLWǢV
SRVVLEOH\RXPLJKWDFWXDOO\EHLQWHUHVWHGLQUHDGLQJPRUH:HNQRZWKDWVRPHRI
\RXDUHPRUHOLNHO\WRZDQWWRSOD\ZLWKFRGHLQRUGHUWRFRQWLQXH\RXUOHDUQLQJ
EXWIRUWKHUHVWRI\RXWKHUHDUHDGGLWLRQDOZULWWHQUHVRXUFHVWRPDLQWDLQ\RXU
JURZWKDVD&8'$&FRGHU

 352*5$00,1*0$66,9(/<3$5$//(/352&(66256$
+$1'6Ȑ21$3352$&+
,I\RXUHDG&KDSWHUZHDVVXUHG\RXWKDWWKLVERRNZDVPRVWGHFLGHGO\not a
WH[WERRNRQSDUDOOHODUFKLWHFWXUHV6XUHZHEDQGLHGDERXWWHUPVVXFKDVmulti-
processorDQGwarpEXWWKLVERRNVWULYHVWRWHDFKWKHVRIWHUVLGHRISURJUDPPLQJ
ZLWK&8'$&DQGLWVDWWHQGDQW$3,V:HOHDUQHGWKH&8'$&ODQJXDJHZLWKLQWKH
SURJUDPPLQJPRGHOVHWIRUWKLQWKHNVIDIA CUDA Programming GuideODUJHO\
LJQRULQJWKHZD\19,',$ǢVKDUGZDUHDFWXDOO\DFFRPSOLVKHVWKHWDVNVZHJLYHLW

%XWWRWUXO\EHFRPHDQDGYDQFHGZHOOURXQGHG&8'$&SURJUDPPHU\RXZLOO
QHHGDPRUHLQWLPDWHIDPLOLDULW\ZLWKWKH&8'$$UFKLWHFWXUHDQGVRPHRIWKH
QXDQFHVRIKRZ19,',$*38VZRUNEHKLQGWKHVFHQHV7RDFFRPSOLVKWKLV
ZHUHFRPPHQGZRUNLQJ\RXUZD\WKURXJKProgramming Massively Parallel
Processors: A Hands-on Approach7RZULWHLW'DYLG.LUNIRUPHUO\19,',$ǢVFKLHI
VFLHQWLVWFROODERUDWHGZLWK:HQPHL:+ZXWKH:-6DQGHUV,,,FKDLUPDQLQ
HOHFWULFDODQGFRPSXWHUHQJLQHHULQJDW8QLYHUVLW\RI,OOLQRLV<RXǢOOHQFRXQWHU
DQXPEHURIIDPLOLDUWHUPVDQGFRQFHSWVEXW\RXZLOOOHDUQDERXWWKHJULWW\
GHWDLOVRI19,',$ǢV&8'$$UFKLWHFWXUHLQFOXGLQJWKUHDGVFKHGXOLQJDQGODWHQF\
WROHUDQFHPHPRU\EDQGZLGWKXVDJHDQGHIȌFLHQF\VSHFLȌFVRQȍRDWLQJSRLQW

244

Download from www.wowebook.com


 WRITTEN
:5,77(15(6285&(6
RESOURCES

KDQGOLQJDQGPXFKPRUH7KHERRNDOVRDGGUHVVHVSDUDOOHOSURJUDPPLQJLQ
DPRUHJHQHUDOVHQVHWKDQWKLVERRNVR\RXZLOOJDLQDEHWWHURYHUDOOXQGHU-
VWDQGLQJRIKRZWRHQJLQHHUSDUDOOHOVROXWLRQVWRODUJHFRPSOH[SUREOHPV

 CUDA U
6RPHRIXVZHUHXQOXFN\HQRXJKWRKDYHDWWHQGHGXQLYHUVLW\SULRUWRWKHH[FLWLQJ
ZRUOGRI*38FRPSXWLQJ)RUWKRVHZKRDUHIRUWXQDWHHQRXJKWREHDWWHQGLQJ
FROOHJHQRZRULQWKHQHDUIXWXUHDERXWXQLYHUVLWLHVDFURVVWKHZRUOG
FXUUHQWO\WHDFKFRXUVHVLQYROYLQJ&8'$%XWEHIRUH\RXVWDUWDFUDVKGLHWWRȌW
EDFNLQWR\RXUFROOHJHJHDUWKHUHǢVDQDOWHUQDWLYH2QWKH&8'$=RQHZHEVLWH
\RXZLOOȌQGDOLQNIRUCUDA UZKLFKLVHVVHQWLDOO\DQRQOLQHXQLYHUVLW\IRU&8'$
HGXFDWLRQ 2U\RXFDQQDYLJDWHGLUHFWO\WKHUHZLWKWKH85/ZZZQYLGLDFRP
REMHFWFXGDBHGXFDWLRQ$OWKRXJK\RXZLOOEHDEOHWROHDUQTXLWHDELWDERXW*38
FRPSXWLQJLI\RXDWWHQGVRPHRIWKHRQOLQHOHFWXUHVDW&8'$8DVRISUHVVWLPH
WKHUHDUHVWLOOQRRQOLQHIUDWHUQLWLHVIRUSDUW\LQJDIWHUFODVV

UNIVERSITY COURSE MATERIALS


$PRQJWKHP\ULDGVRXUFHVRI&8'$HGXFDWLRQRQHRIWKHKLJKOLJKWVLQFOXGHVDQ
HQWLUHFRXUVHIURPWKH8QLYHUVLW\RI,OOLQRLVRQSURJUDPPLQJLQ&8'$&19,',$
DQGWKH8QLYHUVLW\RI,OOLQRLVSURYLGHWKLVFRQWHQWIUHHRIFKDUJHLQWKH09YLGHR
IRUPDWIRU\RXUL3RGL3KRQHVRUFRPSDWLEOHYLGHRSOD\HUV:HNQRZZKDW\RXǢUH
WKLQNLQJǤ)LQDOO\DZD\WROHDUQ&8'$ZKLOH,ZDLWLQOLQHDWWKH'HSDUWPHQWRI
0RWRU9HKLFOHVǥ<RXPD\DOVREHZRQGHULQJZK\ZHZDLWHGXQWLOWKHYHU\HQG
RIWKLVERRNWRLQIRUP\RXRIWKHH[LVWHQFHRIZKDWLVHVVHQWLDOO\DPRYLHYHUVLRQ
RIWKLVERRN:HǢUHVRUU\IRUKROGLQJRXWRQ\RXEXWWKHPRYLHLVKDUGO\HYHUDV
JRRGDVWKHERRNDQ\ZD\ULJKW",QDGGLWLRQWRDFWXDOFRXUVHPDWHULDOVIURPWKH
8QLYHUVLW\RI,OOLQRLVDQGIURPWKH8QLYHUVLW\RI&DOLIRUQLD'DYLV\RXZLOODOVRȌQG
PDWHULDOVIURP&8'$7UDLQLQJ3RGFDVWVDQGOLQNVWRWKLUGSDUW\WUDLQLQJDQG
FRQVXOWDQF\VHUYLFHV

DR. DOBB’S
)RUPRUHWKDQ\HDUVDr. Dobb’sKDVFRYHUHGQHDUO\HYHU\PDMRUGHYHORS-
PHQWLQFRPSXWLQJWHFKQRORJ\DQG19,',$ǢV&8'$LVQRH[FHSWLRQ$VSDUWRIDQ
RQJRLQJVHULHVDr. Dobb’sKDVSXEOLVKHGDQH[WHQVLYHVHULHVRIDUWLFOHVFXWWLQJD
EURDGVZDWKWKURXJKWKH&8'$ODQGVFDSH(QWLWOHGCUDA, Supercomputing for the
MassesWKHVHULHVVWDUWVZLWKDQLQWURGXFWLRQWR*38FRPSXWLQJDQGSURJUHVVHV

245

Download from www.wowebook.com


7+(),1$/&2817'2:1

TXLFNO\IURPDȌUVWNHUQHOWRRWKHUSLHFHVRIWKH&8'$SURJUDPPLQJPRGHO7KH
DUWLFOHVLQDr. Dobb’sFRYHUHUURUKDQGOLQJJOREDOPHPRU\SHUIRUPDQFHVKDUHG
PHPRU\WKH&8'$9LVXDO3URȌOHUWH[WXUHPHPRU\&8'$*'%DQGWKH&8'33
OLEUDU\RIGDWDSDUDOOHO&8'$SULPLWLYHVDVZHOODVPDQ\RWKHUWRSLFV7KLVVHULHV
RIDUWLFOHVLVDQH[FHOOHQWSODFHWRJHWDGGLWLRQDOLQIRUPDWLRQDERXWVRPHRIWKH
PDWHULDOZHǢYHDWWHPSWHGWRFRQYH\LQWKLVERRN)XUWKHUPRUH\RXǢOOȌQGSUDF-
WLFDOLQIRUPDWLRQFRQFHUQLQJVRPHRIWKHWRROVWKDWZHǢYHRQO\KDGWLPHWRJODQFH
RYHULQWKLVWH[WVXFKDVWKHSURȌOLQJDQGGHEXJJLQJRSWLRQVDYDLODEOHWR\RX7KH
VHULHVRIDUWLFOHVLVOLQNHGIURPWKH&8'$=RQHZHESDJHEXWLVUHDGLO\DFFHVVLEOH
WKURXJKDZHEVHDUFKIRUDr Dobbs CUDA

 19,',$)25806
(YHQDIWHUGLJJLQJDURXQGDOORI19,',$ǢVGRFXPHQWDWLRQ\RXPD\ȌQG\RXU-
VHOIZLWKDQXQDQVZHUHGRUSDUWLFXODUO\LQWULJXLQJTXHVWLRQ3HUKDSV\RXǢUH
ZRQGHULQJZKHWKHUDQ\RQHHOVHKDVVHHQVRPHIXQN\EHKDYLRU\RXǢUHH[SH-
ULHQFLQJ2UPD\EH\RXǢUHWKURZLQJD&8'$FHOHEUDWLRQSDUW\DQGZDQWHGWR
DVVHPEOHDJURXSRIOLNHPLQGHGLQGLYLGXDOV)RUDQ\WKLQJ\RXǢUHLQWHUHVWHGLQ
DVNLQJZHVWURQJO\UHFRPPHQGWKHIRUXPVRQ19,',$ǢVZHEVLWH/RFDWHGDW
KWWSIRUXPVQYLGLDFRPWKHIRUXPVDUHDJUHDWSODFHWRDVNTXHVWLRQVRIRWKHU
&8'$XVHUV,QIDFWDIWHUUHDGLQJWKLVERRN\RXǢUHLQDSRVLWLRQWRSRWHQWLDOO\
KHOSRWKHUVLI\RXZDQW19,',$HPSOR\HHVUHJXODUO\SURZOWKHIRUXPVWRRVR
WKHWULFNLHVWTXHVWLRQVZLOOSURPSWDXWKRULWDWLYHDGYLFHULJKWIURPWKHVRXUFH:H
DOVRORYHWRJHWVXJJHVWLRQVIRUQHZIHDWXUHVDQGIHHGEDFNRQWKHJRRGEDGDQG
XJO\WKLQJVWKDWZHDW19,',$GR

 &RGH5HVRXUFHV
$OWKRXJKWKH19,',$*38&RPSXWLQJ6'.LVDWUHDVXUHWURYHRIKRZWRVDPSOHV
LWǢVQRWGHVLJQHGWREHXVHGIRUPXFKPRUHWKDQSHGDJRJ\,I\RXǢUHKXQWLQJIRU
SURGXFWLRQFDOLEHU&8'$SRZHUHGOLEUDULHVRUVRXUFHFRGH\RXǢOOQHHGWRORRND
ELWIXUWKHU)RUWXQDWHO\WKHUHLVDODUJHFRPPXQLW\RI&8'$GHYHORSHUVZKRKDYH
SURGXFHGWRSQRWFKVROXWLRQV$FRXSOHRIWKHVHWRROVDQGOLEUDULHVDUHSUHVHQWHG
KHUHEXW\RXDUHHQFRXUDJHGWRVHDUFKWKH:HEIRUZKDWHYHUVROXWLRQV\RXQHHG
$QGKH\PD\EH\RXǢOOFRQWULEXWHVRPHRI\RXURZQWRWKH&8'$&FRPPXQLW\
VRPHGD\

246

Download from www.wowebook.com


 CODE
&2'(5(6285&(6
RESOURCES

 &8'$'$7$3$5$//(/35,0,7,9(6/,%5$5<
19,',$ZLWKWKHKHOSRIUHVHDUFKHUVDWWKH8QLYHUVLW\RI&DOLIRUQLD'DYLVKDV
UHOHDVHGDOLEUDU\NQRZQDVWKH&8'$'DWD3DUDOOHO3ULPLWLYHV/LEUDU\ &8'33 
&8'33DVWKHQDPHLQGLFDWHVLVDOLEUDU\RIGDWDSDUDOOHODOJRULWKPSULPLWLYHV
6RPHRIWKHVHSULPLWLYHVLQFOXGHSDUDOOHOSUHȌ[VXP scan SDUDOOHOVRUWDQG
SDUDOOHOUHGXFWLRQ3ULPLWLYHVVXFKDVWKHVHIRUPWKHIRXQGDWLRQRIDZLGHYDULHW\
RIGDWDSDUDOOHODOJRULWKPVLQFOXGLQJVRUWLQJVWUHDPFRPSDFWLRQEXLOGLQJ
GDWDVWUXFWXUHVDQGPDQ\RWKHUV,I\RXǢUHORRNLQJWRZULWHDQHYHQPRGHUDWHO\
FRPSOH[DOJRULWKPVFKDQFHVDUHJRRGWKDWHLWKHU&8'33DOUHDG\KDVZKDW\RX
QHHGRULWFDQJHW\RXVLJQLȌFDQWO\FORVHUWRZKHUH\RXZDQWWREH'RZQORDGLWDW
KWWSFRGHJRRJOHFRPSFXGSS

 CULATOOLS
$VZHPHQWLRQHGLQ6HFWLRQ&8%/$619,',$SURYLGHVDQLPSOHPHQWDWLRQ
RIWKH%/$6SDFNDJHGDORQJZLWKWKH&8'$7RRONLWGRZQORDG)RUUHDGHUVZKR
QHHGDEURDGHUVROXWLRQIRUOLQHDUDOJHEUDWDNHDORRNDW(03KRWRQLFVǢ&8'$
LPSOHPHQWDWLRQRIWKHLQGXVWU\VWDQGDUG/LQHDU$OJHEUD3DFNDJH /$3$&. 
,WV/$3$&.LPSOHPHQWDWLRQLVNQRZQDVCULAtoolsDQGRIIHUVPRUHFRPSOH[
OLQHDUDOJHEUDURXWLQHVWKDWDUHEXLOWRQ19,',$ǢV&8%/$6WHFKQRORJ\7KH
IUHHO\DYDLODEOH%DVLFSDFNDJHRIIHUV/8GHFRPSRVLWLRQ45IDFWRUL]DWLRQOLQHDU
V\VWHPVROYHUDQGVLQJXODUYDOXHGHFRPSRVLWLRQDVZHOODVOHDVWVTXDUHVDQG
FRQVWUDLQHGOHDVWVTXDUHVVROYHUV<RXFDQREWDLQWKH%DVLFGRZQORDGDW
ZZZFXODWRROVFRPYHUVLRQVEDVLF<RXZLOODOVRQRWLFHWKDW(03KRWRQLFVRIIHUV
3UHPLXPDQG&RPPHUFLDOOLFHQVHVZKLFKFRQWDLQDIDUJUHDWHUIUDFWLRQRIWKH
/$3$&.URXWLQHVDVZHOODVOLFHQVLQJWHUPVWKDWZLOODOORZ\RXWRGLVWULEXWH\RXU
RZQFRPPHUFLDODSSOLFDWLRQVEDVHGRQ&8/$WRROV

 LANGUAGE WRAPPERS


7KLVERRNKDVSULPDULO\EHHQFRQFHUQHGZLWK&DQG&EXWFOHDUO\KXQGUHGV
RISURMHFWVH[LVWWKDWGRQǢWHPSOR\WKHVHODQJXDJHV)RUWXQDWHO\WKLUGSDUWLHV
KDYHZULWWHQZUDSSHUVWRDOORZDFFHVVWR&8'$WHFKQRORJ\IURPODQJXDJHVQRW
RIȌFLDOO\VXSSRUWHGE\19,',$19,',$LWVHOISURYLGHV)2575$1ELQGLQJVIRU
LWV&8%/$6OLEUDU\EXW\RXFDQDOVRȌQG-DYDELQGLQJVIRUVHYHUDORIWKH&8'$
OLEUDULHVDWZZZMFXGDRUJ/LNHZLVH3\WKRQZUDSSHUVWRDOORZWKHXVHRI&8'$&
NHUQHOVIURP3\WKRQDSSOLFDWLRQVDUHDYDLODEOHIURPWKH3\&8'$SURMHFWDW

247

Download from www.wowebook.com


7+(),1$/&2817'2:1

KWWSPDWKHPDWLFLDQGHVRIWZDUHS\FXGD)LQDOO\WKHUHDUHELQGLQJVIRU
WKH0LFURVRIW1(7HQYLURQPHQWDYDLODEOHIURPWKH&8'$1(7SURMHFWDW
ZZZKRRSRHFORXGFRP6ROXWLRQV&8'$1(7 

$OWKRXJKWKHVHSURMHFWVDUHQRWRIȌFLDOO\VXSSRUWHGE\19,',$WKH\KDYHEHHQ
DURXQGIRUVHYHUDOYHUVLRQVRI&8'$DUHDOOIUHHO\DYDLODEOHDQGHDFKKDVPDQ\
VXFFHVVIXOFXVWRPHUV7KHPRUDORIWKLVVWRU\LVLI\RXUODQJXDJHRIFKRLFH RU
\RXUERVVǢVFKRLFH LVQRW&RU&\RXVKRXOGQRWUXOHRXW*38FRPSXWLQJXQWLO
\RXǢYHȌUVWORRNHGWRVHHZKHWKHUWKHQHFHVVDU\ELQGLQJVDUHDYDLODEOH

 &KDSWHU5HYLHZ
$QGWKHUH\RXKDYHLW(YHQDIWHUFKDSWHUVRI&8'$&WKHUHDUHVWLOOORDGVRI
UHVRXUFHVWRGRZQORDGUHDGZDWFKDQGFRPSLOH7KLVLVDUHPDUNDEO\LQWHUHVWLQJ
WLPHWREHOHDUQLQJ*38FRPSXWLQJDVWKHHUDRIKHWHURJHQHRXVFRPSXWLQJ
SODWIRUPVPDWXUHV:HKRSHWKDW\RXKDYHHQMR\HGOHDUQLQJDERXWRQHRIWKH
PRVWSHUYDVLYHSDUDOOHOSURJUDPPLQJHQYLURQPHQWVLQH[LVWHQFH0RUHRYHUZH
KRSHWKDW\RXOHDYHWKLVH[SHULHQFHH[FLWHGDERXWWKHSRVVLELOLWLHVWRGHYHORSQHZ
DQGH[FLWLQJPHDQVIRULQWHUDFWLQJZLWKFRPSXWHUVDQGIRUSURFHVVLQJWKHHYHU
LQFUHDVLQJDPRXQWRILQIRUPDWLRQDYDLODEOHWR\RXUVRIWZDUH,WǢV\RXULGHDVDQGWKH
DPD]LQJWHFKQRORJLHV\RXGHYHORSWKDWZLOOSXVK*38FRPSXWLQJWRWKHQH[WOHYHO

248

Download from www.wowebook.com


Appendix
Advanced Atomics

&KDSWHUFRYHUHGVRPHRIWKHZD\VLQZKLFKZHFDQXVHDWRPLFRSHUDWLRQVWR
HQDEOHKXQGUHGVRIWKUHDGVWRVDIHO\PDNHFRQFXUUHQWPRGLȌFDWLRQVWRVKDUHG
GDWD,QWKLVDSSHQGL[ZHǢOOORRNDWDQDGYDQFHGPHWKRGIRUXVLQJDWRPLFVWR
LPSOHPHQWORFNLQJGDWDVWUXFWXUHV2QLWVVXUIDFHWKLVWRSLFGRHVQRWVHHPPXFK
PRUHFRPSOLFDWHGWKDQDQ\WKLQJHOVHZHǢYHH[DPLQHG$QGLQUHDOLW\WKLVLVDFFX-
UDWH<RXǢYHOHDUQHGDORWRIFRPSOH[WRSLFVWKURXJKWKLVERRNDQGORFNLQJGDWD
VWUXFWXUHVDUHQRPRUHFKDOOHQJLQJWKDQWKHVH6RZK\LVWKLVPDWHULDOKLGLQJLQ
WKHDSSHQGL[":HGRQǢWZDQWWRUHYHDODQ\VSRLOHUVVRLI\RXǢUHLQWULJXHGUHDGRQ
DQGZHǢOOGLVFXVVWKLVWKURXJKWKHFRXUVHRIWKHDSSHQGL[

249

Download from www.wowebook.com


ADVANCED ATOMICS

$ 'RW3URGXFW5HYLVLWHG
,Q&KDSWHUZHORRNHGDWWKHLPSOHPHQWDWLRQRIDYHFWRUGRWSURGXFWXVLQJ&8'$
&7KLVDOJRULWKPZDVRQHRIDODUJHIDPLO\RIDOJRULWKPVNQRZQDVreductions,I
\RXUHFDOOWKHDOJRULWKPFRPSXWHGWKHGRWSURGXFWRIWZRLQSXWYHFWRUVE\GRLQJ
WKHIROORZLQJ

 (DFKWKUHDGLQHDFKEORFNPXOWLSOLHVWZRFRUUHVSRQGLQJHOHPHQWVRIWKHLQSXW
YHFWRUVDQGVWRUHVWKHSURGXFWVLQVKDUHGPHPRU\

 $OWKRXJKDEORFNKDVPRUHWKDQRQHSURGXFWDWKUHDGDGGVWZRRIWKH
SURGXFWVDQGVWRUHVWKHUHVXOWEDFNWRVKDUHGPHPRU\(DFKVWHSUHVXOWV
LQKDOIDVPDQ\YDOXHVDVLWVWDUWHGZLWK WKLVLVZKHUHWKHWHUPreduction
FRPHVIURP

 :KHQHYHU\EORFNKDVDȌQDOVXPHDFKRQHZULWHVLWVYDOXHWRJOREDOPHPRU\
DQGH[LWV

 ,IWKHNHUQHOUDQZLWKNSDUDOOHOEORFNVWKH&38VXPVWKHVHUHPDLQLQJN
YDOXHVWRJHQHUDWHWKHȌQDOGRWSURGXFW

7KLVKLJKOHYHOORRNDWWKHGRWSURGXFWDOJRULWKPLVLQWHQGHGWREHUHYLHZVRLI
LWǢVEHHQDZKLOHRU\RXǢYHKDGDFRXSOHJODVVHVRI&KDUGRQQD\LWPD\EHZRUWK
WKHWLPHWRUHYLHZ&KDSWHU,I\RXIHHOFRPIRUWDEOHHQRXJKZLWKWKHGRWSURGXFW
FRGHWRFRQWLQXHGUDZ\RXUDWWHQWLRQWRVWHSLQWKHDOJRULWKP$OWKRXJKLW
GRHVQǢWLQYROYHFRS\LQJPXFKGDWDWRWKHKRVWRUSHUIRUPLQJPDQ\FDOFXOD-
WLRQVRQWKH&38PRYLQJWKHFRPSXWDWLRQEDFNWRWKH&38WRȌQLVKLVLQGHHGDV
DZNZDUGDVLWVRXQGV

%XWLWǢVPRUHWKDQDQLVVXHRIDQDZNZDUGVWHSWRWKHDOJRULWKPRUWKHLQHOHJDQFH
RIWKHVROXWLRQ&RQVLGHUDVFHQDULRZKHUHDGRWSURGXFWFRPSXWDWLRQLVMXVWRQH
VWHSLQDORQJVHTXHQFHRIRSHUDWLRQV,I\RXZDQWWRSHUIRUPeveryRSHUDWLRQRQ
WKH*38EHFDXVH\RXU&38LVEXV\ZLWKRWKHUWDVNVRUFRPSXWDWLRQV\RXǢUHRXW
RIOXFN$VLWVWDQGV\RXǢOOEHIRUFHGWRVWRSFRPSXWLQJRQWKH*38FRS\LQWHU-
PHGLDWHUHVXOWVEDFNWRWKHKRVWȌQLVKWKHFRPSXWDWLRQZLWKWKH&38DQGȌQDOO\
XSORDGWKDWUHVXOWEDFNWRWKH*38DQGUHVXPHFRPSXWLQJZLWK\RXUQH[WNHUQHO

6LQFHWKLVLVDQDSSHQGL[RQDWRPLFVDQGZHKDYHJRQHWRVXFKOHQJWKVWRH[SODLQ
ZKDWDSDLQRXURULJLQDOGRWSURGXFWDOJRULWKPLV\RXVKRXOGVHHZKHUHZHǢUH
KHDGLQJ:HLQWHQGWRȌ[RXUGRWSURGXFWXVLQJDWRPLFVVRWKHHQWLUHFRPSXWD-
WLRQFDQVWD\RQWKH*38OHDYLQJ\RXU&38IUHHWRSHUIRUPRWKHUWDVNV,GHDOO\

250

Download from www.wowebook.com


$ '27352'8&75(9,6,7('

LQVWHDGRIH[LWLQJWKHNHUQHOLQVWHSDQGUHWXUQLQJWRWKH&38LQVWHSZHZDQW
HDFKEORFNWRDGGLWVȌQDOUHVXOWWRDWRWDOLQJOREDOPHPRU\,IHDFKYDOXHZHUH
DGGHGDWRPLFDOO\ZHZRXOGQRWKDYHWRZRUU\DERXWSRWHQWLDOFROOLVLRQVRULQGH-
WHUPLQDWHUHVXOWV6LQFHZHKDYHDOUHDG\XVHGDQatomicAdd()RSHUDWLRQLQWKH
KLVWRJUDPRSHUDWLRQWKLVVHHPVOLNHDQREYLRXVFKRLFH

8QIRUWXQDWHO\SULRUWRFRPSXWHFDSDELOLW\atomicAdd()RSHUDWHGRQO\
RQLQWHJHUV$OWKRXJKWKLVPLJKWEHȌQHLI\RXSODQWRFRPSXWHGRWSURGXFWVRI
YHFWRUVZLWKLQWHJHUFRPSRQHQWVLWLVVLJQLȌFDQWO\PRUHFRPPRQWRXVHȍRDWLQJ
SRLQWFRPSRQHQWV+RZHYHUWKHPDMRULW\RI19,',$KDUGZDUHGRHVQRWVXSSRUW
DWRPLFDULWKPHWLFRQȍRDWLQJSRLQWQXPEHUV%XWWKHUHǢVDUHDVRQDEOHH[SODQD-
WLRQIRUWKLVVRGRQǢWWKURZ\RXU*38LQWKHJDUEDJHMXVW\HW

$WRPLFRSHUDWLRQVRQDYDOXHLQPHPRU\JXDUDQWHHRQO\WKDWHDFKWKUHDGǢVUHDG
PRGLI\ZULWHVHTXHQFHZLOOFRPSOHWHZLWKRXWRWKHUWKUHDGVUHDGLQJRUZULWLQJWKH
WDUJHWYDOXHZKLOHLQSURFHVV7KHUHLVQRVWLSXODWLRQDERXWWKHRUGHULQZKLFKWKH
WKUHDGVZLOOSHUIRUPWKHLURSHUDWLRQVVRLQWKHFDVHRIWKUHHWKUHDGVSHUIRUPLQJ
DGGLWLRQVRPHWLPHVWKHKDUGZDUHZLOOSHUIRUP(A+B)+CDQGVRPHWLPHVLW
ZLOOFRPSXWHA+(B+C)7KLVLVDFFHSWDEOHIRULQWHJHUVEHFDXVHLQWHJHUPDWKLV
DVVRFLDWLYHVR(A+B)+C = A+(B+C))ORDWLQJSRLQWDULWKPHWLFLVnotDVVRFLD-
WLYHEHFDXVHRIWKHURXQGLQJRILQWHUPHGLDWHUHVXOWVVR(A+B)+CRIWHQGRHV
QRWHTXDO A+(B+C)$VDUHVXOWDWRPLFDULWKPHWLFRQȍRDWLQJSRLQWYDOXHVLVRI
GXELRXVXWLOLW\EHFDXVHLWJLYHVULVHWRQRQGHWHUPLQLVWLFUHVXOWVLQDKLJKO\PXOWL-
WKUHDGHGHQYLURQPHQWVXFKDVRQWKH*387KHUHDUHPDQ\DSSOLFDWLRQVZKHUH
LWLVVLPSO\XQDFFHSWDEOHWRJHWWZRGLIIHUHQWUHVXOWVIURPWZRUXQVRIDQDSSOL-
FDWLRQVRWKHVXSSRUWRIȍRDWLQJSRLQWDWRPLFDULWKPHWLFZDVQRWDSULRULW\IRU
HDUOLHUKDUGZDUH

+RZHYHULIZHDUHZLOOLQJWRWROHUDWHVRPHQRQGHWHUPLQLVPLQWKHUHVXOWVZHFDQ
VWLOODFFRPSOLVKWKHUHGXFWLRQHQWLUHO\RQWKH*38%XWZHǢOOȌUVWQHHGWRGHYHORS
DZD\WRZRUNDURXQGWKHODFNRIDWRPLFȍRDWLQJSRLQWDULWKPHWLF7KHVROXWLRQ
ZLOOVWLOOXVHDWRPLFRSHUDWLRQVEXWQRWIRUWKHDULWKPHWLFLWVHOI

$ $720,&/2&.6
7KHatomicAdd()IXQFWLRQZHXVHGWREXLOG*38KLVWRJUDPVSHUIRUPHGD
UHDGPRGLI\ZULWHRSHUDWLRQZLWKRXWLQWHUUXSWLRQIURPRWKHUWKUHDGV$WDORZ
OHYHO\RXFDQLPDJLQHWKHKDUGZDUHORFNLQJWKHWDUJHWPHPRU\ORFDWLRQZKLOH
WKLVRSHUDWLRQLVXQGHUZD\DQGZKLOHORFNHGQRRWKHUWKUHDGVFDQUHDGRUZULWH
WKHYDOXHDWWKHORFDWLRQ,IZHKDGDZD\RIHPXODWLQJWKLVORFNLQRXU&8'$&
NHUQHOVZHFRXOGSHUIRUPDUELWUDU\RSHUDWLRQVRQDQDVVRFLDWHGPHPRU\ORFDWLRQ

251

Download from www.wowebook.com


ADVANCED ATOMICS

RUGDWDVWUXFWXUH7KHORFNLQJPHFKDQLVPLWVHOIZLOORSHUDWHH[DFWO\OLNHDW\SLFDO
CPU mutex,I\RXDUHXQIDPLOLDUZLWKPXWXDOH[FOXVLRQ mutex GRQǢWIUHW,WǢVQRW
DQ\PRUHFRPSOLFDWHGWKDQWKHWKLQJV\RXǢYHDOUHDG\OHDUQHG

7KHEDVLFLGHDLVWKDWZHDOORFDWHDVPDOOSLHFHPHPRU\WREHXVHGDVDmutex
7KHPXWH[ZLOODFWOLNHVRPHWKLQJRIDWUDIȌFVLJQDOWKDWJRYHUQVDFFHVVWRVRPH
UHVRXUFH7KHUHVRXUFHFRXOGEHDGDWDVWUXFWXUHDEXIIHURUVLPSO\DPHPRU\
ORFDWLRQZHZDQWWRPRGLI\DWRPLFDOO\:KHQDWKUHDGUHDGVDIURPWKHPXWH[
LWLQWHUSUHWVWKLVYDOXHDVDǤJUHHQOLJKWǥLQGLFDWLQJWKDWQRRWKHUWKUHDGLVXVLQJ
WKHPHPRU\7KHUHIRUHWKHWKUHDGLVIUHHWRORFNWKHPHPRU\DQGPDNHZKDWHYHU
FKDQJHVLWGHVLUHVIUHHRILQWHUIHUHQFHIURPRWKHUWKUHDGV7RORFNWKHPHPRU\
ORFDWLRQLQTXHVWLRQWKHWKUHDGZULWHVDWRWKHPXWH[7KLVZLOODFWDVDǤUHG
OLJKWǥIRUSRWHQWLDOO\FRPSHWLQJWKUHDGV7KHFRPSHWLQJWKUHDGVPXVWWKHQZDLW
XQWLOWKHRZQHUKDVZULWWHQDWRWKHPXWH[EHIRUHWKH\FDQDWWHPSWWRPRGLI\WKH
ORFNHGPHPRU\

$VLPSOHFRGHVHTXHQFHWRDFFRPSOLVKWKLVORFNLQJSURFHVVPLJKWORRNOLNHWKLV

void lock( void ) {


if( *mutex == 0 ) {
*mutex = 1; //store a 1 to lock
}
}

8QIRUWXQDWHO\WKHUHǢVDSUREOHPZLWKWKLVFRGH)RUWXQDWHO\LWǢVDIDPLOLDU
SUREOHP:KDWKDSSHQVLIDQRWKHUWKUHDGZULWHVDWRWKHPXWH[DIWHURXUWKUHDG
KDVUHDGWKHYDOXHWREH]HUR"7KDWLVERWKWKUHDGVFKHFNWKHYDOXHDWmutex
DQGVHHWKDWLWǢV]HUR7KH\WKHQERWKZULWHDWRWKLVORFDWLRQWRVLJQLI\WRRWKHU
WKUHDGVWKDWWKHVWUXFWXUHLVORFNHGDQGXQDYDLODEOHIRUPRGLȌFDWLRQ$IWHUGRLQJ
VRERWKWKUHDGVWKLQNWKH\RZQWKHDVVRFLDWHGPHPRU\RUGDWDVWUXFWXUHDQG
EHJLQPDNLQJXQVDIHPRGLȌFDWLRQV&DWDVWURSKHHQVXHV

7KHRSHUDWLRQZHZDQWWRFRPSOHWHLVIDLUO\VLPSOH:HQHHGWRFRPSDUHWKHYDOXH
DWmutexWRDQGVWRUHDDWWKDWORFDWLRQLIDQGRQO\LIWKHmutexZDV7R
DFFRPSOLVKWKLVFRUUHFWO\WKLVHQWLUHRSHUDWLRQQHHGVWREHSHUIRUPHGDWRPLFDOO\VR
ZHNQRZWKDWQRRWKHUWKUHDGFDQLQWHUIHUHZKLOHRXUWKUHDGH[DPLQHVDQGXSGDWHV
WKHYDOXHDWmutex,Q&8'$&WKLVRSHUDWLRQFDQEHSHUIRUPHGZLWKWKHIXQFWLRQ
atomicCAS()DQDWRPLFFRPSDUHDQGVZDS7KHIXQFWLRQatomicCAS()WDNHV
DSRLQWHUWRPHPRU\DYDOXHZLWKZKLFKWRFRPSDUHWKHYDOXHDWWKDWORFDWLRQDQGD
YDOXHWRVWRUHLQWKDWORFDWLRQLIWKHFRPSDULVRQLVVXFFHVVIXO8VLQJWKLVRSHUDWLRQ
ZHFDQLPSOHPHQWD*38ORFNIXQFWLRQDVIROORZV

252

Download from www.wowebook.com


$ '27352'8&75(9,6,7('

__device__ void lock( void ) {


while( atomicCAS( mutex, 0, 1 ) != 0 );
}

7KHFDOOWRatomicCAS()UHWXUQVWKHYDOXHWKDWLWIRXQGDWWKHDGGUHVVmutex
$VDUHVXOWWKHwhile()ORRSZLOOFRQWLQXHWRUXQXQWLOatomicCAS()VHHVD
DWmutex:KHQLWVHHVDWKHFRPSDULVRQLVVXFFHVVIXODQGWKHWKUHDGZULWHV
DWRmutex(VVHQWLDOO\WKHWKUHDGZLOOVSLQLQWKHwhile() ORRSXQWLOLWKDV
VXFFHVVIXOO\ORFNHGWKHGDWDVWUXFWXUH:HǢOOXVHWKLVORFNLQJPHFKDQLVPWR
LPSOHPHQWRXU*38KDVKWDEOH%XWȌUVWZHGUHVVWKHFRGHXSLQDVWUXFWXUHVRLW
ZLOOEHFOHDQHUWRXVHLQWKHGRWSURGXFWDSSOLFDWLRQ

struct Lock {
int *mutex;
Lock( void ) {
int state = 0;
HANDLE_ERROR( cudaMalloc( (void**)& mutex,
sizeof(int) ) );
HANDLE_ERROR( cudaMemcpy( mutex, &state, sizeof(int),
cudaMemcpyHostToDevice ) );
}

~Lock( void ) {
cudaFree( mutex );
}

__device__ void lock( void ) {


while( atomicCAS( mutex, 0, 1 ) != 0 );
}

__device__ void unlock( void ) {


atomicExch( mutex, 1 );
}
};

1RWLFHWKDWZHUHVWRUHWKHYDOXHRImutexZLWKatomicExch( mutex, 1 )


7KHIXQFWLRQatomicExch()UHDGVWKHYDOXHWKDWLVORFDWHGDWmutexH[FKDQJHV

253

Download from www.wowebook.com


ADVANCED ATOMICS

LWZLWKWKHVHFRQGDUJXPHQW DLQWKLVFDVH DQGUHWXUQVWKHRULJLQDOYDOXHLW


UHDG:K\ZRXOGZHXVHDQDWRPLFIXQFWLRQIRUWKLVUDWKHUWKDQWKHPRUHREYLRXV
PHWKRGWRUHVHWWKHYDOXHDWmutex"
*mutex = 1;

,I\RXǢUHH[SHFWLQJVRPHVXEWOHKLGGHQUHDVRQZK\WKLVPHWKRGIDLOVZHKDWHWR
GLVDSSRLQW\RXEXWWKLVZRXOGZRUNDVZHOO6RZK\QRWXVHWKLVPRUHREYLRXV
PHWKRG"$WRPLFWUDQVDFWLRQVDQGJHQHULFJOREDOPHPRU\RSHUDWLRQVIROORZ
GLIIHUHQWSDWKVWKURXJKWKH*388VLQJERWKDWRPLFVDQGVWDQGDUGJOREDOPHPRU\
RSHUDWLRQVFRXOGWKHUHIRUHOHDGWRDQunlock()VHHPLQJRXWRIV\QFZLWKD
VXEVHTXHQWDWWHPSWWRlock()WKHPXWH[7KHEHKDYLRUZRXOGVWLOOEHIXQFWLRQ-
DOO\FRUUHFWEXWWRHQVXUHFRQVLVWHQWO\LQWXLWLYHEHKDYLRUIURPWKHDSSOLFDWLRQǢV
SHUVSHFWLYHLWǢVEHVWWRXVHWKHVDPHSDWKZD\IRUDOODFFHVVHVWRWKHPXWH[
%HFDXVHZHǢUHUHTXLUHGWRXVHDQDWRPLFWRORFNWKHUHVRXUFHZHKDYHFKRVHQWR
DOVRXVHDQDWRPLFWRXQORFNWKHUHVRXUFH

$ '27352'8&75('8;$720,&/2&.6
7KHRQO\SLHFHRIRXUHDUOLHUGRWSURGXFWH[DPSOHWKDWZHHQGHDYRUWRFKDQJH
LVWKHȌQDO&38EDVHGSRUWLRQRIWKHUHGXFWLRQ,QWKHSUHYLRXVVHFWLRQZH
GHVFULEHGKRZZHLPSOHPHQWDPXWH[RQWKH*387KHLockVWUXFWXUHWKDW
LPSOHPHQWVWKLVPXWH[LVORFDWHGLQlock.hDQGLQFOXGHGDWWKHEHJLQQLQJRIRXU
LPSURYHGGRWSURGXFWH[DPSOH

#include "../common/book.h"
#include "lock.h"

#define imin(a,b) (a<b?a:b)

const int N = 33 * 1024 * 1024;


const int threadsPerBlock = 256;
const int blocksPerGrid =
imin( 32, (N+threadsPerBlock-1) / threadsPerBlock );

:LWKWZRH[FHSWLRQVWKHEHJLQQLQJRIRXUGRWSURGXFWNHUQHOLVLGHQWLFDOWRWKH
NHUQHOZHXVHGLQ&KDSWHU%RWKH[FHSWLRQVLQYROYHWKHNHUQHOǢVVLJQDWXUH
__global__ void dot( Lock lock, float *a, float *b, float *c )

254

Download from www.wowebook.com


$ '27352'8&75(9,6,7('

,QRXUXSGDWHGGRWSURGXFWZHSDVVDLockWRWKHNHUQHOLQDGGLWLRQWRLQSXW
YHFWRUVDQGWKHRXWSXWEXIIHU7KHLockZLOOJRYHUQDFFHVVWRWKHRXWSXWEXIIHU
GXULQJWKHȌQDODFFXPXODWLRQVWHS7KHRWKHUFKDQJHLVQRWnoticeableIURPWKH
VLJQDWXUHEXWLQYROYHVWKHVLJQDWXUH3UHYLRXVO\WKHfloat *cDUJXPHQWZDVD
EXIIHUIRUNȍRDWVZKHUHHDFKRIWKHNEORFNVFRXOGVWRUHLWVSDUWLDOUHVXOW7KLV
EXIIHUZDVFRSLHGEDFNWRWKH&38WRFRPSXWHWKHȌQDOVXP1RZWKHDUJXPHQW
cQRORQJHUSRLQWVWRDWHPSRUDU\EXIIHUEXWWRDVLQJOHȍRDWLQJSRLQWYDOXHWKDW
ZLOOVWRUHWKHGRWSURGXFWRIWKHYHFWRUVLQaDQGb%XWHYHQZLWKWKHVHFKDQJHV
WKHNHUQHOVWDUWVRXWH[DFWO\DVLWGLGLQ&KDSWHU

__global__ void dot( Lock lock, float *a,


float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;

float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}

// set the cache values


cache[cacheIndex] = temp;

// synchronize threads in this block


__syncthreads();

// for reductions, threadsPerBlock must be a power of 2


// because of the following code
int i = blockDim.x/2;
while (i != 0) {
if (cacheIndex < i)
cache[cacheIndex] += cache[cacheIndex + i];
__syncthreads();
i /= 2;
}

255

Download from www.wowebook.com


ADVANCED ATOMICS

$WWKLVSRLQWLQH[HFXWLRQWKHWKUHDGVLQHDFKEORFNKDYHVXPPHGWKHLU
SDLUZLVHSURGXFWVDQGFRPSXWHGDVLQJOHYDOXHWKDWǢVVLWWLQJLQcache[0](DFK
WKUHDGEORFNQRZQHHGVWRDGGLWVȌQDOYDOXHWRWKHYDOXHDWc7RGRWKLVVDIHO\
ZHǢOOXVHWKHORFNWRJRYHUQDFFHVVWRWKLVPHPRU\ORFDWLRQVRHDFKWKUHDGQHHGV
WRDFTXLUHWKHORFNEHIRUHXSGDWLQJWKHYDOXH c$IWHUDGGLQJWKHEORFNǢVSDUWLDO
VXPWRWKHYDOXHDWcLWXQORFNVWKHPXWH[VRRWKHUWKUHDGVFDQDFFXPXODWHWKHLU
YDOXHV$IWHUDGGLQJLWVYDOXHWRWKHȌQDOUHVXOWWKHEORFNKDVQRWKLQJUHPDLQLQJ
WRFRPSXWHDQGFDQUHWXUQIURPWKHNHUQHO

if (cacheIndex == 0) {
lock.lock();
*c += cache[0];
lock.unlock();
}
}

7KHmain()URXWLQHLVYHU\VLPLODUWRRXURULJLQDOLPSOHPHQWDWLRQWKRXJKLWGRHV
KDYHDFRXSOHGLIIHUHQFHV)LUVWZHQRORQJHUQHHGWRDOORFDWHDEXIIHUIRUSDUWLDO
UHVXOWVDVZHGLGLQ&KDSWHU:HQRZDOORFDWHVSDFHIRURQO\DVLQJOHȍRDWLQJ
SRLQWUHVXOW

int main( void ) {


float *a, *b, c = 0;
float *dev_a, *dev_b, *dev_c;

// allocate memory on the CPU side


a = (float*)malloc( N*sizeof(float) );
b = (float*)malloc( N*sizeof(float) );

// allocate the memory on the GPU


HANDLE_ERROR( cudaMalloc( (void**)&dev_a,
N*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_b,
N*sizeof(float) ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_c,
sizeof(float) ) );

256

Download from www.wowebook.com


$ '27352'8&75(9,6,7('

$VZHGLGLQ&KDSWHUZHLQLWLDOL]HRXULQSXWDUUD\VDQGFRS\WKHPWRWKH
*38%XW\RXǢOOQRWLFHDQDGGLWLRQDOFRS\LQWKLVH[DPSOH:HǢUHDOVRFRS\LQJ
D]HURWRdev_cWKHORFDWLRQWKDWZHLQWHQGWRXVHWRDFFXPXODWHRXUȌQDOGRW
SURGXFW6LQFHHDFKEORFNZDQWVWRUHDGWKLVYDOXHDGGLWVSDUWLDOVXPDQG
VWRUHWKHUHVXOWEDFNZHQHHGWKHLQLWLDOYDOXHWREH]HURLQRUGHUWRJHWWKH
FRUUHFWUHVXOW

// fill in the host memory with data


for (int i=0; i<N; i++) {
a[i] = i;
b[i] = i*2;
}

// copy the arrays 'a' and 'b' to the GPU


HANDLE_ERROR( cudaMemcpy( dev_a, a, N*sizeof(float),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_b, b, N*sizeof(float),
cudaMemcpyHostToDevice ) );
HANDLE_ERROR( cudaMemcpy( dev_c, &c, sizeof(float),
cudaMemcpyHostToDevice ) );

$OOWKDWUHPDLQVLVGHFODULQJRXULockLQYRNLQJWKHNHUQHODQGFRS\LQJWKH
UHVXOWEDFNWRWKH&38

Lock lock;
dot<<<blocksPerGrid,threadsPerBlock>>>( lock, dev_a,
dev_b, dev_c );

// copy c back from the GPU to the CPU


HANDLE_ERROR( cudaMemcpy( &c, dev_c,
sizeof(float),
cudaMemcpyDeviceToHost ) );

257

Download from www.wowebook.com


ADVANCED ATOMICS

,Q&KDSWHUWKLVLVZKHQZHZRXOGGRDȌQDOfor()ORRSWRDGGWKHSDUWLDO
VXPV6LQFHWKLVLVGRQHRQWKH*38XVLQJDWRPLFORFNVZHFDQVNLSULJKWWRWKH
DQVZHUFKHFNLQJDQGFOHDQXSFRGH

#define sum_squares(x) (x*(x+1)*(2*x+1)/6)


printf( "Does GPU value %.6g = %.6g?\n", c,
2 * sum_squares( (float)(N - 1) ) );

// free memory on the GPU side


cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );

// free memory on the CPU side


free( a );
free( b );
}

%HFDXVHWKHUHLVQRZD\WRSUHFLVHO\SUHGLFWWKHRUGHULQZKLFKHDFKEORFNZLOO
DGGLWVSDUWLDOVXPWRWKHȌQDOWRWDOLWLVYHU\OLNHO\ DOPRVWFHUWDLQ WKDWWKHȌQDO
UHVXOWZLOOEHVXPPHGLQDGLIIHUHQWRUGHUWKDQWKH&38ZLOOVXPLW%HFDXVHRI
WKHQRQDVVRFLDWLYLW\RIȍRDWLQJSRLQWDGGLWLRQLWǢVWKHUHIRUHTXLWHSUREDEOHWKDW
WKHȌQDOUHVXOWZLOOEHVOLJKWO\GLIIHUHQWEHWZHHQWKH*38DQG&387KHUHLVQRW
PXFKWKDWFDQEHGRQHDERXWWKLVZLWKRXWDGGLQJDQRQWULYLDOFKXQNRIFRGHWR
HQVXUHWKDWWKHEORFNVDFTXLUHWKHORFNLQDGHWHUPLQLVWLFRUGHUWKDWPDWFKHVWKH
VXPPDWLRQRUGHURQWKH&38,I\RXIHHOH[WUDRUGLQDULO\PRWLYDWHGJLYHWKLVDWU\
2WKHUZLVHZHǢOOPRYHRQWRVHHKRZWKHVHDWRPLFORFNVFDQEHXVHGWRLPSOH-
PHQWDPXOWLWKUHDGHGGDWDVWUXFWXUH

$ ,PSOHPHQWLQJD+DVK7DEOH
7KHKDVKWDEOHLVRQHRIWKHPRVWLPSRUWDQWDQGFRPPRQO\XVHGGDWDVWUXFWXUHV
LQFRPSXWHUVFLHQFHSOD\LQJDQLPSRUWDQWUROHLQDZLGHYDULHW\RIDSSOLFDWLRQV
)RUUHDGHUVQRWDOUHDG\IDPLOLDUZLWKKDVKWDEOHVZHǢOOSURYLGHDTXLFNSULPHU
KHUH7KHVWXG\RIGDWDVWUXFWXUHVZDUUDQWVPRUHLQGHSWKVWXG\WKDQZHLQWHQG
WRSURYLGHEXWLQWKHLQWHUHVWRIPDNLQJIRUZDUGSURJUHVVZHZLOONHHSWKLVEULHI
,I\RXDOUHDG\IHHOFRPIRUWDEOHZLWKWKHFRQFHSWVEHKLQGKDVKWDEOHV\RXVKRXOG
VNLSWRWKHKDVKWDEOHLPSOHPHQWDWLRQLQ6HFWLRQ$$&38+DVK7DEOH
258

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

$ +$6+7$%/(29(59,(:
$KDVKWDEOHLVHVVHQWLDOO\DVWUXFWXUHWKDWLVGHVLJQHGWRVWRUHSDLUVRIkeysDQG
values)RUH[DPSOH\RXFRXOGWKLQNRIDGLFWLRQDU\DVDKDVKWDEOH(YHU\ZRUGLQ
WKHGLFWLRQDU\LVDkeyDQGHDFKZRUGKDVDGHȌQLWLRQDVVRFLDWHGZLWKLW7KHGHȌ-
QLWLRQLVWKHvalueDVVRFLDWHGZLWKWKHZRUGDQGWKXVHYHU\ZRUGDQGGHȌQLWLRQLQ
WKHGLFWLRQDU\IRUPDNH\YDOXHSDLU)RUWKLVGDWDVWUXFWXUHWREHXVHIXOWKRXJK
LWLVLPSRUWDQWWKDWZHPLQLPL]HWKHWLPHLWWDNHVWRȌQGDSDUWLFXODUYDOXHLIZHǢUH
JLYHQDNH\,QJHQHUDOWKLVVKRXOGEHDFRQVWDQWDPRXQWRIWLPH7KDWLVWKHWLPH
WRORRNXSDYDOXHJLYHQDNH\VKRXOGEHWKHVDPHUHJDUGOHVVRIKRZPDQ\NH\
YDOXHSDLUVDUHLQWKHKDVKWDEOH

$WDQDEVWUDFWOHYHORXUKDVKWDEOHZLOOSODFHYDOXHVLQǤEXFNHWVǥEDVHGRQWKH
YDOXHǢVFRUUHVSRQGLQJNH\7KHPHWKRGE\ZKLFKZHPDSNH\VWREXFNHWVLVRIWHQ
FDOOHGWKHhash function$JRRGKDVKIXQFWLRQZLOOPDSWKHVHWRISRVVLEOHNH\V
XQLIRUPO\DFURVVDOOWKHEXFNHWVEHFDXVHWKLVZLOOKHOSVDWLVI\RXUUHTXLUHPHQW
WKDWLWWDNHFRQVWDQWWLPHWRȌQGDQ\YDOXHUHJDUGOHVVRIWKHQXPEHURIYDOXHV
ZHǢYHDGGHGWRWKHKDVKWDEOH

)RUH[DPSOHFRQVLGHURXUGLFWLRQDU\KDVKWDEOH2QHREYLRXVKDVKIXQFWLRQZRXOG
LQYROYHXVLQJEXFNHWVRQHIRUHDFKOHWWHURIWKHDOSKDEHW7KLVVLPSOHKDVK
IXQFWLRQPLJKWVLPSO\ORRNDWWKHȌUVWOHWWHURIWKHNH\DQGSXWWKHYDOXHLQRQH
RIWKHEXFNHWVEDVHGRQWKLVOHWWHU)LJXUH$VKRZVKRZWKLVKDVKIXQFWLRQ
ZRXOGDVVLJQIHZVDPSOHZRUGV

  


  

  
   

   
   


Figure A.1 +DVKLQJRIZRUGVLQWREXFNHWV

259

Download from www.wowebook.com


ADVANCED ATOMICS

*LYHQZKDWZHNQRZDERXWWKHGLVWULEXWLRQRIZRUGVLQWKH(QJOLVKODQJXDJHWKLV
KDVKIXQFWLRQOHDYHVPXFKWREHGHVLUHGEHFDXVHLWZLOOQRWPDSZRUGVXQLIRUPO\
DFURVVWKHEXFNHWV6RPHRIWKHEXFNHWVZLOOFRQWDLQYHU\IHZNH\YDOXHSDLUV
DQGVRPHRIWKHEXFNHWVZLOOFRQWDLQDODUJHQXPEHURISDLUV$FFRUGLQJO\LW
ZLOOWDNHPXFKORQJHUWRȌQGWKHYDOXHDVVRFLDWHGZLWKDZRUGWKDWEHJLQVZLWK
DFRPPRQOHWWHUVXFKDV6WKDQLWZRXOGWDNHWRȌQGWKHYDOXHDVVRFLDWHGZLWKD
ZRUGWKDWEHJLQVZLWKWKHOHWWHU;6LQFHZHDUHORRNLQJIRUKDVKIXQFWLRQVWKDW
ZLOOJLYHXVFRQVWDQWWLPHUHWULHYDORIDQ\YDOXHWKLVFRQVHTXHQFHLVIDLUO\XQGH-
VLUDEOH$QLPPHQVHDPRXQWRIUHVHDUFKKDVJRQHLQWRWKHVWXG\RIKDVKIXQF-
WLRQVEXWHYHQDEULHIVXUYH\RIWKHVHWHFKQLTXHVLVEH\RQGWKHVFRSHRIWKLVERRN

7KHODVWFRPSRQHQWRIRXUKDVKWDEOHGDWDVWUXFWXUHLQYROYHVWKHEXFNHWV,IZH
KDGDSHUIHFWKDVKIXQFWLRQHYHU\NH\ZRXOGPDSWRDGLIIHUHQWEXFNHW,QWKLV
FDVHZHFDQVLPSO\VWRUHWKHNH\YDOXHSDLUVLQDQDUUD\ZKHUHHDFKHQWU\LQWKH
DUUD\LVZKDWZHǢYHEHHQFDOOLQJDbucket+RZHYHUHYHQZLWKDQH[FHOOHQWKDVK
IXQFWLRQLQPRVWVLWXDWLRQVZHZLOOKDYHWRGHDOZLWKcollisions$FROOLVLRQRFFXUV
ZKHQPRUHWKDQRQHNH\PDSVWRDEXFNHWVXFKDVZKHQZHDGGERWKWKHZRUGV
avocadoDQGaardvarkWRRXUGLFWLRQDU\KDVKWDEOH7KHVLPSOHVWZD\WRVWRUHDOORI
WKHYDOXHVWKDWPDSWRDJLYHQEXFNHWLVVLPSO\WRPDLQWDLQDOLVWRIYDOXHVLQWKH
EXFNHW:KHQZHHQFRXQWHUDFROOLVLRQVXFKDVDGGLQJaardvarkWRDGLFWLRQDU\
WKDWDOUHDG\FRQWDLQVavocadoZHSXWWKHYDOXHDVVRFLDWHGZLWKaardvarkDWWKH
HQGRIWKHOLVWZHǢUHPDLQWDLQLQJLQWKHǤ$ǥEXFNHWDVVKRZQLQ)LJXUH$

$IWHUDGGLQJWKHZRUGavocadoLQ)LJXUH$WKHȌUVWEXFNHWKDVDVLQJOHNH\
YDOXHSDLULQLWVOLVW/DWHULQWKLVLPDJLQDU\DSSOLFDWLRQZHDGGWKHZRUGaardvark,
DZRUGWKDWFROOLGHVZLWKavocadoEHFDXVHWKH\ERWKVWDUWZLWKWKHOHWWHUA<RX
ZLOOQRWLFHLQ)LJXUH$WKDWLWVLPSO\JHWVSODFHGDWWKHHQGRIWKHOLVWLQWKHȌUVW
EXFNHW

avocado

avocado

Figure A.2 ,QVHUWLQJWKHZRUGavocadoLQWRWKHKDVKWDEOH

260

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

aardvark avocado

avocado aardvark

Figure A.3 5HVROYLQJWKHFRQȍLFWZKHQDGGLQJWKHZRUGaardvark

$UPHGZLWKVRPHEDFNJURXQGRQWKHQRWLRQVRIDhash functionDQGcollision reso-


lutionZHǢUHUHDG\WRWDNHDORRNDWLPSOHPHQWLQJRXURZQKDVKWDEOH

$ $&38+$6+7$%/(
$VGHVFULEHGLQWKHSUHYLRXVVHFWLRQRXUKDVKWDEOHZLOOFRQVLVWRIHVVHQWLDOO\WZR
SDUWVDKDVKIXQFWLRQDQGDGDWDVWUXFWXUHRIEXFNHWV2XUEXFNHWVZLOOEHLPSOH-
PHQWHGH[DFWO\DVEHIRUH:HZLOODOORFDWHDQDUUD\RIOHQJWKNDQGHDFKHQWU\LQ
WKHDUUD\KROGVDOLVWRINH\YDOXHSDLUV%HIRUHFRQFHUQLQJRXUVHOYHVZLWKDKDVK
IXQFWLRQZHZLOOWDNHDORRNDWWKHGDWDVWUXFWXUHVLQYROYHG

#include "../common/book.h"

struct Entry {
unsigned int key;
void* value;
Entry *next;
};

struct Table {
size_t count;
Entry **entries;
Entry *pool;
Entry *firstFree;
};

261

Download from www.wowebook.com


ADVANCED ATOMICS

$VGHVFULEHGLQWKHLQWURGXFWRU\VHFWLRQWKHVWUXFWXUHEntryKROGVERWKDNH\
DQGDYDOXH,QRXUDSSOLFDWLRQZHZLOOXVHXQVLJQHGLQWHJHUNH\VWRVWRUHRXU
NH\YDOXHSDLUV7KHYDOXHDVVRFLDWHGZLWKWKLVNH\FDQEHDQ\GDWDVRZHKDYH
GHFODUHGvalueDVDvoid*WRLQGLFDWHWKLV2XUDSSOLFDWLRQZLOOSULPDULO\EH
FRQFHUQHGZLWKFUHDWLQJWKHKDVKWDEOHGDWDVWUXFWXUHVRZHZRQǢWDFWXDOO\VWRUH
DQ\WKLQJLQWKHvalueȌHOG:HKDYHLQFOXGHGLWLQWKHVWUXFWXUHIRUFRPSOHWH-
QHVVLQFDVH\RXZDQWWRXVHWKLVFRGHLQ\RXURZQDSSOLFDWLRQV7KHODVWSLHFHRI
GDWDLQRXUKDVKWDEOHEntryLVDSRLQWHUWRWKHQH[WEntry$IWHUFROOLVLRQVZHǢOO
KDYHPXOWLSOHHQWULHVLQWKHVDPHEXFNHWDQGZHKDYHGHFLGHGWRVWRUHWKHVH
HQWULHVDVDOLVW6RHDFKHQWU\ZLOOSRLQWWRWKHQH[WHQWU\LQWKHEXFNHWWKHUHE\
IRUPLQJDOLVWRIHQWULHVWKDWKDYHKDVKHGWRWKHVDPHORFDWLRQLQWKHWDEOH7KH
ODVWHQWU\ZLOOKDYHDNULL nextSRLQWHU

$WLWVKHDUWWKHTableVWUXFWXUHLWVHOILVDQDUUD\RIǤEXFNHWVǥ7KLVEXFNHW
DUUD\LVMXVWDQDUUD\RIOHQJWKcountZKHUHHDFKEXFNHWLQentriesLVMXVWD
SRLQWHUWRDQEntry7RDYRLGLQFXUULQJWKHFRPSOLFDWLRQDQGSHUIRUPDQFHKLWRI
DOORFDWLQJPHPRU\HYHU\WLPHZHZDQWWRDGGDQEntryWRWKHWDEOHWKHWDEOH
ZLOOPDLQWDLQDODUJHDUUD\RIDYDLODEOHHQWULHVLQpool7KHȌHOGfirstFree
SRLQWVWRWKHQH[WDYDLODEOHEntryIRUXVHVRZKHQZHQHHGWRDGGDQHQWU\WR
WKHWDEOHZHFDQVLPSO\XVHWKHEntrySRLQWHGWRE\firstFreeDQGLQFUHPHQW
WKDWSRLQWHU1RWHWKDWWKLVZLOODOVRVLPSOLI\RXUFOHDQXSFRGHEHFDXVHZHFDQ
IUHHDOORIWKHVHHQWULHVZLWKDVLQJOHFDOOWRfree(),IZHKDGDOORFDWHGHYHU\
HQWU\DVZHZHQWZHZRXOGKDYHWRZDONWKURXJKWKHWDEOHDQGIUHHHYHU\HQWU\
RQHE\RQH

$IWHUXQGHUVWDQGLQJWKHGDWDVWUXFWXUHVLQYROYHGOHWǢVWDNHDORRNDWVRPHRIWKH
RWKHUVXSSRUWFRGH

void initialize_table( Table &table, int entries,


int elements ) {
table.count = entries;
table.entries = (Entry**)calloc( entries, sizeof(Entry*) );
table.pool = (Entry*)malloc( elements * sizeof( Entry ) );
table.firstFree = table.pool;
}

262

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

7DEOHLQLWLDOL]DWLRQFRQVLVWVSULPDULO\RIDOORFDWLQJPHPRU\DQGFOHDULQJPHPRU\
IRUWKHEXFNHWDUUD\entries:HDOVRDOORFDWHVWRUDJHIRUDSRRORIHQWULHVDQG
LQLWLDOL]HWKHfirstFreeSRLQWHUWREHWKHȌUVWHQWU\LQWKHSRRODUUD\

$WWKHHQGRIWKHDSSOLFDWLRQZHǢOOZDQWWRIUHHWKHPHPRU\ZHǢYHDOORFDWHGVR
RXUFOHDQXSURXWLQHIUHHVWKHEXFNHWDUUD\DQGWKHSRRORIIUHHHQWULHV

void free_table( Table &table ) {


free( table.entries );
free( table.pool );
}

,QRXULQWURGXFWLRQZHVSRNHTXLWHDELWDERXWWKHKDVKIXQFWLRQ6SHFLȌFDOO\
ZHGLVFXVVHGKRZDJRRGKDVKIXQFWLRQFDQPDNHWKHGLIIHUHQFHEHWZHHQDQ
H[FHOOHQWKDVKWDEOHLPSOHPHQWDWLRQDQGSRRURQH,QWKLVH[DPSOHZHǢUHXVLQJ
XQVLJQHGLQWHJHUVDVRXUNH\VDQGZHQHHGWRPDSWKHVHWRWKHLQGLFHVRIRXU
EXFNHWDUUD\7KHVLPSOHVWZD\WRGRWKLVZRXOGEHWRVHOHFWWKHEXFNHWZLWKDQ
LQGH[HTXDOWRWKHNH\7KDWLVZHFRXOGVWRUHWKHHQWU\eLQtable.entries[e.
key]+RZHYHUZHKDYHQRZD\RIJXDUDQWHHLQJWKDWHYHU\NH\ZLOOEHOHVVWKDQ
WKHOHQJWKRIWKHDUUD\RIEXFNHWV)RUWXQDWHO\WKLVSUREOHPFDQEHVROYHGUHOD-
WLYHO\SDLQOHVVO\

size_t hash( unsigned int key, size_t count ) {


return key % count;
}

,IWKHKDVKIXQFWLRQLVVRLPSRUWDQWKRZFDQZHJHWDZD\ZLWKVXFKDVLPSOH
RQH",GHDOO\ZHZDQWWKHNH\VWRPDSXQLIRUPO\DFURVVDOOWKHEXFNHWVLQRXU
WDEOHDQGDOOZHǢUHGRLQJKHUHLVWDNLQJWKHNH\PRGXORWKHDUUD\OHQJWK,Q
UHDOLW\KDVKIXQFWLRQVPD\QRWQRUPDOO\EHWKLVVLPSOHEXWEHFDXVHWKLVLVMXVW
DQH[DPSOHSURJUDPZHZLOOEHUDQGRPO\JHQHUDWLQJRXUNH\V,IZHDVVXPH
WKDWWKHUDQGRPQXPEHUJHQHUDWRUJHQHUDWHVYDOXHVURXJKO\XQLIRUPO\WKLV
KDVKIXQFWLRQVKRXOGPDSWKHVHNH\VXQLIRUPO\DFURVVDOORIWKHEXFNHWVRIWKH
KDVKWDEOH,Q\RXURZQKDVKWDEOHLPSOHPHQWDWLRQ\RXPD\UHTXLUHDPRUH
FRPSOLFDWHGKDVKIXQFWLRQ

263

Download from www.wowebook.com


ADVANCED ATOMICS

+DYLQJVHHQWKHKDVKWDEOHVWUXFWXUHVDQGWKHKDVKIXQFWLRQZHǢUHUHDG\WRORRN
DWWKHSURFHVVRIDGGLQJDNH\YDOXHSDLUWRWKHWDEOH7KHSURFHVVLQYROYHVWKUHH
EDVLFVWHSV

 &RPSXWHWKHKDVKIXQFWLRQRQWKHLQSXWNH\WRGHWHUPLQHWKHQHZHQWU\ǢV
EXFNHW

 7DNHDSUHDOORFDWHGEntryIURPWKHSRRODQGLQLWLDOL]HLWVkeyDQGvalue
ȌHOGV

 ,QVHUWWKHHQWU\DWWKHIURQWRIWKHSURSHUEXFNHWǢVOLVW

:HWUDQVODWHWKHVHVWHSVWRFRGHLQDIDLUO\VWUDLJKWIRUZDUGZD\

void add_to_table( Table &table, unsigned int key, void* value )


{
//Step 1
size_t hashValue = hash( key, table.count );

//Step 2
Entry *location = table.firstFree++;
location->key = key;
location->value = value;

//Step 3
location->next = table.entries[hashValue];
table.entries[hashValue] = location;
}

,I\RXKDYHQHYHUVHHQOLQNHGOLVWV RULWǢVEHHQDZKLOH VWHSPD\EHWULFN\


WRXQGHUVWDQGDWȌUVW7KHH[LVWLQJOLVWKDVLWVȌUVWQRGHVWRUHGDWtable.
entries[hashValue]:LWKWKLVLQPLQGZHFDQLQVHUWDQHZQRGHDWWKHKHDG
RIWKHOLVWLQWZRVWHSV)LUVWZHVHWRXUQHZHQWU\ǢVnextSRLQWHUWRSRLQWWRWKH
ȌUVWQRGHLQWKHH[LVWLQJOLVW7KHQZHVWRUHWKHQHZHQWU\LQWKHEXFNHWDUUD\VRit
EHFRPHVWKHȌUVWQRGHRIWKHQHZOLVW

264

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

6LQFHLWǢVDJRRGLGHDWRKDYHVRPHLGHDZKHWKHUWKHFRGH\RXǢYHZULWWHQZRUNV
ZHǢYHLPSOHPHQWHGDURXWLQHWRSHUIRUPDVDQLW\FKHFNRQDKDVKWDEOH7KH
FKHFNLQYROYHVȌUVWZDONLQJWKURXJKWKHWDEOHDQGH[DPLQLQJHYHU\QRGH
:HFRPSXWHWKHKDVKIXQFWLRQRQWKHQRGHǢVNH\DQGFRQȌUPWKDWWKHQRGH
LVVWRUHGLQWKHFRUUHFWEXFNHW$IWHUFKHFNLQJHYHU\QRGHZHYHULI\WKDWWKH
QXPEHURIQRGHVactuallyLQWKHWDEOHLVLQGHHGHTXDOWRWKHQXPEHURIHOHPHQWV
we intendedWRDGGWRWKHWDEOH,IWKHVHQXPEHUVGRQǢWDJUHHWKHQHLWKHU
ZHǢYHDGGHGDQRGHDFFLGHQWDOO\WRPXOWLSOHEXFNHWVRUZHKDYHQǢWLQVHUWHGLW
FRUUHFWO\

#define SIZE (100*1024*1024)


#define ELEMENTS (SIZE / sizeof(unsigned int))

void verify_table( const Table &table ) {


int count = 0;
for (size_t i=0; i<table.count; i++) {
Entry *current = table.entries[i];
while (current != NULL) {
++count;
if (hash( current->value, table.count ) != i)
printf( "%d hashed to %ld, but was located "
"at %ld\n", current->value,
hash( current->value, table.count ), i );
current = current->next;
}
}
if (count != ELEMENTS)
printf( "%d elements found in hash table. Should be %ld\n",
count, ELEMENTS );
else
printf( "All %d elements found in hash table.\n", count);
}

265

Download from www.wowebook.com


ADVANCED ATOMICS

:LWKDOOWKHLQIUDVWUXFWXUHFRGHRXWRIWKHZD\ZHFDQORRNDWmain()$VZLWK
PDQ\RIWKLVERRNǢVH[DPSOHVDORWRIWKHKHDY\OLIWLQJKDVEHHQGRQHLQKHOSHU
IXQFWLRQVVRZHKRSHWKDWmain()ZLOOEHUHODWLYHO\HDV\WRIROORZ

#define HASH_ENTRIES 1024

int main( void ) {


unsigned int *buffer =
(unsigned int*)big_random_block( SIZE );

clock_t start, stop;


start = clock();

Table table;
initialize_table( table, HASH_ENTRIES, ELEMENTS );

for (int i=0; i<ELEMENTS; i++) {


add_to_table( table, buffer[i], (void*)NULL );
}

stop = clock();
float elapsedTime = (float)(stop - start) /
(float)CLOCKS_PER_SEC * 1000.0f;
printf( "Time to hash: %3.1f ms\n", elapsedTime );

verify_table( table );

free_table( table );
free( buffer );
return 0;
}

$V\RXFDQVHHZHVWDUWE\DOORFDWLQJDELJFKXQNRIUDQGRPQXPEHUV7KHVH
UDQGRPO\JHQHUDWHGXQVLJQHGLQWHJHUVZLOOEHWKHNH\VZHLQVHUWLQWRRXU
KDVKWDEOH$IWHUJHQHUDWLQJWKHQXPEHUVZHUHDGWKHV\VWHPWLPHLQRUGHUWR
PHDVXUHWKHSHUIRUPDQFHRIRXULPSOHPHQWDWLRQ:HLQLWLDOL]HWKHKDVKWDEOHDQG
WKHQLQVHUWHDFKUDQGRPNH\LQWRWKHWDEOHXVLQJDfor()ORRS$IWHUDGGLQJDOO
WKHNH\VZHUHDGWKHV\VWHPWLPHDJDLQWRFRPSXWHWKHHODSVHGWLPHWRLQLWLDOL]H
DQGDGGWKHNH\V)LQDOO\ZHYHULI\WKHKDVKWDEOHZLWKRXUVDQLW\FKHFNURXWLQH
DQGIUHHWKHEXIIHUVZHǢYHDOORFDWHG
266

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

<RXSUREDEO\QRWLFHGWKDWZHDUHXVLQJNULLDVWKHYDOXHIRUHYHU\NH\YDOXHSDLU
,QDW\SLFDODSSOLFDWLRQ\RXZRXOGOLNHO\VWRUHVRPHXVHIXOGDWDZLWKWKHNH\EXW
EHFDXVHZHDUHSULPDULO\FRQFHUQHGZLWKWKHKDVKWDEOHLPSOHPHQWDWLRQLWVHOI
ZHǢUHVWRULQJDPHDQLQJOHVVYDOXHZLWKHDFKNH\

$ 08/7,7+5($'('+$6+7$%/(
7KHUHDUHVRPHDVVXPSWLRQVEXLOWLQWRRXU&38KDVKWDEOHWKDWZLOOQRORQJHUEH
YDOLGZKHQZHPRYHWRWKH*38)LUVWZHKDYHDVVXPHGWKDWRQO\RQHQRGHFDQ
EHDGGHGWRWKHWDEOHDWDWLPHLQRUGHUWRPDNHWKHDGGLWLRQRIDQRGHVLPSOHU,I
PRUHWKDQRQHWKUHDGZHUHWU\LQJWRDGGDQRGHWRWKHWDEOHDWRQFHZHFRXOGHQG
XSZLWKSUREOHPVVLPLODUWRWKHPXOWLWKUHDGHGDGGLWLRQSUREOHPVLQWKHH[DPSOH
IURP&KDSWHU

)RUH[DPSOHOHWǢVUHYLVLWRXUǤDYRFDGRDQGDDUGYDUNǥH[DPSOHDQGLPDJLQHWKDW
WKUHDGV$DQG%DUHWU\LQJWRDGGWKHVHHQWULHVWRWKHWDEOH7KUHDG$FRPSXWHVD
KDVKIXQFWLRQRQavocadoDQGWKUHDG%FRPSXWHVWKHIXQFWLRQRQaardvark7KH\
ERWKGHFLGHWKHLUNH\VEHORQJLQWKHVDPHEXFNHW7RDGGWKHQHZHQWU\WRWKHOLVW
WKUHDG$DQG%VWDUWE\VHWWLQJWKHLUQHZHQWU\ǢVnextSRLQWHUWRWKHȌUVWQRGHRI
WKHH[LVWLQJOLVWDVLQ)LJXUH$

7KHQERWKWKUHDGVWU\WRUHSODFHWKHHQWU\LQWKHEXFNHWDUUD\ZLWKWKHLUQHZ
HQWU\+RZHYHUWKHWKUHDGWKDWȌQLVKHVVHFRQGLVWKHRQO\WKUHDGWKDWKDVLWV
XSGDWHSUHVHUYHGEHFDXVHLWRYHUZULWHVWKHZRUNRIWKHSUHYLRXVWKUHDG6R
FRQVLGHUWKHVFHQDULRZKHUHWKUHDG$UHSODFHVWKHHQWU\altitudeZLWKLWVHQWU\IRU
avocado,PPHGLDWHO\DIWHUȌQLVKLQJWKUHDG%UHSODFHVZKDWLWEHOLHYHWREHWKH
HQWU\IRUaltitudeZLWKLWVHQWU\IRUaardvark8QIRUWXQDWHO\LWǢVUHSODFLQJavocado
LQVWHDGRIaltitudeUHVXOWLQJLQWKHVLWXDWLRQLOOXVWUDWHGLQ)LJXUH$

avocado
altitude audience
aardvark

Figure A.4 0XOWLSOHWKUHDGVDWWHPSWLQJWRDGGDQRGHWRWKHVDPHEXFNHW

267

Download from www.wowebook.com


ADVANCED ATOMICS

avocado

aardvark altitude audience

Figure A.5 7KHKDVKWDEOHDIWHUDQXQVXFFHVVIXOFRQFXUUHQWPRGLȌFDWLRQE\


WZRWKUHDGV

7KUHDG$ǢVHQWU\LVWUDJLFDOO\ǤȍRDWLQJǥRXWVLGHRIWKHKDVKWDEOH)RUWXQDWHO\RXU
VDQLW\FKHFNURXWLQHZRXOGFDWFKWKLVDQGDOHUWXVWRWKHSUHVHQFHRIDSUREOHP
EHFDXVHLWZRXOGFRXQWIHZHUQRGHVWKDQZHH[SHFWHG%XWZHVWLOOQHHGWR
DQVZHUWKLVTXHVWLRQ+RZGRZHEXLOGDKDVKWDEOHRQWKH*38"7KHNH\REVHU-
YDWLRQKHUHLQYROYHVWKHIDFWWKDWRQO\RQHWKUHDGFDQVDIHO\PDNHPRGLȌFDWLRQV
WRDEXFNHWDWDWLPH7KLVLVVLPLODUWRRXUGRWSURGXFWH[DPSOHZKHUHRQO\RQH
WKUHDGDWDWLPHFRXOGVDIHO\DGGLWVYDOXHWRWKHȌQDOUHVXOW,IHDFKEXFNHWKDG
DQDWRPLFORFNDVVRFLDWHGZLWKLWZHFRXOGHQVXUHWKDWRQO\DVLQJOHWKUHDGZDV
PDNLQJFKDQJHVWRDJLYHQEXFNHWDWDWLPH

$ $*38+$6+7$%/(
$UPHGZLWKDPHWKRGWRHQVXUHVDIHPXOWLWKUHDGHGDFFHVVWRWKHKDVKWDEOHZH
FDQSURFHHGZLWKD*38LPSOHPHQWDWLRQRIWKHKDVKWDEOHDSSOLFDWLRQZHZURWH
LQ6HFWLRQ$$&38+DVK7DEOH:HǢOOQHHGWRLQFOXGHlock.hWKHLPSOH-
PHQWDWLRQRIRXU*38LockVWUXFWXUHIURP6HFWLRQ$$WRPLF/RFNVDQGZHǢOO
QHHGWRGHFODUHWKHKDVKIXQFWLRQDVD__device__IXQFWLRQ$VLGHIURPWKHVH
FKDQJHVWKHIXQGDPHQWDOGDWDVWUXFWXUHVDQGKDVKIXQFWLRQDUHLGHQWLFDOWRWKH
&38LPSOHPHQWDWLRQ

268

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

#include “../common/book.h”
#include “lock.h”

struct Entry {
unsigned int key;
void* value;
Entry *next;
};

struct Table {
size_t count;
Entry **entries;
Entry *pool;
};

__device__ __host__ size_t hash( unsigned int value,


size_t count ) {
return value % count;
}

,QLWLDOL]LQJDQGIUHHLQJWKHKDVKWDEOHFRQVLVWVRIWKHVDPHVWHSVDVZHSHUIRUPHG
RQWKH&38EXWDVZLWKSUHYLRXVH[DPSOHVZHXVH&8'$UXQWLPHIXQFWLRQVWR
DFFRPSOLVKWKLV:HXVHcudaMalloc()WRDOORFDWHDEXFNHWDUUD\DQGDSRRORI
HQWULHVDQGZHXVHcudaMemset()WRVHWWKHEXFNHWDUUD\HQWULHVWR]HUR7R
IUHHWKHPHPRU\XSRQDSSOLFDWLRQFRPSOHWLRQZHXVHcudaFree()

void initialize_table( Table &table, int entries,


int elements ) {
table.count = entries;
HANDLE_ERROR( cudaMalloc( (void**)&table.entries,
entries * sizeof(Entry*)) );
HANDLE_ERROR( cudaMemset( table.entries, 0,
entries * sizeof(Entry*) ) );
HANDLE_ERROR( cudaMalloc( (void**)&table.pool,
elements * sizeof(Entry)) );
}

269

Download from www.wowebook.com


ADVANCED ATOMICS

void free_table( Table &table ) {


cudaFree( table.pool );
cudaFree( table.entries );
}

:HXVHGDURXWLQHWRFKHFNRXUKDVKWDEOHIRUFRUUHFWQHVVLQWKH&38LPSOHPHQ-
WDWLRQ:HQHHGDVLPLODUURXWLQHIRUWKH*38YHUVLRQVRZHKDYHWZRRSWLRQV:H
FRXOGZULWHD*38EDVHGYHUVLRQRIverify_table()RUZHFRXOGXVHWKHVDPH
FRGHZHXVHGLQWKH&38YHUVLRQDQGDGGDIXQFWLRQWKDWFRSLHVDKDVKWDEOHIURP
WKH*38WRWKH&38$OWKRXJKHLWKHURSWLRQJHWVXVZKDWZHQHHGWKHVHFRQG
RSWLRQVHHPVVXSHULRUIRUWZRUHDVRQV)LUVWLWLQYROYHVUHXVLQJRXU&38YHUVLRQ
RIverify_table()$VZLWKFRGHUHXVHLQJHQHUDOWKLVVDYHVWLPHDQGHQVXUHV
WKDWIXWXUHFKDQJHVWRWKHFRGHZRXOGQHHGWREHPDGHLQRQO\RQHSODFHIRUERWK
YHUVLRQVRIWKHKDVKWDEOH6HFRQGLPSOHPHQWLQJDFRS\IXQFWLRQZLOOXQFRYHUDQ
LQWHUHVWLQJSUREOHPWKHVROXWLRQWRZKLFKPD\EHYHU\XVHIXOWR\RXLQWKHIXWXUH

$VSURPLVHGverify_table()LVLGHQWLFDOWRWKH&38LPSOHPHQWDWLRQDQGLV
UHSULQWHGKHUHIRU\RXUFRQYHQLHQFH

#define SIZE (100*1024*1024)


#define ELEMENTS (SIZE / sizeof(unsigned int))
#define HASH_ENTRIES 1024

void verify_table( const Table &dev_table ) {


Table table;
copy_table_to_host( dev_table, table );

int count = 0;
for (size_t i=0; i<table.count; i++) {
Entry *current = table.entries[i];
while (current != NULL) {
++count;
if (hash( current->value, table.count ) != i)
printf( "%d hashed to %ld, but was located "
"at %ld\n", current->value,
hash(current->value, table.count), i );
current = current->next;
}
}

270

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

if (count != ELEMENTS)
printf( “%d elements found in hash table. Should be %ld\n”,
count, ELEMENTS );
else
printf( “All %d elements found in hash table.\n”, count );

free( table.pool );
free( table.entries );
}

6LQFHZHFKRVHWRUHXVHRXU&38LPSOHPHQWDWLRQRIverify_table()ZHQHHGD
IXQFWLRQWRFRS\WKHWDEOHIURP*38PHPRU\WRKRVWPHPRU\7KHUHDUHWKUHHVWHSV
WRWKLVIXQFWLRQWZRUHODWLYHO\REYLRXVVWHSVDQGDWKLUGWULFNLHUVWHS7KHȌUVWWZR
VWHSVLQYROYHDOORFDWLQJKRVWPHPRU\IRUWKHKDVKWDEOHGDWDDQGSHUIRUPLQJDFRS\
RIWKH*38GDWDVWUXFWXUHVLQWRWKLVPHPRU\ZLWKcudaMemcpy():HKDYHGRQH
WKLVPDQ\WLPHVSUHYLRXVO\VRWKLVVKRXOGFRPHDVQRVXUSULVH

void copy_table_to_host( const Table &table, Table &hostTable) {


hostTable.count = table.count;
hostTable.entries = (Entry**)calloc( table.count,
sizeof(Entry*) );
hostTable.pool = (Entry*)malloc( ELEMENTS *
sizeof( Entry ) );

HANDLE_ERROR( cudaMemcpy( hostTable.entries, table.entries,


table.count * sizeof(Entry*),
cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaMemcpy( hostTable.pool, table.pool,
ELEMENTS * sizeof( Entry ),
cudaMemcpyDeviceToHost ) );

7KHWULFN\SRUWLRQRIWKLVURXWLQHLQYROYHVWKHIDFWWKDWVRPHRIWKHGDWDZHKDYH
FRSLHGDUHSRLQWHUV:HFDQQRWVLPSO\FRS\WKHVHSRLQWHUVWRWKHKRVWEHFDXVH
WKH\DUHDGGUHVVHVRQWKH*38WKH\ZLOOQRORQJHUEHYDOLGSRLQWHUVRQWKHKRVW
+RZHYHUWKHUHODWLYHRIIVHWVRIWKHSRLQWHUVwillVWLOOEHYDOLG(YHU\*38SRLQWHU

271

Download from www.wowebook.com


ADVANCED ATOMICS

WRDQEntrySRLQWVVRPHZKHUHZLWKLQWKHtable.pool[]DUUD\EXWIRUWKHKDVK
WDEOHWREHXVDEOHRQWKHKRVWZHQHHGWKHPWRSRLQWWRWKHVDPHEntryLQWKH
hostTable.pool[]DUUD\

*LYHQD*38SRLQWHU;ZHWKHUHIRUHQHHGWRDGGWKHSRLQWHUǢVRIIVHWIURPtable.
poolWRhostTable.poolWRJHWDYDOLGKRVWSRLQWHU7KDWLVWKHQHZSRLQWHU
VKRXOGEHFRPSXWHGDVIROORZV
(X - table.pool) + hostTable.pool

:HSHUIRUPWKLVXSGDWHIRUHYHU\EntrySRLQWHUZHǢYHFRSLHGIURPWKH*38WKH
EntrySRLQWHUVLQhostTable.entriesDQGWKHnextSRLQWHURIHYHU\Entry
LQWKHWDEOHǢVSRRORIHQWULHV

for (int i=0; i<table.count; i++) {


if (hostTable.entries[i] != NULL)
hostTable.entries[i] =
(Entry*)((size_t)hostTable.entries[i] -
(size_t)table.pool + (size_t)hostTable.pool);
}
for (int i=0; i<ELEMENTS; i++) {
if (hostTable.pool[i].next != NULL)
hostTable.pool[i].next =
(Entry*)((size_t)hostTable.pool[i].next -
(size_t)table.pool + (size_t)hostTable.pool);
}
}

+DYLQJVHHQWKHGDWDVWUXFWXUHVKDVKIXQFWLRQLQLWLDOL]DWLRQFOHDQXSDQGYHULȌ-
FDWLRQFRGHWKHPRVWLPSRUWDQWSLHFHUHPDLQLQJLVWKHRQHWKDWDFWXDOO\LQYROYHV
&8'$&DWRPLFV$VDUJXPHQWVWKHadd_to_table()NHUQHOZLOOWDNHDQDUUD\
RINH\VDQGYDOXHVWREHDGGHGWRWKHKDVKWDEOH,WVQH[WDUJXPHQWLVWKHKDVK
WDEOHLWVHOIDQGWKHȌQDODUJXPHQWLVDQDUUD\RIORFNVWKDWZLOOEHXVHGWRORFN
HDFKRIWKHWDEOHǢVEXFNHWV6LQFHRXULQSXWLVWZRDUUD\VWKDWRXUWKUHDGVZLOO
QHHGWRLQGH[ZHDOVRQHHGRXUDOOWRRFRPPRQLQGH[OLQHDUL]DWLRQ

__global__ void add_to_table( unsigned int *keys, void **values,


Table table, Lock *lock ) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
272

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

2XUWKUHDGVZDONWKURXJKWKHLQSXWDUUD\VH[DFWO\OLNHWKH\GLGLQWKHGRWSURGXFW
H[DPSOH)RUHDFKNH\LQWKHkeys[]DUUD\WKHWKUHDGZLOOFRPSXWHWKHKDVK
IXQFWLRQLQRUGHUWRGHWHUPLQHZKLFKEXFNHWWKHNH\YDOXHSDLUEHORQJVLQ$IWHU
GHWHUPLQLQJWKHWDUJHWEXFNHWWKHWKUHDGORFNVWKHEXFNHWDGGVLWVNH\YDOXH
SDLUDQGXQORFNVWKHEXFNHW

while (tid < ELEMENTS) {


unsigned int key = keys[tid];
size_t hashValue = hash( key, table.count );
for (int i=0; i<32; i++) {
if ((tid % 32) == i) {
Entry *location = &(table.pool[tid]);
location->key = key;
location->value = values[tid];
lock[hashValue].lock();
location->next = table.entries[hashValue];
table.entries[hashValue] = location;
lock[hashValue].unlock();
}
}
tid += stride;
}
}

7KHUHLVVRPHWKLQJUHPDUNDEO\SHFXOLDUDERXWWKLVELWRIFRGHKRZHYHU7KH
for()ORRSDQGVXEVHTXHQWif()VWDWHPHQWVHHPGHFLGHGO\XQQHFHVVDU\,Q
&KDSWHUZHLQWURGXFHGWKHFRQFHSWRIDwarp,I\RXǢYHIRUJRWWHQDZDUSLVD
FROOHFWLRQRIWKUHDGVWKDWH[HFXWHWRJHWKHULQORFNVWHS$OWKRXJKWKHQXDQFHV
RIKRZWKLVJHWVLPSOHPHQWHGLQWKH*38DUHEH\RQGWKHVFRSHRIWKLVERRNRQO\
RQHWKUHDGLQWKHZDUSFDQDFTXLUHWKHORFNDWDWLPHDQGZHZLOOVXIIHUPDQ\D
KHDGDFKHLIZHOHWDOOWKUHDGVLQWKHZDUSFRQWHQGIRUWKHORFNVLPXOWDQHRXVO\
,QWKLVVLWXDWLRQZHǢYHIRXQGWKDWLWǢVEHVWWRGRVRPHRIWKHZRUNLQVRIWZDUHDQG
VLPSO\ZDONWKURXJKHDFKWKUHDGLQWKHZDUSJLYLQJHDFKDFKDQFHWRDFTXLUHWKH
GDWDVWUXFWXUHǢVORFNGRLWVZRUNDQGVXEVHTXHQWO\UHOHDVHWKHORFN

7KHȍRZRImain()VKRXOGDSSHDULGHQWLFDOWRWKH&38LPSOHPHQWDWLRQ:H
VWDUWE\DOORFDWLQJDODUJHFKXQNRIUDQGRPGDWDIRURXUKDVKWDEOHNH\V7KHQZH
FUHDWHVWDUWDQGVWRS&8'$HYHQWVDQGUHFRUGWKHVWDUWHYHQWIRURXUSHUIRUPDQFH

273

Download from www.wowebook.com


ADVANCED ATOMICS

PHDVXUHPHQWV:HSURFHHGWRDOORFDWH*38PHPRU\IRURXUDUUD\RIUDQGRP
NH\VFRS\WKHDUUD\XSWRWKHGHYLFHDQGLQLWLDOL]HRXUKDVKWDEOH

int main( void ) {


unsigned int *buffer =
(unsigned int*)big_random_block( SIZE );

cudaEvent_t start, stop;


HANDLE_ERROR( cudaEventCreate( &start ) );
HANDLE_ERROR( cudaEventCreate( &stop ) );
HANDLE_ERROR( cudaEventRecord( start, 0 ) );

unsigned int *dev_keys;


void **dev_values;
HANDLE_ERROR( cudaMalloc( (void**)&dev_keys, SIZE ) );
HANDLE_ERROR( cudaMalloc( (void**)&dev_values, SIZE ) );
HANDLE_ERROR( cudaMemcpy( dev_keys, buffer, SIZE,
cudaMemcpyHostToDevice ) );

// copy the values to dev_values here


// filled in by user of this code example

Table table;
initialize_table( table, HASH_ENTRIES, ELEMENTS );

7KHODVWVWHSRISUHSDUDWLRQWREXLOGRXUKDVKWDEOHLQYROYHVSUHSDULQJORFNVIRU
WKHKDVKWDEOHǢVEXFNHWV:HDOORFDWHRQHORFNIRUHDFKEXFNHWLQWKHKDVKWDEOH
&RQFHLYDEO\ZHFRXOGVDYHDORWRIPHPRU\E\XVLQJRQO\RQHORFNIRUWKHZKROH
WDEOH%XWGRLQJVRZRXOGXWWHUO\GHVWUR\SHUIRUPDQFHEHFDXVHHYHU\WKUHDG
ZRXOGKDYHWRFRPSHWHIRUWKHWDEOHORFNZKHQHYHUDJURXSRIWKUHDGVWULHVWR
VLPXOWDQHRXVO\DGGHQWULHVWRWKHWDEOH6RZHGHFODUHDQDUUD\RIORFNVRQHIRU
HYHU\EXFNHWLQWKHDUUD\:HWKHQDOORFDWHD*38DUUD\IRUWKHORFNVDQGFRS\
WKHPXSWRWKHGHYLFH

274

Download from www.wowebook.com


$ ,03/(0(17,1*$+$6+7$%/(

Lock lock[HASH_ENTRIES];
Lock *dev_lock;
HANDLE_ERROR( cudaMalloc( (void**)&dev_lock,
HASH_ENTRIES * sizeof( Lock ) ) );
HANDLE_ERROR( cudaMemcpy( dev_lock, lock,
HASH_ENTRIES * sizeof( Lock ),
cudaMemcpyHostToDevice ) );

7KHUHVWRImain()LVVLPLODUWRWKH&38YHUVLRQ:HDGGDOORIRXUNH\VWRWKH
KDVKWDEOHVWRSWKHSHUIRUPDQFHWLPHUYHULI\WKHFRUUHFWQHVVRIWKHKDVKWDEOH
DQGFOHDQXSDIWHURXUVHOYHV

add_to_table<<<60,256>>>( dev_keys, dev_values,


table, dev_lock );

HANDLE_ERROR( cudaEventRecord( stop, 0 ) );


HANDLE_ERROR( cudaEventSynchronize( stop ) );
float elapsedTime;
HANDLE_ERROR( cudaEventElapsedTime( &elapsedTime,
start, stop ) );
printf( "Time to hash: %3.1f ms\n", elapsedTime );

verify_table( table );

HANDLE_ERROR( cudaEventDestroy( start ) );


HANDLE_ERROR( cudaEventDestroy( stop ) );
free_table( table );
cudaFree( dev_lock );
cudaFree( dev_keys );
cudaFree( dev_values );
free( buffer );
return 0;
}

275

Download from www.wowebook.com


ADVANCED ATOMICS

$ +$6+7$%/(3(5)250$1&(
8VLQJDQ,QWHO&RUH'XRWKH&38KDVKWDEOHH[DPSOHLQ6HFWLRQ$$&38
+DVK7DEOHWDNHVPVWREXLOGDKDVKWDEOHIURP0%RIGDWD7KHFRGH
ZDVEXLOWZLWKWKHRSWLRQ-O3WRHQVXUHPD[LPDOO\RSWLPL]HG&38FRGH7KH
PXOWLWKUHDGHG*38KDVKWDEOHLQ6HFWLRQ$$*38+DVK7DEOHWDNHVPV
WRFRPSOHWHWKHVDPHWDVN'LIIHULQJE\OHVVWKDQSHUFHQWWKHVHDUHURXJKO\
FRPSDUDEOHH[HFXWLRQWLPHVZKLFKUDLVHVDQH[FHOOHQWTXHVWLRQ:K\ZRXOGVXFK
DPDVVLYHO\SDUDOOHOPDFKLQHVXFKDVD*38JHWEHDWHQE\DVLQJOHWKUHDGHG&38
YHUVLRQRIWKHVDPHDSSOLFDWLRQ")UDQNO\WKLVLVEHFDXVH*38VZHUHQRWGHVLJQHG
WRH[FHODWPXOWLWKUHDGHGDFFHVVWRFRPSOH[GDWDVWUXFWXUHVVXFKDVDKDVKWDEOH
)RUWKLVUHDVRQWKHUHDUHYHU\IHZSHUIRUPDQFHPRWLYDWLRQVWREXLOGDGDWDVWUXF-
WXUHVXFKDVDKDVKWDEOHRQWKH*386RLIall\RXUDSSOLFDWLRQQHHGVWRGRLVEXLOG
DKDVKWDEOHRUVLPLODUGDWDVWUXFWXUH\RXZRXOGOLNHO\EHEHWWHURIIGRLQJWKLVRQ
\RXU&38

2QWKHRWKHUKDQG\RXZLOOVRPHWLPHVȌQG\RXUVHOILQDVLWXDWLRQZKHUHDORQJ
FRPSXWDWLRQSLSHOLQHLQYROYHVRQHRUWZRVWDJHVWKDWWKH*38GRHVQRWHQMR\D
SHUIRUPDQFHDGYDQWDJHRYHUFRPSDUDEOH&38LPSOHPHQWDWLRQV,QWKHVHVLWXD-
WLRQV\RXKDYHWKUHH VRPHZKDWREYLRXV RSWLRQV

ǩ 3HUIRUPHYHU\VWHSRIWKHSLSHOLQHRQWKH*38

ǩ 3HUIRUPHYHU\VWHSRIWKHSLSHOLQHRQWKH&38

ǩ 3HUIRUPVRPHSLSHOLQHVWHSVRQWKH*38DQGVRPHRQWKH&38

7KHODVWRSWLRQVRXQGVOLNHWKHEHVWRIERWKZRUOGVKRZHYHULWLPSOLHVWKDW\RX
ZLOOQHHGWRV\QFKURQL]H\RXU&38DQG*38DWDQ\SRLQWLQ\RXUDSSOLFDWLRQZKHUH
\RXZDQWWRPRYHFRPSXWDWLRQIURPWKH*38WR&38RUEDFN7KLVV\QFKURQL]DWLRQ
DQGVXEVHTXHQWGDWDWUDQVIHUEHWZHHQKRVWDQG*38FDQRIWHQNLOODQ\SHUIRU-
PDQFHDGYDQWDJH\RXPLJKWKDYHGHULYHGIURPHPSOR\LQJDK\EULGDSSURDFKLQ
WKHȌUVWSODFH

,QVXFKDVLWXDWLRQLWPD\EHZRUWK\RXUWLPHWRSHUIRUPHYHU\SKDVHRIFRPSX-
WDWLRQRQWKH*38HYHQLIWKH*38LVQRWLGHDOO\VXLWHGIRUVRPHVWHSVRIWKH
DOJRULWKP,QWKLVYHLQWKH*38KDVKWDEOHFDQSRWHQWLDOO\SUHYHQWD&38*38
V\QFKURQL]DWLRQSRLQWPLQLPL]HGDWDWUDQVIHUEHWZHHQWKHKRVWDQG*38DQGIUHH
WKH&38WRSHUIRUPRWKHUFRPSXWDWLRQV,QVXFKDVFHQDULRLWǢVSRVVLEOHWKDWWKH
RYHUDOOSHUIRUPDQFHRID*38LPSOHPHQWDWLRQZRXOGH[FHHGD&38*38K\EULG
DSSURDFKGHVSLWHWKH*38EHLQJQRIDVWHUWKDQWKH&38RQFHUWDLQVWHSV RU
SRWHQWLDOO\HYHQJHWWLQJWURXQFHGE\WKH&38LQVRPHFDVHV 

276

Download from www.wowebook.com


$ $33(1',;5(9,(:

$ $SSHQGL[5HYLHZ
:HVDZKRZWRXVHDWRPLFFRPSDUHDQGVZDSRSHUDWLRQVWRLPSOHPHQWD*38
PXWH[8VLQJDORFNEXLOWZLWKWKLVPXWH[ZHVDZKRZWRLPSURYHRXURULJLQDOGRW
SURGXFWDSSOLFDWLRQWRUXQHQWLUHO\RQWKH*38:HFDUULHGWKLVLGHDIXUWKHUE\
LPSOHPHQWLQJDPXOWLWKUHDGHGKDVKWDEOHWKDWXVHGDQDUUD\RIORFNVWRSUHYHQW
XQVDIHVLPXOWDQHRXVPRGLȌFDWLRQVE\PXOWLSOHWKUHDGV,QIDFWWKHPXWH[ZH
GHYHORSHGFRXOGEHXVHGIRUDQ\PDQQHURISDUDOOHOGDWDVWUXFWXUHVDQGZHKRSH
WKDW\RXǢOOȌQGLWXVHIXOLQ\RXURZQH[SHULPHQWDWLRQDQGDSSOLFDWLRQGHYHORS-
PHQW2IFRXUVHWKHSHUIRUPDQFHRIDSSOLFDWLRQVWKDWXVHWKH*38WRLPSOH-
PHQWPXWH[EDVHGGDWDVWUXFWXUHVQHHGVFDUHIXOVWXG\2XU*38KDVKWDEOHJHWV
EHDWHQE\DVLQJOHWKUHDGHG&38YHUVLRQRIWKHVDPHFRGHVRLWZLOOPDNHVHQVH
WRXVHWKH*38IRUWKLVW\SHRIDSSOLFDWLRQRQO\LQFHUWDLQVLWXDWLRQV7KHUHLVQR
EODQNHWUXOHWKDWFDQEHXVHGWRGHWHUPLQHZKHWKHUD*38RQO\&38RQO\RU
K\EULGDSSURDFKZLOOZRUNEHVWEXWNQRZLQJKRZWRXVHDWRPLFVZLOODOORZ\RXWR
PDNHWKDWGHFLVLRQRQDFDVHE\FDVHEDVLV

277

Download from www.wowebook.com


This page intentionally left blank

Download from www.wowebook.com


Index

A ORFNVǞ
add()IXQFWLRQ&38YHFWRUVXPVǞ RSHUDWLRQVǞ
add_to_table()NHUQHO*38KDVKWDEOH RYHUYLHZRIǞ
$/8V DULWKPHWLFORJLFXQLWV VXPPDU\UHYLHZǞ
&8'$$UFKLWHFWXUH
XVLQJFRQVWDQWPHPRU\ B
anim_and_exit()PHWKRG*38ULSSOHV EDQGZLGWKFRQVWDQWPHPRU\VDYLQJǞ
anim_gpu()URXWLQHWH[WXUHPHPRU\ %DVLF/LQHDU$OJHEUD6XESURJUDPV %/$6 &8%/$6
DQLPDWLRQ OLEUDU\Ǟ
*38-XOLD6HWH[DPSOHǞ ELQFRXQWV&38KLVWRJUDPFRPSXWDWLRQǞ
*38ULSSOHXVLQJWKUHDGVǞ %/$6 %DVLF/LQHDU$OJHEUD6XESURJUDPV &8%/$6
KHDWWUDQVIHUVLPXODWLRQǞ OLEUDU\Ǟ
animExit(), 149 blend_kernel()
DV\QFKURQRXVFDOO 'WH[WXUHPHPRU\Ǟ
cudaMemcpyAsync()DV WH[WXUHPHPRU\Ǟ
XVLQJHYHQWVZLWK
blockDimYDULDEOH
DWRPLFORFNV
'WH[WXUHPHPRU\Ǟ
*38KDVKWDEOHǞ
GRWSURGXFWFRPSXWDWLRQǞ
RYHUYLHZRIǞ
GRWSURGXFWFRPSXWDWLRQLQFRUUHFW
atomicAdd()
RSWLPL]DWLRQ
DWRPLFORFNVǞ
GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV
KLVWRJUDPNHUQHOXVLQJJOREDOPHPRU\
Ǟ
QRWVXSSRUWLQJȍRDWLQJSRLQWQXPEHUV
atomicCAS()*38ORFNǞ GRWSURGXFWFRPSXWDWLRQ]HURFRS\PHPRU\
atomicExch()*38ORFNǞ Ǟ
DWRPLFVǞ *38KDVKWDEOHLPSOHPHQWDWLRQ
DGYDQFHGǞ *38ULSSOHXVLQJWKUHDGVǞ
FRPSXWHFDSDELOLW\RI19,',$*38VǞ *38VXPVRIDORQJHUYHFWRUǞ
GRWSURGXFWDQGǞ *38VXPVRIDUELWUDULO\ORQJYHFWRUVǞ
KDVKWDEOHVseeKDVKWDEOHV JUDSKLFVLQWHURSHUDELOLW\
KLVWRJUDPFRPSXWDWLRQ&38Ǟ KLVWRJUDPNHUQHOXVLQJJOREDOPHPRU\DWRPLFV
KLVWRJUDPFRPSXWDWLRQ*38Ǟ Ǟ
KLVWRJUDPFRPSXWDWLRQRYHUYLHZ KLVWRJUDPNHUQHOXVLQJVKDUHGJOREDOPHPRU\
KLVWRJUDPNHUQHOXVLQJJOREDOPHPRU\DWRPLFV DWRPLFVǞ
Ǟ PXOWLSOH&8'$VWUHDPV
KLVWRJUDPNHUQHOXVLQJVKDUHGJOREDOPHPRU\ UD\WUDFLQJRQ*38
DWRPLFVǞ VKDUHGPHPRU\ELWPDS
IRUPLQLPXPFRPSXWHFDSDELOLW\Ǟ WHPSHUDWXUHXSGDWHFRPSXWDWLRQǞ

279

Download from www.wowebook.com


,1'(;

blockIdxYDULDEOH FDOOEDFNVGPUAnimBitmapXVHUUHJLVWUDWLRQ
'WH[WXUHPHPRU\Ǟ IRU
GHȌQHG &DPEULGJH8QLYHUVLW\&8'$DSSOLFDWLRQVǞ
GRWSURGXFWFRPSXWDWLRQǞ FDPHUD
GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV UD\WUDFLQJFRQFHSWVǞ
Ǟ UD\WUDFLQJRQ*38Ǟ
GRWSURGXFWFRPSXWDWLRQ]HURFRS\PHPRU\ FHOOXODUSKRQHVSDUDOOHOSURFHVVLQJLQ
Ǟ FHQWUDOSURFHVVLQJXQLWVsee&38V FHQWUDO
*38KDVKWDEOHLPSOHPHQWDWLRQ SURFHVVLQJXQLWV 
*38-XOLD6HW FOHDQLQJDJHQWV&8'$DSSOLFDWLRQVIRUǞ
*38ULSSOHXVLQJWKUHDGVǞ clickDrag(), 149
*38VXPVRIDORQJHUYHFWRUǞ FORFNVSHHGHYROXWLRQRIǞ
*38YHFWRUVXPVǞ FRGHEUHDNLQJDVVXPSWLRQVǞ
JUDSKLFVLQWHURSHUDELOLW\ FRGHUHVRXUFHV&8'DǞ
KLVWRJUDPNHUQHOXVLQJJOREDOPHPRU\DWRPLFV FROOLVLRQUHVROXWLRQKDVKWDEOHVǞ
Ǟ FRORU
KLVWRJUDPNHUQHOXVLQJVKDUHGJOREDOPHPRU\ &38-XOLD6HWǞ
DWRPLFVǞ HDUO\GD\VRI*38FRPSXWLQJǞ
PXOWLSOH&8'$VWUHDPV UD\WUDFLQJFRQFHSWV
UD\WUDFLQJRQ*38 FRPSLOHU
VKDUHGPHPRU\ELWPDS IRUPLQLPXPFRPSXWHFDSDELOLW\Ǟ
WHPSHUDWXUHXSGDWHFRPSXWDWLRQǞ VWDQGDUG&IRU*38FRGHǞ
EORFNV FRPSOH[QXPEHUV
GHȌQHG GHȌQLQJJHQHULFFODVVWRVWRUHǞ
*38-XOLD6HW VWRULQJZLWKVLQJOHSUHFLVLRQȍRDWLQJSRLQW
*38YHFWRUVXPVǞ FRPSRQHQWV
KDUGZDUHLPSRVHGOLPLWVRQ FRPSXWDWLRQDOȍXLGG\QDPLFV&8'$DSSOLFDWLRQV
VSOLWWLQJLQWRWKUHDGVseeSDUDOOHOEORFNVVSOLWWLQJ IRUǞ
LQWRWKUHDGV FRPSXWHFDSDELOLW\
EUHDVWFDQFHU&8'$DSSOLFDWLRQVIRUǞ FRPSLOLQJIRUPLQLPXPǞ
EULGJHVFRQQHFWLQJPXOWLSOH*38V cudaChooseDevice()and, 141
EXFNHWVKDVKWDEOH GHȌQHG
FRQFHSWRIǞ RI19,',$*38VǞ
*38KDVKWDEOHLPSOHPHQWDWLRQǞ RYHUYLHZRIǞ
PXOWLWKUHDGHGKDVKWDEOHVDQGǞ FRPSXWHUJDPHV'JUDSKLFGHYHORSPHQWIRUǞ
bufferObjYDULDEOH FRQVWDQWPHPRU\
FUHDWLQJGPUAnimBitmap, 149 DFFHOHUDWLQJDSSOLFDWLRQVZLWK
UHJLVWHULQJZLWK&8'$UXQWLPH PHDVXULQJSHUIRUPDQFHZLWKHYHQWVǞ
UHJLVWHULQJZLWKcudaGraphicsGL- PHDVXULQJUD\WUDFHUSHUIRUPDQFHǞ
RegisterBuffer() RYHUYLHZRI
VHWWLQJXSJUDSKLFVLQWHURSHUDELOLW\Ǟ SHUIRUPDQFHZLWKǞ
EXIIHUVGHFODULQJVKDUHGPHPRU\Ǟ UD\WUDFLQJLQWURGXFWLRQǞ
UD\WUDFLQJRQ*38Ǟ
C UD\WUDFLQJZLWKǞ
cache[]VKDUHGPHPRU\YDULDEOH VXPPDU\UHYLHZ
GHFODULQJEXIIHURIVKDUHGPHPRU\QDPHGǞ __constant__ IXQFWLRQ
GRWSURGXFWFRPSXWDWLRQǞǞ GHFODULQJPHPRU\DVǞ
GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV SHUIRUPDQFHZLWKFRQVWDQWPHPRU\Ǟ
Ǟ copy_const_kernel()NHUQHO
cacheIndexLQFRUUHFWGRWSURGXFWRSWLPL]DWLRQ 'WH[WXUHPHPRU\
FDFKHVWH[WXUHǞ XVLQJWH[WXUHPHPRU\Ǟ

280

Download from www.wowebook.com


,1'(;

copy_constant_kernel()FRPSXWLQJ &8'$0HPRU\&KHFNHU
WHPSHUDWXUHXSGDWHVǞ &8'$VWUHDPV
CPUAnimBitmapFODVVFUHDWLQJ*38ULSSOHǞ *38ZRUNVFKHGXOLQJZLWKǞ
Ǟ PXOWLSOHǞǞ
&38V FHQWUDOSURFHVVLQJXQLWV  RYHUYLHZRI
HYROXWLRQRIFORFNVSHHGǞ VLQJOHǞ
HYROXWLRQRIFRUHFRXQW VXPPDU\UHYLHZ
IUHHLQJPHPRU\see free(),&ODQJXDJH &8'$7RRONLWǞ
KDVKWDEOHVǞ LQGHYHORSPHQWHQYLURQPHQWǞ
KLVWRJUDPFRPSXWDWLRQRQǞ &8'$WRROV
DVKRVWLQWKLVERRN &8%/$6OLEUDU\Ǟ
WKUHDGPDQDJHPHQWDQGVFKHGXOLQJLQ &8'$7RRONLWǞ
YHFWRUVXPVǞ &8))7OLEUDU\
YHULI\LQJ*38KLVWRJUDPXVLQJUHYHUVH&38 GHEXJJLQJ&8'$&Ǟ
KLVWRJUDPǞ *38&RPSXWLQJ6'.GRZQORDGǞ
&8%/$6OLEUDU\Ǟ 19,',$3HUIRUPDQFH3ULPLWLYHV
cuComplexVWUXFWXUH&38-XOLD6HWǞ RYHUYLHZRI
cuComplexVWUXFWXUH*38-XOLD6HWǞ 9LVXDO3URȌOHUǞ
CUDA, Supercomputing for the Masses Ǟ &8'$=RQH
&8'$$UFKLWHFWXUH cuda_malloc_test()SDJHORFNHGPHPRU\
FRPSXWDWLRQDOȍXLGG\QDPLFDSSOLFDWLRQVǞ cudaBindTexture()WH[WXUHPHPRU\Ǟ
GHȌQHG
cudaBindTexture2D()WH[WXUHPHPRU\
HQYLURQPHQWDOVFLHQFHDSSOLFDWLRQVǞ
cudaChannelFormatDesc()ELQGLQJ'
ȌUVWDSSOLFDWLRQRI
WH[WXUHV
PHGLFDOLPDJLQJDSSOLFDWLRQVǞ
cudaChooseDevice()
UHVRXUFHIRUXQGHUVWDQGLQJǞ
GHȌQHG
XVLQJǞ
GPUAnimBitmap LQLWLDOL]DWLRQ
CUDA C
IRUYDOLG,'Ǟ
FRPSXWDWLRQDOȍXLGG\QDPLFDSSOLFDWLRQVǞ
cudaD39SetDirect3DDevice()'LUHFW;
&8'$GHYHORSPHQWWRRONLWǞ
LQWHURSHUDELOLW\Ǟ
&8'$HQDEOHGJUDSKLFVSURFHVVRUǞ
GHEXJJLQJǞ cudaDeviceMapHost()]HURFRS\PHPRU\GRW
GHYHORSPHQWHQYLURQPHQWVHWXSseeGHYHORSPHQW SURGXFW
HQYLURQPHQWVHWXS cudaDevicePropVWUXFWXUH
GHYHORSPHQWRI cudaChooseDevice()ZRUNLQJZLWK
HQYLURQPHQWDOVFLHQFHDSSOLFDWLRQVǞ PXOWLSOH&8'$VWUHDPV
JHWWLQJVWDUWHGǞ RYHUYLHZRIǞ
PHGLFDOLPDJLQJDSSOLFDWLRQVǞ VLQJOH&8'$VWUHDPVǞ
19,',$GHYLFHGULYHU XVLQJGHYLFHSURSHUWLHV
RQPXOWLSOH*38Vsee*38V JUDSKLFVSURFHVVLQJ &8'$HQDEOHGJUDSKLFVSURFHVVRUVǞ
XQLWV PXOWLV\VWHP cudaEventCreate()
RYHUYLHZRIǞ 'WH[WXUHPHPRU\
SDUDOOHOSURJUDPPLQJLQsee parallel &8'$VWUHDPV
SURJUDPPLQJ&8'$ *38KDVKWDEOHLPSOHPHQWDWLRQǞ
SDVVLQJSDUDPHWHUVǞ *38KLVWRJUDPFRPSXWDWLRQ
TXHU\LQJGHYLFHVǞ PHDVXULQJSHUIRUPDQFHZLWKHYHQWVǞ
VWDQGDUG&FRPSLOHUǞ SDJHORFNHGKRVWPHPRU\DSSOLFDWLRQǞ
VXPPDU\UHYLHZ SHUIRUPLQJDQLPDWLRQZLWKGPUAnimBitmap
XVLQJGHYLFHSURSHUWLHVǞ UD\WUDFLQJRQ*38
ZULWLQJȌUVWSURJUDPǞ VWDQGDUGKRVWPHPRU\GRWSURGXFW
&8'$'DWD3DUDOOHO3ULPLWLYHV/LEUDU\ &8'33  WH[WXUHPHPRU\
&8'$HYHQW$3,DQGSHUIRUPDQFHǞ ]HURFRS\KRVWPHPRU\

281

Download from www.wowebook.com


,1'(;

cudaEventDestroy() PXOWLSOH&38V
GHȌQHG SDJHORFNHGKRVWPHPRU\Ǟ
*38KDVKWDEOHLPSOHPHQWDWLRQ UD\WUDFLQJRQ*38
*38KLVWRJUDPFRPSXWDWLRQ UD\WUDFLQJZLWKFRQVWDQWPHPRU\
KHDWWUDQVIHUVLPXODWLRQ VKDUHGPHPRU\ELWPDS
PHDVXULQJSHUIRUPDQFHZLWKHYHQWVǞ VWDQGDUGKRVWPHPRU\GRWSURGXFW
SDJHORFNHGKRVWPHPRU\Ǟ cudaFreeHost()
WH[WXUHPHPRU\ DOORFDWLQJSRUWDEOHSLQQHGPHPRU\
]HURFRS\KRVWPHPRU\ &8'$VWUHDPV
cudaEventElapsedTime() GHȌQHG
'WH[WXUHPHPRU\ IUHHLQJEXIIHUDOORFDWHGZLWK
&8'$VWUHDPV cudaHostAlloc(), 190
GHȌQHG ]HURFRS\PHPRU\GRWSURGXFW
*38KDVKWDEOHLPSOHPHQWDWLRQ &8'$*'%GHEXJJLQJWRROǞ
*38KLVWRJUDPFRPSXWDWLRQ cudaGetDevice()
KHDWWUDQVIHUVLPXODWLRQDQLPDWLRQ &8'$VWUHDPV
KHDWWUDQVIHUXVLQJJUDSKLFVLQWHURSHUDELOLW\ GHYLFHSURSHUWLHV
SDJHORFNHGKRVWPHPRU\ ]HURFRS\PHPRU\GRWSURGXFW
VWDQGDUGKRVWPHPRU\GRWSURGXFW cudaGetDeviceCount()
]HURFRS\PHPRU\GRWSURGXFW GHYLFHSURSHUWLHV
cudaEventRecord() JHWWLQJFRXQWRI&8'$GHYLFHV
&8'$VWUHDPV PXOWLSOH&38VǞ
&8'$VWUHDPVDQG cudaGetDeviceProperties()
*38KDVKWDEOHLPSOHPHQWDWLRQǞ GHWHUPLQLQJLI*38LVLQWHJUDWHGRUGLVFUHWH
*38KLVWRJUDPFRPSXWDWLRQ PXOWLSOH&8'$VWUHDPV
KHDWWUDQVIHUVLPXODWLRQDQLPDWLRQ TXHU\LQJGHYLFHVǞ
KHDWWUDQVIHUXVLQJJUDSKLFVLQWHURSHUDELOLW\ ]HURFRS\PHPRU\GRWSURGXFW
Ǟ cudaGLSetGLDevice()
PHDVXULQJSHUIRUPDQFHZLWKHYHQWVǞ JUDSKLFVLQWHURSHUDWLRQZLWK2SHQ*/
PHDVXULQJUD\WUDFHUSHUIRUPDQFHǞ SUHSDULQJ&8'$WRXVH2SHQ*/GULYHU
SDJHORFNHGKRVWPHPRU\Ǟ cudaGraphicsGLRegisterBuffer()
UD\WUDFLQJRQ*38 cudaGraphicsMapFlagsNone(), 143
VWDQGDUGKRVWPHPRU\GRWSURGXFW cudaGraphicsMapFlagsReadOnly(), 143
XVLQJWH[WXUHPHPRU\Ǟ cudaGraphicsMapFlagsWriteDiscard(), 143
cudaEventSynchronize() cudaGraphicsUnapResources(), 144
'WH[WXUHPHPRU\ cudaHostAlloc()
*38KDVKWDEOHLPSOHPHQWDWLRQ &8'$VWUHDPV
*38KLVWRJUDPFRPSXWDWLRQ malloc()YHUVXVǞ
KHDWWUDQVIHUVLPXODWLRQDQLPDWLRQ SDJHORFNHGKRVWPHPRU\DSSOLFDWLRQǞ
KHDWWUDQVIHUXVLQJJUDSKLFVLQWHURSHUDELOLW\ ]HURFRS\PHPRU\GRWSURGXFWǞ
PHDVXULQJSHUIRUPDQFHZLWKHYHQWV cudaHostAllocDefault()
SDJHORFNHGKRVWPHPRU\ &8'$VWUHDPV
VWDQGDUGKRVWPHPRU\GRWSURGXFW GHIDXOWSLQQHGPHPRU\
cudaFree() SDJHORFNHGKRVWPHPRU\Ǟ
DOORFDWLQJSRUWDEOHSLQQHGPHPRU\ cudaHostAllocMapped()ȍDJ
&38YHFWRUVXPV GHIDXOWSLQQHGPHPRU\
&8'$VWUHDPV SRUWDEOHSLQQHGPHPRU\
GHȌQHGǞ ]HURFRS\PHPRU\GRWSURGXFWǞ
GRWSURGXFWFRPSXWDWLRQ cudaHostAllocPortable()SRUWDEOHSLQQHG
GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV PHPRU\Ǟ
*38KDVKWDEOHLPSOHPHQWDWLRQǞ cudaHostAllocWriteCombined()ȍDJ
*38ULSSOHXVLQJWKUHDGV SRUWDEOHSLQQHGPHPRU\
*38VXPVRIDUELWUDULO\ORQJYHFWRUV ]HURFRS\PHPRU\GRWSURGXFWǞ

282

Download from www.wowebook.com


,1'(;

cudaHostGetDevicePointer() *38-XOLD6HW
SRUWDEOHSLQQHGPHPRU\ *38VXPVRIDUELWUDULO\ORQJYHFWRUV
]HURFRS\PHPRU\GRWSURGXFWǞ PXOWLSOH&8'$VWUHDPV
cudaMalloc(), 124 SDJHORFNHGKRVWPHPRU\
'WH[WXUHPHPRU\Ǟ UD\WUDFLQJRQ*38
DOORFDWLQJGHYLFHPHPRU\XVLQJ VKDUHGPHPRU\ELWPDS
&38YHFWRUVXPVDSSOLFDWLRQ VWDQGDUGKRVWPHPRU\GRWSURGXFW
&8'$VWUHDPVǞ XVLQJPXOWLSOH&38V
GRWSURGXFWFRPSXWDWLRQ cudaMemcpyHostToDevice()
GRWSURGXFWFRPSXWDWLRQVWDQGDUGKRVW &38YHFWRUVXPVDSSOLFDWLRQ
PHPRU\ GRWSURGXFWFRPSXWDWLRQ
GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV *38VXPVRIDUELWUDULO\ORQJYHFWRUV
*38KDVKWDEOHLPSOHPHQWDWLRQǞ LPSOHPHQWLQJ*38ORFNIXQFWLRQ
*38-XOLD6HW PHDVXULQJUD\WUDFHUSHUIRUPDQFH
*38ORFNIXQFWLRQ PXOWLSOH&38V
*38ULSSOHXVLQJWKUHDGV PXOWLSOH&8'$VWUHDPV
*38VXPVRIDUELWUDULO\ORQJYHFWRUV SDJHORFNHGKRVWPHPRU\
PHDVXULQJUD\WUDFHUSHUIRUPDQFH VWDQGDUGKRVWPHPRU\GRWSURGXFW
SRUWDEOHSLQQHGPHPRU\ cudaMemcpyToSymbol()FRQVWDQWPHPRU\Ǟ
UD\WUDFLQJRQ*38 cudaMemset()
UD\WUDFLQJZLWKFRQVWDQWPHPRU\ *38KDVKWDEOHLPSOHPHQWDWLRQ
VKDUHGPHPRU\ELWPDS *38KLVWRJUDPFRPSXWDWLRQ
XVLQJPXOWLSOH&38V &8'$1(7SURMHFW
XVLQJWH[WXUHPHPRU\ cudaSetDevice()
cuda-memcheck, 242 DOORFDWLQJSRUWDEOHSLQQHGPHPRU\Ǟ
cudaMemcpy()
Ǟ
'WH[WXUHELQGLQJ
XVLQJGHYLFHSURSHUWLHV
FRS\LQJGDWDEHWZHHQKRVWDQGGHYLFH
XVLQJPXOWLSOH&38VǞ
&38YHFWRUVXPVDSSOLFDWLRQ
cudaSetDeviceFlags()
GRWSURGXFWFRPSXWDWLRQǞ
DOORFDWLQJSRUWDEOHSLQQHGPHPRU\
GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV
]HURFRS\PHPRU\GRWSURGXFW
*38KDVKWDEOHLPSOHPHQWDWLRQǞ
cudaStreamCreate(), 194, 201
*38KLVWRJUDPFRPSXWDWLRQǞ
cudaStreamDestroy()
*38-XOLD6HW
cudaStreamSynchronize()Ǟ
*38ORFNIXQFWLRQLPSOHPHQWDWLRQ
cudaThreadSynchronize(), 219
*38ULSSOHXVLQJWKUHDGV
cudaUnbindTexture()'WH[WXUHPHPRU\
*38VXPVRIDUELWUDULO\ORQJYHFWRUV
Ǟ
KHDWWUDQVIHUVLPXODWLRQDQLPDWLRQǞ
PHDVXULQJUD\WUDFHUSHUIRUPDQFH &8'33 &8'$'DWD3DUDOOHO3ULPLWLYHV/LEUDU\ 
SDJHORFNHGKRVWPHPRU\DQG &8))7OLEUDU\
UD\WUDFLQJRQ*38 &8/$WRROV
VWDQGDUGKRVWPHPRU\GRWSURGXFW FXUUHQWDQLPDWLRQWLPH*38ULSSOHXVLQJWKUHDGV
XVLQJPXOWLSOH&38VǞ Ǟ
cudaMemcpyAsync()
*38ZRUNVFKHGXOLQJǞ D
PXOWLSOH&8'$VWUHDPVǞ GHEXJJLQJ&8'$&Ǟ
VLQJOH&8'$VWUHDPV GHWHUJHQWV&8'$DSSOLFDWLRQVǞ
WLPHOLQHRILQWHQGHGDSSOLFDWLRQH[HFXWLRQXVLQJ dev_bitmapSRLQWHU*38-XOLD6HW
PXOWLSOHVWUHDPV GHYHORSPHQWHQYLURQPHQWVHWXS
cudaMemcpyDeviceToHost() &8'$7RRONLWǞ
&38YHFWRUVXPVDSSOLFDWLRQ &8'$HQDEOHGJUDSKLFVSURFHVVRUǞ
GRWSURGXFWFRPSXWDWLRQǞ 19,',$GHYLFHGULYHU
*38KDVKWDEOHLPSOHPHQWDWLRQ VWDQGDUG&FRPSLOHUǞ
*38KLVWRJUDPFRPSXWDWLRQǞ VXPPDU\UHYLHZ

283

Download from www.wowebook.com


,1'(;

GHYLFHGULYHUV RYHUYLHZRIǞ
GHYLFHRYHUODS*38Ǟ UHFRUGLQJsee cudaEventRecord()
__device__ IXQFWLRQ VWRSSLQJDQGVWDUWLQJsee
*38KDVKWDEOHLPSOHPHQWDWLRQǞ cudaEventDestroy()
*38-XOLD6HW VXPPDU\UHYLHZ
GHYLFHV EXIT_FAILURE()SDVVLQJSDUDPHWHUV
JHWWLQJFRXQWRI&8'$
*38YHFWRUVXPVǞ
SDVVLQJSDUDPHWHUVǞ
F
fAnim()VWRULQJUHJLVWHUHGFDOOEDFNV
TXHU\LQJǞ
)DVW)RXULHU7UDQVIRUPOLEUDU\19,',$
XVHRIWHUPLQWKLVERRN
ȌUVWSURJUDPZULWLQJǞ
XVLQJSURSHUWLHVRIǞ
devPtrJUDSKLFVLQWHURSHUDELOLW\ ȍDJVLQJUDSKLFVLQWHURSHUDELOLW\
dim3 YDULDEOHJULG*38-XOLD6HWǞ float_to_color()NHUQHOVLQJUDSKLFV
DIMxDIM ELWPDSLPDJH*38-XOLD6HWǞ LQWHURSHUDELOLW\
GLUHFWPHPRU\DFFHVV '0$ IRUSDJHORFNHG ȍRDWLQJSRLQWQXPEHUV
PHPRU\ DWRPLFDULWKPHWLFQRWVXSSRUWHGIRU
'LUHFW; &8'$$UFKLWHFWXUHGHVLJQHGIRU
DGGLQJVWDQGDUG&WR HDUO\GD\VRI*38FRPSXWLQJQRWDEOHWRKDQGOH
EUHDNWKURXJKLQ*38WHFKQRORJ\Ǟ )2575$1DSSOLFDWLRQV
*H)RUFH*7; &8%/$6FRPSDWLELOLW\ZLWKǞ
JUDSKLFVLQWHURSHUDELOLW\Ǟ ODQJXDJHZUDSSHUIRU&8'$&
GLVFUHWH*38VǞ IRUXPV19,',$
GLVSOD\DFFHOHUDWRUV' IUDFWDOVsee-XOLD6HWH[DPSOH
'0$ GLUHFWPHPRU\DFFHVV IRUSDJHORFNHG free(),&ODQJXDJH
PHPRU\ cudaFree( )YHUVXVǞ
GRWSURGXFWFRPSXWDWLRQ GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV
RSWLPL]HGLQFRUUHFWO\Ǟ *38KDVKWDEOHLPSOHPHQWDWLRQ
VKDUHGPHPRU\DQGǞ PXOWLSOH&38V
VWDQGDUGKRVWPHPRU\YHUVLRQRIǞ VWDQGDUGKRVWPHPRU\GRWSURGXFW
XVLQJDWRPLFVWRNHHSHQWLUHO\RQ*38Ǟ
Ǟ
GRWSURGXFWFRPSXWDWLRQPXOWLSOH*38V
G
*H)RUFH
DOORFDWLQJSRUWDEOHSLQQHGPHPRU\Ǟ
*H)RUFH*7;
XVLQJǞ
]HURFRS\Ǟ generate_frame()*38ULSSOHǞ
]HURFRS\SHUIRUPDQFH JHQHULFFODVVHVVWRULQJFRPSOH[QXPEHUVZLWK
'U'REE V&8'$Ǟ Ǟ
'5$0VGLVFUHWH*38VZLWKRZQGHGLFDWHGǞ GL_PIXEL_UNPACK_BUFFER_ARBWDUJHW2SHQ*/
draw_funcJUDSKLFVLQWHURSHUDELOLW\Ǟ LQWHURSHUDWLRQ
glBindBuffer()
E FUHDWLQJSL[HOEXIIHUREMHFW
end_thread()PXOWLSOH&38V JUDSKLFVLQWHURSHUDELOLW\
HQYLURQPHQWDOVFLHQFH&8'$DSSOLFDWLRQVIRUǞ glBufferData()SL[HOEXIIHUREMHFW
HYHQWWLPHUseeWLPHUHYHQW glDrawPixels()
HYHQWV JUDSKLFVLQWHURSHUDELOLW\
FRPSXWLQJHODSVHGWLPHEHWZHHQUHFRUGHGsee RYHUYLHZRIǞ
cudaEventElapsedTime() glGenBuffers()SL[HOEXIIHUREMHFW
FUHDWLQJsee cudaEventCreate() JOREDOPHPRU\DWRPLFV
*38KLVWRJUDPFRPSXWDWLRQ *38FRPSXWHFDSDELOLW\UHTXLUHPHQWV
PHDVXULQJSHUIRUPDQFHZLWK KLVWRJUDPNHUQHOXVLQJǞ
PHDVXULQJUD\WUDFHUSHUIRUPDQFHǞ KLVWRJUDPNHUQHOXVLQJVKDUHGDQGǞ

284

Download from www.wowebook.com


,1'(;

__global__ IXQFWLRQ ]HURFRS\KRVWPHPRU\Ǟ


addIXQFWLRQ ]HURFRS\SHUIRUPDQFHǞ
NHUQHOFDOOǞ JUDSKLFVDFFHOHUDWRUV'JUDSKLFVǞ
UXQQLQJkernel()LQ*38-XOLD6HWDSSOLFDWLRQ JUDSKLFVLQWHURSHUDELOLW\Ǟ
Ǟ 'LUHFW;Ǟ
*/87 */8WLOLW\7RRONLW JHQHUDWLQJLPDJHGDWDZLWKNHUQHOǞ
JUDSKLFVLQWHURSHUDELOLW\VHWXS *38ULSSOHZLWKǞ
LQLWLDOL]DWLRQRI KHDWWUDQVIHUZLWKǞ
LQLWLDOL]LQJ2SHQ*/GULYHUE\FDOOLQJ RYHUYLHZRIǞ
glutIdleFunc(), 149 SDVVLQJLPDJHGDWDWR2SHQ*/IRUUHQGHULQJ
glutInit() Ǟ
glutMainLoop(), 144 VXPPDU\UHYLHZ
*38&RPSXWLQJ6'.GRZQORDGǞ JUDSKLFVSURFHVVLQJXQLWVsee*38V JUDSKLFV
GPU ripple SURFHVVLQJXQLWV
ZLWKJUDSKLFVLQWHURSHUDELOLW\Ǟ grey()*38ULSSOH
XVLQJWKUHDGVǞ JULG
*38YHFWRUVXPV DVFROOHFWLRQRISDUDOOHOEORFNV
DSSOLFDWLRQǞ GHȌQHG
RIDUELWUDULO\ORQJYHFWRUVXVLQJWKUHDGVǞ WKUHHGLPHQVLRQDO
RIORQJHUYHFWRUXVLQJWKUHDGVǞ gridDimYDULDEOH
XVLQJWKUHDGVǞ 'WH[WXUHPHPRU\Ǟ
gpu_anim.hǞ GHȌQHG
GPUAnimBitmapVWUXFWXUH GRWSURGXFWFRPSXWDWLRQǞ
FUHDWLQJǞ GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV
*38ULSSOHSHUIRUPLQJDQLPDWLRQǞ Ǟ
KHDWWUDQVIHUZLWKJUDSKLFVLQWHURSHUDELOLW\ *38KDVKWDEOHLPSOHPHQWDWLRQ
Ǟ *38-XOLD6HW
*38V JUDSKLFVSURFHVVLQJXQLWV *38ULSSOHXVLQJWKUHDGVǞ
FDOOHGGHYLFHVLQWKLVERRN *38VXPVRIDUELWUDULO\ORQJYHFWRUVǞ
GHYHORSLQJFRGHLQ&8'$&ZLWK&8'$HQDEOHG JUDSKLFVLQWHURSHUDELOLW\VHWXS
Ǟ KLVWRJUDPNHUQHOXVLQJJOREDOPHPRU\DWRPLFV
GHYHORSPHQWRI&8'$IRUǞ Ǟ
GLVFUHWHYHUVXVLQWHJUDWHGǞ KLVWRJUDPNHUQHOXVLQJVKDUHGJOREDOPHPRU\
HDUO\GD\VRIǞ DWRPLFVǞ
IUHHLQJPHPRU\see cudaFree() UD\WUDFLQJRQ*38
KDVKWDEOHVǞ VKDUHGPHPRU\ELWPDS
KLVWRJUDPFRPSXWDWLRQRQǞ WHPSHUDWXUHXSGDWHFRPSXWDWLRQǞ
KLVWRJUDPNHUQHOXVLQJJOREDOPHPRU\DWRPLFV ]HURFRS\PHPRU\GRWSURGXFW
Ǟ
KLVWRJUDPNHUQHOXVLQJVKDUHGJOREDOPHPRU\ H
DWRPLFVǞ KDOIZDUSVUHDGLQJFRQVWDQWPHPRU\
KLVWRU\RIǞ HANDLE_ERROR()PDFUR
-XOLD6HWH[DPSOHǞ 'WH[WXUHPHPRU\Ǟ
PHDVXULQJSHUIRUPDQFHZLWKHYHQWVǞ &8'$VWUHDPVǞǞǞ
UD\WUDFLQJRQǞ GRWSURGXFWFRPSXWDWLRQǞǞ
ZRUNVFKHGXOLQJǞ GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV
*38V JUDSKLFVSURFHVVLQJXQLWV PXOWLSOH Ǟ
Ǟ *38KDVKWDEOHLPSOHPHQWDWLRQ
RYHUYLHZRIǞ *38KLVWRJUDPFRPSXWDWLRQFRPSOHWLRQ
SRUWDEOHSLQQHGPHPRU\Ǟ *38ORFNIXQFWLRQLPSOHPHQWDWLRQ
VXPPDU\UHYLHZǞ *38ULSSOHXVLQJWKUHDGV
XVLQJǞ *38VXPVRIDUELWUDULO\ORQJYHFWRUV

285

Download from www.wowebook.com


,1'(;

HANDLE_ERROR()PDFURcontinued +220' +LJKO\2SWLPL]HG2EMHFWRULHQWHG


KHDWWUDQVIHUVLPXODWLRQDQLPDWLRQǞ 0DQ\SDUWLFOH'\QDPLFV Ǟ
PHDVXULQJUD\WUDFHUSHUIRUPDQFHǞ KRVWV
SDJHORFNHGKRVWPHPRU\DSSOLFDWLRQǞ DOORFDWLQJPHPRU\WRsee malloc()
SDVVLQJSDUDPHWHUV &38YHFWRUVXPVǞ
SD\LQJDWWHQWLRQWR &8'$&EOXUULQJGHYLFHFRGHDQG
SRUWDEOHSLQQHGPHPRU\Ǟ SDJHORFNHGPHPRU\Ǟ
UD\WUDFLQJRQ*38Ǟ SDVVLQJSDUDPHWHUVǞ
UD\WUDFLQJZLWKFRQVWDQWPHPRU\Ǟ XVHRIWHUPLQWKLVERRN
VKDUHGPHPRU\ELWPDSǞ ]HURFRS\KRVWPHPRU\Ǟ
VWDQGDUGKRVWPHPRU\GRWSURGXFWǞ
WH[WXUHPHPRU\
]HURFRS\PHPRU\GRWSURGXFWǞ
I
idle_func()PHPEHUGPUAnimBitmap
KDUGZDUH
,(((UHTXLUHPHQWV$/8V
GHFRXSOLQJSDUDOOHOL]DWLRQIURPPHWKRGRI
LQFUHPHQWRSHUDWRU x++ Ǟ
H[HFXWLQJ
SHUIRUPLQJDWRPLFRSHUDWLRQVRQPHPRU\ LQLWLDOL]DWLRQ
KDUGZDUHOLPLWDWLRQV &38KDVKWDEOHLPSOHPHQWDWLRQ
*38VXPVRIDUELWUDULO\ORQJYHFWRUVǞ &38KLVWRJUDPFRPSXWDWLRQ
QXPEHURIEORFNVLQVLQJOHODXQFK */87Ǟ
QXPEHURIWKUHDGVSHUEORFNLQNHUQHOODXQFK GPUAnimBitmap, 149
KDVKIXQFWLRQ LQQHUSURGXFWVseeGRWSURGXFWFRPSXWDWLRQ
&38KDVKWDEOHLPSOHPHQWDWLRQǞ LQWHJUDWHG*38VǞ
*38KDVKWDEOHLPSOHPHQWDWLRQǞ LQWHUOHDYHGRSHUDWLRQVǞ
RYHUYLHZRIǞ LQWHURSHUDWLRQseeJUDSKLFVLQWHURSHUDELOLW\
KDVKWDEOHV
FRQFHSWVǞ J
&38Ǟ julia()IXQFWLRQǞ
*38Ǟ -XOLD6HWH[DPSOH
PXOWLWKUHDGHGǞ &38DSSOLFDWLRQRIǞ
SHUIRUPDQFHǞ *38DSSOLFDWLRQRIǞ
VXPPDU\UHYLHZ RYHUYLHZRIǞ
KHDWWUDQVIHUVLPXODWLRQ
'WH[WXUHPHPRU\Ǟ
DQLPDWLQJǞ
K
NHUQHO
FRPSXWLQJWHPSHUDWXUHXSGDWHVǞ
'WH[WXUHPHPRU\Ǟ
ZLWKJUDSKLFVLQWHURSHUDELOLW\Ǟ
blockIdx.xYDULDEOH
VLPSOHKHDWLQJPRGHOǞ
XVLQJWH[WXUHPHPRU\Ǟ FDOOWRǞ
+HOOR:RUOGH[DPSOH GHȌQHG
NHUQHOFDOOǞ *38KLVWRJUDPFRPSXWDWLRQǞ
SDVVLQJSDUDPHWHUVǞ *38-XOLD6HWǞ
ZULWLQJȌUVWSURJUDPǞ *38ULSSOHSHUIRUPLQJDQLPDWLRQ
+LJKO\2SWLPL]HG2EMHFWRULHQWHG0DQ\SDUWLFOH *38ULSSOHXVLQJWKUHDGVǞ
'\QDPLFV +220' Ǟ *38VXPVRIDORQJHUYHFWRUǞ
KLVWRJUDPFRPSXWDWLRQ JUDSKLFVLQWHURSHUDELOLW\ǞǞ
RQ&38VǞ +HOOR:RUOGH[DPSOHRIFDOOWRǞ
RQ*38VǞ ODXQFKLQJZLWKQXPEHULQDQJOHEUDFNHWVWKDWLV
RYHUYLHZ QRWǞ
KLVWRJUDPNHUQHO SDVVLQJSDUDPHWHUVWRǞ
XVLQJJOREDOPHPRU\DWRPLFVǞ UD\WUDFLQJRQ*38Ǟ
XVLQJVKDUHGJOREDOPHPRU\DWRPLFVǞ WH[WXUHPHPRU\Ǟ
hit()PHWKRGUD\WUDFLQJRQ*38 key_funcJUDSKLFVLQWHURSHUDELOLW\Ǟ

286

Download from www.wowebook.com


,1'(;

NH\V *38KLVWRJUDPFRPSXWDWLRQǞ
&38KDVKWDEOHLPSOHPHQWDWLRQǞ SDJHORFNHGKRVW SLQQHG Ǟ
*38KDVKWDEOHLPSOHPHQWDWLRQǞ TXHU\LQJGHYLFHVǞ
KDVKWDEOHFRQFHSWVǞ VKDUHGseeVKDUHGPHPRU\
WH[WXUHseeWH[WXUHPHPRU\
L XVHRIWHUPLQWKLVERRN
ODQJXDJHZUDSSHUVǞ 0HPRU\&KHFNHU&8'$
/$3$&. /LQHDU$OJHEUD3DFNDJH  memset(),&ODQJXDJH
OLJKWHIIHFWVUD\WUDFLQJFRQFHSWV 0LFURVRIW:LQGRZV9LVXDO6WXGLR&FRPSLOHUǞ
/LQX[VWDQGDUG&FRPSLOHUIRU 0LFURVRIW1(7
LockVWUXFWXUHǞǞ PXOWLFRUHUHYROXWLRQHYROXWLRQRI&38V
ORFNVDWRPLFǞ PXOWLSOLFDWLRQLQYHFWRUGRWSURGXFWV
PXOWLWKUHDGHGKDVKWDEOHVǞ
M mutex*38ORFNIXQFWLRQǞ
0DFLQWRVK26;VWDQGDUG&FRPSLOHU
main()URXWLQH N
'WH[WXUHPHPRU\Ǟ Q)RUFHPHGLDDQGFRPPXQLFDWLRQVSURFHVVRUV
&38KDVKWDEOHLPSOHPHQWDWLRQǞ 0&3V Ǟ
&38KLVWRJUDPFRPSXWDWLRQ NVIDIA
GRWSURGXFWFRPSXWDWLRQǞ FRPSXWHFDSDELOLW\RIYDULRXV*38VǞ
GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV FUHDWLQJ'JUDSKLFVIRUFRQVXPHUV
Ǟ FUHDWLQJ&8'$&IRU*38
*38KDVKWDEOHLPSOHPHQWDWLRQǞ FUHDWLQJȌUVW*38EXLOWZLWK&8'$$UFKLWHFWXUH
*38KLVWRJUDPFRPSXWDWLRQ &8%/$6OLEUDU\Ǟ
*38-XOLD6HWǞ &8'$HQDEOHGJUDSKLFVSURFHVVRUVǞ
*38ULSSOHXVLQJWKUHDGVǞ &8'$*'%GHEXJJLQJWRROǞ
*38YHFWRUVXPVǞ &8))7OLEUDU\
JUDSKLFVLQWHURSHUDELOLW\ GHYLFHGULYHU
SDJHORFNHGKRVWPHPRU\DSSOLFDWLRQǞ *38&RPSXWLQJ6'.GRZQORDGǞ
UD\WUDFLQJRQ*38Ǟ 3DUDOOHO16LJKWGHEXJJLQJWRRO
UD\WUDFLQJZLWKFRQVWDQWPHPRU\Ǟ 3HUIRUPDQFH3ULPLWLYHV
VKDUHGPHPRU\ELWPDS
SURGXFWVFRQWDLQLQJPXOWLSOH*38V
VLQJOH&8'$VWUHDPVǞ
9LVXDO3URȌOHUǞ
]HURFRS\PHPRU\GRWSURGXFWǞ
NVIDIA CUDA Programming Guide, 31
malloc()
cudaHostAlloc()YHUVXV
cudaHostAlloc()YHUVXV O
cudaMalloc( )YHUVXV offset'WH[WXUHPHPRU\
UD\WUDFLQJRQ*38 RQFKLSFDFKLQJseeFRQVWDQWPHPRU\WH[WXUH
PDPPRJUDPV&8'$DSSOLFDWLRQVIRUPHGLFDO PHPRU\
LPDJLQJ RQHGLPHQVLRQDOEORFNV
maxThreadsPerBlockȌHOGGHYLFHSURSHUWLHV *38VXPVRIDORQJHUYHFWRU
PHGLDDQGFRPPXQLFDWLRQVSURFHVVRUV 0&3V  WZRGLPHQVLRQDOEORFNVYHUVXV
PHGLFDOLPDJLQJ&8'$DSSOLFDWLRQVIRUǞ RQOLQHUHVRXUFHVseeUHVRXUFHVRQOLQH
memcpy(),&ODQJXDJH 2SHQ*/
PHPRU\ FUHDWLQJGPUAnimBitmapǞ
DOORFDWLQJGHYLFHsee cudaMalloc() LQHDUO\GD\VRI*38FRPSXWLQJǞ
FRQVWDQWseeFRQVWDQWPHPRU\ JHQHUDWLQJLPDJHGDWDZLWKNHUQHOǞ
&8'$$UFKLWHFWXUHFUHDWLQJDFFHVVWR LQWHURSHUDWLRQǞ
HDUO\GD\VRI*38FRPSXWLQJ ZULWLQJ'JUDSKLFV
H[HFXWLQJGHYLFHFRGHWKDWXVHVDOORFDWHG RSHUDWLRQVDWRPLFǞ
IUHHLQJsee cudaFree()free(),&ODQJXDJH RSWLPL]DWLRQLQFRUUHFWGRWSURGXFWǞ

287

Download from www.wowebook.com


,1'(;

P SURSHUWLHV
SDJHORFNHGKRVWPHPRU\ cudaDevicePropVWUXFWXUHsee
DOORFDWLQJDVSRUWDEOHSLQQHGPHPRU\Ǟ cudaDevicePropVWUXFWXUH
RYHUYLHZRIǞ maxThreadsPerBlockȌHOGIRUGHYLFH
UHVWULFWHGXVHRI UHSRUWLQJGHYLFH
VLQJOH&8'$VWUHDPVZLWKǞ XVLQJGHYLFHǞ
SDUDOOHOEORFNV 3\&8'$SURMHFWǞ
*38-XOLD6HW 3\WKRQODQJXDJHZUDSSHUVIRU&8'$&
*38YHFWRUVXPV
SDUDOOHOEORFNVVSOLWWLQJLQWRWKUHDGV Q
*38VXPVRIDUELWUDULO\ORQJYHFWRUVǞ TXHU\LQJGHYLFHVǞ
*38VXPVRIORQJHUYHFWRUǞ
*38YHFWRUVXPVXVLQJWKUHDGVǞ R
RYHUYLHZRI
UDVWHUL]DWLRQ
YHFWRUVXPVǞ
UD\WUDFLQJ
3DUDOOHO16LJKWGHEXJJLQJWRRO
FRQFHSWVEHKLQGǞ
SDUDOOHOSURFHVVLQJ
ZLWKFRQVWDQWPHPRU\Ǟ
HYROXWLRQRI&38VǞ
RQ*38Ǟ
SDVWSHUFHSWLRQRI
PHDVXULQJSHUIRUPDQFHǞ
SDUDOOHOSURJUDPPLQJ&8'$
UHDGPRGLI\ZULWHRSHUDWLRQV
&38YHFWRUVXPVǞ
DWRPLFRSHUDWLRQVDVǞ
H[DPSOH&38-XOLD6HWDSSOLFDWLRQǞ
XVLQJDWRPLFORFNVǞ
H[DPSOH*38-XOLD6HWDSSOLFDWLRQǞ
UHDGRQO\PHPRU\seeFRQVWDQWPHPRU\WH[WXUH
H[DPSOHRYHUYLHZǞ
PHPRU\
*38YHFWRUVXPVǞ
UHGXFWLRQV
RYHUYLHZRI
GRWSURGXFWVDV
VXPPDU\UHYLHZ
VXPPLQJYHFWRUVǞ RYHUYLHZRI
SDUDPHWHUSDVVLQJǞ VKDUHGPHPRU\DQGV\QFKURQL]DWLRQIRUǞ
3&JDPLQJ'JUDSKLFVIRUǞ UHIHUHQFHVWH[WXUHPHPRU\ǞǞ
3&,([SUHVVVORWVDGGLQJPXOWLSOH*38VWR UHJLVWUDWLRQ
SHUIRUPDQFH bufferObjZLWKcudaGraphicsGLRegister-
FRQVWDQWPHPRU\DQGǞ Buffer()
HYROXWLRQRI&38VǞ FDOOEDFN
KDVKWDEOH UHQGHULQJ*38VSHUIRUPLQJFRPSOH[
ODXQFKLQJNHUQHOIRU*38KLVWRJUDPFRPSXWDWLRQ resourceYDULDEOH
Ǟ FUHDWLQJGPUAnimBitmapǞ
PHDVXULQJZLWKHYHQWVǞ JUDSKLFVLQWHURSHUDWLRQ
SDJHORFNHGKRVWPHPRU\DQG UHVRXUFHVRQOLQH
]HURFRS\PHPRU\DQGǞ &8'$FRGHǞ
SLQQHGPHPRU\ &8'$7RRONLW
DOORFDWLQJDVSRUWDEOHǞ &8'$8QLYHUVLW\
cudaHostAllocDefault()JHWWLQJGHIDXOW &8'33
DVSDJHORFNHGPHPRU\seeSDJHORFNHGKRVW &8/$WRROV
PHPRU\ 'U'REE V&8'$
SL[HOEXIIHUREMHFWV 3%2 2SHQ*/Ǟ *38&RPSXWLQJ6'.FRGHVDPSOHV
SL[HOVKDGHUVHDUO\GD\VRI*38FRPSXWLQJǞ ODQJXDJHZUDSSHUVǞ
SL[HOVQXPEHURIWKUHDGVSHUEORFNǞ 19,',$GHYLFHGULYHU
SRUWDEOHFRPSXWLQJGHYLFHV 19,',$IRUXPV
Programming Massively Parallel Processors: A VWDQGDUG&FRPSLOHUIRU0DF26;
Hands-on Approach .LUN+ZX  9LVXDO6WXGLR&FRPSLOHU

288

Download from www.wowebook.com


,1'(;

UHVRXUFHVZULWWHQ GHYHORSPHQWHQYLURQPHQWǞ
&8'$8Ǟ NHUQHOFDOOǞ
IRUXPV startHYHQWǞ
SURJUDPPLQJPDVVLYHSDUDOOHOSURFHVVRUVǞ start_thread()PXOWLSOH&38VǞ
ripple, GPU stopHYHQWǞ
ZLWKJUDSKLFVLQWHURSHUDELOLW\Ǟ VWUHDPV
SURGXFLQJǞ &8'$RYHUYLHZRI
routine() &8'$XVLQJPXOWLSOHǞǞ
DOORFDWLQJSRUWDEOHSLQQHGPHPRU\Ǟ &8'$XVLQJVLQJOHǞ
XVLQJPXOWLSOH&38VǞ *38ZRUNVFKHGXOLQJDQGǞ
5XVVLDQQHVWLQJGROOKLHUDUFK\ RYHUYLHZRIǞ
SDJHORFNHGKRVWPHPRU\DQGǞ
S VXPPDU\UHYLHZ
VFDODEOHOLQNLQWHUIDFH 6/, DGGLQJPXOWLSOH*38V VXSHUFRPSXWHUVSHUIRUPDQFHJDLQVLQ
ZLWK VXUIDFWDQWVHQYLURQPHQWDOGHYDVWDWLRQRI
scaleIDFWRU&38-XOLD6HW V\QFKURQL]DWLRQ
VFLHQWLȌFFRPSXWDWLRQVLQHDUO\GD\V RIHYHQWVsee cudaEventSynchronize()
VFUHHQVKRWV RIVWUHDPVǞ
DQLPDWHGKHDWWUDQVIHUVLPXODWLRQ RIWKUHDGV
*38-XOLD6HWH[DPSOH V\QFKURQL]DWLRQDQGVKDUHGPHPRU\
*38ULSSOHH[DPSOH GRWSURGXFWǞ
JUDSKLFVLQWHURSHUDWLRQH[DPSOH GRWSURGXFWRSWLPL]HGLQFRUUHFWO\Ǟ
UD\WUDFLQJH[DPSOHǞ RYHUYLHZRI
UHQGHUHGZLWKSURSHUV\QFKURQL]DWLRQ VKDUHGPHPRU\ELWPDSǞ
UHQGHUHGZLWKRXWSURSHUV\QFKURQL]DWLRQ __syncthreads()
VKDGLQJODQJXDJHV GRWSURGXFWFRPSXWDWLRQǞ
VKDUHGGDWDEXIIHUVNHUQHO2SHQ*/UHQGHULQJ VKDUHGPHPRU\ELWPDSXVLQJǞ
LQWHURSHUDWLRQ XQLQWHQGHGFRQVHTXHQFHVRIǞ
VKDUHGPHPRU\
DWRPLFVǞ T
ELWPDSǞ WDVNSDUDOOHOLVP&38YHUVXV*38DSSOLFDWLRQV
&8'$$UFKLWHFWXUHFUHDWLQJDFFHVVWR 7HFKQL6FDQ0HGLFDO6\VWHPV&8'$DSSOLFDWLRQV
GRWSURGXFWǞ WHPSHUDWXUHV
GRWSURGXFWRSWLPL]HGLQFRUUHFWO\Ǟ FRPSXWLQJWHPSHUDWXUHXSGDWHVǞ
DQGV\QFKURQL]DWLRQ KHDWWUDQVIHUVLPXODWLRQǞ
6LOLFRQ*UDSKLFV2SHQ*/OLEUDU\ KHDWWUDQVIHUVLPXODWLRQDQLPDWLRQǞ
VLPXODWLRQ 7HPSOH8QLYHUVLW\UHVHDUFK&8'$DSSOLFDWLRQV
DQLPDWLRQRIǞ Ǟ
FKDOOHQJHVRISK\VLFDO tex1Dfetch()FRPSLOHULQWULQVLFWH[WXUHPHPRU\
FRPSXWLQJWHPSHUDWXUHXSGDWHVǞ ǞǞ
VLPSOHKHDWLQJPRGHOǞ tex2D()FRPSLOHULQWULQVLFWH[WXUHPHPRU\
6/, VFDODEOHOLQNLQWHUIDFH DGGLQJPXOWLSOH*38V Ǟ
ZLWK WH[WXUHHDUO\GD\VRI*38FRPSXWLQJǞ
VSDWLDOORFDOLW\ WH[WXUHPHPRU\
GHVLJQLQJWH[WXUHFDFKHVIRUJUDSKLFVZLWK DQLPDWLRQRIVLPXODWLRQǞ
KHDWWUDQVIHUVLPXODWLRQDQLPDWLRQǞ GHȌQHG
VSOLWSDUDOOHOEORFNVseeSDUDOOHOEORFNVVSOLWWLQJ RYHUYLHZRIǞ
LQWRWKUHDGV VLPXODWLQJKHDWWUDQVIHUǞ
VWDQGDUG&FRPSLOHU VXPPDU\UHYLHZ
FRPSLOLQJIRUPLQLPXPFRPSXWHFDSDELOLW\ WZRGLPHQVLRQDOǞ
Ǟ XVLQJǞ

289

Download from www.wowebook.com


,1'(;

threadIdxYDULDEOH WLPH*38ULSSOHXVLQJWKUHDGVǞ
'WH[WXUHPHPRU\Ǟ WLPHUHYHQWsee cudaEventElapsedTime()
GRWSURGXFWFRPSXWDWLRQǞ 7RRONLW&8'$Ǟ
GRWSURGXFWFRPSXWDWLRQZLWKDWRPLFORFNV WZRGLPHQVLRQDOEORFNV
Ǟ DUUDQJHPHQWRIEORFNVDQGWKUHDGV
*38KDVKWDEOHLPSOHPHQWDWLRQ *38-XOLD6HW
*38-XOLD6HW *38ULSSOHXVLQJWKUHDGV
*38ULSSOHXVLQJWKUHDGVǞ gridDimYDULDEOHDV
*38VXPVRIDORQJHUYHFWRUǞ RQHGLPHQVLRQDOLQGH[LQJYHUVXV
*38VXPVRIDUELWUDULO\ORQJYHFWRUVǞ WZRGLPHQVLRQDOGLVSOD\DFFHOHUDWRUVGHYHORSPHQW
*38YHFWRUVXPVXVLQJWKUHDGV
RI*38V
KLVWRJUDPNHUQHOXVLQJJOREDOPHPRU\DWRPLFV
WZRGLPHQVLRQDOWH[WXUHPHPRU\
Ǟ
GHȌQHG
KLVWRJUDPNHUQHOXVLQJVKDUHGJOREDOPHPRU\
KHDWWUDQVIHUVLPXODWLRQǞ
DWRPLFVǞ
PXOWLSOH&8'$VWUHDPV RYHUYLHZRIǞ
UD\WUDFLQJRQ*38
VHWWLQJXSJUDSKLFVLQWHURSHUDELOLW\ U
VKDUHGPHPRU\ELWPDS XOWUDVRXQGLPDJLQJ&8'$DSSOLFDWLRQVIRU
WHPSHUDWXUHXSGDWHFRPSXWDWLRQǞ XQLȌHGVKDGHUSLSHOLQH&8'$$UFKLWHFWXUH
]HURFRS\PHPRU\GRWSURGXFW XQLYHUVLW\&8'$
WKUHDGV
FRGLQJZLWKǞ V
FRQVWDQWPHPRU\DQGǞ YDOXHV
*38ULSSOHXVLQJǞ
&38KDVKWDEOHLPSOHPHQWDWLRQǞ
*38VXPVRIDORQJHUYHFWRUǞ
*38KDVKWDEOHLPSOHPHQWDWLRQǞ
*38VXPVRIDUELWUDULO\ORQJYHFWRUVǞ
KDVKWDEOHFRQFHSWVǞ
*38YHFWRUVXPVXVLQJǞ
YHFWRUGRWSURGXFWVseeGRWSURGXFWFRPSXWDWLRQ
KDUGZDUHOLPLWWRQXPEHURI
KLVWRJUDPNHUQHOXVLQJJOREDOPHPRU\DWRPLFV YHFWRUVXPV
Ǟ &38Ǟ
LQFRUUHFWGRWSURGXFWRSWLPL]DWLRQDQGGLYHUJHQFH *38Ǟ
RI *38VXPVRIDUELWUDULO\ORQJYHFWRUVǞ
PXOWLSOH&38VǞ *38VXPVRIORQJHUYHFWRUǞ
RYHUYLHZRIǞ *38VXPVXVLQJWKUHDGVǞ
UD\WUDFLQJRQ*38DQGǞ RYHUYLHZRIǞǞ
UHDGPRGLI\ZULWHRSHUDWLRQVǞ verify_table()*38KDVKWDEOH
VKDUHGPHPRU\DQGseeVKDUHGPHPRU\ 9LVXDO3URȌOHU19,',$Ǟ
VXPPDU\UHYLHZ 9LVXDO6WXGLR&FRPSLOHUǞ
V\QFKURQL]LQJ
threadsPerBlock W
DOORFDWLQJVKDUHGPHPRU\Ǟ ZDUSVUHDGLQJFRQVWDQWPHPRU\ZLWKǞ
GRWSURGXFWFRPSXWDWLRQǞ while()ORRS
WKUHHGLPHQVLRQDOEORFNV*38VXPVRIDORQJHU
&38YHFWRUVXPV
YHFWRU
*38ORFNIXQFWLRQ
WKUHHGLPHQVLRQDOJUDSKLFVKLVWRU\RI*38VǞ
ZRUNVFKHGXOLQJ*38Ǟ
WKUHHGLPHQVLRQDOVFHQHVUD\WUDFLQJSURGXFLQJ'
LPDJHRI
tidYDULDEOH Z
blockIdx.xYDULDEOHDVVLJQLQJYDOXHRI ]HURFRS\PHPRU\
FKHFNLQJWKDWLWLVOHVVWKDQNǞ DOORFDWLQJXVLQJǞ
GRWSURGXFWFRPSXWDWLRQǞ GHȌQHG
SDUDOOHOL]LQJFRGHRQPXOWLSOH&38V SHUIRUPDQFHǞ

290

Download from www.wowebook.com


Sand

You might also like