Cuda GDB

CUDA-GDB
NVIDIA CUDA Debugger - 4.1 Release for Linux and Mac

DU-05227-001_V4.1 | January 10, 2012
User Manual
TABLE OF CONTENTS
1 Introduction.......................................................................... 1 What is CUDA-GDB? ................................................................. 1 Supported features ................................................................. 1 About this document ............................................................... 2 2 Release Notes........................................................................ 3 GDB 7.2 Source Base................................................................ Support For Simultaneous CUDA-GDB Sessions ................................. New Autostep Command ........................................................... Support For Multiple Contexts .................................................... Support for Device Assertions ..................................................... 3 3 4 4 4
3 Getting Started ...................................................................... 5 Installation Instructions ............................................................ 5 Setting Up the Debugger Environment........................................... 6 Linux ............................................................................... 6 Mac OS X .......................................................................... 6 Compiling the Application ......................................................... 7 Debug Compilation .............................................................. 7 Compiling for Fermi GPUs ...................................................... 7 Compiling for Fermi and Tesla GPUs .......................................... 7 Using the Debugger ................................................................. 8 Single GPU Debugging ........................................................... 8 Multi-GPU Debugging ............................................................ 8 Remote Debugging .............................................................. 10 Multiple Debuggers ............................................................. 11 CUDA/OpenGL Interop Applications on Linux .............................. 11 4 CUDA-GDB Extensions .............................................................12 Command Naming Convention ................................................... 12 Getting Help ........................................................................ 12 Initialization File ................................................................... 13 GUI Integration ..................................................................... 13 Emacs ............................................................................. 13
Graphics Driver CUDA-GDB
DU-05227-001_V4.1 | i
TABLE OF CONTENTS
DDD................................................................................... 13 5 Kernel Focus ........................................................................14 Software Coordinates vs. Hardware Coordinates ............................. 14 Current Focus....................................................................... 14 Switching Focus .................................................................... 15 6 Program Execution.................................................................16 Interrupting the Application...................................................... 16 Single-Stepping ..................................................................... 16 7 Breakpoints .........................................................................18 Symbolic Breakpoints.............................................................. 18 Line Breakpoints ................................................................... 19 Address Breakpoints ............................................................... 19 Kernel Entry Breakpoints ......................................................... 19 Conditional Breakpoints........................................................... 20 8 Inspecting Program State .........................................................21 Memory and Variables ............................................................. 21 Variable Storage and Accessibility............................................... 21 Inspecting Textures ................................................................ 22 Info CUDA Commands ............................................................. 23 info cuda devices ............................................................... 23 info cuda sms .................................................................... 24 info cuda warps ................................................................. 24 info cuda lanes .................................................................. 25 info cuda kernels ............................................................... 25 info cuda blocks................................................................. 26 info cuda threads ............................................................... 27 9 Context and Kernel Events......................................................28 Display CUDA context events..................................................... 28 Display CUDA kernel events ...................................................... 28 Examples of displayed events .................................................... 29
DU-05227-001_V4.1 | ii
TABLE OF CONTENTS
10 Checking Memory Errors .......................................................30 Checking Memory Errors .......................................................... 30 Increasing the Precision of Memory Errors WIth Autostep ................... 31 Usage ............................................................................. 31 Related Commands ............................................................. 32 GPU Error Reporting ............................................................... 33 11 Walk-through Examples ........................................................36 Example 1: bitreverse ............................................................. 36 Source Code ..................................................................... 36 Walking Through the Code .................................................... 37 Example 2: autostep............................................................... 41 Source Code ..................................................................... 41 Debugging With Autosteps..................................................... 42 Appendix A: Supported Platforms ...............................................44 Host Platform Requirements ..................................................... 44 Mac OS............................................................................ 44 Linux .............................................................................. 44 GPU Requirements.............................................................. 45 Appendix B: Known Issues ........................................................46
DU-05227-001_V4.1 | iii
01 INTRODUCTION
ThisdocumentintroducesCUDAGDB,theNVIDIACUDAdebugger,anddescribes whatisnewinversion4.1.
What is CUDA-GDB?
CUDAGDBistheNVIDIAtoolfordebuggingCUDAapplicationsrunningonLinux andMac.CUDAGDBisanextensiontothex8664portofGDB,theGNUProject debugger.ThetoolprovidesdeveloperswithamechanismfordebuggingCUDA applicationsrunningonactualhardware.Thisenablesdeveloperstodebugapplications withoutthepotentialvariationsintroducedbysimulationandemulationenvironments. CUDAGDBrunsonLinuxandMacOSX,32bitand64bit.CUDAGDBisbasedon GDB7.2onbothLinuxandMacOSX.
Supported features
CUDAGDBisdesignedtopresenttheuserwithaseamlessdebuggingenvironmentthat allowssimultaneousdebuggingofbothGPUandCPUcodewithinthesameapplication. JustasprogramminginCUDACisanextensiontoCprogramming,debuggingwith CUDAGDBisanaturalextensiontodebuggingwithGDB.TheexistingGDBdebugging featuresareinherentlypresentfordebuggingthehostcode,andadditionalfeatureshave beenprovidedtosupportdebuggingCUDAdevicecode. CUDAGDBsupportsCandC++CUDAapplications.AlltheC++featuressupportedby theNVCCcompilercanbedebuggedbyCUDAGDB. CUDAGDBallowstheusertosetbreakpoints,tosinglestepCUDAapplications,and alsotoinspectandmodifythememoryandvariablesofanygiventhreadrunningonthe hardware. CUDAGDBsupportsdebuggingallCUDAapplications,whethertheyusetheCUDA driverAPI,theCUDAruntimeAPI,orboth.
CUDA-GDB
DU-05227-001_V4.1 | 1
Chapter 01 : I NTRODUCTION
CUDAGDBsupportsdebuggingkernelsthathavebeencompiledforspecificCUDA architectures,suchassm_10orsm_20,butalsosupportsdebuggingkernelscompiledat runtime,referredtoasjustintimecompilation,orJITcompilationforshort.
About this document

ThisdocumentisthemaindocumentationforCUDAGDBandisorganizedmoreasa usermanualthanareferencemanual.Therestofthedocumentwilldescribehowto installanduseCUDAGDBtodebugCUDAkernelsandhowtousethenewCUDA commandsthathavebeenaddedtoGDB.Somewalkthroughexamplesarealso provided.ItisassumedthattheuseralreadyknowsthebasicGDBcommandsusedto debughostapplications.
CUDA-GDB
DU-05227-001_V4.1 | 2
02 RELEASE NOTES
Thefollowingfeatureshavebeenaddedforthe4.1release:
GDB 7.2 Source Base

Untilnow,CUDAGDBwasbasedonGDB6.6onLinux,andGDB6.3.5onDarwin(the Applebranch).Now,bothversionsofCUDAGDBareusingthesame7.2sourcebase. Also,CUDAGDBsupportsnewerversionsofGCC(testeduptoGCC4.5),hasbetter supportforDWARF3debuginformation,andbetterC++debuggingsupport.
Support For Simultaneous CUDA-GDB Sessions

Withthe4.1release,thesingleCUDAGDBprocessrestrictionislifted.Now,multiple CUDAGDBsessionsareallowedtocoexistaslongastheGPUsarenotsharedbetween theapplicationsbeingdebugged.Forinstance,oneCUDAGDBprocesscandebug processfoousingGPU0whileanotherCUDAGDBprocessdebugsprocessbarusing GPU1.TheexclusiveofGPUscanbeenforcedwiththeCUDA_VISIBLE_DEVICES environmentvariable.
CUDA-GDB
DU-05227-001_V4.1 | 3
Chapter 02 : R ELEASE NOTES
New Autostep Command

Anewautostepcommandwasadded.ThecommandincreasestheprecisionofCUDA exceptionsbyautomaticallysinglesteppingthroughportionsofcode. Undernormalexecution,thethreadandinstructionwhereanexceptionoccurredmaybe impreciselyreported.However,theexactinstructionthatgeneratestheexceptioncanbe determinediftheprogramisbeingsinglesteppedwhentheexceptionoccurs. Manuallysinglesteppingthroughaprogramisaslowandtediousprocess.Therefore autostepaidstheuserbyallowingthemtospecifysectionsofcodewheretheysuspect anexceptioncouldoccur.Thesesectionsareautomaticallysinglesteppedthroughwhen theprogramisrunning,andanyexceptionthatoccurswithinthesesectionsisprecisely reported. TypehelpautostepfromCUDAGDBforthesyntaxandusageofthecommand.
Support For Multiple Contexts

OnGPUswithcomputecapabilityofSM20orhigher,debuggingmultiplecontextsonthe sameGPUisnowsupported.Itwasaknownlimitationinpreviousreleases.
Support for Device Assertions

TheR285driverreleasedwiththe4.1versionofthetoolkitsupportsdeviceassertions. CUDAGDBsupportstheassertioncallandstopstheexecutionoftheapplicationwhen theassertionishit.Thenthevariablesandmemorycanbeinspectedasusual.The applicationcanalsoberesumedpasttheassertionifneeded.Usethesetcuda hide_internal_framesoptiontoexpose/hidethesystemcallframes(hiddenbydefault).
CUDA-GDB
DU-05227-001_V4.1 | 4
03 GETTING STARTED
IncludedinthischapterareinstructionsforinstallingCUDAGDBandforusingNVCC, theNVIDIACUDAcompilerdriver,tocompileCUDAprogramsfordebugging.
Installation Instructions
FollowthesestepstoinstallCUDAGDB.
1 VisittheNVIDIACUDAZonedownloadpage:
http://www.nvidia.com/object/cuda_get.html.
2 SelecttheappropriateoperatingsystemMacOSXorLinux.
(SeeHostPlatformRequirementsonpage 26.)
3 DownloadandinstalltheCUDADriver. 4 DownloadandinstalltheCUDAToolkit.
CUDA-GDB
DU-05227-001_V4.1 | 5
Chapter 03 : GETTING STARTED
Setting Up the Debugger Environment

Linux
SetupthePATHandLD_LIBRARY_PATHenvironmentvariables:
export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/ lib:$LD_LIBRARY_PATH
Mac OS X
SetupthePATHandDYLD_LIBRARY_PATHenvironmentvariables:
export PATH=/usr/local/cuda/bin:$PATH export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH
Also,ifyouareunabletoexecuteCUDAGDBorifyouhittheUnabletofindMachtask portforprocessiderror,tryresettingthecorrectpermissionswiththefollowing commands:

sudo chgrp procmod /usr/local/cuda/bin/cuda-binary-gdb sudo chmod 2755 /usr/local/cuda/bin/cuda-binary-gdb sudo chmod 755 /usr/local/cuda/bin/cuda-gdb
Temporary Directory
Bydefault,CUDAGDBuses/tmpasthedirectorytostoretemporaryfiles.Toselecta differentdirectory,setthe$TMPDIRenvironmentvariable.
CUDA-GDB
DU-05227-001_V4.1 | 6
Compiling the Application

Debug Compilation
NVCC,theNVIDIACUDAcompilerdriver,providesamechanismforgeneratingthe debugginginformationnecessaryforCUDAGDBtoworkproperly.The-g-Goption pairmustbepassedtoNVCCwhenanapplicationiscompiledinordertodebugwith CUDAGDB;forexample,
nvcc -g -G foo.cu -o foo
UsingthislinetocompiletheCUDAapplicationfoo.cu
forces-O0compilation,withtheexceptionofverylimiteddeadcodeeliminationsand
registerspillingoptimizations.
makesthecompilerincludedebuginformationintheexecutable
Compiling for Fermi GPUs

ForFermiGPUs,addthefollowingflagstotargetFermioutputwhencompilingthe application:
-gencode arch=compute_20,code=sm_20
ItwillcompilethekernelsspecificallyfortheFermiarchitectureonceandforall.Ifthe flagisnotspecified,thenthekernelsmustberecompiledatruntimeeverytime.
Compiling for Fermi and Tesla GPUs

IfyouaretargetingbothFermiandTeslaGPUs,includethesetwoflags:
-gencode arch=compute_20,code=sm_20 -gencode arch=compute_10,code=sm_10 Note: It is highly recommended to use the -gencode flag whenever possible.
CUDA-GDB
DU-05227-001_V4.1 | 7
Using the Debugger

DebuggingaCUDAGPUinvolvespausingthatGPU.Whenthegraphicsdesktop managerisrunningonthesameGPU,thendebuggingthatGPUfreezestheGUIand makesthedesktopunusable.Toavoidthis,useCUDAGDBinthefollowingsystem configurations:
Single GPU Debugging

InasingleGPUsystem,CUDAGDBcanbeusedtodebugCUDAapplicationsonlyifno X11server(onLinux)ornoAquadesktopmanager(onMacOSX)isrunningonthat system.OnLinuxyoucanstoptheX11serverbystoppingthegdmservice.OnMacOS Xyoucanloginwith>consoleastheusernameinthedesktopUIloginscreen.This allowsCUDAapplicationstobeexecutedanddebuggedinasingleGPUconfiguration.
Multi-GPU Debugging
MultiGPUdebuggingisnotmuchdifferentthansingleGPUdebuggingexceptforafew additionalCUDAGDBcommandsthatletyouswitchbetweentheGPUs. AnyGPUhittingabreakpointwillpausealltheGPUsrunningCUDAonthatsystem. Oncepaused,youcanuseinfo cuda kernelstoviewalltheactivekernelsandtheGPUs theyarerunningon.WhenanyGPUisresumed,alltheGPUsareresumed.
Note: If the CUDA_VISIBLE_DEVICES environment is used, only the specified devices are suspended and resumed.
AllCUDAcapableGPUsmayrunoneormorekernel.Toswitchtoanactivekernel,use cuda kernel <n>,wherenistheidofthekernelretrievedfrom info cuda kernels.

Note: The same kernel can be loaded and used by different contexts and devices at the same time. When a breakpoint is set in such a kernel, by either name or file name and line number, it will be resolved arbitrarily to only one instance of that kernel. With the runtime API, the exact instance to which the breakpoint will be resolved cannot be controlled. With the driver API, the user can control the instance to which the breakpoint will be resolved to by setting the breakpoint rightafter its module is loaded.
CUDA-GDB
DU-05227-001_V4.1 | 8
Multi-GPU Debugging in Console Mode

CUDAGDBallowssimultaneousdebuggingofapplicationsrunningCUDAkernelson multipleGPUs.Inconsolemode,CUDAGDBcanbeusedtopauseanddebugevery GPUinthesystem.YoucanenableconsolemodeasdescribedaboveforthesingleGPU consolemode.
Multi-GPU Debugging with the Desktop Manager Running

ThiscanbeachievedbyrunningthedesktopGUIononeGPUandCUDAontheother GPUtoavoidhangingthedesktopGUI. On Linux TheCUDAdriverautomaticallyexcludestheGPUusedbyX11frombeingvisibletothe applicationbeingdebugged.Thispreventsthebehavioroftheapplicationsince,ifthere arenGPUsinthesystem,thenonlyn1GPUswillbevisibletotheapplication. On Mac OS X TheCUDAdriverexposeseveryCUDAcapableGPUinthesystem,includingtheone usedbyAquadesktopmanager.TodeterminewhichGPUshouldbeusedforCUDA, runthedeviceQueryappfromtheCUDASDKsample.TheoutputofdeviceQueryas showninFigure3.1indicatesalltheGPUsinthesystem. Forexample,ifyouhavetwoGPUsyouwillseeDevice0:GeForcexxxxandDevice1: GeForcexxxx.ChoosetheDevice<index>thatisnotrenderingthedesktoponyour connectedmonitor.IfDevice0isrenderingthedesktop,thenchooseDevice1forrunning anddebuggingtheCUDAapplication.Thisexclusionofthedesktopcanbeachievedby settingtheCUDA_VISIBLE_DEVICESenvironmentvariableto1:
export CUDA_VISIBLE_DEVICES=1
CUDA-GDB
DU-05227-001_V4.1 | 9
Figure 3.1
deviceQuery Output
Remote Debugging
Toremotelydebuganapplication,useSSHorVNCfromthehostsystemtoconnectto thetargetsystem.Fromthere,CUDAGDBcanbelaunchedinconsolemode.
CUDA-GDB
DU-05227-001_V4.1 | 10
Multiple Debuggers
InamultiGPUenvironment,severaldebuggingsessionsmaytakeplacesimultaneously aslongastheCUDAdevicesareusedexclusively.Forinstance,oneinstanceofCUDA GDBcandebugafirstapplicationthatusesthefirstGPUwhileanotherinstanceof CUDAGDBdebugsasecondapplicationthatusesthesecondGPU.Theexclusiveuseof aGPUisachievedbyspecifyingwhichGPUisvisibletotheapplicationbyusingthe CUDA_VISIBLE_DEVICESenvironmentvariable.
CUDA_VISIBLE_DEVICES=1 cuda-gdb my_app
CUDA/OpenGL Interop Applications on Linux

AnyCUDAapplicationthatusesOpenGLinteroperabilityrequiresanactivewindows server.SuchapplicationswillfailtorununderconsolemodedebuggingonbothLinux andMacOSX.However,iftheXserverisrunningonLinux,therenderGPUwillnotbe enumeratedwhendebugging,sotheapplicationcouldstillfail,unlesstheapplication usestheOpenGLdeviceenumerationtoaccesstherenderGPU.ButiftheXsessionis runninginnoninteractivemodewhileusingthedebugger,therenderGPUwillbe enumeratedcorrectly.
Instructions
1 LaunchyourXsessioninnoninteractivemode.
a StopyourXserver. b Edit/etc/X11/xorg.conftocontainthefollowinglineintheDevicesection correspondingtoyourdisplay:
Option "Interactive" "off
c RestartyourXserver.
2 Loginremotely(SSH,etc.)andlaunchyourapplicationunderCUDAGDB.
ThissetupworksproperlyforsingleGPUandmultiGPUconfigurations.
3 EnsureyourDISPLAYenvironmentvariableissetappropriately. Forexample:
export DISPLAY=:0.0
Limitations
WhileXisinnoninteractivemode,interactingwiththeXsessioncancauseyour debuggingsessiontostallorterminate.
CUDA-GDB
DU-05227-001_V4.1 | 11
04 CUDA-GDB EXTENSIONS
Command Naming Convention

TheexistingGDBcommandsareunchanged.EverynewCUDAcommandoroptionis prefixedwiththeCUDAkeyword.Asmuchaspossible,CUDAGDBcommandnames willbesimilartotheequivalentGDBcommandsusedfordebugginghostcode.For instance,theGDBcommandtodisplaythehostthreadsandswitchtohostthread1are, respectively:
(cuda-gdb) info threads (cuda-gdb) thread 1
TodisplaytheCUDAthreadsandswitchtocudathread1,theuseronlyhastotype:
(cuda-gdb) info cuda threads (cuda-gdb) cuda thread 1
Getting Help
AswithGDBcommands,thebuiltinhelpfortheCUDAcommandsisaccessiblefrom thecudagdbcommandlinebyusingthehelpcommand:
(cuda-gdb) help cuda name_of_the_cuda_command (cuda-gdb) help set cuda name_of_the_cuda_option (cuda-gdb) help info cuda name_of_the_info_cuda_command
CUDA-GDB
DU-05227-001_V4.1 | 12
Chapter 04 : CUDA-GDB E XTENSIONS
Initialization File
TheinitializationfileforCUDAGDBisnamed.cuda-gdbinitandfollowsthesamerules asthestandard.gdbinitfileusedbyGDB.TheinitializationfilemaycontainanyCUDA GDBcommand.ThosecommandswillbeprocessedinorderwhenCUDAGDBis launched.
GUI Integration
Emacs
CUDAGDBworkswithGUDinEmacsandXEmacs.Noextrastepisrequiredother thanpointingtotherightbinary. TouseCUDAGDB,thegudgdbcommandnamevariablemustbesettocudagdb annotate=3.UseMxcustomizevariabletosetthevariable. EnsurethatcudagdbispresentintheEmacs/XEmacs$PATH.
DDD
CUDAGDBworkswithDDD.TouseDDDwithCUDAGDB,launchDDDwiththe followingcommand:
ddd --debugger cuda-gdb
cudagdbmustbeinyour$PATH.
CUDA-GDB
DU-05227-001_V4.1 | 13
05 KERNEL FOCUS
ACUDAapplicationmayberunningseveralhostthreadsandmanydevicethreads.To simplifythevisualizationofinformationaboutthestateofapplication,commandsare appliedtotheentityinfocus. Whenthefocusissettoahostthread,thecommandswillapplyonlytothathostthread (unlesstheapplicationisfullyresumed,forinstance).Onthedeviceside,thefocusis alwayssettothelowestgranularitylevelthedevicethread.
Software Coordinates vs. Hardware Coordinates

Adevicethreadbelongstoablock,whichinturnbelongstoakernel.Thread,block,and kernelarethesoftwarecoordinatesofthefocus.Adevicethreadrunsonalane.Alane belongstoawarp,whichbelongstoanSM,whichinturnbelongstoadevice.Lane, warp,SM,anddevicearethehardwarecoordinatesofthefocus.Softwareandhardware coordinatescanbeusedinterchangeablyandsimultaneouslyaslongastheyremain coherent. Anothersoftwarecoordinateissometimesused:thegrid.Thedifferencebetweenagrid andakernelisthescope.ThegridIDisuniqueperGPUwhereasthekernelIDisunique acrossallGPUs.Thereforethereisa1:1mappingbetweenakernelanda(grid,device) tuple.
Current Focus
Toinspectthecurrentfocus,usethecudacommandfollowedbythecoordinatesof interest:
(cuda-gdb) cuda device sm warp lane block thread block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0 (cuda-gdb) cuda kernel block thread kernel 1, block (0,0,0), thread (0,0,0) (cuda-gdb) cuda kernel kernel 1
CUDA-GDB
DU-05227-001_V4.1 | 14
Chapter 05 : K ERNEL F OCUS
Switching Focus
Toswitchthecurrentfocus,usethecudacommandfollowedbythecoordinatestobe changed:
(cuda-gdb) cuda device 0 sm 1 warp 2 lane 3 [Switching focus to CUDA kernel 1, grid 2, block (8,0,0), thread (67,0,0), device 0, sm 1, warp 2, lane 3] 374 int totalThreads = gridDim.x * blockDim.x;
Ifthespecifiedfocusisnotfullydefinedbythecommand,thedebuggerwillassumethat theomittedcoordinatesaresettothecoordinatesinthecurrentfocus,includingthe subcoordinatesoftheblockandthread.

(cuda-gdb) cuda thread (15) [Switching focus to CUDA kernel 1, grid 2, block (8,0,0), thread (15,0,0), device 0, sm 1, warp 0, lane 15] 374 int totalThreads = gridDim.x * blockDim.x;
Theparenthesesfortheblockandthreadargumentsareoptional.
(cuda-gdb) cuda block 1 thread 3 [Switching focus to CUDA kernel 1, grid 2, block (1,0,0), thread (3,0,0), device 0, sm 3, warp 0, lane 3] 374 int totalThreads = gridDim.x * blockDim.
CUDA-GDB
DU-05227-001_V4.1 | 15
06 PROGRAM EXECUTION
ApplicationsarelaunchedthesamewayinCUDAGDBastheyarewithGDBbyusing theruncommand.ThischapterdescribeshowtointerruptandsinglestepCUDA applications.
Interrupting the Application

IftheCUDAapplicationappearstobehangingorstuckinaninfiniteloop,itispossible tomanuallyinterrupttheapplicationbypressingCTRL+C.Whenthesignalisreceived, theGPUsaresuspendedandthecudagdbpromptwillappear. Atthatpoint,theprogramcanbeinspected,modified,singlestepped,resumed,or terminatedattheusersdiscretion. Thisfeatureislimitedtoapplicationsrunningwithinthedebugger.Itisnotpossibleto breakintoanddebugapplicationsthathavebeenlaunchedoutsidethedebugger.
Single-Stepping
Singlesteppingdevicecodeissupported.However,unlikehostcodesinglestepping, devicecodesinglesteppingworksatthewarplevel.Thismeansthatsinglesteppinga devicekerneladvancesalltheactivethreadsinthewarpcurrentlyinfocus.The divergentthreadsinthewarparenotsinglestepped. Inordertoadvancetheexecutionofmorethanonewarp,abreakpointmustbesetatthe desiredlocationandthentheapplicationmustbefullyresumed. Aspecialcaseissinglesteppingoverathreadbarriercall:__syncthreads().Inthiscase, animplicittemporarybreakpointissetimmediatelyafterthebarrierandallthreadsare resumeduntilthetemporarybreakpointishit. OnGPUswithsm_typelowerthansm_20itisnotpossibletostepoverasubroutinein thedevicecode.Instead,CUDAGDBalwaysstepsintothedevicefunction.OnGPUs withsm_typesm_20andhigher,youcanstepin,over,oroutofthedevicefunctionsas
CUDA-GDB
DU-05227-001_V4.1 | 16
Chapter 06 : P ROGRAM EXECUTION
longastheyarenotinlined.Toforceafunctiontonotbeinlinedbythecompiler,the __ __noinline__ __ keywordmustbeaddedtothefunctiondeclaration.
CUDA-GDB
DU-05227-001_V4.1 | 17
07 BREAKPOINTS
TherearemultiplewaystosetabreakpointonaCUDAapplication.Thosemethodsare describedbelow.Thecommandstosetabreakpointonthedevicecodearethesameas thecommandsusedtosetabreakpointonthehostcode. Ifthebreakpointissetondevicecode,thebreakpointwillbemarkedpendinguntilthe ELFimageofthekernelisloaded.Atthatpoint,thebreakpointwillberesolvedandits addresswillbeupdated. Whenabreakpointisset,itforcesallresidentGPUthreadstostopatthislocationwhenit hitsthatcorrespondingPC. Whenabreakpointishitbyonethread,thereisnoguaranteethattheotherthreadswill hitthebreakpointatthesametime.Thereforethesamebreakpointmaybehitseveral times,andtheusermustbecarefulwithcheckingwhichthread(s)actuallyhit(s)the breakpoint.
Symbolic Breakpoints
Tosetabreakpointattheentryofafunction,usethebreakcommandfollowedbythe nameofthefunctionormethod:
(cuda-gdb) break my_function (cuda-gdb) break my_class::my_method
Fortemplatizedfunctionsandmethods,thefullsignaturemustbegiven:
(cuda-gdb) break int my_templatized_function<int>(int)
CUDA-GDB
DU-05227-001_V4.1 | 18
Chapter 07 : BREAKPOINTS
Themanglednameofthefunctioncanalsobeused.Tofindthemanglednameofa function,youcanusethefollowingcommand:
(cuda-gdb) set demangle-style none (cuda-gdb) info function my_function_name (cuda-gdb) set demangle-style auto
Line Breakpoints
Tosetabreakpointonaspecificlinenumber,usethefollowingsyntax:
(cuda-gdb) break my_file.cu:185
Ifthespecifiedlinecorrespondstoaninstructionwithintemplatizedcode,multiple breakpointswillbecreated,oneforeachinstanceofthetemplatizedcode.
Address Breakpoints
Tosetabreakpointataspecificaddress,usethebreakcommandwiththeaddressas argument:
(cuda-gdb) break 0x1afe34d0
Theaddresscanbeanyaddressonthedeviceorthehost.
Kernel Entry Breakpoints

Tobreakonthefirstinstructionofeverylaunchedkernel,setthebreak_on_launch optiontoapplication:
(cuda-gdb) set cuda break_on_launch application
Possibleoptionsare:
application: anykernellaunchedbytheuserapplication system: anykernellaunchedbythedriver,suchasmemset all: anykernel,applicationandsystem none: nokernel,applicationorsystem
Thoseautomaticbreakpointsarenotdisplayedbytheinfobreakpointscommandand aremanagedseparatelyfromindividualbreakpoints.Turningofftheoptionwillnot deleteotherindividualbreakpointssettothesameaddressandviceversa.
CUDA-GDB
DU-05227-001_V4.1 | 19
Chapter 07 : BREAKPOINTS
Conditional Breakpoints
Tomakethebreakpointconditional,usetheoptionalifkeywordorthecondcommand.
(cuda-gdb) break foo.cu:23 if threadIdx.x == 1 && i < 5 (cuda-gdb) cond 3 threadIdx.x == 1 && i < 5
Conditionalexpressionsmayreferanyvariable,includingbuiltinvariablessuchas threadIdxandblockIdx.Functioncallsarenotallowedinconditionalexpressions. Notethatconditionalbreakpointsarealwayshitandevaluated,butthedebuggerreports thebreakpointasbeinghitonlyiftheconditionalstatementisevaluatedtoTRUE.The processofhittingthebreakpointandevaluatingthecorrespondingconditionalstatement istimeconsuming.Therefore,runningapplicationswhileusingconditionalbreakpoints mayslowdownthedebuggingsession.Moreover,iftheconditionalstatementisalways evaluatedtoFALSE,thedebuggermayappeartobehangingorstuck,althoughitisnot thecase.YoucaninterrupttheapplicationwithCTRLCtoverifythatprogressisbeing made. ConditionalbreakpointscanonlybesetoncodefromCUDAmodulesthatarealready loaded.Otherwide,CUDAGDBwillreportanerrorthatitisunabletofindsymbolsin thecurrentcontext.Ifunsure,firstsetanunconditionalbreakpointatthedesiredlocation andaddtheconditionalstatementthefirsttimethebreakpointishitbyusingthecond command.
CUDA-GDB
DU-05227-001_V4.1 | 20
08 INSPECTING PROGRAM STATE
Memory and Variables

TheGDBprintcommandhasbeenextendedtodecipherthelocationofanyprogram variableandcanbeusedtodisplaythecontentsofanyCUDAprogramvariable including:
dataallocatedviacudaMalloc() datathatresidesinvariousGPUmemoryregions,suchasshared,local,andglobal
memory
specialCUDAruntimevariables,suchasthreadIdx
Variable Storage and Accessibility

Dependingonthevariabletypeandusage,variablescanbestoredeitherinregistersorin local,shared,constorglobalmemory.Youcanprinttheaddressofanyvariabletofind outwhereitisstoredanddirectlyaccesstheassociatedmemory. Theexamplebelowshowshowthevariablearray,whichisoftypesharedint*,canbe directlyaccessedinordertoseewhatthestoredvaluesareinthearray.
(cuda-gdb) print &array $1 = (@shared int (*)[0]) 0x20 (cuda-gdb) print array[0]@4 $2 = {0, 128, 64, 192}
Youcanalsoaccessthesharedmemoryindexedintothestartingoffsettoseewhatthe storedvaluesare:
(cuda-gdb) print *(@shared int*)0x20 $3 = 0 (cuda-gdb) print *(@shared int*)0x24 $4 = 128 (cuda-gdb) print *(@shared int*)0x28 $5 = 64
CUDA-GDB
DU-05227-001_V4.1 | 21
Chapter 08 : I NSPECTING PROGRAM STATE
Theexamplebelowshowshowtoaccessthestartingaddressoftheinputparameterto thekernel.
(cuda-gdb) print &data $6 = (const @global void * const @parameter *) 0x10 (cuda-gdb) print *(@global void * const @parameter *) 0x10 $7 = (@global void * const @parameter) 0x110000
Inspecting Textures
Note: The debugger can always read/write the source variables when the PC is on the first assembly instruction of a source instruction. When doing assembly-level debugging, the value of source variables is not always accessible.
Toinspectatexture,usetheprintcommandwhiledereferencingthetexturerecasttothe typeofthearrayitisboundto.Forinstance,iftexturetexisboundtoarrayAoftype float*,use:

(cuda-gdb) print *(@texture float *)tex
Allthearrayoperators,suchas[],canbeappliedto(@texturefloat*)tex:
(cuda-gdb) print ((@texture float *)tex)[2] (cuda-gdb) print ((@texture float *)tex)[2]@4
CUDA-GDB
DU-05227-001_V4.1 | 22
Info CUDA Commands

ThesearecommandsthatdisplayinformationabouttheGPUandtheapplications CUDAstate.Theavailableoptionsare:
devices:informationaboutallthedevices sms:informationaboutalltheSMsinthecurrentdevice warps:informationaboutallthewarpsinthecurrentSM lanes:informationaboutallthelanesinthecurrentwarp kernels:informationaboutalltheactivekernels blocks:informationaboutalltheactiveblocksinthecurrentkernel threads:informationaboutalltheactivethreadsinthecurrentkernel
Afiltercanbeappliedtoevery info cudacommand.Thefilterrestrictsthescopeof thecommand.Afilteriscomposedofoneormorerestrictions.Arestrictioncanbeanyof thefollowing:

device n sm n warp n lane n kernel n grid n block x[,y] or block (x[,y]) or thread (x[,y[,z]]) thread x[,y[,z]]
wheren, x, y, zareintegers,oroneofthefollowingspecialkeywords: current, any,andall. current indicatesthatthecorrespondingvalueinthecurrent focusshouldbeused.anyand allindicatethatanyvalueisacceptable.
info cuda devices

ThiscommandenumeratesalltheGPUsinthesystemsortedbydeviceindex.A* indicatesthedevicecurrentlyinfocus.Thiscommandsupportsfilters.Thedefaultis deviceall.ThiscommandprintsNoCUDADevicesifnoGPUsarefound.
(cuda-gdb) info cuda devices
Dev/Description/SM Type/SMs Warps/SM Lanes/Warp Max Regs/Lane/Active SMs Mask
* 0
gt200
sm_13
24
32
32
128
0x00ffffff
CUDA-GDB
DU-05227-001_V4.1 | 23
info cuda sms

ThiscommandshowsalltheSMsforthedeviceandtheassociatedactivewarpsonthe SMs.Thiscommandsupportsfiltersandthedefaultisdevicecurrentsmall.A* indicatestheSMisfocus.Theresultsaregroupedperdevice.
(cuda-gdb) info cuda sms
SM Active Warps Mask
Device 0 * 0 0xffffffffffffffff 1 0xffffffffffffffff 2 0xffffffffffffffff 3 0xffffffffffffffff 4 0xffffffffffffffff 5 0xffffffffffffffff 6 0xffffffffffffffff 7 0xffffffffffffffff 8 0xffffffffffffffff ...
info cuda warps

Thiscommandtakesyouoneleveldeeperandprintsallthewarpsinformationforthe SMinfocus.Thiscommandsupportsfiltersandthedefaultisdevicecurrentsmcurrent warpall.Thecommandcanbeusedtodisplaywhichwarpexecuteswhatblock.
(cuda-gdb) info cuda warps
Wp /Active Lanes Mask/ Divergent Lanes Mask/Active Physical PC/Kernel/BlockIdx
Device 0 SM 0 * 0 0xffffffff 1 0xffffffff 2 0xffffffff 3 0xffffffff 4 0xffffffff 5 0xffffffff 6 0xffffffff 7 0xffffffff ...
0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
0x000000000000001c 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000
0 0 0 0 0 0 0 0
(0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0)
CUDA-GDB
DU-05227-001_V4.1 | 24
info cuda lanes

Thiscommanddisplaysallthelanes(threads)forthewarpinfocus.Thiscommand supportsfiltersandthedefaultisdevicecurrentsmcurrentwarpcurrentlaneall.In theexamplebelowyoucanseethatallthelanesareatthesamephysicalPC.The commandcanbeusedtodiplaywhichlaneexecuteswhatthread.
(cuda-gdb) info cuda lanes
Ln State Physical PC ThreadIdx
Device 0 SM 0 Warp 0 * 0 active 0x000000000000008c 1 active 0x000000000000008c 2 active 0x000000000000008c 3 active 0x000000000000008c 4 active 0x000000000000008c 5 active 0x000000000000008c 6 active 0x000000000000008c 7 active 0x000000000000008c 8 active 0x000000000000008c 9 active 0x000000000000008c 10 active 0x000000000000008c 11 active 0x000000000000008c 12 active 0x000000000000008c 13 active 0x000000000000008c 14 active 0x000000000000008c 15 active 0x000000000000008c 16 active 0x000000000000008c ...
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,0,0) (5,0,0) (6,0,0) (7,0,0) (8,0,0) (9,0,0) (10,0,0) (11,0,0) (12,0,0) (13,0,0) (14,0,0) (15,0,0) (16,0,0)
info cuda kernels

ThiscommanddisplaysonalltheactivekernelsontheGPUinfocus.ItprintstheSM mask,kernelIDandthegridIDforeachkernelwiththeassociateddimensionsand arguments.ThekernelIDisuniqueacrossallGPUswhereasthegridIDisuniqueper GPU.Thiscommandsupportsfiltersandthedefaultiskernelall.
(cuda-gdb) info cuda kernels
Kernel Dev Grid SMs Mask GridDim BlockDim Name Args
{...}
0x00ffffff (240,1,1) (128,1,1) acos_main
parms=
CUDA-GDB
DU-05227-001_V4.1 | 25
info cuda blocks

Thiscommanddisplaysalltheactiveorrunningblocksforthekernelinfocus.The resultsaregroupedperkernel.Thiscommandsupportsfiltersandthedefaultiskernel currentblockall.Theoutputsarecoalescedbydefault.
(cuda-gdb) info cuda blocks BlockIdx To BlockIdx Count Kernel 1 * (0,0,0) (191,0,0) 192 State running
CoalescingcanbeturnedoffasfollowsinwhichcasemoreinformationontheDevice andtheSMgetdisplayed:
(cuda-gdb) set cuda coalescing off
Thefollowingistheoutputofthesamecommandwhencoalescingisturnedoff.
(cuda-gdb) info cuda blocks BlockIdx State Dev SM Kernel 1 * (0,0,0) running 0 0 (1,0,0) running 0 3 (2,0,0) running 0 6 (3,0,0) running 0 9 (4,0,0) running 0 12 (5,0,0) running 0 15 (6,0,0) running 0 18 (7,0,0) running 0 21 (8,0,0) running 0 1 ...
CUDA-GDB
DU-05227-001_V4.1 | 26
info cuda threads

ThiscommanddisplaystheapplicationsactiveCUDAblocksandthreadswiththetotal countofthreadsinthoseblocks.AlsodisplayedarethevirtualPCandtheassociated sourcefileandthelinenumberinformation.Theresultsaregroupedperkernel.The commandsupportsfilterswithdefaultbeingkernelcurrentblockallthreadall.The outputsarecoalescedbydefaultasfollows:
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename acos.cu acos.cu Line 376 374
Device 0 SM 0
* (0,0,0 (0,0,0) (0,0,0) (31,0,0) 32 (0,0,0) (32,0,0) (191,0,0) (127,0,0) 24544 0x000000000088f88c 0x000000000088f800
...
Coalescingcanbeturnedoffasfollowsinwhichcasemoreinformationisdisplayedwith theoutput.
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line
Kernel 1 * (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) ...
(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,0,0) (5,0,0) (6,0,0) (7,0,0) (8,0,0) (9,0,0)
0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9
acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu
376 376 376 376 376 376 376 376 376 376
Note: In coalesced form, threads must be contiguous in order to be coalesced. If some threads are not currently running on the hardware, they will create "holes" in the thread ranges. For instance, if a kernel consist of 2 blocks of 16 threads, and only the 8 lowest threads are active, then 2 coalesced ranges will be printed: one range for block 0 thread 0 to 7, and one range for block 1 thread 0 to 7. Because threads 8-15 in block 0 are not running, the 2 ranges cannot be coalesced.
CUDA-GDB
DU-05227-001_V4.1 | 27
09 CONTEXT AND KERNEL EVENTS
WithinCUDAGDB,kernelreferstoyourdevicecodethatexecutesontheGPU,while contextreferstothevirtualaddressspaceontheGPUforyourkernel. YoucanturnONorOFFthedisplayofCUDAcontextandkerneleventstoreviewthe flowoftheactivecontextsandkernels.
Display CUDA context events

(cuda-gdb) set cuda context_events 1
DisplayCUDAcontextevents.
(cuda-gdb) set cuda context_events 0
DonotdisplayCUDAcontextevents.
Display CUDA kernel events

(cudagdb)set cuda kernel_events 1
DisplayCUDAkernelevents.
(cudagdb)set cuda kernel_events 0
DonotdisplayCUDAkernelevents.
CUDA-GDB
DU-05227-001_V4.1 | 28
Chapter 09 : C ONTEXT AND KERNEL E VENTS
Examples of displayed events

Thefollowingareexamplesofcontexteventsdisplayed:
[Context Create of context 0xad2fe60 on Device 0] [Context Pop of context 0xad2fe60 on Device 0] [Context Destroy of context 0xad2fe60 on Device 0]
Thefollowingareexamplesofkerneleventsdisplayed:
[Launch of CUDA Kernel 1 (kernel3) on Device 0] [Termination of CUDA Kernel 1 (kernel3) on Device 0]
Note: The kernel termination event is only displayed when a kernel is launched asynchronously, or when the debugger can safely assume that the kernel has terminated.
CUDA-GDB
DU-05227-001_V4.1 | 29
010 CHECKING MEMORY ERRORS
Checking Memory Errors

TheCUDAmemcheckfeaturedetectsglobalmemoryviolationsandmisalignedglobal memoryaccesses.Thisfeatureisoffbydefaultandcanbeenabledusingthefollowing variableinCUDAGDBbeforetheapplicationisrun.
(cuda-gdb) set cuda memcheck on
OnceCUDAmemcheckisenabled,anydetectionofglobalmemoryviolationsandmis alignedglobalmemoryaccesseswillbereported. WhenCUDAmemcheckisenabled,allthekernellaunchesaremadeblocking,asifthe environmentvariableCUDA_LAUNCH_BLOCKING wassetto1.Thehostthreadlaunchinga kernelwillthereforewaituntilthekernelhascompletedbeforeproceeding.Thismay changethebehaviorofyourapplication. YoucanalsoruntheCUDAmemorycheckerasastandalonetoolnamed CUDAMEMCHECK.Thistoolisalsopartofthetoolkit.Pleasereadtherelated documentationformoreinformation. Bydefault,CUDAGDBwillreportanymemoryerror.Seethenextsectionforalistofthe memoryerrors.Toincreasethenumberofmemoryerrorsbeingreportedandtoincrease theprecisionofthememoryerrors,CUDAmemcheckmustbeturnedon.
CUDA-GDB
DU-05227-001_V4.1 | 30
Chapter 010 : CHECKING MEMORY ERRORS
Increasing the Precision of Memory Errors WIth Autostep

AutostepisacommandtoincreasetheprecisionofCUDAexceptionstotheexactlane andinstruction,whentheywouldnothavebeenotherwise. Undernormalexecution,anexceptionmaybereportedseveralinstructionsafterthe exceptionoccurred,ortheexactthreadwhereanexceptionoccurredmaynotbeknown unlesstheexceptionisalaneerror.However,thepreciseoriginoftheexceptioncanbe determinediftheprogramisbeingsinglesteppedwhentheexceptionoccurs.Single steppingmanuallyisaslowandtediousprocess;steppingtakesmuchlongerthan normalexecutionandtheuserhastosinglestepeachwarpindividually. Autostepaidestheuserbyallowingthemtospecifysectionsofcodewheretheysuspect anexceptioncouldoccur,andthesesectionsareautomaticallyandtransparentlysingle steppedtheprogramisrunning.Therestoftheprogramisexecutednormallyto minimizetheslowdowncausedbysinglestepping.Thepreciseoriginofanexception willbereportediftheexceptionoccurswithinthesesections.Thustheexactinstruction andthreadwhereanexceptionoccurredcanbefoundquicklyandwithmuchlesseffort byusingautostep.
Usage
autostep [LOCATION] autostep [LOCATION] for LENGTH [lines|instructions] LOCATIONmaybeanythingthatyouusetospecifythelocationofabreakpoint,such
asalinenumber,functionname,oraninstructionaddressprecededbyanasterisk.If noLOCATIONisspecified,thenthecurrentinstructionaddressisused.
LENGTHspecifiesthesizeoftheautostepwindowinnumberoflinesorinstructions
(linesandinstructionscanbeshortened,e.g.lori).Ifthelengthtypeisnot specified,thenlinesisthedefault.Iftheforclauseisomitted,thenthedefaultis1 line.

astepcanbeusedasanaliasfortheautostepcommand. Callstofunctionsmadeduringanautostepwillbesteppedover. Incaseofdivergence,thelengthoftheautostepwindowisdeterminedbythenumber
oflinesorinstructionsthefirstactivelaneineachwarpexecutes. Divergentlanesarealsosinglestepped,buttheinstructionstheyexecutedonotcount towardsthelengthoftheautostepwindow.

Ifabreakpointoccurswhileinsideanautostepwindow,thewarpwherethe
breakpointwashitwillnotcontinueautosteppingwhentheprogramisresumed. However,otherwarpsmaycontinueautostepping.
Overlappingautostepsarenotsupported.
CUDA-GDB
DU-05227-001_V4.1 | 31
Ifanautostepisencounteredwhileanotherautostepisbeingexecuted,thenthe secondautostepisignored.
Note: Autostep requires Fermi GPUs or above.
Related Commands
Autostepsandbreakpointssharethesamenumberingsomostcommandsthatworkwith breakpointswillalsoworkwithautosteps.
info autosteps
Showsallbreakpointsandautosteps.Similartoinfo breakpoints.
(cuda-gdb) info autosteps
Num 1 3 Type autostep autostep Disp Enb Address keep y keep y What 0x0000000000401234 in merge at sort.cu:30 for 49 instructions 0x0000000000489913 in bubble at sort.cu:94 for 11 lines
disable autosteps n
Disablesanautostep.Equivalenttodisable breakpoints n.
delete autosteps n
Deletesanautostep.Equivalenttodelete breakpoints n.
ignore n i
Donotsinglestepthenextitimesthedebuggerentersthewindowforautostepn.This commandalreadyexistsforbreakpoints.
CUDA-GDB
DU-05227-001_V4.1 | 32
GPU Error Reporting

WithimprovedGPUerrorreportinginCUDAGDB,applicationbugsarenoweasierto identifyandeasytofix.Thefollowingtableshowsthenewerrorsthatarereportedon GPUswithcomputecapabilitysm_20andhigher.
Note: Continuing the execution of your application after these errors are found can lead to application termination or indeterminate results.
Table 10.1 CUDA Exception Codes

Exception code CUDA_EXCEPTION_0 : Device Unknown Exception Precision of the Error Not precise Global error on the GPU This is a global GPU error caused by the application which does not match any of the listed error codes below. This should be a rare occurrence. Potentially, this may be due to Device Hardware Stack overflows or a kernel generating an exception very close to its termination. This occurs when a thread accesses an illegal(out of bounds) global address. This occurs when a thread exceeds its stack memory limit. This occurs when the application triggers a global hardware stack overflow. The main cause of this error is large amounts of divergence in the presence of function calls. Scope of the Error Description
CUDA_EXCEPTION_1 : Lane Illegal Address
Precise (Requires memcheck on) Precise
Per lane/thread error
CUDA_EXCEPTION_2 : Lane User Stack Overflow CUDA_EXCEPTION_3 : Device Hardware Stack Overflow
Per lane/thread error
Not precise
Global error on the GPU
CUDA-GDB
DU-05227-001_V4.1 | 33
Table 10.1 CUDA Exception Codes (continued)

Exception code CUDA_EXCEPTION_4 : Warp Illegal Instruction CUDA_EXCEPTION_5 : Warp Out-of-range Address Precision of the Error Not precise Warp error This occurs when any thread within a warp has executed an illegal instruction. This occurs when any thread within a warp accesses an address that is outside the valid range of local or shared memory regions. This occurs when any thread within a warp accesses an address in the local or shared memory segments that is not correctly aligned. This occurs when any thread within a warp executes an instruction that accesses a memory space not permitted for that instruction. This occurs when any thread within a warp advances its PC beyond the 40-bit address space. This occurs when any thread in a warp triggers a hardware stack overflow. This should be a rare occurrence. This occurs when a thread accesses an illegal(out of bounds) global address. For increased precision, use the cuda memcheck feature. Scope of the Error Description
Not precise
Warp error
CUDA_EXCEPTION_6 : Warp Misaligned Address
Not precise
Warp error
CUDA_EXCEPTION_7 : Warp Invalid Address Space
Not precise
Warp error
CUDA_EXCEPTION_8 : Warp Invalid PC
Not precise
Warp error
CUDA_EXCEPTION_9 : Warp Hardware Stack Overflow
Not precise
Warp error
CUDA_EXCEPTION_10 : Device Illegal Address
Not precise
Global error
CUDA-GDB
DU-05227-001_V4.1 | 34
Table 10.1 CUDA Exception Codes (continued)

Exception code CUDA_EXCEPTION_11 : Lane Misaligned Address CUDA_EXCEPTION_12 : Warp Assert Precision of the Error Precise (Requires memcheck on) Precise Per lane/thread error This occurs when a thread accesses a global address that is not correctly aligned. This occurs when any thread in the warp hits a device side assertion. Scope of the Error Description
Per warp
CUDA-GDB
DU-05227-001_V4.1 | 35
011 WALK-THROUGH EXAMPLES
ThechaptercontainstwoCUDAGDBwalkthroughexamples:
Example1:bitreverse Example2:autostep
Example 1: bitreverse
ThissectionpresentsawalkthroughofCUDAGDBbydebuggingasampleapplication calledbitreversethatperformsasimple8bitreversalonadataset.
Source Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 #include <stdio.h> #include <stdlib.h> // Simple 8-bit bit reversal Compute test #define N 256 __global__ void bitreverse(void *data) { unsigned int *idata = (unsigned int*)data; extern __shared__ int array[]; array[threadIdx.x] = idata[threadIdx.x]; array[threadIdx.x] = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) | ((0x0f0f0f0f & array[threadIdx.x]) << 4); array[threadIdx.x] = ((0xcccccccc & array[threadIdx.x]) >> 2) | ((0x33333333 & array[threadIdx.x]) << 2); array[threadIdx.x] = ((0xaaaaaaaa & array[threadIdx.x]) >> 1) | ((0x55555555 & array[threadIdx.x]) << 1); idata[threadIdx.x] = array[threadIdx.x];
CUDA-GDB
DU-05227-001_V4.1 | 36
Chapter 011 : WALK- THROUGH EXAMPLES
22 } 23 24 int 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 }
main(void) { void *d = NULL; int i; unsigned int idata[N], odata[N]; for (i = 0; i < N; i++) idata[i] = (unsigned int)i; cudaMalloc((void**)&d, sizeof(int)*N); cudaMemcpy(d, idata, sizeof(int)*N, cudaMemcpyHostToDevice); bitreverse<<<1, N, N*sizeof(int)>>>(d); cudaMemcpy(odata, d, sizeof(int)*N, cudaMemcpyDeviceToHost); for (i = 0; i < N; i++) printf("%u -> %u\n", idata[i], odata[i]); cudaFree((void*)d); return 0;
Walking Through the Code

1 Beginbycompilingthebitreverse.cuCUDAapplicationfordebuggingbyentering
thefollowingcommandatashellprompt:
$ nvcc -g -G bitreverse.cu -o bitreverse
Thiscommandassumesthatthesourcefilenameisbitreverse.cu andthatno additionalcompilerflagsarerequiredforcompilation.SeealsoCompilingfor Debuggingonpage 20.
2 StarttheCUDAdebuggerbyenteringthefollowingcommandatashellprompt:
$ cuda-gdb bitreverse
3 Setbreakpoints.Setboththehost(main)andGPU(bitreverse)breakpointshere.
Also,setabreakpointataparticularlineinthedevicefunction(bitreverse.cu:18).
(cuda-gdb) Breakpoint (cuda-gdb) Breakpoint (cuda-gdb) Breakpoint
break main
1 at 0x18e1: file bitreverse.cu, line 25.
break bitreverse
2 at 0x18a1: file bitreverse.cu, line 8.
break 21
3 at 0x18ac: file bitreverse.cu, line 21.
CUDA-GDB
DU-05227-001_V4.1 | 37
4 RuntheCUDAapplication,anditexecutesuntilitreachesthefirstbreakpoint(main)
setinstep3.
(cuda-gdb) run Starting program: /Users/CUDA_User1/docs/bitreverse Reading symbols for shared libraries ..++........................................................... done Breakpoint 1, main () at bitreverse.cu:25 25 void *d = NULL; int i;
5 Atthispoint,commandscanbeenteredtoadvanceexecutionortoprinttheprogram
state.Forthiswalkthrough,letscontinueuntilthedevicekernelislaunched.
(cuda-gdb) continue Continuing. Reading symbols for shared libraries .. done Reading symbols for shared libraries .. done [Context Create of context 0x80f200 on Device 0] [Launch of CUDA Kernel 0 (bitreverse<<<(1,1,1),(256,1,1)>>>) on Device 0] Breakpoint 3 at 0x8667b8: file bitreverse.cu, line 21. [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0] Breakpoint 2, bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x110000) at bitreverse.cu:9 9 unsigned int *idata = (unsigned int*)data;
CUDAGDBhasdetectedthataCUDAdevicekernelhasbeenreached.Thedebugger printsthecurrentCUDAthreadoffocus.
6 VerifytheCUDAthreadoffocuswiththeinfo cuda threadscommandand

switchbetweenhostthreadandtheCUDAthreads:
(cuda-gdb) info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line Kernel 0 * (0,0,0) (0,0,0) (0,0,0) (255,0,0) 256 0x0000000000866400 bitreverse.cu 9 (cuda-gdb) thread [Current thread is 1 (process 16738)] (cuda-gdb) thread 1 [Switching to thread 1 (process 16738)] #0 0x000019d5 in main () at bitreverse.cu:34 34 bitreverse<<<1, N, N*sizeof(int)>>>(d); (cuda-gdb) backtrace #0 0x000019d5 in main () at bitreverse.cu:34 (cuda-gdb) info cuda kernels Kernel Dev Grid SMs Mask GridDim BlockDim Name Args 0 0 1 0x00000001 (1,1,1) (256,1,1) bitreverse data=0x110000
CUDA-GDB
DU-05227-001_V4.1 | 38
(cuda-gdb) cuda kernel 0 [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0] 9 unsigned int *idata = (unsigned int*)data; (cuda-gdb) backtrace #0 bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x110000) at bitreverse.cu:9
7 Corroboratethisinformationbyprintingtheblockandthreadindexes:
(cuda-gdb) print $1 = {x = 0, y = (cuda-gdb) print $2 = {x = 0, y = blockIdx 0}
threadIdx
0, z = 0)
8 Thegridandblockdimensionscanalsobeprinted:
(cuda-gdb) print $3 = {x = 1, y = (cuda-gdb) print $4 = {x = 256, y gridDim 1} blockDim = 1, z = 1)
9 Advancekernelexecutionandverifysomedata:
(cuda-gdb) next 12 array[threadIdx.x] (cuda-gdb) next 14 array[threadIdx.x] (cuda-gdb) next 16 array[threadIdx.x] (cuda-gdb) next 18 array[threadIdx.x] (cuda-gdb) next = idata[threadIdx.x]; = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) | = ((0xcccccccc & array[threadIdx.x]) >> 2) | = ((0xaaaaaaaa & array[threadIdx.x]) >> 1) |
Breakpoint 3, bitreverse <<<(1,1),(256,1,1)>>> (data=0x100000) at bitreverse.cu:21 21 idata[threadIdx.x] = array[threadIdx.x]; (cuda-gdb) print array[0]@12 $7 = {0, 128, 64, 192, 32, 160, 96, 224, 16, 144, 80, 208} (cuda-gdb) print/x array[0]@12 $8 = {0x0, 0x80, 0x40, 0xc0, 0x20, 0xa0, 0x60, 0xe0, 0x10, 0x90, 0x50, 0xd0} (cuda-gdb) print &data $9 = (@global void * @parameter *) 0x10 (cuda-gdb) print *(@global void * @parameter *) 0x10 $10 = (@global void * @parameter) 0x100000
Theresultingoutputdependsonthecurrentcontentofthememorylocation.
CUDA-GDB
DU-05227-001_V4.1 | 39
10 Sincethread(0,0,0)reversesthevalueof0,switchtoadifferentthreadtoshowmore
interestingdata:
cuda-gdb) cuda thread 170 [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (170,0,0), device 0, sm 0, warp 5, lane 10]
11 Deletethebreakpointsandcontinuetheprogramtocompletion:
(cuda-gdb) delete breakpoints Delete all breakpoints? (y or n) y (cuda-gdb) continue Continuing. Program exited normally. (cuda-gdb)
CUDA-GDB
DU-05227-001_V4.1 | 40
Example 2: autostep
Thissectionshowshowtousetheautostepcommandanddemonstrateshowithelps increasetheprecisionofmemoryerrorreporting.
Source Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #define NUM_BLOCKS 8 #define THREADS_PER_BLOCK 64 __global__ void example(int **data) { int value1, value2, value3, value4, value5; int idx1, idx2, idx3; idx1 = blockIdx.x * blockDim.x; idx2 = threadIdx.x; idx3 = idx1 + idx2; value1 = *(data[idx1]); value2 = *(data[idx2]); value3 = value1 + value2; value4 = value1 * value2; value5 = value3 + value4; *(data[idx3]) = value5; *(data[idx1]) = value3; *(data[idx2]) = value4; idx1 = idx2 = idx3 = 0; } int main(int argc, char *argv[]) { int *host_data[NUM_BLOCKS*THREADS_PER_BLOCK]; int **dev_data; const int zero = 0; /* Allocate an integer for each thread in each block */ for (int block = 0; block < NUM_BLOCKS; block++) { for (int thread = 0; thread < THREADS_PER_BLOCK; thread++) { int idx = thread + block * THREADS_PER_BLOCK; cudaMalloc(&host_data[idx], sizeof(int)); cudaMemcpy(host_data[idx], &zero, sizeof(int), cudaMemcpyHostToDevice); } } /* This inserts an error into block 3, thread 39*/ host_data[3*THREADS_PER_BLOCK + 39] = NULL; /* Copy the array of pointers to the device */ cudaMalloc((void**)&dev_data, sizeof(host_data));
CUDA-GDB
DU-05227-001_V4.1 | 41
41 42 43 44 45 46 } 47
cudaMemcpy(dev_data, host_data, sizeof(host_data), cudaMemcpyHostToDevice); /* Execute example */ example <<< NUM_BLOCKS, THREADS_PER_BLOCK >>> (dev_data); cudaThreadSynchronize();
Inthissmallexample,wehaveanarrayofpointerstointegers,andwewanttodosome operationsontheintegers.Suppose,however,thatoneofthepointersisNULLasshown inline37.ThiswillcauseCUDA_EXCEPTION_10DeviceIllegalAddresstobethrown whenwetrytoaccesstheintegerthatcorrespondswithblock3,thread39.Thisexception shouldoccuratline16whenwetrytowritetothatvalue.
Debugging With Autosteps

1 CompiletheexampleandstartCUDAGDBasnormal.
Webeginbyrunningtheprogram:
(cuda-gdb) run Starting program: /home/jitud/cudagdb_test/autostep_ex/example [Thread debugging using libthread_db enabled] [New Thread 0x7ffff5688700 (LWP 9083)] [Context Create of context 0x617270 on Device 0] [Launch of CUDA Kernel 0 (example<<<(8,1,1),(64,1,1)>>>) on Device 0] Program received signal CUDA_EXCEPTION_10, Device Illegal Address. [Switching focus to CUDA kernel 0, grid 1, block (1,0,0), thread (0,0,0), device 0, sm 1, warp 0, lane 0] 0x0000000000796f60 in example (data=0x200300000) at example.cu:17 17 *(data[idx1]) = value3;
Asexpected,wereceivedaCUDA_EXCEPTION_10.However,thereportedthreadis block1,thread0andthelineis17.SinceCUDA_EXCEPTION_10isaGlobalerror, thereisnothreadinformationthatisreported,sowewouldmanuallyhavetoinspect all512threads.
2 Setautosteps.
Togetmoreaccurateinformation,wereasonthatsinceCUDA_EXCEPTION_10isa memoryaccesserror,itmustoccuroncodethataccessesmemory.Thishappenson lines11,12,16,17,and18,sowesettwoautostepwindowsforthoseareas:
(cuda-gdb) autostep 11 for 2 lines Breakpoint 1 at 0x796d18: file example.cu, line 11. Created autostep of length 2 lines (cuda-gdb) autostep 16 for 3 lines Breakpoint 2 at 0x796e90: file example.cu, line 16. Created autostep of length 3 lines
CUDA-GDB
DU-05227-001_V4.1 | 42
3 Finally,weruntheprogramagainwiththeseautosteps:
(cuda-gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y [Termination of CUDA Kernel 0 (example<<<(8,1,1),(64,1,1)>>>) on Device 0] Starting program: /home/jitud/cudagdb_test/autostep_ex/example [Thread debugging using libthread_db enabled] [New Thread 0x7ffff5688700 (LWP 9089)] [Context Create of context 0x617270 on Device 0] [Launch of CUDA Kernel 1 (example<<<(8,1,1),(64,1,1)>>>) on Device 0] [Switching focus to CUDA kernel 1, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0] Program received signal CUDA_EXCEPTION_10, Device Illegal Address. [Current focus set to CUDA kernel 1, grid 1, block (3,0,0), thread (32,0,0), device 0, sm 1, warp 3, lane 0] Autostep precisely caught exception at example.cu:16 (0x796e90)
Thistimewecorrectlycaughttheexceptionatline16.Eventhough CUDA_EXCEPTION_10isaglobalerror,wehavenownarroweditdowntoawarp error,sowenowknowthatthethreadthatthrewtheexceptionmusthavebeeninthe samewarpasblock3,thread32. Inthisexample,wehavenarroweddownthescopeoftheerrorfrom512threadsdown to32threadsjustbysettingtwoautostepsandrerunningtheprogram.
CUDA-GDB
DU-05227-001_V4.1 | 43
APPENDIX A SUPPORTED PLATFORMS
ThegeneralplatformandGPUrequirementsforrunningNVIDIACUDAGDBare describedinthissection.
Host Platform Requirements

Mac OS
CUDAGDBissupportedonboth32bitand64biteditionsofthefollowingMacOS versions:
MacOSX10.6 MacOSX10.7
Linux
CUDAGDBissupportedonboth32bitand64biteditionsofthefollowingLinux distributions:
RedHatEnterpriseLinux4.8(64bitonly) RedHatEnterpriseLinux5.5,5.6,and5.7 RedHatEnterpriseLinux6.0(64bitonly),and6.1(64bitonly) Ubuntu10.04,10.10,and11.04 Fedora13,and14 OpenSuse11.2 SuseLinuxEnterpriseServer11.1
CUDA-GDB
DU-05227-001_V4.1 | 44
Appendix A : SUPPORTED PLATFORMS
GPU Requirements
DebuggingissupportedonallCUDAcapableGPUswithacomputecapabilityof1.1or later.ComputecapabilityisadeviceattributethataCUDAapplicationcanqueryabout; formoreinformation,seethelatestNVIDIACUDAProgrammingGuideontheNVIDIA CUDAZoneWebsite:http://developer.nvidia.com/object/gpucomputing.html. TheseGPUshaveacomputecapabilityof1.0andarenotsupported: GeForce 8800 GTS GeForce 8800 GTX GeForce 8800 Ultra Quadro Plex 1000 Model IV Quadro Plex 2100 Model S4 Quadro FX 4600 Quadro FX 5600 Tesla C870 Tesla D870 Tesla S870
CUDA-GDB
DU-05227-001_V4.1 | 45
APPENDIX B KNOWN ISSUES
Thefollowingareknownissueswiththecurrentrelease.
SettingthecudamemcheckoptionONwillmakeallthelaunchesblocking. ConditionalbreakpointscanonlybesetaftertheCUDAmoduleisloaded. DevicememoryallocatedviacudaMalloc()isnotvisibleoutsideofthekernel
function.
OnGPUswithsm_typelowerthansm_20itisnotpossibletostepoverasubroutinein
thedevicecode.
RequestingtoreadorwriteGPUmemorymaybeunsuccessfulifthesizeislarger
than100MBonTeslaGPUsandlargerthan32MBonFermiGPUs.
OnGPUswithsm_20,ifyouaredebuggingcodeindevicefunctionsthatgetcalledby
multiplekernels,thensettingabreakpointinthedevicefunctionwillinsertthe breakpointinonlyoneofthekernels.
InamultiGPUdebuggingenvironmentonMacOSXwithAquarunning,youmay
experiencesomevisibledelaywhilesinglesteppingtheapplication.
Settingabreakpointonalinewithina__device__or__global__functionbeforeits
moduleisloadedmayresultinthebreakpointbeingtemporarilysetonthefirstlineof afunctionbelowinthesourcecode.Assoonasthemoduleforthetargetedfunctionis loaded,thebreakpointwillberesetproperly.Inthemeantime,thebreakpointmaybe hit,dependingontheapplication.Inthosesituations,thebreakpointcanbesafely ignored,andtheapplicationcanberesumed.

Theschedulerlockingoptioncannotbesettoon. Steppingagainaftersteppingoutofakernelresultsinundeterminedbehavior.Itis
recommendedtousethecontinuecommandinstead.
OpenGLapplicationsmayrequiretolaunchXinnoninteractivemode.SeeCUDA/
OpenGLInteropApplicationsonLinuxonpage 11fordetails.
CUDA-GDB
DU-05227-001_V4.1 | 46
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA, the NVIDIA logo, NVIDIA nForce, GeForce, NVIDIA Quadro, NVDVD, NVIDIA Personal Cinema, NVIDIA Soundstorm, Vanta, TNT2, TNT, RIVA, RIVA TNT, VOODOO, VOODOO GRAPHICS, WAVEBAY, Accuview Antialiasing, Detonator, Digital Vibrance Control, ForceWare, NVRotate, NVSensor, NVSync, PowerMizer, Quincunx Antialiasing, Sceneshare, See What You've Been Missing, StreamThru, SuperStability, T-BUFFER, The Way It's Meant to be Played Logo, TwinBank, TwinView and the Video & Nth Superscript Design Logo are registered trademarks or trademarks of NVIDIA Corporation in the United States and/or other countries. Other company and product names may be trademarks or registered trademarks of the respective owners with which they are associated.
Copyright
20072012 NVIDIA Corporation. All rights reserved.
www.nvidia.com

Cuda GDB

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cuda GDB

Uploaded by

Copyright:

Available Formats

CUDA-GDB

NVIDIA CUDA Debugger - 4.1 Release for Linux and Mac

Graphics Driver CUDA-GDB

Graphics Driver CUDA-GDB

Graphics Driver CUDA-GDB

CUDAGDBsupportsdebuggingkernelsthathavebeencompiledforspecificCUDA architectures,suchassm_10orsm_20,butalsosupportsdebuggingkernelscompiledat runtime,referredtoasjustintimecompilation,orJITcompilationforshort.

About this document

GDB 7.2 Source Base

Support For Simultaneous CUDA-GDB Sessions

Chapter 02 : R ELEASE NOTES

New Autostep Command

Support For Multiple Contexts

Support for Device Assertions

Chapter 03 : GETTING STARTED

Setting Up the Debugger Environment

Also,ifyouareunabletoexecuteCUDAGDBorifyouhittheUnabletofindMachtask portforprocessiderror,tryresettingthecorrectpermissionswiththefollowing commands:

Chapter 03 : GETTING STARTED

Compiling the Application

Compiling for Fermi GPUs

Compiling for Fermi and Tesla GPUs

Chapter 03 : GETTING STARTED

Using the Debugger

Single GPU Debugging

AllCUDAcapableGPUsmayrunoneormorekernel.Toswitchtoanactivekernel,use cuda kernel <n>,wherenistheidofthekernelretrievedfrom info cuda kernels.

Chapter 03 : GETTING STARTED

Multi-GPU Debugging in Console Mode

Multi-GPU Debugging with the Desktop Manager Running

Chapter 03 : GETTING STARTED

Chapter 03 : GETTING STARTED

CUDA/OpenGL Interop Applications on Linux

Command Naming Convention

Chapter 04 : CUDA-GDB E XTENSIONS

Software Coordinates vs. Hardware Coordinates

Chapter 05 : K ERNEL F OCUS

Ifthespecifiedfocusisnotfullydefinedbythecommand,thedebuggerwillassumethat theomittedcoordinatesaresettothecoordinatesinthecurrentfocus,includingthe subcoordinatesoftheblockandthread.

ApplicationsarelaunchedthesamewayinCUDAGDBastheyarewithGDBbyusing theruncommand.ThischapterdescribeshowtointerruptandsinglestepCUDA applications.

Interrupting the Application

Chapter 06 : P ROGRAM EXECUTION

longastheyarenotinlined.Toforceafunctiontonotbeinlinedbythecompiler,the __ __noinline__ __ keywordmustbeaddedtothefunctiondeclaration.

Kernel Entry Breakpoints

Thoseautomaticbreakpointsarenotdisplayedbytheinfobreakpointscommandand aremanagedseparatelyfromindividualbreakpoints.Turningofftheoptionwillnot deleteotherindividualbreakpointssettothesameaddressandviceversa.

08 INSPECTING PROGRAM STATE

Memory and Variables

Variable Storage and Accessibility

Chapter 08 : I NSPECTING PROGRAM STATE

Toinspectatexture,usetheprintcommandwhiledereferencingthetexturerecasttothe typeofthearrayitisboundto.Forinstance,iftexturetexisboundtoarrayAoftype float*,use:

Chapter 08 : I NSPECTING PROGRAM STATE

Info CUDA Commands

Afiltercanbeappliedtoevery info cudacommand.Thefilterrestrictsthescopeof thecommand.Afilteriscomposedofoneormorerestrictions.Arestrictioncanbeanyof thefollowing:

wheren, x, y, zareintegers,oroneofthefollowingspecialkeywords: current, any,andall. current indicatesthatthecorrespondingvalueinthecurrent focusshouldbeused.anyand allindicatethatanyvalueisacceptable.

info cuda devices

Chapter 08 : I NSPECTING PROGRAM STATE

info cuda sms

info cuda warps

0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000

0x000000000000001c 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000

(0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0)

Chapter 08 : I NSPECTING PROGRAM STATE

info cuda lanes

info cuda kernels

0x00ffffff (240,1,1) (128,1,1) acos_main

Chapter 08 : I NSPECTING PROGRAM STATE

info cuda blocks

longastheyarenotinlined.Toforceafunctiontonotbeinlinedbythecompiler,the noinline keywordmustbeaddedtothefunctiondeclaration.