You are on page 1of 51

CUDA-GDB

NVIDIA CUDA Debugger - 4.1 Release for Linux and Mac


DU-05227-001_V4.1 | January 10, 2012

User Manual

TABLE OF CONTENTS
1 Introduction.......................................................................... 1 What is CUDA-GDB? ................................................................. 1 Supported features ................................................................. 1 About this document ............................................................... 2 2 Release Notes........................................................................ 3 GDB 7.2 Source Base................................................................ Support For Simultaneous CUDA-GDB Sessions ................................. New Autostep Command ........................................................... Support For Multiple Contexts .................................................... Support for Device Assertions ..................................................... 3 3 4 4 4

3 Getting Started ...................................................................... 5 Installation Instructions ............................................................ 5 Setting Up the Debugger Environment........................................... 6 Linux ............................................................................... 6 Mac OS X .......................................................................... 6 Compiling the Application ......................................................... 7 Debug Compilation .............................................................. 7 Compiling for Fermi GPUs ...................................................... 7 Compiling for Fermi and Tesla GPUs .......................................... 7 Using the Debugger ................................................................. 8 Single GPU Debugging ........................................................... 8 Multi-GPU Debugging ............................................................ 8 Remote Debugging .............................................................. 10 Multiple Debuggers ............................................................. 11 CUDA/OpenGL Interop Applications on Linux .............................. 11 4 CUDA-GDB Extensions .............................................................12 Command Naming Convention ................................................... 12 Getting Help ........................................................................ 12 Initialization File ................................................................... 13 GUI Integration ..................................................................... 13 Emacs ............................................................................. 13

Graphics Driver CUDA-GDB

DU-05227-001_V4.1 | i

TABLE OF CONTENTS
DDD................................................................................... 13 5 Kernel Focus ........................................................................14 Software Coordinates vs. Hardware Coordinates ............................. 14 Current Focus....................................................................... 14 Switching Focus .................................................................... 15 6 Program Execution.................................................................16 Interrupting the Application...................................................... 16 Single-Stepping ..................................................................... 16 7 Breakpoints .........................................................................18 Symbolic Breakpoints.............................................................. 18 Line Breakpoints ................................................................... 19 Address Breakpoints ............................................................... 19 Kernel Entry Breakpoints ......................................................... 19 Conditional Breakpoints........................................................... 20 8 Inspecting Program State .........................................................21 Memory and Variables ............................................................. 21 Variable Storage and Accessibility............................................... 21 Inspecting Textures ................................................................ 22 Info CUDA Commands ............................................................. 23 info cuda devices ............................................................... 23 info cuda sms .................................................................... 24 info cuda warps ................................................................. 24 info cuda lanes .................................................................. 25 info cuda kernels ............................................................... 25 info cuda blocks................................................................. 26 info cuda threads ............................................................... 27 9 Context and Kernel Events......................................................28 Display CUDA context events..................................................... 28 Display CUDA kernel events ...................................................... 28 Examples of displayed events .................................................... 29

Graphics Driver CUDA-GDB

DU-05227-001_V4.1 | ii

TABLE OF CONTENTS
10 Checking Memory Errors .......................................................30 Checking Memory Errors .......................................................... 30 Increasing the Precision of Memory Errors WIth Autostep ................... 31 Usage ............................................................................. 31 Related Commands ............................................................. 32 GPU Error Reporting ............................................................... 33 11 Walk-through Examples ........................................................36 Example 1: bitreverse ............................................................. 36 Source Code ..................................................................... 36 Walking Through the Code .................................................... 37 Example 2: autostep............................................................... 41 Source Code ..................................................................... 41 Debugging With Autosteps..................................................... 42 Appendix A: Supported Platforms ...............................................44 Host Platform Requirements ..................................................... 44 Mac OS............................................................................ 44 Linux .............................................................................. 44 GPU Requirements.............................................................. 45 Appendix B: Known Issues ........................................................46

Graphics Driver CUDA-GDB

DU-05227-001_V4.1 | iii

01 INTRODUCTION

ThisdocumentintroducesCUDAGDB,theNVIDIACUDAdebugger,anddescribes whatisnewinversion4.1.

What is CUDA-GDB?
CUDAGDBistheNVIDIAtoolfordebuggingCUDAapplicationsrunningonLinux andMac.CUDAGDBisanextensiontothex8664portofGDB,theGNUProject debugger.ThetoolprovidesdeveloperswithamechanismfordebuggingCUDA applicationsrunningonactualhardware.Thisenablesdeveloperstodebugapplications withoutthepotentialvariationsintroducedbysimulationandemulationenvironments. CUDAGDBrunsonLinuxandMacOSX,32bitand64bit.CUDAGDBisbasedon GDB7.2onbothLinuxandMacOSX.

Supported features
CUDAGDBisdesignedtopresenttheuserwithaseamlessdebuggingenvironmentthat allowssimultaneousdebuggingofbothGPUandCPUcodewithinthesameapplication. JustasprogramminginCUDACisanextensiontoCprogramming,debuggingwith CUDAGDBisanaturalextensiontodebuggingwithGDB.TheexistingGDBdebugging featuresareinherentlypresentfordebuggingthehostcode,andadditionalfeatureshave beenprovidedtosupportdebuggingCUDAdevicecode. CUDAGDBsupportsCandC++CUDAapplications.AlltheC++featuressupportedby theNVCCcompilercanbedebuggedbyCUDAGDB. CUDAGDBallowstheusertosetbreakpoints,tosinglestepCUDAapplications,and alsotoinspectandmodifythememoryandvariablesofanygiventhreadrunningonthe hardware. CUDAGDBsupportsdebuggingallCUDAapplications,whethertheyusetheCUDA driverAPI,theCUDAruntimeAPI,orboth.

CUDA-GDB

DU-05227-001_V4.1 | 1

Chapter 01 : I NTRODUCTION

CUDAGDBsupportsdebuggingkernelsthathavebeencompiledforspecificCUDA architectures,suchassm_10orsm_20,butalsosupportsdebuggingkernelscompiledat runtime,referredtoasjustintimecompilation,orJITcompilationforshort.

About this document


ThisdocumentisthemaindocumentationforCUDAGDBandisorganizedmoreasa usermanualthanareferencemanual.Therestofthedocumentwilldescribehowto installanduseCUDAGDBtodebugCUDAkernelsandhowtousethenewCUDA commandsthathavebeenaddedtoGDB.Somewalkthroughexamplesarealso provided.ItisassumedthattheuseralreadyknowsthebasicGDBcommandsusedto debughostapplications.

CUDA-GDB

DU-05227-001_V4.1 | 2

02 RELEASE NOTES

Thefollowingfeatureshavebeenaddedforthe4.1release:

GDB 7.2 Source Base


Untilnow,CUDAGDBwasbasedonGDB6.6onLinux,andGDB6.3.5onDarwin(the Applebranch).Now,bothversionsofCUDAGDBareusingthesame7.2sourcebase. Also,CUDAGDBsupportsnewerversionsofGCC(testeduptoGCC4.5),hasbetter supportforDWARF3debuginformation,andbetterC++debuggingsupport.

Support For Simultaneous CUDA-GDB Sessions


Withthe4.1release,thesingleCUDAGDBprocessrestrictionislifted.Now,multiple CUDAGDBsessionsareallowedtocoexistaslongastheGPUsarenotsharedbetween theapplicationsbeingdebugged.Forinstance,oneCUDAGDBprocesscandebug processfoousingGPU0whileanotherCUDAGDBprocessdebugsprocessbarusing GPU1.TheexclusiveofGPUscanbeenforcedwiththeCUDA_VISIBLE_DEVICES environmentvariable.

CUDA-GDB

DU-05227-001_V4.1 | 3

Chapter 02 : R ELEASE NOTES

New Autostep Command


Anewautostepcommandwasadded.ThecommandincreasestheprecisionofCUDA exceptionsbyautomaticallysinglesteppingthroughportionsofcode. Undernormalexecution,thethreadandinstructionwhereanexceptionoccurredmaybe impreciselyreported.However,theexactinstructionthatgeneratestheexceptioncanbe determinediftheprogramisbeingsinglesteppedwhentheexceptionoccurs. Manuallysinglesteppingthroughaprogramisaslowandtediousprocess.Therefore autostepaidstheuserbyallowingthemtospecifysectionsofcodewheretheysuspect anexceptioncouldoccur.Thesesectionsareautomaticallysinglesteppedthroughwhen theprogramisrunning,andanyexceptionthatoccurswithinthesesectionsisprecisely reported. TypehelpautostepfromCUDAGDBforthesyntaxandusageofthecommand.

Support For Multiple Contexts


OnGPUswithcomputecapabilityofSM20orhigher,debuggingmultiplecontextsonthe sameGPUisnowsupported.Itwasaknownlimitationinpreviousreleases.

Support for Device Assertions


TheR285driverreleasedwiththe4.1versionofthetoolkitsupportsdeviceassertions. CUDAGDBsupportstheassertioncallandstopstheexecutionoftheapplicationwhen theassertionishit.Thenthevariablesandmemorycanbeinspectedasusual.The applicationcanalsoberesumedpasttheassertionifneeded.Usethesetcuda hide_internal_framesoptiontoexpose/hidethesystemcallframes(hiddenbydefault).

CUDA-GDB

DU-05227-001_V4.1 | 4

03 GETTING STARTED

IncludedinthischapterareinstructionsforinstallingCUDAGDBandforusingNVCC, theNVIDIACUDAcompilerdriver,tocompileCUDAprogramsfordebugging.

Installation Instructions
FollowthesestepstoinstallCUDAGDB.

1 VisittheNVIDIACUDAZonedownloadpage:
http://www.nvidia.com/object/cuda_get.html.

2 SelecttheappropriateoperatingsystemMacOSXorLinux.
(SeeHostPlatformRequirementsonpage 26.)

3 DownloadandinstalltheCUDADriver. 4 DownloadandinstalltheCUDAToolkit.

CUDA-GDB

DU-05227-001_V4.1 | 5

Chapter 03 : GETTING STARTED

Setting Up the Debugger Environment


Linux
SetupthePATHandLD_LIBRARY_PATHenvironmentvariables:
export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/ lib:$LD_LIBRARY_PATH

Mac OS X
SetupthePATHandDYLD_LIBRARY_PATHenvironmentvariables:
export PATH=/usr/local/cuda/bin:$PATH export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH

Also,ifyouareunabletoexecuteCUDAGDBorifyouhittheUnabletofindMachtask portforprocessiderror,tryresettingthecorrectpermissionswiththefollowing commands:


sudo chgrp procmod /usr/local/cuda/bin/cuda-binary-gdb sudo chmod 2755 /usr/local/cuda/bin/cuda-binary-gdb sudo chmod 755 /usr/local/cuda/bin/cuda-gdb

Temporary Directory
Bydefault,CUDAGDBuses/tmpasthedirectorytostoretemporaryfiles.Toselecta differentdirectory,setthe$TMPDIRenvironmentvariable.

CUDA-GDB

DU-05227-001_V4.1 | 6

Chapter 03 : GETTING STARTED

Compiling the Application


Debug Compilation
NVCC,theNVIDIACUDAcompilerdriver,providesamechanismforgeneratingthe debugginginformationnecessaryforCUDAGDBtoworkproperly.The-g-Goption pairmustbepassedtoNVCCwhenanapplicationiscompiledinordertodebugwith CUDAGDB;forexample,
nvcc -g -G foo.cu -o foo

UsingthislinetocompiletheCUDAapplicationfoo.cu
forces-O0compilation,withtheexceptionofverylimiteddeadcodeeliminationsand

registerspillingoptimizations.
makesthecompilerincludedebuginformationintheexecutable

Compiling for Fermi GPUs


ForFermiGPUs,addthefollowingflagstotargetFermioutputwhencompilingthe application:
-gencode arch=compute_20,code=sm_20

ItwillcompilethekernelsspecificallyfortheFermiarchitectureonceandforall.Ifthe flagisnotspecified,thenthekernelsmustberecompiledatruntimeeverytime.

Compiling for Fermi and Tesla GPUs


IfyouaretargetingbothFermiandTeslaGPUs,includethesetwoflags:
-gencode arch=compute_20,code=sm_20 -gencode arch=compute_10,code=sm_10 Note: It is highly recommended to use the -gencode flag whenever possible.

CUDA-GDB

DU-05227-001_V4.1 | 7

Chapter 03 : GETTING STARTED

Using the Debugger


DebuggingaCUDAGPUinvolvespausingthatGPU.Whenthegraphicsdesktop managerisrunningonthesameGPU,thendebuggingthatGPUfreezestheGUIand makesthedesktopunusable.Toavoidthis,useCUDAGDBinthefollowingsystem configurations:

Single GPU Debugging


InasingleGPUsystem,CUDAGDBcanbeusedtodebugCUDAapplicationsonlyifno X11server(onLinux)ornoAquadesktopmanager(onMacOSX)isrunningonthat system.OnLinuxyoucanstoptheX11serverbystoppingthegdmservice.OnMacOS Xyoucanloginwith>consoleastheusernameinthedesktopUIloginscreen.This allowsCUDAapplicationstobeexecutedanddebuggedinasingleGPUconfiguration.

Multi-GPU Debugging
MultiGPUdebuggingisnotmuchdifferentthansingleGPUdebuggingexceptforafew additionalCUDAGDBcommandsthatletyouswitchbetweentheGPUs. AnyGPUhittingabreakpointwillpausealltheGPUsrunningCUDAonthatsystem. Oncepaused,youcanuseinfo cuda kernelstoviewalltheactivekernelsandtheGPUs theyarerunningon.WhenanyGPUisresumed,alltheGPUsareresumed.
Note: If the CUDA_VISIBLE_DEVICES environment is used, only the specified devices are suspended and resumed.

AllCUDAcapableGPUsmayrunoneormorekernel.Toswitchtoanactivekernel,use cuda kernel <n>,wherenistheidofthekernelretrievedfrom info cuda kernels.


Note: The same kernel can be loaded and used by different contexts and devices at the same time. When a breakpoint is set in such a kernel, by either name or file name and line number, it will be resolved arbitrarily to only one instance of that kernel. With the runtime API, the exact instance to which the breakpoint will be resolved cannot be controlled. With the driver API, the user can control the instance to which the breakpoint will be resolved to by setting the breakpoint rightafter its module is loaded.

CUDA-GDB

DU-05227-001_V4.1 | 8

Chapter 03 : GETTING STARTED

Multi-GPU Debugging in Console Mode


CUDAGDBallowssimultaneousdebuggingofapplicationsrunningCUDAkernelson multipleGPUs.Inconsolemode,CUDAGDBcanbeusedtopauseanddebugevery GPUinthesystem.YoucanenableconsolemodeasdescribedaboveforthesingleGPU consolemode.

Multi-GPU Debugging with the Desktop Manager Running


ThiscanbeachievedbyrunningthedesktopGUIononeGPUandCUDAontheother GPUtoavoidhangingthedesktopGUI. On Linux TheCUDAdriverautomaticallyexcludestheGPUusedbyX11frombeingvisibletothe applicationbeingdebugged.Thispreventsthebehavioroftheapplicationsince,ifthere arenGPUsinthesystem,thenonlyn1GPUswillbevisibletotheapplication. On Mac OS X TheCUDAdriverexposeseveryCUDAcapableGPUinthesystem,includingtheone usedbyAquadesktopmanager.TodeterminewhichGPUshouldbeusedforCUDA, runthedeviceQueryappfromtheCUDASDKsample.TheoutputofdeviceQueryas showninFigure3.1indicatesalltheGPUsinthesystem. Forexample,ifyouhavetwoGPUsyouwillseeDevice0:GeForcexxxxandDevice1: GeForcexxxx.ChoosetheDevice<index>thatisnotrenderingthedesktoponyour connectedmonitor.IfDevice0isrenderingthedesktop,thenchooseDevice1forrunning anddebuggingtheCUDAapplication.Thisexclusionofthedesktopcanbeachievedby settingtheCUDA_VISIBLE_DEVICESenvironmentvariableto1:
export CUDA_VISIBLE_DEVICES=1

CUDA-GDB

DU-05227-001_V4.1 | 9

Chapter 03 : GETTING STARTED

Figure 3.1

deviceQuery Output

Remote Debugging
Toremotelydebuganapplication,useSSHorVNCfromthehostsystemtoconnectto thetargetsystem.Fromthere,CUDAGDBcanbelaunchedinconsolemode.

CUDA-GDB

DU-05227-001_V4.1 | 10

Chapter 03 : GETTING STARTED

Multiple Debuggers
InamultiGPUenvironment,severaldebuggingsessionsmaytakeplacesimultaneously aslongastheCUDAdevicesareusedexclusively.Forinstance,oneinstanceofCUDA GDBcandebugafirstapplicationthatusesthefirstGPUwhileanotherinstanceof CUDAGDBdebugsasecondapplicationthatusesthesecondGPU.Theexclusiveuseof aGPUisachievedbyspecifyingwhichGPUisvisibletotheapplicationbyusingthe CUDA_VISIBLE_DEVICESenvironmentvariable.
CUDA_VISIBLE_DEVICES=1 cuda-gdb my_app

CUDA/OpenGL Interop Applications on Linux


AnyCUDAapplicationthatusesOpenGLinteroperabilityrequiresanactivewindows server.SuchapplicationswillfailtorununderconsolemodedebuggingonbothLinux andMacOSX.However,iftheXserverisrunningonLinux,therenderGPUwillnotbe enumeratedwhendebugging,sotheapplicationcouldstillfail,unlesstheapplication usestheOpenGLdeviceenumerationtoaccesstherenderGPU.ButiftheXsessionis runninginnoninteractivemodewhileusingthedebugger,therenderGPUwillbe enumeratedcorrectly.

Instructions
1 LaunchyourXsessioninnoninteractivemode.
a StopyourXserver. b Edit/etc/X11/xorg.conftocontainthefollowinglineintheDevicesection correspondingtoyourdisplay:
Option "Interactive" "off

c RestartyourXserver.

2 Loginremotely(SSH,etc.)andlaunchyourapplicationunderCUDAGDB.
ThissetupworksproperlyforsingleGPUandmultiGPUconfigurations.

3 EnsureyourDISPLAYenvironmentvariableissetappropriately. Forexample:
export DISPLAY=:0.0

Limitations
WhileXisinnoninteractivemode,interactingwiththeXsessioncancauseyour debuggingsessiontostallorterminate.

CUDA-GDB

DU-05227-001_V4.1 | 11

04 CUDA-GDB EXTENSIONS

Command Naming Convention


TheexistingGDBcommandsareunchanged.EverynewCUDAcommandoroptionis prefixedwiththeCUDAkeyword.Asmuchaspossible,CUDAGDBcommandnames willbesimilartotheequivalentGDBcommandsusedfordebugginghostcode.For instance,theGDBcommandtodisplaythehostthreadsandswitchtohostthread1are, respectively:
(cuda-gdb) info threads (cuda-gdb) thread 1

TodisplaytheCUDAthreadsandswitchtocudathread1,theuseronlyhastotype:
(cuda-gdb) info cuda threads (cuda-gdb) cuda thread 1

Getting Help
AswithGDBcommands,thebuiltinhelpfortheCUDAcommandsisaccessiblefrom thecudagdbcommandlinebyusingthehelpcommand:
(cuda-gdb) help cuda name_of_the_cuda_command (cuda-gdb) help set cuda name_of_the_cuda_option (cuda-gdb) help info cuda name_of_the_info_cuda_command

CUDA-GDB

DU-05227-001_V4.1 | 12

Chapter 04 : CUDA-GDB E XTENSIONS

Initialization File
TheinitializationfileforCUDAGDBisnamed.cuda-gdbinitandfollowsthesamerules asthestandard.gdbinitfileusedbyGDB.TheinitializationfilemaycontainanyCUDA GDBcommand.ThosecommandswillbeprocessedinorderwhenCUDAGDBis launched.

GUI Integration
Emacs
CUDAGDBworkswithGUDinEmacsandXEmacs.Noextrastepisrequiredother thanpointingtotherightbinary. TouseCUDAGDB,thegudgdbcommandnamevariablemustbesettocudagdb annotate=3.UseMxcustomizevariabletosetthevariable. EnsurethatcudagdbispresentintheEmacs/XEmacs$PATH.

DDD
CUDAGDBworkswithDDD.TouseDDDwithCUDAGDB,launchDDDwiththe followingcommand:
ddd --debugger cuda-gdb

cudagdbmustbeinyour$PATH.

CUDA-GDB

DU-05227-001_V4.1 | 13

05 KERNEL FOCUS

ACUDAapplicationmayberunningseveralhostthreadsandmanydevicethreads.To simplifythevisualizationofinformationaboutthestateofapplication,commandsare appliedtotheentityinfocus. Whenthefocusissettoahostthread,thecommandswillapplyonlytothathostthread (unlesstheapplicationisfullyresumed,forinstance).Onthedeviceside,thefocusis alwayssettothelowestgranularitylevelthedevicethread.

Software Coordinates vs. Hardware Coordinates


Adevicethreadbelongstoablock,whichinturnbelongstoakernel.Thread,block,and kernelarethesoftwarecoordinatesofthefocus.Adevicethreadrunsonalane.Alane belongstoawarp,whichbelongstoanSM,whichinturnbelongstoadevice.Lane, warp,SM,anddevicearethehardwarecoordinatesofthefocus.Softwareandhardware coordinatescanbeusedinterchangeablyandsimultaneouslyaslongastheyremain coherent. Anothersoftwarecoordinateissometimesused:thegrid.Thedifferencebetweenagrid andakernelisthescope.ThegridIDisuniqueperGPUwhereasthekernelIDisunique acrossallGPUs.Thereforethereisa1:1mappingbetweenakernelanda(grid,device) tuple.

Current Focus
Toinspectthecurrentfocus,usethecudacommandfollowedbythecoordinatesof interest:
(cuda-gdb) cuda device sm warp lane block thread block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0 (cuda-gdb) cuda kernel block thread kernel 1, block (0,0,0), thread (0,0,0) (cuda-gdb) cuda kernel kernel 1

CUDA-GDB

DU-05227-001_V4.1 | 14

Chapter 05 : K ERNEL F OCUS

Switching Focus
Toswitchthecurrentfocus,usethecudacommandfollowedbythecoordinatestobe changed:
(cuda-gdb) cuda device 0 sm 1 warp 2 lane 3 [Switching focus to CUDA kernel 1, grid 2, block (8,0,0), thread (67,0,0), device 0, sm 1, warp 2, lane 3] 374 int totalThreads = gridDim.x * blockDim.x;

Ifthespecifiedfocusisnotfullydefinedbythecommand,thedebuggerwillassumethat theomittedcoordinatesaresettothecoordinatesinthecurrentfocus,includingthe subcoordinatesoftheblockandthread.


(cuda-gdb) cuda thread (15) [Switching focus to CUDA kernel 1, grid 2, block (8,0,0), thread (15,0,0), device 0, sm 1, warp 0, lane 15] 374 int totalThreads = gridDim.x * blockDim.x;

Theparenthesesfortheblockandthreadargumentsareoptional.
(cuda-gdb) cuda block 1 thread 3 [Switching focus to CUDA kernel 1, grid 2, block (1,0,0), thread (3,0,0), device 0, sm 3, warp 0, lane 3] 374 int totalThreads = gridDim.x * blockDim.

CUDA-GDB

DU-05227-001_V4.1 | 15

06 PROGRAM EXECUTION

ApplicationsarelaunchedthesamewayinCUDAGDBastheyarewithGDBbyusing theruncommand.ThischapterdescribeshowtointerruptandsinglestepCUDA applications.

Interrupting the Application


IftheCUDAapplicationappearstobehangingorstuckinaninfiniteloop,itispossible tomanuallyinterrupttheapplicationbypressingCTRL+C.Whenthesignalisreceived, theGPUsaresuspendedandthecudagdbpromptwillappear. Atthatpoint,theprogramcanbeinspected,modified,singlestepped,resumed,or terminatedattheusersdiscretion. Thisfeatureislimitedtoapplicationsrunningwithinthedebugger.Itisnotpossibleto breakintoanddebugapplicationsthathavebeenlaunchedoutsidethedebugger.

Single-Stepping
Singlesteppingdevicecodeissupported.However,unlikehostcodesinglestepping, devicecodesinglesteppingworksatthewarplevel.Thismeansthatsinglesteppinga devicekerneladvancesalltheactivethreadsinthewarpcurrentlyinfocus.The divergentthreadsinthewarparenotsinglestepped. Inordertoadvancetheexecutionofmorethanonewarp,abreakpointmustbesetatthe desiredlocationandthentheapplicationmustbefullyresumed. Aspecialcaseissinglesteppingoverathreadbarriercall:__syncthreads().Inthiscase, animplicittemporarybreakpointissetimmediatelyafterthebarrierandallthreadsare resumeduntilthetemporarybreakpointishit. OnGPUswithsm_typelowerthansm_20itisnotpossibletostepoverasubroutinein thedevicecode.Instead,CUDAGDBalwaysstepsintothedevicefunction.OnGPUs withsm_typesm_20andhigher,youcanstepin,over,oroutofthedevicefunctionsas

CUDA-GDB

DU-05227-001_V4.1 | 16

Chapter 06 : P ROGRAM EXECUTION

longastheyarenotinlined.Toforceafunctiontonotbeinlinedbythecompiler,the __ __noinline__ __ keywordmustbeaddedtothefunctiondeclaration.

CUDA-GDB

DU-05227-001_V4.1 | 17

07 BREAKPOINTS

TherearemultiplewaystosetabreakpointonaCUDAapplication.Thosemethodsare describedbelow.Thecommandstosetabreakpointonthedevicecodearethesameas thecommandsusedtosetabreakpointonthehostcode. Ifthebreakpointissetondevicecode,thebreakpointwillbemarkedpendinguntilthe ELFimageofthekernelisloaded.Atthatpoint,thebreakpointwillberesolvedandits addresswillbeupdated. Whenabreakpointisset,itforcesallresidentGPUthreadstostopatthislocationwhenit hitsthatcorrespondingPC. Whenabreakpointishitbyonethread,thereisnoguaranteethattheotherthreadswill hitthebreakpointatthesametime.Thereforethesamebreakpointmaybehitseveral times,andtheusermustbecarefulwithcheckingwhichthread(s)actuallyhit(s)the breakpoint.

Symbolic Breakpoints
Tosetabreakpointattheentryofafunction,usethebreakcommandfollowedbythe nameofthefunctionormethod:
(cuda-gdb) break my_function (cuda-gdb) break my_class::my_method

Fortemplatizedfunctionsandmethods,thefullsignaturemustbegiven:
(cuda-gdb) break int my_templatized_function<int>(int)

CUDA-GDB

DU-05227-001_V4.1 | 18

Chapter 07 : BREAKPOINTS

Themanglednameofthefunctioncanalsobeused.Tofindthemanglednameofa function,youcanusethefollowingcommand:
(cuda-gdb) set demangle-style none (cuda-gdb) info function my_function_name (cuda-gdb) set demangle-style auto

Line Breakpoints
Tosetabreakpointonaspecificlinenumber,usethefollowingsyntax:
(cuda-gdb) break my_file.cu:185

Ifthespecifiedlinecorrespondstoaninstructionwithintemplatizedcode,multiple breakpointswillbecreated,oneforeachinstanceofthetemplatizedcode.

Address Breakpoints
Tosetabreakpointataspecificaddress,usethebreakcommandwiththeaddressas argument:
(cuda-gdb) break 0x1afe34d0

Theaddresscanbeanyaddressonthedeviceorthehost.

Kernel Entry Breakpoints


Tobreakonthefirstinstructionofeverylaunchedkernel,setthebreak_on_launch optiontoapplication:
(cuda-gdb) set cuda break_on_launch application

Possibleoptionsare:
application: anykernellaunchedbytheuserapplication system: anykernellaunchedbythedriver,suchasmemset all: anykernel,applicationandsystem none: nokernel,applicationorsystem

Thoseautomaticbreakpointsarenotdisplayedbytheinfobreakpointscommandand aremanagedseparatelyfromindividualbreakpoints.Turningofftheoptionwillnot deleteotherindividualbreakpointssettothesameaddressandviceversa.

CUDA-GDB

DU-05227-001_V4.1 | 19

Chapter 07 : BREAKPOINTS

Conditional Breakpoints
Tomakethebreakpointconditional,usetheoptionalifkeywordorthecondcommand.
(cuda-gdb) break foo.cu:23 if threadIdx.x == 1 && i < 5 (cuda-gdb) cond 3 threadIdx.x == 1 && i < 5

Conditionalexpressionsmayreferanyvariable,includingbuiltinvariablessuchas threadIdxandblockIdx.Functioncallsarenotallowedinconditionalexpressions. Notethatconditionalbreakpointsarealwayshitandevaluated,butthedebuggerreports thebreakpointasbeinghitonlyiftheconditionalstatementisevaluatedtoTRUE.The processofhittingthebreakpointandevaluatingthecorrespondingconditionalstatement istimeconsuming.Therefore,runningapplicationswhileusingconditionalbreakpoints mayslowdownthedebuggingsession.Moreover,iftheconditionalstatementisalways evaluatedtoFALSE,thedebuggermayappeartobehangingorstuck,althoughitisnot thecase.YoucaninterrupttheapplicationwithCTRLCtoverifythatprogressisbeing made. ConditionalbreakpointscanonlybesetoncodefromCUDAmodulesthatarealready loaded.Otherwide,CUDAGDBwillreportanerrorthatitisunabletofindsymbolsin thecurrentcontext.Ifunsure,firstsetanunconditionalbreakpointatthedesiredlocation andaddtheconditionalstatementthefirsttimethebreakpointishitbyusingthecond command.

CUDA-GDB

DU-05227-001_V4.1 | 20

08 INSPECTING PROGRAM STATE

Memory and Variables


TheGDBprintcommandhasbeenextendedtodecipherthelocationofanyprogram variableandcanbeusedtodisplaythecontentsofanyCUDAprogramvariable including:
dataallocatedviacudaMalloc() datathatresidesinvariousGPUmemoryregions,suchasshared,local,andglobal

memory
specialCUDAruntimevariables,suchasthreadIdx

Variable Storage and Accessibility


Dependingonthevariabletypeandusage,variablescanbestoredeitherinregistersorin local,shared,constorglobalmemory.Youcanprinttheaddressofanyvariabletofind outwhereitisstoredanddirectlyaccesstheassociatedmemory. Theexamplebelowshowshowthevariablearray,whichisoftypesharedint*,canbe directlyaccessedinordertoseewhatthestoredvaluesareinthearray.
(cuda-gdb) print &array $1 = (@shared int (*)[0]) 0x20 (cuda-gdb) print array[0]@4 $2 = {0, 128, 64, 192}

Youcanalsoaccessthesharedmemoryindexedintothestartingoffsettoseewhatthe storedvaluesare:
(cuda-gdb) print *(@shared int*)0x20 $3 = 0 (cuda-gdb) print *(@shared int*)0x24 $4 = 128 (cuda-gdb) print *(@shared int*)0x28 $5 = 64

CUDA-GDB

DU-05227-001_V4.1 | 21

Chapter 08 : I NSPECTING PROGRAM STATE

Theexamplebelowshowshowtoaccessthestartingaddressoftheinputparameterto thekernel.
(cuda-gdb) print &data $6 = (const @global void * const @parameter *) 0x10 (cuda-gdb) print *(@global void * const @parameter *) 0x10 $7 = (@global void * const @parameter) 0x110000

Inspecting Textures
Note: The debugger can always read/write the source variables when the PC is on the first assembly instruction of a source instruction. When doing assembly-level debugging, the value of source variables is not always accessible.

Toinspectatexture,usetheprintcommandwhiledereferencingthetexturerecasttothe typeofthearrayitisboundto.Forinstance,iftexturetexisboundtoarrayAoftype float*,use:


(cuda-gdb) print *(@texture float *)tex

Allthearrayoperators,suchas[],canbeappliedto(@texturefloat*)tex:
(cuda-gdb) print ((@texture float *)tex)[2] (cuda-gdb) print ((@texture float *)tex)[2]@4

CUDA-GDB

DU-05227-001_V4.1 | 22

Chapter 08 : I NSPECTING PROGRAM STATE

Info CUDA Commands


ThesearecommandsthatdisplayinformationabouttheGPUandtheapplications CUDAstate.Theavailableoptionsare:
devices:informationaboutallthedevices sms:informationaboutalltheSMsinthecurrentdevice warps:informationaboutallthewarpsinthecurrentSM lanes:informationaboutallthelanesinthecurrentwarp kernels:informationaboutalltheactivekernels blocks:informationaboutalltheactiveblocksinthecurrentkernel threads:informationaboutalltheactivethreadsinthecurrentkernel

Afiltercanbeappliedtoevery info cudacommand.Thefilterrestrictsthescopeof thecommand.Afilteriscomposedofoneormorerestrictions.Arestrictioncanbeanyof thefollowing:


device n sm n warp n lane n kernel n grid n block x[,y] or block (x[,y]) or thread (x[,y[,z]]) thread x[,y[,z]]

wheren, x, y, zareintegers,oroneofthefollowingspecialkeywords: current, any,andall. current indicatesthatthecorrespondingvalueinthecurrent focusshouldbeused.anyand allindicatethatanyvalueisacceptable.

info cuda devices


ThiscommandenumeratesalltheGPUsinthesystemsortedbydeviceindex.A* indicatesthedevicecurrentlyinfocus.Thiscommandsupportsfilters.Thedefaultis deviceall.ThiscommandprintsNoCUDADevicesifnoGPUsarefound.
(cuda-gdb) info cuda devices
Dev/Description/SM Type/SMs Warps/SM Lanes/Warp Max Regs/Lane/Active SMs Mask

* 0

gt200

sm_13

24

32

32

128

0x00ffffff

CUDA-GDB

DU-05227-001_V4.1 | 23

Chapter 08 : I NSPECTING PROGRAM STATE

info cuda sms


ThiscommandshowsalltheSMsforthedeviceandtheassociatedactivewarpsonthe SMs.Thiscommandsupportsfiltersandthedefaultisdevicecurrentsmall.A* indicatestheSMisfocus.Theresultsaregroupedperdevice.
(cuda-gdb) info cuda sms
SM Active Warps Mask

Device 0 * 0 0xffffffffffffffff 1 0xffffffffffffffff 2 0xffffffffffffffff 3 0xffffffffffffffff 4 0xffffffffffffffff 5 0xffffffffffffffff 6 0xffffffffffffffff 7 0xffffffffffffffff 8 0xffffffffffffffff ...

info cuda warps


Thiscommandtakesyouoneleveldeeperandprintsallthewarpsinformationforthe SMinfocus.Thiscommandsupportsfiltersandthedefaultisdevicecurrentsmcurrent warpall.Thecommandcanbeusedtodisplaywhichwarpexecuteswhatblock.
(cuda-gdb) info cuda warps
Wp /Active Lanes Mask/ Divergent Lanes Mask/Active Physical PC/Kernel/BlockIdx

Device 0 SM 0 * 0 0xffffffff 1 0xffffffff 2 0xffffffff 3 0xffffffff 4 0xffffffff 5 0xffffffff 6 0xffffffff 7 0xffffffff ...

0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000

0x000000000000001c 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000

0 0 0 0 0 0 0 0

(0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0)

CUDA-GDB

DU-05227-001_V4.1 | 24

Chapter 08 : I NSPECTING PROGRAM STATE

info cuda lanes


Thiscommanddisplaysallthelanes(threads)forthewarpinfocus.Thiscommand supportsfiltersandthedefaultisdevicecurrentsmcurrentwarpcurrentlaneall.In theexamplebelowyoucanseethatallthelanesareatthesamephysicalPC.The commandcanbeusedtodiplaywhichlaneexecuteswhatthread.
(cuda-gdb) info cuda lanes
Ln State Physical PC ThreadIdx

Device 0 SM 0 Warp 0 * 0 active 0x000000000000008c 1 active 0x000000000000008c 2 active 0x000000000000008c 3 active 0x000000000000008c 4 active 0x000000000000008c 5 active 0x000000000000008c 6 active 0x000000000000008c 7 active 0x000000000000008c 8 active 0x000000000000008c 9 active 0x000000000000008c 10 active 0x000000000000008c 11 active 0x000000000000008c 12 active 0x000000000000008c 13 active 0x000000000000008c 14 active 0x000000000000008c 15 active 0x000000000000008c 16 active 0x000000000000008c ...

(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,0,0) (5,0,0) (6,0,0) (7,0,0) (8,0,0) (9,0,0) (10,0,0) (11,0,0) (12,0,0) (13,0,0) (14,0,0) (15,0,0) (16,0,0)

info cuda kernels


ThiscommanddisplaysonalltheactivekernelsontheGPUinfocus.ItprintstheSM mask,kernelIDandthegridIDforeachkernelwiththeassociateddimensionsand arguments.ThekernelIDisuniqueacrossallGPUswhereasthegridIDisuniqueper GPU.Thiscommandsupportsfiltersandthedefaultiskernelall.
(cuda-gdb) info cuda kernels
Kernel Dev Grid SMs Mask GridDim BlockDim Name Args

{...}

0x00ffffff (240,1,1) (128,1,1) acos_main

parms=

CUDA-GDB

DU-05227-001_V4.1 | 25

Chapter 08 : I NSPECTING PROGRAM STATE

info cuda blocks


Thiscommanddisplaysalltheactiveorrunningblocksforthekernelinfocus.The resultsaregroupedperkernel.Thiscommandsupportsfiltersandthedefaultiskernel currentblockall.Theoutputsarecoalescedbydefault.
(cuda-gdb) info cuda blocks BlockIdx To BlockIdx Count Kernel 1 * (0,0,0) (191,0,0) 192 State running

CoalescingcanbeturnedoffasfollowsinwhichcasemoreinformationontheDevice andtheSMgetdisplayed:
(cuda-gdb) set cuda coalescing off

Thefollowingistheoutputofthesamecommandwhencoalescingisturnedoff.
(cuda-gdb) info cuda blocks BlockIdx State Dev SM Kernel 1 * (0,0,0) running 0 0 (1,0,0) running 0 3 (2,0,0) running 0 6 (3,0,0) running 0 9 (4,0,0) running 0 12 (5,0,0) running 0 15 (6,0,0) running 0 18 (7,0,0) running 0 21 (8,0,0) running 0 1 ...

CUDA-GDB

DU-05227-001_V4.1 | 26

Chapter 08 : I NSPECTING PROGRAM STATE

info cuda threads


ThiscommanddisplaystheapplicationsactiveCUDAblocksandthreadswiththetotal countofthreadsinthoseblocks.AlsodisplayedarethevirtualPCandtheassociated sourcefileandthelinenumberinformation.Theresultsaregroupedperkernel.The commandsupportsfilterswithdefaultbeingkernelcurrentblockallthreadall.The outputsarecoalescedbydefaultasfollows:
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename acos.cu acos.cu Line 376 374

Device 0 SM 0
* (0,0,0 (0,0,0) (0,0,0) (31,0,0) 32 (0,0,0) (32,0,0) (191,0,0) (127,0,0) 24544 0x000000000088f88c 0x000000000088f800

...

Coalescingcanbeturnedoffasfollowsinwhichcasemoreinformationisdisplayedwith theoutput.
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx Virtual PC Dev SM Wp Ln Filename Line

Kernel 1 * (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) (0,0,0) ...

(0,0,0) (1,0,0) (2,0,0) (3,0,0) (4,0,0) (5,0,0) (6,0,0) (7,0,0) (8,0,0) (9,0,0)

0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c 0x000000000088f88c

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9

acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu acos.cu

376 376 376 376 376 376 376 376 376 376

Note: In coalesced form, threads must be contiguous in order to be coalesced. If some threads are not currently running on the hardware, they will create "holes" in the thread ranges. For instance, if a kernel consist of 2 blocks of 16 threads, and only the 8 lowest threads are active, then 2 coalesced ranges will be printed: one range for block 0 thread 0 to 7, and one range for block 1 thread 0 to 7. Because threads 8-15 in block 0 are not running, the 2 ranges cannot be coalesced.

CUDA-GDB

DU-05227-001_V4.1 | 27

09 CONTEXT AND KERNEL EVENTS

WithinCUDAGDB,kernelreferstoyourdevicecodethatexecutesontheGPU,while contextreferstothevirtualaddressspaceontheGPUforyourkernel. YoucanturnONorOFFthedisplayofCUDAcontextandkerneleventstoreviewthe flowoftheactivecontextsandkernels.

Display CUDA context events


(cuda-gdb) set cuda context_events 1

DisplayCUDAcontextevents.
(cuda-gdb) set cuda context_events 0

DonotdisplayCUDAcontextevents.

Display CUDA kernel events


(cudagdb)set cuda kernel_events 1

DisplayCUDAkernelevents.
(cudagdb)set cuda kernel_events 0

DonotdisplayCUDAkernelevents.

CUDA-GDB

DU-05227-001_V4.1 | 28

Chapter 09 : C ONTEXT AND KERNEL E VENTS

Examples of displayed events


Thefollowingareexamplesofcontexteventsdisplayed:
[Context Create of context 0xad2fe60 on Device 0] [Context Pop of context 0xad2fe60 on Device 0] [Context Destroy of context 0xad2fe60 on Device 0]

Thefollowingareexamplesofkerneleventsdisplayed:
[Launch of CUDA Kernel 1 (kernel3) on Device 0] [Termination of CUDA Kernel 1 (kernel3) on Device 0]

Note: The kernel termination event is only displayed when a kernel is launched asynchronously, or when the debugger can safely assume that the kernel has terminated.

CUDA-GDB

DU-05227-001_V4.1 | 29

010 CHECKING MEMORY ERRORS

Checking Memory Errors


TheCUDAmemcheckfeaturedetectsglobalmemoryviolationsandmisalignedglobal memoryaccesses.Thisfeatureisoffbydefaultandcanbeenabledusingthefollowing variableinCUDAGDBbeforetheapplicationisrun.
(cuda-gdb) set cuda memcheck on

OnceCUDAmemcheckisenabled,anydetectionofglobalmemoryviolationsandmis alignedglobalmemoryaccesseswillbereported. WhenCUDAmemcheckisenabled,allthekernellaunchesaremadeblocking,asifthe environmentvariableCUDA_LAUNCH_BLOCKING wassetto1.Thehostthreadlaunchinga kernelwillthereforewaituntilthekernelhascompletedbeforeproceeding.Thismay changethebehaviorofyourapplication. YoucanalsoruntheCUDAmemorycheckerasastandalonetoolnamed CUDAMEMCHECK.Thistoolisalsopartofthetoolkit.Pleasereadtherelated documentationformoreinformation. Bydefault,CUDAGDBwillreportanymemoryerror.Seethenextsectionforalistofthe memoryerrors.Toincreasethenumberofmemoryerrorsbeingreportedandtoincrease theprecisionofthememoryerrors,CUDAmemcheckmustbeturnedon.

CUDA-GDB

DU-05227-001_V4.1 | 30

Chapter 010 : CHECKING MEMORY ERRORS

Increasing the Precision of Memory Errors WIth Autostep


AutostepisacommandtoincreasetheprecisionofCUDAexceptionstotheexactlane andinstruction,whentheywouldnothavebeenotherwise. Undernormalexecution,anexceptionmaybereportedseveralinstructionsafterthe exceptionoccurred,ortheexactthreadwhereanexceptionoccurredmaynotbeknown unlesstheexceptionisalaneerror.However,thepreciseoriginoftheexceptioncanbe determinediftheprogramisbeingsinglesteppedwhentheexceptionoccurs.Single steppingmanuallyisaslowandtediousprocess;steppingtakesmuchlongerthan normalexecutionandtheuserhastosinglestepeachwarpindividually. Autostepaidestheuserbyallowingthemtospecifysectionsofcodewheretheysuspect anexceptioncouldoccur,andthesesectionsareautomaticallyandtransparentlysingle steppedtheprogramisrunning.Therestoftheprogramisexecutednormallyto minimizetheslowdowncausedbysinglestepping.Thepreciseoriginofanexception willbereportediftheexceptionoccurswithinthesesections.Thustheexactinstruction andthreadwhereanexceptionoccurredcanbefoundquicklyandwithmuchlesseffort byusingautostep.

Usage
autostep [LOCATION] autostep [LOCATION] for LENGTH [lines|instructions] LOCATIONmaybeanythingthatyouusetospecifythelocationofabreakpoint,such

asalinenumber,functionname,oraninstructionaddressprecededbyanasterisk.If noLOCATIONisspecified,thenthecurrentinstructionaddressisused.
LENGTHspecifiesthesizeoftheautostepwindowinnumberoflinesorinstructions

(linesandinstructionscanbeshortened,e.g.lori).Ifthelengthtypeisnot specified,thenlinesisthedefault.Iftheforclauseisomitted,thenthedefaultis1 line.


astepcanbeusedasanaliasfortheautostepcommand. Callstofunctionsmadeduringanautostepwillbesteppedover. Incaseofdivergence,thelengthoftheautostepwindowisdeterminedbythenumber

oflinesorinstructionsthefirstactivelaneineachwarpexecutes. Divergentlanesarealsosinglestepped,buttheinstructionstheyexecutedonotcount towardsthelengthoftheautostepwindow.


Ifabreakpointoccurswhileinsideanautostepwindow,thewarpwherethe

breakpointwashitwillnotcontinueautosteppingwhentheprogramisresumed. However,otherwarpsmaycontinueautostepping.
Overlappingautostepsarenotsupported.

CUDA-GDB

DU-05227-001_V4.1 | 31

Chapter 010 : CHECKING MEMORY ERRORS

Ifanautostepisencounteredwhileanotherautostepisbeingexecuted,thenthe secondautostepisignored.
Note: Autostep requires Fermi GPUs or above.

Related Commands
Autostepsandbreakpointssharethesamenumberingsomostcommandsthatworkwith breakpointswillalsoworkwithautosteps.

info autosteps
Showsallbreakpointsandautosteps.Similartoinfo breakpoints.
(cuda-gdb) info autosteps
Num 1 3 Type autostep autostep Disp Enb Address keep y keep y What 0x0000000000401234 in merge at sort.cu:30 for 49 instructions 0x0000000000489913 in bubble at sort.cu:94 for 11 lines

disable autosteps n
Disablesanautostep.Equivalenttodisable breakpoints n.

delete autosteps n
Deletesanautostep.Equivalenttodelete breakpoints n.

ignore n i
Donotsinglestepthenextitimesthedebuggerentersthewindowforautostepn.This commandalreadyexistsforbreakpoints.

CUDA-GDB

DU-05227-001_V4.1 | 32

Chapter 010 : CHECKING MEMORY ERRORS

GPU Error Reporting


WithimprovedGPUerrorreportinginCUDAGDB,applicationbugsarenoweasierto identifyandeasytofix.Thefollowingtableshowsthenewerrorsthatarereportedon GPUswithcomputecapabilitysm_20andhigher.
Note: Continuing the execution of your application after these errors are found can lead to application termination or indeterminate results.

Table 10.1 CUDA Exception Codes


Exception code CUDA_EXCEPTION_0 : Device Unknown Exception Precision of the Error Not precise Global error on the GPU This is a global GPU error caused by the application which does not match any of the listed error codes below. This should be a rare occurrence. Potentially, this may be due to Device Hardware Stack overflows or a kernel generating an exception very close to its termination. This occurs when a thread accesses an illegal(out of bounds) global address. This occurs when a thread exceeds its stack memory limit. This occurs when the application triggers a global hardware stack overflow. The main cause of this error is large amounts of divergence in the presence of function calls. Scope of the Error Description

CUDA_EXCEPTION_1 : Lane Illegal Address

Precise (Requires memcheck on) Precise

Per lane/thread error

CUDA_EXCEPTION_2 : Lane User Stack Overflow CUDA_EXCEPTION_3 : Device Hardware Stack Overflow

Per lane/thread error

Not precise

Global error on the GPU

CUDA-GDB

DU-05227-001_V4.1 | 33

Chapter 010 : CHECKING MEMORY ERRORS

Table 10.1 CUDA Exception Codes (continued)


Exception code CUDA_EXCEPTION_4 : Warp Illegal Instruction CUDA_EXCEPTION_5 : Warp Out-of-range Address Precision of the Error Not precise Warp error This occurs when any thread within a warp has executed an illegal instruction. This occurs when any thread within a warp accesses an address that is outside the valid range of local or shared memory regions. This occurs when any thread within a warp accesses an address in the local or shared memory segments that is not correctly aligned. This occurs when any thread within a warp executes an instruction that accesses a memory space not permitted for that instruction. This occurs when any thread within a warp advances its PC beyond the 40-bit address space. This occurs when any thread in a warp triggers a hardware stack overflow. This should be a rare occurrence. This occurs when a thread accesses an illegal(out of bounds) global address. For increased precision, use the cuda memcheck feature. Scope of the Error Description

Not precise

Warp error

CUDA_EXCEPTION_6 : Warp Misaligned Address

Not precise

Warp error

CUDA_EXCEPTION_7 : Warp Invalid Address Space

Not precise

Warp error

CUDA_EXCEPTION_8 : Warp Invalid PC

Not precise

Warp error

CUDA_EXCEPTION_9 : Warp Hardware Stack Overflow

Not precise

Warp error

CUDA_EXCEPTION_10 : Device Illegal Address

Not precise

Global error

CUDA-GDB

DU-05227-001_V4.1 | 34

Chapter 010 : CHECKING MEMORY ERRORS

Table 10.1 CUDA Exception Codes (continued)


Exception code CUDA_EXCEPTION_11 : Lane Misaligned Address CUDA_EXCEPTION_12 : Warp Assert Precision of the Error Precise (Requires memcheck on) Precise Per lane/thread error This occurs when a thread accesses a global address that is not correctly aligned. This occurs when any thread in the warp hits a device side assertion. Scope of the Error Description

Per warp

CUDA-GDB

DU-05227-001_V4.1 | 35

011 WALK-THROUGH EXAMPLES

ThechaptercontainstwoCUDAGDBwalkthroughexamples:
Example1:bitreverse Example2:autostep

Example 1: bitreverse
ThissectionpresentsawalkthroughofCUDAGDBbydebuggingasampleapplication calledbitreversethatperformsasimple8bitreversalonadataset.

Source Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 #include <stdio.h> #include <stdlib.h> // Simple 8-bit bit reversal Compute test #define N 256 __global__ void bitreverse(void *data) { unsigned int *idata = (unsigned int*)data; extern __shared__ int array[]; array[threadIdx.x] = idata[threadIdx.x]; array[threadIdx.x] = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) | ((0x0f0f0f0f & array[threadIdx.x]) << 4); array[threadIdx.x] = ((0xcccccccc & array[threadIdx.x]) >> 2) | ((0x33333333 & array[threadIdx.x]) << 2); array[threadIdx.x] = ((0xaaaaaaaa & array[threadIdx.x]) >> 1) | ((0x55555555 & array[threadIdx.x]) << 1); idata[threadIdx.x] = array[threadIdx.x];

CUDA-GDB

DU-05227-001_V4.1 | 36

Chapter 011 : WALK- THROUGH EXAMPLES

22 } 23 24 int 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 }

main(void) { void *d = NULL; int i; unsigned int idata[N], odata[N]; for (i = 0; i < N; i++) idata[i] = (unsigned int)i; cudaMalloc((void**)&d, sizeof(int)*N); cudaMemcpy(d, idata, sizeof(int)*N, cudaMemcpyHostToDevice); bitreverse<<<1, N, N*sizeof(int)>>>(d); cudaMemcpy(odata, d, sizeof(int)*N, cudaMemcpyDeviceToHost); for (i = 0; i < N; i++) printf("%u -> %u\n", idata[i], odata[i]); cudaFree((void*)d); return 0;

Walking Through the Code


1 Beginbycompilingthebitreverse.cuCUDAapplicationfordebuggingbyentering
thefollowingcommandatashellprompt:
$ nvcc -g -G bitreverse.cu -o bitreverse

Thiscommandassumesthatthesourcefilenameisbitreverse.cu andthatno additionalcompilerflagsarerequiredforcompilation.SeealsoCompilingfor Debuggingonpage 20.

2 StarttheCUDAdebuggerbyenteringthefollowingcommandatashellprompt:
$ cuda-gdb bitreverse

3 Setbreakpoints.Setboththehost(main)andGPU(bitreverse)breakpointshere.
Also,setabreakpointataparticularlineinthedevicefunction(bitreverse.cu:18).
(cuda-gdb) Breakpoint (cuda-gdb) Breakpoint (cuda-gdb) Breakpoint

break main
1 at 0x18e1: file bitreverse.cu, line 25.

break bitreverse
2 at 0x18a1: file bitreverse.cu, line 8.

break 21
3 at 0x18ac: file bitreverse.cu, line 21.

CUDA-GDB

DU-05227-001_V4.1 | 37

Chapter 011 : WALK- THROUGH EXAMPLES

4 RuntheCUDAapplication,anditexecutesuntilitreachesthefirstbreakpoint(main)
setinstep3.
(cuda-gdb) run Starting program: /Users/CUDA_User1/docs/bitreverse Reading symbols for shared libraries ..++........................................................... done Breakpoint 1, main () at bitreverse.cu:25 25 void *d = NULL; int i;

5 Atthispoint,commandscanbeenteredtoadvanceexecutionortoprinttheprogram
state.Forthiswalkthrough,letscontinueuntilthedevicekernelislaunched.
(cuda-gdb) continue Continuing. Reading symbols for shared libraries .. done Reading symbols for shared libraries .. done [Context Create of context 0x80f200 on Device 0] [Launch of CUDA Kernel 0 (bitreverse<<<(1,1,1),(256,1,1)>>>) on Device 0] Breakpoint 3 at 0x8667b8: file bitreverse.cu, line 21. [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0] Breakpoint 2, bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x110000) at bitreverse.cu:9 9 unsigned int *idata = (unsigned int*)data;

CUDAGDBhasdetectedthataCUDAdevicekernelhasbeenreached.Thedebugger printsthecurrentCUDAthreadoffocus.

6 VerifytheCUDAthreadoffocuswiththeinfo cuda threadscommandand


switchbetweenhostthreadandtheCUDAthreads:
(cuda-gdb) info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename Line Kernel 0 * (0,0,0) (0,0,0) (0,0,0) (255,0,0) 256 0x0000000000866400 bitreverse.cu 9 (cuda-gdb) thread [Current thread is 1 (process 16738)] (cuda-gdb) thread 1 [Switching to thread 1 (process 16738)] #0 0x000019d5 in main () at bitreverse.cu:34 34 bitreverse<<<1, N, N*sizeof(int)>>>(d); (cuda-gdb) backtrace #0 0x000019d5 in main () at bitreverse.cu:34 (cuda-gdb) info cuda kernels Kernel Dev Grid SMs Mask GridDim BlockDim Name Args 0 0 1 0x00000001 (1,1,1) (256,1,1) bitreverse data=0x110000

CUDA-GDB

DU-05227-001_V4.1 | 38

Chapter 011 : WALK- THROUGH EXAMPLES

(cuda-gdb) cuda kernel 0 [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0] 9 unsigned int *idata = (unsigned int*)data; (cuda-gdb) backtrace #0 bitreverse<<<(1,1,1),(256,1,1)>>> (data=0x110000) at bitreverse.cu:9

7 Corroboratethisinformationbyprintingtheblockandthreadindexes:
(cuda-gdb) print $1 = {x = 0, y = (cuda-gdb) print $2 = {x = 0, y = blockIdx 0}

threadIdx
0, z = 0)

8 Thegridandblockdimensionscanalsobeprinted:
(cuda-gdb) print $3 = {x = 1, y = (cuda-gdb) print $4 = {x = 256, y gridDim 1} blockDim = 1, z = 1)

9 Advancekernelexecutionandverifysomedata:
(cuda-gdb) next 12 array[threadIdx.x] (cuda-gdb) next 14 array[threadIdx.x] (cuda-gdb) next 16 array[threadIdx.x] (cuda-gdb) next 18 array[threadIdx.x] (cuda-gdb) next = idata[threadIdx.x]; = ((0xf0f0f0f0 & array[threadIdx.x]) >> 4) | = ((0xcccccccc & array[threadIdx.x]) >> 2) | = ((0xaaaaaaaa & array[threadIdx.x]) >> 1) |

Breakpoint 3, bitreverse <<<(1,1),(256,1,1)>>> (data=0x100000) at bitreverse.cu:21 21 idata[threadIdx.x] = array[threadIdx.x]; (cuda-gdb) print array[0]@12 $7 = {0, 128, 64, 192, 32, 160, 96, 224, 16, 144, 80, 208} (cuda-gdb) print/x array[0]@12 $8 = {0x0, 0x80, 0x40, 0xc0, 0x20, 0xa0, 0x60, 0xe0, 0x10, 0x90, 0x50, 0xd0} (cuda-gdb) print &data $9 = (@global void * @parameter *) 0x10 (cuda-gdb) print *(@global void * @parameter *) 0x10 $10 = (@global void * @parameter) 0x100000

Theresultingoutputdependsonthecurrentcontentofthememorylocation.

CUDA-GDB

DU-05227-001_V4.1 | 39

Chapter 011 : WALK- THROUGH EXAMPLES

10 Sincethread(0,0,0)reversesthevalueof0,switchtoadifferentthreadtoshowmore
interestingdata:
cuda-gdb) cuda thread 170 [Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (170,0,0), device 0, sm 0, warp 5, lane 10]

11 Deletethebreakpointsandcontinuetheprogramtocompletion:
(cuda-gdb) delete breakpoints Delete all breakpoints? (y or n) y (cuda-gdb) continue Continuing. Program exited normally. (cuda-gdb)

CUDA-GDB

DU-05227-001_V4.1 | 40

Chapter 011 : WALK- THROUGH EXAMPLES

Example 2: autostep
Thissectionshowshowtousetheautostepcommandanddemonstrateshowithelps increasetheprecisionofmemoryerrorreporting.

Source Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #define NUM_BLOCKS 8 #define THREADS_PER_BLOCK 64 __global__ void example(int **data) { int value1, value2, value3, value4, value5; int idx1, idx2, idx3; idx1 = blockIdx.x * blockDim.x; idx2 = threadIdx.x; idx3 = idx1 + idx2; value1 = *(data[idx1]); value2 = *(data[idx2]); value3 = value1 + value2; value4 = value1 * value2; value5 = value3 + value4; *(data[idx3]) = value5; *(data[idx1]) = value3; *(data[idx2]) = value4; idx1 = idx2 = idx3 = 0; } int main(int argc, char *argv[]) { int *host_data[NUM_BLOCKS*THREADS_PER_BLOCK]; int **dev_data; const int zero = 0; /* Allocate an integer for each thread in each block */ for (int block = 0; block < NUM_BLOCKS; block++) { for (int thread = 0; thread < THREADS_PER_BLOCK; thread++) { int idx = thread + block * THREADS_PER_BLOCK; cudaMalloc(&host_data[idx], sizeof(int)); cudaMemcpy(host_data[idx], &zero, sizeof(int), cudaMemcpyHostToDevice); } } /* This inserts an error into block 3, thread 39*/ host_data[3*THREADS_PER_BLOCK + 39] = NULL; /* Copy the array of pointers to the device */ cudaMalloc((void**)&dev_data, sizeof(host_data));

CUDA-GDB

DU-05227-001_V4.1 | 41

Chapter 011 : WALK- THROUGH EXAMPLES

41 42 43 44 45 46 } 47

cudaMemcpy(dev_data, host_data, sizeof(host_data), cudaMemcpyHostToDevice); /* Execute example */ example <<< NUM_BLOCKS, THREADS_PER_BLOCK >>> (dev_data); cudaThreadSynchronize();

Inthissmallexample,wehaveanarrayofpointerstointegers,andwewanttodosome operationsontheintegers.Suppose,however,thatoneofthepointersisNULLasshown inline37.ThiswillcauseCUDA_EXCEPTION_10DeviceIllegalAddresstobethrown whenwetrytoaccesstheintegerthatcorrespondswithblock3,thread39.Thisexception shouldoccuratline16whenwetrytowritetothatvalue.

Debugging With Autosteps


1 CompiletheexampleandstartCUDAGDBasnormal.
Webeginbyrunningtheprogram:
(cuda-gdb) run Starting program: /home/jitud/cudagdb_test/autostep_ex/example [Thread debugging using libthread_db enabled] [New Thread 0x7ffff5688700 (LWP 9083)] [Context Create of context 0x617270 on Device 0] [Launch of CUDA Kernel 0 (example<<<(8,1,1),(64,1,1)>>>) on Device 0] Program received signal CUDA_EXCEPTION_10, Device Illegal Address. [Switching focus to CUDA kernel 0, grid 1, block (1,0,0), thread (0,0,0), device 0, sm 1, warp 0, lane 0] 0x0000000000796f60 in example (data=0x200300000) at example.cu:17 17 *(data[idx1]) = value3;

Asexpected,wereceivedaCUDA_EXCEPTION_10.However,thereportedthreadis block1,thread0andthelineis17.SinceCUDA_EXCEPTION_10isaGlobalerror, thereisnothreadinformationthatisreported,sowewouldmanuallyhavetoinspect all512threads.

2 Setautosteps.
Togetmoreaccurateinformation,wereasonthatsinceCUDA_EXCEPTION_10isa memoryaccesserror,itmustoccuroncodethataccessesmemory.Thishappenson lines11,12,16,17,and18,sowesettwoautostepwindowsforthoseareas:
(cuda-gdb) autostep 11 for 2 lines Breakpoint 1 at 0x796d18: file example.cu, line 11. Created autostep of length 2 lines (cuda-gdb) autostep 16 for 3 lines Breakpoint 2 at 0x796e90: file example.cu, line 16. Created autostep of length 3 lines

CUDA-GDB

DU-05227-001_V4.1 | 42

Chapter 011 : WALK- THROUGH EXAMPLES

3 Finally,weruntheprogramagainwiththeseautosteps:
(cuda-gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y [Termination of CUDA Kernel 0 (example<<<(8,1,1),(64,1,1)>>>) on Device 0] Starting program: /home/jitud/cudagdb_test/autostep_ex/example [Thread debugging using libthread_db enabled] [New Thread 0x7ffff5688700 (LWP 9089)] [Context Create of context 0x617270 on Device 0] [Launch of CUDA Kernel 1 (example<<<(8,1,1),(64,1,1)>>>) on Device 0] [Switching focus to CUDA kernel 1, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0] Program received signal CUDA_EXCEPTION_10, Device Illegal Address. [Current focus set to CUDA kernel 1, grid 1, block (3,0,0), thread (32,0,0), device 0, sm 1, warp 3, lane 0] Autostep precisely caught exception at example.cu:16 (0x796e90)

Thistimewecorrectlycaughttheexceptionatline16.Eventhough CUDA_EXCEPTION_10isaglobalerror,wehavenownarroweditdowntoawarp error,sowenowknowthatthethreadthatthrewtheexceptionmusthavebeeninthe samewarpasblock3,thread32. Inthisexample,wehavenarroweddownthescopeoftheerrorfrom512threadsdown to32threadsjustbysettingtwoautostepsandrerunningtheprogram.

CUDA-GDB

DU-05227-001_V4.1 | 43

APPENDIX A SUPPORTED PLATFORMS

ThegeneralplatformandGPUrequirementsforrunningNVIDIACUDAGDBare describedinthissection.

Host Platform Requirements


Mac OS
CUDAGDBissupportedonboth32bitand64biteditionsofthefollowingMacOS versions:
MacOSX10.6 MacOSX10.7

Linux
CUDAGDBissupportedonboth32bitand64biteditionsofthefollowingLinux distributions:
RedHatEnterpriseLinux4.8(64bitonly) RedHatEnterpriseLinux5.5,5.6,and5.7 RedHatEnterpriseLinux6.0(64bitonly),and6.1(64bitonly) Ubuntu10.04,10.10,and11.04 Fedora13,and14 OpenSuse11.2 SuseLinuxEnterpriseServer11.1

CUDA-GDB

DU-05227-001_V4.1 | 44

Appendix A : SUPPORTED PLATFORMS

GPU Requirements
DebuggingissupportedonallCUDAcapableGPUswithacomputecapabilityof1.1or later.ComputecapabilityisadeviceattributethataCUDAapplicationcanqueryabout; formoreinformation,seethelatestNVIDIACUDAProgrammingGuideontheNVIDIA CUDAZoneWebsite:http://developer.nvidia.com/object/gpucomputing.html. TheseGPUshaveacomputecapabilityof1.0andarenotsupported: GeForce 8800 GTS GeForce 8800 GTX GeForce 8800 Ultra Quadro Plex 1000 Model IV Quadro Plex 2100 Model S4 Quadro FX 4600 Quadro FX 5600 Tesla C870 Tesla D870 Tesla S870

CUDA-GDB

DU-05227-001_V4.1 | 45

APPENDIX B KNOWN ISSUES

Thefollowingareknownissueswiththecurrentrelease.
SettingthecudamemcheckoptionONwillmakeallthelaunchesblocking. ConditionalbreakpointscanonlybesetaftertheCUDAmoduleisloaded. DevicememoryallocatedviacudaMalloc()isnotvisibleoutsideofthekernel

function.
OnGPUswithsm_typelowerthansm_20itisnotpossibletostepoverasubroutinein

thedevicecode.
RequestingtoreadorwriteGPUmemorymaybeunsuccessfulifthesizeislarger

than100MBonTeslaGPUsandlargerthan32MBonFermiGPUs.
OnGPUswithsm_20,ifyouaredebuggingcodeindevicefunctionsthatgetcalledby

multiplekernels,thensettingabreakpointinthedevicefunctionwillinsertthe breakpointinonlyoneofthekernels.
InamultiGPUdebuggingenvironmentonMacOSXwithAquarunning,youmay

experiencesomevisibledelaywhilesinglesteppingtheapplication.
Settingabreakpointonalinewithina__device__or__global__functionbeforeits

moduleisloadedmayresultinthebreakpointbeingtemporarilysetonthefirstlineof afunctionbelowinthesourcecode.Assoonasthemoduleforthetargetedfunctionis loaded,thebreakpointwillberesetproperly.Inthemeantime,thebreakpointmaybe hit,dependingontheapplication.Inthosesituations,thebreakpointcanbesafely ignored,andtheapplicationcanberesumed.


Theschedulerlockingoptioncannotbesettoon. Steppingagainaftersteppingoutofakernelresultsinundeterminedbehavior.Itis

recommendedtousethecontinuecommandinstead.
OpenGLapplicationsmayrequiretolaunchXinnoninteractivemode.SeeCUDA/

OpenGLInteropApplicationsonLinuxonpage 11fordetails.

CUDA-GDB

DU-05227-001_V4.1 | 46

Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, "MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication of otherwise under any patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all other information previously supplied. NVIDIA Corporation products are not authorized as critical components in life support devices or systems without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA, the NVIDIA logo, NVIDIA nForce, GeForce, NVIDIA Quadro, NVDVD, NVIDIA Personal Cinema, NVIDIA Soundstorm, Vanta, TNT2, TNT, RIVA, RIVA TNT, VOODOO, VOODOO GRAPHICS, WAVEBAY, Accuview Antialiasing, Detonator, Digital Vibrance Control, ForceWare, NVRotate, NVSensor, NVSync, PowerMizer, Quincunx Antialiasing, Sceneshare, See What You've Been Missing, StreamThru, SuperStability, T-BUFFER, The Way It's Meant to be Played Logo, TwinBank, TwinView and the Video & Nth Superscript Design Logo are registered trademarks or trademarks of NVIDIA Corporation in the United States and/or other countries. Other company and product names may be trademarks or registered trademarks of the respective owners with which they are associated.

Copyright
20072012 NVIDIA Corporation. All rights reserved.

www.nvidia.com

You might also like